Categories
data analytics devops novice

Realtime Log Analytics with Solr, Logstash, Banana and Beats

Logs are everywhere and usually generated in large sizes and high velocities. These logs can be used to obtain useful information and insights about the domain or the process related to these logs, such as platforms, transactions, system users, etc. In this post, a realtime web (Apache2) log analytics pipeline will be built using Apache Solr, Banana, Logstash and Beats containers.

However, in order to get the pipeline running, several integration aspects related to streaming data need to be addressed through settings and patches supplied through mounted volumes. The structure of these volumes can be as below:

volumes
 |-solr
   |-configsets
     |-_default
       |-xslt
 |-logstash
   |-xbin
   |-xplugins
     |-solr_http
   |-config
 |-filebeat
 |-banana

Note: volumes/solr/configsets is a copy of $SOLR_HOME/server/solr/configsets

Filebeat

Configure Filebeat

$ docker pull docker.elastic.co/beats/filebeat:7.3.2

The latest versions of Filebeat exports extra fields related to ECS which override original Combined Log fields and, additionally, cannot be indexed into Solr directly as they do not follow the standard Solr add document structure. To overcome that, these fields can be dropped here and later downstream recover the overwritten ones, namely agent and host, by Logstash. The following filebeat.yml reflects that:

filebeat.config:
  modules:
    path: ${path_config}/modules.d/*
    reload.enabled: false

processors:
  - drop_fields:
    fields: ["log", "eco", "host", "agent", "input"]

filebeat.inputs:
  type: log
  path:
    - /var/log/apache2/*

output.logstash:
  hosts: ["Logstash:5044"]

Logstash

$ docker pull docker.elastic.co/beats/logstash:7.3.2

Configure Logstash

Create logstash-filebeat.yml under volumes/logstash/config directory as follows:

input {
  beats {
    port => 5044
  }
}

filter {
  grok {
    match => { "message" => "%{COMBINEDAPACHELOG}" }
  }
  date {
    match => [ "timestamp", "dd/MM/yyyy:HH:mm:ss Z" ]
  }
  useragent {
    source: "[apache][access][agent]"
    target: "[apache][access][user_agent]"
  }
  geoip {
    source: "clientip"
  }
}

output {
  solr_http {
    id => "solr_plugin_1"
    solr_url => "http://solr:8983/solr/logs"
    tr => "update_geoip.xsl"
  }
}

The settings are straightforward. Beats is used for input, four filters are used to process the stream of data and solr_http plugin is used for output:

  1. grok: extract fields from the payload. The expected format is Apache combined log. There are many other supported formats by this plugin.
  2. date: parse timestamp and convert it to an internal representation.
  3. useragent: add user agent data.
  4. geoip: add geographical data.

solr_http plugin

The latest version of solr_http plugin has some glitches related to JSON structure, ISO8601 time format, and the commit policy which prevents data from being indexed into Solr. The solr_http.rb patch below addresses these glitches, in addition to applying the XSLT:

--- solr_http.rb	2019-09-24 08:56:50.000000000 +0200
+++ /Users/ahmedadel/workspace/volumes/logstash/xplugins/solr_http/solr_http.rb	2019-09-24 09:31:06.000000000 +0200
@@ -39,10 +39,13 @@
   # '%{foo}' so you can assign your own IDs
   config :document_id, :validate => :string, :default => nil
 
+  # Document transformation XSL
+  config :tr, :validate => :string, :default => nil
+
   public
   def register
     require "rsolr"
-    @solr = RSolr.connect :url => @solr_url
+    @solr = RSolr.connect :url => @solr_url, update_format: :xml
     buffer_initialize(
       :max_items => @flush_size,
       :max_interval => @idle_flush_time,
@@ -62,7 +65,7 @@
 
     events.each do |event|
         document = event.to_hash()
-        document["@timestamp"] = document["@timestamp"].iso8601 #make the timestamp ISO
+        document["@timestamp"] = document["@timestamp"].to_iso8601 #make the timestamp ISO
         if @document_id.nil?
           document ["id"] = UUIDTools::UUID.random_create    #add a unique ID
         else
@@ -71,7 +74,7 @@
         documents.push(document)
     end
 
-    @solr.add(documents)
+    @solr.add(documents, :add_attributes => {:commitWithin=>10000}, :params => {:tr => @tr})
     rescue Exception => e
       @logger.warn("An error occurred while indexing: #{e.message}")
   end #def flush

Place the patch under volumes/logstash/xplugins/solr_http/solr_http.rb.patch

Start script

The following shell script can be used to install and patch the plugin before starting Logstash:

#!/usr/bin/sh

cd /usr/share/logstash

yum install -y patch
su - logstash -c "bin/logstash-plugin install logstash-output-solr_http"

# Apply patches for undefined method `iso8601' error and NRT
# PR#9 respectively (https://github.com/logstash-plugins/logstash-output-solr_http/pull/9)
# addressing Issue#10 (https://github.com/logstash-plugins/logstash-output-solr_http/issues/10)
# and PR#7 (https://github.com/logstash-plugins/logstash-output-solr_http/pull/7)

su - logstash -c "cd vendor/bundle/jruby/2.5.0/gems/logstash-output-solr_http-3.0.5/lib/logstash/outputs && patch < /usr/share/logstash/xplugins/solr_http/solr_http.rb.patch"

su - logstash - c "bin/logstash -f /usr/share/logstash/config/logstash-filebeat.conf"

Solr

$ docker pull solr

To take advantage of Solr schemaless mode and field type guessing with Logstash and Banana, guessing settings need to be slightly modified. By default, the supported types are multi-valued which does not play well with Banana as of now, therefore, we will modify solrconfig.xml under volumes/solr/configsets/_default field guessing to be single valued as follows:

<processor class="solr.AddSchemaFieldsUpdateProcessorFactory">
      <str name="defaultFieldType">string</str>
      <lst name="typeMapping">
        <str name="valueClass">java.lang.Boolean</str>
        <str name="fieldType">boolean</str>
      </lst>
      <lst name="typeMapping">
        <str name="valueClass">java.util.Date</str>
        <str name="fieldType">pdate</str>
      </lst>
      <lst name="typeMapping">
        <str name="valueClass">java.lang.Long</str>
        <str name="valueClass">java.lang.Integer</str>
        <str name="fieldType">plong</str>
      </lst>
      <lst name="typeMapping">
        <str name="valueClass">java.lang.Number</str>
        <str name="fieldType">pdouble</str>
      </lst>
</processor>

Although the geoip data, which is generated by Logstash geoip processor, is encoded in one field node, it does not conform with Solr standard add command format. To address this, an XSL transformation can be used to transform it to Solr style structure:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
  <xsl:template match="field[@name='geoip']">
    <field name="country_code2">
      <!-- transform country_code2 only -->
      <xsl:value-of select="@country_code2" />
    </field>
  </xsl:template>
  <xsl:template match="*">
    <xsl:copy>
      <xsl:copy-of select="@*" />
      <xsl:apply-templates />
    </xsl:copy>
  </xsl:template>
</xsl:stylesheet>

In the above XSLT, we are interested in country_code2 field only.

Banana

$ docker pull aaadel/banana

Banana works with this pipeline out-of-the-box, no configuration no modifications!

Starting the pipeline

Preparing the environment

In this step, we create the network required for the stream. Also, as a security measure, Filebeat configuration files should be writable by the owner only:

$ docker network create some network
$ chmod go-w volumes/filebeat/filebeat.yml

Starting the containers

$ docker run --rm --name solr -v volumes/solr/configsets:/opt/configsets --network somenetwork -p 8983:8983 -t solr -c -f 
$ docker run --rm -it --name logstash -u=0 \
-v volumes/logstash/config:/usr/share/logstash/config \
-v volumes/logstash/xbin:/usr/share/logstash/xbin \
-v volumes/logstash/xplugins:/usr/share/logstash/xplugins \
-v /var/log/apache2:/var/log/apache2 --network somenetwork -p 5044:5044 -t docker.elastic.co/logstash/logstash:7.3.2 xbin/setup.sh
$ docker run --rm --name filebeat \
-v volumes/filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml \
-v /tmp/log:/tmp/log --link logstash --network somenetwork docker.elastic.co/beats/filebeat:7.3.2
$ docker run --rm -e "BANANA_SOLR_HOST=solr" --name banana --link solr --network somenetwork -p 9901:9901 -t aaadel/banana

Creating the logs collection

$ docker exec solr bin/solr create -c logs -d /opt/configsets/_default

Loading a pre-defined dashboard

Now, just load the pre-defined dashboard and navigate to http://localhost:9901. Done!

Realtime log analytics dashboard analyzing Apache 2 logs with respect to method, response, client IP, time, and other features.
Realtime Log Analytics Dashboard

Of course, additional panels can be created to get more details about the logs being analyzed. The dashboard is realtime by default, which reflects updates in a 30 seconds resolution.

One reply on “Realtime Log Analytics with Solr, Logstash, Banana and Beats”

Leave a Reply

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.