Hadoop and Spark by Leela Prasad: November 2017

· MySQL => Databases => Tables => Columns/Rows

· Elasticsearch => Indices => Types => Documents with Properties

Elasticsearch has to store the data somewhere. This functionality is stored into shards, which are either the Primary or Replica

ELK Stack Installation:

ELK stack components being used are:

· filebeat version 5.5.2

· logstash 5.5.2

· elasticsearch 5.5.2

· kibana 5.5.2

filebeat

Beats needs to be installed on all the host machines from which you want to read your logs.

To get specific version of ELK browse to https://www.elastic.co/downloads/past-releases

Select the appropriate product and version and download the RPM. In the directory execute the sudo yum install filebeat in all the host machines.

sudo chmod 755 filebeat

Logstash

Needs to be installed on the host machine/ edge node. Download RPM and

sudo yum install logstash

To test your installation,

cd /usr/share/logstash/

sudo /usr/share/logstash/bin/logstash -e 'input { stdin { } } output { stdout {} }'

# After starting Logstash, wait until you see "Pipeline main started" and then enter hello world at the command prompt

ElasticSearch

Needs to be installed on the machine which is going to Elasticsearch filesystem. Download RPM and

sudo yum install elasticsearch

To test your installation

curl -XGET 'localhost:9200/?pretty'

Kibana

sudo yum install kibana

vi /etc/kibana/kibana.yml

edit,enable server.port: and server.host:

sudo service kibana start

To test your installation

Use a browser to open http:[hostname]:5601

Configuration

filebeat

Edit filebeat config file to add the log files to be scanned and shipped to logstash.

###################### Filebeat Configuration Example #########################

# This file is an example configuration file highlighting only the most common

# options. The filebeat.full.yml file from the same directory contains all the

# supported options with more comments. You can use it as a reference.

# You can find the full configuration reference here:

# https://www.elastic.co/guide/en/beats/filebeat/index.html

#=========================== Filebeat prospectors =============================

filebeat.prospectors:

# Each - is a prospector. Most options can be set at the prospector level, so

# you can use different prospectors for various configurations.

# Below are the prospector specific configurations.

- input_type: log

# Paths that should be crawled and fetched. Glob based paths.

paths:

#- /home/sraja005/flume.log

- /var/log/flume-ng/flume-ng-agent.log

fields:

log_type: flumeLog

#----------------------------- Logstash output --------------------------------

output.logstash:

# The Logstash hosts

hosts: ["tsb1.devlab.motive.com:5044"]

Logstash

Create a logstash configuration file and place it in the folder mentioned below

cd /etc/logstash/conf.d/

#Here is a sample conf file.

vi flumetest.conf

input {

beats {

port => "5044"

codec => multiline {

# Grok pattern names are valid! :)

pattern => "^(%{MONTHDAY} %{MONTH} %{YEAR} %{TIME}|%{YEAR}-%{MONTHNUM})"

negate => true

what => "previous"

}

filter {

if ([fields][log_type] == "flumeLog") {

grok {

match => { "message" => "%{MONTHDAY:logDate} %{MONTH:logMonth} %{YEAR:logYear} %{TIME:logTime} %{LOGLEVEL:logLevel} %{GREEDYDATA:message}"}

}

output {

elasticsearch {

hosts => [ "localhost:9200" ]

}

Issues and Points:

1. Source location and Index of the message can be viewed on Message dropdown.

2. For log with starting content,

[12/Oct/2017 09:05:51 ] supervisor ERROR Exception in supervisor main loop

In config file under /etc/logstash/conf.d . Add grok as, where \[ and \] represent []

if ([fields][log_type] == "hueLog") {

grok {

match => { "message" => "\[%{MONTHDAY:logDate}/%{MONTH:logMonth}/%{YEAR:logYear} %{TIME:logTime} \] %{LOGLEVEL:logLevel} %{GREEDYDATA:message}"}

}

Add | \[ in pattern,

pattern => "^(%{MONTHDAY} %{MONTH} %{YEAR} %{TIME}|%{YEAR}-%{MONTHNUM}|\[| )"

Filter for, 17/10/26 13:37:59 ERROR TaskSchedulerImpl: Lost an executor driver (already removed): Executor heartbeat timed out after 239118 ms

if ([fields][log_type] == "sparkLog") {

grok {

match => { "message" => "%{YEAR:logYear}/%{MONTHNUM:logMonth}/%{MONTHDAY:logDate} %{TIME:logTime} %{LOGLEVEL:logLevel} %{GREEDYDATA:message}"}

}

3. By default, an index would be created for every day. To have the data into a single index add index in output of the logstash config file.

elasticsearch {

index => "blogs2"

hosts => [ "localhost:9200" ]

}

To have output in multiple indexes,

output {

elasticsearch {

index => "blogs2"

hosts => [ "localhost:9200" ]

}

elasticsearch {

index => "blogs"

hosts => [ "localhost:9200" ]

}

4. To list the indexes,

curl 'localhost:9200/_cat/indices?v'

5. To get info of a particular Index ‘blogs2’

curl -XGET localhost:9200/blogs2

6. To check the filter of GROK on the text? https://grokdebug.herokuapp.com/

7. For combining timestamp and date value of the log follow https://stackoverflow.com/questions/40385107/logstash-grok-how-to-parse-timestamp-field-using-httpderror-date-pattern

8. To see the health of cluster

curl 'localhost:9200/_cluster/health?pretty'

9. To create new Index, in kibana -> Dev Tools execute the command to create blogs index,

PUT /blogs

{

"settings" : {

"number_of_shards" : 3,

"number_of_replicas" : 1

}

10. Create Index Patterns,

In Kibana -> Management -> Index Patterns -> Create Index pattern and provide the index name or pattern as ‘blogs*’ -> Create.

Extract values from existing field and create new field in logstash.

2 Approaches for this:

1. copy of source by creating a temp variable,

 if ([fields][log_type] == "yarnHive2kafkaLog") {
    grok {
            match => { "message" => "%{YEAR:logYear}-%{MONTHNUM:logMonth}-%{MONTHDAY:logDate} %{TIME:logTime} \!%{SPACE}%{LOGLEVEL:logLevel}%{SPACE}\! %{GREEDYDATA:message}"}
         }
    mutate {
            copy => { "source" => "source_tmp" }
           }
    mutate {
            split => ["source_tmp", "/"]
            add_field => { "applicationID" => "%{source_tmp[4]}" }
           }                       
            }

2. grok filter on source

 if ([fields][log_type] == "yarnHive2kafkaLog") {
    grok {
            match => { "message" => "%{YEAR:logYear}-%{MONTHNUM:logMonth}-%{MONTHDAY:logDate} %{TIME:logTime} \!%{SPACE}%{LOGLEVEL:logLevel}%{SPACE}\! %{GREEDYDATA:message}"}
         }
    grok {
            match => { "source" => "/%{GREEDYDATA:primaryDir}/%{GREEDYDATA:subDir1}/%{GREEDYDATA:subDir2}/%{GREEDYDATA:subDir3}/%{GREEDYDATA:containerID}/%{GREEDYDATA:fileName}"}
            }
    mutate {
           add_field => { "applicationID" => "%{subDir3}" }
           }                       
            }

Follow, https://discuss.elastic.co/t/split-source-value-and-create-a-custom-field-with-splitted-one/110334/5

Spaces not proper while applying grok

2017-11-15 09:21:06,578 ! ERROR ! [Driver] ! imps.CuratorFrameworkImpl ! Background

2017-11-20 03:35:17,730 ! WARN ! [Reporter] ! impl.AMRMClientImpl ! ApplicationMaster

In the above 2 logs the space is not indented in the same manner for ERROR and WARN. To handle this use %{SPACE}, which is equivalent to 0 or many spaces

grok {

match => { "message" => "%{YEAR:logYear}-%{MONTHNUM:logMonth}-%{MONTHDAY:logDate} %{TIME:logTime} !%{SPACE}%{LOGLEVEL:logLevel}%{SPACE}! %{GREEDYDATA:message}"}

}

1. Remove trailing white space in logstash filter

Follow https://discuss.elastic.co/t/remove-trailing-white-space-in-logstash-filter/110819

Approach 1: using something like NOTSPACE instead of GREEDYDATA.

For log,

[24/Oct/2017 15:04:53 ] cluster WARNING Picking RM HA: ha

[%{MONTHDAY:logDate}/%{MONTH:logMonth}/%{YEAR:logYear} %{TIME:logTime} ]%{SPACE}%{GREEDYDATA:platformType} +\s %{SPACE}%{LOGLEVEL:logLevel}%{SPACE}%{GREEDYDATA:message}

The above filter leads to trailing white space for cluster platformType.

\[%{MONTHDAY:logDate}/%{MONTH:logMonth}/%{YEAR:logYear} %{TIME:logTime} \]%{SPACE}%{NOTSPACE:platformType}%{SPACE}%{LOGLEVEL:logLevel}%{SPACE}%{GREEDYDATA:message}

GREEDYDATA when replaced with NOTSPACE resolves this Issue.

Approach2:

Place after grok to strip whitespaces,

mutate {

strip => ["platformType"]

}

Hadoop and Spark by Leela Prasad

Thursday, November 30, 2017

ELK

Extract values from existing field and create new field in logstash.

Spaces not proper while applying grok

1. Remove trailing white space in logstash filter

Monday, November 20, 2017

Kafka - Zookeeper configuration

Popular Posts