Big Data School: March 2016

Monday, March 14, 2016

How to setup a 2 Node Apache Kafka 0.9.0.1 cluster in CentOS 7?

Apache Kafka is one of the realtime message brokers used for realtime stream processing in big data world.

The following needs to be done before beginning the Apache Kafka cluster Setup.

1. Create 2 CentOS 7 Servers KFNODE1 and KFNODE2 as discussed in How to install CentOS 7 on Virtual Machine using VMWare vSphere 6 client?

2. Make sure Java 7 is installed and configured as default as discussed in How to install Java 7 and Java 8 in CentOS 7?.

3. Create the bigdatauser, bigdataadmin and the bigdatagroup as discussed in How to create a user, group and enable him to do what a super user can in CentOS7?

4. Make sure the firewall is disabled and stopped as discussed in How to turn off firewall on CentOS 7?

5. Change etc/hosts file so that all the IPs and the names of the servers are resolved as discussed in
How to setup DNS entries for big data servers in the cloud or not on a domain in /etc/hosts file?

6. Using the bigdatauser setup password less ssh across the 3 clusters namely KFNODE1 and KFNODE2 as discussed in How to setup password less ssh between CentOS 7 cluster servers?

7. Install Apache Zookeeper clusters as discussed in How to setup a 3 Node Apache Zookeeper 3.4.6 cluster in CentOS 7?. Make sure you do the same as in step 5 for these servers too.

For each of the Servers KFNODE1 and KFNODE2 do the following.

Login using the bigdataadmin

//create a folder for kafka under the /usr/local directory
cd /usr/local
sudo mkdir kafka

//change ownership to bigdatauser
sudo chown -R bigdatauser:bigdatagroup kafka

//create the data cum log directory for kafka under the var/lib
cd /var/lib
sudo mkdir kafka

//change ownership to bigdatauser
sudo chown -R bigdatauser:bigdatagroup

kafka

Switch to bigdataauser

//download kafka
wget http://apache.claz.com/kafka/0.9.0.1/kafka_2.11-0.9.0.1.tgz

//unpack the file
tar xzf kafka_2.11-0.9.0.1.tgz


//move the kafka installation to the usr/local/kafka from the download directory
mv kafka_2.11-0.9.0.1 /usr/local/kafka/

//switch to the kafka directory
cd /usr/local/kafka/kafka_2.11-0.9.0.1/



//switch to the config directory
cd config

edit the config file and change the following

vi server.properties

#The broker Id should be unique for KFNODE1 and KFNODE2
#KFNODE1
broker.id=1

#KFNODE2
broker.id=2

#change data cum log directory to
log.dirs=/var/lib/kafka


#include the zookeeper servers

zookeeper.connect=ZKNODE1:2181,ZKNODE2:2181,ZKNODE3:2181

//move to the root of the cluster server on any of the cluster. Start the kafka in both servers.
cd /usr/local/kafka/kafka_2.11-0.9.0.1

//start kafka broker
bin/kafka-server-start.sh config/server.properties >/dev/null &


//if you need to stop
kill processid

//check if the process is running
ps -aux | grep kafka

//check for the kafka data/log folders

//There is no built in UI for kafka nor any commands to query the broker list

//we can use the create topic script to see if we have a cluster. Here we are //attempting to create a topic with replication factor of 3 and the error would say

//how many brokers we have that is 2

bin/kafka-topics.sh --create --zookeeper ZKNODE1:2181,ZKNODE2:2181,ZKNODE3:2181 --replication-factor 3 --partitions 4 --topic testkfbrokers

//Error while executing topic command : replication factor: 3 larger than available brokers: 2

How to setup a 3 Node Apache Storm 0.10 cluster in CentOS 7?

Apache Storm is the real thing for realtime computing. There are some others like Apache Spark Streaming which claims they are realtime computing which are just modified behavior than what they are designed for. Apache Storm or any variation of its design pattern is the one that needs to be picked for realtime big data computing. Apache Storm by default want to run under a supervisor process. Here we are trying to run as a background process.

The following needs to be done before beginning the Storm cluster Setup.

1. Create 3 CentOS 7 Servers STNODE1, STNODE2, and STNODE3 as discussed in How to install CentOS 7 on Virtual Machine using VMWare vSphere 6 client?

2. Make sure Java 7 is installed and configured as default as discussed in How to install Java 7 and Java 8 in CentOS 7?. You can also install Apache Storm with Java 8.

3. Create the bigdatauser, bigdataadmin and the bigdatagroup as discussed in How to create a user, group and enable him to do what a super user can in CentOS7?

4. Make sure the firewall is disabled and stopped as discussed in How to turn off firewall on CentOS 7?

5. Change etc/hosts file so that all the IPs and the names of the servers are resolved as discussed in
How to setup DNS entries for big data servers in the cloud or not on a domain in /etc/hosts file?

6. Using the bigdatauser setup password less ssh across the 3 clusters namely STNODE1, STNODE2 and STNODE3 as discussed in How to setup password less ssh between CentOS 7 cluster servers?

7. Install Apache Zookeeper clusters as discussed in How to setup a 3 Node Apache Zookeeper 3.4.6 cluster in CentOS 7?. Make sure you do the same as in step 5 for these servers too.

Storm has a concept of Master and Supervisor (Worker) Nodes. We are going to set STNODE1 as the Master, Storm UI and the DRPC Server Roles. The other nodes would run the Supervisor roles.

For each of the Servers STNODE1, STNODE2 and STNODE3 do the following.

Login using the bigdataadmin

//create a folder for storm under the /usr/local directory
cd /usr/local
sudo mkdir storm

//change ownership to bigdatauser
sudo chown -R bigdatauser:bigdatagroup storm

//create the data directory for storm under the var/lib
cd /var/lib
sudo mkdir storm

//change ownership to bigdatauser
sudo chown -R bigdatauser:bigdatagroup  storm

Switch to bigdataauser

//download storm
wget http://apache.claz.org/storm/apache-storm-0.10.0/apache-storm-0.10.0.tar.gz

//unpack the file
tar xzf apache-storm-0.10.0.tar.gz


//move the storm installation to the usr/local/storm from the download directory
mv apache-storm-0.10.0 /usr/local/storm/

//switch to the storm directory
cd /usr/local/storm/apache-storm-0.10.0



//switch to the conf directory
cd conf

edit the config file and change the following

vi storm.yaml


#include the zookeeper servers

storm.zookeeper.servers:
- ZKNODE1
- ZKNODE2
- ZKNODE3

//change data directory to
storm.local.dir: "/var/lib/storm"

//change the nimbus host so that all servers know its in a cluster
nimbus.host: "STNODE1"

//we can run DRPC on all the server.
drpc.servers:
- STNODE1
- STNODE2

- STNODE3

//move to the root of the cluster server on any of the cluster. Start the Storm in all of the 3 servers.
cd /usr/local/storm/apache-storm-0.10.0

//start nimbus (master) only on STNODE1
bin/storm nimbus >/dev/null &

//start storm UI only on STNODE1
bin/storm ui >/dev/null &

//start supervisors on STNODE2 and STNODE3
bin/storm supervisor >/dev/null &

//start DRPC on all Servers
bin/storm drpc >/dev/null &

//if you need to stop
kill processid

//check if the process is running
ps -aux | grep java

//check for backtype.storm.daemon.nimbus for Nimbus

//check for backtype.storm.ui.core for UI
//check for backtype.storm.daemon.drpc for DRPC
//check for backtype.storm.daemon.supervisor for supervisor

//check the status of the cluster from the UI
http://stnode1:8080

//you should be able to see 1 nimbus and 2 supervisor servers, if we have configured it correctly.

Troubleshooting

If the UI does not come up make sure that all the services include zookeeper instances are running and the bigdatauser can ssh into all the servers including zookeeper.

Friday, March 11, 2016

How to setup a 2 Node Elastic Search 2.2.0 cluster in CentOS 7?

Elastic Search is one of the technologies out there that can search across the big data. It can scale horizontally as the data volume increases. As with all installation check the current version before following this instruction.

The following needs to be done before beginning the Elastic cluster Setup.

1. Create 2 CentOS 7 Servers ESNODE1 and ESNODE2 as discussed in How to install CentOS 7 on Virtual Machine using VMWare vSphere 6 client?

2. Make sure Java 7 is installed and configured as default as discussed in How to install Java 7 and Java 8 in CentOS 7?

3. Create the bigdatauser, bigdataadmin and the bigdatagroup as discussed in How to create a user, group and enable him to do what a super user can in CentOS7?

4. Make sure the firewall is disabled and stopped as discussed in How to turn off firewall on CentOS 7?

5. Change etc/hosts file so that all the IPs and the names of the servers are resolved as discussed in
How to setup DNS entries for big data servers in the cloud or not on a domain in /etc/hosts file?

6. Using the bigdatauser setup password less ssh across the 2 clusters namely ESNODE1 and ESNODE2 as discussed in How to setup password less ssh between CentOS 7 cluster servers?

For each of the Server ESNODE1 and ESNODE2 do the following

//create a folder for elastic search under the /usr/local directory
cd /usr/local
sudo mkdir es

//change ownership to bigdatauser
sudo chown -R bigdatauser:bigdatagroup es

//create the data directory for elastic search under the var/lib
cd /var/lib
sudo mkdir es

//change ownership to bigdatauser
sudo chown -R bigdatauser:bigdatagroup  es

//create the log directory for elastic search under the var/log
cd /var/log
sudo mkdir es

//change ownership to bigdatauser
sudo chown -R bigdatauser:bigdatagroup  es

Switch to bigdataauser

//download elastic search 2.2.0
wget https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/2.2.0/elasticsearch-2.2.0.tar.gz

//unpack the file
tar xzf elasticsearch-2.2.0.tar.gz


//move the elastic installation to the usr/local/es from the download directory


mv elasticsearch-2.2.0 /usr/local/es/

//switch to the /usr/local/es directory
cd /usr/local/es/elasticsearch-2.2.0

//move to the conf folder for the version of elastic search like
cd config

edit the elasticsearch.yml file and change the following settings



vi elasticsearch.yml


//copy the following comments and create the following

//remove the #
//change cluster name
cluster.name: escluster

//change node name node.name: ESNODE1

//change the path to data directory

path.data: /var/lib/es

//change the path to the log directory

path.logs: /var/log/es

//give the IP address assigned to this server

//make sure you have a static IP and different for each server

network.host: 192.168.0.5



//move to the root of the cluster server on any of the cluster. Start elastic search in all of 2 servers.
cd /usr/local/es/elasticsearch-2.2.0

//start elastic in daemon mode (background process)
bin/elasticsearch -d

//if you need to stop find and kill the process
kill pid


//check if the process is running
ps -aux | grep java

//check for ElasticSearch


//check cluster health

http://ESNODE1:9200/_cluster/health?pretty


//you should find a JSON with

"number_of_nodes" : 2,

"number_of_data_nodes" : 2,



//check the version of elastic

http://ESNODE1:9200/


//you should find a JSON with 

 "version" : {

 "number" : "2.2.0",

Content

Monday, March 14, 2016

How to setup a 2 Node Apache Kafka 0.9.0.1 cluster in CentOS 7?

How to setup a 3 Node Apache Storm 0.10 cluster in CentOS 7?

Friday, March 11, 2016

How to setup a 2 Node Elastic Search 2.2.0 cluster in CentOS 7?