Big Data School

Saturday, May 21, 2016

How to setup standalone instance of Apache Kafka 0.9.0.1 on localhost for Mac OS X?



Apache Kafka is distributed Message Broker which would also for reading messages in a sequential manner maintaining the order in which a message has arrived.

If also allows multiple read of message by way of partitions. So if we have a 4 partition topic it would allow 4 threads to read the messages in parallel. However its the job of the developer to make sure that the messages for the same entity goes in a sequential manner to the same thread instead of a different thread. Kafka does this by having a message key. So if we send an entity E1 with key Key1  with time T1, if another message of E1 comes at Time T2 its the developer responsibility to give the same Key 1 so that the messages are read in the order by a thread namely E1 T1 first and then E1 T2 next and so on.



1. You have admin privileges for your development box

2. Make sure Java 7 or 8 is installed and configured as default as discussed in How to install Java 7 and Java 8 in Mac OS X

3. Make sure Apache Zookeeper standalone is installed as specified in How to setup standalone instance of Apache Zookeeper 3.4.6 on localhost for Mac OS X?
//create a folder for kafka under the /usr/local directory
cd /usr/local
sudo mkdir kafka

//create the data cum log directory for kafka under the var/lib
cd /var/lib
sudo mkdir kafka

//download kafka
wget http://apache.claz.com/kafka/0.9.0.1/kafka_2.11-0.9.0.1.tgz

//unpack the file
tar xzf kafka_2.11-0.9.0.1.tgz


//move the kafka installation to the usr/local/kafka from the download directory
mv kafka_2.11-0.9.0.1 /usr/local/kafka/

//switch to the kafka directory
cd /usr/local/kafka/kafka_2.11-0.9.0.1/


//switch to the config directory
cd config

edit the config file and change the following

vi server.properties

#The broker Id should be unique 
broker.id=1

#change data cum log directory to
log.dirs=/var/lib/kafka

#include the zookeeper servers
zookeeper.connect=YOURMACHOSTNAME.local:2181

#Since this is a dev machine allow a topic to be deleted
delete.topic.enable=true

//save the file
:wq

//move to the kafka root
cd /usr/local/kafka/kafka_2.11-0.9.0.1

//start kafka broker
bin/kafka-server-start.sh config/server.properties >/dev/null &

//if you need to stop
kill processid

//check if the process is running
ps -a | grep kafka

//or use jps
jps

How to setup standalone instance of Apache Zookeeper 3.4.6 on localhost for Mac OS X?

Apache Zookeeper is a distributed state manager that other systems use for state management. You could also setup a standalone zookeeper instead of a built in one to share this zookeeper instance across multiple technologies like Kafka, Storm, Hbase etc so that each instance does not start its own instance. These instruction let you setup Zookeeper as a standalone instance

1. You have admin privileges for your development box
2. Make sure Java 7 or 8 is installed and configured as default as discussed in How to install Java 7 and Java 8 in Mac OS X


//create a folder for zookeeper under the /usr/local directory
cd /usr/local
sudo mkdir zookeeper

//create the data directory for Zookeeper under the var/lib
cd /var/lib
sudo mkdir zookeeper

//create a file named myid under the data directory
cd /var/lib/zookeeper
vi myid

//Put only the number 1.
1

//save the file
:wq

if you do a cat myid it should just display 1


//download zookeeper on any local directory
wget http://apache.arvixe.com/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz

//unpack the file
tar xzf zookeeper-3.4.6.tar.gz


//move the zookeeper installation to the usr/local/zookeeper

//from the download directory
mv zookeeper-3.4.6 /usr/local/zookeeper/

//switch to the /usr/local/zookeeper directory
cd /usr/local/zookeeper/zookeeper-3.4.6

//move to the conf folder for the version of zookeeper like
cd conf

//copy the sample config to zoo.cfg
cp zoo_sample.cfg zoo.cfg

//switch to the conf directory
cd /usr/local/zookeeper/zookeeper-3.4.6/conf

edit the zoo.cfg file and change the data directory to

vi zoo.cfg

//change data directory to
dataDir=/var/lib/zookeeper

#include the cluster servers

server.1=YOURMACHOSTNAME.local:2888:3888


//move to the root of zookeeper

cd /usr/local/zookeeper/zookeeper-3.4.6

//start zookeeper
bin/zkServer.sh start

//if you need to stop
bin/zkServer.sh stop

//check if the process is running
jps

//check for QuorumPeerMain

//check the status of zookeeper

bin/zkServer.sh status

//This should display
Mode: standalone

Monday, March 14, 2016

How to setup a 2 Node Apache Kafka 0.9.0.1 cluster in CentOS 7?

Apache Kafka is one of the realtime message brokers used for realtime stream processing in big data world.

The following needs to be done before beginning the Apache Kafka cluster Setup.

1. Create 2 CentOS 7 Servers KFNODE1 and KFNODE2 as discussed in How to install CentOS 7 on Virtual Machine using VMWare vSphere 6 client?

2. Make sure Java 7 is installed and configured as default as discussed in How to install Java 7 and Java 8 in CentOS 7?.

3. Create the bigdatauser, bigdataadmin and the bigdatagroup as discussed in How to create a user, group and enable him to do what a super user can in CentOS7?

4. Make sure the firewall is disabled and stopped as discussed in How to turn off firewall on CentOS 7?

5. Change etc/hosts file so that all the IPs and the names of the servers are resolved as discussed in
How to setup DNS entries for big data servers in the cloud or not on a domain in /etc/hosts file?

6. Using the bigdatauser setup password less ssh across the 3 clusters namely KFNODE1 and KFNODE2 as discussed in How to setup password less ssh between CentOS 7 cluster servers?

7. Install Apache Zookeeper clusters as discussed in How to setup a 3 Node Apache Zookeeper 3.4.6 cluster in CentOS 7?. Make sure you do the same as in step 5 for these servers too.

For each of the Servers KFNODE1 and KFNODE2 do the following.

Login using the bigdataadmin

//create a folder for kafka under the /usr/local directory
cd /usr/local
sudo mkdir kafka

//change ownership to bigdatauser
sudo chown -R bigdatauser:bigdatagroup kafka

//create the data cum log directory for kafka under the var/lib
cd /var/lib
sudo mkdir kafka

//change ownership to bigdatauser
sudo chown -R bigdatauser:bigdatagroup

kafka

Switch to bigdataauser

//download kafka
wget http://apache.claz.com/kafka/0.9.0.1/kafka_2.11-0.9.0.1.tgz

//unpack the file
tar xzf kafka_2.11-0.9.0.1.tgz


//move the kafka installation to the usr/local/kafka from the download directory
mv kafka_2.11-0.9.0.1 /usr/local/kafka/

//switch to the kafka directory
cd /usr/local/kafka/kafka_2.11-0.9.0.1/



//switch to the config directory
cd config

edit the config file and change the following

vi server.properties

#The broker Id should be unique for KFNODE1 and KFNODE2
#KFNODE1
broker.id=1

#KFNODE2
broker.id=2

#change data cum log directory to
log.dirs=/var/lib/kafka


#include the zookeeper servers

zookeeper.connect=ZKNODE1:2181,ZKNODE2:2181,ZKNODE3:2181

//move to the root of the cluster server on any of the cluster. Start the kafka in both servers.
cd /usr/local/kafka/kafka_2.11-0.9.0.1

//start kafka broker
bin/kafka-server-start.sh config/server.properties >/dev/null &


//if you need to stop
kill processid

//check if the process is running
ps -aux | grep kafka

//check for the kafka data/log folders

//There is no built in UI for kafka nor any commands to query the broker list

//we can use the create topic script to see if we have a cluster. Here we are //attempting to create a topic with replication factor of 3 and the error would say

//how many brokers we have that is 2

bin/kafka-topics.sh --create --zookeeper ZKNODE1:2181,ZKNODE2:2181,ZKNODE3:2181 --replication-factor 3 --partitions 4 --topic testkfbrokers

//Error while executing topic command : replication factor: 3 larger than available brokers: 2

How to setup a 3 Node Apache Storm 0.10 cluster in CentOS 7?

Apache Storm is the real thing for realtime computing. There are some others like Apache Spark Streaming which claims they are realtime computing which are just modified behavior than what they are designed for. Apache Storm or any variation of its design pattern is the one that needs to be picked for realtime big data computing. Apache Storm by default want to run under a supervisor process. Here we are trying to run as a background process.

The following needs to be done before beginning the Storm cluster Setup.

1. Create 3 CentOS 7 Servers STNODE1, STNODE2, and STNODE3 as discussed in How to install CentOS 7 on Virtual Machine using VMWare vSphere 6 client?

2. Make sure Java 7 is installed and configured as default as discussed in How to install Java 7 and Java 8 in CentOS 7?. You can also install Apache Storm with Java 8.

3. Create the bigdatauser, bigdataadmin and the bigdatagroup as discussed in How to create a user, group and enable him to do what a super user can in CentOS7?

4. Make sure the firewall is disabled and stopped as discussed in How to turn off firewall on CentOS 7?

5. Change etc/hosts file so that all the IPs and the names of the servers are resolved as discussed in
How to setup DNS entries for big data servers in the cloud or not on a domain in /etc/hosts file?

6. Using the bigdatauser setup password less ssh across the 3 clusters namely STNODE1, STNODE2 and STNODE3 as discussed in How to setup password less ssh between CentOS 7 cluster servers?

7. Install Apache Zookeeper clusters as discussed in How to setup a 3 Node Apache Zookeeper 3.4.6 cluster in CentOS 7?. Make sure you do the same as in step 5 for these servers too.

Storm has a concept of Master and Supervisor (Worker) Nodes. We are going to set STNODE1 as the Master, Storm UI and the DRPC Server Roles. The other nodes would run the Supervisor roles.

For each of the Servers STNODE1, STNODE2 and STNODE3 do the following.

Login using the bigdataadmin

//create a folder for storm under the /usr/local directory
cd /usr/local
sudo mkdir storm

//change ownership to bigdatauser
sudo chown -R bigdatauser:bigdatagroup storm

//create the data directory for storm under the var/lib
cd /var/lib
sudo mkdir storm

//change ownership to bigdatauser
sudo chown -R bigdatauser:bigdatagroup  storm

Switch to bigdataauser

//download storm
wget http://apache.claz.org/storm/apache-storm-0.10.0/apache-storm-0.10.0.tar.gz

//unpack the file
tar xzf apache-storm-0.10.0.tar.gz


//move the storm installation to the usr/local/storm from the download directory
mv apache-storm-0.10.0 /usr/local/storm/

//switch to the storm directory
cd /usr/local/storm/apache-storm-0.10.0



//switch to the conf directory
cd conf

edit the config file and change the following

vi storm.yaml


#include the zookeeper servers

storm.zookeeper.servers:
- ZKNODE1
- ZKNODE2
- ZKNODE3

//change data directory to
storm.local.dir: "/var/lib/storm"

//change the nimbus host so that all servers know its in a cluster
nimbus.host: "STNODE1"

//we can run DRPC on all the server.
drpc.servers:
- STNODE1
- STNODE2

- STNODE3

//move to the root of the cluster server on any of the cluster. Start the Storm in all of the 3 servers.
cd /usr/local/storm/apache-storm-0.10.0

//start nimbus (master) only on STNODE1
bin/storm nimbus >/dev/null &

//start storm UI only on STNODE1
bin/storm ui >/dev/null &

//start supervisors on STNODE2 and STNODE3
bin/storm supervisor >/dev/null &

//start DRPC on all Servers
bin/storm drpc >/dev/null &

//if you need to stop
kill processid

//check if the process is running
ps -aux | grep java

//check for backtype.storm.daemon.nimbus for Nimbus

//check for backtype.storm.ui.core for UI
//check for backtype.storm.daemon.drpc for DRPC
//check for backtype.storm.daemon.supervisor for supervisor

//check the status of the cluster from the UI
http://stnode1:8080

//you should be able to see 1 nimbus and 2 supervisor servers, if we have configured it correctly.

Troubleshooting

If the UI does not come up make sure that all the services include zookeeper instances are running and the bigdatauser can ssh into all the servers including zookeeper.

Friday, March 11, 2016

How to setup a 2 Node Elastic Search 2.2.0 cluster in CentOS 7?

Elastic Search is one of the technologies out there that can search across the big data. It can scale horizontally as the data volume increases. As with all installation check the current version before following this instruction.

The following needs to be done before beginning the Elastic cluster Setup.

1. Create 2 CentOS 7 Servers ESNODE1 and ESNODE2 as discussed in How to install CentOS 7 on Virtual Machine using VMWare vSphere 6 client?

2. Make sure Java 7 is installed and configured as default as discussed in How to install Java 7 and Java 8 in CentOS 7?

3. Create the bigdatauser, bigdataadmin and the bigdatagroup as discussed in How to create a user, group and enable him to do what a super user can in CentOS7?

4. Make sure the firewall is disabled and stopped as discussed in How to turn off firewall on CentOS 7?

5. Change etc/hosts file so that all the IPs and the names of the servers are resolved as discussed in
How to setup DNS entries for big data servers in the cloud or not on a domain in /etc/hosts file?

6. Using the bigdatauser setup password less ssh across the 2 clusters namely ESNODE1 and ESNODE2 as discussed in How to setup password less ssh between CentOS 7 cluster servers?

For each of the Server ESNODE1 and ESNODE2 do the following

//create a folder for elastic search under the /usr/local directory
cd /usr/local
sudo mkdir es

//change ownership to bigdatauser
sudo chown -R bigdatauser:bigdatagroup es

//create the data directory for elastic search under the var/lib
cd /var/lib
sudo mkdir es

//change ownership to bigdatauser
sudo chown -R bigdatauser:bigdatagroup  es

//create the log directory for elastic search under the var/log
cd /var/log
sudo mkdir es

//change ownership to bigdatauser
sudo chown -R bigdatauser:bigdatagroup  es

Switch to bigdataauser

//download elastic search 2.2.0
wget https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/2.2.0/elasticsearch-2.2.0.tar.gz

//unpack the file
tar xzf elasticsearch-2.2.0.tar.gz


//move the elastic installation to the usr/local/es from the download directory


mv elasticsearch-2.2.0 /usr/local/es/

//switch to the /usr/local/es directory
cd /usr/local/es/elasticsearch-2.2.0

//move to the conf folder for the version of elastic search like
cd config

edit the elasticsearch.yml file and change the following settings



vi elasticsearch.yml


//copy the following comments and create the following

//remove the #
//change cluster name
cluster.name: escluster

//change node name node.name: ESNODE1

//change the path to data directory

path.data: /var/lib/es

//change the path to the log directory

path.logs: /var/log/es

//give the IP address assigned to this server

//make sure you have a static IP and different for each server

network.host: 192.168.0.5



//move to the root of the cluster server on any of the cluster. Start elastic search in all of 2 servers.
cd /usr/local/es/elasticsearch-2.2.0

//start elastic in daemon mode (background process)
bin/elasticsearch -d

//if you need to stop find and kill the process
kill pid


//check if the process is running
ps -aux | grep java

//check for ElasticSearch


//check cluster health

http://ESNODE1:9200/_cluster/health?pretty


//you should find a JSON with

"number_of_nodes" : 2,

"number_of_data_nodes" : 2,



//check the version of elastic

http://ESNODE1:9200/


//you should find a JSON with 

 "version" : {

 "number" : "2.2.0",

Friday, February 19, 2016

How to setup a 3 Node Apache Zookeeper 3.4.6 cluster in CentOS 7?

Zookeeper is short is a distributed state manager which can be used by many clusters to maintain state across its own clusters. Like HBase can use Zookeeper to maintain state across its own set of clusters without having to have cluster state within it.

The following needs to be done before beginning the Zookeeper cluster Setup.

1. Create 3 CentOS 7 Servers ZKNODE1, ZKNODE2, and ZKNOD3 as discussed in How to install CentOS 7 on Virtual Machine using VMWare vSphere 6 client?

2. Make sure Java 7 is installed and configured as default as discussed in How to install Java 7 and Java 8 in CentOS 7?

3. Create the bigdatauser, bigdataadmin and the bigdatagroup as discussed in How to create a user, group and enable him to do what a super user can in CentOS7?

4. Make sure the firewall is disabled and stopped as discussed in How to turn off firewall on CentOS 7?

5. Change etc/hosts file so that all the IPs and the names of the servers are resolved as discussed in
How to setup DNS entries for big data servers in the cloud or not on a domain in /etc/hosts file?

6. Using the bigdatauser setup password less ssh across the 3 clusters namely ZKNODE1, ZKNODE2 and ZKNODE3 as discussed in How to setup password less ssh between CentOS 7 cluster servers?

For each of the Server ZKNODE1, ZKNODE2 and ZKNODE3 do the following

Login using the bigdataadmin


//create a folder for zookeeper under the /usr/local directory

cd /usr/local

sudo mkdir zookeeper



//change ownership to bigdatauser

sudo chown -R bigdatauser:bigdatagroup zookeeper



//create the data directory for Zookeeper under the var/lib

cd /var/lib

sudo mkdir zookeeper



//change ownership to bigdatauser

sudo chown -R bigdatauser:bigdatagroup  zookeeper

Switch to bigdataauser


//create a file named myid under the data directory

cd /var/lib/zookeeper

vi myid

put only the number for the corresponding servers. DO NOT put all the 3 numbers in each server.
on ZKNODE1
1
on ZKNODE2
2
on ZKNODE3
3

if you do a cat myid it should just display 1 for ZKNODE1 and so on.


//download zookeeper

wget http://apache.arvixe.com/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz



//unpack the file

tar xzf zookeeper-3.4.6.tar.gz





//move the zookeeper installation to the usr/local/zookeeper from the download directory

mv zookeeper-3.4.6 /usr/local/zookeeper/



//switch to the /usr/local/zookeeper directory

cd /usr/local/zookeeper/zookeeper-3.4.6



//move to the conf folder for the version of zookeeper like

cd conf



//copy the sample config to zoo.cfg

cp zoo_sample.cfg zoo.cfg



//switch to the conf directory

cd /usr/local/zookeeper/zookeeper-3.4.6/conf



edit the zoo.cfg file and change the data directory to



vi zoo.cfg



//change data directory to

dataDir=/var/lib/zookeeper



#include the cluster servers

server.1=ZKNODE1:2888:3888

server.2=ZKNODE2:2888:3888

server.3=ZKNODE3:2888:3888





//move to the root of the cluster server on any of the cluster. Start the zookeeper in all of the 3 servers.

cd /usr/local/zookeeper/zookeeper-3.4.6


//start zookeeper

bin/zkServer.sh start



//if you need to stop

bin/zkServer.sh stop



//check if the process is running

ps -aux | grep java



//check for QuorumPeerMain



//check the status of each server to see if they are in a cluster. Only one among the 3 should be master and the others are followers

bin/zkServer.sh status

to find if its running as follower or leader similar to master, slave.

Mode: follower

Mode: leader

Troubleshooting Errors
Error:
Using config: /usr/local/zookeeper/zookeeper-3.4.6/bin/../conf/zoo.cfg
mkdir: cannot create directory ‚/var/bin‚: Permission denied
Starting zookeeper ... bin/zkServer.sh: line 113: /var/bin/zookeeper/zookeeper_server.pid: No such file or directory
Solution:
Make sure that the data directory is correct and you are running as bigdatauser and not bigdataadmin

Thursday, February 18, 2016

How to switch Java version from Java 7 to Java 8 on CentOS 7?

Most of the Big Data stack technologies would work with Java 1.7 or Java 7 which is installed by default in CentOS 7 Server UI Edition. If not follow the instruction as described in How to install Java 7 and Java 8 in CentOS 7?

Some technologies require Java 1.8 or 8. For example the Gremlin Server from Titan Graph requires Java 8. Once we have installed Java 8 we can switch the default Java version to 8 for those technologies that required Java 8 as the default.

We can attempt to run technologies that require both Java 7 and Java 8 on the same box by using the different user login each having different Java home path but this is not recommended. Try to stick to technologies that run with the same Java version per server.

Do the following command. You would need to be an administrator.


sudo update-alternatives --config java

This would display

There are 2 programs which provide 'java'.

  Selection    Command

-----------------------------------------------

*  1           /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.75-2.5.4.2.el7_0.x86_64/jre/bin/java

 + 2           /usr/java/jdk1.8.0_60/jre/bin/java

Enter to keep the current selection[+], or type selection number: 

The current selection is displayed in +. To change it
Type 1 and press enter to switch to Java 7
Type 2 and press enter to switch to Java 8.

To exist without changing press "enter".

Content

Saturday, May 21, 2016

Monday, March 14, 2016

Friday, March 11, 2016

Friday, February 19, 2016

Thursday, February 18, 2016