Content

Sunday, December 18, 2016

How to work with Github.com Fork Pattern?

Login to the github account. Make sure that you have access to the repository which you want to fork.

https://github.com/githubacountoffork/repositorytofork

Hit the Fork button. Once you click on the Fork button the repository would be forked to your github account.
https://github.com/yourgithubaccount/repositoryoffork

This fork would be the origin for you to push changes to the repository that you forked. Basically you are making changes to your local fork and then giving pull request to the original source where the actual repository is.

Clone the forked master repository from the github to your local development machine.
git clone https://github.com/yourgithubaccount/repositoryoffork


Add the remote upstream to say where to get the remote repository
git remote add upstream https://github.com/githubacountoffork/repositorytofork

Verify the new remote named 'upstream'. You should see both 'origin' and 'remote' pointing back to the server link.
git remote -v


As a best practice do not work on the local master, create a  branch for every story or task
git branch branch-name


Check the branch that you have created.
The * on the branch name indicates the current active branch.
git branch

On of the greatest feature of git from other repository is that you don't have multiple folders for different release of repositories like mainline, dev, release 1 etc. You just switch between branches using the same physical repository.
The downside is that you can work on only one branch at a time. We need to ensure that we are in the right branch all the time. To switch to any given branch use the following command
git checkout branch-name

Work on the files, using IDE of your choice, to check which files are modified issue
git status

When you are done doing the changes, add those files to git and issue a commit before working on new files for committing.
git add .

To selectively add the files, go to the individual folders run the add command  using the file name or folder name.
Do a Git status and check the message about 'pending commit'. The add command has only 'staged' the changes. The changes are not committed yet.
git status

The next step is to commit the change using the following changes.
Commit will take all the 'staged' files and commit to your local branch.
git commit -m 'commit message detailing the changes.'


Check status and make sure changes are committed.
git status

Push these changes to your fork, if the branch name is not present, this branch is automatically created
git push origin 'branch-name'

Now log on the github.com website.
From the drop down select the branch name that you just 'pushed'
Create a pull request targeted to the Origin repository
This will send a pull request to the reviewer who can review and merge the changes.
After the pull request is merged to the main master, we have the option to Delete a branch. Remember we have a local copy and a remote copy. So we have to delete both. Sometimes the remote fork branch copy can be deleted by the Pull Request taker. In this case there is no need to delete the remote branch on the fork.

To delete the local branch, move to the master branch before deleting your local branch
git checkout master

Force delete the local branch
git branch -D branch-name

If the branch is not deleted by the Pull Request (PR) on the Remote Branch
git push origin branch-name

Once manual step with this approach is that you need to synchronize the remote repository to your fork. This is not done automatically as sometimes people just want to work on that version of the fork only. This is not enforced by Github.com

Get all changes from remote
IMPORTANT: make sure that you have the correct working branch before you issue this command else your would replace your other branches and conflicts can occur.
git branch

Pull changes from the remote upstream branch
git pull upstream master

Push this changes to your local fork branch
git push origin master


How to switch default java version namely Java 7 to Java 8 in Mac OS X?

Once in a while Java versions on your developer environment needs to be upgraded and the new version needs to be the default across the different software that you would be using for the product or solution.


Before changing the default, find the current default in the system by typing

java -version

This should give something like 
java version "1.7.0_79"
Java(TM) SE Runtime Environment (build 1.7.0_79-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode)

Now check the other versions installed on the system

/usr/libexec/java_home -V

This should give something like
Matching Java Virtual Machines (2):
    1.8.0_45, x86_64: "Java SE 8" /Library/Java/JavaVirtualMachines/jdk1.8.0_45.jdk/Contents/Home
    1.7.0_79, x86_64: "Java SE 7" /Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/Home

Now change the version to Java 8

export JAVA_HOME='/usr/libexec/java_home -v 1.8.0_45'

Verify if this is changed by giving the same command again

java -version

This should now show the changed default
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)



Saturday, December 17, 2016

Source control management for products or projects

As with any software development, one need to know a source control for keeping the version of code for few of the following reason.

1. Keep code in safe place apart from your laptop
2. Have different versions of the same code
3. Collaborate with other developers
4. Have different releases and each have its own release cycles of Service/Feature Packs, Hot fixes etc.
5. Release management to separate development code vs shippable code

The choice right now in this age is github.com

Go to the website and create a new git hub account or handle.

You have two choices
1. Make all the repository as public in which case you don't have to pay for it but your code is open and anyone can see it.
2. Make the repository as private but incur a cost starting from $7/month for personal and above for other types of account.


Source control pattern recommendation

"Fork Pattern is the way forward"
We all know that we came from traditional software development and we want to keep our branches till death and our old habits die hard. There is no reason for all developers to be working off the same branch.

The new mantra is to have a release version defined.Each developer takes a branch for himself for every story or task they want to do, make the changes, give a pull request to be merged with the master and once merged you delete your branch. In this way there are no conflicts between branches, nothing to synchronize across developers and each developer just throws his branch and takes a new one to continue. Do make sure that the solution is modular so that each of them don't step into others shoes.

So go ahead and change your archaic philosophies to accommodate this, and throw those old habit people who wants all branches of codes to be kept to work on legacy software as they don't want to move on.



Note: The cost was as of Dec 2016



Sunday, December 11, 2016

How to install JSON editor on Eclipse Neon IDE for Java Developers?

I am not sure why Eclipse thinks that JSON editor is only for the Web Developer IDE; but until they realize that the new way of development is using JSON even on Java big data world, we have to install the JSON editor inside the Eclipse Neon IDE for Java Developers for us to be able to edit JSON documents.

On the Eclipse Neon IDE for Java Developers, click on "Help" - > "Install New Software"

On the Work with, click on "Add".

On the "Add Repository" dialog
name enter "eclipseneon"

on the location enter "http://download.eclipse.org/releases/neon"

click ok.
The would try to get all the software available on this location.

Once the software list is loaded, filter by "Eclipse Web Developer Tools"

Find the software and place a check mark, follow instructions to install this software.


Once done, Eclipse Neon would ask it to you restarted.

Now you can edit JSON documents in Eclipse IDE for Java Developers.


Note: As of this writing the Eclipse JSON editor has a blunder bug, which doesn't know how to handle arrays when it formats. We are surprised by the quality of deliverable not likely of Eclipse and not sure why this was not found.

A simpler editor is Json Tools 1.0.1 is the best in terms of formatting and handles large files. It does tend to be sluggish like any xml editor as the JSON files increases in size.


Update: Feb 2019
Do not use this JSON Editor or any JSON Editor for Eclipse as its buggy. Use the Visual Studio Code as listed here Changing JSON Editor in Eclipse

Saturday, October 22, 2016

How to enable SSH on your developer Mac OSX?

For most of the big data technologies, the able to to password less ssh to each other is a must.
In order to make these technologies work, you need to enable ssh in your Mac (El Capitan).

1. Click on System Preference
2. Click on Sharing
3. On the left hand side under "Service" enable "Remote Login"

How to Setup a 3 Node Apache Hbase 1.2.3 cluster in CentOS 7?

The following needs to be done before beginning  the Apache Hadoop cluster Setup.

1. Create 3 CentOS 7 Servers HBNODE1, HBNODE2 and HBNODE3 as discussed in How to install CentOS 7 on Virtual Machine using VMWare vSphere 6 client?

2. Make sure Java 7 is installed and configured as default as discussed in How to install Java 7 and Java 8 in CentOS 7?.

3. Create the bigdatauser, bigdataadmin and the bigdatagroup as discussed in How to create a user, group and enable him to do what a super user can in CentOS7?

4. Make sure the firewall is disabled and stopped as discussed in How to turn off firewall on CentOS 7? 

5. Change etc/hosts file so that all the IPs and the names of the servers are resolved as discussed in

6. Using the bigdatauser setup password less ssh across the 3 clusters namely HBNODE1, HBNODE2 and HBNODE3 as discussed in How to setup password less ssh between CentOS 7 cluster servers?


7. Install Apache Zookeeper clusters as discussed in How to setup a 3 Node Apache Zookeeper 3.4.6 cluster in CentOS 7? Make sure you do the same as in step 5 for these servers too.
8. Install Apache Hadoop clusters as discussed in How to Setup a 3 Node Apache Hadoop 2.7.3 cluster in CentOS 7? Make sure you do the same as in step 5 for these servers too.

For each of the Servers HBNODE1, HBNODE2 and HBNODE3 do the following.



For each of the Servers HBNODE1, HBNODE2 and HBNODE3 do the following.
 
Login using the bigdataadmin
 
#create a folder for hadoop under the /usr/local directory
cd /usr/local
sudo mkdir hbase
 
#change ownership to bigdatauser
sudo chown -R bigdatauser:bigdatagroup hbase

#Switch to bigdataauser
su bigdataauser

#move to a download folder and download hbase
wget http://www-eu.apache.org/dist/hbase/1.2.3/hbase-1.2.3-bin.tar.gz

#unzip the files
tar xzf hbase-1.2.3-bin.tar.gz

#move this to the common directory
mv hbase-1.2.3 /usr/local/hbase

#go to the hbase directory
cd /usr/local/hbase/hbase-1.2.3

#move to config directory
cd conf

#edit hbase-env.sh
vi hbase-env.sh

#change Java Home Path
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.75-2.5.4.2.el7_0.x86_64/jre

#disable internal zookeeper
export HBASE_MANAGES_ZK=false

#save
wq

#edit the hbase-site.xml

vi hbase-site.xml

<configuration>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
    <description>The mode the cluster will be in. Possible values are
false: standalone and pseudo-distributed setups with managed Zookeeper
true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
    </description>
  </property>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://hdnode1:9000/user/hadoop/hbase</value>
    <description>The directory shared by RegionServers.</description>
  </property>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>zknode1,zknode2,zknode3</value>
    <description>The Zookeeper ensemble</description>
  </property>
</configuration>

#save
wq

#edit the regsionservers file only on the master node hbnode1

vi regionservers
hbnode2
hbnode3

#save
wq

#move to the root folder and start the HBase cluster from the master node hbnode1
cd /usr/local/hbase/hbase-1.2.3

bin/start-hbase.sh

#This would start the regions servers in other node too
#check for the following process 

ps aux | grep hbase

#HMaster on master hbnode1 and HRegionServer on other nodes.

#view the status of the cluster in the following URL
http://hbnode1:16010/master-status

This should display the nodes as well as other details like Zookeeper etc.


Wednesday, October 12, 2016

Linux systems folder structure - File System Hierarchy Standard (FHS)

The following link describes the Linux File System Hierarchy Standard structure that all developers should be aware of when using linux systems. This should also give an idea on where to place the softwares we develop for deployment. https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/3/html/Reference_Guide/s1-filesystem-fhs.html
Pay attention to the following folders

/usr/
/usr/libexec
/usr/local
/var/lib



3.2. Overview of File System Hierarchy Standard (FHS)

Red Hat Enterprise Linux uses the Filesystem Hierarchy Standard (FHS) file system structure, which defines the names, locations, and permissions for many file types and directories.
The FHS document is the authoritative reference to any FHS-compliant file system, but the standard leaves many areas undefined or extensible. This section is an overview of the standard and a description of the parts of the file system not covered by the standard.
Compliance with the standard means many things, but the two most important are compatibility with other compliant systems and the ability to mount a /usr/ partition as read-only. This second point is important because the directory contains common executables and should not be changed by users. Also, since the /usr/ directory is mounted as read-only, it can be mounted from the CD-ROM or from another machine via a read-only NFS mount.

3.2.1. FHS Organization

The directories and files noted here are a small subset of those specified by the FHS document. Refer to the latest FHS document for the most complete information.
The complete standard is available online at http://www.pathname.com/fhs/.

3.2.1.1. The /boot/ Directory

The /boot/ directory contains static files required to boot the system, such as the Linux kernel. These files are essential for the system to boot properly.
WarningWarning
Do not remove the /boot/ directory. Doing so will render the system unbootable.

3.2.1.2. The /dev/ Directory

The /dev/ directory contains file system entries which represent devices that are attached to the system. These files are essential for the system to function properly.

3.2.1.3. The /etc/ Directory

The /etc/ directory is reserved for configuration files that are local to the machine. No binaries are to be put in /etc/. Any binaries that were once located in /etc/ should be placed into/sbin/ or /bin/.
The X11/ and skel/ directories are subdirectories of the /etc/ directory:
/etc
  |- X11/
  |- skel/
The /etc/X11/ directory is for X Window System configuration files such as XF86Config. The /etc/skel/ directory is for "skeleton" user files, which are used to populate a home directory when a user is first created.

3.2.1.4. The /lib/ Directory

The /lib/ directory should contain only those libraries needed to execute the binaries in /bin/ and /sbin/. These shared library images are particularly important for booting the system and executing commands within the root file system.

3.2.1.5. The /mnt/ Directory

The /mnt/ directory is for temporarily mounted file systems, such as CD-ROMs and 3.5 diskettes.

3.2.1.6. The /opt/ Directory

The /opt/ directory provides storage for large, static application software packages.
A package placing files in the /opt/ directory creates a directory bearing the same name as the package. This directory, in turn, holds files that otherwise would be scattered throughout the file system, giving the system administrator an easy way to determine the role of each file within a particular package.
For example, if sample is the name of a particular software package located within the /opt/ directory, then all of its files are placed in directories inside the /opt/sample/ directory, such as /opt/sample/bin/ for binaries and /opt/sample/man/ for manual pages.
Large packages that encompass many different sub-packages, each of which accomplish a particular task, are also located in the /opt/ directory, giving that large package a way to organize itself. In this way, our sample package may have different tools that each go in their own sub-directories, such as /opt/sample/tool1/ and /opt/sample/tool2/, each of which can have their own bin/man/, and other similar directories.

3.2.1.7. The /proc/ Directory

The /proc/ directory contains special files that either extract information from or send information to the kernel.
Due to the great variety of data available within /proc/ and the many ways this directory can be used to communicate with the kernel, an entire chapter has been devoted to the subject. For more information, please refer to Chapter 5 The proc File System.

3.2.1.8. The /sbin/ Directory

The /sbin/ directory stores executables used by the root user. The executables in /sbin/ are only used at boot time and perform system recovery operations. Of this directory, the FHS says:
/sbin contains binaries essential for booting, restoring, recovering, and/or repairing the system in addition to the binaries in /bin. Programs executed after /usr/ is known to be mounted (when there are no problems) are generally placed into /usr/sbin. Locally-installed system administration programs should be placed into /usr/local/sbin.
At a minimum, the following programs should be in /sbin/:
arp, clock,
halt, init, 
fsck.*, grub
ifconfig, lilo, 
mingetty, mkfs.*, 
mkswap, reboot, 
route, shutdown, 
swapoff, swapon

3.2.1.9. The /usr/ Directory

The /usr/ directory is for files that can be shared across multiple machines. The /usr/ directory is often on its own partition and is mounted read-only. At minimum, the following directories should be subdirectories of /usr/:
/usr
  |- bin/
  |- dict/
  |- doc/
  |- etc/
  |- games/
  |- include/
  |- kerberos/
  |- lib/
  |- libexec/     
  |- local/
  |- sbin/
  |- share/
  |- src/
  |- tmp -> ../var/tmp/
  |- X11R6/
Under the /usr/ directory, the bin/ directory contains executables, dict/ contains non-FHS compliant documentation pages, etc/ contains system-wide configuration files, games is for games, include/ contains C header files, kerberos/ contains binaries and other Kerberos-related files, and lib/ contains object files and libraries that are not designed to be directly utilized by users or shell scripts. The libexec/ directory contains small helper programs called by other programs, sbin/ is for system administration binaries (those that do not belong in the /sbin/ directory), share/ contains files that are not architecture-specific, src/ is for source code, and X11R6/ is for the X Window System (XFree86 on Red Hat Enterprise Linux).

3.2.1.10. The /usr/local/ Directory

The FHS says:
The /usr/local hierarchy is for use by the system administrator when installing software locally. It needs to be safe from being overwritten when the system software is updated. It may be used for programs and data that are shareable among a group of hosts, but not found in /usr.
The /usr/local/ directory is similar in structure to the /usr/ directory. It has the following subdirectories, which are similar in purpose to those in the /usr/ directory:
/usr/local
       |- bin/
       |- doc/
       |- etc/
       |- games/
       |- include/
       |- lib/
       |- libexec/
       |- sbin/
       |- share/
       |- src/
In Red Hat Enterprise Linux, the intended use for the /usr/local/ directory is slightly different from that specified by the FHS. The FHS says that /usr/local/ should be where software that is to remain safe from system software upgrades is stored. Since software upgrades can be performed safely with Red Hat Package Manager (RPM), it is not necessary to protect files by putting them in /usr/local/. Instead, the /usr/local/ directory is used for software that is local to the machine.
For instance, if the /usr/ directory is mounted as a read-only NFS share from a remote host, it is still possible to install a package or program under the /usr/local/ directory.

3.2.1.11. The /var/ Directory

Since the FHS requires Linux to mount /usr/ as read-only, any programs that write log files or need spool/ or lock/ directories should write them to the /var/ directory. The FHS states/var/ is for:
...variable data files. This includes spool directories and files, administrative and logging data, and transient and temporary files.
Below are some of the directories found within the /var/ directory:
/var
  |- account/
  |- arpwatch/
  |- cache/
  |- crash/
  |- db/
  |- empty/
  |- ftp/
  |- gdm/
  |- kerberos/
  |- lib/
  |- local/
  |- lock/
  |- log/
  |- mail -> spool/mail/
  |- mailman/
  |- named/
  |- nis/
  |- opt/
  |- preserve/
  |- run/
  +- spool/
       |- at/
       |- clientmqueue/
       |- cron/
       |- cups/
       |- lpd/
       |- mail/
       |- mqueue/
       |- news/
       |- postfix/ 
       |- repackage/
       |- rwho/
       |- samba/ 
       |- squid/
       |- squirrelmail/
       |- up2date/ 
       |- uucppublic/
       |- vbox/
  |- tmp/
  |- tux/
  |- www/
  |- yp/
System log files such as messages/ and lastlog/ go in the /var/log/ directory. The /var/lib/rpm/ directory contains RPM system databases. Lock files go in the /var/lock/directory, usually in directories for the program using the file. The /var/spool/ directory has subdirectories for programs in which data files are stored.



Sunday, July 3, 2016

How to delete a topic in Apache Kafka Message Broker 0.9.x?

Deleting a topic is relevant only in development or testing environments. DO NOT enable this setting in production.

To delete a topic (associated with a message queue in other systems). you need the following
1. The zookeeper ensemble that Kafka clusters use.
2. Enable delete of topic in the server.properties namely
delete.topic.enable=true
Refer to How to setup standalone instance of Apache Kafka 0.9.0.1 on localhost for Mac OS X? for enabling this setting.

For a kafka server cluster installation with a zookeeper ensemble. Refer How to setup a 2 Node Apache Kafka 0.9.0.1 cluster in CentOS 7?

Navigate to any node of Kafka installation instance namely


cd /usr/local/kafka/kafka_2.11-0.9.0.1

bin/kafka-topics.sh --zookeeper ZKNODE1:2181,ZKNODE2:2181,ZKNODE3:2181  --delete --topic topicName



Navigate to installation instance namely
cd /usr/local/kafka/kafka_2.11-0.9.0.1

bin/kafka-topics.sh --zookeeper yourmac.local:2181 --delete --topic topicName


Note: Make sure that you have killed all consumers before your delete the topic. Kafka would take anywhere between 2 seconds to a minute to delete a topic. When the delete command is issued it would just mark the topic for deletion.




How to read messages in a topic from Apache Kafka Message Broker 0.9.x?



Sometime we need to quickly check what messages are present in Apache Kafka topic. Apache Kafka provides a default consumer shell for reading messages off a topic. Apache Kafka does not allow you to read a message by message Id or partition key. You can only read from the beginning or from the last position that was read which is automatically maintained in Zookeeper.

The reader is like an application waiting to read messages and would read continuously as long as you don't kill the session in the console. i.e a Kafka producer can produce 100 messages in time t1, you would see all the 100 messages printed in the console, if now there are another 10 messages at time t2 as long as the consumer is running you would now see only the next 10 messages.

Kafka consumers has a concept of offset. i.e the last position of the messages that it has read. This offset is maintained in Apache Zookeeper. Since Kafka supports multiple partitions an offset is maintained for each partition.

Apache Kafka is not like most other message queue systems where a message can be read only by one consumer and the message is removed after reading. Kafka allows multiple consumers to read from the same topic. Its the responsibility of each consumer to keep track of what each has read. The default Kafka installation keeps messages for 7 Days after which they are removed from the topic.

Every time a Kafka consumer shell is invoked it maintains the offset in Zookeeper. Apache Kafka supports two types of messages, String and Binary.  The content of the message alone is printed to the console without the partition key or from which partition it was read. Its good for String message types.


To read message  topic (associated with a message queue in other systems). you need the following
1. The zookeeper ensemble that Kafka clusters use.

For a kafka server cluster installation with a zookeeper ensemble. Refer How to setup a 2 Node Apache Kafka 0.9.0.1 cluster in CentOS 7?

Navigate to any node of Kafka installation instance namely


cd /usr/local/kafka/kafka_2.11-0.9.0.1

bin/kafka-console-consumer.sh --zookeeper ZKNODE1:2181,ZKNODE2:2181,ZKNODE3:2181 --topic topicName --from-beginning


To exit reading you would need to kill the process.

For CentOS7
To stop reading press ctrl + C to exit to the shell 
Kafka would now print the number of messages it has read. It would always be 1 more than the messages that your producer has put inside the topic.

ps -aux | grep kafka
to view the consumer processes and note the process id
kill processid or force kill using kill -9 processid




Navigate to installation instance namely
cd /usr/local/kafka/kafka_2.11-0.9.0.1

bin/kafka-console-consumer.sh --zookeeper yourmac.local:2181 --topic yourTopic --from-beginning

To exit
For Mac
To stop reading press ctrl + C to exit to the shell 
Kafka would now print the number of messages it has read. It would always be 1 more than the messages that your producer has put inside the topic.

ps -a | grep kafka
to view the consumer processes and note the process id
kill processid or force kill using kill -9 processid

Note: if there are no messages, the consumer would wait do not assuming no messages are coming check your producer to verify that its correctly sending to the topic.

How to create a topic in Apache Kafka Message Broker 0.9.x?

To create a topic (associated with a message queue in other systems). you need the following
1. The zookeeper ensemble that Kafka clusters use.

For a kafka server cluster installation with a zookeeper ensemble. Refer How to setup a 2 Node Apache Kafka 0.9.0.1 cluster in CentOS 7?

Navigate to any node of Kafka installation instance namely


cd /usr/local/kafka/kafka_2.11-0.9.0.1

bin/kafka-topics.sh --create --zookeeper ZKNODE1:2181,ZKNODE2:2181,ZKNODE3:2181 --replication-factor 2 --partitions 8 --topic topicname

This assumes that you have a minimum 2 node cluster. If you setup more than 2 you can increase the replication factor correspondingly. The partition is the number of concurrent reads that you would like to perform form your application. In order to better utilize the partition you would need to understand partition key which we would cover int the later lessons.


Navigate to installation instance namely
cd /usr/local/kafka/kafka_2.11-0.9.0.1

bin/kafka-topics.sh --create --zookeeper yourmac.local:2181 --replication-factor 1 --partitions 4 --topic topicname





Saturday, May 21, 2016

How to setup standalone instance of Apache Kafka 0.9.0.1 on localhost for Mac OS X?


Apache Kafka is distributed Message Broker which would also for reading messages in a sequential manner maintaining the order in which a message has arrived.
If also allows multiple read of message by way of partitions. So if we have a 4 partition topic it would allow 4 threads to read the messages in parallel. However its the job of the developer to make sure that the messages for the same entity goes in a sequential manner to the same thread instead of a different thread. Kafka does this by having a message key. So if we send an entity E1 with key Key1  with time T1, if another message of E1 comes at Time T2 its the developer responsibility to give the same Key 1 so that the messages are read in the order by a thread namely E1 T1 first and then E1 T2 next and so on.
1. You have admin privileges for your development box
2. Make sure Java 7 or 8 is installed and configured as default as discussed in How to install Java 7 and Java 8 in Mac OS X
3. Make sure Apache Zookeeper standalone is installed as specified in How to setup standalone instance of Apache Zookeeper 3.4.6 on localhost for Mac OS X?
//create a folder for kafka under the /usr/local directory cd /usr/local sudo mkdir kafka //create the data cum log directory for kafka under the var/lib cd /var/lib sudo mkdir kafka //download kafka wget http://apache.claz.com/kafka/0.9.0.1/kafka_2.11-0.9.0.1.tgz //unpack the file tar xzf kafka_2.11-0.9.0.1.tgz //move the kafka installation to the usr/local/kafka from the download directory mv kafka_2.11-0.9.0.1 /usr/local/kafka/ //switch to the kafka directory cd /usr/local/kafka/kafka_2.11-0.9.0.1/ //switch to the config directory cd config edit the config file and change the following vi server.properties #The broker Id should be unique broker.id=1 #change data cum log directory to log.dirs=/var/lib/kafka #include the zookeeper servers zookeeper.connect=YOURMACHOSTNAME.local:2181 #Since this is a dev machine allow a topic to be deleted delete.topic.enable=true //save the file :wq
//move to the kafka root
cd /usr/local/kafka/kafka_2.11-0.9.0.1

//start kafka broker
bin/kafka-server-start.sh config/server.properties >/dev/null &

//if you need to stop
kill processid

//check if the process is running
ps -a | grep kafka

//or use jps
jps


How to setup standalone instance of Apache Zookeeper 3.4.6 on localhost for Mac OS X?

Apache Zookeeper is a distributed state manager that other systems use for state management. You could also setup a standalone zookeeper instead of a built in one to share this zookeeper instance across multiple technologies like Kafka, Storm, Hbase etc so that each instance does not start its own instance. These instruction let you setup Zookeeper as a standalone instance

1. You have admin privileges for your development box
2. Make sure Java 7 or 8 is installed and configured as default as discussed in How to install Java 7 and Java 8 in Mac OS X



//create a folder for zookeeper under the /usr/local directory
cd /usr/local
sudo mkdir zookeeper

//create the data directory for Zookeeper under the var/lib
cd /var/lib
sudo mkdir zookeeper

//create a file named myid under the data directory
cd /var/lib/zookeeper
vi myid

//Put only the number 1.
1

//save the file
:wq

if you do a cat myid it should just display 1


//download zookeeper on any local directory
wget http://apache.arvixe.com/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz

//unpack the file
tar xzf zookeeper-3.4.6.tar.gz


//move the zookeeper installation to the usr/local/zookeeper 
//from the download directory
mv zookeeper-3.4.6 /usr/local/zookeeper/

//switch to the /usr/local/zookeeper directory
cd /usr/local/zookeeper/zookeeper-3.4.6

//move to the conf folder for the version of zookeeper like
cd conf

//copy the sample config to zoo.cfg
cp zoo_sample.cfg zoo.cfg

//switch to the conf directory
cd /usr/local/zookeeper/zookeeper-3.4.6/conf

edit the zoo.cfg file and change the data directory to

vi zoo.cfg

//change data directory to
dataDir=/var/lib/zookeeper

#include the cluster servers

server.1=YOURMACHOSTNAME.local:2888:3888


//move to the root of zookeeper
cd /usr/local/zookeeper/zookeeper-3.4.6

//start zookeeper
bin/zkServer.sh start

//if you need to stop
bin/zkServer.sh stop

//check if the process is running
jps

//check for QuorumPeerMain

//check the status of zookeeper
bin/zkServer.sh status
//This should display
Mode: standalone


Monday, March 14, 2016

How to setup a 2 Node Apache Kafka 0.9.0.1 cluster in CentOS 7?

Apache Kafka is one of the realtime message brokers used for realtime stream processing in big data world.


The following needs to be done before beginning  the Apache Kafka cluster Setup.

1. Create 2 CentOS 7 Servers KFNODE1 and KFNODE2 as discussed in How to install CentOS 7 on Virtual Machine using VMWare vSphere 6 client?

2. Make sure Java 7 is installed and configured as default as discussed in How to install Java 7 and Java 8 in CentOS 7?.

3. Create the bigdatauser, bigdataadmin and the bigdatagroup as discussed in How to create a user, group and enable him to do what a super user can in CentOS7?

4. Make sure the firewall is disabled and stopped as discussed in How to turn off firewall on CentOS 7? 

5. Change etc/hosts file so that all the IPs and the names of the servers are resolved as discussed in
How to setup DNS entries for big data servers in the cloud or not on a domain in /etc/hosts file?

6. Using the bigdatauser setup password less ssh across the 3 clusters namely KFNODE1 and KFNODE2 as discussed in How to setup password less ssh between CentOS 7 cluster servers?

7. Install Apache Zookeeper clusters as discussed in How to setup a 3 Node Apache Zookeeper 3.4.6 cluster in CentOS 7?. Make sure you do the same as in step 5 for these servers too.


For each of the Servers KFNODE1 and KFNODE2 do the following.

Login using the bigdataadmin

//create a folder for kafka under the /usr/local directory
cd /usr/local
sudo mkdir kafka

//change ownership to bigdatauser
sudo chown -R bigdatauser:bigdatagroup kafka

//create the data cum log directory for kafka under the var/lib
cd /var/lib
sudo mkdir kafka

//change ownership to bigdatauser
sudo chown -R bigdatauser:bigdatagroup  
kafka

Switch to bigdataauser

//download kafka
wget http://apache.claz.com/kafka/0.9.0.1/kafka_2.11-0.9.0.1.tgz

//unpack the file
tar xzf kafka_2.11-0.9.0.1.tgz


//move the kafka installation to the usr/local/kafka from the download directory
mv kafka_2.11-0.9.0.1 /usr/local/kafka/

//switch to the kafka directory
cd /usr/local/kafka/kafka_2.11-0.9.0.1/



//switch to the config directory
cd config

edit the config file and change the following

vi server.properties


#The broker Id should be unique for KFNODE1 and KFNODE2
#KFNODE1
broker.id=1

#KFNODE2
broker.id=2

#change data cum log directory to
log.dirs=/var/lib/kafka


#include the zookeeper servers

zookeeper.connect=ZKNODE1:2181,ZKNODE2:2181,ZKNODE3:2181

//move to the root of the cluster server on any of the cluster. Start the kafka in both servers.
cd /usr/local/kafka/kafka_2.11-0.9.0.1

//start kafka broker
bin/kafka-server-start.sh config/server.properties >/dev/null &


//if you need to stop
kill processid

//check if the process is running
ps -aux | grep kafka

//check for the kafka data/log folders 


//There is no built in UI for kafka nor any commands to query the broker list
//we can use the create topic script to see if we have a cluster. Here we are //attempting to create a topic with replication factor of 3 and the error would say
//how many brokers we have that is 2
bin/kafka-topics.sh --create --zookeeper ZKNODE1:2181,ZKNODE2:2181,ZKNODE3:2181 --replication-factor 3 --partitions 4 --topic testkfbrokers

//Error while executing topic command : replication factor: 3 larger than available brokers: 2 

How to setup a 3 Node Apache Storm 0.10 cluster in CentOS 7?

Apache Storm is the real thing for realtime computing. There are some others like Apache Spark Streaming which claims they are realtime computing which are just modified behavior than what they are designed for. Apache Storm or any variation of its design pattern is the one that needs to be picked for realtime big data computing. Apache Storm by default want to run under a supervisor process. Here we are trying to run as a background process.


The following needs to be done before beginning  the Storm cluster Setup.

1. Create 3 CentOS 7 Servers STNODE1, STNODE2, and STNODE3 as discussed in How to install CentOS 7 on Virtual Machine using VMWare vSphere 6 client?

2. Make sure Java 7 is installed and configured as default as discussed in How to install Java 7 and Java 8 in CentOS 7?. You can also install Apache Storm with Java 8.

3. Create the bigdatauser, bigdataadmin and the bigdatagroup as discussed in How to create a user, group and enable him to do what a super user can in CentOS7?

4. Make sure the firewall is disabled and stopped as discussed in How to turn off firewall on CentOS 7? 

5. Change etc/hosts file so that all the IPs and the names of the servers are resolved as discussed in
How to setup DNS entries for big data servers in the cloud or not on a domain in /etc/hosts file?

6. Using the bigdatauser setup password less ssh across the 3 clusters namely STNODE1, STNODE2 and STNODE3 as discussed in How to setup password less ssh between CentOS 7 cluster servers?

7. Install Apache Zookeeper clusters as discussed in How to setup a 3 Node Apache Zookeeper 3.4.6 cluster in CentOS 7?. Make sure you do the same as in step 5 for these servers too.

Storm has a concept of Master and Supervisor (Worker) Nodes. We are going to set STNODE1 as the Master, Storm UI and the DRPC Server Roles. The other nodes would run the Supervisor roles.

For each of the Servers STNODE1, STNODE2 and STNODE3 do the following.

Login using the bigdataadmin

//create a folder for storm under the /usr/local directory
cd /usr/local
sudo mkdir storm

//change ownership to bigdatauser
sudo chown -R bigdatauser:bigdatagroup storm

//create the data directory for storm under the var/lib
cd /var/lib
sudo mkdir storm

//change ownership to bigdatauser
sudo chown -R bigdatauser:bigdatagroup  storm


Switch to bigdataauser

//download storm
wget http://apache.claz.org/storm/apache-storm-0.10.0/apache-storm-0.10.0.tar.gz

//unpack the file
tar xzf apache-storm-0.10.0.tar.gz


//move the storm installation to the usr/local/storm from the download directory
mv apache-storm-0.10.0 /usr/local/storm/

//switch to the storm directory
cd /usr/local/storm/apache-storm-0.10.0



//switch to the conf directory
cd conf

edit the config file and change the following

vi storm.yaml

#include the zookeeper servers

storm.zookeeper.servers:
  - ZKNODE1
  - ZKNODE2
  - ZKNODE3

//change data directory to
storm.local.dir: "/var/lib/storm"


//change the nimbus host so that all servers know its in a cluster
nimbus.host: "STNODE1"

//we can run DRPC on all the server.
drpc.servers:
  - STNODE1
  - STNODE2

  - STNODE3


//move to the root of the cluster server on any of the cluster. Start the Storm in all of the 3 servers.
cd /usr/local/storm/apache-storm-0.10.0

//start nimbus (master) only on STNODE1
bin/storm nimbus >/dev/null &


//start storm UI only on STNODE1
bin/storm ui >/dev/null &

//start supervisors on STNODE2 and STNODE3
bin/storm supervisor >/dev/null &

//start DRPC on all Servers
bin/storm drpc >/dev/null &

//if you need to stop
kill processid

//check if the process is running
ps -aux | grep java

//check for backtype.storm.daemon.nimbus for Nimbus

//check for backtype.storm.ui.core for UI
//check for backtype.storm.daemon.drpc for DRPC
//check for backtype.storm.daemon.supervisor for supervisor

//check the status of the cluster from the UI
http://stnode1:8080

//you should be able to see 1 nimbus and 2 supervisor servers, if we have configured it correctly. 

Troubleshooting
If the UI does not come up make sure that all the services include zookeeper instances are running and the bigdatauser can ssh into all the servers including zookeeper.

Friday, March 11, 2016

How to setup a 2 Node Elastic Search 2.2.0 cluster in CentOS 7?

Elastic Search is one of the technologies out there that can search across the big data. It can scale horizontally as the data volume increases.  As with all installation check the current version before following this instruction.

The following needs to be done before beginning  the Elastic cluster Setup.

1. Create 2 CentOS 7 Servers ESNODE1 and ESNODE2 as discussed in How to install CentOS 7 on Virtual Machine using VMWare vSphere 6 client?

2. Make sure Java 7 is installed and configured as default as discussed in How to install Java 7 and Java 8 in CentOS 7?

3. Create the bigdatauser, bigdataadmin and the bigdatagroup as discussed in How to create a user, group and enable him to do what a super user can in CentOS7?

4. Make sure the firewall is disabled and stopped as discussed in How to turn off firewall on CentOS 7? 

5. Change etc/hosts file so that all the IPs and the names of the servers are resolved as discussed in
How to setup DNS entries for big data servers in the cloud or not on a domain in /etc/hosts file?

6. Using the bigdatauser setup password less ssh across the 2 clusters namely ESNODE1 and ESNODE2 as discussed in How to setup password less ssh between CentOS 7 cluster servers?


For each of the Server ESNODE1 and ESNODE2 do the following


Login using the bigdataadmin

//create a folder for elastic search under the /usr/local directory
cd /usr/local
sudo mkdir es

//change ownership to bigdatauser
sudo chown -R bigdatauser:bigdatagroup es

//create the data directory for elastic search under the var/lib
cd /var/lib
sudo mkdir es

//change ownership to bigdatauser
sudo chown -R bigdatauser:bigdatagroup  es


//create the log directory for elastic search under the var/log
cd /var/log
sudo mkdir es

//change ownership to bigdatauser
sudo chown -R bigdatauser:bigdatagroup  es


Switch to bigdataauser

//download elastic search 2.2.0
wget https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/2.2.0/elasticsearch-2.2.0.tar.gz

//unpack the file
tar xzf elasticsearch-2.2.0.tar.gz


//move the elastic installation to the usr/local/es from the download directory


mv elasticsearch-2.2.0 /usr/local/es/

//switch to the /usr/local/es directory
cd /usr/local/es/elasticsearch-2.2.0

//move to the conf folder for the version of elastic search like
cd config

edit the elasticsearch.yml file and change the following settings

vi elasticsearch.yml


//copy the following comments and create the following

//remove the # 
//change cluster name
cluster.name: escluster

//change node name node.name: ESNODE1

//change the path to data directory
path.data: /var/lib/es


//change the path to the log directory
path.logs: /var/log/es


//give the IP address assigned to this server
//make sure you have a static IP and different for each server
network.host: 192.168.0.5


//move to the root of the cluster server on any of the cluster. Start elastic search in all of 2 servers.
cd /usr/local/es/elasticsearch-2.2.0

//start elastic in daemon mode (background process)
bin/elasticsearch -d

//if you need to stop find and kill the process
kill pid


//check if the process is running
ps -aux | grep java

//check for ElasticSearch

//check cluster health
http://ESNODE1:9200/_cluster/health?pretty
//you should find a JSON with
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,

//check the version of elastic
http://ESNODE1:9200/
//you should find a JSON with
"version" : { 

 "number" : "2.2.0",

Friday, February 19, 2016

How to setup a 3 Node Apache Zookeeper 3.4.6 cluster in CentOS 7?

Zookeeper is short is a distributed state manager which can be used by many clusters to maintain state across its own clusters. Like HBase can use Zookeeper to maintain state across its own set of clusters without having to have cluster state within it.

The following needs to be done before beginning  the Zookeeper cluster Setup.

1. Create 3 CentOS 7 Servers ZKNODE1, ZKNODE2, and ZKNOD3 as discussed in How to install CentOS 7 on Virtual Machine using VMWare vSphere 6 client?

2. Make sure Java 7 is installed and configured as default as discussed in How to install Java 7 and Java 8 in CentOS 7?

3. Create the bigdatauser, bigdataadmin and the bigdatagroup as discussed in How to create a user, group and enable him to do what a super user can in CentOS7?

4. Make sure the firewall is disabled and stopped as discussed in How to turn off firewall on CentOS 7? 

5. Change etc/hosts file so that all the IPs and the names of the servers are resolved as discussed in
How to setup DNS entries for big data servers in the cloud or not on a domain in /etc/hosts file?

6. Using the bigdatauser setup password less ssh across the 3 clusters namely ZKNODE1, ZKNODE2 and ZKNODE3 as discussed in How to setup password less ssh between CentOS 7 cluster servers?


For each of the Server ZKNODE1, ZKNODE2 and ZKNODE3 do the following

Login using the bigdataadmin

//create a folder for zookeeper under the /usr/local directory
cd /usr/local
sudo mkdir zookeeper

//change ownership to bigdatauser
sudo chown -R bigdatauser:bigdatagroup zookeeper

//create the data directory for Zookeeper under the var/lib
cd /var/lib
sudo mkdir zookeeper

//change ownership to bigdatauser
sudo chown -R bigdatauser:bigdatagroup  zookeeper


Switch to bigdataauser

//create a file named myid under the data directory
cd /var/lib/zookeeper
vi myid

put only the number for the corresponding servers. DO NOT put all the 3 numbers in each server.
on ZKNODE1
1
on ZKNODE2
2
on ZKNODE3
3

if you do a cat myid it should just display 1 for ZKNODE1 and so on.


//download zookeeper
wget http://apache.arvixe.com/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz

//unpack the file
tar xzf zookeeper-3.4.6.tar.gz


//move the zookeeper installation to the usr/local/zookeeper from the download directory
mv zookeeper-3.4.6 /usr/local/zookeeper/

//switch to the /usr/local/zookeeper directory
cd /usr/local/zookeeper/zookeeper-3.4.6

//move to the conf folder for the version of zookeeper like
cd conf

//copy the sample config to zoo.cfg
cp zoo_sample.cfg zoo.cfg

//switch to the conf directory
cd /usr/local/zookeeper/zookeeper-3.4.6/conf

edit the zoo.cfg file and change the data directory to

vi zoo.cfg

//change data directory to
dataDir=/var/lib/zookeeper

#include the cluster servers
server.1=ZKNODE1:2888:3888
server.2=ZKNODE2:2888:3888
server.3=ZKNODE3:2888:3888


//move to the root of the cluster server on any of the cluster. Start the zookeeper in all of the 3 servers.
cd /usr/local/zookeeper/zookeeper-3.4.6

//start zookeeper
bin/zkServer.sh start

//if you need to stop
bin/zkServer.sh stop

//check if the process is running
ps -aux | grep java

//check for QuorumPeerMain

//check the status of each server to see if they are in a cluster. Only one among the 3 should be master and the others are followers
bin/zkServer.sh status
to find if its running as follower or leader similar to master, slave.
Mode: follower
Mode: leader


Troubleshooting Errors
Error:
Using config: /usr/local/zookeeper/zookeeper-3.4.6/bin/../conf/zoo.cfg
mkdir: cannot create directory ‚/var/bin‚: Permission denied
Starting zookeeper ... bin/zkServer.sh: line 113: /var/bin/zookeeper/zookeeper_server.pid: No such file or directory
Solution:
Make sure that the data directory is correct and you are running as bigdatauser and not bigdataadmin





Thursday, February 18, 2016

How to switch Java version from Java 7 to Java 8 on CentOS 7?

Most of the Big Data stack technologies would work with Java 1.7 or Java 7 which is installed by default in CentOS 7 Server UI Edition. If not follow the instruction as described in How to install Java 7 and Java 8 in CentOS 7?

Some technologies require Java 1.8 or 8. For example the Gremlin Server from Titan Graph requires Java 8. Once we have installed Java 8 we can switch the default Java version to 8 for those technologies that required Java 8 as the default.

We can attempt to run technologies that require both Java 7 and Java 8 on the same box by using the different user login each having different Java home path but this is not recommended. Try to stick to technologies that run with the same Java version per server.

Do the following command. You would need to be an administrator.

sudo update-alternatives --config java


This would display

There are 2 programs which provide 'java'.

  Selection    Command
-----------------------------------------------
*  1           /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.75-2.5.4.2.el7_0.x86_64/jre/bin/java
 + 2           /usr/java/jdk1.8.0_60/jre/bin/java


Enter to keep the current selection[+], or type selection number: 


The current selection is displayed in +. To change it
Type 1 and press enter to switch to Java 7
Type 2 and press enter to switch to Java 8.
To exist without changing press "enter".


Wednesday, February 17, 2016

How to setup password less ssh between CentOS 7 cluster servers?

Now that we are working with clusters we need a way for machines to communicate with each other. In the windows world with Active Directory (AD) we could have created a domain account and added this user in all the machines. The issue with this approach is that each service would still need to login to these machine using the domain account every time which means there is an authentication request which need to go to AD for login.

In the Linux world they have solved this differently, they now have a concept of password less ssh (Secure Shell), this mean that the password is actually stored in each server and given a certificate. The next time the user needs to communicate with the generated certificate to login to the server.  This way it automatically log the user who has setup the password less ssh without prompting for a password.

Do the following to create a password less ssh for a specific user. In our case the bigdatauser

The setting is:
3 Apache Hbase Node clusters
HBASENODE1
HBASENODE2
HBASENODE3

On each of these machines we have created a user called the bigdatauser as described in How to create a user, group and enable him to do what a super user can in CentOS7?.
We also need to create the DNS records in /etc/hosts as described in How to setup DNS entries for big data servers in the cloud or not on a domain in /etc/hosts file?

Login to the CentOS 7 Server HBASENODE1 using ssh with "bigdatauser" and issue the following commands

Create the certificate for the user on his local home directory

cd ~

//create the ssh keys
ssh-keygen -t rsa -P ""


press enter do not type anything and accept the default directory.

//copy the keys to the authorized keys from bigdatauser
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys


Test the password less ssh is working by typing

ssh localhost


Accept the warnings, you should be able to login now. To exit local host. Type

exit


Do this another time, this would not show any warnings.

Now that we have setup password less ssh for one node lets call it HBASENODE1  we need to do the same for HBASENODE2 and HBASENODE3.

Once we have done the same to all the 3 servers. We now need to enable the password less ssh between the nodes.
The logic would be as follows from HBASENODE1 do the following command to HBASENODE2

//copy the keys to other nodes
ssh-copy-id -i $HOME/.ssh/id_rsa.pub bigdatauser@HBASENODE2


The same command needs to happen from
HBASENODE1 to HBASENODE3
and from
HBASENODE2 to HBASENODE1, HBASENODE3
HBASENODE3 to HBASENODE1, HBASENODE2

once this is done verify if you can login from any node to any other node by typing the following

ssh HBASENODE2

on HBASENODE1 and the other combination.
accept the warning first time and the next time it should directly log you into the servers.


Tuesday, February 16, 2016

How to install Java 7 and Java 8 on Mac OS X?

Now that we have installed the package managers in Mac OS X, we can install the most widely used Java versions on Mac OS X.

Most big data technologies work with Java 1.7 also called Java 7. Few of these use the Java 1.8 also called Java 8.

Unfortunately for Mac OS X the distributions available are from Oracle and not from OpenJDK. Install Java 7 before installing Java 8. This would install the JDK version of the Java.

Issue the following command to install Java 7

sudo brew cask install java7

Issue the following command to install Java 8

sudo brew cask install java

The default location of the installation is present under

/opt/homebrew-cask/Caskroom/

You could also find more information about the Software by typing the installation keyword like java, java7 etc

brew cask info java7


Feb 19, 2019
An updated article is available to install Java 8 as Java 7 is quite old.

Monday, February 15, 2016

Install Homebrew and Cask for package management in Mac OS X

Just like we have yum in CentOS 7 we need to have the package managers in Development IDE namely Mac OS X.

Homebrew and Cask are the preferred package managers.

For installing Homebrew give the following command

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Refer the following URL for more information

Once you have installed brew you can install Cask by giving the following command

brew tap caskroom/cask

Refer the following URL for more information

Note: As with all installation make sure that you have admin rights or use the sudo keyword.


Friday, January 29, 2016

How to setup DNS entries for big data servers in the cloud or not on a domain in /etc/hosts file?

IMPORTANT:
a. Do not do this if the servers are already in domain and DNS is already handled by a dedicated DNS Server.
b. Its always better to assign static IP for the servers before doing these changes so that we don't chain the DNS entries every time the host machine IP changes due to restart.
c. You need to be an administrator to perform these changes.

Now that we have created these servers which are on the virtual network and have only an IPV4 number. We need to make sure that these machines in  the network are accessible by its host name.

Most of the big data clusters need not be managed by a domain controller and can exist as standalone in the cloud or in your own virtualization host.

On each of the clusters the DNS entries for the other nodes that it uses in the cluster needs to be set in the /etc/hosts file.
For those clusters which are interacting with other clusters then the DNS entries needs to be set for all the other clusters the current cluster interacts with.

e.g
If we have a Elastic Search 3 node clusters then on each of the 3 node cluster we need to setup the DNS entries in /etc/hosts file

sudo vi /etc/hosts

Insert the following records and save on all the clusters. Keep the localhost entry as it is and append to it. (You can insert with the comment using the #)

# Elastic Nodes
192.168.0.6 SEARCHNODE1
192.168.0.7 SEARCHNODE2
192.168.0.8 SEARCHNODE3


if we have an Apache Storm 3 node clusters which inserts data to any of the Elastic Search clusters then on each of the Storm cluster we need to have both the entries for Storm nodes as well as the Elastic Search nodes.

sudo vi /etc/hosts

Insert the following records and save on all the clusters

# Storm Nodes
192.168.0.10 STORMNODE1
192.168.0.11 STORMNODE2
192.168.0.12 STORMNODE3

# Elastic Nodes
192.168.0.6 SEARCHNODE1
192.168.0.7 SEARCHNODE2
192.168.0.8 SEARCHNODE3


Verify if its working by doing the following command

ping STORMNODE1

cancel by  ctrl + c

You should not get any dropped packets.

How to turn off firewall on CentOS 7?

Important:
Do not turn off firewall in Production and Pre-production servers. Figure out the ports to open for all the technologies that run in the server and open these rules in firewall.

In Developer Box, Developer Instance and QA Instance environments its safe to turn off for quick deployment.

You need to be an administrator to perform these changes.

//check the status of firewall
sudo systemctl status firewalld

//stop the firewall
sudo systemctl stop firewalld

//disable firewall so that it doesn't start again when restarted
sudo systemctl disable firewalld

Thursday, January 28, 2016

How to install Java 7 and Java 8 in CentOS 7?

Java has three concepts

1. Versions like 1.7 , 1.8 which are also called Java 7 and Java 8
2. Various distributions like JRE, JDK, etc
3. Vendor distributions i.e the Oracle distributes or OpenJDK distributes.

In the Development machine install the JDK which also has the JRE.
In Server machines install only the JRE.
If there is an Open JDK version available for your OS use this else take the Oracle one.
Since there is an OpenJDK version for CentOS we shall install this distribution.

CentOS 7 Server with UI should come pre-installed with Java 7 JRE.

Check if Java is already installed by issuing the following command.

java -version
It should print out

java version "1.7.0_79"


Yum is the package manager in CentOS (Windows installer in Microsoft). It can auto detect if java is already installed and install only if its not.

Install Java 8 (1.8) JDK using the following command. You need to have admin privileges.


sudo yum install java-1.8.0-openjdk-devel.x86_64


If already installed you would get

Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
* base: mirror.5ninesolutions.com
* extras: mirror.millry.co
* updates: mirror.cogentco.com
Package 1:java-1.8.0-openjdk-devel-1.8.0.65-2.b17.el7_1.x86_64 already installed and latest version


Check the installation by using the following command. which list any java processes running. This is available only in the JDK version and not on the JRE version

jps


Install Java 8 (1.8) JRE using the following command. You need to have admin privileges.

sudo yum install java-1.8.0-openjdk.x86_64


Don't bother about the naming convention its called openjdk, this is a company name but installs the JRE.

If Java 7 is not installed and you got a server core version of CentOS 7.

Then issue the following  commands before you install Java 8

//install the java 7 JRE
sudo yum install java-1.7.0-openjdk.x86_64

//install the java 7 JDK
sudo yum install java-1.7.0-openjdk-devel.x86_64



Wednesday, January 27, 2016

How to build an Apache Storm 0.10 topology job in Eclipse using Maven?

The following is the process of building an Apache Storm 0.10 Topology Job.

Technologies used:
a. Apache Storm 0.10
b. Eclipse Mars
c. Apache Maven 3.3.9
d. Java 7 or Java 8

1.You need to have two environments to build Storm Topology.
Environment 1: Developer Box
Environment 2: Development / Build Box

IMPORTANT: The changes you are going to make if done in Developer Box would render the running the topologies on the local Storm cluster broken.

2.Remove the jars that Storm already provides.
Remove Storm jar and Logging Jar like Log4j2
Go to the project pom.xml that has the Storm jar reference and add the following so that this jar is not including when we compile to a single jar with dependencies.


    <dependency>
      <groupid>org.apache.storm</groupid>
      <artifactid>storm-core>/artifactid>
      <version>0.10.0</version>
      <!--  This need to be enabled for storm submit job --> 
      <scope>provided</scope>   
    </dependency>

    <dependency>
      <groupid>org.apache.logging.log4j</groupid>
      <artifactid>log4j-api>/artifactid>
      <version>2.5</version>
      <!--  This need to be enabled for storm submit job --> 
      <scope>provided</scope>   
    </dependency>

    <dependency>
      <groupid>org.apache.logging.log4j</groupid>
      <artifactid>log4j-core>/artifactid>
      <version>2.5</version>
      <!--  This need to be enabled for storm submit job --> 
      <scope>provided</scope>   
    </dependency>   

3. Packing the project into a single jar with dependencies.
Go to the maven project that has the main method. Edit the pom.xml to have the following.
below the <dependencies> node.
  </dependencies>
  <build>
  <plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
<manifest>
<mainClass>com.yourcompany.yourproduct.yourproject.App</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
  </plugins>
  </build> 

4. Issue the following Maven command and copy the jar with dependencies in the target folder.

mvn package
Note: you would have both the jars, copy the one with the following name appended like yourproject-1.0.0-jar-with-dependencies.jar

5. Follow the instruction for the next step of submitting the Storm Topology Job