Deploying G-thinker on Your Cluster

Other Documentations: [How to Run], [How to Program], [The Complete API]
Click here if you wish to deploy with virtual machines on Microsoft Azure.

G-thinker is implemented in C/C++ and runs on Linux. G-thinker reads/writes data from/to files on Hadoop Distributed File System (HDFS) through LibHDFS, and sends messages using MPI (Message Passing Interface). Below is a checklist of the software dependencies.

  • GCC (GNU Compiler Collection)    [go]
  • MPICH or Open MPI    [go]
  • JDK    [go]
  • Hadoop    [go]

This webpage explains how to deploy G-thinker on a cluster where each machine just has its 64-bit Linux OS set up. The dependencies and their installed locations are given as follows:

  • MPICH v3.2:        /usr/local/mpich
  • OpenJDK 7:         /usr/lib/jvm/java-1.7.0-openjdk
  • Hadoop v2.7.5:    /usr/local/hadoop
We also put all HDFS data under /local/hdfs. If you already installed some of the dependencies above in your preferred locations, you may skip the relevant dependency installation, and change the paths in this tutorial wherever approriate. If you are using Hadoop v1, please refer to this configuration tutorial.

Installing G++

G-thinker is written in C++, using C++11 multithreading. Therefore, one needs to have a GCC version that supports C++11 installed. If GCC is not installed on your Linux, install it as follows (the command depends on your Linux distribution).

# For Ubuntu users
sudo apt-get install g++
[type password]
# For CentOS users
sudo yum install gcc-c++
[type password]

MPI Installation

We now deploy MPICH3 on each machine of a cluster. If you use other MPI implementations like Open MPI, please make proper changes.

Download MPICH3 from mpich.org and extract the contents of the MPICH package to some temporary location, before compiling the source code to binaries. We download MPICH v3.2.

cd {the-directory-containing-downloaded-mpich-package}
tar -xvzf mpich-3.2.1.tar.gz
cd mpich-3.2.1

Choose a directory for installing MPICH3 (we use /usr/local/mpich), and then compile and install MPICH3.

./configure -prefix=/usr/local/mpich --disable-fortran
sudo make
sudo make install

Append the following two environment variables to the file $HOME/.bashrc. Here, $HOME, ~ and /home/{your_username} are equivalent, which is your home folder.

export MPICH_HOME=/usr/local/mpich
export PATH=$PATH:$MPICH_HOME/bin

Compile the file with the command source $HOME/.bashrc (You may do this later after adding more enviromental variables).

JDK Installation

Hadoop requires Java 5+ (aka Java 1.5+) to work. We thus first install OpenJDK 7 as follows.

# For Ubuntu users
sudo apt-get install openjdk-7-jdk
[type password]
# For CentOS users
sudo yum install java-1.7.0-openjdk-devel
[type password]

It is also necessary to add the following environment variables to the end of the file $HOME/.bashrc.

# For Ubuntu user
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JAVA_HOME/jre/lib/amd64/server
# For CentOS user
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JAVA_HOME/jre/lib/amd64/server

Compile the file with the command source $HOME/.bashrc (You may do this later after adding more enviromental variables).

SSH Configuration

Both Hadoop and MPI requires password-less SSH connection between the master and all the slaves. If SSH server is not installed on your Linux, install it as follows.

# For Ubuntu user
sudo apt-get install openssh-server 
# For CentOS user
sudo yum install openssh-server

If SSH server is not started, start it using command sudo /etc/init.d/ssh start.

Suppose that the cluster contains one master machine with IP 192.168.0.1 and hostname master, and N slave machines where the i-th slave has IP 192.168.0.(i+1) and hostname slave(i+1). Then, for each machine, we need to append the following lines to the file /etc/hosts.

192.168.0.1 master
192.168.0.2 slave1
192.168.0.3 slave2
......
192.168.0.(N+1) slaveN

We now show how to configure the password-less SSH connection. First, create an RSA key pair with an empty password on the master:

ssh-keygen -t rsa
[Press "Enter" for all questions]

This creates files id_rsa and id_rsa.pub under the default directory $HOME/.ssh/. Then, we add the public key id_rsa.pub to all machines, in their file $HOME/.ssh/authorized_keys:

ssh-copy-id -i $HOME/.ssh/id_rsa.pub {username}@master
[type master password]
ssh-copy-id -i $HOME/.ssh/id_rsa.pub {username}@slave1
[type slave1 password]
ssh-copy-id -i $HOME/.ssh/id_rsa.pub {username}@slave2
[type password]
......
ssh-copy-id -i $HOME/.ssh/id_rsa.pub {username}@slaveN
[type slaveN password]

Above, you may omit {username} if you use the same account name across all machines.

Now, you should be able to connect from master to any machine using SSH without the need to input any password.

ssh master
[no password is required]
exit
ssh slave1
[no password is required]
exit
......

Hadoop Deployment

Download Hadoop (we use Hadoop v2.7.5 here) to master and extract the contents of the Hadoop package to /usr/local. We may copy it to the slaves using "scp" after configuration.

tar -xvzf hadoop-2.7.5.tar.gz
sudo mv hadoop-2.7.5 /usr/local/hadoop
[type password]

In the above setting, /usr/local/hadoop is the root directory of Hadoop, or HADOOP_HOME. Make sure to add your {username} as the owner of all the files in HADOOP_HOME (inclusive). One can find the group(s) of {username} using command groups {username}.

sudo chown -R {username}:{usergroup} /usr/local/hadoop
[type password]

It is also necessary to append the following environment variables to the file $HOME/.bashrc.

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native
export CLASSPATH=$CLASSPATH:/usr/local/hadoop/etc/hadoop
for i in $HADOOP_HOME/share/hadoop/common/lib/*.jar
do
    CLASSPATH=$CLASSPATH:$i
done
for i in $HADOOP_HOME/share/hadoop/hdfs/lib/*.jar
do
    CLASSPATH=$CLASSPATH:$i
done
for i in $HADOOP_HOME/share/hadoop/common/*.jar
do
    CLASSPATH=$CLASSPATH:$i
done
for i in $HADOOP_HOME/share/hadoop/hdfs/*.jar
do
    CLASSPATH=$CLASSPATH:$i
done
export CLASSPATH

Compile the file with the command source $HOME/.bashrc. To save workload, one may edit the file $HOME/.bashrc only on the master, and use "scp" to copy it to the slaves, and then connect to each slave using "ssh" to compile the file.

Next, we need to configure the files under folder $HADOOP_HOME/etc/hadoop/. Referring to the previous cluster setting again, then the content of the file $HADOOP_HOME/etc/hadoop/masters (create one if does not exist) should be updated with your master machine's hostname, which in our case is:

master

The content of the file $HADOOP_HOME/etc/hadoop/slaves should be updated with your worker machine's hostnames, which in our case is:

slave1
slave2
......
slaveN

Create a directory for holding the data of HDFS (Hadoop Distributed File System) such as /local/hdfs, and set the required ownership and permissions for {username} as follows.

sudo mkdir -p /local/hdfs
[type password]
sudo chown {username}:{usergroup} /local/hdfs
sudo chmod 755 /local/hdfs

Add the following properties betweem <configuration> and </configuration> in the file $HADOOP_HOME/etc/hadoop/core-site.xml:

<property>
    <name>hadoop.tmp.dir</name>
    <value>/local/hdfs</value>
</property>
<property>
    <name>fs.default.name</name>
    <value>hdfs://master:9000</value>
</property>

This indicates that HDFS data are stored under /local/hdfs, and the HDFS master (aka NameNode) resides at master with port 9000. If your master machine's hostname is not master, please update it with your master machine's information accordingly.

By default, the $HADOOP_HOME/etc/hadoop/ folder contains mapred-site.xml.template file, and we rename and copy this file by:

cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template \
              $HADOOP_HOME/etc/hadoop/mapred-site.xml

Add the following property betweem <configuration> and </configuration> in the file $HADOOP_HOME/etc/hadoop/mapred-site.xml:

<property>
    <name>mapred.job.tracker</name>
    <value>hdfs://master:9001</value>
</property>

This specifies the location of JobTracker that manages MapReduce jobs, and if your master machine's hostname is not master, please use your hostname.

Add the following property betweem <configuration> and </configuration> in the file $HADOOP_HOME/etc/hadoop/hdfs-site.xml:

<property>
    <name>dfs.replication</name>
    <value>3</value>
</property>

This configuration asks HDFS to replicate any data three times, so that data won't be lost even if a machine is down. You may also use 1 rather than 3 to save disk space.

Other properties of Hadoop can also be configured in the files according to the your need. Now that Hadoop is properly configured on the master, we use "scp" to copy it to slaves. On each slave slavei, we run the following commands on master:

sudo scp -r /usr/local/hadoop {username}@slavei:$HOME
[type slavei password]
ssh slavei
sudo mv hadoop /usr/local/
[type slavei password]
sudo chown -R {username}:{usergroup} /usr/local/hadoop
sudo mkdir -p /local/hdfs
sudo chown {username}:{usergroup} /local/hdfs
sudo chmod 755 /local/hdfs
exit

Before we start the newly configured Hadoop cluster, we must format HDFS via the NameNode. Type the following command on master to format HDFS:

hadoop namenode -format

Now, you are ready to start the Hadoop cluster. To start both HDFS and JobTracker, type the following command on master:

$HADOOP_HOME/sbin/start-all.sh

To see whether the Hadoop cluster is properly started, you may type "jps" in terminal. You should see processes: "NodeManager", "Jps", "ResourceManager","DataNode" and "NameNode". You are now able to access http://master:50070 to view the state of HDFS, and access http://master:50030 to view the state of MapReduce jobs. To stop the Hadoop cluster, type the following command on master:

$HADOOP_HOME/sbin/stop-all.sh

If you are not using Hadoop MapReduce but only GraphD, there is no need to start the JobTracker. You only need to start HDFS using the following command:

$HADOOP_HOME/sbin/start-dfs.sh

You may stop HDFS later on using the following command:

$HADOOP_HOME/sbin/stop-dfs.sh