G-thinker

Other Documentations: [How to Deploy], [How to Program], [The Complete API]

This tutorial introduces how to run a G-thinker program in your cluster, where we assume:

You have already deployed G-thinker according to the deployment note.
You have already downloaded G-thinker according to here, and we denote the G-thinker root folder as $GTHINKER_HOME in the rest of the tutorial.

Put Data on HDFS

G-thinker requires that a large graph data is partitioned into smaller files under the same data folder on HDFS, and these files are loaded by different computing processes during processing.

To achieve this goal, we cannot simply use the command hadoop fs -put {local-file} {HDFS-path}. Otherwise, all the data file will be loaded by only one computing process, and the other processes simply wait for it to finish loading.

To put a large graph data onto HDFS, one needs to use our put program, which can be obtained as follows:

wget http://www.cs.uab.edu/yanda/gthinker/doc/put.zip
unzip put.zip
cd put
make    # generate a program called "run"
mv run put    # rename it to "put"

Now, the "put" program is ready for use, which takes two arguments being {local-file} {HDFS-path}. Let us put our unlabeled toy graph described here to HDFS:

wget http://www.cs.uab.edu/yanda/gthinker/doc/unlabeled.txt
hadoop fs -mkdir /toyFolder
hadoop fs -put unlabeled.txt /toyFolder
hadoop fs -cat /toyFolder/*

You may move the "put" program to $HOME, so that you can call it as "~/put" anywhere.

mv put ~

Compiling and Running

We illustrate the process with the application Maximum Clique Finding.

Now, change the current directory to $GTHINKER_HOME/app_maxclique. There is already a Makefile, and thus you can run make to compile it, which will generate a program called run, with two input arguments {HDFS-input-path} and {number-of-threads}.

For example, you can run it on your machine using 4 threads as follows, where the input graph is the toy graph we just put on HDFS.

./run /toyFolder 4

Running with Multiple Machines

G-thinker is just a library of header files using MPI, and thus a G-thinker program is simply an MPI program. MPI requires that our compiled program be transmitted to every machine, under the same path. For example, we can move the previously compiled program run to $HOME, and then transmit it to every machine as follows (assuming that we are currently on master).

mv run ~    # ~ is equivalent to $HOME
cd ~
scp run slave1:~
scp run slave2:~
...
scp run slaveN:~

Moreover, to run an MPI program on multiple machines, you need to indicate the list of hosts you want to run with. We now prepare a host-file named hosts listing the hostnames of those machines that we would like to run with, such as:

master
slave1
slave2
...
slaveN

Now, you may run the program with mpiexec, whose option -n indicates how many processes you want to run, and option -f specifies the hostfile.

mpiexec -n N+1 -f hosts ./run /toyFolder 4
# 4 threads per process/machine

Processes are distributed to machines in your hostfile in a round-robin manner, and we recommend you to run one process on each machine, and to adjust the number of threads to fully utilize your multi-core machines.

Task Buffer on Local Disk

In G-thinker, if a task is growing its subgraph by pulling vertices from remote machines, the task will be hanged up when waiting for responses.

When responses are received, these tasks will be waken up for processing, and if they need further processing such as pulling more vertices, they are added to an in-memory task queue waiting to be fetched by a core for computation.

To keep memory consumption low, G-thinker limits the number of tasks allowed in a task queue. If a burst of tasks are processed and need to be added to the task queue, but the task queue is full, then we move 1/3 of the tasks to local disks in order to add more tasks to the task queue.

The tasks buffered on disks are stored in a folder specified by $GTHINKER_HOME/system/util/global.h Line 173 the variable:

string TASK_DISK_BUFFER_DIR;

The default path is a folder named buffered_tasks in the current path that you run your G-thinker program, but can be changed by providing a 2nd input argument to the Worker class (see Lines 95 and 98).

It is recommended to set TASK_DISK_BUFFER_DIR towards a path on local disk rather than an NFS (Network File System), since otherwise, the disk arm of the NFS is shared by all machines.

The buffered tasks will be loaded back to the task queue as soon as the queue underflows (i.e., 1/3 or less of the full capacity).