How to make Giraph run?


Apache Giraph is the best known Pregel-like system and if you work on the related topic, paper reviewers will definitely ask you to compare with Giraph (so maybe you want to do it before they ask).


Unfortunately, the documentation is scarce and it is not uncommon that people struggle to make it work. This tutorial aims to reduce your time investment in making Giraph work.


A Giraph deployment tutorial that Works can be found here, which uses Hadoop v2.5.1 and Giraph v1.1.0. Put simply, you go to GIRAPH_HOME and use maven to compile, and a giraph-xxxxxx-dependencies.jar will be generated under $GIRAPH_HOME/giraph-examples/target. That jar-file is the one you need to run using "hadoop jar".

To run it, you need to specify options like
    -vip: vertex-input-format, i.e. adjacecy list format
    -eip: edge-input-format, i.e. each line is "src dst"
    -vof, -eof: vertex/edge-output-format
    -op: output-path
    -w: number of workers run


This is not the end of the story yet... your job may stuck for a while and then fail. Giraph is using a lot of memory and the default memory constraints are often too small.


If this happens, record your job ID, check its Hadoop job log. If at the end of the log you see "Error: GC overhead limit exceeded", check here. Put simply, you need to add another option:
    -Dmapred.child.java.opts=-Xmx4000m
    -Dmapred.child.java.opts=-Xms4000m
where 4000m means 4GB memory and you may further increase it according to your machine's memory size.

You may further add more options listed here, like giraph.numComputeThreads and giraph.SplitMasterWorker. For example, the previously-mentioned tutorial uses:
     -ca giraph.SplitMasterWorker=false