Running existing Hadoop applications

When the installed Hadoop distribution is configured to run YARN, ScaleOut hServer can be used to run unchanged Hadoop applications, i.e., the JARs containing standard Hadoop MR jobs can be run as-is. To direct MapReduce jobs to use ScaleOut hServer as the execution engine, the following actions are required in addition to the general ScaleOut hServer installation procedure described in Installation of the IMDG.

Set the following environmental variables either by editing conf/hadoop-env.sh in the Hadoop installation directory or through command line:

  1. Configure Hadoop to run in YARN mode. Make sure that HADOOP_MAPRED_HOME variable is set to the location of the YARN MR2 implementation. Please refer to the Hadoop distribution documentation for more details on how to configure MapReduce to run YARN.

    Example for CDH5:

    $ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
  2. Add the ScaleOut hServer library JARs and the appropriate Hadoop distribution JARs to the Java classpath. Make sure that the distribution-specific JAR folder has the “-yarn” or equivalent suffix if applicable:

    $ export HADOOP_CLASSPATH=/usr/local/soss/java_api/*:/usr/local/soss/java_api/lib/*:/usr/local/soss/java_api/hslib/cdh5.2.1-yarn/*
  3. ScaleOut hServer has to be configured as the MapReduce execution framework by setting the configuration property mapreduce.framework.name to hserver-yarn. This property may be set in conf/mapred-site.xml or passed to the Hadoop executable via the command line.

Here is an example of running standard Hadoop word count example with ScaleOut hServer. Environmental variables are assumed to be configured, and mapreduce.framework.name is set through command line:

$ hadoop jar hadoop-mapreduce-examples.jar wordcount -Dmapreduce.framework.name=hserver-yarn  in out
[Note] Note

If output key sorting is not required for running a MapReduce job, it can be disabled to improve performance and reduce memory usage. This can be done by setting the configuration property mapred.hserver.sortkeys to false. This property is set to true by default (output keys are sorted), which is analogous to standard Hadoop behavior.