Deploy Samza Job To CDH

The tutorial assumes you have successfully run hello-samza and now you want to deploy the job to your Cloudera Data Hub (CDH). This tutorial is based on CDH 5.4.0 and uses hello-samza as the example job.

Compile Package for CDH 5.4.0

We need to use a specific compile option to build hello-samza package for CDH 5.4.0

mvn clean package -Dhadoop.version=cdh5.4.0

Upload Package to Cluster

There are a few ways of uploading the package to the cluster’s HDFS. If you do not have the job package in your cluster, scp from you local machine to the cluster. Then run

hadoop fs -put path/to/hello-samza-0.12.0-dist.tar.gz /path/for/tgz

Get Deploying Scripts

Untar the job package (assume you will run from the current directory)

tar -xvf path/to/samza-job-package-0.12.0-dist.tar.gz -C ./

Add Package Path to Properties File

vim config/wikipedia-parser.properties

Change the yarn package path:

yarn.package.path=hdfs://<hdfs name node ip>:<hdfs name node port>/path/to/tgz

Set Yarn Environment Variable

export HADOOP_CONF_DIR=/etc/hadoop/conf

Run Samza Job

bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/config/wikipedia-parser.properties