Skip to content

User Guide

Alex Bain edited this page Apr 19, 2016 · 29 revisions

Table of Contents

Overview

This is the LinkedIn Gradle Plugin for Apache Hadoop User Guide. For the sake of brevity, we will refer to the plugin as simply the "Hadoop Plugin".

The Hadoop Plugin will help you more effectively build, test and deploy Hadoop applications. In particular, the Plugin will help you easily work with Hadoop applications like Apache Pig and build workflows for Hadoop workflow schedulers like Azkaban and Apache Oozie.

The Plugin includes the LinkedIn Gradle DSL for Apache Hadoop (which we shall refer to as simply the "Hadoop DSL"), a language for specifying jobs and workflows for Hadoop workflow schedulers like Azkaban and Apache Oozie. Go directly to the Hadoop DSL Language Reference.

Using the Open-Source Hadoop Plugin

The Hadoop Plugin is now published on plugins.gradle.org. Click on the link for a short snippet to add to your build.gradle file to start using the Hadoop Plugin today!

Using the Hadoop Plugin at LinkedIn

If you are using the Hadoop Plugin internally at LinkedIn, see our comprehensive instructions at go/HadoopPlugin on the LinkedIn Wiki to start using the Plugin.

Hadoop Plugin Tasks

To see all of the Hadoop Plugin tasks, run gradle tasks in the project directory of your Hadoop Plugin project and look at the section titled Hadoop Plugin tasks. You may see something like:

Hadoop Plugin tasks
-------------------
azkabanNertzHadoopZip - Creates a Hadoop zip archive for azkabanNertz
azkabanUpload - Uploads Hadoop zip archive to Azkaban
azkabanWarHadoopZip - Creates a Hadoop zip archive for azkabanWar
buildAzkabanFlows - Builds the Hadoop DSL for Azkaban. Have your build task depend on this task.
buildHadoopZips - Builds all of the Hadoop zip archives. Tasks that depend on Hadoop zips should depend on this task
buildOozieFlows - Builds the Hadoop DSL for Apache Oozie. Have your build task depend on this task.
buildPigCache - Build the cache directory to run Pig scripts by Gradle tasks. This task will be run automatically for you.
buildScmMetadata - Writes SCM metadata about the project to the project's build directory
checkDependencies - Task to help in controlling and monitoring the dependencies used in the project
CRTHadoopZip - Creates a Hadoop CRT deployment zip archive
disallowLocalDependencies - Task to disallow users from checking in local dependencies
oozieCommand - Runs the oozieCommand specified by -Pcommand=CommandName
oozieUpload - Uploads the Oozie project folder to HDFS
printScmMetadata - Prints SCM metadata about the project to the screen
run_count_by_country.pig - Run the Pig script src/main/pig/count_by_country.pig with no Pig parameters or JVM properties
run_count_by_country_python.pig - Run the Pig script src/main/pig/count_by_country_python.pig with no Pig parameters or JVM properties
run_member_event_count.pig - Run the Pig script src/main/pig/member_event_count.pig with no Pig parameters or JVM properties
run_postal_code.pig - Run the Pig script src/main/pig/postal_code.pig with no Pig parameters or JVM properties
run_verify_recommendations.pig - Run the Pig script src/main/pig/verify_recommendations.pig with no Pig parameters or JVM properties
runPigJob - Runs a Pig job configured in the Hadoop DSL with gradle runPigJob -Pjob=<job name>. Uses the Pig parameters and JVM properties from the DSL.
runSparkJob - Runs a Spark job configured in the Hadoop DSL with gradle runSparkJob -PjobName=<job name> -PzipTaskName=<zip task name>. Uses the Spark parameters and JVM properties from the DSL.
showPigJobs - Lists Pig jobs configured in the Hadoop DSL that can be run with the runPigJob task
showSparkJobs - Lists Spark jobs configured in the Hadoop DSL that can be run with the runSparkJob task
startHadoopZips - Container task on which all the Hadoop zip tasks depend
writeAzkabanPluginJson - Writes a default .azkabanPlugin.json file in the project directory
writeOoziePluginJson - Writes a default .ooziePlugin.json file in the project directory
writeScmPluginJson - Writes a default .scmPlugin.json file in the root project directory

Some of these tasks will help you run and debug Hadoop jobs, some of them are related to the Hadoop DSL, and some of them will help you upload to Azkaban. See the sections below for descriptions of each.

Hadoop DSL Language

The Hadoop Plugin comes with the Hadoop DSL, which makes it easy to specify workflows and jobs for Hadoop workflow schedulers.

(Since version 0.3.9) If for some reason you need to disable the Hadoop DSL Plugin, you can pass -PdisableHadoopDslPlugin on the Gradle command line or add disableHadoopDslPlugin=true to your gradle.properties file.

Hadoop DSL Language Reference

The Hadoop DSL Language Reference is documented on its own page at Hadoop DSL Language Reference.

Building the Hadoop DSL for Azkaban

Right now, Azkaban is the only Hadoop workflow scheduler which can be targeted by the Hadoop DSL, but eventually the Hadoop Plugin may include compilers for other schedulers such as Apache Oozie.

To see how to use the Hadoop DSL to build job files for Azkaban, see Azkaban Features.

Hadoop Runtime Dependency Configuration

Applying the Hadoop Plugin will create the hadoopRuntime dependency configuration. You should add dependencies to this configuration that your Hadoop code doesn't need at compile time, but needs at runtime when it executes on the grid.

For projects that also apply the Java Plugin, the hadoopRuntime configuration automatically extends the runtime configuration and adds the jar task. By default, everything in the hadoopRuntime configuration will be added to each Hadoop zip artifact you declare in the hadoopZip block.

// In your <rootProject>/<project>/build.gradle:
// Declare Hadoop runtime dependencies using the hadoopRuntime dependency configuration
dependencies {
  hadoopRuntime "org.apache.avro:avro:1.7.7"
  // ...
}

Hadoop Zip Artifacts

Hadoop Zip Artifacts

Azkaban Features

Azkaban Features

Apache Pig Features

Apache Pig Features

Apache Spark Features

Apache Spark Features

Dependency Management Features

Here

Source Code Metadata Features

Source Code Metadata Features

Clone this wiki locally