Skip to content

User Guide

Alex Bain edited this page Jun 8, 2017 · 29 revisions

Table of Contents

Overview

This is the LinkedIn Gradle Plugin for Apache Hadoop User Guide. For the sake of brevity, we will refer to the plugin as simply the "Hadoop Plugin".

The Hadoop Plugin will help you more effectively build, test and deploy Hadoop applications. In particular, the Plugin will help you easily work with Hadoop applications like Apache Pig and build workflows for Hadoop workflow schedulers like Azkaban and Apache Oozie.

The Plugin includes the LinkedIn Gradle DSL for Apache Hadoop (which we shall refer to as simply the "Hadoop DSL"), a language for specifying jobs and workflows for Hadoop workflow schedulers like Azkaban and Apache Oozie. Go directly to the Hadoop DSL Language Reference.

Using the Open-Source Hadoop Plugin

The Hadoop Plugin is now published on plugins.gradle.org. Click on the link for a short snippet to add to your build.gradle file to start using the Hadoop Plugin today!

Using the Hadoop Plugin at LinkedIn

If you are using the Hadoop Plugin internally at LinkedIn, see our comprehensive instructions at go/HadoopPlugin on the LinkedIn Wiki to start using the Plugin.

Hadoop Plugin Tasks

To see all of the Hadoop Plugin tasks, run gradle tasks in the project directory of your Hadoop Plugin project and look at the section titled Hadoop Plugin tasks. You may see something like:

Hadoop Plugin tasks
-------------------
azkabanDevHadoopZip - Creates a Hadoop zip archive for azkabanDev
azkabanUpload - Uploads Hadoop zip archive to Azkaban
azkabanProdHadoopZip - Creates a Hadoop zip archive for azkabanProd
buildAzkabanFlows - Builds the Hadoop DSL for Azkaban. Have your build task depend on this task.
buildHadoopZips - Builds all of the Hadoop zip archives. Tasks that depend on Hadoop zips should depend on this task
buildOozieFlows - Builds the Hadoop DSL for Apache Oozie. Have your build task depend on this task.
buildPigCache - Build the cache directory to run Pig scripts by Gradle tasks. This task will be run automatically for you.
buildScmMetadata - Writes SCM metadata about the project to the project's build directory
checkDependencies - Task to help in controlling and monitoring the dependencies used in the project
CRTHadoopZip - Creates a Hadoop CRT deployment zip archive
disallowLocalDependencies - Task to disallow users from checking in local dependencies
oozieCommand - Runs the oozieCommand specified by -Pcommand=CommandName
oozieUpload - Uploads the Oozie project folder to HDFS
printScmMetadata - Prints SCM metadata about the project to the screen
run_count_by_country.pig - Run the Pig script src/main/pig/count_by_country.pig with no Pig parameters or JVM properties
run_count_by_country_python.pig - Run the Pig script src/main/pig/count_by_country_python.pig with no Pig parameters or JVM properties
run_member_event_count.pig - Run the Pig script src/main/pig/member_event_count.pig with no Pig parameters or JVM properties
run_postal_code.pig - Run the Pig script src/main/pig/postal_code.pig with no Pig parameters or JVM properties
run_verify_recommendations.pig - Run the Pig script src/main/pig/verify_recommendations.pig with no Pig parameters or JVM properties
runPigJob - Runs a Pig job configured in the Hadoop DSL with gradle runPigJob -Pjob=<job name>. Uses the Pig parameters and JVM properties from the DSL.
runSparkJob - Runs a Spark job configured in the Hadoop DSL with gradle runSparkJob -PjobName=<job name> -PzipTaskName=<zip task name>. Uses the Spark parameters and JVM properties from the DSL.
showPigJobs - Lists Pig jobs configured in the Hadoop DSL that can be run with the runPigJob task
showSparkJobs - Lists Spark jobs configured in the Hadoop DSL that can be run with the runSparkJob task
startHadoopZips - Container task on which all the Hadoop zip tasks depend
writeAzkabanPluginJson - Writes a default .azkabanPlugin.json file in the project directory
writeOoziePluginJson - Writes a default .ooziePlugin.json file in the project directory
writeScmPluginJson - Writes a default .scmPlugin.json file in the root project directory

Some of these tasks will help you run and debug Hadoop jobs, some of them are related to the Hadoop DSL, and some of them will help you upload to Azkaban. See the sections below for descriptions of each.

Hadoop DSL Language

The Hadoop Plugin comes with the Hadoop DSL, which makes it easy to specify workflows and jobs for Hadoop workflow schedulers.

(Since version 0.3.9) If for some reason you need to disable the Hadoop DSL Plugin, you can pass -PdisableHadoopDslPlugin on the Gradle command line or add disableHadoopDslPlugin=true to your gradle.properties file.

Hadoop DSL Language Reference

The Hadoop DSL Language Reference is documented on its own page at Hadoop DSL Language Reference.

Hadoop DSL Syntax Completion in IntelliJ IDEA

The Hadoop DSL supports automatic syntax completion in all recent versions of IntelliJ IDEA. See the Hadoop DSL Language Reference to learn how to enable this feature.

Building the Hadoop DSL for Azkaban

To learn how to use the Hadoop DSL to build job files for Azkaban, see Azkaban Features.

Hadoop Runtime Dependency Configuration

Applying the Hadoop Plugin will create the hadoopRuntime dependency configuration. You should add dependencies to this configuration that your Hadoop code doesn't need at compile time, but needs at runtime when it executes on the grid.

For projects that also apply the Java Plugin, the hadoopRuntime configuration automatically extends the runtime configuration and adds the jar task. By default, everything in the hadoopRuntime configuration will be added to each Hadoop zip artifact you declare in the hadoopZip block.

To see the dependencies that will be added to the hadoopRuntime configuration, run ligradle dependencies --configuration hadoopRuntime.

// In your <rootProject>/<project>/build.gradle:

// Declare Hadoop runtime dependencies using the hadoopRuntime dependency configuration
dependencies {
  hadoopRuntime "org.apache.avro:avro:1.7.7"
  // ...
}

Hadoop Zip Artifacts

The Hadoop Plugin includes a number of features for building Hadoop zip artifacts that can be uploaded to your Hadoop workflow scheduler: Hadoop Zip Artifacts.

Azkaban Features

The Hadoop Plugin comes with tasks to compile the Hadoop DSL into job files for Azkaban and to upload zip artifacts to Azkaban: Azkaban Features.

Apache Pig Features

The Hadoop Plugin comes with features that should make it much easier for you to quickly run and debug Apache Pig scripts: Apache Pig Features.

Apache Spark Features

The Hadoop Plugin comes with features that should make it much easier for you to quickly run Apache Spark programs: Apache Spark Features.

Dependency Management Features

The Hadoop Plugin comes with features that enable your company's Hadoop development and operations teams to disable poor dependency management practices: Dependency Management Features.

Source Code Metadata Features

The Hadoop Plugin comes with features to record metadata about your source code and to build source code zips for your projects: Source Code Metadata Features.

Clone this wiki locally