Wire up Flink DataProc job submission (#189)

piyush-zlai · web-flow · commit 1c00dcb79226 · 2025-01-15T09:51:00.000-05:00
## Summary
This PR adds support to submit Flink jobs to DataProc. There's a bit of
refactoring of the existing submitter code to support the new Flink job
type and its params. For Flink we need two jars - one is the main jar
(flink-assembly*.jar) which contains our FlinkJob main() code. The
second is the cloud_gcp jar which contains our BigTable classes.

Flink requires some infra that doesn't currently exist in our canary
like the source Kafka cluster. In the current version of this code, I've
created a TestJob (in TestFlinkJob) that sets up an in-mem E2EEvent
source along with a mocked GroupBy / GroupByServing info. The rest of
the job (spark eval, avro conversion, BT kv store writes) are all wired
up.

Follow ups are called out in a few places in the code, listing the major
ones out:
* More prod grade Flink settings (things like checkpointing frequency,
watermarking interval, ..)
* Support for IDL encoders (will start with proto as that's what the
Etsy folks need)
* Read GroupByServingInfo from BigTable
* Add support for Kafka source and leverage existing Chronon code for
inferring parallelism etc from the Kafka topic

As we fix some of these up, we can get rid of the TestFlinkJob and the
mocked code / classes there.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update

Kicked off Flink jobs using the added test and confirmed that the job
comes up, runs successfully and writes out data to BT that I can query:
```
$ cbt -project=canary-443022 -instance=zipline-canary-instance read GROUPBY_STREAMING
E2E_COUNT_STREAMING#
test0#1736380800000
  cf:value                                 @ 2025/01/09-15:30:41.199000
    "\x02\x00\x00\x00\x00\x00\x00\x00\x00"
  cf:value                                 @ 2025/01/09-14:55:04.992000
    "\x02\x00\x00\x00\x00\x00\x00\x00\x00"
...
```


&lt;!-- This is an auto-generated comment: release notes by coderabbit.ai
--&gt;
## Summary by CodeRabbit

## Release Notes

- **New Features**
- Enhanced job submission capabilities with support for Spark and Flink
job types.
  - Introduced flexible configuration for job properties and parameters.
  - Added command-line argument parsing for Flink jobs.
- Introduced utility classes for testing Flink jobs, including event
generation and data streaming.

- **Improvements**
  - Refined error logging across multiple components for better clarity.
- Updated dependency management and assembly configurations for improved
stability.
  - Improved source and encoder provider abstractions for Flink jobs.

- **Infrastructure**
- Updated build configurations for better dependency handling and
assembly processes.
- Removed unnecessary configuration entries from the submission
configuration file.
&lt;!-- end of auto-generated comment: release notes by coderabbit.ai --&gt;
diff --git a/build.sbt b/build.sbt
@@ -109,7 +109,8 @@ val circe = Seq(
 val flink_all = Seq(
   "org.apache.flink" %% "flink-streaming-scala",
   "org.apache.flink" % "flink-metrics-dropwizard",
-  "org.apache.flink" % "flink-clients"
+  "org.apache.flink" % "flink-clients",
+  "org.apache.flink" % "flink-yarn"
 ).map(_ % flink_1_17)
 
 val vertx_java = Seq(
@@ -213,6 +214,22 @@ lazy val flink = project
   .settings(
     libraryDependencies ++= spark_all,
     libraryDependencies ++= flink_all,
+    assembly / assemblyMergeStrategy := {
+      case PathList("META-INF", "services", xs @ _*) => MergeStrategy.concat
+      case "reference.conf" => MergeStrategy.concat
+      case "application.conf" => MergeStrategy.concat
+      case PathList("META-INF", xs @ _*) => MergeStrategy.discard
+      case _ => MergeStrategy.first
+    },
+    // Exclude Hadoop & Guava from the assembled JAR
+    // Else we hit an error - IllegalAccessError: class org.apache.hadoop.hdfs.web.HftpFileSystem cannot access its
+    // superinterface org.apache.hadoop.hdfs.web.TokenAspect$TokenManagementDelegator
+    // Or: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(...)
+    // Or: 'com/google/protobuf/MapField' is not assignable to 'com/google/protobuf/MapFieldReflectionAccessor'
+    assembly / assemblyExcludedJars := {
+      val cp = (assembly / fullClasspath).value
+      cp filter { jar =>  jar.data.getName.startsWith("hadoop-") || jar.data.getName.startsWith("guava") || jar.data.getName.startsWith("protobuf")}
+    },
     libraryDependencies += "org.apache.flink" % "flink-test-utils" % flink_1_17 % Test excludeAll (
       ExclusionRule(organization = "org.apache.logging.log4j", name = "log4j-api"),
       ExclusionRule(organization = "org.apache.logging.log4j", name = "log4j-core"),
@@ -236,13 +253,24 @@ lazy val cloud_gcp = project
     libraryDependencies += "org.json4s" %% "json4s-native" % "3.7.0-M11",
     libraryDependencies += "org.json4s" %% "json4s-core" % "3.7.0-M11",
     libraryDependencies += "org.yaml" % "snakeyaml" % "2.3",
+    libraryDependencies += "io.grpc" % "grpc-netty-shaded" % "1.62.2",
     libraryDependencies ++= avro,
     libraryDependencies ++= spark_all_provided,
     dependencyOverrides ++= jackson,
+    // assembly merge settings to allow Flink jobs to kick off
+    assembly / assemblyMergeStrategy := {
+      case PathList("META-INF", "services", xs @ _*) => MergeStrategy.concat // Add to include channel provider
+      case PathList("META-INF", xs @ _*) => MergeStrategy.discard
+      case "reference.conf" => MergeStrategy.concat
+      case "application.conf" => MergeStrategy.concat
+      case _ => MergeStrategy.first
+    },
     libraryDependencies += "org.mockito" % "mockito-core" % "5.12.0" % Test,
     libraryDependencies += "com.google.cloud" % "google-cloud-bigtable-emulator" % "0.178.0" % Test,
     // force a newer version of reload4j to sidestep: https://security.snyk.io/vuln/SNYK-JAVA-CHQOSRELOAD4J-5731326
-    dependencyOverrides += "ch.qos.reload4j" % "reload4j" % "1.2.25"
+    dependencyOverrides ++= Seq(
+      "ch.qos.reload4j" % "reload4j" % "1.2.25",
+    )
   )
 
 lazy val cloud_gcp_submitter = project
diff --git a/cloud_gcp/src/main/resources/dataproc-submitter-conf.yaml b/cloud_gcp/src/main/resources/dataproc-submitter-conf.yaml
@@ -2,5 +2,3 @@
 projectId: "canary-443022"
 region: "us-central1"
 clusterName: "canary-2"
-jarUri: "gs://zipline-jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar"
-mainClass: "ai.chronon.spark.Driver"
diff --git a/cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitter.scala b/cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitter.scala
@@ -1,6 +1,12 @@
 package ai.chronon.integrations.cloud_gcp
 import ai.chronon.spark.JobAuth
 import ai.chronon.spark.JobSubmitter
+import ai.chronon.spark.JobSubmitterConstants.FlinkMainJarURI
+import ai.chronon.spark.JobSubmitterConstants.JarURI
+import ai.chronon.spark.JobSubmitterConstants.MainClass
+import ai.chronon.spark.JobType
+import ai.chronon.spark.{FlinkJob => TypeFlinkJob}
+import ai.chronon.spark.{SparkJob => TypeSparkJob}
 import com.google.api.gax.rpc.ApiException
 import com.google.cloud.dataproc.v1._
 import org.json4s._
@@ -14,9 +20,7 @@ import collection.JavaConverters._
 case class SubmitterConf(
     projectId: String,
     region: String,
-    clusterName: String,
-    jarUri: String,
-    mainClass: String
+    clusterName: String
 ) {
 
   def endPoint: String = s"${region}-dataproc.googleapis.com:443"
@@ -49,38 +53,67 @@ class DataprocSubmitter(jobControllerClient: JobControllerClient, conf: Submitte
     job.getDone
   }
 
-  override def submit(files: List[String], args: String*): String = {
-
-    val sparkJob = SparkJob
-      .newBuilder()
-      .setMainClass(conf.mainClass)
-      .addJarFileUris(conf.jarUri)
-      .addAllFileUris(files.asJava)
-      .addAllArgs(args.toIterable.asJava)
-      .build()
+  override def submit(jobType: JobType,
+                      jobProperties: Map[String, String],
+                      files: List[String],
+                      args: String*): String = {
+    val mainClass = jobProperties.getOrElse(MainClass, throw new RuntimeException("Main class not found"))
+    val jarUri = jobProperties.getOrElse(JarURI, throw new RuntimeException("Jar URI not found"))
+
+    val jobBuilder = jobType match {
+      case TypeSparkJob => buildSparkJob(mainClass, jarUri, files, args: _*)
+      case TypeFlinkJob =>
+        val mainJarUri =
+          jobProperties.getOrElse(FlinkMainJarURI, throw new RuntimeException(s"Missing expected $FlinkMainJarURI"))
+        buildFlinkJob(mainClass, mainJarUri, jarUri, args: _*)
+    }
 
     val jobPlacement = JobPlacement
       .newBuilder()
       .setClusterName(conf.clusterName)
       .build()
 
     try {
-      val job = Job
-        .newBuilder()
+      val job = jobBuilder
         .setReference(jobReference)
         .setPlacement(jobPlacement)
-        .setSparkJob(sparkJob)
         .build()
 
       val submittedJob = jobControllerClient.submitJob(conf.projectId, conf.region, job)
       submittedJob.getReference.getJobId
 
     } catch {
       case e: ApiException =>
-        throw new RuntimeException(s"Failed to submit job: ${e.getMessage}")
+        throw new RuntimeException(s"Failed to submit job: ${e.getMessage}", e)
     }
   }
 
+  private def buildSparkJob(mainClass: String, jarUri: String, files: List[String], args: String*): Job.Builder = {
+    val sparkJob = SparkJob
+      .newBuilder()
+      .setMainClass(mainClass)
+      .addJarFileUris(jarUri)
+      .addAllFileUris(files.asJava)
+      .addAllArgs(args.toIterable.asJava)
+      .build()
+    Job.newBuilder().setSparkJob(sparkJob)
+  }
+
+  private def buildFlinkJob(mainClass: String, mainJarUri: String, jarUri: String, args: String*): Job.Builder = {
+    val envProps =
+      Map("jobmanager.memory.process.size" -> "4G", "yarn.classpath.include-user-jar" -> "FIRST")
+
+    val flinkJob = FlinkJob
+      .newBuilder()
+      .setMainClass(mainClass)
+      .setMainJarFileUri(mainJarUri)
+      .putAllProperties(envProps.asJava)
+      .addJarFileUris(jarUri)
+      .addAllArgs(args.toIterable.asJava)
+      .build()
+    Job.newBuilder().setFlinkJob(flinkJob)
+  }
+
   def jobReference: JobReference = JobReference.newBuilder().build()
 }
 
@@ -146,14 +179,14 @@ object DataprocSubmitter {
     val submitterConf = SubmitterConf(
       projectId,
       region,
-      clusterName,
-      chrononJarUri,
-      "ai.chronon.spark.Driver"
+      clusterName
     )
 
     val a = DataprocSubmitter(submitterConf)
 
     val jobId = a.submit(
+      TypeSparkJob,
+      Map(MainClass -> "ai.chronon.spark.Driver", JarURI -> chrononJarUri),
       gcsFiles.toList,
       userArgs: _*
     )
diff --git a/cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitterTest.scala b/cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitterTest.scala
@@ -1,5 +1,9 @@
 package ai.chronon.integrations.cloud_gcp
 
+import ai.chronon.spark
+import ai.chronon.spark.JobSubmitterConstants.FlinkMainJarURI
+import ai.chronon.spark.JobSubmitterConstants.JarURI
+import ai.chronon.spark.JobSubmitterConstants.MainClass
 import com.google.api.gax.rpc.UnaryCallable
 import com.google.cloud.dataproc.v1._
 import com.google.cloud.dataproc.v1.stub.JobControllerStub
@@ -37,21 +41,39 @@ class DataprocSubmitterTest extends AnyFunSuite with MockitoSugar {
 
     val submitter = new DataprocSubmitter(
       mockJobControllerClient,
-      SubmitterConf("test-project", "test-region", "test-cluster", "test-jar-uri", "test-main-class"))
+      SubmitterConf("test-project", "test-region", "test-cluster"))
 
-    val submittedJobId = submitter.submit(List.empty)
+    val submittedJobId = submitter.submit(spark.SparkJob, Map(MainClass -> "test-main-class", JarURI -> "test-jar-uri"), List.empty)
     assertEquals(submittedJobId, jobId)
   }
 
   test("Verify classpath with spark-bigquery-connector") {
     BigQueryUtilScala.validateScalaVersionCompatibility()
   }
 
+  ignore("test flink job locally") {
+    val submitter = DataprocSubmitter()
+    val submittedJobId =
+      submitter.submit(spark.FlinkJob,
+        Map(MainClass -> "ai.chronon.flink.FlinkJob",
+          FlinkMainJarURI -> "gs://zipline-jars/flink-assembly-0.1.0-SNAPSHOT.jar",
+          JarURI -> "gs://zipline-jars/cloud_gcp_bigtable.jar"),
+        List.empty,
+        "--online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl",
+        "--groupby-name=e2e-count",
+        "-ZGCP_PROJECT_ID=bigtable-project-id",
+        "-ZGCP_INSTANCE_ID=bigtable-instance-id")
+    println(submittedJobId)
+  }
+
   ignore("Used to iterate locally. Do not enable this in CI/CD!") {
 
     val submitter = DataprocSubmitter()
     val submittedJobId =
       submitter.submit(
+        spark.SparkJob,
+        Map(MainClass -> "ai.chronon.spark.Driver",
+              JarURI -> "gs://zipline-jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar"),
         List("gs://zipline-jars/training_set.v1",
              "gs://zipline-jars/dataproc-submitter-conf.yaml",
              "gs://zipline-jars/additional-confs.yaml"),
@@ -67,7 +89,11 @@ class DataprocSubmitterTest extends AnyFunSuite with MockitoSugar {
 
     val submitter = DataprocSubmitter()
     val submittedJobId =
-      submitter.submit(List.empty,
+      submitter.submit(
+        spark.SparkJob,
+        Map(MainClass -> "ai.chronon.spark.Driver",
+          JarURI -> "gs://zipline-jars/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar"),
+        List.empty,
         "groupby-upload-bulk-load",
         "-ZGCP_PROJECT_ID=bigtable-project-id",
         "-ZGCP_INSTANCE_ID=bigtable-instance-id",
diff --git a/flink/src/main/scala/ai/chronon/flink/AsyncKVStoreWriter.scala b/flink/src/main/scala/ai/chronon/flink/AsyncKVStoreWriter.scala
@@ -119,7 +119,7 @@ class AsyncKVStoreWriter(onlineImpl: Api, featureGroupName: String)
         // in the KVStore - we log the exception and skip the object to
         // not fail the app
         errorCounter.inc()
-        logger.error(s"Caught exception writing to KVStore for object: $input - $exception")
+        logger.error(s"Caught exception writing to KVStore for object: $input", exception)
         resultFuture.complete(util.Arrays.asList[WriteResponse](WriteResponse(input, status = false)))
     }
   }
diff --git a/flink/src/main/scala/ai/chronon/flink/AvroCodecFn.scala b/flink/src/main/scala/ai/chronon/flink/AvroCodecFn.scala
@@ -108,7 +108,7 @@ case class AvroCodecFn[T](groupByServingInfoParsed: GroupByServingInfoParsed)
       case e: Exception =>
         // To improve availability, we don't rethrow the exception. We just drop the event
         // and track the errors in a metric. Alerts should be set up on this metric.
-        logger.error(s"Error converting to Avro bytes - $e")
+        logger.error("Error converting to Avro bytes", e)
         eventProcessingErrorCounter.inc()
         avroConversionErrorCounter.inc()
     }
diff --git a/flink/src/main/scala/ai/chronon/flink/FlinkJob.scala b/flink/src/main/scala/ai/chronon/flink/FlinkJob.scala
@@ -9,6 +9,7 @@ import ai.chronon.flink.window.FlinkRowAggProcessFunction
 import ai.chronon.flink.window.FlinkRowAggregationFunction
 import ai.chronon.flink.window.KeySelector
 import ai.chronon.flink.window.TimestampedTile
+import ai.chronon.online.Api
 import ai.chronon.online.GroupByServingInfoParsed
 import ai.chronon.online.KVStore.PutRequest
 import ai.chronon.online.SparkConversions
@@ -22,6 +23,9 @@ import org.apache.flink.streaming.api.windowing.assigners.WindowAssigner
 import org.apache.flink.streaming.api.windowing.time.Time
 import org.apache.flink.streaming.api.windowing.windows.TimeWindow
 import org.apache.spark.sql.Encoder
+import org.rogach.scallop.ScallopConf
+import org.rogach.scallop.ScallopOption
+import org.rogach.scallop.Serialization
 import org.slf4j.LoggerFactory
 
 /**
@@ -196,3 +200,56 @@ class FlinkJob[T](eventSrc: FlinkSource[T],
     )
   }
 }
+
+object FlinkJob {
+  // Pull in the Serialization trait to sidestep: https://github.com/scallop/scallop/issues/137
+  class JobArgs(args: Seq[String]) extends ScallopConf(args) with Serialization {
+    val onlineClass: ScallopOption[String] =
+      opt[String](required = true,
+                  descr = "Fully qualified Online.Api based class. We expect the jar to be on the class path")
+    val groupbyName: ScallopOption[String] =
+      opt[String](required = true, descr = "The name of the groupBy to process")
+    val mockSource: ScallopOption[Boolean] =
+      opt[Boolean](required = false, descr = "Use a mocked data source instead of a real source", default = Some(true))
+
+    val apiProps: Map[String, String] = props[String]('Z', descr = "Props to configure API / KV Store")
+
+    verify()
+  }
+
+  def main(args: Array[String]): Unit = {
+    val jobArgs = new JobArgs(args)
+    jobArgs.groupbyName()
+    val onlineClassName = jobArgs.onlineClass()
+    val props = jobArgs.apiProps.map(identity)
+    val useMockedSource = jobArgs.mockSource()
+
+    val api = buildApi(onlineClassName, props)
+    val flinkJob =
+      if (useMockedSource) {
+        // We will yank this conditional block when we wire up our real sources etc.
+        TestFlinkJob.buildTestFlinkJob(api)
+      } else {
+        // TODO - what we need to do when we wire this up for real
+        // lookup groupByServingInfo by groupByName from the kv store
+        // based on the topic type (e.g. kafka / pubsub) and the schema class name:
+        // 1. lookup schema object using SchemaProvider (e.g SchemaRegistry / Jar based)
+        // 2. Create the appropriate Encoder for the given schema type
+        // 3. Invoke the appropriate source provider to get the source, encoder, parallelism
+        throw new IllegalArgumentException("We don't support non-mocked sources like Kafka / PubSub yet!")
+      }
+
+    val env = StreamExecutionEnvironment.getExecutionEnvironment
+    // TODO add useful configs
+    flinkJob.runGroupByJob(env).addSink(new PrintSink) // TODO wire up a metrics sink / such
+    env.execute(s"${flinkJob.groupByName}")
+  }
+
+  def buildApi(onlineClass: String, props: Map[String, String]): Api = {
+    val cl = Thread.currentThread().getContextClassLoader // Use Flink's classloader
+    val cls = cl.loadClass(onlineClass)
+    val constructor = cls.getConstructors.apply(0)
+    val onlineImpl = constructor.newInstance(props)
+    onlineImpl.asInstanceOf[Api]
+  }
+}
diff --git a/flink/src/main/scala/ai/chronon/flink/SourceProvider.scala b/flink/src/main/scala/ai/chronon/flink/SourceProvider.scala
@@ -0,0 +1,23 @@
+package ai.chronon.flink
+
+import ai.chronon.online.GroupByServingInfoParsed
+import org.apache.spark.sql.Encoder
+
+/**
+  * SourceProvider is an abstract class that provides a way to build a source for a Flink job.
+  * It takes the groupByServingInfo as an argument and based on the configured GB details, configures
+  * the Flink source (e.g. Kafka or PubSub) with the right parallelism etc.
+  */
+abstract class SourceProvider[T](maybeGroupByServingInfoParsed: Option[GroupByServingInfoParsed]) {
+  // Returns a tuple of the source, parallelism
+  def buildSource(): (FlinkSource[T], Int)
+}
+
+/**
+  * EncoderProvider is an abstract class that provides a way to build an Spark encoder for a Flink job.
+  * These encoders are used in the SparkExprEval Flink function to convert the incoming stream into types
+  * that are amenable for tiled / untiled processing.
+  */
+abstract class EncoderProvider[T] {
+  def buildEncoder(): Encoder[T]
+}
diff --git a/flink/src/main/scala/ai/chronon/flink/SparkExpressionEvalFn.scala b/flink/src/main/scala/ai/chronon/flink/SparkExpressionEvalFn.scala
@@ -109,7 +109,7 @@ class SparkExpressionEvalFn[T](encoder: Encoder[T], groupBy: GroupBy) extends Ri
       case e: Exception =>
         // To improve availability, we don't rethrow the exception. We just drop the event
         // and track the errors in a metric. Alerts should be set up on this metric.
-        logger.error(s"Error evaluating Spark expression - $e")
+        logger.error("Error evaluating Spark expression", e)
         exprEvalErrorCounter.inc()
     }
   }
diff --git a/flink/src/main/scala/ai/chronon/flink/TestFlinkJob.scala b/flink/src/main/scala/ai/chronon/flink/TestFlinkJob.scala
diff --git a/spark/src/main/scala/ai/chronon/spark/JobSubmitter.scala b/spark/src/main/scala/ai/chronon/spark/JobSubmitter.scala

Original file line number	Diff line number	Diff line change
`@@ -119,7 +119,7 @@ class AsyncKVStoreWriter(onlineImpl: Api, featureGroupName: String)`
`119`	`119`	`// in the KVStore - we log the exception and skip the object to`
`120`	`120`	`// not fail the app`
`121`	`121`	`errorCounter.inc()`
`122`		`- logger.error(s"Caught exception writing to KVStore for object: $input - $exception")`
	`122`	`+ logger.error(s"Caught exception writing to KVStore for object: $input", exception)`
`123`	`123`	`resultFuture.complete(util.Arrays.asList[WriteResponse](WriteResponse(input, status = false)))`
`124`	`124`	`}`
`125`	`125`	`}`
Original file line number	Diff line number	Diff line change
`@@ -108,7 +108,7 @@ case class AvroCodecFn[T](groupByServingInfoParsed: GroupByServingInfoParsed)`
`108`	`108`	`case e: Exception =>`
`109`	`109`	`// To improve availability, we don't rethrow the exception. We just drop the event`
`110`	`110`	`// and track the errors in a metric. Alerts should be set up on this metric.`
`111`		`- logger.error(s"Error converting to Avro bytes - $e")`
	`111`	`+ logger.error("Error converting to Avro bytes", e)`
`112`	`112`	`eventProcessingErrorCounter.inc()`
`113`	`113`	`avroConversionErrorCounter.inc()`
`114`	`114`	`}`
Original file line number	Diff line number	Diff line change
`@@ -109,7 +109,7 @@ class SparkExpressionEvalFn[T](encoder: Encoder[T], groupBy: GroupBy) extends Ri`
`109`	`109`	`case e: Exception =>`
`110`	`110`	`// To improve availability, we don't rethrow the exception. We just drop the event`
`111`	`111`	`// and track the errors in a metric. Alerts should be set up on this metric.`
`112`		`- logger.error(s"Error evaluating Spark expression - $e")`
	`112`	`+ logger.error("Error evaluating Spark expression", e)`
`113`	`113`	`exprEvalErrorCounter.inc()`
`114`	`114`	`}`
`115`	`115`	`}`