Build failure reasons for synchronous jobs (check/spec/discover) #14715

pedroslopez · 2022-07-14T15:35:00Z

What

As described in Spec: Failure Reasons for Synchronous Jobs to report connector failures from synchronous jobs to sentry (#13857) we need to build and surface failure reasons from AirbyteTraceMessages for these jobs.

This PR focuses on building and surfacing these FailureReasons, while a follow up PR will focus on using this to actually send them to sentry.

How

Check/Spec/Discover workers now look for AirbyteTraceMessages and build a FailureReason for it when encountering a non-zero exit code.
Check/Spec/Discover workers now output a ConnectorJobOutput that contains a failureReason field
TemporalResponse is now marked as succeeded=false if the output is a ConnectorJobOutput with a FailureReason

🚨 User Impact 🚨

This change should have no user impact.

pedroslopez · 2022-07-14T15:42:45Z

...uler/client/src/main/java/io/airbyte/scheduler/client/DefaultSynchronousSchedulerClient.java

+  <T, U> SynchronousResponse<T> execute(final ConfigType configType,
+                                        @Nullable final UUID connectorDefinitionId,
+                                        final Function<UUID, TemporalResponse<U>> executor,
+                                        final Function<U, T> outputMapper,
+                                        final UUID workspaceId) {


This method mainly remains the same, but to keep consumers of these SynchronousResponses the same, I introduced an outputMapper function for going from the common ConnectorJobOutput -> the specific response we want.

pedroslopez · 2022-07-14T15:44:05Z

...uler/client/src/main/java/io/airbyte/scheduler/client/DefaultSynchronousSchedulerClient.java

+      track(jobId, configType, connectorDefinitionId, workspaceId, outputState, mappedOutput);
+      // TODO(pedro): report ConnectorJobOutput's failureReason to the JobErrorReporter, like the above


To keep this PR somewhat manageable I've limited it to only building/surfacing the failure reasons for these jobs, but not actually reporting it to the JobErrorReporter (sentry). This is where that would happen, following the same pattern that's being used for the JobTracker (segment). This will come in a subsequent PR.

pedroslopez · 2022-07-14T15:46:53Z

airbyte-workers/src/main/java/io/airbyte/workers/general/CheckConnectionWorker.java

 import io.airbyte.workers.Worker;

-public interface CheckConnectionWorker extends Worker<StandardCheckConnectionInput, StandardCheckConnectionOutput> {}
+public interface CheckConnectionWorker extends Worker<StandardCheckConnectionInput, ConnectorJobOutput> {}


One thing I don't love about using a common ConnectorJobOutput is we lose some type-safety in what these workers are returning, but seems ok as to stick with our existing convention of using POJOs/json schema for defining these outputs.

pedroslopez · 2022-07-14T15:49:46Z

airbyte-workers/src/main/java/io/airbyte/workers/general/DefaultDiscoverCatalogWorker.java

+        messagesByType = streamFactory.create(IOs.newBufferedReader(stdout))
+            .collect(Collectors.groupingBy(AirbyteMessage::getType));


This is how we are building failure reasons from trace messages for each of these jobs:

Instead of just filtering for the desired output when reading the stream initially, collect the messages by type

filter for our desired output

filter for a trace message if needed and use the existing FailureHelper to produce a FailureReason from it

pedroslopez · 2022-07-14T15:52:18Z

airbyte-workers/src/main/java/io/airbyte/workers/temporal/TemporalClient.java

+    boolean succeeded = exception == null;
+    if (succeeded && operationOutput instanceof ConnectorJobOutput) {
+      succeeded = getConnectorJobSucceeded((ConnectorJobOutput) operationOutput);
+    }
+
+    final JobMetadata metadata = new JobMetadata(succeeded, logPath);


This is the piece that determines whether the TemporalResponse was successful. Before it only considered exceptions as unsuccessful, and now we are also considering ConnectorJobOutputs with a failure reason as unsuccessful.

pedroslopez · 2022-07-14T15:54:57Z

...workers/src/main/java/io/airbyte/workers/temporal/scheduling/SyncCheckConnectionFailure.java

+    if (failureOutput.getFailureReason() != null) {
+      syncOutput.setFailures(List.of(failureOutput.getFailureReason().withFailureOrigin(origin)));
+    } else {
+      final StandardCheckConnectionOutput checkOutput = failureOutput.getCheckConnection();
+      final Exception ex = new IllegalArgumentException(checkOutput.getMessage());
+      final FailureReason checkFailureReason = FailureHelper.checkFailure(ex, jobId, attemptId, origin);
+      syncOutput.setFailures(List.of(checkFailureReason));
+    }


When building the failure for the checks that are executed before syncs, if it failed with a FailureReason from the connector just use that. Otherwise, build a FailureReason just as before.

pedroslopez · 2022-07-14T16:05:24Z

...kers/src/main/java/io/airbyte/workers/temporal/scheduling/ConnectionManagerWorkflowImpl.java

@@ -350,8 +349,8 @@ private SyncCheckConnectionFailure checkConnections(final GenerateInputActivity.
      log.info("SOURCE CHECK: Skipped");
    } else {
      log.info("SOURCE CHECK: Starting");
-      final StandardCheckConnectionOutput sourceCheckResponse = runMandatoryActivityWithOutput(checkActivity::run, checkSourceInput);
-      if (sourceCheckResponse.getStatus() == Status.FAILED) {
+      final ConnectorJobOutput sourceCheckResponse = runMandatoryActivityWithOutput(checkActivity::run, checkSourceInput);


This activity now has a different output type, being called from the ConnectionManagerWorkflow. I know when it comes to changing the inputs we have to do some version checks - is there anything special to do here because of this output change?

@benmoriceau is the expert here, but since these are short-lived activities (vs syncs which can take days and span multiple deployments), maybe we don't need to do a version check?

unfortunatelly we need a new version here. You can check that by updating the workflowHistory in WorkflowReplayingTest by something generated from master. Here the test doesn't fail because it is already controlled by a version. The time that an activity takes to run doesn't influence how likely it can have a versioning issue. The main factor is where in the workflow the activity is run. For example the last activity is less likely to have a version issue because the workflow will terminate/continue as new after it while for the first activity it is more likely that the workflow will be unloaded from memory and potentially replayed.

@benmoriceau Ah! Thanks for pointing out the WorkflowReplayingTest - it did indeed fail and was super helpful for getting this working. I've updated this to consider a version in 6305fe4

Thanks @pedroslopez, about that. We have one last issue to ensure that the workflow are not block because of a version. After that we will clear all the version. For the future we will still keep the version but it will be removal in a timely manner.

evantahler

Really nice work @pedroslopez! I defer to the reviews of the other members of the team, but 👍 from me!

evantahler · 2022-07-14T17:10:01Z

airbyte-config/config-models/src/main/resources/types/ConnectorJobOutput.yaml

+"$schema": http://json-schema.org/draft-07/schema#
+"$id": https://github.com/airbytehq/airbyte/blob/master/airbyte-config/models/src/main/resources/types/ConnectorJobOutput.yaml
+title: ConnectorJobOutput
+description: connector command job output


Suggested change

description: connector command job output

description: connector command job output for all connector commands other than READ and WRITE

airbyte-workers/src/main/java/io/airbyte/workers/general/DefaultCheckConnectionWorker.java

evantahler · 2022-07-14T17:26:20Z

...kers/src/main/java/io/airbyte/workers/temporal/scheduling/ConnectionManagerWorkflowImpl.java

@@ -350,8 +349,8 @@ private SyncCheckConnectionFailure checkConnections(final GenerateInputActivity.
      log.info("SOURCE CHECK: Skipped");
    } else {
      log.info("SOURCE CHECK: Starting");
-      final StandardCheckConnectionOutput sourceCheckResponse = runMandatoryActivityWithOutput(checkActivity::run, checkSourceInput);
-      if (sourceCheckResponse.getStatus() == Status.FAILED) {
+      final ConnectorJobOutput sourceCheckResponse = runMandatoryActivityWithOutput(checkActivity::run, checkSourceInput);


@benmoriceau is the expert here, but since these are short-lived activities (vs syncs which can take days and span multiple deployments), maybe we don't need to do a version check?

airbyte-config/config-models/src/main/resources/types/ConnectorJobOutput.yaml

benmoriceau · 2022-07-14T16:14:46Z

airbyte-scheduler/client/src/main/java/io/airbyte/scheduler/client/SynchronousResponse.java

-                                                                final UUID configId,
-                                                                final long createdAt,
-                                                                final long endedAt) {
+  public static <T, U> SynchronousResponse<T> fromTemporalResponse(final TemporalResponse<U> temporalResponse,


Shouldn't it be SynchronousResponse<U> ?

Because I wanted to keep the SynchronousResponse the same instead of the more generic type, the output is mapped and provided directly as an argument, so the TemporalResponse type and SynchronousResponse types can be different.

For example, for the discover job we have TemporalResponse<ConnectorJobOutput> and SynchronousResponse<AirbyteCatalog>

benmoriceau · 2022-07-14T22:50:48Z

...kers/src/main/java/io/airbyte/workers/temporal/scheduling/ConnectionManagerWorkflowImpl.java

@@ -350,8 +349,8 @@ private SyncCheckConnectionFailure checkConnections(final GenerateInputActivity.
      log.info("SOURCE CHECK: Skipped");
    } else {
      log.info("SOURCE CHECK: Starting");
-      final StandardCheckConnectionOutput sourceCheckResponse = runMandatoryActivityWithOutput(checkActivity::run, checkSourceInput);
-      if (sourceCheckResponse.getStatus() == Status.FAILED) {
+      final ConnectorJobOutput sourceCheckResponse = runMandatoryActivityWithOutput(checkActivity::run, checkSourceInput);


unfortunatelly we need a new version here. You can check that by updating the workflowHistory in WorkflowReplayingTest by something generated from master. Here the test doesn't fail because it is already controlled by a version. The time that an activity takes to run doesn't influence how likely it can have a versioning issue. The main factor is where in the workflow the activity is run. For example the last activity is less likely to have a version issue because the workflow will terminate/continue as new after it while for the first activity it is more likely that the workflow will be unloaded from memory and potentially replayed.

…e branch

benmoriceau · 2022-07-18T23:54:48Z

.../src/main/java/io/airbyte/workers/temporal/check/connection/CheckConnectionWorkflowImpl.java


-    return activity.run(new CheckConnectionInput(jobRunConfig, launcherConfig, connectionConfiguration));
+    return activity.runWithJobOutput(new CheckConnectionInput(jobRunConfig, launcherConfig, connectionConfiguration));


it is less likely to happen because of the time that it takes to run this activity but we will need a version here.

added in aefc013 - as part of this I also added a test with a previous workflow history to make sure it passes (it was indeed failing before adding the version checks)

) * demo for surfacing synchronous job failures * add missing changes for StandardDiscoverCatalogOutput impl * extract trace message failure reason for discover job * move to using a single pojo to represent synchronous job outputs * format * handle new output type in check before sync * re-genericize DefaultSynchronousSchedulerClient.execute * fix failing tests * fix failing scheduler client tests * get spec returns failure reason from trace message * build failure reason from trace message for check job * type safety * only consider error-type trace messages * add more tests * just use nulls * this was removed but incorrectly re-added when merging master into the branch * check output version for workflow replay support * refactor trace message finding to util method * additionalProperties: true * add versioning for CheckConnectionWorkflow * update comment

pedroslopez added 15 commits July 13, 2022 19:05

demo for surfacing synchronous job failures

f942322

add missing changes for StandardDiscoverCatalogOutput impl

cde49d7

extract trace message failure reason for discover job

d40d062

move to using a single pojo to represent synchronous job outputs

fe6489a

format

65a6d29

handle new output type in check before sync

587d15c

re-genericize DefaultSynchronousSchedulerClient.execute

45873df

fix failing tests

d094a91

fix failing scheduler client tests

1e7267d

get spec returns failure reason from trace message

3413970

build failure reason from trace message for check job

46c52c2

type safety

16ec97a

only consider error-type trace messages

02a6023

add more tests

d9d7c6c

just use nulls

1fa8953

github-actions bot added area/platform issues related to the platform area/scheduler area/server area/worker Related to worker labels Jul 14, 2022

pedroslopez temporarily deployed to more-secrets July 14, 2022 15:37 Inactive

pedroslopez commented Jul 14, 2022

View reviewed changes

pedroslopez marked this pull request as ready for review July 14, 2022 16:10

pedroslopez requested review from evantahler, benmoriceau and lmossman July 14, 2022 16:11

pedroslopez added the team/extensibility label Jul 14, 2022

evantahler reviewed Jul 14, 2022

View reviewed changes

benmoriceau requested changes Jul 14, 2022

View reviewed changes

Merge branch 'master' into pedroslopez/synchr-job-reporter

4d52806

pedroslopez added 2 commits July 15, 2022 20:05

this was removed but incorrectly re-added when merging master into th…

0980d3d

…e branch

check output version for workflow replay support

6305fe4

pedroslopez temporarily deployed to more-secrets July 17, 2022 01:31 Inactive

pedroslopez temporarily deployed to more-secrets July 17, 2022 03:47 Inactive

pedroslopez added 2 commits July 18, 2022 12:09

refactor trace message finding to util method

7eacb6e

additionalProperties: true

ca6e575

pedroslopez temporarily deployed to more-secrets July 18, 2022 16:14 Inactive

pedroslopez requested a review from benmoriceau July 18, 2022 16:25

pedroslopez mentioned this pull request Jul 18, 2022

report synchronous check/spec/discover failures to JobErrorReporter #14818

Merged

benmoriceau reviewed Jul 18, 2022

View reviewed changes

add versioning for CheckConnectionWorkflow

aefc013

pedroslopez temporarily deployed to more-secrets July 19, 2022 20:38 Inactive

update comment

7808962

pedroslopez requested a review from benmoriceau July 19, 2022 20:39

pedroslopez temporarily deployed to more-secrets July 19, 2022 20:40 Inactive

benmoriceau approved these changes Jul 19, 2022

View reviewed changes

pedroslopez merged commit 198e580 into master Jul 19, 2022

pedroslopez deleted the pedroslopez/synchr-job-reporter branch July 19, 2022 21:19

pedroslopez mentioned this pull request Jul 19, 2022

fix build: update connector command worker output usage in tests #14859

Merged

edgao mentioned this pull request Jul 19, 2022

FIx build #14860

Closed

octavia-squidington-iii mentioned this pull request Jul 23, 2022

Bump Airbyte version from 0.39.37-alpha to 0.39.38-alpha #14976

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build failure reasons for synchronous jobs (check/spec/discover) #14715

Build failure reasons for synchronous jobs (check/spec/discover) #14715

pedroslopez commented Jul 14, 2022 •

edited

Loading

pedroslopez Jul 14, 2022

pedroslopez Jul 14, 2022

pedroslopez Jul 14, 2022

pedroslopez Jul 14, 2022

pedroslopez Jul 14, 2022

pedroslopez Jul 14, 2022

pedroslopez Jul 14, 2022

evantahler Jul 14, 2022

benmoriceau Jul 14, 2022

pedroslopez Jul 18, 2022

benmoriceau Jul 18, 2022

evantahler left a comment

evantahler Jul 14, 2022

evantahler Jul 14, 2022

benmoriceau Jul 14, 2022

pedroslopez Jul 18, 2022

benmoriceau Jul 14, 2022

benmoriceau Jul 18, 2022

pedroslopez Jul 19, 2022 •

edited

Loading

		track(jobId, configType, connectorDefinitionId, workspaceId, outputState, mappedOutput);
		// TODO(pedro): report ConnectorJobOutput's failureReason to the JobErrorReporter, like the above

		messagesByType = streamFactory.create(IOs.newBufferedReader(stdout))
		.collect(Collectors.groupingBy(AirbyteMessage::getType));

	description: connector command job output
	description: connector command job output for all connector commands other than READ and WRITE


		return activity.run(new CheckConnectionInput(jobRunConfig, launcherConfig, connectionConfiguration));
		return activity.runWithJobOutput(new CheckConnectionInput(jobRunConfig, launcherConfig, connectionConfiguration));

Build failure reasons for synchronous jobs (check/spec/discover) #14715

Build failure reasons for synchronous jobs (check/spec/discover) #14715

Conversation

pedroslopez commented Jul 14, 2022 • edited Loading

What

How

Recommended reading order

🚨 User Impact 🚨

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evantahler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pedroslopez Jul 19, 2022 • edited Loading

Choose a reason for hiding this comment

pedroslopez commented Jul 14, 2022 •

edited

Loading

pedroslopez Jul 19, 2022 •

edited

Loading