Skip to content

EcsRunTaskOperator reattach does not work #29601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 2 tasks
deanmorin opened this issue Feb 17, 2023 · 1 comment · Fixed by #29447
Closed
1 of 2 tasks

EcsRunTaskOperator reattach does not work #29601

deanmorin opened this issue Feb 17, 2023 · 1 comment · Fixed by #29447
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:amazon AWS/Amazon - related issues

Comments

@deanmorin
Copy link
Contributor

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==7.1.0

Apache Airflow version

2.5.1

Operating System

Docker on ECS (apache/airflow:2.5.1-python3.9)

Deployment

Other Docker-based deployment

Deployment details

No response

What happened

The EcsRunTaskOperator has a reattach option. The idea is that when a task is launched on ECS, its arn will be saved in the xcom table so that if airflow restarts or something, it'll be able to reattach to the currently-running task in ECS rather than launching a new one.

It always fails to get the arn of the running task from the xcom table however.

What you think should happen instead

When airflow restarts and retries an EcsRunTaskOperator task that was killed by the the restart, it should find the arn of the currently-running ECS task and continue waiting for that ECS task to finish instead of starting a new one.

How to reproduce

  1. Create a DAG containing an EcsRunTaskOperator task which, for testing purposes, takes at least a few minutes to complete
  2. Trigger a DAG run
  3. While the task is running, redeploy airflow

Check the task logs when the task restarts. In the logs you'll see "No active previously launched task found to reattach"

Anything else

There are two problems from what I can tell.

Problem 1

When is pushes the xcom data, it uses the task_id of the task.

When in tries to retrieve the data it uses a made-up task_id, so it'll never find the one saved earlier.

It also uses the same made-up task_id when it tries to later delete the xcom data.

Problem 2

Switching from the made-up task_id to the normal task_id during retrieval doesn't help, since all xcom rows with the task/dag/run id are deleted when the task restarts, so the arn saved on the previous attempt is never available.

I tried changing the xcom_push to this:

XCom.set(
    task_id=self.REATTACH_XCOM_TASK_ID_TEMPLATE.format(task_id=self.task_id),
    key=self.REATTACH_XCOM_KEY,
    value=self.arn,
    dag_id=self.dag_id,
    run_id=context["ti"].run_id,
)

This causes this error:

 psycopg2.errors.ForeignKeyViolation: insert or update on table "xcom" violates foreign key constraint "xcom_task_instance_fkey"
DETAIL:  Key (dag_id, task_id, run_id, map_index)=(meltano_distribution, distribution_to_snowflake_task_arn, manual__2023-02-17T19:08:34.015748+00:00, -1) is not present in table "task_instance".

You used to be able to make up a task_id as a hack to save things in the xcom table, but that foreign key constraint must have been added at some point.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@deanmorin deanmorin added area:providers kind:bug This is a clearly a bug labels Feb 17, 2023
@Taragolis
Copy link
Contributor

#29447

@Taragolis Taragolis added the provider:amazon AWS/Amazon - related issues label Feb 17, 2023
@Taragolis Taragolis linked a pull request Feb 18, 2023 that will close this issue
@Taragolis Taragolis self-assigned this Feb 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:amazon AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants