Skip to content

Commit cc54a19

Browse files
Add basic dag-bundles docs; webserver->api server in security model (#48600)
1 parent 8a784a7 commit cc54a19

File tree

4 files changed

+160
-28
lines changed

4 files changed

+160
-28
lines changed
Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
.. Licensed to the Apache Software Foundation (ASF) under one
2+
or more contributor license agreements. See the NOTICE file
3+
distributed with this work for additional information
4+
regarding copyright ownership. The ASF licenses this file
5+
to you under the Apache License, Version 2.0 (the
6+
"License"); you may not use this file except in compliance
7+
with the License. You may obtain a copy of the License at
8+
9+
.. http://www.apache.org/licenses/LICENSE-2.0
10+
11+
.. Unless required by applicable law or agreed to in writing,
12+
software distributed under the License is distributed on an
13+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
KIND, either express or implied. See the License for the
15+
specific language governing permissions and limitations
16+
under the License.
17+
18+
Dag Bundles
19+
===========
20+
21+
Dag bundle are a collection of dags and other files (think the Airflow 2 dags folder). Unlike Airflow 2, where dags were required to be on local disk and getting
22+
the dags there was the sole responsibility of the deployment manager, Airflow 3 is now able to pull dags from external systems
23+
as well. And since dag bundles support versioning, it also allows Airflow to run a task using a specific version
24+
of the dag bundle, allowing for a dag run to use the same code for the whole run, even if the dag is updated mid way through the run.
25+
26+
What's in a dag bundle? One or more dag files along with their associated files, such as
27+
other Python scripts, configuration files, or other resources. By keeping the bundle at a higher level, it allows for versioning
28+
everything the dag needs to run.
29+
30+
Dag bundles can source the dags from various locations, such as local directories, Git repositories, or other external systems.
31+
Deployment administrators can also write their own dag bundle classes to support custom sources.
32+
You can also define more than 1 dag bundle in an Airflow deployments, allowing for better organization of your dags.
33+
34+
Why Are dag bundles important?
35+
------------------------------
36+
37+
- **Version Control**: By supporting versioning, dag bundles allow dag runs to use the same code for the whole run, even if the dag is updated mid way through the run.
38+
- **Scalability**: With dag bundles, Airflow can efficiently manage large numbers of DAGs by organizing them into logical units.
39+
- **Flexibility**: Dag bundles enable seamless integration with external systems, such as Git repositories, to source dags.
40+
41+
Types of dag bundles
42+
--------------------
43+
Airflow supports multiple types of dag Bundles, each catering to specific use cases:
44+
45+
**airflow.dag_processing.bundles.local.LocalDagBundle**
46+
These bundles reference a local directory containing DAG files. They are ideal for development and testing environments, but do not support versioning of the bundle, meaning tasks always run using the latest code.
47+
48+
**airflow.providers.git.bundles.git.GitDagBundle**
49+
These bundles integrate with Git repositories, allowing Airflow to fetch dags directly from a repository.
50+
51+
Configuring dag bundles
52+
-----------------------
53+
54+
Dag bundles are configured in :ref:`config:dag_processor__dag_bundle_config_list`. You can add one or more dag bundles here.
55+
56+
By default, Airflow adds a local dag bundle, which is the same as the old dags folder. This is done for backwards compatibility, and you can remove it if you do not want to use it. You can also keep it and add other dag bundles, such as a git dag bundle.
57+
58+
For example, adding multiple dag bundles to your ``airflow.cfg`` file:
59+
60+
.. code-block:: ini
61+
62+
[dag_processor]
63+
dag_bundle_config_list = [
64+
{
65+
"name": "my_git_repo",
66+
"classpath": "airflow.dag_processing.bundles.git.GitDagBundle",
67+
"kwargs": {"tracking_ref": "main", "git_conn_id": "my_git_conn"}
68+
}
69+
{
70+
"name": "dags-folder",
71+
"classpath": "airflow.dag_processing.bundles.local.LocalDagBundle",
72+
"kwargs": {}
73+
}
74+
]
75+
76+
.. note::
77+
78+
The whitespace, particularly on the last line, is important so a multi-line value works properly. More details can be found in the
79+
the `configparser docs <https://docs.python.org/3/library/configparser.html#supported-ini-file-structure>`_.
80+
81+
You can also override the :ref:`config:dag_processor__refresh_interval` per dag bundle by passing it in kwargs.
82+
This controls how often the dag processor refreshes, or looks for new files, in the dag bundles.
83+
84+
Writing custom dag bundles
85+
--------------------------
86+
87+
When implementing your own dag bundle by extending the ``BaseDagBundle`` class, there are several methods you must implement. Below is a guide to help you implement a custom dag bundle.
88+
89+
Abstract Methods
90+
~~~~~~~~~~~~~~~~
91+
The following methods are abstract and must be implemented in your custom bundle class:
92+
93+
**path**
94+
This property should return a ``Path`` to the directory where the dag files for this bundle are stored.
95+
Airflow uses this property to locate the DAG files for processing.
96+
97+
**get_current_version**
98+
This method should return the current version of the bundle as a string.
99+
Airflow will use pass this version to ``__init__`` later to get this version of the bundle again when it runs tasks.
100+
If versioning is not supported, it should return ``None``.
101+
102+
**refresh**
103+
This method should handle refreshing the bundle's contents from its source (e.g., pulling the latest changes from a remote repository).
104+
This is used by the dag processor periodically to ensure that the bundle is up-to-date.
105+
106+
Optional Methods
107+
~~~~~~~~~~~~~~~~
108+
In addition to the abstract methods, you may choose to override the following methods to customize the behavior of your bundle:
109+
110+
**__init__**
111+
This method can be extended to initialize the bundle with extra parameters, such as ``tracking_ref`` for the ``GitDagBundle``.
112+
It should also call the parent class's ``__init__`` method to ensure proper initialization.
113+
Expensive operations, such as network calls, should be avoided in this method to prevent delays during the bundle's instantiation, and done
114+
in the ``initialize`` method instead.
115+
116+
**initialize**
117+
This method is called before the bundle is first used in the dag processor or worker. It allows you to perform expensive operations only when the bundle's content is accessed.
118+
119+
**view_url**
120+
This method should return a URL as a string to view the bundle on an external system (e.g., a Git repository's web interface).
121+
122+
Other Considerations
123+
~~~~~~~~~~~~~~~~~~~~
124+
125+
- **Versioning**: If your bundle supports versioning, ensure that ``initialize``, ``get_current_version`` and ``refresh`` are implemented to handle version-specific logic.
126+
127+
- **Concurrency**: Workers may create many bundles simultaneously, and does nothing to serialize calls to the bundle objects. Thus, the bundle class must handle locking if
128+
that is problematic for the underlying technology. For example, if you are cloning a git repo, the bundle class is responsible for locking to ensure only 1 bundle
129+
object is cloning at a time. There is a ``lock`` method in the base class that can be used for this purpose, if necessary.

airflow-core/docs/administration-and-deployment/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ This section contains information about deploying dags into production and the a
2828
kubernetes
2929
lineage
3030
listeners
31+
dag-bundles
3132
dag-serialization
3233
modules_management
3334
scheduler

airflow-core/docs/core-concepts/dags.rst

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -147,7 +147,7 @@ Chain can also do *pairwise* dependencies for lists the same size (this is diffe
147147
Loading dags
148148
------------
149149

150-
Airflow loads dags from Python source files, which it looks for inside its configured ``DAG_FOLDER``. It will take each file, execute it, and then load any DAG objects from that file.
150+
Airflow loads dags from Python source files in dag bundles. It will take each file, execute it, and then load any DAG objects from that file.
151151

152152
This means you can define multiple dags per Python file, or even spread one very complex DAG across multiple Python files using imports.
153153

@@ -164,11 +164,11 @@ While both DAG constructors get called when the file is accessed, only ``dag_1``
164164

165165
.. note::
166166

167-
When searching for dags inside the ``DAG_FOLDER``, Airflow only considers Python files that contain the strings ``airflow`` and ``dag`` (case-insensitively) as an optimization.
167+
When searching for dags inside the dag bundle, Airflow only considers Python files that contain the strings ``airflow`` and ``dag`` (case-insensitively) as an optimization.
168168

169169
To consider all Python files instead, disable the ``DAG_DISCOVERY_SAFE_MODE`` configuration flag.
170170

171-
You can also provide an ``.airflowignore`` file inside your ``DAG_FOLDER``, or any of its subfolders, which describes patterns of files for the loader to ignore. It covers the directory it's in plus all subfolders underneath it. See :ref:`.airflowignore <concepts:airflowignore>` below for details of the file syntax.
171+
You can also provide an ``.airflowignore`` file inside your dag bundle, or any of its subfolders, which describes patterns of files for the loader to ignore. It covers the directory it's in plus all subfolders underneath it. See :ref:`.airflowignore <concepts:airflowignore>` below for details of the file syntax.
172172

173173
In the case where the ``.airflowignore`` does not meet your needs and you want a more flexible way to control if a python file needs to be parsed by airflow, you can plug your callable by setting ``might_contain_dag_callable`` in the config file.
174174
Note, this callable will replace the default Airflow heuristic, i.e. checking if the strings ``airflow`` and ``dag`` (case-insensitively) are present in the python file.
@@ -691,7 +691,7 @@ Packaging dags
691691

692692
While simpler dags are usually only in a single Python file, it is not uncommon that more complex dags might be spread across multiple files and have dependencies that should be shipped with them ("vendored").
693693

694-
You can either do this all inside of the ``DAG_FOLDER``, with a standard filesystem layout, or you can package the DAG and all of its Python files up as a single zip file. For instance, you could ship two dags along with a dependency they need as a zip file with the following contents::
694+
You can either do this all inside of the dag bundle, with a standard filesystem layout, or you can package the DAG and all of its Python files up as a single zip file. For instance, you could ship two dags along with a dependency they need as a zip file with the following contents::
695695

696696
my_dag1.py
697697
my_dag2.py
@@ -711,7 +711,7 @@ In general, if you have a complex set of compiled dependencies and modules, you
711711
``.airflowignore``
712712
------------------
713713

714-
An ``.airflowignore`` file specifies the directories or files in ``DAG_FOLDER``
714+
An ``.airflowignore`` file specifies the directories or files in the dag bundle
715715
or ``PLUGINS_FOLDER`` that Airflow should intentionally ignore. Airflow supports
716716
two syntax flavors for patterns in the file, as specified by the ``DAG_IGNORE_FILE_SYNTAX``
717717
configuration parameter (*added in Airflow 2.3*): ``regexp`` and ``glob``.
@@ -740,7 +740,7 @@ match any of the patterns would be ignored (under the hood, ``Pattern.search()``
740740
to match the pattern). Use the ``#`` character to indicate a comment; all characters
741741
on lines starting with ``#`` will be ignored.
742742

743-
The ``.airflowignore`` file should be put in your ``DAG_FOLDER``. For example, you can prepare
743+
The ``.airflowignore`` file should be put in your dag bundle. For example, you can prepare
744744
a ``.airflowignore`` file with the ``glob`` syntax
745745

746746
.. code-block::
@@ -749,12 +749,12 @@ a ``.airflowignore`` file with the ``glob`` syntax
749749
tenant_[0-9]*
750750
751751
Then files like ``project_a_dag_1.py``, ``TESTING_project_a.py``, ``tenant_1.py``,
752-
``project_a/dag_1.py``, and ``tenant_1/dag_1.py`` in your ``DAG_FOLDER`` would be ignored
752+
``project_a/dag_1.py``, and ``tenant_1/dag_1.py`` in your dag bundle would be ignored
753753
(If a directory's name matches any of the patterns, this directory and all its subfolders
754754
would not be scanned by Airflow at all. This improves efficiency of DAG finding).
755755

756756
The scope of a ``.airflowignore`` file is the directory it is in plus all its subfolders.
757-
You can also prepare ``.airflowignore`` file for a subfolder in ``DAG_FOLDER`` and it
757+
You can also prepare ``.airflowignore`` file for a subfolder in your dag bundle and it
758758
would only be applicable for that subfolder.
759759

760760
DAG Dependencies

0 commit comments

Comments
 (0)