Skip to content

Commit ccb5f8b

Browse files
committed
Update incremental strategies per dbt-labs/dbt-spark#141
1 parent 2f3f54a commit ccb5f8b

File tree

3 files changed

+82
-12
lines changed

3 files changed

+82
-12
lines changed

website/docs/docs/building-a-dbt-project/building-models/configuring-incremental-models.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -156,7 +156,7 @@ the reliability of your `unique_key`, or the availability of certain features.
156156

157157
* [Snowflake](snowflake-configs#merge-behavior-incremental-models): `merge` (default), `delete+insert` (optional)
158158
* [BigQuery](bigquery-configs#merge-behavior-incremental-models): `merge` (default), `insert_overwrite` (optional)
159-
* [Spark](spark-configs#incremental-models): `insert_overwrite` (default), `merge` (optional, Delta-only)
159+
* [Spark](spark-configs#incremental-models): `append` (default), `insert_overwrite` (optional), `merge` (optional, Delta-only)
160160

161161
### Configuring incremental strategy
162162

website/docs/reference/resource-configs/spark-configs.md

+81-11
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ To-do:
1515

1616
## Configuring tables
1717

18-
When materializing a model as `table`, you may include several optional configs:
18+
When materializing a model as `table`, you may include several optional configs that are specific to the dbt-spark plugin, in addition to the standard [model configs](model-configs).
1919

2020
| Option | Description | Required? | Example |
2121
|---------|----------------------------------------------------|-------------------------|--------------------------|
@@ -27,15 +27,86 @@ When materializing a model as `table`, you may include several optional configs:
2727

2828
## Incremental models
2929

30-
The [`incremental_strategy` config](configuring-incremental-models#what-is-an-incremental_strategy) controls how dbt builds incremental models, and it can be set to one of two values:
31-
- `insert_overwrite` (default)
32-
- `merge` (Delta Lake only)
30+
<Changelog>
31+
32+
- `dbt-spark==0.19.0`: Added the `append` strategy as default for all platforms, file types, and connection methods.
33+
34+
</Changelog>
35+
36+
dbt seeks to offer useful, intuitive modeling abstractions by means of its built-in configurations and materializations. Because there is so much variance between Apache Spark clusters out in the world—not to mention the powerful features offered to Databricks users by the Delta file format and custom runtime—making sense of all the available options is an undertaking in its own right.
37+
38+
For that reason, the dbt-spark plugin leans heavily on the [`incremental_strategy` config](configuring-incremental-models#what-is-an-incremental_strategy). This config tells the incremental materialization how to build models in runs beyond their first. It can be set to one of three values:
39+
- **`append`** (default): Insert new records without updating or overwriting any existing data.
40+
- **`insert_overwrite`**: If `partition_by` is specified, overwrite partitions in the table with new data. If no `partition_by` is specified, overwrite the entire table with new data.
41+
- **`merge`** (Delta Lake only): Match records based on a `unique_key`; update old records, insert new ones. (If no `unique_key` is specified, all new data is inserted, similar to `append`.)
42+
43+
Each of these strategies has its pros and cons, which we'll discuss below. As with any model config, `incremental_strategy` may be specified in `dbt_project.yml` or within a model file's `config()` block.
44+
45+
### The `append` strategy
46+
47+
Following the `append` strategy, dbt will perform an `insert into` statement with all new data. The appeal of this strategy is that it is straightforward and functional across all platforms, file types, connection methods, and Apache Spark versions. However, this strategy _cannot_ update, overwrite, or delete existing data, so it is likely to insert duplicate records for many data sources.
48+
49+
Specifying `append` as the incremental strategy is optional, since it's the default strategy used when none is specified.
50+
51+
<Tabs
52+
defaultValue="source"
53+
values={[
54+
{ label: 'Source code', value: 'source', },
55+
{ label: 'Run code', value: 'run', },
56+
]
57+
}>
58+
<TabItem value="source">
59+
60+
<File name='spark_incremental.sql'>
61+
62+
```sql
63+
{{ config(
64+
materialized='incremental',
65+
incremental_strategy='append',
66+
) }}
67+
68+
-- All rows returned by this query will be appended to the existing table
69+
70+
select * from {{ ref('events') }}
71+
{% if is_incremental() %}
72+
where event_ts > (select max(event_ts) from {{ this }})
73+
{% endif %}
74+
```
75+
</File>
76+
</TabItem>
77+
<TabItem value="run">
78+
79+
<File name='spark_incremental.sql'>
80+
81+
```sql
82+
create temporary view spark_incremental__dbt_tmp as
83+
84+
select * from analytics.events
85+
86+
where event_ts >= (select max(event_ts) from {{ this }})
87+
88+
;
89+
90+
insert into table analytics.spark_incremental
91+
select `date_day`, `users` from spark_incremental__dbt_tmp
92+
```
93+
94+
</File>
95+
</TabItem>
96+
</Tabs>
3397

3498
### The `insert_overwrite` strategy
3599

36-
Apache Spark does not natively support `delete`, `update`, or `merge` statements. As such, Spark's default incremental behavior is different [from the standard](configuring-incremental-models).
100+
This strategy is most effective when specified alongside a `partition_by` clause in your model config. dbt will run an [atomic `insert overwrite` statement](https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-dml-insert-overwrite-table.html) that dynamically replaces all partitions included in your query. Be sure to re-select _all_ of the relevant data for a partition when using this incremental strategy.
101+
102+
If no `partition_by` is specified, then the `insert_overwrite` strategy will atomically replace all contents of the table, overriding all existing data with only the new records. The column schema of the table remains the same, however. This can be desirable in some limited circumstances, since it minimizes downtime while the table contents are overwritten. The operation is comparable to running `truncate` + `insert` on other databases. For atomic replacement of Delta-formatted tables, use the `table` materialization (which runs `create or replace`) instead.
103+
104+
**Usage notes:**
105+
- This strategy is not supported for tables with `file_format: delta`.
106+
- This strategy is not available when connecting via Databricks SQL endpoints (`method: odbc` + `endpoint`).
107+
- If connecting via a Databricks cluster + ODBC driver (`method: odbc` + `cluster`), you **must** include `set spark.sql.sources.partitionOverwriteMode DYNAMIC` in the [cluster Spark Config](https://docs.databricks.com/clusters/configure.html#spark-config) in order for dynamic partition replacement to work (`incremental_strategy: insert_overwrite` + `partition_by`).
37108

38-
To use incremental models, specify a `partition_by` clause in your model config. dbt will run an [atomic `insert overwrite` statement](https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-dml-insert-overwrite-table.html) that dynamically replaces all partitions included in your query. Be sure to re-select _all_ of the relevant data for a partition when using this incremental strategy.
109+
<Lightbox src="/img/reference/databricks-cluster-sparkconfig-partition-overwrite.png" title="Databricks cluster: Spark Config" />
39110

40111
<Tabs
41112
defaultValue="source"
@@ -123,12 +194,11 @@ This functionality is new in dbt-spark v0.15.3. See [installation instructions](
123194

124195
:::
125196

126-
There are three prerequisites for the `merge` incremental strategy:
127-
- Creating the table in Delta file format
128-
- Using Databricks Runtime 5.1 and above
129-
- Specifying a `unique_key`
197+
**Usage notes:** The `merge` incremental strategy requires:
198+
- `file_format: delta`
199+
- Databricks Runtime 5.1 and above
130200

131-
dbt will run an [atomic `merge` statement](https://docs.databricks.com/spark/latest/spark-sql/language-manual/merge-into.html) which looks nearly identical to the default merge behavior on Snowflake and BigQuery.
201+
dbt will run an [atomic `merge` statement](https://docs.databricks.com/spark/latest/spark-sql/language-manual/merge-into.html) which looks nearly identical to the default merge behavior on Snowflake and BigQuery. If a `unique_key` is specified (recommended), dbt will update old records with values from new records that match on the key column. If a `unique_key` is not specified, dbt will forgo match criteria and simply insert all new records (similar to `append` strategy).
132202

133203
<Tabs
134204
defaultValue="source"
Loading

0 commit comments

Comments
 (0)