Skip to content

Commit 6b19bf4

Browse files
avaidyanathaAbhi Vaidyanatha
andauthored
Add high level overview to normalization doc. (#6445)
* Add high level overview to normalization * Address review comments Co-authored-by: Abhi Vaidyanatha <[email protected]>
1 parent 911998b commit 6b19bf4

File tree

1 file changed

+26
-13
lines changed

1 file changed

+26
-13
lines changed

docs/understanding-airbyte/basic-normalization.md

Lines changed: 26 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,17 @@
11
# Basic Normalization
22

3-
At its core, Airbyte is geared to handle the EL \(Extract Load\) steps of an ELT process. These steps can also be referred in Airbyte's dialect as "Source" and "Destination".
3+
## High-Level Overview
44

5-
However, this is actually producing a table in the destination with a JSON blob column... For the typical analytics use case, you probably want this json blob normalized so that each field is its own column.
6-
7-
So, after EL, comes the T \(transformation\) and the first T step that Airbyte actually applies on top of the extracted data is called "Normalization".
8-
9-
Airbyte runs this step before handing the final data over to other tools that will manage further transformation down the line.
10-
11-
To summarize, we can represent the ELT process in the diagram below. These are steps that happens between your "Source Database or API" and the final "Replicated Tables" with examples of implementation underneath:
5+
{% hint style="info" %}
6+
The high-level overview contains all the information you need to use Basic Normalization when pulling from APIs. Information past that can be read for advanced or educational purposes.
7+
{% endhint %}
128

13-
![](../.gitbook/assets/connecting-EL-with-T-4.png)
9+
When you run your first Airbyte sync without the basic normalization, you'll notice that your data gets written to your destination as one data column with a JSON blob that contains all of your data. This is the `_airbyte_raw_` table that you may have seen before. Why do we create this table? A core tenet of ELT philosophy is that data should be untouched as it moves through the E and L stages so that the raw data is always accessible. If an unmodified version of the
10+
data exists in the destination, it can be retransformed without needing to sync data again.
1411

15-
In Airbyte, the current normalization option is implemented using a dbt Transformer composed of:
16-
- Airbyte base-normalization python package to generate dbt SQL models files
17-
- dbt to compile and executes the models on top of the data in the destinations that supports it.
12+
If you have Basic Normalization enabled, Airbyte automatically uses this JSON blob to create a schema and tables with your data in mind, converting it to the format of your destination. This runs after your sync and may take a long time if you have a large amount of data synced. If you don't enable Basic Normalization, you'll have to transform the JSON data from that column yourself.
1813

19-
## Overview
14+
## Example
2015

2116
Basic Normalization uses a fixed set of rules to map a json object from a source to the types and format that are native to the destination. For example if a source emits data that looks like this:
2217

@@ -50,6 +45,24 @@ The [normalization rules](basic-normalization.md#Rules) are _not_ configurable.
5045

5146
Airbyte places the json blob version of your data in a table called `_airbyte_raw_<stream name>`. If basic normalization is turned on, it will place a separate copy of the data in a table called `<stream name>`. Under the hood, Airbyte is using dbt, which means that the data only ingresses into the data store one time. The normalization happens as a query within the datastore. This implementation avoids extra network time and costs.
5247

48+
## Why does Airbyte have Basic Normalization?
49+
50+
At its core, Airbyte is geared to handle the EL \(Extract Load\) steps of an ELT process. These steps can also be referred in Airbyte's dialect as "Source" and "Destination".
51+
52+
However, this is actually producing a table in the destination with a JSON blob column... For the typical analytics use case, you probably want this json blob normalized so that each field is its own column.
53+
54+
So, after EL, comes the T \(transformation\) and the first T step that Airbyte actually applies on top of the extracted data is called "Normalization".
55+
56+
Airbyte runs this step before handing the final data over to other tools that will manage further transformation down the line.
57+
58+
To summarize, we can represent the ELT process in the diagram below. These are steps that happens between your "Source Database or API" and the final "Replicated Tables" with examples of implementation underneath:
59+
60+
![](../.gitbook/assets/connecting-EL-with-T-4.png)
61+
62+
In Airbyte, the current normalization option is implemented using a dbt Transformer composed of:
63+
- Airbyte base-normalization python package to generate dbt SQL models files
64+
- dbt to compile and executes the models on top of the data in the destinations that supports it.
65+
5366
## Destinations that Support Basic Normalization
5467

5568
* [BigQuery](../integrations/destinations/bigquery.md)

0 commit comments

Comments
 (0)