|
1 | 1 | # Basic Normalization
|
2 | 2 |
|
3 |
| -At its core, Airbyte is geared to handle the EL \(Extract Load\) steps of an ELT process. These steps can also be referred in Airbyte's dialect as "Source" and "Destination". |
| 3 | +## High-Level Overview |
4 | 4 |
|
5 |
| -However, this is actually producing a table in the destination with a JSON blob column... For the typical analytics use case, you probably want this json blob normalized so that each field is its own column. |
6 |
| - |
7 |
| -So, after EL, comes the T \(transformation\) and the first T step that Airbyte actually applies on top of the extracted data is called "Normalization". |
8 |
| - |
9 |
| -Airbyte runs this step before handing the final data over to other tools that will manage further transformation down the line. |
10 |
| - |
11 |
| -To summarize, we can represent the ELT process in the diagram below. These are steps that happens between your "Source Database or API" and the final "Replicated Tables" with examples of implementation underneath: |
| 5 | +{% hint style="info" %} |
| 6 | +The high-level overview contains all the information you need to use Basic Normalization when pulling from APIs. Information past that can be read for advanced or educational purposes. |
| 7 | +{% endhint %} |
12 | 8 |
|
13 |
| - |
| 9 | +When you run your first Airbyte sync without the basic normalization, you'll notice that your data gets written to your destination as one data column with a JSON blob that contains all of your data. This is the `_airbyte_raw_` table that you may have seen before. Why do we create this table? A core tenet of ELT philosophy is that data should be untouched as it moves through the E and L stages so that the raw data is always accessible. If an unmodified version of the |
| 10 | +data exists in the destination, it can be retransformed without needing to sync data again. |
14 | 11 |
|
15 |
| -In Airbyte, the current normalization option is implemented using a dbt Transformer composed of: |
16 |
| -- Airbyte base-normalization python package to generate dbt SQL models files |
17 |
| -- dbt to compile and executes the models on top of the data in the destinations that supports it. |
| 12 | +If you have Basic Normalization enabled, Airbyte automatically uses this JSON blob to create a schema and tables with your data in mind, converting it to the format of your destination. This runs after your sync and may take a long time if you have a large amount of data synced. If you don't enable Basic Normalization, you'll have to transform the JSON data from that column yourself. |
18 | 13 |
|
19 |
| -## Overview |
| 14 | +## Example |
20 | 15 |
|
21 | 16 | Basic Normalization uses a fixed set of rules to map a json object from a source to the types and format that are native to the destination. For example if a source emits data that looks like this:
|
22 | 17 |
|
@@ -50,6 +45,24 @@ The [normalization rules](basic-normalization.md#Rules) are _not_ configurable.
|
50 | 45 |
|
51 | 46 | Airbyte places the json blob version of your data in a table called `_airbyte_raw_<stream name>`. If basic normalization is turned on, it will place a separate copy of the data in a table called `<stream name>`. Under the hood, Airbyte is using dbt, which means that the data only ingresses into the data store one time. The normalization happens as a query within the datastore. This implementation avoids extra network time and costs.
|
52 | 47 |
|
| 48 | +## Why does Airbyte have Basic Normalization? |
| 49 | + |
| 50 | +At its core, Airbyte is geared to handle the EL \(Extract Load\) steps of an ELT process. These steps can also be referred in Airbyte's dialect as "Source" and "Destination". |
| 51 | + |
| 52 | +However, this is actually producing a table in the destination with a JSON blob column... For the typical analytics use case, you probably want this json blob normalized so that each field is its own column. |
| 53 | + |
| 54 | +So, after EL, comes the T \(transformation\) and the first T step that Airbyte actually applies on top of the extracted data is called "Normalization". |
| 55 | + |
| 56 | +Airbyte runs this step before handing the final data over to other tools that will manage further transformation down the line. |
| 57 | + |
| 58 | +To summarize, we can represent the ELT process in the diagram below. These are steps that happens between your "Source Database or API" and the final "Replicated Tables" with examples of implementation underneath: |
| 59 | + |
| 60 | + |
| 61 | + |
| 62 | +In Airbyte, the current normalization option is implemented using a dbt Transformer composed of: |
| 63 | +- Airbyte base-normalization python package to generate dbt SQL models files |
| 64 | +- dbt to compile and executes the models on top of the data in the destinations that supports it. |
| 65 | + |
53 | 66 | ## Destinations that Support Basic Normalization
|
54 | 67 |
|
55 | 68 | * [BigQuery](../integrations/destinations/bigquery.md)
|
|
0 commit comments