Skip to content

Commit b805848

Browse files
2021/v1.1 (#33)
* Create CHANGELOG for version 2021 * Initial text version for visually impaired users * Add 2021 updates (see CHANGELOG) * Include link to text version * Format markdown * Format markdown
1 parent e7b19ea commit b805848

File tree

6 files changed

+232
-1
lines changed

6 files changed

+232
-1
lines changed

CHANGELOG.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
## Roadmap 2021
2+
3+
### Update 2021-01-15
4+
5+
* Added text version for visually impaired users (issue #10)
6+
* Math & statistics basics have been added to CS fundamentals (issue #22)
7+
* Dimensional modelling has been added to Database fundamentals
8+
* Added section for Object storage (issue #7)
9+
* Azure CosmosDB has been added to Document databases
10+
* Apache Impala has been moved from Batch processing to Data Warehouses
11+
* Azure Synapse Analytics (issue #18) and ClickHouse (issue #24) have been added to Data Warehouses
12+
* Lambda & Kappa architectures have been added to Cluster computing fundamentals (issue #31)
13+
* Azure Data Lake has been added to Managed Hadoop
14+
* Apache NiFi has been added to Hybrid data processing
15+
* Cloud specific messaging services have been added to Messaging (issue #8)
16+
* Luigi has been added to Workflow scheduling
17+
* AWS CDK has replaced AWS CloudFormation in Infrastructure provisioning (issue #4, issue #6)
18+
* Power BI has been added to data visualisation tools (issue #29)
19+
* MLflow has been added to Machine Learning Ops (issue #30)
20+
21+
## Roadmap 2020
22+
23+
[Modern Data Engineer Roadmap 2020](https://github.com/datastacktv/data-engineer-roadmap/tree/8b1ccdce4524961bfd37495de20117c47766b1eb)

README.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
> Roadmap to becoming a data engineer in 2021
55
66
[![Twitter](https://img.shields.io/badge/-Twitter-1DA1F2)](https://twitter.com/datastacktv)
7-
[![YouTube](https://img.shields.io/badge/-YouTube-FF0000)](https://www.youtube.com/channel/UCQSbqkMlvf_J949HDWxOt7Q)
7+
[![YouTube](https://img.shields.io/badge/-YouTube-FF0000)](http://youtube.com/c/datastacktv)
88
[![Website](https://img.shields.io/badge/-Website-565CD8)](https://datastack.tv/)
99

1010
This roadmap aims to give a **complete picture of the modern data engineering landscape** and serve as a **study guide** for aspiring data engineers.
@@ -17,10 +17,14 @@ This roadmap aims to give a **complete picture of the modern data engineering la
1717
1818
***
1919

20+
> [Text version for visually impaired users](text/roadmap.md)
21+
2022
![Data Engineer Roadmap](img/roadmap.png)
2123

2224
## Nice to have 😎
2325

26+
> [Text version for visually impaired users](text/extras.md)
27+
2428
![Data Engineer Roadmap Extras](img/extras.png)
2529

2630
## Contributions are welcome 💜

img/extras.png

5.66 KB
Loading

img/roadmap.png

79.1 KB
Loading

text/extras.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
> Text version for visually impaired users
2+
3+
*Note: Data engineers often work closely with Data scientists, Data analysts and Machine Learning engineers. It’s good to have a basic understanding of the tools they use.*
4+
5+
* Visualise data
6+
* Tableau [general recommendation]
7+
* Looker [personal recommendation]
8+
* Grafana [general recommendation]
9+
* Jupyter Notebook [general recommendation]
10+
* Microsoft Power BI
11+
12+
* Machine Learning fundamentals
13+
* Terminology [general recommendation]
14+
* Supervised vs unsupervised learning
15+
* Classification vs regression
16+
* Evaluation metrics
17+
* scikit-learn [general recommendation]
18+
* Tensorflow [personal recommendation]
19+
* Keras [personal recommendation]
20+
* PyTorch [general recommendation]
21+
22+
* Machine Learning Ops
23+
* Tensorflow Extended (TFX) [general recommendation]
24+
* Kubeflow [personal recommendation]
25+
* MLflow
26+
* Amazon SageMaker
27+
* Google Cloud AI Platform
28+
29+
*Note: Keep learning...*

text/roadmap.md

Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
> Text version for visually impaired users
2+
3+
# Data Engineer in 2021
4+
5+
* CS fundamentals
6+
* Basic terminal usage [general recommendation]
7+
* Data structures & algorithms [general recommendation]
8+
* APIs [general recommendation]
9+
* REST [general recommendation]
10+
* Structured vs unstructured data [general recommendation]
11+
* Serialisation
12+
* Linux [general recommendation]
13+
* CLI
14+
* Vim
15+
* Shell scripting
16+
* Cronjobs
17+
* How does the computer work? [general recommendation]
18+
* How does the Internet work? [general recommendation]
19+
* Git — Version control [general recommendation]
20+
* Math & statistics basics [general recommendation]
21+
22+
*Note: Git is used for tracking changes in source code and coordinating work among programmers. In your day to day work you will use Git server as a service like GitHub, GitLab or Bitbucket.*
23+
24+
* Learn a programming language
25+
* Python [personal recommendation]
26+
* Java [general recommendation]
27+
* Scala
28+
* Go
29+
30+
*Note: Learn how to write clean, extensibile code. Spend some time understanding programming paradigms (functional vs. OOP) and best practices (design patterns, YAGNI, stateful vs stateless applications). Get familiar with an IDE or code editor like VSCode.*
31+
32+
* Testing
33+
* Unit testing [general recommendation]
34+
* Integration testing [general recommendation]
35+
* Functional testing [general recommendation]
36+
37+
* Database fundamentals
38+
* SQL [general recommendation]
39+
* Normalisation [general recommendation]
40+
* ACID transactions [general recommendation]
41+
* CAP theorem [general recommendation]
42+
* OLTP vs OLAP [general recommendation]
43+
* Horizontal vs vertical scaling [general recommendation]
44+
* Dimensional modeling [general recommendation]
45+
46+
* Relational databases
47+
* MySQL [general recommendation]
48+
* PostgreSQL [general recommendation]
49+
* MariaDB
50+
* Amazon Aurora
51+
52+
* Non-relational databases
53+
* Document databases
54+
* MongoDB [general recommendation]
55+
* Elasticsearch [general recommendation]
56+
* Apache CouchDB
57+
* Azure CormosDB
58+
* Wide column databases
59+
* Apache Cassandra [general recommendation]
60+
* Apache HBase [general recommendation]
61+
* Google Cloud Bigtable [personal recommendation]
62+
* Graph databases
63+
* Neo4j
64+
* Amazon Neptune
65+
* Key-value stores
66+
* Redis [personal recommendation]
67+
* Memcached
68+
* Amazon DynamoDB [general recommendation]
69+
70+
*Note: Understand the difference between Document, Wide column, Graph and Key-value NoSQL databases. We recommend mastering one database from each category.*
71+
72+
* Data warehouses
73+
* Snowflake [general recommendation]
74+
* Presto
75+
* Apache Hive
76+
* Apache Impala
77+
* Amazon Redshift [general recommendation]
78+
* Google BigQuery [personal recommendation]
79+
* Azure Synapse
80+
* ClickHouse
81+
82+
* Object storage
83+
* AWS S3 [general recommendation]
84+
* Azure Blob Storage
85+
* Google Cloud Storage
86+
87+
* Cluster computing fundamentals
88+
* Apache Hadoop [general recommendation]
89+
* HDFS [general recommendation]
90+
* MapReduce [general recommendation]
91+
* Lambda & Kappa architectures
92+
* Managed Hadoop [general recommendation]
93+
* Amazon EMR
94+
* Google Dataproc
95+
* Azure Data Lake
96+
97+
*Note: Most modern data processing frameworks are based on Apache Hadoop and MapReduce to some extent. Understanding these concepts can help you learn modern data processing frameworks much quicker.*
98+
99+
* Data processing
100+
* Batch
101+
* Apache Pig [general recommendation]
102+
* Apache Arrow
103+
* data build tool [personal recommendation]
104+
* Hybrid
105+
* Apache Spark [general recommendation]
106+
* Apache Beam [personal recommendation]
107+
* Apache Flink [general recommendation]
108+
* Apache NiFi
109+
* Streaming
110+
* Apache Kafka [personal recommendation]
111+
* Apache Storm [general recommendation]
112+
* Apache Samza
113+
* Amazon Kinesis
114+
115+
*Note: Hybrid frameworks are able to process both batch and streaming data. Batch data processing is often done by analytical data warehouse applications. See Data warehouses section for more.*
116+
117+
* Messaging
118+
* RabbitMQ [general recommendation]
119+
* Apache ActiveMQ
120+
* Amazon SNS & SQS
121+
* Google PubSub
122+
* Azure Service Bus
123+
124+
* Workflow scheduling
125+
* Apache Airflow [personal recommendation]
126+
* Google Composer
127+
* Apache Oozie
128+
* Luigi
129+
130+
*Note: Cloud Composer is a managed Apache Airflow service on Google Cloud Platform.*
131+
132+
* Monitoring data pipelines
133+
* Prometheus [general recommendation]
134+
* Datadog [general recommendation]
135+
* Sentry [general recommendation]
136+
* StatsD
137+
138+
* Networking
139+
* Protocols [general recommendation]
140+
* HTTP / HTTPS
141+
* TCP
142+
* SSH
143+
* IP
144+
* DNS
145+
* Firewalls [general recommendation]
146+
* VPN [general recommendation]
147+
* VPC [general recommendation]
148+
149+
* Infrastructure as Code
150+
* Containers
151+
* Docker [personal recommendation]
152+
* LXC
153+
* Container orchestration
154+
* Kubernetes [general recommendation]
155+
* Docker Swarm
156+
* Apache Mesos
157+
* Google Kubernetes Engine (GKE) [general recommendation]
158+
* Infrastructure provisioning
159+
* Terraform [personal recommendation]
160+
* Pulumi
161+
* AWS CDK [general recommendation]
162+
163+
* CI/CD
164+
* GitHub Actions [general recommendation]
165+
* Jenkins [general recommendation]
166+
167+
* Identity and access management
168+
* Active Directory [general recommendation]
169+
* Azure Active Directory
170+
171+
* Data security & privacy
172+
* Legal compliance [general recommendation]
173+
* Encryption [general recommendation]
174+
* Key management [general recommendation]
175+
* Data governance & integrity

0 commit comments

Comments
 (0)