Skip to content

Commit c8fecae

Browse files
committed
added
1 parent c02ad1d commit c8fecae

File tree

1 file changed

+53
-0
lines changed

1 file changed

+53
-0
lines changed

assignment/assignment.qmd

+53
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
---
2+
title: "Create a Data Cleaning System"
3+
format: pdf
4+
---
5+
6+
### Task
7+
8+
During this assignment you are going to build a small system that cleans data and produces statistics
9+
automatically.
10+
11+
Each group will have its own synthetic data set.
12+
13+
The data set contains financial and economic data on companies. We are
14+
interested in publishing the average turnover from industrial activities,
15+
turnover from trade, and operating result, by 3-digit NACE code (the first three
16+
digits of the NACE variable). You can use e.g. `dplyr::summarise` to compute the
17+
results.
18+
19+
You will build a set of scripts that cleans the raw data and estimates the desired quantities.
20+
21+
In the afternoon each group will share its results by presenting their system.
22+
This may be a Powerpoint presentation but you can also just show the scripts and
23+
talk through it.
24+
25+
26+
### Tips
27+
28+
- Look at the data. Make plots, discuss amongst each other what can be wrong with it, and how it might
29+
be solved. Don't try to solve everything at once. Solve one problem at the time.
30+
- Start by doing the estimates. They will be way off, but it is good to have a first result and see
31+
the effect of updates to the data cleaning process.
32+
- It is better to have simple code that runs than complicated code that doesn't. Start small, make sure it runs and then
33+
expand.
34+
- Use one script for each step in the statistical value chain. Each script reads an input, does something to the data, and writes an output.
35+
- Define rules to check the quality of the data. Also here: work iteratively. Start with a few rules
36+
and adapt the ruleset iteratively.
37+
- Iterate often, view the data and the results often.
38+
- Make plots of the data: are there outliers? Also think about ratios between variables.
39+
- Impute the missing values. Try a few models.
40+
41+
42+
43+
### The data
44+
45+
46+
The financial variabeles have to satisfy the following balance restrictions. You can think of extra
47+
restrictions as you deem fit.
48+
\begin{center}
49+
\includegraphics[width=\textwidth]{balances.pdf}
50+
\end{center}
51+
52+
53+

0 commit comments

Comments
 (0)