Repo for Foundations of Data Science for Everyone - class taught at Lincoln University + University of Delaware
This course will teach the basics of data-driven research. Students will acquire basic computational skills, basic knowledge of statistical analysis, error analysis, familiarize with good practices for handling small- and big-data, and the basics of Machine Learning. After this class students should be able to formulate a question, find appropriate data to answer the question, prepare and analyze the data, get an answer, and understand the answer’s confidence level. The course will be organized in a modular fashion, with labs and projects assigned to students for group work.
the flow chart of a data-driven project from idea to divulgation, the concepts of falsifiability, reproducibility, open science, the importance of version control
Lab: setting up github repositories, making a jupyter notebooks (on colab free platform)
Data types, missing data, censored data, organization of data in tables. Data hygiene
Lab: Acquiring and preparing data (CSV, TSV, downloadable ascii files, basic SQL, API) in Pandas: merging data from different files, reading data collections from CSV files into data frames, selecting columns, selecting rows, merging data frames
Inference from plots: plotting histograms and scatter plots, data types incl ordinal, continuous, categorical data, visual inspection of correlation between variables Lab: read and clean data, Citibikes, Pluto, Census
p-value, chi-square, z-test. Lab: basic statistics on Pluto, Census, Citibikes data, moment extraction, deviations from Gaussianity/Poissonity, histograms, proper binning. PDF/CDF, data dredging, error analysis, testing models (KS, Anderson Darling, KL divergence), goodness of fit. Lab: creating and testing simple distribution models in NumPy
Bayes vs Frequentist statistics, Prior, Likelihood, Posterior