Description
Pre-2015 we flattened the JSON events into a pre-defined schema and appended them to a global "timeline" table. This worked, but it has a few issues:
- The GH event format changed overtime, whereas our schema stayed the same
- New fields were omitted to keep backwards-compat with older data
- One big table was convenient to query but became expensive due to large dataset
Post 2015 we moved to a different format that keeps the minimum number of well-defined fields as separate columns and stashes the rest into a JSON payload
- see https://www.githubarchive.org/#bigquery for more details.
Ideally, we should reprocess the pre 2015 events to follow the same pattern. The ~rough work involved:
- Downloading the raw pre-2015 hourly gzip's
- Reformat each archive to use the new event format with fixed fields and JSON payload
- Import above data into separate daily tables
- Create appropriate rollups to simplify querying
In other words, an ETL.. with bulk of the work in (2) where all the data massaging needs to happen -- the caveat is that the format changed over time, so this needs to be done carefully to avoid losing data.
Update: we should also scrub email's from the new tables and files: [email protected] -> [email protected]