Convert pre-2015 archives to new event format

Pre-2015 we flattened the JSON events into a pre-defined schema and appended them to a global "timeline" table. This worked, but it has a few issues:
- The GH event format changed overtime, whereas our schema stayed the same
  - New fields were omitted to keep backwards-compat with older data
- One big table was convenient to query but became expensive due to large dataset

Post 2015 we moved to a different format that keeps the minimum number of well-defined fields as separate columns and stashes the rest into a JSON `payload` - see https://www.githubarchive.org/#bigquery for more details.

Ideally, we should reprocess the pre 2015 events to follow the same pattern. The ~rough work involved:
1. Downloading the raw pre-2015 hourly gzip's 
2. Reformat each archive to use the new event format with fixed fields and JSON payload
3. Import above data into separate daily tables
4. Create appropriate rollups to simplify querying 

In other words, an ETL.. with bulk of the work in (2) where all the data massaging needs to happen -- the caveat is that the format changed over time, so this needs to be done carefully to avoid losing data.

_Update: we should also scrub email's from the new tables and files: user@site.com -> redacted@site.com_


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Convert pre-2015 archives to new event format #112

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Convert pre-2015 archives to new event format #112

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions