Skip to content

Convert pre-2015 archives to new event format #112

Closed
@igrigorik

Description

@igrigorik

Pre-2015 we flattened the JSON events into a pre-defined schema and appended them to a global "timeline" table. This worked, but it has a few issues:

  • The GH event format changed overtime, whereas our schema stayed the same
    • New fields were omitted to keep backwards-compat with older data
  • One big table was convenient to query but became expensive due to large dataset

Post 2015 we moved to a different format that keeps the minimum number of well-defined fields as separate columns and stashes the rest into a JSON payload - see https://www.githubarchive.org/#bigquery for more details.

Ideally, we should reprocess the pre 2015 events to follow the same pattern. The ~rough work involved:

  1. Downloading the raw pre-2015 hourly gzip's
  2. Reformat each archive to use the new event format with fixed fields and JSON payload
  3. Import above data into separate daily tables
  4. Create appropriate rollups to simplify querying

In other words, an ETL.. with bulk of the work in (2) where all the data massaging needs to happen -- the caveat is that the format changed over time, so this needs to be done carefully to avoid losing data.

Update: we should also scrub email's from the new tables and files: [email protected] -> [email protected]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions