Skip to content

New data warehouse strategy [tabular]: SQL database populated with dictionaries data (experimental feature) #37

Open
@fititnt

Description

@fititnt

Current know context at the moment

  • The New exported format: JSON-LD metadata to explain the CSVs, using W3C Tabular Data (Basic implementation only)  #36 , if followed strictly, would allow creating a package importable to some database. But I'd we do it, would require duplicate more CSVs on each focused base of dictionaries
  • The New exported format: frictionlessdata Tabular Data Package + Data Package Catalogs #35 , from frictionless, have an experimental feature (just done a quick test, and it somewhat works) which allows write a populated SQLite database from an datapackage.json
  • The entire 1603 already designed to be friendly to allow users have everything as local copy
    • Different from generic datasets most data portals ingest, the dictionaries we do are very structured
      • The fact we use 1603 as global prefix, if the dictionaries already are on a database, users could use other global prefixes to ingest actual data and then use SQL to manipulate/transform real world data (an alternative to work CSVs directly)
  • The way we already structured the dictionaries, some from [1603:1] already are required to generate each Cōdex. _They already somewhat have an implicit schema, but the CLIs can work with plain text (the CSVs)

Idea of this issue

TODO: Experimental CLI feature to bootrapp a database from selected dictionaries (...somewhat equivalent to bootstrap a data warehouse)

Do not make sense pre-generate binary databases for end users, somewhat a waste of space. Also, users could be more interested in some dictionaries than others, so even a near single global database would both be too big, potentially be in an inconsistent state from time to time, and obviously make the compilation times absurdly huge.

However soon or later people (or at least we, for our internal use) could want to ingest everything of interest on some relational database. In fact, this would be a side effect of better data formats to explain the datasets such as the frictionless or W3C Tabular Data.

However, we can cut a lot of time (and too much pain, like commands to re-ingest dictionaries again one by one) by simply allowing (even if using the experimental features of friccionesdata) already optimized to create the full database with already selected groups of dictionaries. This also would be more aligned with the philosophy of automating what would take more documentation AND could help get a better overview of the datasets without going one by one.

Other comments

The common use case here assume data related to dictionaries can be re-bootstrapped and, when finished, no more writes would occur (at least not on the reference tables). So SQLite would be a perfect case (even for production use and huge databases, as long as no concurrent writes are necessary). However PostgreSQL (or whatever use would want to convert the SQLite) would be another alternative.

Open room for conventions to store Common Operational Datasets (at least COD-ABs)

While the dictionaries we're doing have their index handcrafted (even if the terminology translations are compiled with software) the perfect first candidates to optimize to users ingest in a predictable way would be CODs.

Note: in case we fetch data from other sources (such as @digital-guard) the actual use case here would be focus on live data, not archived data.

Before go to CODs, means optimize dictionaries that explain then

To have a sane way to ingest data, we would fist start to have dictionaries from [1603:??] Geographia (create base numerospace) #31 already done.

Our dictionaries can reuse other dictionaries (so the things get better over time) and at least on concepts related to places, the number to access the dictionary can actually mean the country.

Metadata

Metadata

Assignees

No one assigned

    Labels

    archiva-farmatisarchīva fōrmātīs; /formats of files/@eng-Latn; About (new) data formats to package dictionarieslibrarium-formatolibrārium fōrmātō; /library format/@eng-Latn; Related to storage of entire referential data

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions