Description
- Frictionless framewok
- Read/write from SQL databases (python API only, not works with CLIs); as 2022-04-29, considered experimental feature
- SQLite
- Official limits
- https://www.sqlite.org/limits.html
- Trivia: maybe the limit we would reach first is number of columns in a same file (default seems to be 2.000 columns)
- Etc
- Official limits
- Posters
- Limits
- https://www.postgresql.org/docs/current/limits.html
- Default maximum number of columns per table 1600
- Limits
Current know context at the moment
- The New exported format: JSON-LD metadata to explain the CSVs, using W3C Tabular Data (Basic implementation only) #36 , if followed strictly, would allow creating a package importable to some database. But I'd we do it, would require duplicate more CSVs on each focused base of dictionaries
- The New exported format: frictionlessdata Tabular Data Package + Data Package Catalogs #35 , from frictionless, have an experimental feature (just done a quick test, and it somewhat works) which allows write a populated SQLite database from an datapackage.json
- The entire 1603 already designed to be friendly to allow users have everything as local copy
- Different from generic datasets most data portals ingest, the dictionaries we do are very structured
- The fact we use 1603 as global prefix, if the dictionaries already are on a database, users could use other global prefixes to ingest actual data and then use SQL to manipulate/transform real world data (an alternative to work CSVs directly)
- The way we already structured the dictionaries, some from [1603:1] already are required to generate each Cōdex. _They already somewhat have an implicit schema, but the CLIs can work with plain text (the CSVs)
Idea of this issue
TODO: Experimental CLI feature to bootrapp a database from selected dictionaries (...somewhat equivalent to bootstrap a data warehouse)
Do not make sense pre-generate binary databases for end users, somewhat a waste of space. Also, users could be more interested in some dictionaries than others, so even a near single global database would both be too big, potentially be in an inconsistent state from time to time, and obviously make the compilation times absurdly huge.
However soon or later people (or at least we, for our internal use) could want to ingest everything of interest on some relational database. In fact, this would be a side effect of better data formats to explain the datasets such as the frictionless or W3C Tabular Data.
However, we can cut a lot of time (and too much pain, like commands to re-ingest dictionaries again one by one) by simply allowing (even if using the experimental features of friccionesdata) already optimized to create the full database with already selected groups of dictionaries. This also would be more aligned with the philosophy of automating what would take more documentation AND could help get a better overview of the datasets without going one by one.
Other comments
The common use case here assume data related to dictionaries can be re-bootstrapped and, when finished, no more writes would occur (at least not on the reference tables). So SQLite would be a perfect case (even for production use and huge databases, as long as no concurrent writes are necessary). However PostgreSQL (or whatever use would want to convert the SQLite) would be another alternative.
Open room for conventions to store Common Operational Datasets (at least COD-ABs)
- See
- https://en.wikipedia.org/wiki/Common_Operational_Datasets
- MVP of [1603.45.16] /"Ontologia"."United Nations"."P"/@eng-Latn MVP of
[1603.45.16]
/"Ontologia"."United Nations"."P"/@eng-Latn #2 - Maybe also harmonize with digital-guard data??
While the dictionaries we're doing have their index handcrafted (even if the terminology translations are compiled with software) the perfect first candidates to optimize to users ingest in a predictable way would be CODs.
Note: in case we fetch data from other sources (such as @digital-guard) the actual use case here would be focus on live data, not archived data.
Before go to CODs, means optimize dictionaries that explain then
To have a sane way to ingest data, we would fist start to have dictionaries from [1603:??] Geographia (create base numerospace) #31 already done.
Our dictionaries can reuse other dictionaries (so the things get better over time) and at least on concepts related to places, the number to access the dictionary can actually mean the country.