Closed
Description
epic: #9895
Tell us about the problem you're trying to solve
We want to keep track of the output of discover schema call. See the epic for the full context.
We want the persistence of this information to support the following features:
- Ability to link the catalog used for a connection to the catalog returned by discover
- Ability to see the different catalogs returned by a source over time.
- Ability to prune catalogs if they become too numerous.
- Ability to return a cached catalog instead of calling discover schema if the configuration for the connector has not changed
- Ability to avoid calling discover schema if the configuration for a connector is the same
- Ability to avoid storing duplicate catalogs
Describe the solution you’d like
Add 2 tables:
actor_catalog - this table primarily stores a raw catalog
catalog_hash - hash of catalog for easier comparisons
- id - uuid
- catalog - jsonb
- created_at - timestampz
actor_catalog_fetch_event - id - uuid
- actor_catalog_id - uuid
- actor_id - foreign key to the actor for which the catalog pertains
- config_hash - hash of the configuration used to produce the catalog
- actor_version - version of the actor at the time of fetch
Update connection table to add a column
- add source_catalog_id - uuid - links the catalog used to configure the connection. nullable and null for old connections.
The goal of using 2 tables here is to avoid storing multiple copies of the same catalog outputs which can be massive. We know they tend not to change.
Describe the alternative you’ve considered or used
We could use a single table but we would likely end up with tons of duplicate catalogs. The objects are huge so they are hard to tell apart, so we would need some separate mechanism to do that anyway.