Skip to content

Create schema to persist discover schema output #9896

Closed
@cgardens

Description

@cgardens

epic: #9895

Tell us about the problem you're trying to solve

We want to keep track of the output of discover schema call. See the epic for the full context.

We want the persistence of this information to support the following features:

  • Ability to link the catalog used for a connection to the catalog returned by discover
  • Ability to see the different catalogs returned by a source over time.
  • Ability to prune catalogs if they become too numerous.
  • Ability to return a cached catalog instead of calling discover schema if the configuration for the connector has not changed
  • Ability to avoid calling discover schema if the configuration for a connector is the same
  • Ability to avoid storing duplicate catalogs

Describe the solution you’d like

Add 2 tables:
actor_catalog - this table primarily stores a raw catalog
catalog_hash - hash of catalog for easier comparisons

  • id - uuid
  • catalog - jsonb
  • created_at - timestampz
    actor_catalog_fetch_event
  • id - uuid
  • actor_catalog_id - uuid
  • actor_id - foreign key to the actor for which the catalog pertains
  • config_hash - hash of the configuration used to produce the catalog
  • actor_version - version of the actor at the time of fetch

Update connection table to add a column

  • add source_catalog_id - uuid - links the catalog used to configure the connection. nullable and null for old connections.

The goal of using 2 tables here is to avoid storing multiple copies of the same catalog outputs which can be massive. We know they tend not to change.

Describe the alternative you’ve considered or used

We could use a single table but we would likely end up with tons of duplicate catalogs. The objects are huge so they are hard to tell apart, so we would need some separate mechanism to do that anyway.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions