Skip to content

Create schema to persist discover schema output #9896

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #9895
cgardens opened this issue Jan 31, 2022 · 0 comments · Fixed by #10226
Closed
Tracked by #9895

Create schema to persist discover schema output #9896

cgardens opened this issue Jan 31, 2022 · 0 comments · Fixed by #10226
Assignees
Labels
area/platform issues related to the platform team/compose team/platform-move type/enhancement New feature or request

Comments

@cgardens
Copy link
Contributor

epic: #9895

Tell us about the problem you're trying to solve

We want to keep track of the output of discover schema call. See the epic for the full context.

We want the persistence of this information to support the following features:

  • Ability to link the catalog used for a connection to the catalog returned by discover
  • Ability to see the different catalogs returned by a source over time.
  • Ability to prune catalogs if they become too numerous.
  • Ability to return a cached catalog instead of calling discover schema if the configuration for the connector has not changed
  • Ability to avoid calling discover schema if the configuration for a connector is the same
  • Ability to avoid storing duplicate catalogs

Describe the solution you’d like

Add 2 tables:
actor_catalog - this table primarily stores a raw catalog
catalog_hash - hash of catalog for easier comparisons

  • id - uuid
  • catalog - jsonb
  • created_at - timestampz
    actor_catalog_fetch_event
  • id - uuid
  • actor_catalog_id - uuid
  • actor_id - foreign key to the actor for which the catalog pertains
  • config_hash - hash of the configuration used to produce the catalog
  • actor_version - version of the actor at the time of fetch

Update connection table to add a column

  • add source_catalog_id - uuid - links the catalog used to configure the connection. nullable and null for old connections.

The goal of using 2 tables here is to avoid storing multiple copies of the same catalog outputs which can be massive. We know they tend not to change.

Describe the alternative you’ve considered or used

We could use a single table but we would likely end up with tons of duplicate catalogs. The objects are huge so they are hard to tell apart, so we would need some separate mechanism to do that anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform team/compose team/platform-move type/enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants