Description
Tell us about the problem you're trying to solve
Discover schema (docs) can be expensive (both in terms of rate limiting and time).
Additionally, because we do not store output of discover schema (just the configured version which is lossy), we lose information about the schema in the UI after a connection gets configured. For example, if the catalog from discover_schema has stream1 and stream2, but then the connection only configures stream1, then the user will never be able to see that stream again in the UI (unless they click force refresh schema). This is very confusing, and we should be able to keep track of this information after configuration.
Finally, if setting up a connection gets interrupted, it forces a user to re-pull the schema. If this is a schema that takes a long time to pull (or the schema taking a long time to pull was the problem), then this can leave the user stuck. While we would should give feedback to the user if discover schema is taking a long time, it is at least slightly better to store the output so that they do not need to do the expensive thing multiple times.
Describe the solution you’d like
We should persist the output of discover schema, so that if there is a failure in the middle of configuration the user does not need to refetch. This will also allow us to not lose information about the schema in the UI.
Acceptance Criteria
- Retain information about the full catalog, even after configuring a connector.
- Avoid calling discover schema again if the connector version and configuration has not changed.
Execution Plan
- Create schema to persist discover schema output #9896
- Persist discover schema output #9897
- Add caching to discover schema API endpoints #9900
- When setting up a connection, track what version of the schema was used. #9901
- Change the settings page in the UI to display the union of the schema and the configured schema #9902
- (optional) https://github.com/airbytehq/airbyte-internal-issues/issues/401