Skip to content

Split extension into an extension and a library package #580

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Apr 2, 2025
Merged

Conversation

cevian
Copy link
Collaborator

@cevian cevian commented Mar 21, 2025

Purpose

This PR separates the codebase into two distinct components:

  1. The Extension: Focused specifically on model-calling functionality
  2. The pgai Python Library: Houses database functionality installed via Python scripts, including
    vectorizer and text2sql functionality

The key advantage: Portability. The pgai Python library can now be used on any PostgreSQL
installation, including cloud platforms that don't support the pgai extensions.

Architecture

The pgai library installs database objects (tables, schemas, functions) through the pgai install
function using scripts compiled from the projects/pgai/db directory. The migration management follows
the extension's proven approach:

  • Idempotent SQL: Safe to run on each installation, primarily containing function definitions
  • Incremental SQL: Ordered migrations that evolve database objects sequentially, tracked via the
    ai.migration_app table

The approach of using the ai.migration_app table allows us to maintains a single upgrade script compatible with any previous version, eliminating the need for version-specific upgrade paths.

Migration Strategy

To transition from the previous monolithic extension:

  1. Code Relocation: Vectorizer SQL code (both idempotent and incremental) moved from extension to
    pgai/db. Since the vectorizer code was mostly in their own files, these files moved mostly unchanged.
  2. Divest Process: Created a script that cleanly dissociates vectorizer-related objects from the
    extension while preserving their functionality
  3. Migration Tracking: Recorded previously extension-applied vectorizer-related incremental files in
    ai.migration_app

After upgrading to the next extension version, vectorizer components remain intact but are no longer managed by the extension. The extension does not upgrade the vectorizer components before releasing them, since it now lack the required code. Instead, the pgai library can then independently upgrade these components regardless of which extension version previously created them.

@cevian cevian temporarily deployed to internal-contributors March 21, 2025 21:12 — with GitHub Actions Inactive
@cevian cevian temporarily deployed to internal-contributors March 23, 2025 18:40 — with GitHub Actions Inactive
@@ -177,6 +184,10 @@ def psql_cmd(cmd: str) -> str:


def test_vectorizer_timescaledb():
with psycopg.connect(db_url("test")) as con:
with con.cursor() as cur:
cur.execute("create extension ai cascade")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't understand why we're installing the extension?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the thing that sends the ping to the lamda is still in the extension so on our cloud there is a dependency

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added logic to the install() function to set up the ai extension if executor_url is set

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we take an advisory lock before attempting install/upgrade? bail if we don't get it.

);

insert into ai.app_version ("name", version)
values ('ai', '__version__') on conflict ("name") do update set version = excluded.version;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check the version? If the version we are trying to install/upgrade is already present, abort? If you're running N apps in a load balancer, we probably want the first to win and the rest to skip.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was committed by mistake in the first place? Seems inappropriate on a library that can run on many python versions

-> leaving ai.execute_vectorizer in extension
-> ai._vectorizer_job which calls ai.execute_vectorizer is moved to dbapp

-> creating the index still needs to be handled (probably through the worker)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work for lambda-based workers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For lamba stuff it's done in the timescaledb job. That actually already works

@cevian cevian temporarily deployed to internal-contributors March 28, 2025 16:56 — with GitHub Actions Inactive
@cevian cevian marked this pull request as ready for review March 28, 2025 16:57
@cevian cevian requested a review from a team as a code owner March 28, 2025 16:57
@cevian cevian temporarily deployed to internal-contributors March 28, 2025 17:21 — with GitHub Actions Inactive
@cevian cevian temporarily deployed to internal-contributors March 28, 2025 21:53 — with GitHub Actions Inactive
@cevian cevian temporarily deployed to internal-contributors March 28, 2025 21:59 — with GitHub Actions Inactive
@cevian cevian temporarily deployed to internal-contributors March 28, 2025 22:01 — with GitHub Actions Inactive
@cevian cevian temporarily deployed to internal-contributors March 30, 2025 17:52 — with GitHub Actions Inactive
@cevian cevian temporarily deployed to internal-contributors March 30, 2025 18:10 — with GitHub Actions Inactive
@cevian cevian requested review from jgpruitt and JamesGuthrie March 30, 2025 18:12
@cevian cevian temporarily deployed to internal-contributors March 30, 2025 18:25 — with GitHub Actions Inactive
@cevian cevian temporarily deployed to internal-contributors March 30, 2025 18:26 — with GitHub Actions Inactive
@cevian cevian temporarily deployed to internal-contributors March 30, 2025 18:36 — with GitHub Actions Inactive
@cevian cevian temporarily deployed to internal-contributors March 30, 2025 18:37 — with GitHub Actions Inactive
@cevian cevian temporarily deployed to internal-contributors March 30, 2025 18:55 — with GitHub Actions Inactive
@cevian cevian temporarily deployed to internal-contributors March 30, 2025 18:57 — with GitHub Actions Inactive
@cevian cevian temporarily deployed to internal-contributors March 30, 2025 19:30 — with GitHub Actions Inactive
@cevian cevian temporarily deployed to internal-contributors March 31, 2025 01:03 — with GitHub Actions Inactive
Comment on lines 539 to 562
def error_if_pre_release() -> None:
# Note: released versions always have the output sql file
# commited into the repository.
output_file = output_sql_file()
command = (
"just ext build-install"
if "ROOT_JUSTFILE" in os.environ
else "just build-install"
if "PROJECT_JUSTFILE" in os.environ
else "python3 build.py build-install"
)
if not Path(output_file).exists():
print(
textwrap.dedent(f"""
WARNING: You're trying to install a pre-release version of pgai.
This is not supported, and there is no upgrade path.

Instead, install an official release from https://github.com/timescale/pgai/releases.

If you are certain that you want to install a pre-release version, run:
`{command}`
""")
)
exit(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is quite right. Also, I'm not sure it's necessary. People won't be using the just commands to install the SQL anymore. Maybe this check needs to go into the Python library?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this was dead code. I removed it and added a check in the python code.

default="postgres://postgres@localhost:5432/postgres",
show_default=True,
)
def install(db_url: str) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is upgrading the same as installing over an existing installation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes made

@cevian cevian temporarily deployed to internal-contributors April 1, 2025 01:20 — with GitHub Actions Inactive
@JamesGuthrie JamesGuthrie temporarily deployed to internal-contributors April 1, 2025 09:54 — with GitHub Actions Inactive
Copy link
Member

@JamesGuthrie JamesGuthrie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, but approving to unblock.

raise notice '%', _sql;
execute _sql;

if _rec.relname != 'vectorizer_id_seq' THEN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we not change the ownership of vectorizer_id_seq?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a code comment. The reason is it's linked to the vectorizer table so you change the owner of the table instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes aren't tested by CI because we use pull_request_target to build it. I attempted to run just pgai db ci locally, and it didn't pass, so I don't have confidence that these tests will succeed when this merges.

Copy link
Contributor

@Askir Askir Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I (accidentally) ran them by creating a PR to this branch and they are red: https://github.com/timescale/pgai/actions/runs/14193958551/job/39764748178?pr=598

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run manually through workflow_dispatch. everything passes now https://github.com/timescale/pgai/actions/runs/14205814536

[
"uv run --no-project pgspot --ignore-lang=plpython3u",
'--proc-without-search-path "ai._vectorizer_job(job_id integer,config pg_catalog.jsonb)"',
f"{output_sql_file()}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed a commit with this fix.

USER root

RUN pip install --break-system-packages uv==0.6.3
RUN mkdir /py/ && uv venv --directory /py/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing correctly setting up dependencies in the container.

Suggested change
RUN mkdir /py/ && uv venv --directory /py/
COPY pyproject.toml uv.lock /py/
RUN uv sync --directory /py --no-install-project --only-dev --frozen

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I do that in the just command docker-sync now. Let's review after merge.

just docker-shell

//inside shell 1
uv pip install fastapi[standard]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the suggested change to the dockerfile, this becomes redundant.

Suggested change
uv pip install fastapi[standard]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah these were just outdated docs docker-sync takes care of this

Comment on lines 52 to 53
uv sync --active
cd ../extension && just build && just install && cd ../pgai/db
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should eliminate these extra manual steps required to run the tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah outdate docs this is already not necessary. fixing doc

, installed_at timestamptz not null default pg_catalog.clock_timestamp()
);

--check if the app has already been installed, error if so
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why error? Because it's the only way to stop with the rest of the installation, or because we actually see this as being erroneous?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes because its the only way to stop the install

@cevian cevian temporarily deployed to internal-contributors April 1, 2025 19:04 — with GitHub Actions Inactive
@cevian cevian temporarily deployed to internal-contributors April 1, 2025 19:37 — with GitHub Actions Inactive
@cevian cevian temporarily deployed to internal-contributors April 1, 2025 20:13 — with GitHub Actions Inactive
@cevian cevian temporarily deployed to internal-contributors April 1, 2025 20:14 — with GitHub Actions Inactive
@cevian cevian temporarily deployed to internal-contributors April 1, 2025 20:53 — with GitHub Actions Inactive
@cevian cevian temporarily deployed to internal-contributors April 1, 2025 20:53 — with GitHub Actions Inactive
@cevian cevian requested a review from jgpruitt April 1, 2025 20:54
@jgpruitt jgpruitt temporarily deployed to internal-contributors April 1, 2025 22:15 — with GitHub Actions Inactive
@cevian cevian temporarily deployed to internal-contributors April 2, 2025 15:29 — with GitHub Actions Inactive
@cevian cevian temporarily deployed to internal-contributors April 2, 2025 15:29 — with GitHub Actions Inactive
@cevian cevian merged commit 3fe83c6 into main Apr 2, 2025
13 checks passed
@cevian cevian deleted the mat/repackage branch April 2, 2025 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants