-
Notifications
You must be signed in to change notification settings - Fork 2.5k
indexing - postgres record manager inconsistently identifies which documents already exist, causing error #4851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey @arrested-developer! Great to see you back. Running into another quirky challenge, I see? Let's dive into this one together. Thank you for the detailed report. The issue you're experiencing seems to be related to how the The const query = `
SELECT k, (key is not null) ex from unnest(ARRAY[${arrayPlaceholders}]) k left join ${this.finalTableName} on k=key and namespace = $1;
`; The PostgreSQL query used in the The The method then maps over the rows returned by the query, extracting the boolean value ( Without more context on how the results of the I hope this helps clarify the issue. If you have any more information or context that could help identify the problem, please share it. Sources
|
@dosu-bot I have validated that the res.rows response from postgres is correct, in that it correctly identifies which rows have a key that already exists. However it is in an order which is not consistent with the array of I would suggest, for safety, mapping over the |
Thank you for providing such a detailed description of the issue you're facing. It seems like the Your suggestion to map over the Here's how you can modify the async exists(keys: string[]): Promise<boolean[]> {
if (keys.length === 0) {
return [];
}
// Generate a series of values that represent the order of the keys
const keysWithOrder = keys.map((key, index) => ({ key, order: index + 1 }));
// Convert the array of keys with their order into a PostgreSQL array literal
const keysOrderArrayLiteral = `ARRAY[${keysWithOrder.map(k => `('${k.key}', ${k.order})`).join(", ")}]`;
const query = `
WITH ordered_keys AS (
SELECT * FROM unnest(${keysOrderArrayLiteral}) AS t(key TEXT, order_id INT)
)
SELECT ok.order_id, (r.key IS NOT NULL) AS ex
FROM ordered_keys ok
LEFT JOIN ${this.finalTableName} r ON ok.key = r.key AND r.namespace = $1
ORDER BY ok.order_id;
`;
const res = await this.pool.query(query, [this.namespace]);
// Since the results are ordered by the order_id, they will match the order of the original keys array
return res.rows.map((row: { ex: boolean }) => row.ex);
} This modification introduces a common table expression (CTE) named This approach should effectively address the issue of ensuring the order of results matches the order of the keys array, which can be particularly useful when the order of keys is significant for the application logic. Please let me know if this solution works for you or if you need further assistance. Sources
|
@dosu-bot I like the suggestion of somehow better preserving order in the postgres query ideally the values of the ordered keys would be escaped. this also does not produce a valid query, for me it produced the following error:
|
The error you're encountering, 'function return row and query-specified return row do not match', typically occurs when the structure of the data returned by a PostgreSQL function (or in this case, a query) does not match the expected structure, either in terms of the number of columns or their data types. In your modified The original To fix the error, ensure that any modifications to this query or its processing in TypeScript still produce an array of boolean values corresponding to the existence of each key in the order they were provided. Specifically:
Without the specific modifications you've made to the For reference, here is the original SELECT k, (key is not null) ex from unnest(ARRAY[${arrayPlaceholders}]) k left join ${this.finalTableName} on k=key and namespace = $1; This query unnests the array of keys into a set of rows ( You can find the original Sources
|
Hey @arrested-developer , Thanks for bringing this up! You're right that the order is not preserved. Curiously this has not really come up for me, but I can reproduce it. I'm going to do some more testing (also for performance), but something like the following query should work: WITH ordered_keys AS (
SELECT * FROM unnest(ARRAY[${arrayPlaceholders}]) WITH ORDINALITY as t(key, o)
)
SELECT ok.key, (r.key IS NOT NULL)
FROM ordered_keys ok
LEFT JOIN ${this.finalTablename} r
ON r.key = ok.key
AND namespace = $1
ORDER BY ok.o; Might have to compare with doing the order in post-processing to see if it's faster than pure SQL. |
@MJDeligan I'm experiencing the same issue so please share if you find a solution! |
@MJDeligan query needed a little cleaning up - specifically the casing on WITH ordered_keys AS (
SELECT * FROM unnest(ARRAY[${arrayPlaceholders}]) WITH ORDINALITY as t(key, o)
)
SELECT ok.key, (r.key IS NOT NULL) ex
FROM ordered_keys ok
LEFT JOIN ${this.finalTableName} r
ON r.key = ok.key
AND namespace = $1
ORDER BY ok.o; This seems to be working nicely for me, the postgres response appears to be in order now. Thanks! I'm happy to contribute to testing/raising a PR if you like - let me know. |
Sorry for the delay. Sure that would be great! |
Checked other resources
Example Code
Error Message and Stack Trace (if applicable)
Description
I'm trying to use pgvector, with langchain's provided record manager and indexer. I'm using a customised loader to load in pages from the confluence API.
I have had a really hard time trying to recreate this with mocked data. However consistently when loading from the confluence API, after first indexing all documents successfully, then modifying one page in confluence, publishing it and running a full index again, the second one crashes as it is trying to insert a document which already exists.
As far as I can see, the postgres query which is used for the
exists
method is returning the results out of order. These results are then returned as boolean values, and so some documents are incorrectly inserted when they should be skipped. this results in an error due to duplicate IDs.System Info
node v20.11.1
yarn v1.22.22
ankane/pgvector:latest
MacOS 14.4
The text was updated successfully, but these errors were encountered: