Skip to content

indexing - postgres record manager inconsistently identifies which documents already exist, causing error #4851

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
5 tasks done
arrested-developer opened this issue Mar 21, 2024 · 9 comments · Fixed by #5533
Labels
auto:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@arrested-developer
Copy link
Contributor

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain.js documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain.js rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

import { PostgresRecordManager } from '@langchain/community/indexes/postgres';
import { index } from 'langchain/indexes';
import { PGVectorStore } from '@langchain/community/vectorstores/pgvector';
import { PoolConfig } from 'pg';
import { OpenAIEmbeddings } from '@langchain/openai';

import { ConfluencePagesLoader, ConfluencePagesLoaderParams } from '@canx-tools/document-loaders';

// I assume you can provide your own values for these
import { openAIEmbeddingsParams, dbUri, confluenceUser, confluenceAccessToken } from './env';

// for some reason, this occasionally fails with an error on the ID constraint for the documents table...
// on closer inspection, the postgres query used to find if the documents already exist is returning results out of order

const main = async () => {
    const postgresConfig: PoolConfig = {
        connectionString: dbUri,
    };

    const vectorStoreConfig = {
        postgresConnectionOptions: postgresConfig,
        tableName: 'documents_confluence',
        columns: {
            idColumnName: 'id',
            vectorColumnName: 'vector',
            contentColumnName: 'content',
            metadataColumnName: 'metadata',
        },
    };

    const vectorStore = await PGVectorStore.initialize(new OpenAIEmbeddings(openAIEmbeddingsParams), vectorStoreConfig);

    // Create a new record manager
    const recordManagerConfig = {
        postgresConnectionOptions: postgresConfig,
        tableName: 'record_manager_records',
    };

    const recordManager = new PostgresRecordManager('confluence', recordManagerConfig);

    // Create the schema if it doesn't exist
    await recordManager.createSchema();

    const confluenceLoaderParams: ConfluencePagesLoaderParams = {
        baseUrl: 'REDACTED',
        rootPageId: '2347204666',
        spaceKey: 'EN',
        username: confluenceUser,
        accessToken: confluenceAccessToken,
    };
    const confluenceLoader = new ConfluencePagesLoader(confluenceLoaderParams);

    const docs = await confluenceLoader.loadAndSplit();

    try {
        const report = await index({
            docsSource: docs,
            recordManager,
            vectorStore,
            options: {
                cleanup: 'full',
                sourceIdKey: 'id',
            },
        });
        console.log(report);
        console.log('done!');
    } catch (err) {
        console.error(err);
        process.exit(1);
    }
};

main().catch((error) => {
    console.error(error);
    process.exit(1);
});

Error Message and Stack Trace (if applicable)

error: duplicate key value violates unique constraint "documents_confluence_pkey"
    at /Users/michael/beamery/canx-tools/node_modules/pg-pool/index.js:45:11
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at PGVectorStore.addVectors (/Users/michael/beamery/canx-tools/node_modules/@langchain/community/dist/vectorstores/pgvector.cjs:331:17)
    at index (/Users/michael/beamery/canx-tools/node_modules/langchain/dist/indexes/indexing.cjs:229:13)
    at main (/Users/michael/beamery/canx-tools/apps/index-sources/src/main.ts:63:24) {
  length: 260,
  severity: 'ERROR',
  code: '23505',
  detail: 'Key (id)=(63ccb6a2-754d-5596-aeb6-e1f5d09e1398) already exists.',
  hint: undefined,
  position: undefined,
  internalPosition: undefined,
  internalQuery: undefined,
  where: undefined,
  schema: 'public',
  table: 'documents_confluence',
  column: undefined,
  dataType: undefined,
  constraint: 'documents_confluence_pkey',
  file: 'nbtinsert.c',
  line: '664',
  routine: '_bt_check_unique'
}

Error: Error inserting: duplicate key value violates unique constraint "documents_confluence_pkey"
    at PGVectorStore.addVectors (/Users/michael/beamery/canx-tools/node_modules/@langchain/community/dist/vectorstores/pgvector.cjs:335:23)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at index (/Users/michael/beamery/canx-tools/node_modules/langchain/dist/indexes/indexing.cjs:229:13)
    at main (/Users/michael/beamery/canx-tools/apps/index-sources/src/main.ts:63:24)

Description

I'm trying to use pgvector, with langchain's provided record manager and indexer. I'm using a customised loader to load in pages from the confluence API.

I have had a really hard time trying to recreate this with mocked data. However consistently when loading from the confluence API, after first indexing all documents successfully, then modifying one page in confluence, publishing it and running a full index again, the second one crashes as it is trying to insert a document which already exists.

As far as I can see, the postgres query which is used for the exists method is returning the results out of order. These results are then returned as boolean values, and so some documents are incorrectly inserted when they should be skipped. this results in an error due to duplicate IDs.

System Info

{
"@langchain/community": "^0.0.41",
    "@langchain/core": "^0.1.48",
    "@langchain/openai": "^0.0.22",
    "langchain": "^0.1.28"
}

node v20.11.1
yarn v1.22.22
ankane/pgvector:latest
MacOS 14.4

Copy link

dosubot bot commented Mar 21, 2024

Hey @arrested-developer! Great to see you back. Running into another quirky challenge, I see? Let's dive into this one together.

Thank you for the detailed report. The issue you're experiencing seems to be related to how the exists method in the PostgresRecordManager class is handling the results returned from the PostgreSQL query.

The exists method constructs a query that uses the unnest function with an array of keys and performs a left join on the table to check the existence of these keys. The query itself does not specify an order for the results:

const query = `
  SELECT k, (key is not null) ex from unnest(ARRAY[${arrayPlaceholders}]) k left join ${this.finalTableName} on k=key and namespace = $1;
  `;

The PostgreSQL query used in the exists method should return results in the order of the keys as they were provided in the input array, because the unnest function in combination with a left join should preserve the order of the input array. If the results seem out of order, it could be due to how the results are processed after the query or due to an assumption elsewhere in the system that does not hold.

The exists method in the PostgresRecordManager class handles the results returned from the PostgreSQL query by executing a SQL query that checks if each key in the provided keys array exists in the table associated with the PostgresRecordManager instance. It does this by using the unnest function to transform the array of keys into a set of rows, then performing a left join on the table using the key and namespace as join conditions. The query selects a boolean value indicating whether each key exists in the table ((key is not null) ex). This boolean value is determined by checking if the key column in the join result is not null, which would indicate that a matching row was found in the table.

The method then maps over the rows returned by the query, extracting the boolean value (ex) from each row, and returns an array of these boolean values. Each boolean value in the returned array corresponds to a key in the input keys array, indicating whether that key exists in the table.

Without more context on how the results of the exists method are used for indexing documents and how duplicate ID errors are being encountered, it's challenging to pinpoint the exact cause of the issue. The code for handling the query results and the integration with the Confluence API would be critical to analyze for identifying the root cause of the problem.

I hope this helps clarify the issue. If you have any more information or context that could help identify the problem, please share it.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosubot dosubot bot added the auto:bug Related to a bug, vulnerability, unexpected error with an existing feature label Mar 21, 2024
@arrested-developer
Copy link
Contributor Author

@dosu-bot I have validated that the res.rows response from postgres is correct, in that it correctly identifies which rows have a key that already exists. However it is in an order which is not consistent with the array of keys provided to the exists method.

I would suggest, for safety, mapping over the keys array and using the data from the postgres query response to produce the boolean value which corresponds to each key.

Copy link

dosubot bot commented Mar 22, 2024

Thank you for providing such a detailed description of the issue you're facing. It seems like the exists method in the PostgresRecordManager class is not returning the results in the same order as the keys provided, which is causing a unique constraint violation error when trying to index documents.

Your suggestion to map over the keys array and use the data from the PostgreSQL query response to produce the boolean value which corresponds to each key is a good one. However, I'd like to propose an alternative solution that involves modifying the SQL query used in the exists method to ensure the order of results matches the order of the keys array.

Here's how you can modify the exists method:

async exists(keys: string[]): Promise<boolean[]> {
  if (keys.length === 0) {
    return [];
  }

  // Generate a series of values that represent the order of the keys
  const keysWithOrder = keys.map((key, index) => ({ key, order: index + 1 }));
  // Convert the array of keys with their order into a PostgreSQL array literal
  const keysOrderArrayLiteral = `ARRAY[${keysWithOrder.map(k => `('${k.key}', ${k.order})`).join(", ")}]`;

  const query = `
    WITH ordered_keys AS (
      SELECT * FROM unnest(${keysOrderArrayLiteral}) AS t(key TEXT, order_id INT)
    )
    SELECT ok.order_id, (r.key IS NOT NULL) AS ex
    FROM ordered_keys ok
    LEFT JOIN ${this.finalTableName} r ON ok.key = r.key AND r.namespace = $1
    ORDER BY ok.order_id;
  `;
  const res = await this.pool.query(query, [this.namespace]);
  // Since the results are ordered by the order_id, they will match the order of the original keys array
  return res.rows.map((row: { ex: boolean }) => row.ex);
}

This modification introduces a common table expression (CTE) named ordered_keys that represents each key along with an order_id that indicates its position in the original array. The keys are then left joined with the records in the table based on the key and namespace. The final result is ordered by order_id, ensuring that the order of the results matches the order of the keys in the provided array.

This approach should effectively address the issue of ensuring the order of results matches the order of the keys array, which can be particularly useful when the order of keys is significant for the application logic.

Please let me know if this solution works for you or if you need further assistance.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@arrested-developer
Copy link
Contributor Author

@dosu-bot I like the suggestion of somehow better preserving order in the postgres query

ideally the values of the ordered keys would be escaped. this also does not produce a valid query, for me it produced the following error:

error: function return row and query-specified return row do not match
    at /Users/michael/beamery/canx-tools/node_modules/pg-pool/index.js:45:11
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at PostgresRecordManager.exists (/Users/michael/beamery/canx-tools/node_modules/@langchain/community/dist/indexes/postgres.cjs:137:21)
    at index (/Users/michael/beamery/canx-tools/node_modules/langchain/dist/indexes/indexing.cjs:205:29)
    at main (/Users/michael/beamery/canx-tools/apps/index-sources/src/main.ts:63:24) {
  length: 194,
  severity: 'ERROR',
  code: '42804',
  detail: 'Returned type unknown at ordinal position 1, but query expects text.',
  hint: undefined,
  position: undefined,
  internalPosition: undefined,
  internalQuery: undefined,
  where: undefined,
  schema: undefined,
  table: undefined,
  column: undefined,
  dataType: undefined,
  constraint: undefined,
  file: 'execSRF.c',
  line: '964',
  routine: 'tupledesc_match'
}

Copy link

dosubot bot commented Mar 22, 2024

The error you're encountering, 'function return row and query-specified return row do not match', typically occurs when the structure of the data returned by a PostgreSQL function (or in this case, a query) does not match the expected structure, either in terms of the number of columns or their data types.

In your modified exists method, you're trying to return a result set with two columns (ok.order_id and ex), but the error message suggests that the query is expecting a single column of type text. This discrepancy is likely causing the error.

The original exists method in the PostgresRecordManager class is designed to return a Promise<boolean[]>, indicating the existence of each key in the database. If you've modified this method to ensure the order of results matches the order of the keys array, it's crucial that the modifications do not alter the expected result set structure, which should be an array of booleans corresponding to each key.

To fix the error, ensure that any modifications to this query or its processing in TypeScript still produce an array of boolean values corresponding to the existence of each key in the order they were provided. Specifically:

  1. Maintain the Result Structure: Ensure that the modifications to the query or the subsequent processing of res.rows do not alter the expected structure of the result. The final result should be an array of booleans.

  2. Order Preservation: If you're adding logic to preserve the order of keys, ensure that this logic does not interfere with the data structure expected by the rest of the method. One approach to maintaining order is to use the original order of the keys array to sort the results before returning them, rather than trying to enforce order through the SQL query itself.

  3. Debugging Tip: Log the structure of res.rows before mapping it to the array of booleans. This can help you identify if and how the structure differs from the expected array of booleans.

Without the specific modifications you've made to the exists method, these are general guidelines to address the error. Ensure that the query's output and the TypeScript processing logic align with these expectations.

For reference, here is the original exists method query:

SELECT k, (key is not null) ex from unnest(ARRAY[${arrayPlaceholders}]) k left join ${this.finalTableName} on k=key and namespace = $1;

This query unnests the array of keys into a set of rows (k), then performs a left join on the table to check if each key exists. The result set includes a column for the key (k) and a boolean column (ex) indicating the existence of the key.

You can find the original exists method in the PostgresRecordManager class in the LangChain JS codebase.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@MJDeligan
Copy link
Contributor

Hey @arrested-developer ,

Thanks for bringing this up! You're right that the order is not preserved. Curiously this has not really come up for me, but I can reproduce it.

I'm going to do some more testing (also for performance), but something like the following query should work:

WITH ordered_keys AS (
    SELECT * FROM unnest(ARRAY[${arrayPlaceholders}]) WITH ORDINALITY as t(key, o)
)
SELECT ok.key, (r.key IS NOT NULL) 
FROM ordered_keys ok 
LEFT JOIN ${this.finalTablename} r 
ON r.key = ok.key 
AND namespace = $1
ORDER BY ok.o;

Might have to compare with doing the order in post-processing to see if it's faster than pure SQL.

@awartoft
Copy link

@MJDeligan I'm experiencing the same issue so please share if you find a solution!

@arrested-developer
Copy link
Contributor Author

arrested-developer commented Mar 26, 2024

@MJDeligan query needed a little cleaning up - specifically the casing on this.finalTableName and the column name ex

WITH ordered_keys AS (
    SELECT * FROM unnest(ARRAY[${arrayPlaceholders}]) WITH ORDINALITY as t(key, o)
)
SELECT ok.key, (r.key IS NOT NULL) ex
FROM ordered_keys ok 
LEFT JOIN ${this.finalTableName} r 
ON r.key = ok.key 
AND namespace = $1
ORDER BY ok.o;

This seems to be working nicely for me, the postgres response appears to be in order now. Thanks!

I'm happy to contribute to testing/raising a PR if you like - let me know.

@MJDeligan
Copy link
Contributor

Sorry for the delay. Sure that would be great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
3 participants