draft of decoupling #4893

evan-onyx · 2025-06-13T20:33:48Z

Description

Addresses https://linear.app/danswer/issue/DAN-2151/decoupled-indexing-pipeline
Note: the following summary describes a single connector indexing, ignoring the mechanisms we use for preventing race conditions, overlapping connector runs, etc

Pre-this-PR indexing pipeline:
0. "check for indexing" task runs, which calls "try_creating_indexing_task" which spawns a new process
for a new scheduled index attempt. That new process (which runs the connector_indexing_task function) does the following:

determines parameters of the indexing attempt (which connector indexing function to run, start and end time, from prev checkpoint or not), then run that connector. Specifically, connectors are responsible for reading data from an outside source and converting it to Onyx documents. At the moment these two steps (reading external data and converting to an Onyx document) are not parallelized in most connectors; that's a subject for future work
upserts documents to postgres (index_doc_batch_prepare)
chunks each document (optionally adds context for contextual rag)
embeds chunks (embed_chunks_with_failure_handling) via a call to the model server
write chunks to vespa (write_chunks_to_vector_db_with_backoff)
update document and indexing metadata in postgres

Note that steps 1-6 are all done in the same spawned process. In this (draft) PR, we decouple step 1 from the remaining steps 2-6 to allow them to run in parallel. On a high level, the approach is to run step (1) in a task that writes its results to a blob store and have steps (2-6) run in an independent task that reads from that blob store as input.

An early implementation had the following issues:
a) both new tasks are run directly with send_task
b) the blob store implementations are only local file storage and s3; need to implement Azure, GCP. Also need to check the actual implementation code to make sure it leverages existing code
c) several minor features from the old implementation are missing, most notably the indexing callback. It isn't clear to me how to restructure that callback across multiple new tasks (each task gets its own callback with information synchronized by index attempt and tenant id maybe?)

^^ these are now addressed, remaining issues:
a) some of the code is located in the wrong files, some dead code (_run_indexing in particular since I was using it as a reference)
b) The error raised when pausing a connector mid-run isn't caught correctly, leading to an "error" state when it should be cancelled.
c) need to test with multi-tenant
d) Want to test with a variety of connectors and varying levels of parallelism
e) TODO: tell docfetching to back off if the indexing jobs are piling up. Also will need logic for fully stopping docfetching and marking the attempt as an error if i.e. the index tasks are just never picking anything up

How Has This Been Tested?

N/A but planning to test across the main connectors in the UI and with all blob store solutions

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

This PR should be backported (make sure to check that the backport attempt succeeds)
[Optional] Override Linear Check

vercel · 2025-06-13T20:33:52Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
internal-search	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jul 12, 2025 2:57am

backend/onyx/background/celery/apps/document_processing.py

backend/onyx/background/celery/configs/document_processing.py

deployment/helm/charts/onyx/templates/celery-worker-document-processing.yaml

backend/supervisord.conf

rkuo-danswer · 2025-06-13T23:07:53Z

backend/onyx/background/indexing/run_indexing.py

@@ -393,7 +399,9 @@ def _run_indexing(
    index_attempt: IndexAttempt | None = None
    try:
        with get_session_with_current_tenant() as db_session_temp:
-            index_attempt = get_index_attempt(db_session_temp, index_attempt_id)
+            index_attempt = get_index_attempt(


this eager load stuff seems to be exposing some DB behavior ... maybe a bit problematic

backend/onyx/background/celery/versioned_apps/document_processing.py

backend/onyx/background/celery/tasks/document_processing/tasks.py

greptile-apps

PR Summary

Major architectural change to decouple document fetching from processing in the indexing pipeline, allowing parallel execution through a new blob store intermediary layer.

Split monolithic indexing process into docfetching_task (fetches source documents) and document_indexing_pipeline_task (handles processing/embedding/storage), coordinated via Redis locks
Added new DocumentBatchStorage system using FileStore abstraction to manage intermediate document storage between fetching and processing stages
Implemented robust error handling and state management for the decoupled tasks, including proper cleanup of temporary storage
Updated run_indexing.py to support the new pipeline architecture with separate DocExtractionContext and DocIndexingContext models
Simplified session handling in S3BackedFileStore by moving db_session from constructor to method parameters for better dependency injection

_{35 files reviewed, 9 comments}
_{Edit PR Review Bot Settings | Greptile}

greptile-apps · 2025-06-28T22:11:23Z

backend/onyx/httpx/httpx_pool.py

    # Default parameters for creation
-    DEFAULT_KWARGS = {
-        "http2": True,
-        "limits": lambda: httpx.Limits(),
-    }



style: Remove empty comment and blank line as they no longer serve a purpose since DEFAULT_KWARGS was moved

greptile-apps · 2025-06-28T22:11:27Z

backend/onyx/configs/constants.py

@@ -66,6 +66,7 @@
 POSTGRES_CELERY_WORKER_LIGHT_APP_NAME = "celery_worker_light"
 POSTGRES_CELERY_WORKER_HEAVY_APP_NAME = "celery_worker_heavy"
 POSTGRES_CELERY_WORKER_INDEXING_APP_NAME = "celery_worker_indexing"
+POSTGRES_CELERY_WORKER_docfetching_APP_NAME = "celery_worker_docfetching"


style: Use consistent casing ('docfetching' should be 'DOCFETCHING')

Suggested change

POSTGRES_CELERY_WORKER_docfetching_APP_NAME = "celery_worker_docfetching"

POSTGRES_CELERY_WORKER_DOCFETCHING_APP_NAME = "celery_worker_docfetching"

greptile-apps · 2025-06-28T22:11:27Z

backend/onyx/utils/threadpool_concurrency.py

+def parallel_yield_from_funcs(
+    funcs: list[Callable[..., R]],
+    max_workers: int = 10,
+) -> Iterator[R]:


style: Consider allowing args and kwargs parameters to support functions that take arguments. Currently only supports nullary functions.

Suggested change

def parallel_yield_from_funcs(

funcs: list[Callable[..., R]],

max_workers: int = 10,

) -> Iterator[R]:

def parallel_yield_from_funcs(

funcs: list[Callable[[], R]],

max_workers: int = 10,

) -> Iterator[R]:

greptile-apps · 2025-06-28T22:11:29Z

backend/onyx/background/celery/apps/docfetching.py

+        "onyx.background.celery.tasks.docfetching",
+        "onyx.background.celery.tasks.indexing",  # TODO: remove this and move the task to docfetching


logic: The indexing tasks dependency creates coupling that contradicts the PR's decoupling goal. Need to complete the TODO task migration before merging

greptile-apps · 2025-06-28T22:11:30Z

backend/onyx/background/indexing/run_indexing.py

+                    # TODO: change to doc extraction if it doesnt break things
+                    callback.progress("_run_indexing", 0)


logic: Callback still references '_run_indexing' but should be updated to 'doc_extraction' for clarity

Also should docfetching specific stuff be moved into a docfetching folder?

greptile-apps · 2025-06-28T22:11:30Z

backend/onyx/background/indexing/run_indexing.py

+            if batches_processed > last_batches_completed:
+                last_batches_completed = batches_processed
+                last_progress_time = time.monotonic()
+            elif time.monotonic() - last_progress_time > 3600 * 6:


style: 6 hour timeout hardcoded - consider making this configurable via constant or config

greptile-apps · 2025-06-28T22:11:31Z

backend/onyx/file_store/document_batch_storage.py

+            if not self.file_store.has_file(
+                file_id=file_name,
+                file_origin=FileOrigin.OTHER,
+                file_type="application/json",
+            ):


style: file_type should be constant, define FILE_TYPE = 'application/json' at class level

greptile-apps · 2025-06-28T22:11:31Z

backend/onyx/file_store/document_batch_storage.py

+            except Exception as e:
+                logger.warning(f"Failed to delete extraction state: {e}")


logic: error message indicates 'extraction state' but this is in a loop over all state types

Suggested change

except Exception as e:

logger.warning(f"Failed to delete extraction state: {e}")

except Exception as e:

logger.warning(f"Failed to delete {state_type.value} state: {e}")

prefer logger.exception here (prints out the stack trace nicely)

greptile-apps · 2025-06-28T22:11:32Z

backend/onyx/file_store/file_store.py

+                raise



logic: The bare except block here could mask important errors. Consider handling specific exceptions (like ClientError for S3 operations and SQLAlchemyError for database operations) separately.

Weves

Let's discuss how we want to do the monitoring, but overall looks good 🧐

Weves · 2025-06-29T02:06:06Z

backend/onyx/background/celery/apps/docfetching.py

+        "onyx.background.celery.tasks.docfetching",
+        "onyx.background.celery.tasks.indexing",  # TODO: remove this and move the task to docfetching


Weves · 2025-06-29T02:06:22Z

deployment/helm/charts/onyx/values.yaml

@@ -538,6 +538,27 @@ slackbot:
    limits:
      cpu: "1000m"
      memory: "2000Mi"
+celery_worker_docfetching:


missing actual worker for this

Weves · 2025-06-29T02:21:38Z

backend/onyx/file_store/document_batch_storage.py

+            except Exception as e:
+                logger.warning(f"Failed to delete extraction state: {e}")


prefer logger.exception here (prints out the stack trace nicely)

Weves · 2025-06-29T02:35:32Z

backend/onyx/background/indexing/run_indexing.py

+
+    # Get batch storage (transition to IN_PROGRESS is handled by run_indexing_entrypoint)
+    with get_session_with_current_tenant() as db_session:
+        batch_storage = get_document_batch_storage(


this seems slightly problematic (it's also in a few other places). The db session will no longer be valid outside of the with statement

Weves · 2025-06-29T02:38:15Z

backend/onyx/background/indexing/run_indexing.py

+                    # TODO: change to doc extraction if it doesnt break things
+                    callback.progress("_run_indexing", 0)


Weves · 2025-06-29T02:42:12Z

backend/onyx/file_store/document_batch_storage.py

+            logger.error(f"Failed to store {state_type} state: {e}")
+            raise
+
+    def _get_state(self, state_type: str) -> DocumentStorageState | None:


should state_type be an Enum type?

Weves · 2025-06-29T02:43:32Z

backend/onyx/background/celery/tasks/indexing/tasks.py

-    )
+    # Get the document batch storage
+    with get_session_with_current_tenant() as db_session:
+        storage = get_document_batch_storage(tenant_id, index_attempt_id, db_session)


same comment about db_session

Weves · 2025-06-29T02:44:55Z

backend/onyx/background/indexing/run_indexing.py

+            if batches_processed > last_batches_completed:
+                last_batches_completed = batches_processed
+                last_progress_time = time.monotonic()
+            elif time.monotonic() - last_progress_time > 3600 * 6:


Weves · 2025-06-29T02:45:11Z

backend/onyx/background/indexing/run_indexing.py

+    )
+
+    with get_session_with_current_tenant() as db_session:
+        storage = get_document_batch_storage(tenant_id, index_attempt_id, db_session)


same comment about db_session

Weves · 2025-06-29T02:46:50Z

backend/onyx/background/celery/tasks/docfetching/tasks.py

+    tenant_id: str,
+) -> int | None:
+    """
+    TODO: update docstring to reflect docfetching


can we do this TODO now 🥺

rkuo-danswer reviewed Jun 13, 2025

View reviewed changes

evan-onyx force-pushed the feat/connector-indexing-decoupled branch from af01b09 to 9592aa7 Compare June 17, 2025 21:41

vercel bot deployed to Preview June 17, 2025 21:45 View deployment

evan-onyx force-pushed the feat/connector-indexing-decoupled branch from 9592aa7 to 512e122 Compare June 20, 2025 03:19

vercel bot deployed to Preview June 20, 2025 03:24 View deployment

evan-onyx force-pushed the feat/connector-indexing-decoupled branch from 512e122 to ee30d7b Compare June 28, 2025 00:42

vercel bot deployed to Preview June 28, 2025 00:49 View deployment

vercel bot deployed to Preview June 28, 2025 21:36 View deployment

vercel bot deployed to Preview June 28, 2025 21:58 View deployment

evan-onyx marked this pull request as ready for review June 28, 2025 22:10

evan-onyx requested a review from a team as a code owner June 28, 2025 22:10

greptile-apps bot reviewed Jun 28, 2025

View reviewed changes

vercel bot deployed to Preview June 29, 2025 02:17 View deployment

Weves reviewed Jun 29, 2025

View reviewed changes

evan-onyx added 15 commits July 11, 2025 14:26

WIP

1a07052

renamed and moved tasks (WIP)

747970d

minio migration

d32007f

bug fixes and finally add document batch storage

e937207

WIP: can suceed but status is error

618441a

WIP

161e23e

import fixes

399c6aa

working v1 of decoupled

402c92d

catastrophe handling

0dc4863

refactor

ac1a9b7

remove unused db session in prep for new approach

2b8a23b

renaming and docstrings (untested)

84b8369

renames

fa2a6b3

WIP with no more indexing fences

51d9c7a

robustness improvements

3ab468c

clean up rebase

da665a6

evan-onyx force-pushed the feat/connector-indexing-decoupled branch from bf3c890 to da665a6 Compare July 11, 2025 21:33

vercel bot deployed to Preview July 11, 2025 21:38 View deployment

migration and salesforce rate limits

34829b7

vercel bot deployed to Preview July 12, 2025 00:54 View deployment

evan-onyx added 2 commits July 11, 2025 19:50

minor tweaks

a428949

test fix

56e6b41

vercel bot deployed to Preview July 12, 2025 02:57 View deployment

	POSTGRES_CELERY_WORKER_docfetching_APP_NAME = "celery_worker_docfetching"
	POSTGRES_CELERY_WORKER_DOCFETCHING_APP_NAME = "celery_worker_docfetching"

		"onyx.background.celery.tasks.docfetching",
		"onyx.background.celery.tasks.indexing", # TODO: remove this and move the task to docfetching

		# TODO: change to doc extraction if it doesnt break things
		callback.progress("_run_indexing", 0)

		except Exception as e:
		logger.warning(f"Failed to delete extraction state: {e}")

draft of decoupling #4893

Are you sure you want to change the base?

draft of decoupling #4893

Conversation

evan-onyx commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Backporting (check the box to trigger backport action)

Uh oh!

vercel bot commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

PR Summary

Uh oh!

greptile-apps bot Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

Weves left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

evan-onyx commented Jun 13, 2025 •

edited

Loading

vercel bot commented Jun 13, 2025 •

edited

Loading