Read OSS

The RAG Pipeline: Document Indexing, Vector Storage, and Knowledge Retrieval

Advanced

Prerequisites

  • Articles 1-2: Architecture Overview and Request Flow
  • Basic understanding of text embeddings and vector similarity search
  • Familiarity with RAG (Retrieval-Augmented Generation) concepts

The RAG Pipeline: Document Indexing, Vector Storage, and Knowledge Retrieval

Retrieval-Augmented Generation is the backbone of knowledge-grounded LLM applications. In Dify, the RAG pipeline handles everything from ingesting a PDF upload to returning ranked document chunks during a chat conversation. This article traces both sides of that pipeline — the indexing path that transforms documents into searchable vectors, and the retrieval path that finds relevant knowledge at query time. Along the way, we'll encounter 30+ vector database adapters, multi-dataset routing strategies, and a parent-child chunking model that enables fine-grained retrieval.

The Document Indexing Pipeline

Document indexing is orchestrated by the IndexingRunner class, which coordinates a multi-stage pipeline:

flowchart TD
    Upload[Document Upload] --> Task[Celery Task<br/>document_indexing_task]
    Task --> Runner[IndexingRunner.run]
    Runner --> Extract[Extract Text<br/>PDF, DOCX, HTML, etc.]
    Extract --> Clean[Clean Text<br/>remove extras, normalize]
    Clean --> Split[Split Text<br/>FixedRecursiveCharacterTextSplitter]
    Split --> Embed[Generate Embeddings<br/>via ModelInstance]
    Embed --> Store[Store in Vector DB<br/>+ PostgreSQL metadata]
    Store --> Done[Update Document Status]

    Runner -->|Error| ErrorHandler[Mark Document as ERROR<br/>with error message]

The run() method iterates over documents, processing each through four stages:

  1. Extract — Delegates to format-specific extractors based on the document's data source type
  2. Transform — Applies cleaning rules and splits text into chunks using the dataset's processing rules
  3. Load — Generates embeddings and stores vectors along with metadata

Each document's status is tracked through an IndexingStatus enum (pending → indexing → completed/error/paused), with progress visible in the UI.

The indexing task at tasks/document_indexing_task.py is dispatched to the dataset Celery queue. This separation means indexing work doesn't compete with real-time workflow execution for worker resources. The task uses a TenantIsolatedTaskQueue for fair scheduling across tenants.

Tip: When indexing fails with cryptic errors, check the Document.error field in the database. The _handle_indexing_error() method stores the error message there, which the UI displays to the user.

Text Extraction and Splitting Strategies

Dify supports text extraction from a wide range of formats through extractor classes under core/rag/extractor/:

Format Extractor Notes
PDF PdfExtractor Uses pypdf or Unstructured API
DOCX WordExtractor python-docx based
HTML HtmlExtractor BeautifulSoup parsing
Markdown MarkdownExtractor Custom parser
CSV/Excel CsvExtractor, ExcelExtractor Tabular data support
Notion NotionExtractor API integration
Website WebsiteExtractor Jina Reader / Firecrawl

After extraction, the FixedRecursiveCharacterTextSplitter handles chunking. This splitter is configurable per dataset with parameters for:

  • Chunk size — maximum characters per chunk
  • Chunk overlap — number of characters shared between adjacent chunks
  • Separator — the splitting delimiter hierarchy (paragraphs → sentences → words)

There's also an EnhanceRecursiveCharacterTextSplitter that provides better handling of code blocks and structured content. The dataset's DatasetProcessRule model stores the active splitting configuration, allowing different datasets to use different strategies.

Vector Database Adapter Pattern

One of Dify's most impressive engineering feats is its support for 32 vector databases through a common interface. The VectorType enum lists them all:

classDiagram
    class BaseVector {
        <<abstract>>
        +create(texts, embeddings)
        +delete()
        +search_by_vector(query_vector)
        +search_by_full_text(query)
    }

    class QdrantVector
    class WeaviateVector
    class MilvusVector
    class PgVectorVector
    class ElasticsearchVector
    class ChromaVector
    class OceanBaseVector
    class TencentVector

    BaseVector <|-- QdrantVector
    BaseVector <|-- WeaviateVector
    BaseVector <|-- MilvusVector
    BaseVector <|-- PgVectorVector
    BaseVector <|-- ElasticsearchVector
    BaseVector <|-- ChromaVector
    BaseVector <|-- OceanBaseVector
    BaseVector <|-- TencentVector
    note for BaseVector "32 implementations total"

Each adapter lives under core/rag/datasource/vdb/ and implements the BaseVector interface with methods for creating, deleting, and searching vectors. The vector_factory.py module resolves the correct adapter based on the dataset's configuration.

The corresponding configuration classes under api/configs/middleware/vdb/ provide typed settings for each database. As we saw in Part 1, these are composed into MiddlewareConfig via Pydantic multiple inheritance.

Some adapters support hybrid search (combining vector similarity with keyword/full-text search):

  • MilvusMILVUS_ENABLE_HYBRID_SEARCH
  • Tencent Vector DBTENCENT_VECTOR_DB_ENABLE_HYBRID_SEARCH
  • Elasticsearch — native hybrid via BM25 + vector
  • PgVector — combined with pg_bigm for full-text

Tip: When choosing a vector database, consider whether you need hybrid search. Pure semantic search misses keyword-exact matches (like error codes or product SKUs). Databases that support hybrid search out of the box save you from building a separate keyword index.

Multi-Dataset Retrieval and Query Routing

When a knowledge retrieval node executes in a workflow, it may need to search across multiple datasets. The DatasetRetrieval class orchestrates this with two routing strategies:

flowchart TD
    Query[User Query] --> Strategy{Routing Strategy?}
    Strategy -->|single_dataset| Direct[Search single dataset]
    Strategy -->|function_call| FCRouter[FunctionCallMultiDatasetRouter<br/>LLM selects relevant datasets]
    Strategy -->|react| ReactRouter[ReactMultiDatasetRouter<br/>ReAct reasoning loop]

    FCRouter --> Selected[Selected datasets]
    ReactRouter --> Selected
    Selected --> Parallel[Parallel retrieval<br/>per-dataset strategy]

    Parallel --> Semantic[Semantic Search<br/>vector similarity]
    Parallel --> Keyword[Keyword Search<br/>BM25/full-text]
    Parallel --> Hybrid[Hybrid Search<br/>combined score]

    Semantic --> Merge[Merge results]
    Keyword --> Merge
    Hybrid --> Merge
    Merge --> Rerank[Reranking Pipeline]
    Rerank --> Return[Return top-K results]

The FunctionCallMultiDatasetRouter is elegant in its approach. It presents each dataset as a "tool" to the LLM and asks it to select which datasets are relevant to the query. If there's only one dataset, it skips the LLM call entirely. The ReactMultiDatasetRouter uses a ReAct reasoning loop for more complex routing decisions.

Each selected dataset is then queried using its own configured retrieval method — semantic, keyword, or hybrid — with dataset-specific parameters like top-K, score thresholds, and reranking settings.

The default retrieval model provides sensible defaults:

default_retrieval_model = {
    "search_method": RetrievalMethod.SEMANTIC_SEARCH,
    "reranking_enable": False,
    "top_k": 4,
    "score_threshold_enabled": False,
}

Reranking, Post-Processing, and Parent-Child Chunking

After initial retrieval, results pass through a post-processing pipeline in data_post_processor.py. The DataPostProcessor applies:

  1. Reranking — Uses a reranking model (like Cohere Rerank or BGE Reranker) to re-score results by relevance. Supports multiple reranking modes: model-based reranking, weighted scoring, and cross-encoder reranking.
  2. Score threshold filtering — Removes results below a configurable similarity threshold.
  3. Reordering — The reorder.py module can apply diversity-based reordering to avoid returning semantically duplicate chunks.

The parent-child chunking model is one of Dify's more sophisticated RAG features. The ChildChunk model enables a two-level chunking strategy:

erDiagram
    Dataset ||--o{ Document : contains
    Document ||--o{ DocumentSegment : "split into"
    DocumentSegment ||--o{ ChildChunk : "further split into"

    DocumentSegment {
        string id PK
        string content
        string index_node_id
        int word_count
        int position
    }

    ChildChunk {
        string id PK
        string segment_id FK
        string content
        int position
        string index_node_id
    }

Parent segments provide context windows while child chunks enable fine-grained matching. During retrieval, the system matches against child chunks but returns their parent segments for context. This addresses a common RAG trade-off: small chunks improve retrieval precision, but large chunks provide better context for LLM generation.

Async Indexing with Progress Tracking

The indexing pipeline runs asynchronously via Celery, with fine-grained progress tracking. The document_indexing_task is the entry point:

@shared_task(queue="dataset")
def document_indexing_task(dataset_id: str, document_ids: list):
    """Async process document"""
    _document_indexing(dataset_id, document_ids)

The dataset queue isolation ensures indexing jobs don't compete with real-time workflow tasks. The _document_indexing() function creates a session, fetches documents, and delegates to IndexingRunner.run().

Progress is tracked at multiple granularities:

  • Document levelIndexingStatus enum (pending/indexing/completed/error/paused)
  • Segment levelSegmentStatus tracks individual chunk processing
  • Time trackingindexing_latency, completed_at, stopped_at timestamps

The system also supports incremental re-indexing — when a document is updated, only changed segments need to be re-embedded and re-stored.

Tip: For large document sets, watch the Celery dataset queue depth. If indexing is slow, consider increasing CELERY_WORKER_AMOUNT or using autoscaling (CELERY_AUTO_SCALE=true). The tenant-isolated task queue ensures one tenant's bulk upload doesn't block another's.

What's Next

We've traced documents from upload to vector store and back to retrieval. In Part 5, we'll explore how Dify abstracts 100+ LLM providers behind a unified interface, the five types of tools available to agents and workflows, and the plugin daemon architecture that safely executes untrusted code.