The RAG Pipeline: Document Indexing, Vector Storage, and Knowledge Retrieval
Prerequisites
- ›Articles 1-2: Architecture Overview and Request Flow
- ›Basic understanding of text embeddings and vector similarity search
- ›Familiarity with RAG (Retrieval-Augmented Generation) concepts
The RAG Pipeline: Document Indexing, Vector Storage, and Knowledge Retrieval
Retrieval-Augmented Generation is the backbone of knowledge-grounded LLM applications. In Dify, the RAG pipeline handles everything from ingesting a PDF upload to returning ranked document chunks during a chat conversation. This article traces both sides of that pipeline — the indexing path that transforms documents into searchable vectors, and the retrieval path that finds relevant knowledge at query time. Along the way, we'll encounter 30+ vector database adapters, multi-dataset routing strategies, and a parent-child chunking model that enables fine-grained retrieval.
The Document Indexing Pipeline
Document indexing is orchestrated by the IndexingRunner class, which coordinates a multi-stage pipeline:
flowchart TD
Upload[Document Upload] --> Task[Celery Task<br/>document_indexing_task]
Task --> Runner[IndexingRunner.run]
Runner --> Extract[Extract Text<br/>PDF, DOCX, HTML, etc.]
Extract --> Clean[Clean Text<br/>remove extras, normalize]
Clean --> Split[Split Text<br/>FixedRecursiveCharacterTextSplitter]
Split --> Embed[Generate Embeddings<br/>via ModelInstance]
Embed --> Store[Store in Vector DB<br/>+ PostgreSQL metadata]
Store --> Done[Update Document Status]
Runner -->|Error| ErrorHandler[Mark Document as ERROR<br/>with error message]
The run() method iterates over documents, processing each through four stages:
- Extract — Delegates to format-specific extractors based on the document's data source type
- Transform — Applies cleaning rules and splits text into chunks using the dataset's processing rules
- Load — Generates embeddings and stores vectors along with metadata
Each document's status is tracked through an IndexingStatus enum (pending → indexing → completed/error/paused), with progress visible in the UI.
The indexing task at tasks/document_indexing_task.py is dispatched to the dataset Celery queue. This separation means indexing work doesn't compete with real-time workflow execution for worker resources. The task uses a TenantIsolatedTaskQueue for fair scheduling across tenants.
Tip: When indexing fails with cryptic errors, check the
Document.errorfield in the database. The_handle_indexing_error()method stores the error message there, which the UI displays to the user.
Text Extraction and Splitting Strategies
Dify supports text extraction from a wide range of formats through extractor classes under core/rag/extractor/:
| Format | Extractor | Notes |
|---|---|---|
PdfExtractor |
Uses pypdf or Unstructured API | |
| DOCX | WordExtractor |
python-docx based |
| HTML | HtmlExtractor |
BeautifulSoup parsing |
| Markdown | MarkdownExtractor |
Custom parser |
| CSV/Excel | CsvExtractor, ExcelExtractor |
Tabular data support |
| Notion | NotionExtractor |
API integration |
| Website | WebsiteExtractor |
Jina Reader / Firecrawl |
After extraction, the FixedRecursiveCharacterTextSplitter handles chunking. This splitter is configurable per dataset with parameters for:
- Chunk size — maximum characters per chunk
- Chunk overlap — number of characters shared between adjacent chunks
- Separator — the splitting delimiter hierarchy (paragraphs → sentences → words)
There's also an EnhanceRecursiveCharacterTextSplitter that provides better handling of code blocks and structured content. The dataset's DatasetProcessRule model stores the active splitting configuration, allowing different datasets to use different strategies.
Vector Database Adapter Pattern
One of Dify's most impressive engineering feats is its support for 32 vector databases through a common interface. The VectorType enum lists them all:
classDiagram
class BaseVector {
<<abstract>>
+create(texts, embeddings)
+delete()
+search_by_vector(query_vector)
+search_by_full_text(query)
}
class QdrantVector
class WeaviateVector
class MilvusVector
class PgVectorVector
class ElasticsearchVector
class ChromaVector
class OceanBaseVector
class TencentVector
BaseVector <|-- QdrantVector
BaseVector <|-- WeaviateVector
BaseVector <|-- MilvusVector
BaseVector <|-- PgVectorVector
BaseVector <|-- ElasticsearchVector
BaseVector <|-- ChromaVector
BaseVector <|-- OceanBaseVector
BaseVector <|-- TencentVector
note for BaseVector "32 implementations total"
Each adapter lives under core/rag/datasource/vdb/ and implements the BaseVector interface with methods for creating, deleting, and searching vectors. The vector_factory.py module resolves the correct adapter based on the dataset's configuration.
The corresponding configuration classes under api/configs/middleware/vdb/ provide typed settings for each database. As we saw in Part 1, these are composed into MiddlewareConfig via Pydantic multiple inheritance.
Some adapters support hybrid search (combining vector similarity with keyword/full-text search):
- Milvus —
MILVUS_ENABLE_HYBRID_SEARCH - Tencent Vector DB —
TENCENT_VECTOR_DB_ENABLE_HYBRID_SEARCH - Elasticsearch — native hybrid via BM25 + vector
- PgVector — combined with
pg_bigmfor full-text
Tip: When choosing a vector database, consider whether you need hybrid search. Pure semantic search misses keyword-exact matches (like error codes or product SKUs). Databases that support hybrid search out of the box save you from building a separate keyword index.
Multi-Dataset Retrieval and Query Routing
When a knowledge retrieval node executes in a workflow, it may need to search across multiple datasets. The DatasetRetrieval class orchestrates this with two routing strategies:
flowchart TD
Query[User Query] --> Strategy{Routing Strategy?}
Strategy -->|single_dataset| Direct[Search single dataset]
Strategy -->|function_call| FCRouter[FunctionCallMultiDatasetRouter<br/>LLM selects relevant datasets]
Strategy -->|react| ReactRouter[ReactMultiDatasetRouter<br/>ReAct reasoning loop]
FCRouter --> Selected[Selected datasets]
ReactRouter --> Selected
Selected --> Parallel[Parallel retrieval<br/>per-dataset strategy]
Parallel --> Semantic[Semantic Search<br/>vector similarity]
Parallel --> Keyword[Keyword Search<br/>BM25/full-text]
Parallel --> Hybrid[Hybrid Search<br/>combined score]
Semantic --> Merge[Merge results]
Keyword --> Merge
Hybrid --> Merge
Merge --> Rerank[Reranking Pipeline]
Rerank --> Return[Return top-K results]
The FunctionCallMultiDatasetRouter is elegant in its approach. It presents each dataset as a "tool" to the LLM and asks it to select which datasets are relevant to the query. If there's only one dataset, it skips the LLM call entirely. The ReactMultiDatasetRouter uses a ReAct reasoning loop for more complex routing decisions.
Each selected dataset is then queried using its own configured retrieval method — semantic, keyword, or hybrid — with dataset-specific parameters like top-K, score thresholds, and reranking settings.
The default retrieval model provides sensible defaults:
default_retrieval_model = {
"search_method": RetrievalMethod.SEMANTIC_SEARCH,
"reranking_enable": False,
"top_k": 4,
"score_threshold_enabled": False,
}
Reranking, Post-Processing, and Parent-Child Chunking
After initial retrieval, results pass through a post-processing pipeline in data_post_processor.py. The DataPostProcessor applies:
- Reranking — Uses a reranking model (like Cohere Rerank or BGE Reranker) to re-score results by relevance. Supports multiple reranking modes: model-based reranking, weighted scoring, and cross-encoder reranking.
- Score threshold filtering — Removes results below a configurable similarity threshold.
- Reordering — The
reorder.pymodule can apply diversity-based reordering to avoid returning semantically duplicate chunks.
The parent-child chunking model is one of Dify's more sophisticated RAG features. The ChildChunk model enables a two-level chunking strategy:
erDiagram
Dataset ||--o{ Document : contains
Document ||--o{ DocumentSegment : "split into"
DocumentSegment ||--o{ ChildChunk : "further split into"
DocumentSegment {
string id PK
string content
string index_node_id
int word_count
int position
}
ChildChunk {
string id PK
string segment_id FK
string content
int position
string index_node_id
}
Parent segments provide context windows while child chunks enable fine-grained matching. During retrieval, the system matches against child chunks but returns their parent segments for context. This addresses a common RAG trade-off: small chunks improve retrieval precision, but large chunks provide better context for LLM generation.
Async Indexing with Progress Tracking
The indexing pipeline runs asynchronously via Celery, with fine-grained progress tracking. The document_indexing_task is the entry point:
@shared_task(queue="dataset")
def document_indexing_task(dataset_id: str, document_ids: list):
"""Async process document"""
_document_indexing(dataset_id, document_ids)
The dataset queue isolation ensures indexing jobs don't compete with real-time workflow tasks. The _document_indexing() function creates a session, fetches documents, and delegates to IndexingRunner.run().
Progress is tracked at multiple granularities:
- Document level —
IndexingStatusenum (pending/indexing/completed/error/paused) - Segment level —
SegmentStatustracks individual chunk processing - Time tracking —
indexing_latency,completed_at,stopped_attimestamps
The system also supports incremental re-indexing — when a document is updated, only changed segments need to be re-embedded and re-stored.
Tip: For large document sets, watch the Celery
datasetqueue depth. If indexing is slow, consider increasingCELERY_WORKER_AMOUNTor using autoscaling (CELERY_AUTO_SCALE=true). The tenant-isolated task queue ensures one tenant's bulk upload doesn't block another's.
What's Next
We've traced documents from upload to vector store and back to retrieval. In Part 5, we'll explore how Dify abstracts 100+ LLM providers behind a unified interface, the five types of tools available to agents and workflows, and the plugin daemon architecture that safely executes untrusted code.