This document explains RAGFlow's document parsing and chunking system, which transforms raw files into retrievable text chunks with metadata. The system employs a parser factory pattern where different parsing strategies are selected based on document type and user configuration. Each parsing method applies specialized logic to preserve document structure and semantics.
For information about the task execution system that coordinates parsing operations, see Task Execution and Queue System. For embedding and indexing after chunking, see Document Processing Pipeline.
RAGFlow uses a centralized factory pattern to dispatch parsing requests to specialized implementations. The PARSERS dictionary maps parser identifiers to functions that return parsed document sections.
Parser Registration and Dispatch
Parser Selection Logic
The system selects parsers based on two factors:
chunk() function to call (e.g., naive.chunk(), book.chunk())rag/app/naive.py135-141 defines the PARSERS dictionary:
Sources: rag/app/naive.py135-141 rag/app/naive.py694-714
The naive method provides general-purpose document chunking based on token limits and delimiters. It's the default method for most document types and supports embedded files, hyperlink extraction, and custom delimiters.
Naive Chunking Strategy
Configuration Parameters
The naive method accepts these configuration keys in parser_config:
| Parameter | Default | Description |
|---|---|---|
chunk_token_num | 512 | Maximum tokens per chunk |
delimiter | "\n!?。;!?" | Character boundaries for splitting |
children_delimiter | - | Custom delimiters in backticks |
layout_recognize | "DeepDOC" | PDF parser selection |
analyze_hyperlink | False | Extract and process URLs |
table_context_size | 0 | Tokens of context around tables |
image_context_size | 0 | Tokens of context around images |
overlapped_percent | 0 | Overlap percentage between chunks |
Custom Delimiter Handling
Custom delimiters are specified in backticks (e.g., `Chapter`, `Section`). The system:
File Type-Specific Processing
python-docx to extract paragraphs, styles, and images. Handles captions and nested structure. rag/app/naive.py655-692Embedded File Extraction
The naive method recursively processes embedded files (e.g., attachments in DOCX):
chunk() with is_root=FalseHyperlink Analysis
When enabled, the system:
Sources: rag/app/naive.py604-923 rag/nlp/__init__.py784-840 rag/nlp/__init__.py843-910
The book method applies hierarchical structure detection to create chunks that respect chapter/section boundaries. It's optimized for long-form content with clear hierarchical organization.
Book Structure Recognition
Bullet Pattern Detection
The system recognizes multiple bullet pattern categories: rag/nlp/__init__.py168-200
第一章, 第一节, (一), etc.1., 1.1, 1.1.1, etc.Chapter I, Section 1, Article 1#, ##, ###, etc.The detector samples random sections and counts pattern matches to identify the document's structure type.
Hierarchical Merging Algorithm
rag/nlp/__init__.py693-781 implements hierarchical merging:
Example hierarchy:
Level 0: Chapter I (section_id=0)
Level 1: Section 1.1 (section_id=0)
Level 2: Paragraph (section_id=0)
Level 1: Section 1.2 (section_id=1) ← New chunk
Level 2: Paragraph (section_id=1)
PDF Outline Integration
For PDFs with embedded outlines: rag/app/book.py98-127
Sources: rag/app/book.py68-183 rag/nlp/__init__.py693-781 rag/nlp/__init__.py168-200
The paper method is specialized for academic papers, extracting title, authors, abstract, and sections while preserving scientific document structure.
Paper Structure Extraction
Two-Column Layout Handling
Academic papers often use two-column layouts. The parser:
column_width < page_width / 2Abstract Extraction
The abstract is treated as a special chunk with enhanced metadata: rag/app/paper.py202-211
important_kwd: ["abstract", "总结", "概括", "summary", "summarize"]Section-Based Chunking
Unlike naive chunking, paper chunking:
Beginning Pattern Recognition
The parser identifies paper beginning markers: rag/app/paper.py73-76
introduction, abstract, 摘要, 引言, keywords, background, 目录, 前言, contentsSources: rag/app/paper.py142-241 rag/app/paper.py29-139
The qa method extracts question-answer pairs from structured documents, creating one chunk per Q&A pair. It supports multiple formats: Excel, CSV, TXT, PDF, Markdown, and DOCX.
Q&A Structure Detection
Question Pattern Recognition
rag/nlp/__init__.py74-86 defines 11 question bullet patterns:
第([零一二三四五六七八九十百0-9]+)问 - Chinese "Question N"第([零一二三四五六七八九十百0-9]+)条 - Chinese "Article N"<FileRef file-url="https://github.com/infiniflow/ragflow/blob/fa9b7b25/\\((" undefined file-path="\\((">Hii</FileRef>[\))] - Parenthesized Chinese numbers第([0-9]+)问 - Numeric "Question N"第([0-9]+)条 - Numeric "Article N"([0-9]{1,2})[\. 、] - Numeric bullets([零一二三四五六七八九十百]+)[ 、] - Chinese word bullets<FileRef file-url="https://github.com/infiniflow/ragflow/blob/fa9b7b25/\\((" undefined file-path="\\((">Hii</FileRef>[\))] - Parenthesized numbersQUESTION (ONE|TWO|THREE|...) - English word numbersQUESTION (I+V?|VI*|XI|IX|X) - Roman numeralsQUESTION ([0-9]+) - Numeric questionsFormat-Specific Processing
Excel/CSV Format: rag/app/qa.py36-76 rag/app/qa.py377-407
PDF Format: rag/app/qa.py79-183
Markdown Format: rag/app/qa.py418-451
mdQuestionLevel() counts # symbols# Q1 → ## Q1.1 → Q1.1 answerDOCX Format: rag/app/qa.py185-259 rag/app/qa.py453-460
Hierarchical Question Stacking
For hierarchical formats (Markdown, DOCX): rag/app/qa.py224-229
This creates nested question contexts:
Chapter 1 → Section 1.1 → Question A results in:Prefix Addition
The system adds language-specific prefixes: rag/app/qa.py267-278
"Question: " + q + "\t" + "Answer: " + a"问题:" + q + "\t" + "回答:" + armPrefix() to handle user-entered prefixesSources: rag/app/qa.py313-463 rag/nlp/__init__.py74-129
The table method processes structured tabular data, converting spreadsheets into searchable records. Each row becomes a chunk with field mappings.
Table Processing Pipeline
Complex Header Parsing
Excel files may have multi-level headers with merged cells. The parser: rag/app/table.py80-203
Example:
Row 1: [ Product Info ][ Sales Data ]
Row 2: [ Name ][ Price ][ Q1 ][ Q2 ][ Q3 ][ Q4 ]
Results in: ["Product Info-Name", "Product Info-Price", "Sales Data-Q1", ...]
Data Type Detection
The column_data_type() function analyzes each column: rag/app/table.py268-304
| Type | Pattern | Example |
|---|---|---|
int | [+-]?[0-9]+ | 123, -456 |
float | [+-]?[0-9.]{,19} | 12.34, -0.5 |
bool | true|yes|是|✓|false|no|否|× | Yes, No, ✓ |
datetime | dateutil.parser.parse() | "2024-01-15", "Jan 15" |
text | Everything else | "Product Name" |
Type detection uses voting: if >50% of non-null values match a pattern, the column gets that type.
Field Name Transformation
Field names are converted to Elasticsearch-compatible formats: rag/app/table.py366-374
(value1, value2) and /synonym parts姓名 → xingming)_tks for text (tokenized)_long for integers_flt for floats_dt for datetimes_kwd for keywords/booleansExample: "姓名/名字" → xingming_tks
Field Map Storage
The system stores field mappings in the knowledge base configuration: rag/app/table.py395
This allows the retrieval system to search specific fields (e.g., age_long:[20 TO 30]).
Row-to-Chunk Conversion
Each row becomes a chunk: rag/app/table.py377-393
"field:value" pairsSources: rag/app/table.py307-398 rag/app/table.py80-251 rag/app/table.py268-304
The laws method handles legal documents with hierarchical article structure, using tree-based merging to preserve legal document organization.
Legal Document Structure
Tree-Based Merging
The laws method uses tree_merge() instead of hierarchical_merge(): rag/nlp/__init__.py645-691
Node Class Structure: rag/nlp/__init__.py953-1016
Tree Building Algorithm:
Example hierarchy:
Level 1: 第一章 法律总则
Level 2: 第一条 本法适用范围
Level 3: (一) 条款细节
Level 2: 第二条 定义解释 ← Chunk boundary
Level 3: (一) 术语说明
Article Pattern Recognition
Legal documents use specific patterns: rag/nlp/__init__.py168-173
第[零一二三四五六七八九十百0-9]+(分?编|部分) - Parts/Divisions第[零一二三四五六七八九十百0-9]+章 - Chapters第[零一二三四五六七八九十百0-9]+节 - Sections第[零一二三四五六七八九十百0-9]+条 - Articles[\((][零一二三四五六七八九十百]+[\))] - Parenthesized clausesDOCX-Specific Processing
For DOCX files: rag/app/laws.py33-89
Sources: rag/app/laws.py133-226 rag/nlp/__init__.py645-691 rag/nlp/__init__.py953-1016
The manual method is designed for technical manuals and structured documentation, organizing content by hierarchical sections and preserving metadata like section IDs.
Manual Structure Processing
Section ID Assignment
The manual method assigns section_id values to group related content: rag/app/manual.py273-282
Sections with the same section_id are kept together during chunking, ensuring related content stays in one chunk.
Outline-Based Level Detection
For PDFs with embedded outlines: rag/app/manual.py253-271
Token-Limited Merging
Small sections are merged: rag/app/manual.py295-309
section_id, appendPosition Normalization
Manual chunks track precise positions: rag/app/manual.py221-240
Positions format: [(page_num, x1, x2, y1, y2), ...]
Sources: rag/app/manual.py182-338 rag/app/manual.py31-72 rag/app/manual.py75-179
The presentation method treats each slide as a separate chunk, preserving slide images and text for presentation files (PPTX) and slide-formatted PDFs.
Slide-Based Chunking
PPTX Processing
Uses the aspose.slides library: rag/app/presentation.py29-52
is_english() on extracted textPDF Slide Processing
For PDF presentations: rag/app/presentation.py54-83
[0-9\.,%/-]+ (numeric-only text)Garbage Text Detection
The __garbage() method filters out: rag/app/presentation.py58-64
This prevents page numbers and decorative elements from becoming chunks.
Chunk Metadata
Each slide chunk includes: rag/app/presentation.py119-124
The doc_type_kwd="image" flag enables image-aware retrieval in the search system.
Sources: rag/app/presentation.py97-171 rag/app/presentation.py29-52 rag/app/presentation.py54-83
The one method treats the entire document as a single chunk, preserving complete document context. Useful for short documents or when maintaining full context is critical.
Single-Chunk Strategy
Content Aggregation
The one method processes all content then joins it: rag/app/one.py64-157
"\n".join(sections)Example for PDF: rag/app/one.py58-61
Sorts by (page, top, left) to preserve reading order.
Excel Special Handling
Excel files use HTML representation: rag/app/one.py122-124
Each sheet converted to HTML table with all rows included.
Use Cases
The one method is appropriate for:
Limitations
Sources: rag/app/one.py64-157 rag/app/one.py28-61
Document parsing behavior is controlled through parser_config dictionaries passed to chunk functions. Configuration supports template-based presets and runtime customization.
Configuration Parameters
Parameter Reference Table
| Parameter | Type | Default | Used By | Description |
|---|---|---|---|---|
chunk_token_num | int | 512 | All methods | Maximum tokens per chunk |
delimiter | str | "\n!?。;!?" | naive, book, paper | Character boundaries for splitting |
children_delimiter | str | - | naive | Custom delimiters (backtick-wrapped) |
overlapped_percent | int | 0 | naive | Overlap percentage (0-90) |
layout_recognize | str | "DeepDOC" | PDF parsers | Parser selection |
from_page | int | 0 | PDF parsers | Start page (0-indexed) |
to_page | int | 100000 | PDF parsers | End page (exclusive) |
analyze_hyperlink | bool | False | naive | Extract and chunk URLs |
table_context_size | int | 0 | All methods | Tokens of context around tables |
image_context_size | int | 0 | All methods | Tokens of context around images |
html4excel | bool | False | naive (Excel) | Output Excel as HTML |
Parser Selection Values
The layout_recognize parameter accepts these values: rag/app/naive.py135-141
| Value | Implementation | Description |
|---|---|---|
"DeepDOC" | by_deepdoc() | Default vision pipeline |
"MinerU" | by_mineru() | External MinerU parser |
"Docling" | by_docling() | IBM Docling parser |
"TCADP" | by_tcadp() | Tencent Cloud parser |
"Plain Text" | by_plaintext() | Text-only extraction |
Custom Delimiter Format
Custom delimiters are specified in backticks: rag/app/naive.py619-621
The system:
["Chapter", "Section", "Part"]"(Chapter|Section|Part)"Overlap Percentage
When overlapped_percent > 0: rag/nlp/__init__.py803-806
Example with 20% overlap:
Zero-Token Special Handling
Some parsers set chunk_token_num = 0 to disable chunking: rag/app/naive.py721-722
This preserves the parser's native output structure without additional splitting.
Sources: rag/app/naive.py615-623 rag/nlp/__init__.py784-840 rag/app/naive.py694-724
PDF parsing integrates vision models for OCR (text detection/recognition), layout classification, and table structure recognition. The vision pipeline is optional and configurable per document.
Vision Pipeline Architecture
OCR Models
The system uses computer vision models for text extraction. From the architecture diagrams, the vision pipeline includes:
Vision Model Configuration
Vision models are optional and configurable via LLMBundle: rag/app/naive.py123-124
Vision Figure Parser Integration
After table extraction, the system can enhance figures: rag/app/naive.py54-56 rag/app/naive.py674
This wrapper:
Layout Type Handling
Each text box has a layoutno field: rag/app/paper.py83-84
Common layout types:
text - Body texttitle - Headingsfigure - Images, diagramstable - Tabular datafooter - Page footersheader - Page headersPlain Text Fallback
If layout recognition is disabled: rag/app/naive.py119-132
PlainParser skips vision models and extracts raw text only.
Zoom Factor
The zoomin parameter (default 3) controls image resolution: rag/app/naive.py47-48
Higher zoom = better OCR accuracy but slower processing and more memory.
Sources: deepdoc/parser/pdf_parser.py rag/app/naive.py43-57 rag/app/naive.py119-132
The chunking system uses token counting to ensure chunks fit within model context windows. Multiple merging strategies balance chunk size with semantic coherence.
Token Counting
Naive Merge Algorithm
The basic merging algorithm: rag/nlp/__init__.py784-840
Key behaviors:
Custom Delimiter Handling
Custom delimiters override token-based merging: rag/nlp/__init__.py817-835
Each custom-delimited segment becomes a separate chunk regardless of token count.
Hierarchical Merge Algorithm
Hierarchical merging groups by document structure: rag/nlp/__init__.py693-781
Example:
Level 1: Chapter 1 (218 tokens) → Chunk 1
Level 2: Section 1.1 (100 tokens)
Level 3: Subsection (50 tokens)
→ Merged into Chunk 2 (150 tokens)
Level 2: Section 1.2 (300 tokens) → Chunk 3
Tree Merge Algorithm
Tree merging constructs a hierarchy tree: rag/nlp/__init__.py645-691
The Node class manages the tree: rag/nlp/__init__.py953-1016
Markdown-Specific Merging
Markdown uses special handling for sections with images: rag/app/naive.py859-910
Sources: rag/nlp/__init__.py784-840 rag/nlp/__init__.py843-910 rag/nlp/__init__.py693-781 rag/nlp/__init__.py645-691
Each chunk includes metadata for retrieval, filtering, and debugging. Position information enables precise source location within documents.
Chunk Metadata Schema
Core Metadata Fields
| Field | Type | Example | Description |
|---|---|---|---|
docnm_kwd | str | "contract.pdf" | Source filename |
title_tks | str | "contract" | Tokenized filename (no extension) |
title_sm_tks | str | "con tra ct" | Fine-grained tokens |
content_with_weight | str | "Full chunk text..." | Complete chunk text |
content_ltks | str | "full chunk text" | Tokenized content |
content_sm_ltks | str | "fu ll ch un k te xt" | Fine-grained tokens |
Position Metadata
Position tracking enables source highlighting: rag/nlp/__init__.py547-559
Position format: (page_number, x1, x2, y1, y2) in PDF coordinate space
page_number: 1-indexed page numberx1, x2: Horizontal bounds (left, right)y1, y2: Vertical bounds (top, bottom)Position Tag Format
Positions are embedded in text as tags: rag/app/manual.py289-293
Example: @@5\t100.0\t200.0\t50.0\t75.0## means page 5, x=[100,200], y=[50,75]
Document Type Classification
The doc_type_kwd field indicates chunk type:
"text" - Default text chunk"table" - Table or structured data"image" - Image or figure with optional captionSet by tokenization functions: rag/nlp/__init__.py337-340
Hierarchical Context Fields
For child-delimiter chunking: rag/nlp/__init__.py293-298
The mom_with_weight field stores the parent chunk text, enabling reconstruction of context during retrieval.
Image Metadata
Image chunks include PIL.Image objects: rag/nlp/__init__.py312-313
Images are stored as binary data in the document store, referenced by chunk ID.
Table Context Attachment
The attach_media_context() function adds surrounding text to tables/images: rag/nlp/__init__.py359-544
content_with_weight with text+contextExample:
Original table chunk: "<table>...</table>"
With context (200 tokens): "Previous paragraph text.\n<table>...</table>\nNext paragraph text."
Sources: rag/nlp/__init__.py267-302 rag/nlp/__init__.py327-356 rag/nlp/__init__.py547-559 rag/nlp/__init__.py359-544
Refresh this wiki
This wiki was recently refreshed. Please wait 6 days to refresh again.