Document Parsing and Chunking Methods

Relevant source files

Purpose and Scope

This document explains RAGFlow's document parsing and chunking system, which transforms raw files into retrievable text chunks with metadata. The system employs a parser factory pattern where different parsing strategies are selected based on document type and user configuration. Each parsing method applies specialized logic to preserve document structure and semantics.

For information about the task execution system that coordinates parsing operations, see Task Execution and Queue System. For embedding and indexing after chunking, see Document Processing Pipeline.

Parser Factory Pattern

RAGFlow uses a centralized factory pattern to dispatch parsing requests to specialized implementations. The PARSERS dictionary maps parser identifiers to functions that return parsed document sections.

Parser Registration and Dispatch

Parser Selection Logic

The system selects parsers based on two factors:

File Extension: Determines which chunk() function to call (e.g., naive.chunk(), book.chunk())
Layout Recognizer: For PDFs, determines which parser implementation to use

rag/app/naive.py135-141 defines the PARSERS dictionary:

Sources: rag/app/naive.py135-141 rag/app/naive.py694-714

Chunking Method: Naive (General Purpose)

The naive method provides general-purpose document chunking based on token limits and delimiters. It's the default method for most document types and supports embedded files, hyperlink extraction, and custom delimiters.

Naive Chunking Strategy

Configuration Parameters

The naive method accepts these configuration keys in parser_config:

Parameter	Default	Description
`chunk_token_num`	512	Maximum tokens per chunk
`delimiter`	`"\n!?。；！？"`	Character boundaries for splitting
`children_delimiter`	-	Custom delimiters in backticks
`layout_recognize`	"DeepDOC"	PDF parser selection
`analyze_hyperlink`	False	Extract and process URLs
`table_context_size`	0	Tokens of context around tables
`image_context_size`	0	Tokens of context around images
`overlapped_percent`	0	Overlap percentage between chunks

Custom Delimiter Handling

Custom delimiters are specified in backticks (e.g., `Chapter`, `Section`). The system:

Extracts patterns: rag/app/naive.py619-621
Splits content at these boundaries: rag/nlp/__init__.py817-835
Creates separate chunks without merging across delimiters

File Type-Specific Processing

DOCX: Uses python-docx to extract paragraphs, styles, and images. Handles captions and nested structure. rag/app/naive.py655-692
PDF: Delegates to selected parser, then applies vision enhancements for tables/figures. rag/app/naive.py694-724
Excel/CSV: Extracts tables as HTML or row-based text. rag/app/naive.py726-763
Markdown: Preserves heading structure, extracts images from URLs. rag/app/naive.py771-819
HTML: Uses BeautifulSoup for tag-aware extraction. rag/app/naive.py820-825

Embedded File Extraction

The naive method recursively processes embedded files (e.g., attachments in DOCX):

Extract embedded files: rag/app/naive.py636-653
Recursively call chunk() with is_root=False
Merge results into parent document chunks

Hyperlink Analysis

When enabled, the system:

Extracts URLs from DOCX/PDF: rag/utils/file_utils.py
Fetches HTML content: rag/app/naive.py660-668
Recursively chunks the linked content
Appends to main document results

Sources: rag/app/naive.py604-923 rag/nlp/__init__.py784-840 rag/nlp/__init__.py843-910

Chunking Method: Book

The book method applies hierarchical structure detection to create chunks that respect chapter/section boundaries. It's optimized for long-form content with clear hierarchical organization.

Book Structure Recognition

Bullet Pattern Detection

The system recognizes multiple bullet pattern categories: rag/nlp/__init__.py168-200

Chinese Patterns: 第一章, 第一节, (一), etc.
Numeric Patterns: 1., 1.1, 1.1.1, etc.
Mixed Patterns: Combination of Chinese and numbers
English Patterns: Chapter I, Section 1, Article 1
Markdown Patterns: #, ##, ###, etc.

The detector samples random sections and counts pattern matches to identify the document's structure type.

Hierarchical Merging Algorithm

rag/nlp/__init__.py693-781 implements hierarchical merging:

Level Assignment: Assign hierarchy levels based on bullet patterns
Pivot Selection: Choose most frequent level as merge boundary
Binary Search: Find parent/child relationships between sections
Token Accumulation: Merge small sections up to token limit (218 tokens for short sections, configurable for others)

Example hierarchy:

Level 0: Chapter I (section_id=0)
  Level 1: Section 1.1 (section_id=0)
    Level 2: Paragraph (section_id=0)
  Level 1: Section 1.2 (section_id=1) ← New chunk
    Level 2: Paragraph (section_id=1)

PDF Outline Integration

For PDFs with embedded outlines: rag/app/book.py98-127

Extract outline hierarchy from PDF metadata
Match outline entries to text sections using token overlap (80% threshold)
Use outline levels instead of bullet pattern detection
More accurate than pattern-based detection for well-structured PDFs

Sources: rag/app/book.py68-183 rag/nlp/__init__.py693-781 rag/nlp/__init__.py168-200

Chunking Method: Paper

The paper method is specialized for academic papers, extracting title, authors, abstract, and sections while preserving scientific document structure.

Paper Structure Extraction

Two-Column Layout Handling

Academic papers often use two-column layouts. The parser:

Calculates median column width: rag/app/paper.py60
Detects two-column if column_width < page_width / 2
Sorts text boxes by column order: rag/app/paper.py69
Preserves reading order across columns

Abstract Extraction

The abstract is treated as a special chunk with enhanced metadata: rag/app/paper.py202-211

Marked with important_kwd: ["abstract", "总结", "概括", "summary", "summarize"]
Stored as complete section without token splitting
Includes cropped image of abstract region

Section-Based Chunking

Unlike naive chunking, paper chunking:

Detects section boundaries (Introduction, Methods, Results, Discussion, etc.)
Assigns section IDs based on hierarchy: rag/app/paper.py216-225
Merges all text within a section boundary: rag/app/paper.py227-235
Preserves section context even if exceeding token limits

Beginning Pattern Recognition

The parser identifies paper beginning markers: rag/app/paper.py73-76

introduction, abstract, 摘要, 引言, keywords, background, 目录, 前言, contents
Used to skip title page content and focus on substantive content

Sources: rag/app/paper.py142-241 rag/app/paper.py29-139

Chunking Method: Q&A

The qa method extracts question-answer pairs from structured documents, creating one chunk per Q&A pair. It supports multiple formats: Excel, CSV, TXT, PDF, Markdown, and DOCX.

Q&A Structure Detection

Question Pattern Recognition

rag/nlp/__init__.py74-86 defines 11 question bullet patterns:

第([零一二三四五六七八九十百0-9]+)问 - Chinese "Question N"
第([零一二三四五六七八九十百0-9]+)条 - Chinese "Article N"
<FileRef file-url="https://github.com/infiniflow/ragflow/blob/fa9b7b25/\\(（" undefined file-path="\\(（">Hii</FileRef>[\)）] - Parenthesized Chinese numbers
第([0-9]+)问 - Numeric "Question N"
第([0-9]+)条 - Numeric "Article N"
([0-9]{1,2})[\. 、] - Numeric bullets
([零一二三四五六七八九十百]+)[ 、] - Chinese word bullets
<FileRef file-url="https://github.com/infiniflow/ragflow/blob/fa9b7b25/\\(（" undefined file-path="\\(（">Hii</FileRef>[\)）] - Parenthesized numbers
QUESTION (ONE|TWO|THREE|...) - English word numbers
QUESTION (I+V?|VI*|XI|IX|X) - Roman numerals
QUESTION ([0-9]+) - Numeric questions

Format-Specific Processing

Excel/CSV Format: rag/app/qa.py36-76 rag/app/qa.py377-407

Two columns: Question | Answer
No header row required
Empty rows skipped, tracked as failures
Delimiter auto-detection (comma vs tab)

PDF Format: rag/app/qa.py79-183

Uses OCR and layout recognition
Applies question bullet patterns
Tracks question indices to detect sequences
Accumulates answers until next question
Associates images with answers

Markdown Format: rag/app/qa.py418-451

Heading levels indicate question hierarchy
mdQuestionLevel() counts # symbols
Questions stack hierarchically: # Q1 → ## Q1.1 → Q1.1 answer
Answers converted to HTML tables if present

DOCX Format: rag/app/qa.py185-259 rag/app/qa.py453-460

Uses paragraph styles (Heading 1-6)
Extracts images from paragraphs
Stacks questions hierarchically
Concatenates images vertically

Hierarchical Question Stacking

For hierarchical formats (Markdown, DOCX): rag/app/qa.py224-229

This creates nested question contexts:

Chapter 1 → Section 1.1 → Question A results in:
Full question: "Chapter 1\nSection 1.1\nQuestion A"

Prefix Addition

The system adds language-specific prefixes: rag/app/qa.py267-278

English: "Question: " + q + "\t" + "Answer: " + a
Chinese: "问题：" + q + "\t" + "回答：" + a
Prefix removal via rmPrefix() to handle user-entered prefixes

Sources: rag/app/qa.py313-463 rag/nlp/__init__.py74-129

Chunking Method: Table

The table method processes structured tabular data, converting spreadsheets into searchable records. Each row becomes a chunk with field mappings.

Table Processing Pipeline

Complex Header Parsing

Excel files may have multi-level headers with merged cells. The parser: rag/app/table.py80-203

Detects Complex Structure: Checks for merged cells in first two rows
Counts Header Rows: Identifies rows with "header-like" content vs data
Builds Hierarchical Headers: Merges levels with dashes (e.g., "Category-Subcategory-Field")
Handles Merged Cells: Extracts value from top-left cell of merged range

Example:

Row 1: [  Product Info   ][    Sales Data     ]
Row 2: [ Name ][ Price ][ Q1 ][ Q2 ][ Q3 ][ Q4 ]

Results in: ["Product Info-Name", "Product Info-Price", "Sales Data-Q1", ...]

Data Type Detection

The column_data_type() function analyzes each column: rag/app/table.py268-304

Type	Pattern	Example
`int`	`[+-]?[0-9]+`	123, -456
`float`	`[+-]?[0-9.]{,19}`	12.34, -0.5
`bool`	`true\|yes\|是\|✓\|false\|no\|否\|×`	Yes, No, ✓
`datetime`	`dateutil.parser.parse()`	"2024-01-15", "Jan 15"
`text`	Everything else	"Product Name"

Type detection uses voting: if >50% of non-null values match a pattern, the column gets that type.

Field Name Transformation

Field names are converted to Elasticsearch-compatible formats: rag/app/table.py366-374

Remove Annotations: Strip (value1, value2) and /synonym parts
Pinyin Conversion: Convert Chinese characters to pinyin (e.g., 姓名 → xingming)
Add Type Suffix: Append based on detected type
- _tks for text (tokenized)
- _long for integers
- _flt for floats
- _dt for datetimes
- _kwd for keywords/booleans

Example: "姓名/名字" → xingming_tks

Field Map Storage

The system stores field mappings in the knowledge base configuration: rag/app/table.py395

This allows the retrieval system to search specific fields (e.g., age_long:[20 TO 30]).

Row-to-Chunk Conversion

Each row becomes a chunk: rag/app/table.py377-393

Field Assignment: Each cell value assigned to its transformed field name
Text Content: All fields concatenated as "field:value" pairs
Tokenization: Text fields tokenized, others stored as-is

Sources: rag/app/table.py307-398 rag/app/table.py80-251 rag/app/table.py268-304

Chunking Method: Laws

The laws method handles legal documents with hierarchical article structure, using tree-based merging to preserve legal document organization.

Legal Document Structure

Tree-Based Merging

The laws method uses tree_merge() instead of hierarchical_merge(): rag/nlp/__init__.py645-691

Node Class Structure: rag/nlp/__init__.py953-1016

Tree Building Algorithm:

Level Assignment: Each section assigned a hierarchy level based on bullet patterns
Parent-Child Relationships: Sections with lower level numbers become parents
Depth Selection: Usually depth=2 (article level)
Tree Traversal: Collect all text from root to target depth

Example hierarchy:

Level 1: 第一章 法律总则
  Level 2: 第一条 本法适用范围
    Level 3: (一) 条款细节
  Level 2: 第二条 定义解释 ← Chunk boundary
    Level 3: (一) 术语说明

Article Pattern Recognition

Legal documents use specific patterns: rag/nlp/__init__.py168-173

第[零一二三四五六七八九十百0-9]+(分?编|部分) - Parts/Divisions
第[零一二三四五六七八九十百0-9]+章 - Chapters
第[零一二三四五六七八九十百0-9]+节 - Sections
第[零一二三四五六七八九十百0-9]+条 - Articles
[\(（][零一二三四五六七八九十百]+[\)）] - Parenthesized clauses

DOCX-Specific Processing

For DOCX files: rag/app/laws.py33-89

Uses paragraph styles (Heading 1-6) if available
Falls back to bullet pattern detection
Constructs tree with configurable depth
Second-level depth (h2) typically used

Sources: rag/app/laws.py133-226 rag/nlp/__init__.py645-691 rag/nlp/__init__.py953-1016

Chunking Method: Manual

The manual method is designed for technical manuals and structured documentation, organizing content by hierarchical sections and preserving metadata like section IDs.

Manual Structure Processing

Section ID Assignment

The manual method assigns section_id values to group related content: rag/app/manual.py273-282

Sections with the same section_id are kept together during chunking, ensuring related content stays in one chunk.

Outline-Based Level Detection

For PDFs with embedded outlines: rag/app/manual.py253-271

Extract outline entries with their levels
Match outline text to document text using token overlap
Use outline levels if >3% of sections have matches
More reliable than bullet pattern detection

Token-Limited Merging

Small sections are merged: rag/app/manual.py295-309

If current chunk has <32 tokens, append next section
If current chunk has <1024 tokens AND next section has same section_id, append
Otherwise, start new chunk

Position Normalization

Manual chunks track precise positions: rag/app/manual.py221-240

Positions format: [(page_num, x1, x2, y1, y2), ...]

Sources: rag/app/manual.py182-338 rag/app/manual.py31-72 rag/app/manual.py75-179

Chunking Method: Presentation

The presentation method treats each slide as a separate chunk, preserving slide images and text for presentation files (PPTX) and slide-formatted PDFs.

Slide-Based Chunking

PPTX Processing

Uses the aspose.slides library: rag/app/presentation.py29-52

Text Extraction: Iterate through slides, extract text content
Thumbnail Generation: Render each slide at 0.1x scale to JPEG
Image-Text Pairing: Create tuples of (text, image) for each slide
Language Detection: Use is_english() on extracted text

PDF Slide Processing

For PDF presentations: rag/app/presentation.py54-83

Page-Level OCR: Each page processed independently
Garbage Filtering: Remove noise like page numbers, small fragments
- Pattern: [0-9\.,%/-]+ (numeric-only text)
- Length: <3 characters
Image Association: Each page's text paired with page image

Garbage Text Detection

The __garbage() method filters out: rag/app/presentation.py58-64

Pure numeric strings: "123", "45.6%"
Very short text: <3 characters
Common separators: slashes, dashes, commas

This prevents page numbers and decorative elements from becoming chunks.

Chunk Metadata

Each slide chunk includes: rag/app/presentation.py119-124

The doc_type_kwd="image" flag enables image-aware retrieval in the search system.

Sources: rag/app/presentation.py97-171 rag/app/presentation.py29-52 rag/app/presentation.py54-83

Chunking Method: One

The one method treats the entire document as a single chunk, preserving complete document context. Useful for short documents or when maintaining full context is critical.

Single-Chunk Strategy

Content Aggregation

The one method processes all content then joins it: rag/app/one.py64-157

Extract All Sections: Use appropriate parser for file type
Extract All Tables: Include table HTML content
Sort by Position: For PDFs, maintain reading order
Join with Newlines: "\n".join(sections)

Example for PDF: rag/app/one.py58-61

Sorts by (page, top, left) to preserve reading order.

Excel Special Handling

Excel files use HTML representation: rag/app/one.py122-124

Each sheet converted to HTML table with all rows included.

Use Cases

The one method is appropriate for:

Short documents (<512 tokens)
Documents requiring full context (e.g., contracts)
Summary generation tasks
Documents with complex cross-references

Limitations

No chunking means large documents may exceed model context windows
Retrieval recall may be lower (entire document vs. relevant section)
Embedding quality may suffer for very long documents

Sources: rag/app/one.py64-157 rag/app/one.py28-61

Configuration System

Document parsing behavior is controlled through parser_config dictionaries passed to chunk functions. Configuration supports template-based presets and runtime customization.

Configuration Parameters

Parameter Reference Table

Parameter	Type	Default	Used By	Description
`chunk_token_num`	int	512	All methods	Maximum tokens per chunk
`delimiter`	str	`"\n!?。；！？"`	naive, book, paper	Character boundaries for splitting
`children_delimiter`	str	-	naive	Custom delimiters (backtick-wrapped)
`overlapped_percent`	int	0	naive	Overlap percentage (0-90)
`layout_recognize`	str	"DeepDOC"	PDF parsers	Parser selection
`from_page`	int	0	PDF parsers	Start page (0-indexed)
`to_page`	int	100000	PDF parsers	End page (exclusive)
`analyze_hyperlink`	bool	False	naive	Extract and chunk URLs
`table_context_size`	int	0	All methods	Tokens of context around tables
`image_context_size`	int	0	All methods	Tokens of context around images
`html4excel`	bool	False	naive (Excel)	Output Excel as HTML

Parser Selection Values

The layout_recognize parameter accepts these values: rag/app/naive.py135-141

Value	Implementation	Description
`"DeepDOC"`	`by_deepdoc()`	Default vision pipeline
`"MinerU"`	`by_mineru()`	External MinerU parser
`"Docling"`	`by_docling()`	IBM Docling parser
`"TCADP"`	`by_tcadp()`	Tencent Cloud parser
`"Plain Text"`	`by_plaintext()`	Text-only extraction

Custom Delimiter Format

Custom delimiters are specified in backticks: rag/app/naive.py619-621

The system:

Extracts patterns: ["Chapter", "Section", "Part"]
Creates regex: "(Chapter|Section|Part)"
Splits content at these boundaries
Each segment becomes separate chunk without merging

Overlap Percentage

When overlapped_percent > 0: rag/nlp/__init__.py803-806

Example with 20% overlap:

Chunk 1: Tokens 0-100
Chunk 2: Tokens 80-180 (last 20 tokens of Chunk 1 + new tokens)

Zero-Token Special Handling

Some parsers set chunk_token_num = 0 to disable chunking: rag/app/naive.py721-722

This preserves the parser's native output structure without additional splitting.

Sources: rag/app/naive.py615-623 rag/nlp/__init__.py784-840 rag/app/naive.py694-724

Vision and OCR Integration

PDF parsing integrates vision models for OCR (text detection/recognition), layout classification, and table structure recognition. The vision pipeline is optional and configurable per document.

Vision Pipeline Architecture

OCR Models

The system uses computer vision models for text extraction. From the architecture diagrams, the vision pipeline includes:

Text Detection: Locates text regions in images
Text Recognition: Converts image regions to text
Layout Recognition: Classifies regions (text, title, figure, table)
Table Structure: Detects table rows, columns, and cells

Vision Model Configuration

Vision models are optional and configurable via LLMBundle: rag/app/naive.py123-124

Vision Figure Parser Integration

After table extraction, the system can enhance figures: rag/app/naive.py54-56 rag/app/naive.py674

This wrapper:

Detects figure regions in tables
Applies vision models to extract figure descriptions
Appends descriptions to table content

Layout Type Handling

Each text box has a layoutno field: rag/app/paper.py83-84

Common layout types:

text - Body text
title - Headings
figure - Images, diagrams
table - Tabular data
footer - Page footers
header - Page headers

Plain Text Fallback

If layout recognition is disabled: rag/app/naive.py119-132

PlainParser skips vision models and extracts raw text only.

Zoom Factor

The zoomin parameter (default 3) controls image resolution: rag/app/naive.py47-48

Higher zoom = better OCR accuracy but slower processing and more memory.

Sources: deepdoc/parser/pdf_parser.py rag/app/naive.py43-57 rag/app/naive.py119-132

Token Management and Merging

The chunking system uses token counting to ensure chunks fit within model context windows. Multiple merging strategies balance chunk size with semantic coherence.

Token Counting

Naive Merge Algorithm

The basic merging algorithm: rag/nlp/__init__.py784-840

Key behaviors:

Accumulate text until token limit reached
When limit exceeded, start new chunk
Optionally include overlap from previous chunk
Position tags appended but not counted toward tokens (< 8 tokens)

Custom Delimiter Handling

Custom delimiters override token-based merging: rag/nlp/__init__.py817-835

Each custom-delimited segment becomes a separate chunk regardless of token count.

Hierarchical Merge Algorithm

Hierarchical merging groups by document structure: rag/nlp/__init__.py693-781

Level Detection: Assign hierarchy levels to each section
Level Selection: Choose target levels to merge (e.g., depth=5 means top 5 levels)
Section Grouping: Group sections between level boundaries
Token Accumulation: Merge small groups (<218 tokens) with same level

Example:

Level 1: Chapter 1 (218 tokens) → Chunk 1
Level 2: Section 1.1 (100 tokens)
Level 3: Subsection (50 tokens)
  → Merged into Chunk 2 (150 tokens)
Level 2: Section 1.2 (300 tokens) → Chunk 3

Tree Merge Algorithm

Tree merging constructs a hierarchy tree: rag/nlp/__init__.py645-691

Tree Construction: Build parent-child relationships
Depth Selection: Choose target depth level
Traversal: Collect all text from root to target depth
Concatenation: Join parent and child text

The Node class manages the tree: rag/nlp/__init__.py953-1016

Markdown-Specific Merging

Markdown uses special handling for sections with images: rag/app/naive.py859-910

Section Iteration: Process each markdown section
Token Counting: Track current chunk tokens
Image Accumulation: Concatenate images vertically
Overlap Handling: Include last N% of previous chunk
Image Association: Each chunk paired with combined image

Sources: rag/nlp/__init__.py784-840 rag/nlp/__init__.py843-910 rag/nlp/__init__.py693-781 rag/nlp/__init__.py645-691

Metadata and Position Tracking

Each chunk includes metadata for retrieval, filtering, and debugging. Position information enables precise source location within documents.

Chunk Metadata Schema

Core Metadata Fields

Field	Type	Example	Description
`docnm_kwd`	str	"contract.pdf"	Source filename
`title_tks`	str	"contract"	Tokenized filename (no extension)
`title_sm_tks`	str	"con tra ct"	Fine-grained tokens
`content_with_weight`	str	"Full chunk text..."	Complete chunk text
`content_ltks`	str	"full chunk text"	Tokenized content
`content_sm_ltks`	str	"fu ll ch un k te xt"	Fine-grained tokens

Position Metadata

Position tracking enables source highlighting: rag/nlp/__init__.py547-559

Position format: (page_number, x1, x2, y1, y2) in PDF coordinate space

page_number: 1-indexed page number
x1, x2: Horizontal bounds (left, right)
y1, y2: Vertical bounds (top, bottom)

Position Tag Format

Positions are embedded in text as tags: rag/app/manual.py289-293

Example: @@5\t100.0\t200.0\t50.0\t75.0## means page 5, x=[100,200], y=[50,75]

Document Type Classification

The doc_type_kwd field indicates chunk type:

"text" - Default text chunk
"table" - Table or structured data
"image" - Image or figure with optional caption

Set by tokenization functions: rag/nlp/__init__.py337-340

Hierarchical Context Fields

For child-delimiter chunking: rag/nlp/__init__.py293-298

The mom_with_weight field stores the parent chunk text, enabling reconstruction of context during retrieval.

Image Metadata

Image chunks include PIL.Image objects: rag/nlp/__init__.py312-313

Images are stored as binary data in the document store, referenced by chunk ID.

Table Context Attachment

The attach_media_context() function adds surrounding text to tables/images: rag/nlp/__init__.py359-544

Position-Based Sorting: Order chunks by (page, top, left) if position data exists
Context Collection: Gather previous and next text chunks up to token budget
Sentence Trimming: Trim to sentence boundaries to avoid partial sentences
Content Update: Replace content_with_weight with text+context

Example:

Original table chunk: "<table>...</table>"
With context (200 tokens): "Previous paragraph text.\n<table>...</table>\nNext paragraph text."

Sources: rag/nlp/__init__.py267-302 rag/nlp/__init__.py327-356 rag/nlp/__init__.py547-559 rag/nlp/__init__.py359-544

Document Parsing and Chunking Methods

Purpose and Scope

Parser Factory Pattern

Chunking Method: Naive (General Purpose)

Chunking Method: Book

Chunking Method: Paper

Chunking Method: Q&A

Chunking Method: Table

Chunking Method: Laws

Chunking Method: Manual

Chunking Method: Presentation

Chunking Method: One

Configuration System

Vision and OCR Integration

Token Management and Merging

Metadata and Position Tracking

On this page