Menu

Document Parsing and Chunking Methods

Relevant source files

Purpose and Scope

This document explains RAGFlow's document parsing and chunking system, which transforms raw files into retrievable text chunks with metadata. The system employs a parser factory pattern where different parsing strategies are selected based on document type and user configuration. Each parsing method applies specialized logic to preserve document structure and semantics.

For information about the task execution system that coordinates parsing operations, see Task Execution and Queue System. For embedding and indexing after chunking, see Document Processing Pipeline.


Parser Factory Pattern

RAGFlow uses a centralized factory pattern to dispatch parsing requests to specialized implementations. The PARSERS dictionary maps parser identifiers to functions that return parsed document sections.

Parser Registration and Dispatch

Parser Selection Logic

The system selects parsers based on two factors:

  1. File Extension: Determines which chunk() function to call (e.g., naive.chunk(), book.chunk())
  2. Layout Recognizer: For PDFs, determines which parser implementation to use

rag/app/naive.py135-141 defines the PARSERS dictionary:

Sources: rag/app/naive.py135-141 rag/app/naive.py694-714


Chunking Method: Naive (General Purpose)

The naive method provides general-purpose document chunking based on token limits and delimiters. It's the default method for most document types and supports embedded files, hyperlink extraction, and custom delimiters.

Naive Chunking Strategy

Configuration Parameters

The naive method accepts these configuration keys in parser_config:

ParameterDefaultDescription
chunk_token_num512Maximum tokens per chunk
delimiter"\n!?。;!?"Character boundaries for splitting
children_delimiter-Custom delimiters in backticks
layout_recognize"DeepDOC"PDF parser selection
analyze_hyperlinkFalseExtract and process URLs
table_context_size0Tokens of context around tables
image_context_size0Tokens of context around images
overlapped_percent0Overlap percentage between chunks

Custom Delimiter Handling

Custom delimiters are specified in backticks (e.g., `Chapter`, `Section`). The system:

  1. Extracts patterns: rag/app/naive.py619-621
  2. Splits content at these boundaries: rag/nlp/__init__.py817-835
  3. Creates separate chunks without merging across delimiters

File Type-Specific Processing

Embedded File Extraction

The naive method recursively processes embedded files (e.g., attachments in DOCX):

  1. Extract embedded files: rag/app/naive.py636-653
  2. Recursively call chunk() with is_root=False
  3. Merge results into parent document chunks

Hyperlink Analysis

When enabled, the system:

  1. Extracts URLs from DOCX/PDF: rag/utils/file_utils.py
  2. Fetches HTML content: rag/app/naive.py660-668
  3. Recursively chunks the linked content
  4. Appends to main document results

Sources: rag/app/naive.py604-923 rag/nlp/__init__.py784-840 rag/nlp/__init__.py843-910


Chunking Method: Book

The book method applies hierarchical structure detection to create chunks that respect chapter/section boundaries. It's optimized for long-form content with clear hierarchical organization.

Book Structure Recognition

Bullet Pattern Detection

The system recognizes multiple bullet pattern categories: rag/nlp/__init__.py168-200

  1. Chinese Patterns: 第一章, 第一节, (一), etc.
  2. Numeric Patterns: 1., 1.1, 1.1.1, etc.
  3. Mixed Patterns: Combination of Chinese and numbers
  4. English Patterns: Chapter I, Section 1, Article 1
  5. Markdown Patterns: #, ##, ###, etc.

The detector samples random sections and counts pattern matches to identify the document's structure type.

Hierarchical Merging Algorithm

rag/nlp/__init__.py693-781 implements hierarchical merging:

  1. Level Assignment: Assign hierarchy levels based on bullet patterns
  2. Pivot Selection: Choose most frequent level as merge boundary
  3. Binary Search: Find parent/child relationships between sections
  4. Token Accumulation: Merge small sections up to token limit (218 tokens for short sections, configurable for others)

Example hierarchy:

Level 0: Chapter I (section_id=0)
  Level 1: Section 1.1 (section_id=0)
    Level 2: Paragraph (section_id=0)
  Level 1: Section 1.2 (section_id=1) ← New chunk
    Level 2: Paragraph (section_id=1)

PDF Outline Integration

For PDFs with embedded outlines: rag/app/book.py98-127

  • Extract outline hierarchy from PDF metadata
  • Match outline entries to text sections using token overlap (80% threshold)
  • Use outline levels instead of bullet pattern detection
  • More accurate than pattern-based detection for well-structured PDFs

Sources: rag/app/book.py68-183 rag/nlp/__init__.py693-781 rag/nlp/__init__.py168-200


Chunking Method: Paper

The paper method is specialized for academic papers, extracting title, authors, abstract, and sections while preserving scientific document structure.

Paper Structure Extraction

Two-Column Layout Handling

Academic papers often use two-column layouts. The parser:

  1. Calculates median column width: rag/app/paper.py60
  2. Detects two-column if column_width < page_width / 2
  3. Sorts text boxes by column order: rag/app/paper.py69
  4. Preserves reading order across columns

Abstract Extraction

The abstract is treated as a special chunk with enhanced metadata: rag/app/paper.py202-211

  • Marked with important_kwd: ["abstract", "总结", "概括", "summary", "summarize"]
  • Stored as complete section without token splitting
  • Includes cropped image of abstract region

Section-Based Chunking

Unlike naive chunking, paper chunking:

  1. Detects section boundaries (Introduction, Methods, Results, Discussion, etc.)
  2. Assigns section IDs based on hierarchy: rag/app/paper.py216-225
  3. Merges all text within a section boundary: rag/app/paper.py227-235
  4. Preserves section context even if exceeding token limits

Beginning Pattern Recognition

The parser identifies paper beginning markers: rag/app/paper.py73-76

  • introduction, abstract, 摘要, 引言, keywords, background, 目录, 前言, contents
  • Used to skip title page content and focus on substantive content

Sources: rag/app/paper.py142-241 rag/app/paper.py29-139


Chunking Method: Q&A

The qa method extracts question-answer pairs from structured documents, creating one chunk per Q&A pair. It supports multiple formats: Excel, CSV, TXT, PDF, Markdown, and DOCX.

Q&A Structure Detection

Question Pattern Recognition

rag/nlp/__init__.py74-86 defines 11 question bullet patterns:

  1. 第([零一二三四五六七八九十百0-9]+)问 - Chinese "Question N"
  2. 第([零一二三四五六七八九十百0-9]+)条 - Chinese "Article N"
  3. <FileRef file-url="https://github.com/infiniflow/ragflow/blob/fa9b7b25/\\((" undefined file-path="\\((">Hii</FileRef>[\))] - Parenthesized Chinese numbers
  4. 第([0-9]+)问 - Numeric "Question N"
  5. 第([0-9]+)条 - Numeric "Article N"
  6. ([0-9]{1,2})[\. 、] - Numeric bullets
  7. ([零一二三四五六七八九十百]+)[ 、] - Chinese word bullets
  8. <FileRef file-url="https://github.com/infiniflow/ragflow/blob/fa9b7b25/\\((" undefined file-path="\\((">Hii</FileRef>[\))] - Parenthesized numbers
  9. QUESTION (ONE|TWO|THREE|...) - English word numbers
  10. QUESTION (I+V?|VI*|XI|IX|X) - Roman numerals
  11. QUESTION ([0-9]+) - Numeric questions

Format-Specific Processing

Excel/CSV Format: rag/app/qa.py36-76 rag/app/qa.py377-407

  • Two columns: Question | Answer
  • No header row required
  • Empty rows skipped, tracked as failures
  • Delimiter auto-detection (comma vs tab)

PDF Format: rag/app/qa.py79-183

  • Uses OCR and layout recognition
  • Applies question bullet patterns
  • Tracks question indices to detect sequences
  • Accumulates answers until next question
  • Associates images with answers

Markdown Format: rag/app/qa.py418-451

  • Heading levels indicate question hierarchy
  • mdQuestionLevel() counts # symbols
  • Questions stack hierarchically: # Q1 → ## Q1.1 → Q1.1 answer
  • Answers converted to HTML tables if present

DOCX Format: rag/app/qa.py185-259 rag/app/qa.py453-460

  • Uses paragraph styles (Heading 1-6)
  • Extracts images from paragraphs
  • Stacks questions hierarchically
  • Concatenates images vertically

Hierarchical Question Stacking

For hierarchical formats (Markdown, DOCX): rag/app/qa.py224-229

This creates nested question contexts:

  • Chapter 1 → Section 1.1 → Question A results in:
  • Full question: "Chapter 1\nSection 1.1\nQuestion A"

Prefix Addition

The system adds language-specific prefixes: rag/app/qa.py267-278

  • English: "Question: " + q + "\t" + "Answer: " + a
  • Chinese: "问题:" + q + "\t" + "回答:" + a
  • Prefix removal via rmPrefix() to handle user-entered prefixes

Sources: rag/app/qa.py313-463 rag/nlp/__init__.py74-129


Chunking Method: Table

The table method processes structured tabular data, converting spreadsheets into searchable records. Each row becomes a chunk with field mappings.

Table Processing Pipeline

Complex Header Parsing

Excel files may have multi-level headers with merged cells. The parser: rag/app/table.py80-203

  1. Detects Complex Structure: Checks for merged cells in first two rows
  2. Counts Header Rows: Identifies rows with "header-like" content vs data
  3. Builds Hierarchical Headers: Merges levels with dashes (e.g., "Category-Subcategory-Field")
  4. Handles Merged Cells: Extracts value from top-left cell of merged range

Example:

Row 1: [  Product Info   ][    Sales Data     ]
Row 2: [ Name ][ Price ][ Q1 ][ Q2 ][ Q3 ][ Q4 ]

Results in: ["Product Info-Name", "Product Info-Price", "Sales Data-Q1", ...]

Data Type Detection

The column_data_type() function analyzes each column: rag/app/table.py268-304

TypePatternExample
int[+-]?[0-9]+123, -456
float[+-]?[0-9.]{,19}12.34, -0.5
booltrue|yes|是|✓|false|no|否|×Yes, No, ✓
datetimedateutil.parser.parse()"2024-01-15", "Jan 15"
textEverything else"Product Name"

Type detection uses voting: if >50% of non-null values match a pattern, the column gets that type.

Field Name Transformation

Field names are converted to Elasticsearch-compatible formats: rag/app/table.py366-374

  1. Remove Annotations: Strip (value1, value2) and /synonym parts
  2. Pinyin Conversion: Convert Chinese characters to pinyin (e.g., 姓名xingming)
  3. Add Type Suffix: Append based on detected type
    • _tks for text (tokenized)
    • _long for integers
    • _flt for floats
    • _dt for datetimes
    • _kwd for keywords/booleans

Example: "姓名/名字"xingming_tks

Field Map Storage

The system stores field mappings in the knowledge base configuration: rag/app/table.py395

This allows the retrieval system to search specific fields (e.g., age_long:[20 TO 30]).

Row-to-Chunk Conversion

Each row becomes a chunk: rag/app/table.py377-393

  • Field Assignment: Each cell value assigned to its transformed field name
  • Text Content: All fields concatenated as "field:value" pairs
  • Tokenization: Text fields tokenized, others stored as-is

Sources: rag/app/table.py307-398 rag/app/table.py80-251 rag/app/table.py268-304


Chunking Method: Laws

The laws method handles legal documents with hierarchical article structure, using tree-based merging to preserve legal document organization.

Legal Document Structure

Tree-Based Merging

The laws method uses tree_merge() instead of hierarchical_merge(): rag/nlp/__init__.py645-691

Node Class Structure: rag/nlp/__init__.py953-1016

Tree Building Algorithm:

  1. Level Assignment: Each section assigned a hierarchy level based on bullet patterns
  2. Parent-Child Relationships: Sections with lower level numbers become parents
  3. Depth Selection: Usually depth=2 (article level)
  4. Tree Traversal: Collect all text from root to target depth

Example hierarchy:

Level 1: 第一章 法律总则
  Level 2: 第一条 本法适用范围
    Level 3: (一) 条款细节
  Level 2: 第二条 定义解释 ← Chunk boundary
    Level 3: (一) 术语说明

Article Pattern Recognition

Legal documents use specific patterns: rag/nlp/__init__.py168-173

  • 第[零一二三四五六七八九十百0-9]+(分?编|部分) - Parts/Divisions
  • 第[零一二三四五六七八九十百0-9]+章 - Chapters
  • 第[零一二三四五六七八九十百0-9]+节 - Sections
  • 第[零一二三四五六七八九十百0-9]+条 - Articles
  • [\((][零一二三四五六七八九十百]+[\))] - Parenthesized clauses

DOCX-Specific Processing

For DOCX files: rag/app/laws.py33-89

  1. Uses paragraph styles (Heading 1-6) if available
  2. Falls back to bullet pattern detection
  3. Constructs tree with configurable depth
  4. Second-level depth (h2) typically used

Sources: rag/app/laws.py133-226 rag/nlp/__init__.py645-691 rag/nlp/__init__.py953-1016


Chunking Method: Manual

The manual method is designed for technical manuals and structured documentation, organizing content by hierarchical sections and preserving metadata like section IDs.

Manual Structure Processing

Section ID Assignment

The manual method assigns section_id values to group related content: rag/app/manual.py273-282

Sections with the same section_id are kept together during chunking, ensuring related content stays in one chunk.

Outline-Based Level Detection

For PDFs with embedded outlines: rag/app/manual.py253-271

  1. Extract outline entries with their levels
  2. Match outline text to document text using token overlap
  3. Use outline levels if >3% of sections have matches
  4. More reliable than bullet pattern detection

Token-Limited Merging

Small sections are merged: rag/app/manual.py295-309

  • If current chunk has <32 tokens, append next section
  • If current chunk has <1024 tokens AND next section has same section_id, append
  • Otherwise, start new chunk

Position Normalization

Manual chunks track precise positions: rag/app/manual.py221-240

Positions format: [(page_num, x1, x2, y1, y2), ...]

Sources: rag/app/manual.py182-338 rag/app/manual.py31-72 rag/app/manual.py75-179


Chunking Method: Presentation

The presentation method treats each slide as a separate chunk, preserving slide images and text for presentation files (PPTX) and slide-formatted PDFs.

Slide-Based Chunking

PPTX Processing

Uses the aspose.slides library: rag/app/presentation.py29-52

  1. Text Extraction: Iterate through slides, extract text content
  2. Thumbnail Generation: Render each slide at 0.1x scale to JPEG
  3. Image-Text Pairing: Create tuples of (text, image) for each slide
  4. Language Detection: Use is_english() on extracted text

PDF Slide Processing

For PDF presentations: rag/app/presentation.py54-83

  1. Page-Level OCR: Each page processed independently
  2. Garbage Filtering: Remove noise like page numbers, small fragments
    • Pattern: [0-9\.,%/-]+ (numeric-only text)
    • Length: <3 characters
  3. Image Association: Each page's text paired with page image

Garbage Text Detection

The __garbage() method filters out: rag/app/presentation.py58-64

  • Pure numeric strings: "123", "45.6%"
  • Very short text: <3 characters
  • Common separators: slashes, dashes, commas

This prevents page numbers and decorative elements from becoming chunks.

Chunk Metadata

Each slide chunk includes: rag/app/presentation.py119-124

The doc_type_kwd="image" flag enables image-aware retrieval in the search system.

Sources: rag/app/presentation.py97-171 rag/app/presentation.py29-52 rag/app/presentation.py54-83


Chunking Method: One

The one method treats the entire document as a single chunk, preserving complete document context. Useful for short documents or when maintaining full context is critical.

Single-Chunk Strategy

Content Aggregation

The one method processes all content then joins it: rag/app/one.py64-157

  1. Extract All Sections: Use appropriate parser for file type
  2. Extract All Tables: Include table HTML content
  3. Sort by Position: For PDFs, maintain reading order
  4. Join with Newlines: "\n".join(sections)

Example for PDF: rag/app/one.py58-61

Sorts by (page, top, left) to preserve reading order.

Excel Special Handling

Excel files use HTML representation: rag/app/one.py122-124

Each sheet converted to HTML table with all rows included.

Use Cases

The one method is appropriate for:

  • Short documents (<512 tokens)
  • Documents requiring full context (e.g., contracts)
  • Summary generation tasks
  • Documents with complex cross-references

Limitations

  • No chunking means large documents may exceed model context windows
  • Retrieval recall may be lower (entire document vs. relevant section)
  • Embedding quality may suffer for very long documents

Sources: rag/app/one.py64-157 rag/app/one.py28-61


Configuration System

Document parsing behavior is controlled through parser_config dictionaries passed to chunk functions. Configuration supports template-based presets and runtime customization.

Configuration Parameters

Parameter Reference Table

ParameterTypeDefaultUsed ByDescription
chunk_token_numint512All methodsMaximum tokens per chunk
delimiterstr"\n!?。;!?"naive, book, paperCharacter boundaries for splitting
children_delimiterstr-naiveCustom delimiters (backtick-wrapped)
overlapped_percentint0naiveOverlap percentage (0-90)
layout_recognizestr"DeepDOC"PDF parsersParser selection
from_pageint0PDF parsersStart page (0-indexed)
to_pageint100000PDF parsersEnd page (exclusive)
analyze_hyperlinkboolFalsenaiveExtract and chunk URLs
table_context_sizeint0All methodsTokens of context around tables
image_context_sizeint0All methodsTokens of context around images
html4excelboolFalsenaive (Excel)Output Excel as HTML

Parser Selection Values

The layout_recognize parameter accepts these values: rag/app/naive.py135-141

ValueImplementationDescription
"DeepDOC"by_deepdoc()Default vision pipeline
"MinerU"by_mineru()External MinerU parser
"Docling"by_docling()IBM Docling parser
"TCADP"by_tcadp()Tencent Cloud parser
"Plain Text"by_plaintext()Text-only extraction

Custom Delimiter Format

Custom delimiters are specified in backticks: rag/app/naive.py619-621

The system:

  1. Extracts patterns: ["Chapter", "Section", "Part"]
  2. Creates regex: "(Chapter|Section|Part)"
  3. Splits content at these boundaries
  4. Each segment becomes separate chunk without merging

Overlap Percentage

When overlapped_percent > 0: rag/nlp/__init__.py803-806

Example with 20% overlap:

  • Chunk 1: Tokens 0-100
  • Chunk 2: Tokens 80-180 (last 20 tokens of Chunk 1 + new tokens)

Zero-Token Special Handling

Some parsers set chunk_token_num = 0 to disable chunking: rag/app/naive.py721-722

This preserves the parser's native output structure without additional splitting.

Sources: rag/app/naive.py615-623 rag/nlp/__init__.py784-840 rag/app/naive.py694-724


Vision and OCR Integration

PDF parsing integrates vision models for OCR (text detection/recognition), layout classification, and table structure recognition. The vision pipeline is optional and configurable per document.

Vision Pipeline Architecture

OCR Models

The system uses computer vision models for text extraction. From the architecture diagrams, the vision pipeline includes:

  1. Text Detection: Locates text regions in images
  2. Text Recognition: Converts image regions to text
  3. Layout Recognition: Classifies regions (text, title, figure, table)
  4. Table Structure: Detects table rows, columns, and cells

Vision Model Configuration

Vision models are optional and configurable via LLMBundle: rag/app/naive.py123-124

Vision Figure Parser Integration

After table extraction, the system can enhance figures: rag/app/naive.py54-56 rag/app/naive.py674

This wrapper:

  1. Detects figure regions in tables
  2. Applies vision models to extract figure descriptions
  3. Appends descriptions to table content

Layout Type Handling

Each text box has a layoutno field: rag/app/paper.py83-84

Common layout types:

  • text - Body text
  • title - Headings
  • figure - Images, diagrams
  • table - Tabular data
  • footer - Page footers
  • header - Page headers

Plain Text Fallback

If layout recognition is disabled: rag/app/naive.py119-132

PlainParser skips vision models and extracts raw text only.

Zoom Factor

The zoomin parameter (default 3) controls image resolution: rag/app/naive.py47-48

Higher zoom = better OCR accuracy but slower processing and more memory.

Sources: deepdoc/parser/pdf_parser.py rag/app/naive.py43-57 rag/app/naive.py119-132


Token Management and Merging

The chunking system uses token counting to ensure chunks fit within model context windows. Multiple merging strategies balance chunk size with semantic coherence.

Token Counting

Naive Merge Algorithm

The basic merging algorithm: rag/nlp/__init__.py784-840

Key behaviors:

  1. Accumulate text until token limit reached
  2. When limit exceeded, start new chunk
  3. Optionally include overlap from previous chunk
  4. Position tags appended but not counted toward tokens (< 8 tokens)

Custom Delimiter Handling

Custom delimiters override token-based merging: rag/nlp/__init__.py817-835

Each custom-delimited segment becomes a separate chunk regardless of token count.

Hierarchical Merge Algorithm

Hierarchical merging groups by document structure: rag/nlp/__init__.py693-781

  1. Level Detection: Assign hierarchy levels to each section
  2. Level Selection: Choose target levels to merge (e.g., depth=5 means top 5 levels)
  3. Section Grouping: Group sections between level boundaries
  4. Token Accumulation: Merge small groups (<218 tokens) with same level

Example:

Level 1: Chapter 1 (218 tokens) → Chunk 1
Level 2: Section 1.1 (100 tokens)
Level 3: Subsection (50 tokens)
  → Merged into Chunk 2 (150 tokens)
Level 2: Section 1.2 (300 tokens) → Chunk 3

Tree Merge Algorithm

Tree merging constructs a hierarchy tree: rag/nlp/__init__.py645-691

  1. Tree Construction: Build parent-child relationships
  2. Depth Selection: Choose target depth level
  3. Traversal: Collect all text from root to target depth
  4. Concatenation: Join parent and child text

The Node class manages the tree: rag/nlp/__init__.py953-1016

Markdown-Specific Merging

Markdown uses special handling for sections with images: rag/app/naive.py859-910

  1. Section Iteration: Process each markdown section
  2. Token Counting: Track current chunk tokens
  3. Image Accumulation: Concatenate images vertically
  4. Overlap Handling: Include last N% of previous chunk
  5. Image Association: Each chunk paired with combined image

Sources: rag/nlp/__init__.py784-840 rag/nlp/__init__.py843-910 rag/nlp/__init__.py693-781 rag/nlp/__init__.py645-691


Metadata and Position Tracking

Each chunk includes metadata for retrieval, filtering, and debugging. Position information enables precise source location within documents.

Chunk Metadata Schema

Core Metadata Fields

FieldTypeExampleDescription
docnm_kwdstr"contract.pdf"Source filename
title_tksstr"contract"Tokenized filename (no extension)
title_sm_tksstr"con tra ct"Fine-grained tokens
content_with_weightstr"Full chunk text..."Complete chunk text
content_ltksstr"full chunk text"Tokenized content
content_sm_ltksstr"fu ll ch un k te xt"Fine-grained tokens

Position Metadata

Position tracking enables source highlighting: rag/nlp/__init__.py547-559

Position format: (page_number, x1, x2, y1, y2) in PDF coordinate space

  • page_number: 1-indexed page number
  • x1, x2: Horizontal bounds (left, right)
  • y1, y2: Vertical bounds (top, bottom)

Position Tag Format

Positions are embedded in text as tags: rag/app/manual.py289-293

Example: @@5\t100.0\t200.0\t50.0\t75.0## means page 5, x=[100,200], y=[50,75]

Document Type Classification

The doc_type_kwd field indicates chunk type:

  • "text" - Default text chunk
  • "table" - Table or structured data
  • "image" - Image or figure with optional caption

Set by tokenization functions: rag/nlp/__init__.py337-340

Hierarchical Context Fields

For child-delimiter chunking: rag/nlp/__init__.py293-298

The mom_with_weight field stores the parent chunk text, enabling reconstruction of context during retrieval.

Image Metadata

Image chunks include PIL.Image objects: rag/nlp/__init__.py312-313

Images are stored as binary data in the document store, referenced by chunk ID.

Table Context Attachment

The attach_media_context() function adds surrounding text to tables/images: rag/nlp/__init__.py359-544

  1. Position-Based Sorting: Order chunks by (page, top, left) if position data exists
  2. Context Collection: Gather previous and next text chunks up to token budget
  3. Sentence Trimming: Trim to sentence boundaries to avoid partial sentences
  4. Content Update: Replace content_with_weight with text+context

Example:

Original table chunk: "<table>...</table>"
With context (200 tokens): "Previous paragraph text.\n<table>...</table>\nNext paragraph text."

Sources: rag/nlp/__init__.py267-302 rag/nlp/__init__.py327-356 rag/nlp/__init__.py547-559 rag/nlp/__init__.py359-544