-
Notifications
You must be signed in to change notification settings - Fork 264
Add support for Deletion Vectors to MultiFileParquetPartitionReader #13744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release/25.12
Are you sure you want to change the base?
Add support for Deletion Vectors to MultiFileParquetPartitionReader #13744
Conversation
Signed-off-by: Raza Jafri <raza.jafri@gmail.com>
d24af74 to
a9ef584
Compare
|
IMO We need to find a way to have fast tests without adding 4.7K files. |
I have reduced the number of test files. |
you can reduce further by getting rid of the crc checksum files |
Greptile OverviewGreptile SummaryThis PR successfully adds deletion vector support to the Key Changes:
Architecture: Performance: Confidence Score: 5/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant User
participant DeltaMultiFileReaderFactory
participant DeltaCoalescingFileParquetPartitionReader
participant MultiFileParquetPartitionReader
participant RapidsDeletionVectorUtils
participant CoalescedRapidsDropMarkedRowsFilter
User->>DeltaMultiFileReaderFactory: createColumnarReader(partition)
DeltaMultiFileReaderFactory->>DeltaMultiFileReaderFactory: Check if coalescing or multi-threaded
alt Coalescing (Local Files)
DeltaMultiFileReaderFactory->>DeltaCoalescingFileParquetPartitionReader: Create reader
DeltaCoalescingFileParquetPartitionReader->>MultiFileParquetPartitionReader: readBatch()
MultiFileParquetPartitionReader->>MultiFileParquetPartitionReader: Coalesce multiple small files
MultiFileParquetPartitionReader->>DeltaCoalescingFileParquetPartitionReader: Return batch
DeltaCoalescingFileParquetPartitionReader->>RapidsDeletionVectorUtils: getCoalescedRowIndexFilter()
RapidsDeletionVectorUtils->>RapidsDeletionVectorUtils: Find relevant files based on boundaries
RapidsDeletionVectorUtils->>CoalescedRapidsDropMarkedRowsFilter: Create filter with offsets
CoalescedRapidsDropMarkedRowsFilter->>DeltaCoalescingFileParquetPartitionReader: Return filter
DeltaCoalescingFileParquetPartitionReader->>RapidsDeletionVectorUtils: processBatchWithDeletionVector()
RapidsDeletionVectorUtils->>RapidsDeletionVectorUtils: Apply deletion vector to batch
RapidsDeletionVectorUtils->>DeltaCoalescingFileParquetPartitionReader: Return filtered batch
DeltaCoalescingFileParquetPartitionReader->>User: Return batch with deleted rows marked
else Multi-threaded (Cloud Files)
DeltaMultiFileReaderFactory->>DeltaMultiFileParquetPartitionReader: Create reader
DeltaMultiFileParquetPartitionReader->>DeltaMultiFileParquetPartitionReader: get()
DeltaMultiFileParquetPartitionReader->>RapidsDeletionVectorUtils: getRowIndexFilter()
RapidsDeletionVectorUtils->>DeltaMultiFileParquetPartitionReader: Return filter
DeltaMultiFileParquetPartitionReader->>RapidsDeletionVectorUtils: processBatchWithDeletionVector()
RapidsDeletionVectorUtils->>DeltaMultiFileParquetPartitionReader: Return filtered batch
DeltaMultiFileParquetPartitionReader->>User: Return batch with deleted rows marked
end
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
87 files reviewed, no comments
|
The performance numbers aren't good for this PR. I am working on improving performance. |
|
NOTE: release/25.12 has been created from main. Please retarget your PR to release/25.12 if it should be included in the release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
88 files reviewed, no comments
|
Here is the breakdown of performance numbers when benchmarking the
Breaking it further reveals that as the delete percentage increases, the time taken to add offsets to the bitmap becomes the dominant factor in determining the
A table representing the performance numbers above as a percentage of the total materialization time reveals the adding offsets as a major contributor to the slowness
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


Fixes #13617
Description
MultiFileParquetPartitionReaderwhen Deletion Vectors are enabled on a Delta Table. If a query is run on such a table, the plugin will fall back to theMultiFileCloudParquetPartitionReaderPerformance
Baseline: commit id - 21afb61
Target: This PR
Dataset: TPC-DS (sf100_parquet)
Environment: Local
Spark Configs
Query: select sum(ss_list_price) from store_sales
Checklists
(Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)