-
Notifications
You must be signed in to change notification settings - Fork 264
Remove LazySpillableColumnarBatch from join code #13625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR removes LazySpillableColumnarBatch from the join code to simplify the APIs. The changes replace the lazy spillable wrapper pattern with direct usage of SpillableColumnarBatch, which now manages reference counting through incRefCount() instead of the previous "spill only" approach.
Key changes:
- Removed
LazySpillableColumnarBatchclass and replaced all usages withSpillableColumnarBatch - Updated join gatherers to use
SpillableGatherMapinstead ofLazySpillableGatherMap - Modified resource management to use
incRefCount()for shared ownership instead of spill-only wrappers
Reviewed Changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| GpuHashJoin.scala | Updated join iterators to use SpillableColumnarBatch and simplified resource management with incRefCount |
| GpuBroadcastNestedLoopJoinExecBase.scala | Replaced LazySpillableColumnarBatch with SpillableColumnarBatch in cross join and conditional nested loop join iterators |
| GpuBroadcastExchangeExec.scala | Added debug println statement in closeInternal |
| ExistenceJoin.scala | Updated ExistenceJoinIterator to use SpillableColumnarBatch instead of LazySpillableColumnarBatch |
| GpuCartesianProductExec.scala | Changed serializable batch handling and replaced spill-only pattern with incRefCount |
| SpillFramework.scala | Added warning log when handles remain in store during close |
| SpillableColumnarBatch.scala | Added numCols method and new apply overloads with description parameter |
| JoinGatherer.scala | Removed LazySpillable trait and LazySpillableColumnarBatch, updated gatherers to use Spillable types |
| GpuShuffledSizedHashJoinExec.scala | Updated join iterators to use SpillableColumnarBatch and simplified batch ownership |
| AbstractGpuJoinIterator.scala | Updated join iterator base class to use SpillableColumnarBatch |
| InternalRowToColumnarBatchIterator.java | Added blank line (formatting only) |
| GpuDeleteLoader.scala | Updated delete loader to return SpillableColumnarBatch |
| GpuDeleteFilter.scala | Updated delete filter context to use SpillableColumnarBatch |
| GpuDeleteFilterSuite.scala | Updated test to use SpillableColumnarBatch and removed unnecessary withResource |
Comments suppressed due to low confidence (1)
sql-plugin/src/main/scala/com/nvidia/spark/rapids/JoinGatherer.scala:1
- Missing method call parentheses: 'data.numRows' should be 'data.numRows()'. This will cause a compilation error as numRows is a method, not a property.
/*
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuBroadcastExchangeExec.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/spill/SpillFramework.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AbstractGpuJoinIterator.scala
Show resolved
Hide resolved
...rc/main/scala/org/apache/spark/sql/rapids/execution/GpuBroadcastNestedLoopJoinExecBase.scala
Outdated
Show resolved
Hide resolved
|
build |
revans2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first pass looks good, but I still see some formatting changes that I don't think are correct.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuBroadcastExchangeExec.scala
Outdated
Show resolved
Hide resolved
|
build |
Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
3d3db07 to
81d9560
Compare
|
seeing some issues while running benchmarks, I'll have to fix. |
|
build |
|
No improvement/regression based on this patch at sf3k: |
|
build |
|
NOTE: release/25.12 has been created from main. Please retarget your PR to release/25.12 if it should be included in the release. |
Contributes to #7529
Description
This PR removes
LazySpillableColumnarBatchfrom the join in order to make the apis cleaner to use. I have not run performance tests on this yet.Checklists
(Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)