Skip to content

Conversation

@firestarman
Copy link
Collaborator

@firestarman firestarman commented Nov 25, 2025

Contributes to #13412

Rapids UDAF is designed to support executing an UDAF (User Defined Aggregate Function) in the columnar way to get accelerated by GPU.

Complete support of RapidsUDAF covers too many things and a single PR (#13450) is too large to review. So instead it's better to be added in piece by piece, and this PR is the first one who only introduces the relevant inerfaces.

  • RapidsUDAF- the top interface, it defines 5 methods as below, trying to follow the CPU definitions (UserDefinedAggregateFunction) as much as possible to minimize users' learning effort.

    RapidsUDAF UserDefinedAggregateFunction
    getDefaultValue initialize
    updateAggregation update
    mergeAggregation merge
    getResult evaluate
    bufferTypes bufferSchema

    updateAggregation and mergeAggregation return a RapidsUDAFGroupByAggregation who contains the APIs to perform the aggregation.

  • RapidsUDAFGroupByAggregation - base interface for GPU-accelerated UDAF aggregation implementations. It provides the contract for different aggregation strategies. it also supports an optional pair of preStep and postStep to run some transformations before and after a "reduce/aggregate" operation, similar as "preMerge" and "postMerge" for the merge-stage aggregate in GpuAggregateFunction.

  • RapidsSimpleGroupByAggregation - the child class of RapidsUDAFGroupByAggregation, providing a standard cuDF-based aggregation step that uses built-in cuDF aggregation operations.

Putting the groupby 'aggregate' API in the child class is because more types of aggregate may be introduced in the future via child classes. e.g. an aggregate as below can access the grouped data and keys to let users do more customization.

  /**
   * Performs custom aggregation on data that has been grouped by keys.
   * The data is grouped, with offsets indicating group boundaries.
   * @param keyOffsets A ColumnVector containing the start offset for each group.
   *                   The end offset for group i is `keyOffsets[i+1]` (or total
   *                   rows for the last group).
   * @param groupedData An array of ColumnVectors containing the actual data
   *                    columns, sorted and organized by the grouping keys.
   * @return An array of ColumnVectors with one row per group, containing the
   * aggregated results.
   */
  ColumnVector[] aggregateGrouped(ColumnVector keyOffsets, ColumnVector[] groupedData);

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 25, 2025

Greptile Overview

Greptile Summary

Introduces API interfaces for GPU-accelerated User Defined Aggregate Functions (UDAF), establishing the foundation for columnar aggregation operations. This is the first step in a phased implementation approach to support UDAF on GPU.

  • Defines RapidsUDAF interface with lifecycle methods mirroring Spark's deprecated UserDefinedAggregateFunction to minimize learning curve
  • Provides RapidsUDAFGroupByAggregation base interface with optional pre/post-processing hooks (preStep, postStep) and required reduce operation
  • Includes RapidsSimpleGroupByAggregation for standard cuDF aggregation patterns using built-in operations
  • Establishes clear resource management contracts specifying when users must close input resources vs. when framework handles cleanup
  • Aligns buffer types with CPU bufferSchema to ensure compatibility during fallback scenarios (partial CPU/partial GPU execution)

Confidence Score: 5/5

  • This PR is safe to merge with no blocking issues found
  • This is a clean API definition PR introducing only interface contracts with no implementation logic, comprehensive documentation, and clear resource management specifications following established patterns from RapidsUDF
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
sql-plugin-api/src/main/java/com/nvidia/spark/RapidsUDAF.java 5/5 Main UDAF interface defining core aggregation lifecycle methods with clear resource management contracts
sql-plugin-api/src/main/java/com/nvidia/spark/RapidsUDAFGroupByAggregation.java 5/5 Base aggregation interface providing pre/post processing hooks and reduction operations
sql-plugin-api/src/main/java/com/nvidia/spark/RapidsSimpleGroupByAggregation.java 5/5 Simplified aggregation interface for standard cuDF group-by operations

Sequence Diagram

sequenceDiagram
    participant User as UDAF Implementation
    participant Update as updateAggregation()
    participant Merge as mergeAggregation()
    participant Result as getResult()
    
    Note over User: Initial Aggregation Phase
    User->>Update: Call updateAggregation()
    Update->>Update: preStep(numRows, args)
    Note over Update: Transform input columns<br/>(optional)
    Update->>Update: aggregate/reduce operation
    Note over Update: Process raw input data
    Update->>Update: postStep(numRows, aggregatedData)
    Note over Update: Transform output to buffer format
    
    Note over User: Merge Phase (Distributed)
    User->>Merge: Call mergeAggregation()
    Merge->>Merge: preStep(numRows, args)
    Note over Merge: Prepare buffer data<br/>(optional)
    Merge->>Merge: aggregate/reduce operation
    Note over Merge: Combine partial results
    Merge->>Merge: postStep(numRows, aggregatedData)
    Note over Merge: Transform to final buffer format
    
    Note over User: Final Result Phase
    User->>Result: getResult(numRows, args, outType)
    Result-->>User: Final ColumnVector
    Note over User: Returns single column<br/>with final UDAF result
    
    Note over User: Empty Input Case
    User->>User: getDefaultValue()
    Note over User: Returns Scalar[] for<br/>zero-row reduction
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@firestarman firestarman requested review from GaryShen2008, abellina, revans2, sameerz and winningsix and removed request for sameerz November 25, 2025 05:45
@firestarman
Copy link
Collaborator Author

build

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

firestarman and others added 2 commits November 27, 2025 09:33
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

firestarman and others added 2 commits November 27, 2025 10:01
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@firestarman
Copy link
Collaborator Author

build

Copy link
Collaborator

@res-life res-life left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants