Skip to content

Commit de55d6d

Browse files
committed
Update blog
1 parent 981d28f commit de55d6d

File tree

1 file changed

+43
-0
lines changed

1 file changed

+43
-0
lines changed
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
---
2+
title: "Post from Nov 13, 2025"
3+
date: 2025-11-13T09:46:31
4+
slug: "1763027191"
5+
tags:
6+
- ml
7+
- compiler
8+
- sdkit
9+
---
10+
11+
[PolyBlocks](https://docs.polymagelabs.com/) is another interesting ML compiler, written using MLIR. It's a startup incubated in IISc Bangalore, run by someone (Uday Bondhugula) who co-authored a [paper on compiler optimizations for GPGPUs](https://www.ece.lsu.edu/jxr/Publications-pdf/ics08.pdf) back in 2008 (17 years ago)!
12+
13+
Some of the compiler passes to keep in mind:
14+
- fusion
15+
- tiling
16+
- use hardware acceleration (like tensor cores)
17+
- constant folding
18+
- perform redundant computation to avoid global memory accesses where profitable
19+
- pack into buffers
20+
- loop transformation
21+
- unroll-and-jam (register tiling?)
22+
- vectorization
23+
- reorder execution for better spatial, temporary and group reuse
24+
25+
Scheduling approaches:
26+
- greedy heuristics
27+
- ILP
28+
- dynamic programming
29+
- analytical cost models
30+
31+
For fusion, PolyBlocks uses a Polyhedral slicing-based approach in the affine pass of MLIR. This approach seems to perform better than simple fusion (done by XLA and TorchInductor). Need to read about this some more.
32+
33+
[Important optimizations for matrix multiplication kernels](https://www.youtube.com/watch?v=3LLzHKeL2hs) (to get really close to cuBLAS performance):
34+
- Shared Memory Tiling
35+
- Register Tiling
36+
- Padding (to avoid Shared Memory Bank conflicts)
37+
- Load/Store Vectorization (from global memory to shared)
38+
- Fetch the data for the next iteration (of a loop) while processing the current iteration
39+
40+
Some other random notes:
41+
- User-facing API: TensorRT-style compiled engine files, or Torch/Mojo/PolyBlocks-style JIT compilers inside Python.
42+
- For the host-side code (i.e. the code that talks to the driver), it might be a good idea to generate C++ code that people can compile themselves (for power users). But this would add more hoops for the user to jump through, so this might be an option?
43+
- Quantization hardware-awareness in the compiler is important, so that it can factor that in during tiling and memory layout.

0 commit comments

Comments
 (0)