Update blog

cmdr2 · cmdr2 · commit de55d6d22f7b · 2025-11-13T15:16:41.000+05:30
diff --git a/content/blog/2025-11-13-1763027191.md b/content/blog/2025-11-13-1763027191.md
@@ -0,0 +1,43 @@
+---
+title: "Post from Nov 13, 2025"
+date: 2025-11-13T09:46:31
+slug: "1763027191"
+tags:
+  - ml
+  - compiler
+  - sdkit
+---
+
+[PolyBlocks](https://docs.polymagelabs.com/) is another interesting ML compiler, written using MLIR. It's a startup incubated in IISc Bangalore, run by someone (Uday Bondhugula) who co-authored a [paper on compiler optimizations for GPGPUs](https://www.ece.lsu.edu/jxr/Publications-pdf/ics08.pdf) back in 2008 (17 years ago)!
+
+Some of the compiler passes to keep in mind:
+- fusion
+- tiling
+- use hardware acceleration (like tensor cores)
+- constant folding
+- perform redundant computation to avoid global memory accesses where profitable
+- pack into buffers
+- loop transformation
+- unroll-and-jam (register tiling?)
+- vectorization
+- reorder execution for better spatial, temporary and group reuse
+
+Scheduling approaches:
+- greedy heuristics
+- ILP
+- dynamic programming
+- analytical cost models
+
+For fusion, PolyBlocks uses a Polyhedral slicing-based approach in the affine pass of MLIR. This approach seems to perform better than simple fusion (done by XLA and TorchInductor). Need to read about this some more.
+
+[Important optimizations for matrix multiplication kernels](https://www.youtube.com/watch?v=3LLzHKeL2hs) (to get really close to cuBLAS performance):
+- Shared Memory Tiling
+- Register Tiling
+- Padding (to avoid Shared Memory Bank conflicts)
+- Load/Store Vectorization (from global memory to shared)
+- Fetch the data for the next iteration (of a loop) while processing the current iteration
+
+Some other random notes:
+- User-facing API: TensorRT-style compiled engine files, or Torch/Mojo/PolyBlocks-style JIT compilers inside Python.
+- For the host-side code (i.e. the code that talks to the driver), it might be a good idea to generate C++ code that people can compile themselves (for power users). But this would add more hoops for the user to jump through, so this might be an option?
+- Quantization hardware-awareness in the compiler is important, so that it can factor that in during tiling and memory layout.