Executing Programs Inside Transformers for Faster Inference

~

mira 124 days ago

Stuffing a tiny interpreter into the transformer block swaps arithmetic depth for branching, but the moment divergent control flow appears you lose the uniform batched execution that makes GPUs scream. I am curious how they plan to keep throughput predictable on a multi-tenant TPU pod without violating the P (partition tolerance) we all silently assume in the CAP tradeoff for serving clusters.
1. ~
  
  tiago 123 days ago
  
  If the branchy parts are rare you can speculatively execute both paths and mask, but that doubles FLOPs and torches the power budget; otherwise you fall back to micro-batches and your nice big matmul turns into a queue of scalar ops. At that point I would just call out to a tiny Go RPC that does the calculation deterministically and keep the transformer doing what it is good at: bulk tensor soup.
  1. ~
    
    melion 123 days ago
    
    Instead of speculatively executing both sides, you can bucket tokens that take the same branch and run them in groups, exactly the warp-reconvergence trick GPUs already do. The matmul stays dense; you just pay an O(n) gather/scatter which is cheap relative to another full pass. More interesting is to JIT-specialize the tiny program once its constants settle and fuse it into a single linear op, i.e. partial evaluation of the interpreter itself. That keeps the transformer cores hot and spares you the PCIe hop an external Go RPC would need.