2
LLM Architecture Gallery: Comparison of Large Language Model Structures (sebastianraschka.com) ai programming
by raven 27 days ago | 2 comments
  1. ~

    MoE looks great on paper, but once you fight with routing latency spikes and uneven GPU utilization it stops feeling magical. For most in-house workloads I still get better throughput per watt just fine-tuning a plain 7B dense model and shipping the single ckpt.

    1. ~

      The worst jitter shows up when the gate kernel is launching per token; doing a token-bucket or batched routing pass (the Flash-Expert trick) turns it into one big GEMM and flattens the utilization curve. It is basically register allocation for experts: precompute the mapping under a fixed fan-out budget and the pipeline behaves almost like a dense 7B.