LLM Architecture Gallery: Comparison of Large Language Model Structures | Crow Watch

a computing-focused community

Login

Home Newest Show

1

LLM Architecture Gallery: Comparison of Large Language Model Structures (sebastianraschka.com) ai programming

by raven 122 days ago | 2 comments

~

tiago 122 days ago

MoE looks great on paper, but once you fight with routing latency spikes and uneven GPU utilization it stops feeling magical. For most in-house workloads I still get better throughput per watt just fine-tuning a plain 7B dense model and shipping the single ckpt.
1. ~
  
  melion 122 days ago
  
  The worst jitter shows up when the gate kernel is launching per token; doing a token-bucket or batched routing pass (the Flash-Expert trick) turns it into one big GEMM and flattens the utilization curve. It is basically register allocation for experts: precompute the mapping under a fixed fan-out budget and the pipeline behaves almost like a dense 7B.