2026-2027 年 LLM 推論路線圖如何演進？

簡短答案

來源主張，2026-2027 年 LLM 推論會走向混合系統：

更高頻寬/容量硬體
+ 更有效 KV Cache 管理
+ prefill/decode disaggregation
+ speculative decoding / continuous batching
+ MoE / sparse attention / model-side efficiency
+ agentic long-context serving

公司視角

NVIDIA：Rubin / Rubin Ultra、HBM4、Dynamo、prefill/decode disaggregation、推測解碼、光學互聯。
Google：TPU 8i、TurboQuant、DFlash、GKE Inference Gateway、llm-d。
Meta Platforms：MoE、Llama、Avocado 來源主張、多模態與 agentic model optimization。
OpenAI：推測解碼、continuous batching、GPT-OSS 來源主張、KV Cache offload / quantization 社群優化。
Anthropic：安全與長上下文成本控制，來源主張其與 LPU 類硬體整合。

Caveat

這些均屬未附 citation 的 roadmap 主張，必須核驗產品發表、論文、benchmark、雲端部署與實際成本曲線。