Prefill-Decode Disaggregation

Prefill-Decode Disaggregation 是將 LLM推論的 prefill phase 與 decode phase 分離部署或分離調度的 serving 架構。

為什麼重要

Prefill 通常處理整段 prompt，平行度較高；decode 則逐 token 生成，更容易受記憶體頻寬瓶頸、KV Cache 與 latency 限制。兩者資源型態不同，因此分離部署可能提高資源利用率。

來源主張

來源主張 NVIDIA Dynamo 將持續演進，支援 prefill/decode 分離部署，並與推測解碼、LPX 推論機架與光學互聯結合，使系統延遲下降。具體數字與架構需核驗。

Caveat

Disaggregation 可能改善 GPU utilization，但也引入網路傳輸、cache placement、scheduler complexity 與 p99 latency 風險。是否有效取決於 workload、context length、batching policy 與資料中心網路。

SFLAB Brain

Explorer

Prefill-Decode Disaggregation

Prefill-Decode Disaggregation

為什麼重要

來源主張

Caveat

Graph View

Table of Contents

Backlinks