[Models]support GLM4.7 Flash#7139
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7139 +/- ##
==========================================
Coverage ? 73.70%
==========================================
Files ? 376
Lines ? 52876
Branches ? 8244
==========================================
Hits ? 38973
Misses ? 11193
Partials ? 2710
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
该 PR 旨在为新增/兼容的模型形态(GLM4.7 Flash 相关)打通 DeepSeekV3/MLA 路径:将 position_ids 与 encoder mask 统一收敛到 ForwardMeta,并在 MLA backend 中增加对 head 数不足 64 的 padding 处理,同时补充新的架构注册类。
Changes:
- 将 DeepSeekV3 的
position_ids/mask_encoder_batch从显式参数改为通过ForwardMeta传递,减少调用链参数数量。 - 在
MLAAttentionBackenddecode 路径对num_heads < 64(TP>1)时对 Q/out 做 padding/裁剪以适配内核约束。 - 新增
Glm4MoeLiteForCausalLM架构注册,复用 DeepseekV3 实现。
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| fastdeploy/model_executor/models/deepseek_v3.py | 通过 ForwardMeta 传递 position_ids/mask,并注册 Glm4MoeLite 架构 |
| fastdeploy/model_executor/layers/attention/mla_attention_backend.py | MLA decode 增加 head padding 支持,并调整 rope_scaling 判断逻辑 |
| fastdeploy/model_executor/forward_meta.py | 为 MLA/DSA 增加 position_ids 与 mask_encoder_batch 字段 |
Comments suppressed due to low confidence (1)
fastdeploy/model_executor/layers/attention/mla_attention_backend.py:295
- 这里同样使用了
getattr(self.rope_scaling, "factor", None)来判断 rope scaling,但rope_scaling在配置中是 dict,会导致缩放逻辑永远不生效;同时分支内仍然读取fd_config.model_config.rope_scaling,与上面缓存到self.rope_scaling不一致。建议:改为 dict key 判断(如self.rope_scaling.get("factor"))并在分支内统一使用self.rope_scaling取值。
self.rope_scaling = getattr(fd_config.model_config, "rope_scaling", None)
if self.rope_scaling and getattr(self.rope_scaling, "factor", None):
# if fd_config.model_config.rope_scaling:
mscale_all_dim = fd_config.model_config.rope_scaling.get("mscale_all_dim", False) # 1.0
scaling_factor = fd_config.model_config.rope_scaling["factor"] # 40
mscale = yarn_get_mscale(scaling_factor, float(mscale_all_dim))
self.attn_softmax_scale = self.attn_softmax_scale * mscale * mscale
fastdeploy/model_executor/layers/attention/mla_attention_backend.py
Outdated
Show resolved
Hide resolved
EmmonsCurse
left a comment
There was a problem hiding this comment.
LGTM~ Skip coverage check as it mainly relies on tests with flashmla.
Motivation
support GLM4.7 Flash
Modifications
support GLM4.7 Flash
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.