Skip to content

[feat] bump to 1.2.0 with torch 2.11 / torchrec 1.6 / fbgemm 1.6#479

Merged
tiankongdeguiji merged 33 commits into
alibaba:masterfrom
tiankongdeguiji:bump/tzrec-1.2.0
May 2, 2026
Merged

[feat] bump to 1.2.0 with torch 2.11 / torchrec 1.6 / fbgemm 1.6#479
tiankongdeguiji merged 33 commits into
alibaba:masterfrom
tiankongdeguiji:bump/tzrec-1.2.0

Conversation

@tiankongdeguiji
Copy link
Copy Markdown
Collaborator

Summary

Coordinated upgrade of the PyTorch stack and companion accelerators for the 1.2.0 release.

  • tzrec 1.1.11 → 1.2.0
  • torch 2.10.0 → 2.11.0
  • torchrec 1.5.0 → 1.6.0 — wheel source switched from mirrors.aliyun.com/pytorch-wheels to https://tzrec.oss-cn-beijing.aliyuncs.com/third_party/torchrec/repo.html
  • fbgemm-gpu 1.5.0 → 1.6.0
  • torch-tensorrt 2.10.0 → 2.11.0 (now also available for the cu126 variant, not just cu129)
  • dynamicemb 0.0.1+20260407.97b80bf → 0.1.0+20260420.c7b9ea2
  • hstu_attn 0.1.0+bea6b4b → 0.1.0+c7b9ea2

Docker images

New 1.2 images pushed to a staging repo tzrec-test first:

  • mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easyrec/tzrec-test:1.2-{cpu,cu126,cu129}

CI in this PR runs against tzrec-test:1.2. Once all required checks pass, the images are promoted to tzrec-devel:1.2 via the new scripts/promote_docker.sh, and a final commit flips the CI workflow YAMLs back to tzrec-devel:1.2 so the merged master points at the promoted repo.

Pre-commit

  • ruff-pre-commit v0.15.4 → v0.15.11
  • codespell v2.4.1 → v2.4.2
  • pre-commit-hooks already at latest (v6.0.0)

Docker hardening

Large torch wheel downloads from the aliyun mirror occasionally time out mid-stream. Wrapped the pip installs in an 8x shell retry loop, added timeout=120 retries=5 to docker/pip.conf, and set pipefail in scripts/build_docker.sh so docker-build failures surface instead of being swallowed by tee.

Test plan

  • buildtest_ci green against tzrec-test:1.2
  • unittest_ci green (GPU, cu129)
  • unittest_cpu_ci green
  • codestyle_ci green (new pre-commit versions)
  • pytyping_ci green (torchrec 1.6 / fbgemm 1.6 / torch 2.11 API surfaces)
  • Promote tzrec-test:1.2-*tzrec-devel:1.2-* after CI passes
  • Flip CI YAMLs back to tzrec-devel:1.2 and re-run CI green

🤖 Generated with Claude Code

tiankongdeguiji and others added 10 commits April 20, 2026 16:32
- tzrec 1.1.11 -> 1.2.0
- torch 2.10.0 -> 2.11.0
- torchrec 1.5.0 -> 1.6.0 (switch wheel source to tzrec OSS repo.html)
- fbgemm-gpu 1.5.0 -> 1.6.0
- torch-tensorrt 2.10.0 -> 2.11.0, now also available for cu126
- dynamicemb 0.0.1+20260407.97b80bf -> 0.1.0+20260420.c7b9ea2
- hstu_attn 0.1.0+bea6b4b -> 0.1.0+c7b9ea2
- Docker tag 1.1 -> 1.2, staged via new tzrec-test repo before promoting
  to tzrec-devel after CI passes (promote_docker.sh added)
- pre-commit: ruff v0.15.4 -> v0.15.11, codespell v2.4.1 -> v2.4.2

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- pip.conf: add timeout=120 and retries=5 to tolerate transient
  mirrors.aliyun.com network blips during Dockerfile pip install steps
- build_docker.sh: add set -o pipefail and remove the duplicate shebang
  so docker build failures are surfaced instead of being swallowed by tee

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wheel downloads from mirrors.aliyun.com/pytorch-wheels occasionally fail
mid-stream with ReadTimeoutError even with pip's own retries and timeout
bumped. Wrap the torch/torchrec/fbgemm pip install commands in an 8x
shell retry loop so transient registry blips don't abort a 40GB image
build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- tzrec/utils/plan_util.py: torchrec 1.6's hardware-aware perf estimator
  now raises ValueError for any compute kernel not in its internal
  kernel_bw_lookup. Dynamicemb registers CUSTOMIZED_KERNEL which is not
  in that table, so when dynamicemb is loaded we inject a bandwidth
  override for ("cuda", "customized_kernel") on the
  EmbeddingPerfEstimator's HardwarePerfConfig, approximated as a
  fused_uvm_caching-like mix of HBM and HBM-to-DDR bandwidth.

- tzrec/ops/_triton/triton_hstu_attention.py: the triton 3.6 shipped
  with torch 2.11 no longer resolves libdevice.fast_dividef for
  (float64, float64). Replace the two silu formulations
  (fast_dividef(qk, 1 + exp(-qk))) with the mathematically equivalent
  qk * tl.sigmoid(qk), which keeps the dtype flow consistent and also
  avoids the libdevice import.

- .github/workflows/{unittest,buildtest,benchmark,unittest_nightly}_ci.yml:
  add --ulimit memlock=-1 to the GPU container options. dynamicemb 0.1.0
  mlocks physical memory for its HKV cache tables; the default 64 KB
  rlimit in the github actions runner containers made its prefetch
  path raise "mlock physical memory failed".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI has passed against tzrec-test:1.2-*, the images have been promoted to
tzrec-devel:1.2-{cpu,cu126,cu129}, so switch the 8 workflow YAMLs back
to tzrec-devel. This is the merge-ready state of the branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Revert the shell-level pip retry loops in the Dockerfile — they were
added to tolerate a transient mirrors.aliyun.com flap during the 1.2.0
bump build, but aren't needed for the steady state. pip.conf still sets
timeout=120 so individual requests don't hang indefinitely; drop the
retries=5 pin as well.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After auditing the torchrec 1.6 estimator surface (commits 8fe63b34
+ 005fd685 + 1857d90c) and dynamicemb 0.1.0's own planner overrides,
several pieces of tzrec's dynamicemb integration are dead code or
silently bypass the new APIs.

dynamicemb_util.py:

- Drop the GroupedEmbeddingsLookup / GroupedPooledEmbeddingsLookup
  _create_embedding_kernel re-binds. dynamicemb 0.1.0 already provides
  these overrides verbatim
  (recsys-examples@c7b9ea2:corelib/dynamicemb/dynamicemb/planner/
  rw_sharding.py:55-186) with a runtime torchrec-version check; the
  tzrec re-binds shadow them and lose dynamicemb's <1.5 fallback.

- Drop the shard_estimators.kernel_bw_lookup monkey-patch. After PR
  #3723's legacy-estimator cleanup, shard_estimators no longer imports
  kernel_bw_lookup, so the patch is silent dead code. The CUSTOMIZED_KERNEL
  bandwidth override now lives on the EmbeddingPerfEstimator config
  (see plan_util.py).

- Drop the now-unused imports (BaseEmbedding, GroupedEmbeddingConfig,
  ShardingEnv, dist, constants, and the dynamicemb lookup classes).

plan_util.py:

- Replace the static kernel_device_bandwidths dict with a method-level
  HardwarePerfConfig.get_device_bw override that respects the per-shard
  caching_ratio. The dict approach pinned bandwidth across all 10
  cache_load_factor copies enumerate emits per dynamicemb table; the
  method override restores 1.5-equivalent fidelity by computing
  caching_ratio * hbm + (1-caching_ratio) * hbm_to_ddr / 10 per shard.
  Adds a hasattr assert on the private _estimator/_config attribute
  chain so a future torchrec rename surfaces a clear error instead of
  silently regressing.

- Wire forward-compat for torchrec 1.6 stable / nightly:
  * populate self._sharder_data_map = build_sharder_data_map(...) in
    enumerate (no-op in v1.6.0-rc1; required after post-rc1 commits
    b0027133 / 25b9b5ff first tagged in v2026.03.30.00).
  * compute num_buckets via a new _get_num_buckets helper and pass it
    to calculate_shard_sizes_and_offsets, mirroring upstream's
    virtual-table sharding plumbing (enumerators.py:206, 228).
  * thread sharder_key through _filter_sharding_types and apply the new
    GUARDED_SHARDING_TYPES_FOR_FP_MODULES filter for
    FeatureProcessedEmbeddingBagCollection, matching upstream
    enumerators.py:344-365.

- In EmbeddingStorageEstimator, switch sharder_key lookup to
  ShardingOption.module_type_key (precomputed in 1.6 at types.py:1175;
  post-rc1 PR #3917 removes the legacy sharder_name(type(...)) shape).

Verified: all 5 dynamicemb / HSTU / RTP integration tests pass locally
in tzrec-test:1.2-cu129
(create_dynamicemb_init_ckpt + multi_tower_din_with_dynamicemb_train_eval
+ rank_dlrm_hstu_train_eval_export {AOT,unified_aot} +
multi_tower_din_rtp_train_export).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	tzrec/ops/_triton/triton_hstu_attention.py
#	tzrec/version.py
Class-level monkey-patch on HardwarePerfConfig.get_device_bw, applied
once at dynamicemb_util module load alongside the other dynamicemb
patches. Drops the per-planner _estimator/_config private-attribute
walk and the hasattr assert in plan_util.py. Also tightens the comments
added in this PR to one line each.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI consoles don't upload the per-rank stderr files that test_train_eval
/ test_export / test_predict write, so a torchrun subprocess failure
percolates up to ``self.assertTrue(self.success)`` as an opaque
``False is not true`` with no actual error to diagnose.

Print a tail of the failing log file (last 80 lines) right where
run_cmd returns False, so future CI failures include the underlying
exception in the workflow log.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tiankongdeguiji tiankongdeguiji added the claude-review Let Claude Review label Apr 28, 2026
@github-actions github-actions Bot removed the claude-review Let Claude Review label Apr 28, 2026
Comment thread tzrec/utils/plan_util.py
self._sharder_map = {
sharder_name(sharder.module_type): sharder for sharder in sharders
}
self._sharder_data_map = build_sharder_data_map(self._sharder_map)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._sharder_data_map is assigned here but never read anywhere in the repo (verified by grep). If torchrec doesn't reflect on this attribute, this and the import on line 66 can be dropped; if it is required, please add a comment so it isn't deleted later.

Comment thread tzrec/utils/plan_util.py
_constraints, key = self._get_constraints(child_path, name)
# GRID_SHARD is only supported if specified by user in parameter constraints
# GRID_SHARD and row-wise on FP modules require explicit opt-in.
is_fp_module = "FeatureProcessedEmbeddingBagCollection" in sharder_key
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Substring matching on the sharder key is brittle: any user-defined wrapper named MyFeatureProcessedEmbeddingBagCollectionXxx matches, while a renamed FP subclass (e.g. FPEBC) silently misses the guard. Since child_module is in scope, prefer isinstance(child_module, FeatureProcessedEmbeddingBagCollection) (importable from torchrec.modules.fp_embedding_modules) and pass that bool through.

Comment on lines +387 to +414
def _customized_kernel_aware_get_device_bw(
self, # pyre-ignore [2]
compute_device: str,
compute_kernel: str,
hbm_mem_bw: float,
ddr_mem_bw: float,
ssd_mem_bw: float,
hbm_to_ddr_mem_bw: float,
caching_ratio: Optional[float] = None,
prefetch_pipeline: bool = False,
) -> Optional[float]:
"""Calculates the device bandwidth.

Args:
compute_kernel (str): compute kernel.
compute_device (str): compute device.
hbm_mem_bw (float): the bandwidth of the device HBM.
ddr_mem_bw (float): the bandwidth of the system DDR memory.
hbm_to_ddr_mem_bw (float): the bandwidth between device HBM and system DDR.
caching_ratio (Optional[float]): caching ratio used to determine device
bandwidth if UVM caching is enabled.
prefetch_pipeline (bool): whether prefetch pipeline is enabled.

Returns:
Optional[float]: the device bandwidth.
"""
if compute_kernel == EmbeddingComputeKernel.CUSTOMIZED_KERNEL.value:
# for dynamic embedding table
caching_ratio = caching_ratio if caching_ratio else 0.0
return (
caching_ratio * hbm_mem_bw + (1 - caching_ratio) * hbm_to_ddr_mem_bw
) / 10
else:
return constants.kernel_bw_lookup(
compute_device=compute_device,
compute_kernel=compute_kernel,
hbm_mem_bw=hbm_mem_bw,
ddr_mem_bw=ddr_mem_bw,
hbm_to_ddr_mem_bw=hbm_to_ddr_mem_bw,
caching_ratio=caching_ratio,
prefetch_pipeline=prefetch_pipeline,
)
cr = caching_ratio if caching_ratio is not None else 0.0
return (cr * hbm_mem_bw + (1 - cr) * hbm_to_ddr_mem_bw) / 10
return _orig_hw_perf_config_get_device_bw(
self,
compute_device,
compute_kernel,
hbm_mem_bw,
ddr_mem_bw,
ssd_mem_bw,
hbm_to_ddr_mem_bw,
caching_ratio,
prefetch_pipeline,
)

# pyre-ignore [9]
shard_estimators.kernel_bw_lookup = _kernel_bw_lookup
HardwarePerfConfig.get_device_bw = _customized_kernel_aware_get_device_bw
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two concerns on this monkey-patch:

  1. Signature drift risk. Hard-coded positional forwarding (incl. the new ssd_mem_bw) means any future torchrec change — e.g. another memory tier or a renamed kwarg — will fail at planning time with an opaque TypeError. Consider def _customized_kernel_aware_get_device_bw(self, *args, **kwargs) with a kwarg/positional lookup for compute_kernel/hbm_mem_bw/hbm_to_ddr_mem_bw/caching_ratio, then return _orig_hw_perf_config_get_device_bw(self, *args, **kwargs) for the non-customized path.
  2. Lost docstring. The replaced _kernel_bw_lookup had a full Google-style docstring; the replacement has none, despite the new ssd_mem_bw parameter. Project convention asks for docstrings on non-test functions — please restore one explaining the customized-kernel formula (cr * hbm + (1 - cr) * hbm_to_ddr) / 10 and the /10 factor in particular.

Comment thread scripts/promote_docker.sh
Comment on lines +11 to +15
DOCKER_TAG=1.2
DOCKER_TAG_SUFFIX=

for DEVICE in cpu cu126 cu129
do
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The promote step pulls tzrec-test:<tag>-<device> by mutable tag, so any push to the test repo between CI passing and promote running will be promoted. To make "what was tested" and "what was promoted" the same artifact, capture the digest at the end of CI (docker inspect --format '{{index .RepoDigests 0}}') and pull that @sha256:... here. The current path is acceptable for a single human-driven promotion but invites a TOCTOU footgun if this is ever automated.

Comment on lines 55 to 62
KVCounter,
align_to_table_size,
)
from dynamicemb.batched_dynamicemb_compute_kernel import (
BatchedDynamicEmbedding,
BatchedDynamicEmbeddingBag,
)
from dynamicemb.dynamicemb_config import DynamicEmbKernel
from dynamicemb.planner import (
DynamicEmbParameterConstraints,
DynamicEmbParameterSharding,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the GroupedEmbeddingsLookup/GroupedPooledEmbeddingsLookup._create_embedding_kernel monkey-patches drops two pieces of behavior that the previous version explicitly set:

  • BatchedDynamicEmbedding/BatchedDynamicEmbeddingBag instantiation for EmbeddingComputeKernel.CUSTOMIZED_KERNEL.
  • self._need_prefetch = True on the lookup — grep shows no other site sets this in tzrec.

If dynamicemb 0.1.0 or torchrec 1.6 now provides this natively, please mention it in the PR description so a future reader doesn't bisect to here. Otherwise this is a silent regression for dynamic embedding tables.

Comment thread tzrec/utils/plan_util.py
Comment on lines 994 to +1016
if not _constraints or not _constraints.get(key):
return [
filtered = [
t for t in allowed_sharding_types if t != ShardingType.GRID_SHARD.value
]
if is_fp_module:
filtered = [
t
for t in filtered
if t not in GUARDED_SHARDING_TYPES_FOR_FP_MODULES
]
return filtered
constraints: ParameterConstraints = _constraints[key]
if not constraints.sharding_types:
return [
filtered = [
t for t in allowed_sharding_types if t != ShardingType.GRID_SHARD.value
]
if is_fp_module:
filtered = [
t
for t in filtered
if t not in GUARDED_SHARDING_TYPES_FOR_FP_MODULES
]
return filtered
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues:

  1. The "drop GRID_SHARD; if FP, drop GUARDED_SHARDING_TYPES_FOR_FP_MODULES" block is repeated verbatim in both branches — easy to drift on the next torchrec bump. A small _drop_guarded(types, is_fp) helper would deduplicate.
  2. The third branch (when the user supplies an explicit constraints.sharding_types, just below this block) does not apply the FP guard. If a user constrains an FP module to a guarded type like ROW_WISE, it passes through. This may be intentional ("explicit user opt-in overrides the safety filter") — please add an inline comment stating so, otherwise it reads like an oversight.

@github-actions
Copy link
Copy Markdown

Review summary

The version bump itself is mostly mechanical and clean (Dockerfile, requirements, image tags, doc strings). The torchrec-1.6 adapter changes in dynamicemb_util.py and plan_util.py are the substantive part and got the inline comments above. Triton change fast_dividef(qk, 1+exp(-qk))qk * tl.sigmoid(qk) is performance-equivalent and removes a deprecating import — good.

A few non-inline observations:

  • Test coverage gap. The new _get_num_buckets, FP-module guard, and module_type_key substitution in tzrec/utils/plan_util.py have no direct unit tests in plan_util_test.py (existing tests don't construct EmbeddingBagConfig with use_virtual_table=True). tzrec/utils/dynamicemb_util.py has no test file at all, so the new HardwarePerfConfig.get_device_bw patch — including the (cr*hbm + (1-cr)*hbm_to_ddr)/10 formula — only gets coverage via GPU integration tests gated on has_dynamicemb. Worth a small dynamicemb_util_test.py that asserts the formula and that the original is delegated to with all 7 positional args.

  • --ulimit memlock=-1 on shared CI runners. Required for NCCL pinned/RDMA buffers, so the change is correct, but on the self-hosted tzrec-runner/tzrec-bench-runner pool a runaway test can now pin unbounded host RAM and starve sibling jobs. If feasible, prefer a finite cap (e.g. 64 GiB) sized to the workload rather than -1.

  • scripts/build_docker.sh. set -eo pipefail upgrade is good. Worth applying the same to promote_docker.sh (currently only set -e) for consistency. Also: the deleted ${REGISTRY}${REPO_NAME} line (missing /) was a pre-existing typo — silent fix worth mentioning in the commit message.

  • misc_util.run_cmd log-tail. The 80-line tail is helpful for opaque CI failures. Consider hoisting 80 to a module constant and noting in a one-liner that print() here is intentional (vs the project logger) so a future cleanup pass doesn't "fix" it.

  • Docs. Grepped docs/ for stale 1.1/2.10.0/1.5.0/pytorch2.10 strings — none missed. Nice.

Nothing blocking; main asks are (a) confirm in the PR description that removing the _create_embedding_kernel patches doesn't drop dynamicemb's _need_prefetch=True semantics, and (b) the issubclass-based FP detection in _filter_sharding_types.

tiankongdeguiji and others added 11 commits April 28, 2026 20:16
oss-accelerate.aliyuncs.com is the global-accelerated CDN endpoint and
gives faster, more reliable downloads (especially for the large
fbgemm_gpu / torchrec / libidn11 / Miniforge / cuda-keyring artifacts)
than the regional oss-cn-beijing.aliyuncs.com endpoint we were using.
The bucket and key paths are identical — only the hostname changes —
so existing wheel and asset URLs keep working.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- requirements/runtime.txt: bump pyfg pin (cp310/cp311/cp312 wheels)
- docs/feature.md: add ExprFeature isnan (new in 1.0.5), mod, corr; drop duplicate sigmoid
- docs/feature.md: extend CombineFeature/LookupFeature combiner enum with count/avg/gap_min/gap_max
- docs/feature.md: note MatchFeature MAP<K, string> input support
- tzrec/features/tokenize_feature.py: omit output_delim in grouped-sequence path; pyfg 1.0.5 expects
  the inner tokenize feature to emit per-token outputs and rejects output_delim there
Standalone TokenizeFeature parses fine without output_delim too, so the
grouped-sequence branch is unnecessary. Simplifies the previous commit.
Follow-up to dropping output_delim from TokenizeFeature._fg_json — update
the expected dicts in feature_test.test_create_fg_json{,_remove_bucketizer}
so they match the new output. Caught by CI on PR alibaba#489.
After cherry-picking the pyfg 1.0.4 -> 1.0.5 bump and the matching
TokenizeFeature output_delim drop onto bump/tzrec-1.2.0, point CI back
at the staging tzrec-test:1.2 images so the next workflow run validates
the freshly-rebuilt containers against the source tree before we promote
them to tzrec-devel:1.2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI on tzrec-test:1.2 has passed for the pyfg 1.0.5 + TokenizeFeature
changes; promote_docker.sh has retagged + pushed
tzrec-devel:1.2-{cpu,cu126,cu129} (plus the 1.2 and latest aliases) to
the same digests. Switch the 8 workflow YAMLs back to tzrec-devel:1.2 so
the merged master points at the promoted repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DOCKER_TAG_SUFFIX is a per-build marker on the staging tzrec-test
images (e.g. an "-rc1" tail used during a release candidate cycle).
When promoting to tzrec-devel we want the suffix stripped so consumers
see clean tags like tzrec-devel:1.2-cu129 / 1.2 / latest, not
tzrec-devel:1.2-cu129-rc1. Apply the suffix only to the SRC pull and
omit it from every DST tag/push line.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
torch_tensorrt==2.11.0 is now available for cu126 too (no longer
cu129-only since the 1.2.0 bump), so the cu126 image ships TensorRT
just like cu129. Strip the stale parenthetical from the local-tutorial
docker section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cu129 PyTorch wheel is no longer compiled with sm_70/sm_60 SASS, so
running the cu129 image on Tesla V100 / P100 / P40 (CC 7.0 / 6.x)
trips the runtime warning ``Found GPU0 ... CC 7.0`` and any CUDA kernel
launch fails. Add a 注意 block under the docker image variant list in
local_tutorial.md pointing those users at the cu126 image.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaced the bare "cu129 needs CC ≥ 7.5" note with the exact
torch.cuda.get_arch_list() output of each image:

- cu129: sm_75 / 80 / 86 / 90 / 100 / 120 + compute_120 PTX
  (T4, A10/A30/A100, L4/L20, H100/H200, B100/B200; **no V100/P100**)
- cu126: sm_50 / 60 / 70 / 75 / 80 / 86 / 90
  (Pascal/Volta/Turing/Ampere/Hopper; **no Blackwell**)

so users can pick the right image up-front instead of hitting
"Found GPU0 ... CC 7.0" at runtime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tiankongdeguiji and others added 12 commits April 29, 2026 11:21
fbgemm-gpu (the sparse-embedding kernel library tzrec relies on) no
longer ships sm_50/sm_60 SASS, so Pascal (P100/P40) and Maxwell cards
fail at the embedding kernel even though stock PyTorch advertises them
in get_arch_list. Tighten the doc to cu126 = sm_70/75/80/86/90 (Volta
through Hopper) and call out the Pascal caveat explicitly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trim the cu126 bullet to the supported CC range; the fbgemm-gpu Pascal
caveat was redundant with the sm_70+ list right above it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trim the cu129 bullet to the supported CC list; the unsupported-card
caveat was redundant with the cu126 bullet right below it covering
Volta (V100) and the Pascal note that came earlier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both image bullets now follow the "Volta (V100)、Turing (T4)、…"
arch-first format with explicit example cards in parentheses, instead
of mixing raw card lists in cu129 with arch-name format in cu126.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fbgemm-gpu wheel was silently updated at the existing OSS URL,
so docker images need a refresh. The CACHE_BUST_PIP arg busts
just the torch/fbgemm/torchrec RUN layer (apt + conda + cuda
toolkit layers stay cached).
- Pin tensorrt_cu12==10.15.1.29 in step 12 (matches torch_tensorrt
  2.11.0's `tensorrt-cu12<10.16.0,>=10.15.1`); requirements step
  no longer triggers a second tensorrt install.
- Strip `tensorrt` (broad) from torch_tensorrt's METADATA so the
  bare `Requires-Dist: tensorrt<10.16.0,>=10.15.1` line — which
  pulled in tensorrt_cu13_libs (~3.7 GB) on top — gets neutralized.
- Strip `cuda-toolkit` extras from torch's METADATA so step 19
  doesn't re-resolve the 10 nvidia-* wheels we uninstalled in step 12.
- Drop the 1.65 GB tensorrt_libs/libnvinfer_builder_resource_win_*.so.*
  (PE/Windows binaries shipped under .so for wheel-format compliance).
- pip cache purge in step 19 to free /root/.cache/pip.
- Generate pip.conf at build time via ARG PIP_MIRROR
  (default: mirrors.cloud.aliyuncs.com, override with
  --build-arg PIP_MIRROR=mirrors.aliyun.com); revert to public
  mirror at the last RUN so end-user images still resolve.
- Trailing slash on pytorch-wheels find-links URLs to avoid the
  301 to mirrors.aliyun.com.
# Conflicts:
#	tzrec/utils/misc_util.py
#	tzrec/version.py
pyfg wheel was silently updated at the existing OSS URL, so the
requirements layer needs to be busted. ARG CACHE_BUST_REQ on the
step-19 RUN forces the layer to rebuild and pull the new wheel
content; layers above it (apt + conda + cuda-toolkit + torch +
fbgemm + torchrec) stay cached.
@tiankongdeguiji tiankongdeguiji merged commit cceb2be into alibaba:master May 2, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants