[feat] bump to 1.2.0 with torch 2.11 / torchrec 1.6 / fbgemm 1.6 by tiankongdeguiji · Pull Request #479 · alibaba/TorchEasyRec

tiankongdeguiji · 2026-04-20T20:34:53Z

Summary

Coordinated upgrade of the PyTorch stack and companion accelerators for the 1.2.0 release.

tzrec 1.1.11 → 1.2.0
torch 2.10.0 → 2.11.0
torchrec 1.5.0 → 1.6.0 — wheel source switched from mirrors.aliyun.com/pytorch-wheels to https://tzrec.oss-cn-beijing.aliyuncs.com/third_party/torchrec/repo.html
fbgemm-gpu 1.5.0 → 1.6.0
torch-tensorrt 2.10.0 → 2.11.0 (now also available for the cu126 variant, not just cu129)
dynamicemb 0.0.1+20260407.97b80bf → 0.1.0+20260420.c7b9ea2
hstu_attn 0.1.0+bea6b4b → 0.1.0+c7b9ea2

Docker images

New 1.2 images pushed to a staging repo tzrec-test first:

mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easyrec/tzrec-test:1.2-{cpu,cu126,cu129}

CI in this PR runs against tzrec-test:1.2. Once all required checks pass, the images are promoted to tzrec-devel:1.2 via the new scripts/promote_docker.sh, and a final commit flips the CI workflow YAMLs back to tzrec-devel:1.2 so the merged master points at the promoted repo.

Pre-commit

ruff-pre-commit v0.15.4 → v0.15.11
codespell v2.4.1 → v2.4.2
pre-commit-hooks already at latest (v6.0.0)

Docker hardening

Large torch wheel downloads from the aliyun mirror occasionally time out mid-stream. Wrapped the pip installs in an 8x shell retry loop, added timeout=120 retries=5 to docker/pip.conf, and set pipefail in scripts/build_docker.sh so docker-build failures surface instead of being swallowed by tee.

Test plan

buildtest_ci green against tzrec-test:1.2
unittest_ci green (GPU, cu129)
unittest_cpu_ci green
codestyle_ci green (new pre-commit versions)
pytyping_ci green (torchrec 1.6 / fbgemm 1.6 / torch 2.11 API surfaces)
Promote tzrec-test:1.2-* → tzrec-devel:1.2-* after CI passes
Flip CI YAMLs back to tzrec-devel:1.2 and re-run CI green

🤖 Generated with Claude Code

- tzrec 1.1.11 -> 1.2.0 - torch 2.10.0 -> 2.11.0 - torchrec 1.5.0 -> 1.6.0 (switch wheel source to tzrec OSS repo.html) - fbgemm-gpu 1.5.0 -> 1.6.0 - torch-tensorrt 2.10.0 -> 2.11.0, now also available for cu126 - dynamicemb 0.0.1+20260407.97b80bf -> 0.1.0+20260420.c7b9ea2 - hstu_attn 0.1.0+bea6b4b -> 0.1.0+c7b9ea2 - Docker tag 1.1 -> 1.2, staged via new tzrec-test repo before promoting to tzrec-devel after CI passes (promote_docker.sh added) - pre-commit: ruff v0.15.4 -> v0.15.11, codespell v2.4.1 -> v2.4.2 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- pip.conf: add timeout=120 and retries=5 to tolerate transient mirrors.aliyun.com network blips during Dockerfile pip install steps - build_docker.sh: add set -o pipefail and remove the duplicate shebang so docker build failures are surfaced instead of being swallowed by tee Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wheel downloads from mirrors.aliyun.com/pytorch-wheels occasionally fail mid-stream with ReadTimeoutError even with pip's own retries and timeout bumped. Wrap the torch/torchrec/fbgemm pip install commands in an 8x shell retry loop so transient registry blips don't abort a 40GB image build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- tzrec/utils/plan_util.py: torchrec 1.6's hardware-aware perf estimator now raises ValueError for any compute kernel not in its internal kernel_bw_lookup. Dynamicemb registers CUSTOMIZED_KERNEL which is not in that table, so when dynamicemb is loaded we inject a bandwidth override for ("cuda", "customized_kernel") on the EmbeddingPerfEstimator's HardwarePerfConfig, approximated as a fused_uvm_caching-like mix of HBM and HBM-to-DDR bandwidth. - tzrec/ops/_triton/triton_hstu_attention.py: the triton 3.6 shipped with torch 2.11 no longer resolves libdevice.fast_dividef for (float64, float64). Replace the two silu formulations (fast_dividef(qk, 1 + exp(-qk))) with the mathematically equivalent qk * tl.sigmoid(qk), which keeps the dtype flow consistent and also avoids the libdevice import. - .github/workflows/{unittest,buildtest,benchmark,unittest_nightly}_ci.yml: add --ulimit memlock=-1 to the GPU container options. dynamicemb 0.1.0 mlocks physical memory for its HKV cache tables; the default 64 KB rlimit in the github actions runner containers made its prefetch path raise "mlock physical memory failed". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI has passed against tzrec-test:1.2-*, the images have been promoted to tzrec-devel:1.2-{cpu,cu126,cu129}, so switch the 8 workflow YAMLs back to tzrec-devel. This is the merge-ready state of the branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Revert the shell-level pip retry loops in the Dockerfile — they were added to tolerate a transient mirrors.aliyun.com flap during the 1.2.0 bump build, but aren't needed for the steady state. pip.conf still sets timeout=120 so individual requests don't hang indefinitely; drop the retries=5 pin as well. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After auditing the torchrec 1.6 estimator surface (commits 8fe63b34 + 005fd685 + 1857d90c) and dynamicemb 0.1.0's own planner overrides, several pieces of tzrec's dynamicemb integration are dead code or silently bypass the new APIs. dynamicemb_util.py: - Drop the GroupedEmbeddingsLookup / GroupedPooledEmbeddingsLookup _create_embedding_kernel re-binds. dynamicemb 0.1.0 already provides these overrides verbatim (recsys-examples@c7b9ea2:corelib/dynamicemb/dynamicemb/planner/ rw_sharding.py:55-186) with a runtime torchrec-version check; the tzrec re-binds shadow them and lose dynamicemb's <1.5 fallback. - Drop the shard_estimators.kernel_bw_lookup monkey-patch. After PR #3723's legacy-estimator cleanup, shard_estimators no longer imports kernel_bw_lookup, so the patch is silent dead code. The CUSTOMIZED_KERNEL bandwidth override now lives on the EmbeddingPerfEstimator config (see plan_util.py). - Drop the now-unused imports (BaseEmbedding, GroupedEmbeddingConfig, ShardingEnv, dist, constants, and the dynamicemb lookup classes). plan_util.py: - Replace the static kernel_device_bandwidths dict with a method-level HardwarePerfConfig.get_device_bw override that respects the per-shard caching_ratio. The dict approach pinned bandwidth across all 10 cache_load_factor copies enumerate emits per dynamicemb table; the method override restores 1.5-equivalent fidelity by computing caching_ratio * hbm + (1-caching_ratio) * hbm_to_ddr / 10 per shard. Adds a hasattr assert on the private _estimator/_config attribute chain so a future torchrec rename surfaces a clear error instead of silently regressing. - Wire forward-compat for torchrec 1.6 stable / nightly: * populate self._sharder_data_map = build_sharder_data_map(...) in enumerate (no-op in v1.6.0-rc1; required after post-rc1 commits b0027133 / 25b9b5ff first tagged in v2026.03.30.00). * compute num_buckets via a new _get_num_buckets helper and pass it to calculate_shard_sizes_and_offsets, mirroring upstream's virtual-table sharding plumbing (enumerators.py:206, 228). * thread sharder_key through _filter_sharding_types and apply the new GUARDED_SHARDING_TYPES_FOR_FP_MODULES filter for FeatureProcessedEmbeddingBagCollection, matching upstream enumerators.py:344-365. - In EmbeddingStorageEstimator, switch sharder_key lookup to ShardingOption.module_type_key (precomputed in 1.6 at types.py:1175; post-rc1 PR #3917 removes the legacy sharder_name(type(...)) shape). Verified: all 5 dynamicemb / HSTU / RTP integration tests pass locally in tzrec-test:1.2-cu129 (create_dynamicemb_init_ckpt + multi_tower_din_with_dynamicemb_train_eval + rank_dlrm_hstu_train_eval_export {AOT,unified_aot} + multi_tower_din_rtp_train_export). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # tzrec/ops/_triton/triton_hstu_attention.py # tzrec/version.py

Class-level monkey-patch on HardwarePerfConfig.get_device_bw, applied once at dynamicemb_util module load alongside the other dynamicemb patches. Drops the per-planner _estimator/_config private-attribute walk and the hasattr assert in plan_util.py. Also tightens the comments added in this PR to one line each. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI consoles don't upload the per-rank stderr files that test_train_eval / test_export / test_predict write, so a torchrun subprocess failure percolates up to ``self.assertTrue(self.success)`` as an opaque ``False is not true`` with no actual error to diagnose. Print a tail of the failing log file (last 80 lines) right where run_cmd returns False, so future CI failures include the underlying exception in the workflow log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-28T09:55:39Z

        self._sharder_map = {
            sharder_name(sharder.module_type): sharder for sharder in sharders
        }
+        self._sharder_data_map = build_sharder_data_map(self._sharder_map)


self._sharder_data_map is assigned here but never read anywhere in the repo (verified by grep). If torchrec doesn't reflect on this attribute, this and the import on line 66 can be dropped; if it is required, please add a comment so it isn't deleted later.

github-actions · 2026-04-28T09:55:43Z

        _constraints, key = self._get_constraints(child_path, name)
-        # GRID_SHARD is only supported if specified by user in parameter constraints
+        # GRID_SHARD and row-wise on FP modules require explicit opt-in.
+        is_fp_module = "FeatureProcessedEmbeddingBagCollection" in sharder_key


Substring matching on the sharder key is brittle: any user-defined wrapper named MyFeatureProcessedEmbeddingBagCollectionXxx matches, while a renamed FP subclass (e.g. FPEBC) silently misses the guard. Since child_module is in scope, prefer isinstance(child_module, FeatureProcessedEmbeddingBagCollection) (importable from torchrec.modules.fp_embedding_modules) and pass that bool through.

github-actions · 2026-04-28T09:55:58Z

+    def _customized_kernel_aware_get_device_bw(
+        self,  # pyre-ignore [2]
        compute_device: str,
        compute_kernel: str,
        hbm_mem_bw: float,
        ddr_mem_bw: float,
+        ssd_mem_bw: float,
        hbm_to_ddr_mem_bw: float,
        caching_ratio: Optional[float] = None,
        prefetch_pipeline: bool = False,
    ) -> Optional[float]:
-        """Calculates the device bandwidth.
-
-        Args:
-            compute_kernel (str): compute kernel.
-            compute_device (str): compute device.
-            hbm_mem_bw (float): the bandwidth of the device HBM.
-            ddr_mem_bw (float): the bandwidth of the system DDR memory.
-            hbm_to_ddr_mem_bw (float): the bandwidth between device HBM and system DDR.
-            caching_ratio (Optional[float]): caching ratio used to determine device
-                bandwidth if UVM caching is enabled.
-            prefetch_pipeline (bool): whether prefetch pipeline is enabled.
-
-        Returns:
-            Optional[float]: the device bandwidth.
-        """
        if compute_kernel == EmbeddingComputeKernel.CUSTOMIZED_KERNEL.value:
-            # for dynamic embedding table
-            caching_ratio = caching_ratio if caching_ratio else 0.0
-            return (
-                caching_ratio * hbm_mem_bw + (1 - caching_ratio) * hbm_to_ddr_mem_bw
-            ) / 10
-        else:
-            return constants.kernel_bw_lookup(
-                compute_device=compute_device,
-                compute_kernel=compute_kernel,
-                hbm_mem_bw=hbm_mem_bw,
-                ddr_mem_bw=ddr_mem_bw,
-                hbm_to_ddr_mem_bw=hbm_to_ddr_mem_bw,
-                caching_ratio=caching_ratio,
-                prefetch_pipeline=prefetch_pipeline,
-            )
+            cr = caching_ratio if caching_ratio is not None else 0.0
+            return (cr * hbm_mem_bw + (1 - cr) * hbm_to_ddr_mem_bw) / 10
+        return _orig_hw_perf_config_get_device_bw(
+            self,
+            compute_device,
+            compute_kernel,
+            hbm_mem_bw,
+            ddr_mem_bw,
+            ssd_mem_bw,
+            hbm_to_ddr_mem_bw,
+            caching_ratio,
+            prefetch_pipeline,
+        )

    # pyre-ignore [9]
-    shard_estimators.kernel_bw_lookup = _kernel_bw_lookup
+    HardwarePerfConfig.get_device_bw = _customized_kernel_aware_get_device_bw


Two concerns on this monkey-patch:

Signature drift risk. Hard-coded positional forwarding (incl. the new ssd_mem_bw) means any future torchrec change — e.g. another memory tier or a renamed kwarg — will fail at planning time with an opaque TypeError. Consider def _customized_kernel_aware_get_device_bw(self, *args, **kwargs) with a kwarg/positional lookup for compute_kernel/hbm_mem_bw/hbm_to_ddr_mem_bw/caching_ratio, then return _orig_hw_perf_config_get_device_bw(self, *args, **kwargs) for the non-customized path.

Lost docstring. The replaced _kernel_bw_lookup had a full Google-style docstring; the replacement has none, despite the new ssd_mem_bw parameter. Project convention asks for docstrings on non-test functions — please restore one explaining the customized-kernel formula (cr * hbm + (1 - cr) * hbm_to_ddr) / 10 and the /10 factor in particular.

github-actions · 2026-04-28T09:56:03Z

+DOCKER_TAG=1.2
+DOCKER_TAG_SUFFIX=
+
+for DEVICE in cpu cu126 cu129
+do


The promote step pulls tzrec-test:<tag>-<device> by mutable tag, so any push to the test repo between CI passing and promote running will be promoted. To make "what was tested" and "what was promoted" the same artifact, capture the digest at the end of CI (docker inspect --format '{{index .RepoDigests 0}}') and pull that @sha256:... here. The current path is acceptable for a single human-driven promotion but invites a TOCTOU footgun if this is ever automated.

github-actions · 2026-04-28T09:56:08Z

        KVCounter,
        align_to_table_size,
    )
-    from dynamicemb.batched_dynamicemb_compute_kernel import (
-        BatchedDynamicEmbedding,
-        BatchedDynamicEmbeddingBag,
-    )
    from dynamicemb.dynamicemb_config import DynamicEmbKernel
    from dynamicemb.planner import (
        DynamicEmbParameterConstraints,
        DynamicEmbParameterSharding,
    )


Removing the GroupedEmbeddingsLookup/GroupedPooledEmbeddingsLookup._create_embedding_kernel monkey-patches drops two pieces of behavior that the previous version explicitly set:

BatchedDynamicEmbedding/BatchedDynamicEmbeddingBag instantiation for EmbeddingComputeKernel.CUSTOMIZED_KERNEL.

self._need_prefetch = True on the lookup — grep shows no other site sets this in tzrec.

If dynamicemb 0.1.0 or torchrec 1.6 now provides this natively, please mention it in the PR description so a future reader doesn't bisect to here. Otherwise this is a silent regression for dynamic embedding tables.

github-actions · 2026-04-28T09:56:21Z

        if not _constraints or not _constraints.get(key):
-            return [
+            filtered = [
                t for t in allowed_sharding_types if t != ShardingType.GRID_SHARD.value
            ]
+            if is_fp_module:
+                filtered = [
+                    t
+                    for t in filtered
+                    if t not in GUARDED_SHARDING_TYPES_FOR_FP_MODULES
+                ]
+            return filtered
        constraints: ParameterConstraints = _constraints[key]
        if not constraints.sharding_types:
-            return [
+            filtered = [
                t for t in allowed_sharding_types if t != ShardingType.GRID_SHARD.value
            ]
+            if is_fp_module:
+                filtered = [
+                    t
+                    for t in filtered
+                    if t not in GUARDED_SHARDING_TYPES_FOR_FP_MODULES
+                ]
+            return filtered


Two issues:

The "drop GRID_SHARD; if FP, drop GUARDED_SHARDING_TYPES_FOR_FP_MODULES" block is repeated verbatim in both branches — easy to drift on the next torchrec bump. A small _drop_guarded(types, is_fp) helper would deduplicate.

The third branch (when the user supplies an explicit constraints.sharding_types, just below this block) does not apply the FP guard. If a user constrains an FP module to a guarded type like ROW_WISE, it passes through. This may be intentional ("explicit user opt-in overrides the safety filter") — please add an inline comment stating so, otherwise it reads like an oversight.

github-actions · 2026-04-28T09:56:47Z

Review summary

The version bump itself is mostly mechanical and clean (Dockerfile, requirements, image tags, doc strings). The torchrec-1.6 adapter changes in dynamicemb_util.py and plan_util.py are the substantive part and got the inline comments above. Triton change fast_dividef(qk, 1+exp(-qk)) → qk * tl.sigmoid(qk) is performance-equivalent and removes a deprecating import — good.

A few non-inline observations:

Test coverage gap. The new _get_num_buckets, FP-module guard, and module_type_key substitution in tzrec/utils/plan_util.py have no direct unit tests in plan_util_test.py (existing tests don't construct EmbeddingBagConfig with use_virtual_table=True). tzrec/utils/dynamicemb_util.py has no test file at all, so the new HardwarePerfConfig.get_device_bw patch — including the (cr*hbm + (1-cr)*hbm_to_ddr)/10 formula — only gets coverage via GPU integration tests gated on has_dynamicemb. Worth a small dynamicemb_util_test.py that asserts the formula and that the original is delegated to with all 7 positional args.
--ulimit memlock=-1 on shared CI runners. Required for NCCL pinned/RDMA buffers, so the change is correct, but on the self-hosted tzrec-runner/tzrec-bench-runner pool a runaway test can now pin unbounded host RAM and starve sibling jobs. If feasible, prefer a finite cap (e.g. 64 GiB) sized to the workload rather than -1.
scripts/build_docker.sh. set -eo pipefail upgrade is good. Worth applying the same to promote_docker.sh (currently only set -e) for consistency. Also: the deleted ${REGISTRY}${REPO_NAME} line (missing /) was a pre-existing typo — silent fix worth mentioning in the commit message.
misc_util.run_cmd log-tail. The 80-line tail is helpful for opaque CI failures. Consider hoisting 80 to a module constant and noting in a one-liner that print() here is intentional (vs the project logger) so a future cleanup pass doesn't "fix" it.
Docs. Grepped docs/ for stale 1.1/2.10.0/1.5.0/pytorch2.10 strings — none missed. Nice.

Nothing blocking; main asks are (a) confirm in the PR description that removing the _create_embedding_kernel patches doesn't drop dynamicemb's _need_prefetch=True semantics, and (b) the issubclass-based FP detection in _filter_sharding_types.

oss-accelerate.aliyuncs.com is the global-accelerated CDN endpoint and gives faster, more reliable downloads (especially for the large fbgemm_gpu / torchrec / libidn11 / Miniforge / cuda-keyring artifacts) than the regional oss-cn-beijing.aliyuncs.com endpoint we were using. The bucket and key paths are identical — only the hostname changes — so existing wheel and asset URLs keep working. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- requirements/runtime.txt: bump pyfg pin (cp310/cp311/cp312 wheels) - docs/feature.md: add ExprFeature isnan (new in 1.0.5), mod, corr; drop duplicate sigmoid - docs/feature.md: extend CombineFeature/LookupFeature combiner enum with count/avg/gap_min/gap_max - docs/feature.md: note MatchFeature MAP<K, string> input support - tzrec/features/tokenize_feature.py: omit output_delim in grouped-sequence path; pyfg 1.0.5 expects the inner tokenize feature to emit per-token outputs and rejects output_delim there

Standalone TokenizeFeature parses fine without output_delim too, so the grouped-sequence branch is unnecessary. Simplifies the previous commit.

Follow-up to dropping output_delim from TokenizeFeature._fg_json — update the expected dicts in feature_test.test_create_fg_json{,_remove_bucketizer} so they match the new output. Caught by CI on PR alibaba#489.

After cherry-picking the pyfg 1.0.4 -> 1.0.5 bump and the matching TokenizeFeature output_delim drop onto bump/tzrec-1.2.0, point CI back at the staging tzrec-test:1.2 images so the next workflow run validates the freshly-rebuilt containers against the source tree before we promote them to tzrec-devel:1.2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI on tzrec-test:1.2 has passed for the pyfg 1.0.5 + TokenizeFeature changes; promote_docker.sh has retagged + pushed tzrec-devel:1.2-{cpu,cu126,cu129} (plus the 1.2 and latest aliases) to the same digests. Switch the 8 workflow YAMLs back to tzrec-devel:1.2 so the merged master points at the promoted repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # tzrec/version.py

DOCKER_TAG_SUFFIX is a per-build marker on the staging tzrec-test images (e.g. an "-rc1" tail used during a release candidate cycle). When promoting to tzrec-devel we want the suffix stripped so consumers see clean tags like tzrec-devel:1.2-cu129 / 1.2 / latest, not tzrec-devel:1.2-cu129-rc1. Apply the suffix only to the SRC pull and omit it from every DST tag/push line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

torch_tensorrt==2.11.0 is now available for cu126 too (no longer cu129-only since the 1.2.0 bump), so the cu126 image ships TensorRT just like cu129. Strip the stale parenthetical from the local-tutorial docker section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The cu129 PyTorch wheel is no longer compiled with sm_70/sm_60 SASS, so running the cu129 image on Tesla V100 / P100 / P40 (CC 7.0 / 6.x) trips the runtime warning ``Found GPU0 ... CC 7.0`` and any CUDA kernel launch fails. Add a 注意 block under the docker image variant list in local_tutorial.md pointing those users at the cu126 image. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaced the bare "cu129 needs CC ≥ 7.5" note with the exact torch.cuda.get_arch_list() output of each image: - cu129: sm_75 / 80 / 86 / 90 / 100 / 120 + compute_120 PTX (T4, A10/A30/A100, L4/L20, H100/H200, B100/B200; **no V100/P100**) - cu126: sm_50 / 60 / 70 / 75 / 80 / 86 / 90 (Pascal/Volta/Turing/Ampere/Hopper; **no Blackwell**) so users can pick the right image up-front instead of hitting "Found GPU0 ... CC 7.0" at runtime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fbgemm-gpu (the sparse-embedding kernel library tzrec relies on) no longer ships sm_50/sm_60 SASS, so Pascal (P100/P40) and Maxwell cards fail at the embedding kernel even though stock PyTorch advertises them in get_arch_list. Tighten the doc to cu126 = sm_70/75/80/86/90 (Volta through Hopper) and call out the Pascal caveat explicitly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Trim the cu126 bullet to the supported CC range; the fbgemm-gpu Pascal caveat was redundant with the sm_70+ list right above it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Trim the cu129 bullet to the supported CC list; the unsupported-card caveat was redundant with the cu126 bullet right below it covering Volta (V100) and the Pascal note that came earlier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both image bullets now follow the "Volta (V100)、Turing (T4)、…" arch-first format with explicit example cards in parentheses, instead of mixing raw card lists in cu129 with arch-name format in cu126. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fbgemm-gpu wheel was silently updated at the existing OSS URL, so docker images need a refresh. The CACHE_BUST_PIP arg busts just the torch/fbgemm/torchrec RUN layer (apt + conda + cuda toolkit layers stay cached).

…P knobs

- Pin tensorrt_cu12==10.15.1.29 in step 12 (matches torch_tensorrt 2.11.0's `tensorrt-cu12<10.16.0,>=10.15.1`); requirements step no longer triggers a second tensorrt install. - Strip `tensorrt` (broad) from torch_tensorrt's METADATA so the bare `Requires-Dist: tensorrt<10.16.0,>=10.15.1` line — which pulled in tensorrt_cu13_libs (~3.7 GB) on top — gets neutralized. - Strip `cuda-toolkit` extras from torch's METADATA so step 19 doesn't re-resolve the 10 nvidia-* wheels we uninstalled in step 12. - Drop the 1.65 GB tensorrt_libs/libnvinfer_builder_resource_win_*.so.* (PE/Windows binaries shipped under .so for wheel-format compliance). - pip cache purge in step 19 to free /root/.cache/pip. - Generate pip.conf at build time via ARG PIP_MIRROR (default: mirrors.cloud.aliyuncs.com, override with --build-arg PIP_MIRROR=mirrors.aliyun.com); revert to public mirror at the last RUN so end-user images still resolve. - Trailing slash on pytorch-wheels find-links URLs to avoid the 301 to mirrors.aliyun.com.

# Conflicts: # tzrec/utils/misc_util.py # tzrec/version.py

pyfg wheel was silently updated at the existing OSS URL, so the requirements layer needs to be busted. ARG CACHE_BUST_REQ on the step-19 RUN forces the layer to rebuild and pull the new wheel content; layers above it (apt + conda + cuda-toolkit + torch + fbgemm + torchrec) stay cached.

…Q knobs

tiankongdeguiji and others added 10 commits April 20, 2026 16:32

Merge remote-tracking branch 'origin/master' into bump/tzrec-1.2.0

6e8141c

# Conflicts: # tzrec/ops/_triton/triton_hstu_attention.py # tzrec/version.py

tiankongdeguiji added the claude-review Let Claude Review label Apr 28, 2026

github-actions Bot removed the claude-review Let Claude Review label Apr 28, 2026

github-actions Bot reviewed Apr 28, 2026

View reviewed changes

tiankongdeguiji and others added 11 commits April 28, 2026 20:16

[refactor] drop tokenize_feature output_delim unconditionally

ee4c5fb

Standalone TokenizeFeature parses fine without output_delim too, so the grouped-sequence branch is unnecessary. Simplifies the previous commit.

[test] drop output_delim from tokenize_feature expected fg_json

cf367a0

Follow-up to dropping output_delim from TokenizeFeature._fg_json — update the expected dicts in feature_test.test_create_fg_json{,_remove_bucketizer} so they match the new output. Caught by CI on PR alibaba#489.

Merge remote-tracking branch 'origin/master' into bump/tzrec-1.2.0

e62665e

# Conflicts: # tzrec/version.py

tiankongdeguiji and others added 12 commits April 29, 2026 11:21

[doc] drop fbgemm-gpu sm_50/sm_60 caveat from cu126 CC note

27d4c0a

Trim the cu126 bullet to the supported CC range; the fbgemm-gpu Pascal caveat was redundant with the sm_70+ list right above it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/master' into bump/tzrec-1.2.0

b159870

[chore] flip CI yamls to tzrec-test:1.2-u1 + suffix the staging build

484b078

fbgemm-gpu wheel was silently updated at the existing OSS URL, so docker images need a refresh. The CACHE_BUST_PIP arg busts just the torch/fbgemm/torchrec RUN layer (apt + conda + cuda toolkit layers stay cached).

[chore] point CI back at tzrec-devel:1.2 and revert -u1/CACHE_BUST_PI…

ebc9d25

…P knobs

[chore] point CI back at tzrec-devel:1.2 and clear -u2 suffix

acfabb2

Merge remote-tracking branch 'origin/master' into bump/tzrec-1.2.0

1633335

# Conflicts: # tzrec/utils/misc_util.py # tzrec/version.py

[chore] point CI back at tzrec-devel:1.2 and revert -u3/CACHE_BUST_RE…

8cab6ca

…Q knobs

eric-gecheng approved these changes May 1, 2026

View reviewed changes

tiankongdeguiji merged commit cceb2be into alibaba:master May 2, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] bump to 1.2.0 with torch 2.11 / torchrec 1.6 / fbgemm 1.6#479

[feat] bump to 1.2.0 with torch 2.11 / torchrec 1.6 / fbgemm 1.6#479
tiankongdeguiji merged 33 commits into
alibaba:masterfrom
tiankongdeguiji:bump/tzrec-1.2.0

tiankongdeguiji commented Apr 20, 2026

Uh oh!

github-actions Bot Apr 28, 2026

Uh oh!

github-actions Bot Apr 28, 2026

Uh oh!

github-actions Bot Apr 28, 2026

Uh oh!

github-actions Bot Apr 28, 2026

Uh oh!

github-actions Bot Apr 28, 2026

Uh oh!

github-actions Bot Apr 28, 2026

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tiankongdeguiji commented Apr 20, 2026

Summary

Docker images

Pre-commit

Docker hardening

Test plan

Uh oh!

github-actions Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 28, 2026

Review summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants