Skip to content

Record training args in output_path / 在 output_path 中记录训练参数 #1484

@firefighter-eric

Description

@firefighter-eric

Feature request

Please record the training configuration under --output_path when launching model
training.

Currently, training outputs mainly contain checkpoints such as step-*.safetensors. After
a run finishes, it is hard to recover the exact training setup that produced a checkpoint,
especially when comparing many experiments.

Requested behavior

When a training job starts, write reproducibility metadata into output_path, for
example:

  • training_args.json: parsed training script arguments, e.g. vars(args) from
    argparse
  • launch_command.txt: the original command line, including the accelerate launch ... train.py ... invocation if available
  • runtime_env.json: selected runtime information such as CUDA_VISIBLE_DEVICES,
    distributed rank/world size, mixed precision, and DeepSpeed/Accelerate config path
  • optional copies or snapshots of config files such as DeepSpeed JSON / Accelerate YAML

At minimum, saving vars(args) to training_args.json would already be very useful.

Why this is useful

This makes checkpoints self-describing and easier to reproduce/debug. It is especially
important for experiments that vary dataset path, resolution settings, LoRA/full fine-
tuning settings, learning rate, save steps, gradient accumulation, gradient checkpointing/
offload, and DeepSpeed/Accelerate distributed training settings.

中文说明

希望在训练启动时,把训练配置自动记录到 --output_path 目录下。

目前训练输出目录里主要只有 step-*.safetensors 这类 checkpoint。训练结束后,如果没有额外
保存日志,很难准确还原某个 checkpoint 对应的训练参数,尤其是在同时比较多个实验时。

期望行为

训练任务启动时,在 output_path 下写入可复现信息,例如:

  • training_args.json:训练脚本解析后的参数,例如 argparse 得到的 vars(args)
  • launch_command.txt:原始启动命令;如果可行,希望包含完整的 accelerate launch ... train.py ... 命令
  • runtime_env.json:关键运行环境,例如 CUDA_VISIBLE_DEVICES、分布式 rank/world size、
    mixed precision、DeepSpeed/Accelerate config 路径等
  • 可选:复制或快照保存 DeepSpeed JSON / Accelerate YAML 等配置文件

最低限度,只要把 vars(args) 保存成 training_args.json,就已经很有帮助。

价值

这样每个 checkpoint 会更容易追溯和复现,也方便排查实验差异。尤其是数据集路径、分辨率、
LoRA/全参训练、学习率、保存间隔、梯度累积、gradient checkpointing/offload、DeepSpeed/
Accelerate 分布式配置变化时很重要。

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions