Feature request
Please record the training configuration under --output_path when launching model
training.
Currently, training outputs mainly contain checkpoints such as step-*.safetensors. After
a run finishes, it is hard to recover the exact training setup that produced a checkpoint,
especially when comparing many experiments.
Requested behavior
When a training job starts, write reproducibility metadata into output_path, for
example:
training_args.json: parsed training script arguments, e.g. vars(args) from
argparse
launch_command.txt: the original command line, including the accelerate launch ... train.py ... invocation if available
runtime_env.json: selected runtime information such as CUDA_VISIBLE_DEVICES,
distributed rank/world size, mixed precision, and DeepSpeed/Accelerate config path
- optional copies or snapshots of config files such as DeepSpeed JSON / Accelerate YAML
At minimum, saving vars(args) to training_args.json would already be very useful.
Why this is useful
This makes checkpoints self-describing and easier to reproduce/debug. It is especially
important for experiments that vary dataset path, resolution settings, LoRA/full fine-
tuning settings, learning rate, save steps, gradient accumulation, gradient checkpointing/
offload, and DeepSpeed/Accelerate distributed training settings.
中文说明
希望在训练启动时,把训练配置自动记录到 --output_path 目录下。
目前训练输出目录里主要只有 step-*.safetensors 这类 checkpoint。训练结束后,如果没有额外
保存日志,很难准确还原某个 checkpoint 对应的训练参数,尤其是在同时比较多个实验时。
期望行为
训练任务启动时,在 output_path 下写入可复现信息,例如:
training_args.json:训练脚本解析后的参数,例如 argparse 得到的 vars(args)
launch_command.txt:原始启动命令;如果可行,希望包含完整的 accelerate launch ... train.py ... 命令
runtime_env.json:关键运行环境,例如 CUDA_VISIBLE_DEVICES、分布式 rank/world size、
mixed precision、DeepSpeed/Accelerate config 路径等
- 可选:复制或快照保存 DeepSpeed JSON / Accelerate YAML 等配置文件
最低限度,只要把 vars(args) 保存成 training_args.json,就已经很有帮助。
价值
这样每个 checkpoint 会更容易追溯和复现,也方便排查实验差异。尤其是数据集路径、分辨率、
LoRA/全参训练、学习率、保存间隔、梯度累积、gradient checkpointing/offload、DeepSpeed/
Accelerate 分布式配置变化时很重要。
Feature request
Please record the training configuration under
--output_pathwhen launching modeltraining.
Currently, training outputs mainly contain checkpoints such as
step-*.safetensors. Aftera run finishes, it is hard to recover the exact training setup that produced a checkpoint,
especially when comparing many experiments.
Requested behavior
When a training job starts, write reproducibility metadata into
output_path, forexample:
training_args.json: parsed training script arguments, e.g.vars(args)fromargparselaunch_command.txt: the original command line, including theaccelerate launch ... train.py ...invocation if availableruntime_env.json: selected runtime information such asCUDA_VISIBLE_DEVICES,distributed rank/world size, mixed precision, and DeepSpeed/Accelerate config path
At minimum, saving
vars(args)totraining_args.jsonwould already be very useful.Why this is useful
This makes checkpoints self-describing and easier to reproduce/debug. It is especially
important for experiments that vary dataset path, resolution settings, LoRA/full fine-
tuning settings, learning rate, save steps, gradient accumulation, gradient checkpointing/
offload, and DeepSpeed/Accelerate distributed training settings.
中文说明
希望在训练启动时,把训练配置自动记录到
--output_path目录下。目前训练输出目录里主要只有
step-*.safetensors这类 checkpoint。训练结束后,如果没有额外保存日志,很难准确还原某个 checkpoint 对应的训练参数,尤其是在同时比较多个实验时。
期望行为
训练任务启动时,在
output_path下写入可复现信息,例如:training_args.json:训练脚本解析后的参数,例如argparse得到的vars(args)launch_command.txt:原始启动命令;如果可行,希望包含完整的accelerate launch ... train.py ...命令runtime_env.json:关键运行环境,例如CUDA_VISIBLE_DEVICES、分布式 rank/world size、mixed precision、DeepSpeed/Accelerate config 路径等
最低限度,只要把
vars(args)保存成training_args.json,就已经很有帮助。价值
这样每个 checkpoint 会更容易追溯和复现,也方便排查实验差异。尤其是数据集路径、分辨率、
LoRA/全参训练、学习率、保存间隔、梯度累积、gradient checkpointing/offload、DeepSpeed/
Accelerate 分布式配置变化时很重要。