Record training args in output_path / 在 output_path 中记录训练参数

## Feature request

Please record the training configuration under `--output_path` when launching model
training.

Currently, training outputs mainly contain checkpoints such as `step-*.safetensors`. After
a run finishes, it is hard to recover the exact training setup that produced a checkpoint,
especially when comparing many experiments.

## Requested behavior

When a training job starts, write reproducibility metadata into `output_path`, for
example:

- `training_args.json`: parsed training script arguments, e.g. `vars(args)` from
`argparse`
- `launch_command.txt`: the original command line, including the `accelerate launch ...
train.py ...` invocation if available
- `runtime_env.json`: selected runtime information such as `CUDA_VISIBLE_DEVICES`,
distributed rank/world size, mixed precision, and DeepSpeed/Accelerate config path
- optional copies or snapshots of config files such as DeepSpeed JSON / Accelerate YAML

At minimum, saving `vars(args)` to `training_args.json` would already be very useful.

## Why this is useful

This makes checkpoints self-describing and easier to reproduce/debug. It is especially
important for experiments that vary dataset path, resolution settings, LoRA/full fine-
tuning settings, learning rate, save steps, gradient accumulation, gradient checkpointing/
offload, and DeepSpeed/Accelerate distributed training settings.

## 中文说明

希望在训练启动时，把训练配置自动记录到 `--output_path` 目录下。

目前训练输出目录里主要只有 `step-*.safetensors` 这类 checkpoint。训练结束后，如果没有额外
保存日志，很难准确还原某个 checkpoint 对应的训练参数，尤其是在同时比较多个实验时。

## 期望行为

训练任务启动时，在 `output_path` 下写入可复现信息，例如：

- `training_args.json`：训练脚本解析后的参数，例如 `argparse` 得到的 `vars(args)`
- `launch_command.txt`：原始启动命令；如果可行，希望包含完整的 `accelerate launch ...
train.py ...` 命令
- `runtime_env.json`：关键运行环境，例如 `CUDA_VISIBLE_DEVICES`、分布式 rank/world size、
mixed precision、DeepSpeed/Accelerate config 路径等
- 可选：复制或快照保存 DeepSpeed JSON / Accelerate YAML 等配置文件

最低限度，只要把 `vars(args)` 保存成 `training_args.json`，就已经很有帮助。

## 价值

这样每个 checkpoint 会更容易追溯和复现，也方便排查实验差异。尤其是数据集路径、分辨率、
LoRA/全参训练、学习率、保存间隔、梯度累积、gradient checkpointing/offload、DeepSpeed/
Accelerate 分布式配置变化时很重要。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record training args in output_path / 在 output_path 中记录训练参数 #1484

Feature request

Requested behavior

Why this is useful

中文说明

期望行为

价值

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Record training args in output_path / 在 output_path 中记录训练参数 #1484

Description

Feature request

Requested behavior

Why this is useful

中文说明

期望行为

价值

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions