OneIG-Bench

Fork changes

This fork is created to run the benchmarks on Modal.com. We are focused in benchmarking LoRA models created for specific styles. so we have decided to only run the alignment, diversity, and text benchmarks.

Alignment: Only run the alignment benchmark for human and object.
Diversity: Only run the diversity benchmark for human, object, and text.

Anime_Stylization and Knowledge_Reasoning are not relevant to our use case. Perhaps you can use the modal app to run these 2 categories. Just call the alignment or diversity with the appropriate --class_items argument.

OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation

Jingjing Chang^1,2, Yixiao Fang^2†, Peng Xing², Shuhan Wu², Wei Cheng²,

Rui Wang², Xianfang Zeng², Gang Yu^2‡, Hai-Bao Chen^1‡

^†Project lead, ^‡Correspondence Authors

¹Shanghai Jiao Tong University, ²StepFun

🔥🔥🔥 News

2025.09.19 🌟 OneIG-Bench has been accepted to NeurIPS 2025 DB Track.
2025.09.19 🌟 We updated the Seedream 4.0, Gemini-2.5-Flash-Image(Nano Banana), Step-3o Vision and HunyuanImage-2.1 evaluation results on our leaderboard here.
2025.09.19 🌟 We updated the NextStep-1, Lumina-DiMOO and IRG official evaluation results on our leaderboard here.
2025.08.13 🌟 We updated the Qwen-Image official evaluation results on our leaderboard here.
2025.08.13 🌟 We updated the fine-grained analysis script here.

2025.07.03 🌟 We updated the Ovis-U1 evaulation results on our leaderboard here.
2025.06.25 🌟 We updated the Show-o2 and OmniGen2 evaulation results on our leaderboard here.
2025.06.23 🌟 We released the T2I generation script here.
2025.06.10 🌟 We released the OneIG-Bench benchmark on 🤗huggingface.
2025.06.10 🌟 We released the tech report and the project page.
2025.06.10 🌟 We released the evaluation scripts.

To Do List

Fine-grained Analysis Script
Real-time Updating Leaderboard
OneIG-Bench Release
Evaluation Scripts, Technical Report & Project Page Release

Introduction

We introduce OneIG-Bench, a meticulously designed comprehensive benchmark framework for fine-grained evaluation of T2I models across multiple dimensions, including subject-element alignment, text rendering precision, reasoning-generated content, stylization, and diversity. Specifically, these dimensions can be flexibly selected for evaluation based on specific needs.

Key contribution:

We present OneIG-Bench, which consists of six prompt sets, with the first five — 245 Anime and Stylization, 244 Portrait, 206 General Object, 200 Text Rendering, and 225 Knowledge and Reasoning prompts — each provided in both English and Chinese, and 200 Multilingualism prompts, designed for the comprehensive evaluation of current text-to-image models.
A systematic quantitative evaluation is developed to facilitate objective capability ranking through standardized metrics, enabling direct comparability across models. Specifically, our evaluation framework allows T2I models to generate images only for prompts associated with a particular evaluation dimension, and to assess performance accordingly within that dimension.
State-of-the-art open-sourced methods as well as the proprietary model are evaluated based on our proposed benchmark to facilitate the development of text-to-image research.

Get Started

Dependencies and Installation:

We test our benchmark using torch==2.6.0, torchvision==0.21.0 with cuda-11.8, python==3.10.

Install requirements:

pip install -r requirements.txt

The version of flash-attention is in the last line of requirements.txt.

To evaluate style performance, please download the CSD model and CLIP model, then put them under ./scripts/style/models.

Image Generation

You can use the scirpt to generate images. You only need to set up the inference function in the script for generating images.

It's better for you to generate 4 images for each prompt in OneIG-Bench. Each prompt's generated images should be saved into subfolders based on their category Anime & Stylization, Portrait, General Object, Text Rendering, Knowleddge & Reasoning, Multilingualism, corresponding to folders anime, human, object, text, reasoning, multilingualism. If any image cannot be generated, the script will save a black image with the specified filename.

The filename for each image should follow the id assigned to that prompt in OneIG-Bench.csv/OneIG-Bench-ZH.csv. The structure of the images to be saved should look like:

📁 images/
├── 📂 anime/                  
│   ├── 📂 gpt-4o/
│   │   ├── 000.webp
│   │   ├── 001.webp
│   │   └── ...
│   ├── 📂 imagen4/
│   └── ...
├── 📂 human/               
│   ├── 📂 gpt-4o/
│   ├── 📂 imagen4/
│   └── ...
├── 📂 object/                
│   └── ...
├── 📂 text/                  
│   └── ...
├── 📂 reasoning/             
│   └── ...
└── 📂 multilingualism/        # For OneIG-Bench-ZH
    └── ...

Evaluation

Scripts

./run_{overall, alignment, diversity, reasoning, style, text}.sh

The run_overall.sh script contains the execution of all metrics. By running run_overall.sh, you can obtain the results of all metrics in the results directory. You can also choose the metric you want to evaluate by running the corresponding script: run_{metric_name}.sh.

Parameters Configuration for Evaluation

To ensure that the generated images are correctly loaded for evaluation, you can modify the following parameters in each script:

mode : Select whether EN/ZH to evaluate on OneIG-Bench or OneIG-Bench-ZH.
image_dir : The directory where the images generated by your model are stored.
model_names : The names or identifiers of the models you want to evaluate.
image_grid : This corresponds to the number of images generated by the model per prompt, where a value of 1 means 1 image, 2 means 4 images, and so on.
class_items : The prompt categories or image sets you want to evaluate.

Fined-grained Analysis for Evaluation Results

You can copy all the CSV files generated for each prompt dimension (in particular, for the style dimension, the files are named style_style*.csv) into a subfolder named as the model name inside the RESULT_DIR directory.

Then, in fine_grained_analysis.py, adjust the MODE, RESULT_DIR, and KEYS parameters as needed to perform the fine-grained analysis.

📈 Results

We define the sets of images generated based on the OneIG-Bench prompt categories: General Object (O), Portrait (P), Anime and Stylization (A) for prompts without stylization, (S) for prompts with stylization, Text Rendering (T), Knowledge and Reasoning (KR), and Multilingualism (L).

The correspondence between the evaluation metrics and the evaluated image sets in OneIG-Bench and OneIG-Bench-ZH is presented in the table below.

📊 Metrics and Image Sets Correspondence

	Alignment	Text	Reasoning	Style	Diversity
OneIG-Bench	O, P, A, S	T	KR	S	O, P, A, S, T, KR
OneIG-Bench-ZH	O_zh, P_zh, A_zh, S_zh, L_zh	T_zh	KR_zh	S_zh	O_zh, P_zh, A_zh, S_zh, L_zh, T_zh, KR_zh

Method Comparision on OneIG-Bench:

Method Comparision on OneIG-Bench-ZH:

Benchmark Comparison:

OneIG-Bench (also referred to as OneIG-Bench-EN) denotes the English benchmark set.

Citation

If you find our work helpful for your research, please consider citing our work.

@article{chang2025oneig,
  title={OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation}, 
  author={Jingjing Chang and Yixiao Fang and Peng Xing and Shuhan Wu and Wei Cheng and Rui Wang and Xianfang Zeng and Gang Yu and Hai-Bao Chen},
  journal={arXiv preprint arxiv:2506.07977},
  year={2025}
}

Acknowledgement

We would like to express our sincere thanks to the contributors of Qwen, CLIP, CSD_Score, DreamSim, and HuggingFace teams, for their open research and exploration.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
assets		assets
lab		lab
scripts		scripts
.gitignore		.gitignore
OneIG-Bench-ZH.csv		OneIG-Bench-ZH.csv
OneIG-Bench.csv		OneIG-Bench.csv
README.md		README.md
fine_grained_analysis.py		fine_grained_analysis.py
requirements.txt		requirements.txt
run_alignment.sh		run_alignment.sh
run_diversity.sh		run_diversity.sh
run_overall.sh		run_overall.sh
run_reasoning.sh		run_reasoning.sh
run_style.sh		run_style.sh
run_text.sh		run_text.sh
start_benchmark_modal.md		start_benchmark_modal.md
start_benchmark_modal.py		start_benchmark_modal.py
text2image.py		text2image.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OneIG-Bench

Fork changes

OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation

🔥🔥🔥 News

To Do List

Introduction

Get Started

Dependencies and Installation:

Image Generation

Evaluation

Scripts

Parameters Configuration for Evaluation

Fined-grained Analysis for Evaluation Results

📈 Results

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

paperlesspost/OneIG-Benchmark

Folders and files

Latest commit

History

Repository files navigation

OneIG-Bench

Fork changes

OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation

🔥🔥🔥 News

To Do List

Introduction

Get Started

Dependencies and Installation:

Image Generation

Evaluation

Scripts

Parameters Configuration for Evaluation

Fined-grained Analysis for Evaluation Results

📈 Results

Citation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages