This fork is created to run the benchmarks on Modal.com. We are focused in benchmarking LoRA models created for specific styles. so we have decided to only run the alignment, diversity, and text benchmarks.
- Alignment: Only run the alignment benchmark for human and object.
- Diversity: Only run the diversity benchmark for human, object, and text.
Anime_Stylization and Knowledge_Reasoning are not relevant to our use case. Perhaps you can use the modal app to run these 2 categories. Just call the alignment or diversity with the appropriate --class_items argument.
2025.09.19🌟 OneIG-Bench has been accepted to NeurIPS 2025 DB Track.2025.09.19🌟 We updated the Seedream 4.0, Gemini-2.5-Flash-Image(Nano Banana), Step-3o Vision and HunyuanImage-2.1 evaluation results on our leaderboard here.2025.09.19🌟 We updated the NextStep-1, Lumina-DiMOO and IRG official evaluation results on our leaderboard here.2025.08.13🌟 We updated the Qwen-Image official evaluation results on our leaderboard here.2025.08.13🌟 We updated the fine-grained analysis script here.
2025.07.03🌟 We updated the Ovis-U1 evaulation results on our leaderboard here.2025.06.25🌟 We updated the Show-o2 and OmniGen2 evaulation results on our leaderboard here.2025.06.23🌟 We released the T2I generation script here.2025.06.10🌟 We released the OneIG-Bench benchmark on 🤗huggingface.2025.06.10🌟 We released the tech report and the project page.2025.06.10🌟 We released the evaluation scripts.
- Fine-grained Analysis Script
- Real-time Updating Leaderboard
- OneIG-Bench Release
- Evaluation Scripts, Technical Report & Project Page Release
We introduce OneIG-Bench, a meticulously designed comprehensive benchmark framework for fine-grained evaluation of T2I models across multiple dimensions, including subject-element alignment, text rendering precision, reasoning-generated content, stylization, and diversity. Specifically, these dimensions can be flexibly selected for evaluation based on specific needs.
Key contribution:
- We present OneIG-Bench, which consists of six prompt sets, with the first five — 245 Anime and Stylization, 244 Portrait, 206 General Object, 200 Text Rendering, and 225 Knowledge and Reasoning prompts — each provided in both English and Chinese, and 200 Multilingualism prompts, designed for the comprehensive evaluation of current text-to-image models.
- A systematic quantitative evaluation is developed to facilitate objective capability ranking through standardized metrics, enabling direct comparability across models. Specifically, our evaluation framework allows T2I models to generate images only for prompts associated with a particular evaluation dimension, and to assess performance accordingly within that dimension.
- State-of-the-art open-sourced methods as well as the proprietary model are evaluated based on our proposed benchmark to facilitate the development of text-to-image research.
We test our benchmark using torch==2.6.0, torchvision==0.21.0 with cuda-11.8, python==3.10.
Install requirements:
pip install -r requirements.txtThe version of flash-attention is in the last line of requirements.txt.
To evaluate style performance, please download the CSD model and CLIP model, then put them under ./scripts/style/models.
You can use the scirpt to generate images. You only need to set up the inference function in the script for generating images.
It's better for you to generate 4 images for each prompt in OneIG-Bench. Each prompt's generated images should be saved into subfolders based on their category Anime & Stylization, Portrait, General Object, Text Rendering, Knowleddge & Reasoning, Multilingualism, corresponding to folders anime, human, object, text, reasoning, multilingualism. If any image cannot be generated, the script will save a black image with the specified filename.
The filename for each image should follow the id assigned to that prompt in OneIG-Bench.csv/OneIG-Bench-ZH.csv. The structure of the images to be saved should look like:
📁 images/
├── 📂 anime/
│ ├── 📂 gpt-4o/
│ │ ├── 000.webp
│ │ ├── 001.webp
│ │ └── ...
│ ├── 📂 imagen4/
│ └── ...
├── 📂 human/
│ ├── 📂 gpt-4o/
│ ├── 📂 imagen4/
│ └── ...
├── 📂 object/
│ └── ...
├── 📂 text/
│ └── ...
├── 📂 reasoning/
│ └── ...
└── 📂 multilingualism/ # For OneIG-Bench-ZH
└── ..../run_{overall, alignment, diversity, reasoning, style, text}.shThe run_overall.sh script contains the execution of all metrics. By running run_overall.sh, you can obtain the results of all metrics in the results directory. You can also choose the metric you want to evaluate by running the corresponding script: run_{metric_name}.sh.
To ensure that the generated images are correctly loaded for evaluation, you can modify the following parameters in each script:
-
mode: Select whether EN/ZH to evaluate on OneIG-Bench or OneIG-Bench-ZH. -
image_dir: The directory where the images generated by your model are stored. -
model_names: The names or identifiers of the models you want to evaluate. -
image_grid: This corresponds to the number of images generated by the model per prompt, where a value of 1 means 1 image, 2 means 4 images, and so on. -
class_items: The prompt categories or image sets you want to evaluate.
You can copy all the CSV files generated for each prompt dimension (in particular, for the style dimension, the files are named style_style*.csv) into a subfolder named as the model name inside the RESULT_DIR directory.
Then, in fine_grained_analysis.py, adjust the MODE, RESULT_DIR, and KEYS
parameters as needed to perform the fine-grained analysis.
We define the sets of images generated based on the OneIG-Bench prompt categories: General Object (O), Portrait (P), Anime and Stylization (A) for prompts without stylization, (S) for prompts with stylization, Text Rendering (T), Knowledge and Reasoning (KR), and Multilingualism (L).
The correspondence between the evaluation metrics and the evaluated image sets in OneIG-Bench and OneIG-Bench-ZH is presented in the table below.
- 📊 Metrics and Image Sets Correspondence
| Alignment | Text | Reasoning | Style | Diversity | |
|---|---|---|---|---|---|
| OneIG-Bench | O, P, A, S | T | KR | S | O, P, A, S, T, KR |
| OneIG-Bench-ZH | Ozh, Pzh, Azh, Szh, Lzh | Tzh | KRzh | Szh | Ozh, Pzh, Azh, Szh, Lzh, Tzh, KRzh |
- Method Comparision on OneIG-Bench:
- Method Comparision on OneIG-Bench-ZH:
- Benchmark Comparison:
OneIG-Bench (also referred to as OneIG-Bench-EN) denotes the English benchmark set.
If you find our work helpful for your research, please consider citing our work.
@article{chang2025oneig,
title={OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation},
author={Jingjing Chang and Yixiao Fang and Peng Xing and Shuhan Wu and Wei Cheng and Rui Wang and Xianfang Zeng and Gang Yu and Hai-Bao Chen},
journal={arXiv preprint arxiv:2506.07977},
year={2025}
}We would like to express our sincere thanks to the contributors of Qwen, CLIP, CSD_Score, DreamSim, and HuggingFace teams, for their open research and exploration.



