Skip to content

Commit e071d1d

Browse files
committed
Update GitHub热门评测repo.md
1 parent 1d2d3fa commit e071d1d

File tree

1 file changed

+53
-28
lines changed

1 file changed

+53
-28
lines changed

GitHub热门评测repo.md

Lines changed: 53 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,53 @@
1-
| repo |star|
2-
|-----------------------------------------------------------------------------------|----|
3-
| [langfuse](https://github.com/langfuse/langfuse) |14.9k|
4-
| [opik](https://github.com/comet-ml/opik) |12.5k|
5-
| [ragas](https://github.com/explodinggradients/ragas) |10.3k|
6-
| [deepeval](https://github.com/confident-ai/deepeval) |10.1k|
7-
| [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) |9.8k|
8-
| [promptfoo](https://github.com/promptfoo/promptfoo) |7.9k|
9-
| [phoenix](https://github.com/Arize-ai/phoenix) |6.6k|
10-
| [opencompass](https://github.com/open-compass/opencompass) |5.8k|
11-
| [garak](https://github.com/NVIDIA/garak) |4.9k|
12-
| [⭐chinese-llm-benchmark(我们)](https://github.com/jeinlee1991/chinese-llm-benchmark) |4.7k|
13-
| [ARC-AGI](https://github.com/fchollet/ARC-AGI) |4.5k|
14-
| [helicone](https://github.com/Helicone/helicone) |4.3k|
15-
| [AutoRAG](https://github.com/Marker-Inc-Korea/AutoRAG) |4.2k|
16-
| [simple-evals](https://github.com/openai/simple-evals) |4.0k|
17-
| [SWE-bench](https://github.com/SWE-bench/SWE-bench) |3.3k|
18-
| [SuperCLUE](https://github.com/CLUEbenchmark/SuperCLUE) |3.2k|
19-
| [agenta](https://github.com/Agenta-AI/agenta) |3.1k|
20-
| [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) |3.0k|
21-
| [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |2.9k|
22-
| [HumanEval](https://github.com/openai/human-eval) |2.9k|
23-
| [hallucination-leaderboard](https://github.com/vectara/hallucination-leaderboard) |2.7k|
24-
| [trulens](https://github.com/truera/trulens) |2.7k|
25-
| [promptbench](https://github.com/microsoft/promptbench) |2.7k|
26-
| [evaluate](https://github.com/huggingface/evaluate) |2.3k|
27-
| [langwatch](https://github.com/langwatch/langwatch) |2.3k|
28-
| ||
1+
| repo | star | area | about |
2+
|------------------------------------------------------------------------------------|-------|--------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
3+
| [langfuse](https://github.com/langfuse/langfuse) | 14.9k | 国外 | Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23 |
4+
| [opik](https://github.com/comet-ml/opik) | 12.5k | 国外 | Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards. |
5+
| [ragas](https://github.com/explodinggradients/ragas) | 10.3k | 国外 | Supercharge Your LLM Application Evaluations 🚀 |
6+
| [deepeval](https://github.com/confident-ai/deepeval) | 10.1k | 国外 | The LLM Evaluation Framework |
7+
| [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) | 9.8k | 国外 | A framework for few-shot evaluation of language models. |
8+
| [promptfoo](https://github.com/promptfoo/promptfoo) | 7.9k | 国外 | Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration. |
9+
| [phoenix](https://github.com/Arize-ai/phoenix) | 6.6k | 国外 | AI Observability & Evaluation |
10+
| [opencompass](https://github.com/open-compass/opencompass) | 5.8k | **国内** | OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets. |
11+
| [garak](https://github.com/NVIDIA/garak) | 4.9k | 国外 | the LLM vulnerability scanner |
12+
| [⭐chinese-llm-benchmark(我们)](https://github.com/jeinlee1991/chinese-llm-benchmark) | 4.7k | **国内** | ReLE中文大模型能力评测(持续更新):目前已囊括288个大模型,覆盖chatgpt、gpt-5、o4-mini、谷歌gemini-2.5、Claude、智谱GLM-Z1、文心一言、qwen-max、百川、讯飞星火、商汤senseChat、minimax等商用模型, 以及DeepSeek-R1-0528、deepseek-v3、qwen3、llama4、glm4.5、gemma3、mistral等开源大模型。不仅提供排行榜,也提供规模超200万的大模型缺陷库!方便广大社区研究分析、改进大模型。 | |
13+
| [ARC-AGI](https://github.com/fchollet/ARC-AGI) | 4.5k | 国外 | The Abstraction and Reasoning Corpus |
14+
| [helicone](https://github.com/Helicone/helicone) | 4.3k | 国外 | 🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓 |
15+
| [AutoRAG](https://github.com/Marker-Inc-Korea/AutoRAG) | 4.2k | 国外 | AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation |
16+
| [simple-evals](https://github.com/openai/simple-evals) | 4.0k | 国外 | This repository contains a lightweight library for evaluating language models. We are open sourcing it so we can be transparent about the accuracy numbers we're publishing alongside our latest models. |
17+
| [SWE-bench](https://github.com/SWE-bench/SWE-bench) | 3.3k | 国外 | SWE-bench [Multimodal]: Can Language Models Resolve Real-world Github Issues? |
18+
| [SuperCLUE](https://github.com/CLUEbenchmark/SuperCLUE) | 3.2k | **国内** | SuperCLUE: 中文通用大模型综合性基准 A Benchmark for Foundation Models in Chinese |
19+
| [agenta](https://github.com/Agenta-AI/agenta) | 3.1k | 国外 | The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place. |
20+
| [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) | 3.0k | 国外 | One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks |
21+
| [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) | 2.9k | **国内** | Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks |
22+
| [HumanEval](https://github.com/openai/human-eval) | 2.9k | 国外 | Code for the paper "Evaluating Large Language Models Trained on Code" |
23+
| [hallucination-leaderboard](https://github.com/vectara/hallucination-leaderboard) | 2.7k | 国外 | Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents |
24+
| [trulens](https://github.com/truera/trulens) | 2.7k | 国外 | Evaluation and Tracking for LLM Experiments and AI Agents |
25+
| [promptbench](https://github.com/microsoft/promptbench) | 2.7k | 国外 | A unified evaluation framework for large language models |
26+
| [evaluate](https://github.com/huggingface/evaluate) | 2.3k | 国外 | 🤗 Evaluate: A library for easily evaluating machine learning models and datasets. |
27+
| [langwatch](https://github.com/langwatch/langwatch) | 2.3k | 国外 | The open LLM Ops platform - Traces, Analytics, Evaluations, Datasets and Prompt Optimization ✨ |
28+
| [lmnr](https://github.com/lmnr-ai/lmnr) | 2.2k | 国外 |Laminar - open-source all-in-one platform for engineering AI products. Create data flywheel for your AI app. Traces, Evals, Datasets, Labels. YC S24.|
29+
| [EvalAI](https://github.com/Cloud-CV/EvalAI) | 1.9k | 国外 |EvalAI is an open source platform for evaluating and comparing machine learning (ML) and artificial intelligence (AI) algorithms at scale.|
30+
| [lighteval](https://github.com/huggingface/lighteval) | 1.8k | 国外 |Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends|
31+
| [alpaca_eval](https://github.com/tatsu-lab/alpaca_eval) | 1.8k | 国外 |An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.|
32+
| [CodeXGLUE](https://github.com/microsoft/CodeXGLUE) | 1.7k | 国外 |CodeXGLUE, a benchmark dataset and open challenge for code intelligence. It includes a collection of code intelligence tasks and a platform for model evaluation and comparison.|
33+
| [agentic_security](https://github.com/msoedov/agentic_security) | 1.6k | 国外 |Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪|
34+
| [LLM-eval-survey](https://github.com/MLGroupJLU/LLM-eval-survey) | 1.6k | **国内** |The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".|
35+
| [MMLU](https://github.com/hendrycks/test) | 1.5k | 国外 |Measuring Massive Multitask Language Understanding | ICLR 2021|
36+
| [evalscope](https://github.com/modelscope/evalscope) | 1.5k | **国内** |A streamlined and customizable framework for efficient large model evaluation and performance benchmarking|
37+
| [lunary](https://github.com/lunary-ai/lunary) | 1.4k | 国外 |The production toolkit for LLMs. Observability, prompt management and evaluations.|
38+
| [MATH](https://github.com/hendrycks/math) | 1.2k | 国外 |The MATH Dataset (NeurIPS 2021)|
39+
| [prompty](https://github.com/microsoft/prompty) | 1.0k | 国外 |Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.|
40+
| [Humanity's Last Exam](https://github.com/centerforaisafety/hle) | 1.0k | 国外 |Humanity's Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. Humanity's Last Exam consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. |
41+
| [judgeval](https://github.com/JudgmentLabs/judgeval) | 0.9k | 国外 |The open source post-building layer for agents. Our traces + evals power agent post-training (RL, SFT), monitoring, and regression testing.|
42+
| [LiveBench](https://github.com/LiveBench/LiveBench) | 0.9k | 国外 |LiveBench: A Challenging, Contamination-Free LLM Benchmark|
43+
| [uqlm](https://github.com/cvs-health/uqlm) | 0.8k | 国外 |UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection|
44+
| [AGIEval](https://github.com/ruixiangcui/AGIEval) | 0.8k | 国外 |This repository contains information about AGIEval, data, code and output of baseline systems for the benchmark.|
45+
| [TruthfulQA](https://github.com/sylinrl/TruthfulQA) | 0.8k | 国外 |TruthfulQA: Measuring How Models Imitate Human Falsehoods|
46+
| [tau-bench](https://github.com/sierra-research/tau-bench) | 0.8k | 国外 |τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains|
47+
| [CBLUE](https://github.com/CBLUEbenchmark/CBLUE) | 0.8k | **国内** |[CBLUE1] 中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark|
48+
| [CMMLU](https://github.com/haonan-li/CMMLU/activity) | 0.8k | **国内** |CMMLU: Measuring massive multitask language understanding in Chinese|
49+
| [FuzzyAI](https://github.com/cyberark/FuzzyAI) | 0.7k | 国外 |A powerful tool for automated LLM fuzzing. It is designed to help developers and security researchers identify and mitigate potential jailbreaks in their LLM APIs.|
50+
| [ GAOKAO-Bench](https://github.com/OpenLMLab/GAOKAO-Bench) | 0.7k | **国内** |GAOKAO-Bench is an evaluation framework that utilizes GAOKAO questions as a dataset to evaluate large language models.|
51+
| [LiveCodeBench](https://github.com/LiveCodeBench/LiveCodeBench) | 0.6k | 国外 |Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"|
52+
53+
dd

0 commit comments

Comments
 (0)