Chapter 12: Agent Performance Evaluation
In previous chapters, we built the core functionality of the HelloAgents framework, implementing various agent paradigms, tool systems, memory mechanisms, and reinforcement learning training. When building agent systems, we also need to solve a core problem: How to objectively evaluate agent performance? Specifically, we need to answer the following questions:
- Does the agent possess the expected capabilities?
- How does it perform on different tasks?
- What level is it at compared to other agents?
This chapter will add a Performance Evaluation System to HelloAgents. We will deeply understand the theoretical foundation of agent evaluation and implement evaluation tools.
12.1 Agent Evaluation Fundamentals
12.1.1 Why Agent Evaluation is Needed
We now have SimpleAgent, which already possesses powerful reasoning and tool invocation capabilities. Let's look at a typical usage scenario:
This agent can work normally, but we face a core problem: How to objectively evaluate its performance? When we optimize prompts or change LLM models, how do we know if there's real improvement? Before deploying to production environment, how do we ensure agent reliability? These questions all need to be solved through systematic evaluation.
The core value of agent evaluation lies in providing standardized methods to measure agent capabilities. Through evaluation, we can quantify agent performance with specific numerical metrics, objectively compare the merits of different design solutions, promptly discover agent weaknesses in specific scenarios, and prove agent reliability to users.
Unlike traditional software testing, agent evaluation faces unique challenges. First is output uncertainty - the same question may have multiple correct answers, making it difficult to judge with simple right or wrong. Second is diversity of evaluation criteria - different tasks require different evaluation methods; tool invocation needs to check function signatures, while Q&A tasks need to evaluate semantic similarity. Finally is high evaluation cost - each evaluation requires numerous API calls, potentially costing hundreds of yuan or more.
To address these challenges, academia and industry have proposed multiple standardized Benchmarks. These benchmarks provide unified datasets, evaluation metrics, and scoring methods, enabling us to evaluate and compare different agent systems under the same standards.
12.1.2 Overview of Mainstream Evaluation Benchmarks
The agent evaluation field has seen the emergence of multiple influential benchmark tests. Below are some mainstream evaluation benchmarks and metrics:
(1) Tool Invocation Capability Evaluation
Tool invocation is one of the core capabilities of agents. Agents need to understand user intent, select appropriate tools, and correctly construct function calls. Related evaluation benchmarks include:
- BFCL (Berkeley Function Calling Leaderboard)[1]: Launched by UC Berkeley, includes 1120+ test samples, covering four categories: simple, multiple, parallel, irrelevance, uses AST matching algorithm for evaluation, moderate dataset size, active community.
- ToolBench[2]: Launched by Tsinghua University, includes 16000+ real API call scenarios, covering complex tool usage scenarios in the real world.
- API-Bank[3]: Launched by Microsoft Research, includes 53 commonly used API tools, focuses on evaluating agent understanding and invocation of API documentation.
(2) General Capability Evaluation
Evaluates agent comprehensive performance in real-world tasks, including multi-step reasoning, knowledge application, multimodal understanding, etc.:
- GAIA (General AI Assistants)[4]: Jointly launched by Meta AI and Hugging Face, includes 466 real-world problems, divided into Level 1/2/3 difficulty levels, evaluates multi-step reasoning, tool use, file processing, web browsing capabilities, uses Quasi Exact Match algorithm, tasks are realistic and comprehensive.
- AgentBench[5]: Launched by Tsinghua University, includes 8 tasks in different domains, comprehensively evaluates agent general capabilities.
- WebArena[6]: Launched by CMU, evaluates agent task completion and web interaction capabilities in real web environments.
(3) Multi-Agent Collaboration Evaluation
Evaluates the ability of multiple agents to work collaboratively:
- ChatEval[7]: Evaluates quality of multi-agent dialogue systems.
- SOTOPIA[8]: Evaluates agent interaction capabilities in social scenarios.
- Custom Collaboration Scenarios: Evaluation tasks designed according to specific application scenarios.
(4) Common Evaluation Metrics
Different benchmarks use different evaluation metrics, common ones include:
- Accuracy Metrics: Accuracy, Exact Match, F1 Score, used to measure answer correctness.
- Efficiency Metrics: Response Time, Token Usage, used to measure execution efficiency.
- Robustness Metrics: Error Rate, Failure Recovery, used to measure fault tolerance.
- Collaboration Metrics: Communication Efficiency, Task Completion, used to measure collaboration effectiveness.
12.1.3 HelloAgents Evaluation System Design
Considering learning curve and practicality, this chapter will focus on the following evaluation scenarios:
-
BFCL: Evaluate tool invocation capability
- Selection rationale: Moderate dataset size, clear evaluation metrics, active community
- Applicable scenarios: Evaluate agent function call accuracy
-
GAIA: Evaluate general AI assistant capability
- Selection rationale: Realistic tasks, difficulty grading, strong comprehensiveness
- Applicable scenarios: Evaluate agent comprehensive problem-solving capability
-
Data Generation Quality Evaluation: Evaluate LLM-generated data quality
- Selection rationale: Through this case, experience complete demonstration of using Agent to create data and evaluate data
- Applicable scenarios: Evaluate quality of generated training data and test data
- Evaluation methods: LLM Judge, Win Rate, manual verification
Through these three evaluation scenarios, we will build a complete evaluation system. Figure 12.1 shows our evaluation system construction approach.
Figure 12.1 HelloAgents Evaluation System Architecture
12.1.4 Chapter Learning Objectives and Quick Experience
Let's first look at the learning content of Chapter 12:
For this chapter's content, the learning objective is to master the ability to apply evaluation tools. Let's first prepare the development environment:
In the following sections, we will deeply learn the detailed usage and introduction of each evaluation method.
12.2 BFCL: Tool Invocation Capability Evaluation
12.2.1 BFCL Benchmark Introduction
BFCL (Berkeley Function Calling Leaderboard) is a function calling capability evaluation benchmark launched by UC Berkeley[1]. In agent systems, tool calling is one of the core capabilities. Agents need to complete the following tasks:
- Understand Task Requirements: Extract key information from user's natural language description
- Select Appropriate Tools: Choose the most suitable tool from available tool set
- Construct Function Calls: Correctly fill in function name and parameters
- Handle Complex Scenarios: Support advanced scenarios like multi-function calls, parallel calls
The BFCL benchmark contains four evaluation categories with increasing difficulty. Starting from the most basic single function call (Simple), gradually increasing to scenarios requiring multiple function calls (Multiple), then to complex scenarios requiring parallel calls of multiple functions (Parallel), and finally to scenarios requiring judgment of whether function calls are needed (Irrelevance). These four categories cover various tool calling scenarios that agents may encounter in practical applications, as shown in Table 12.1:
Table 12.1 Four Evaluation Categories in BFCL Benchmark
The BFCL evaluation process follows standard benchmark testing procedures: first load dataset and select evaluation category, then run agent to obtain prediction results, next parse prediction results into Abstract Syntax Tree (AST), and finally judge whether predictions are correct through AST matching algorithm. The entire process traverses all test samples, ultimately calculating evaluation metrics like accuracy and generating evaluation reports. The complete evaluation process is shown in Figure 12.2:
Figure 12.2 BFCL Evaluation Process Diagram
(1) BFCL Dataset Structure
The BFCL dataset uses JSON format, with each test sample containing the following fields:
Key Field Descriptions:
question: User's natural language requestfunction: List of available functions (including function signatures and descriptions)ground_truth: Standard answer (expected function call)
(2) AST Matching Explanation
BFCL uses AST Matching (Abstract Syntax Tree Matching) as the core evaluation algorithm, so let's understand the evaluation strategy below.
BFCL uses Abstract Syntax Tree (AST) for intelligent matching, rather than simple string matching. The core idea of AST matching is: Parse function calls into syntax trees, then compare tree structure and node values.
Given predicted function call $P$ and standard answer $G$, the AST matching function is defined as:
$$ \text{AST_Match}(P, G) = \begin{cases} 1 & \text{if } \text{AST}(P) \equiv \text{AST}(G) \ 0 & \text{otherwise} \end{cases} $$
Where $\text{AST}(x)$ represents parsing function call into abstract syntax tree, $\equiv$ represents syntax tree equivalence.
Two syntax trees are equivalent if they satisfy three core conditions: function names must be completely identical (exact match), parameter key-value pair sets are equal (ignoring order), and each parameter value is semantically equivalent (e.g., 2+3 is equivalent to 5). In the specific matching process, function name matching requires exact string matching, for example get_weather and get_temperature are considered different functions. Parameter matching uses AST for intelligent comparison, allowing different parameter orders (f(a=1, b=2) is equivalent to f(b=2, a=1)), allowing equivalent expressions (f(x=2+3) is equivalent to f(x=5)), and also allowing different string representations (f(s="hello") is equivalent to f(s='hello')). For multi-function call scenarios, the matching algorithm requires calling the same number of functions, each function call must match, but call order can differ (using set matching).
AST Matching Examples:
(3) BFCL Evaluation Metrics
BFCL uses the following metrics to evaluate agent performance:
1. Accuracy
Accuracy is the most core metric, defined as the proportion of samples with successful AST matching:
$$ \text{Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \text{AST_Match}(P_i, G_i) $$
Where:
- $N$ is the total number of samples
- $P_i$ is the prediction result of the $i$-th sample
- $G_i$ is the standard answer of the $i$-th sample
- $\text{AST_Match}(P_i, G_i) \in {0, 1}$ is the AST matching function
2. AST Match Rate
Same as accuracy, emphasizing use of AST matching algorithm:
$$ \text{AST Match Rate} = \text{Accuracy} $$
3. Category-wise Accuracy
For each category $c \in {\text{simple}, \text{multiple}, \text{parallel}, \ldots}$, calculate the accuracy for that category:
$$ \text{Accuracy}c = \frac{1}{|D_c|} \sum{i \in D_c} \text{AST_Match}(P_i, G_i) $$
Where $D_c$ is the sample set of category $c$, $|D_c|$ is the number of samples in that category.
4. Weighted Accuracy
Considering difficulty weights of different categories:
$$ \text{Weighted Accuracy} = \sum_{c} w_c \cdot \text{Accuracy}_c $$
Where $w_c$ is the weight of category $c$, satisfying $\sum_c w_c = 1$.
5. Error Rate
Proportion of samples that failed to correctly call functions:
$$ \text{Error Rate} = 1 - \text{Accuracy} = \frac{1}{N} \sum_{i=1}^{N} (1 - \text{AST_Match}(P_i, G_i)) $$
Metric Interpretation:
- Accuracy = 1.0: All samples are completely correct
- Accuracy = 0.8: 80% of samples correct, 20% of samples incorrect
- Accuracy = 0.0: All samples are incorrect
Category Accuracy Example:
(4) BFCL Official Evaluation Tool
BFCL provides official CLI tool for evaluation:
Advantages of using the official evaluation tool: it uses the official AST matching algorithm, evaluation results are completely consistent with the leaderboard, supports all BFCL v4 categories, and can automatically generate detailed evaluation reports.
12.2.2 Obtaining BFCL Dataset
The BFCL dataset can be obtained through the following methods:
Method 1: Clone from Official GitHub Repository (Recommended)
This is the most reliable method, obtaining complete dataset and ground truth:
Reasons for recommending this method: it contains complete ground truth (standard answers), data format is completely consistent with official evaluation tool, can directly use official evaluation scripts, and supports BFCL v4 latest version.
Method 2: Load Official Data Using HelloAgents
After cloning repository, load data using HelloAgents:
The working principle of this loader is: first load test data from bfcl_eval/data/, then load ground truth from bfcl_eval/data/possible_answer/, next automatically merge test data and ground truth, and finally preserve original BFCL data format. BFCL v4 dataset categories can be viewed in Table 12.2.
Table 12.2 Four Evaluation Categories in BFCL Benchmark
You can also view available categories through code:
12.2.3 Implementing BFCL Evaluation in HelloAgents
Now let's see how to implement BFCL evaluation in the HelloAgents framework. We provide three usage methods:
Method 1: Using BFCLEvaluationTool (Recommended)
This is the simplest method, completing evaluation, report generation, and official evaluation with one line of code:
Run Output:
Auto-generated Markdown Report:
After evaluation completes, a detailed Markdown report is automatically generated, including:
Method 2: Using One-Click Evaluation Script
Suitable for quick command-line evaluation. In this chapter's accompanying code examples, we provide 04_run_bfcl_evaluation.py, supporting direct command-line evaluation:
The script supports three parameters: --category specifies evaluation category (default simple_python), --samples specifies number of evaluation samples (default 5, 0 means all), --model-name specifies model name for BFCL official evaluation (default Qwen/Qwen3-8B).
Method 3: Directly Using Dataset and Evaluator
Suitable for scenarios requiring custom evaluation process:
Through these three methods, we can choose appropriate evaluation methods based on different needs. If you just want to quickly understand agent performance, using BFCLEvaluationTool's one-click evaluation is most convenient; if you need batch evaluation or integration into CI/CD pipeline, using command-line scripts is more suitable; if you need deep customization of evaluation process or integration into your own system, directly using Dataset and Evaluator provides maximum flexibility.
12.2.4 BFCL Official Evaluation Tool Integration
Previously we learned how to use HelloAgents' built-in evaluation functionality. In fact, BFCLEvaluationTool has automatically integrated BFCL official evaluation tool, allowing you to obtain authoritative, comparable evaluation results.
The entire evaluation process includes four steps: first load test data from BFCL v4 dataset, then use HelloAgents to run evaluation and obtain agent prediction results, next export results to BFCL official format (JSONL), and finally use official evaluation script to calculate final scores. This process ensures evaluation results are completely consistent with BFCL leaderboard, as shown in Figure 12.3:
Figure 12.3 HelloAgents Loading BFCL Evaluation Process
When using BFCLEvaluationTool, official evaluation runs automatically (enabled by default):
The tool automatically executes the complete evaluation process: first run HelloAgents evaluation to obtain prediction results, then export results to BFCL format and save to evaluation_results/bfcl_official/ directory, next copy result file to result/{model_name}/ directory to meet official evaluation tool requirements, then run BFCL official evaluation command to calculate scores, and finally display official evaluation results and generate Markdown format evaluation report.
Official Evaluation Output Example:
If you want to manually control the evaluation process, you can disable automatic official evaluation:
You can also manually generate reports:
12.2.5 Core Component Implementation Details
In previous sections, we learned how to use BFCL evaluation tools. Now let's dive into how HelloAgents evaluation system's core components are implemented. Understanding these implementation details not only helps you better use the evaluation system, but also allows you to customize and extend according to your own needs.
(1) BFCLDataset: Dataset Loader
BFCLDataset is responsible for loading and managing BFCL dataset:
Because BFCL's dataset is in the official repository, the recommended approach here is to directly clone a local copy for evaluation. Only when not found will it load from Hugging Face.
(2) BFCLEvaluator: Evaluation Executor
BFCLEvaluator is responsible for executing the evaluation process. Its core is the evaluate() method, which coordinates the entire evaluation process:
This evaluator's design contains three core points: first is prompt construction, needing to convert questions and function definitions in dataset into prompts understandable by agent; second is function call extraction, needing to extract function calls from agent's response and support multiple formats (JSON, code blocks, etc.); finally is AST matching, using abstract syntax tree for function call comparison, which is more accurate than simple string matching.
Let's look at the implementation of function call extraction:
(3) BFCLMetrics: Metrics Calculator
BFCLMetrics is responsible for calculating various evaluation metrics:
AST Matching Implementation:
AST matching is the core technology of BFCL evaluation. It is more intelligent than simple string matching and can identify semantically equivalent function calls:
(4) Tool Encapsulation: BFCLEvaluationTool
Finally, we encapsulate these components into a Tool so it can be directly called by agents:
This tool's design follows three core principles: first inherit Tool base class to follow HelloAgents' tool specification, ensuring seamless integration with framework; second perform strict parameter validation, checking required parameters and providing friendly error prompts, improving user experience; finally format results, returning JSON string for easy parsing and display. Through this modular design, we implemented an evaluation system that is both easy to use and flexible. Users can directly use high-level Tool interface to quickly complete evaluation, or dive into low-level components for customization to meet special needs.
12.2.6 Extension and Optimization Recommendations
Through previous learning, we have mastered how to use HelloAgents for BFCL evaluation. It should be noted that our current implementation is a simple reproduction based on SimpleAgent, mainly completing basic BFCL evaluation functionality. In practical applications, BFCL benchmark contains multiple difficulty levels and scenarios. To achieve higher scores on the leaderboard, further optimization and extension are needed.
(1) Limitations of Current Implementation
Our current SimpleAgent implementation mainly focuses on building the evaluation process, with room for improvement in tool calling capabilities. SimpleAgent uses custom tool calling format [TOOL_CALL:tool_name:parameters], which requires LLM to actively learn and use. In complex scenarios, performance may not match agents using native function calling. Additionally, we currently only test basic categories like simple_python. For more complex scenarios like multiple, parallel, irrelevance, targeted optimization is still needed.
(2) Directions for Improving BFCL Scores
To further improve BFCL evaluation scores, you can start from the following directions. First is optimizing agent's tool calling capability - consider using LLMs that support native function calling (like GPT-4, Claude, etc.), or improve prompts to help LLM better understand tool calling format. Second is expanding tool library - BFCL tests involve various types of functions, you can pre-implement common tool types based on test dataset characteristics to improve agent's tool coverage. Third is designing different strategies for different difficulty levels - for example, in multiple scenarios agents need to plan multi-step tool calling sequences, in parallel scenarios they need to identify tool calls that can be executed in parallel, in irrelevance scenarios they need to judge whether tool calling is truly needed.
(3) Practice Recommendations
For developers wanting to achieve better results on BFCL, the following practice strategies are recommended. First, start from simple category, ensure basic single function calls work stably - this is the foundation for subsequent optimization. Then, gradually test more complex categories like multiple, parallel, analyze failure cases, find agent's weak points. During optimization, you can refer to high-scoring models on BFCL leaderboard, learn their design ideas and optimization techniques. Meanwhile, it's recommended to use official evaluation tools for validation, ensuring optimized results are consistent with leaderboard standards.
Here are some suggestions for further processing during evaluation:
1. Progressive Evaluation
Start from small samples, gradually increase sample count:
2. Multi-Category Evaluation
Evaluate tasks of different difficulties:
3. Comparative Evaluation
Compare agents with different configurations:
If your evaluation results are good, consider submitting to BFCL official leaderboard!
Step 1: Prepare Submission Materials
- Model description document
- Evaluation result files (all categories)
- Model access method (API or open-source link)
Step 2: Submit to GitHub
Visit BFCL official repository and submit Pull Request according to instructions:
- Repository: https://github.com/ShishirPatil/gorilla
- Submission guide: Refer to
CONTRIBUTING.md
Step 3: Wait for Review
BFCL team will review your submission and verify result accuracy. After approval, your model will appear on the official leaderboard!
12.3 GAIA: General AI Assistant Capability Evaluation
12.3.1 GAIA Benchmark Introduction
GAIA (General AI Assistants) is an evaluation benchmark jointly launched by Meta AI and Hugging Face, focusing on evaluating AI assistants' general capabilities[2]. Unlike BFCL's focus on tool calling, GAIA evaluates agents' comprehensive performance in real-world tasks.
GAIA's design philosophy is: Real-world problems often require comprehensive application of multiple capabilities. An excellent AI assistant not only needs to call tools, but also needs to:
- Multi-step Reasoning: Decompose complex problems into multiple sub-problems
- Knowledge Application: Utilize built-in knowledge and external knowledge bases
- Multimodal Understanding: Process multiple inputs like text, images, files
- Web Browsing: Obtain latest information from the internet
- File Operations: Read and process files in various formats
(1) GAIA Dataset Structure
After understanding GAIA's evaluation philosophy, let's dive into the specific structure of GAIA dataset. GAIA contains 466 carefully designed real-world problems. These problems are divided into three difficulty levels based on complexity and required reasoning steps, from simple zero-step reasoning tasks to difficult tasks requiring multi-step complex reasoning, comprehensively covering various scenarios agents may encounter in practical applications, as shown in Table 12.3:
Table 12.3 GAIA Dataset Difficulty Level Distribution
For GAIA dataset sample examples, refer to the code snippet below:
Key Field Descriptions:
Question: Question descriptionLevel: Difficulty level (1-3)Final answer: Standard answer (may be number, text, or file)file_name/file_path: Attachment file (if any)Annotator Metadata: Metadata provided by annotator (reasoning steps, required tools, etc.)
(2) Quasi Exact Match Introduction
GAIA uses Quasi Exact Match evaluation algorithm, which is GAIA's officially defined evaluation standard. The core idea of this algorithm is: First normalize answers, then perform exact matching.
Given predicted answer $A_{\text{pred}}$ and standard answer $A_{\text{true}}$, the quasi exact match function is defined as:
$$ \text{Quasi_Exact_Match}(A_{\text{pred}}, A_{\text{true}}) = \begin{cases} 1 & \text{if } \mathcal{N}(A_{\text{pred}}) = \mathcal{N}(A_{\text{true}}) \ 0 & \text{otherwise} \end{cases} $$
Where $\mathcal{N}(\cdot)$ is the normalization function, applying different rules based on answer type.
The normalization function applies different rules based on answer type. For numeric types, remove comma separators (1,000 → 1000) and unit symbols ($100 → 100, 50% → 50), for example "$1,234.56" normalizes to "1234.56". For string types, convert to lowercase ("Apple" → "apple"), remove articles ("the apple" → "apple"), remove extra spaces ("hello world" → "hello world") and remove trailing punctuation ("hello." → "hello"), for example "The United States" normalizes to "united states". For list types, split elements by comma, apply string normalization to each element, sort alphabetically then rejoin, for example "Paris, London, Berlin" normalizes to "berlin,london,paris".
Normalization Examples:
(3) GAIA Evaluation Metrics
GAIA uses the following metrics to evaluate agent performance:
1. Exact Match Rate
Exact match rate is GAIA's core metric, defined as the proportion of samples with successful quasi exact matching:
$$ \text{Exact Match Rate} = \frac{1}{N} \sum_{i=1}^{N} \text{Quasi_Exact_Match}(A_{\text{pred},i}, A_{\text{true},i}) $$
Where:
- $N$ is the total number of samples
- $A_{\text{pred},i}$ is the predicted answer of the $i$-th sample
- $A_{\text{true},i}$ is the standard answer of the $i$-th sample
- $\text{Quasi_Exact_Match}(\cdot, \cdot) \in {0, 1}$ is the quasi exact match function
2. Level-wise Accuracy
For each difficulty level $\ell \in {1, 2, 3}$, calculate the accuracy for that level:
$$ \text{Accuracy}\ell = \frac{1}{|D\ell|} \sum_{i \in D_\ell} \text{Quasi_Exact_Match}(A_{\text{pred},i}, A_{\text{true},i}) $$
Where $D_\ell$ is the sample set of difficulty level $\ell$, $|D_\ell|$ is the number of samples at that level.
3. Difficulty Progression Drop Rate
Measures agent's performance degradation as difficulty increases:
$$ \text{Drop Rate}{\ell \to \ell+1} = \frac{\text{Accuracy}\ell - \text{Accuracy}{\ell+1}}{\text{Accuracy}\ell} $$
- $\text{Drop Rate}_{1 \to 2}$: Drop rate from Level 1 to Level 2
- $\text{Drop Rate}_{2 \to 3}$: Drop rate from Level 2 to Level 3
4. Average Reasoning Steps
Evaluates average number of steps required by agent to complete tasks:
$$ \text{Avg Steps} = \frac{1}{N_{\text{correct}}} \sum_{i \in \text{Correct}} \text{steps}_i $$
Where $N_{\text{correct}}$ is the number of correctly answered samples, $\text{steps}_i$ is the number of reasoning steps for the $i$-th sample.
Metric Interpretation:
- Exact Match Rate = 1.0: All samples are completely correct
- Exact Match Rate = 0.5: 50% of samples correct, 50% of samples incorrect
- Drop Rate = 0.3: Difficulty increase causes 30% accuracy drop
- Drop Rate = 0.0: Difficulty increase doesn't affect accuracy (ideal case)
Evaluation Example:
Suppose we evaluated 10 samples, results can be referenced in Table 12.4:
Table 12.4 GAIA Dataset Difficulty Level Distribution
To calculate metrics for this case, refer to the Python script below:
Result Analysis:
- Overall Performance: 70% exact match rate, good performance
- Difficulty Sensitivity: 33% drop from Level 1 to Level 2, indicating significant degradation in medium difficulty tasks
- Capability Boundary: Level 3 accuracy is 50%, indicating room for improvement in complex tasks
The larger the drop rate, the more obvious the agent's capability degradation when handling complex tasks.
(4) GAIA Official System Prompt
GAIA requires using specific system prompt to ensure model output conforms to evaluation format:
GAIA has strict requirements for answer format: answers must be given in FINAL ANSWER: [answer] format; for numeric answers, don't use comma separators and unit symbols; for string answers, don't use articles and abbreviations; for list answers, use comma separation and arrange alphabetically.
12.3.2 Obtaining GAIA Dataset
Important Note: GAIA is a Gated Dataset, requiring prior application for access permission on HuggingFace.
Step 1: Apply for Access Permission
- Visit https://huggingface.co/datasets/gaia-benchmark/GAIA
- Click "Request access" button
- Fill out application form (usually approved within seconds)
- Get your HuggingFace Token: https://huggingface.co/settings/tokens
Step 2: Configure Environment Variables
Add your HuggingFace Token to .env file:
Method 1: Automatic Download Using HelloAgents (Recommended)
HelloAgents automatically handles GAIA dataset download and caching:
Working Principle:
- On first run, uses
snapshot_downloadto download entire dataset to./data/gaia/ - Dataset contains 114 files (questions, images, PDFs, etc.)
- Subsequent uses load directly from local, very fast
Dataset Directory Structure:
Method 2: Manual Download
If you want to manually download the dataset:
View Dataset Statistics:
12.3.3 Implementing GAIA Evaluation in HelloAgents
Similar to BFCL, we provide two evaluation methods, Method 1 is recommended.
Method 1: One-Click Evaluation Using GAIAEvaluationTool
This is the simplest method, automatically completing dataset download, evaluation execution, result export, and report generation:
Run Results:
After evaluation completes, three types of files are automatically generated: first is GAIA format result file (evaluation_results/gaia_official/gaia_level1_result_*.jsonl), using JSONL format (one JSON object per line), can be directly used for submission to GAIA leaderboard; second is submission guide file (evaluation_results/gaia_official/SUBMISSION_GUIDE_*.md), containing detailed submission steps, result file format description, and notes; finally is evaluation report (evaluation_reports/gaia_report_*.md), containing evaluation result summary, detailed metrics, sample details, and visualization charts.
Note: If you find generated evaluation results unsatisfactory (e.g., low accuracy), this is normal. Although Level 1 is one-step reasoning tasks, agents still need tool calling capabilities (like search engine, calculator, etc.) to correctly answer questions. Our current SimpleAgent is mainly used to demonstrate evaluation process, with room for improvement in tool calling capabilities.
Method 2: Using Dataset + Evaluator (Flexible Customization)
If you need more fine-grained control, you can directly use low-level components:
Generated evaluation report (gaia_report_*.md) can reference the file below:
Generated GAIA Format Results (gaia_level1_result_*.jsonl):
12.3.4 Submitting Results to GAIA Official Leaderboard
After running evaluation using GAIAEvaluationTool, files required for submission and detailed submission instructions are generated in evaluation_results/gaia_official/ directory.
-
GAIA Format Result File:
gaia_level1_result_*.jsonl -
Submission Guide File:
SUBMISSION_GUIDE_*.md
Open the automatically generated SUBMISSION_GUIDE_*.md file, which contains complete submission guide:
Specifically, open browser and visit:
As shown in Figure 12.4, fill in information in submission form:
Figure 12.4 BFCL Evaluation Process Diagram
Before submission, you can manually check the generated JSONL file:
12.3.5 Core Component Implementation Details
GAIA evaluation system implementation is similar to BFCL, but has some special designs for general capability evaluation.
(1) GAIADataset: Multimodal Data Loader
The special feature of GAIA dataset is that it contains multimodal data (text, files, images, etc.):
(2) GAIAEvaluator: Implementing GAIA Official Evaluation Algorithm
GAIA evaluation uses Quasi Exact Match algorithm, requiring special answer normalization and matching logic:
GAIA uses specific normalization rules to handle different types of answers:
GAIA requires model output format to be FINAL ANSWER: [answer]:
After evaluation completes, can export to JSONL format required by GAIA official:
(3) GAIAEvaluationTool: One-Click Evaluation Tool
GAIAEvaluationTool encapsulates complete evaluation process, providing one-click evaluation functionality:
GAIAEvaluationTool automatically generates evaluation report:
12.4 Data Generation Quality Evaluation
In AI system development, high-quality training data is the foundation of system performance. This section introduces how to use the HelloAgents framework to evaluate the quality of generated data, using AIME (American Invitational Mathematics Examination)[9] style mathematics problem generation as an example.
AIME is a medium-difficulty mathematics competition hosted by the Mathematical Association of America (MAA), positioned between AMC 10/12 and the USA Mathematical Olympiad (USAMO). AIME problems have distinctive characteristics: each problem's answer is an integer between 0 and 999, problems cover multiple mathematical domains including algebra, geometry, number theory, combinatorics, and probability, require multi-step reasoning but don't involve advanced theory, and have moderate difficulty (equivalent to AIME problems 6-9). These characteristics make AIME problems an ideal benchmark for evaluating mathematics problem generation quality: unified answer format facilitates automated evaluation, and moderate difficulty is suitable for large-scale generation. We use the TianHongZXY/aime-1983-2025 dataset on HuggingFace as reference, which contains over 900 AIME real problems from 1983 to 2025, providing rich reference samples for our generation and evaluation.
12.4.1 Evaluation Methods Overview
In data generation quality evaluation, we adopt three complementary evaluation methods: LLM Judge, Win Rate, and Manual Verification. There are two important reasons for choosing these three methods. First, from a methodological perspective, these are commonly used automated evaluation schemes in the current agent field and mainstream practices in many academic papers, with broad recognition and practical foundation. Second, from an applicability perspective, these three methods are naturally suitable for our evaluation scenario: LLM Judge and Win Rate are used to evaluate problem generation quality (multi-dimensional evaluation from correctness, clarity, difficulty matching, etc.), while Manual Verification is used to evaluate answer generation quality (verifying answer accuracy through human experts), this division of labor is very reasonable and easy to understand.
Below we introduce the specific implementation of these three evaluation methods in detail. The implementation flow of the entire case is shown in Figure 12.5:
Figure 12.5 Data Generation Quality Evaluation Flow Diagram
(1) LLM Judge Evaluation
Design Motivation: In data generation quality evaluation, we need to quickly and consistently evaluate the quality of a large number of generated problems. Traditional manual evaluation, although accurate, is costly and inefficient, making it difficult to meet the demands of large-scale data generation. LLM Judge, by using large language models as judges, can automatically evaluate the quality of generated data from multiple dimensions, not only greatly improving evaluation efficiency but also maintaining consistency in evaluation standards. More importantly, LLM Judge can provide detailed scoring reasons and improvement suggestions, helping us understand the strengths and weaknesses of generated data and providing direction for subsequent optimization.
In our implementation, LLM Judge evaluates AIME problem quality from four key dimensions:
Table 12.5 LLM Judge Evaluation Dimensions for AIME Problems
After obtaining scores from four dimensions, we need to aggregate these scores into overall evaluation metrics. We define three key metrics to measure the quality level of generated problems:
Evaluation Metrics:
1. Average Score: Calculate the average score of all problems across four dimensions, reflecting the overall quality level of generated problems. $$ \text{Average Score} = \frac{1}{N} \sum_{i=1}^{N} \frac{\sum_{d=1}^{4} S_{i,d}}{4} $$
2. Pass Rate: Count the proportion of problems with average score of 3.5 or above, reflecting basic quality assurance of generated problems.
$$ \text{Pass Rate} = \frac{|{i : \text{Score}_i \geq 3.5}|}{N} $$
3. Excellent Rate: Count the proportion of problems with average score of 4.5 or above, reflecting the high-quality proportion of generated problems.
$$ \text{Excellent Rate} = \frac{|{i : \text{Score}_i \geq 4.5}|}{N} $$
Where:
- $N$ is the total number of problems evaluated
- $S_{i,d}$ is the score of the $i$-th problem on the $d$-th dimension (1-5 points)
- $\text{Score}_i$ is the average score of the $i$-th problem (average of four dimension scores)
These three metrics reflect generation quality from different angles: average score gives overall level, pass rate ensures basic quality, excellent rate measures high-quality output capability.
(2) Win Rate Evaluation
Design Motivation: Although LLM Judge can provide multi-dimensional absolute scoring, we also need a relative evaluation metric to measure the quality gap between generated problems and real problems. Win Rate evaluation, through pairwise comparison, lets LLM directly judge which is better between generated problems and real problems. This relative comparison is more in line with human judgment habits than absolute scoring, and can more easily discover the relative advantages and disadvantages of generated problems. Ideally, if the quality of generated problems is close to real problems, Win Rate should be around 50% (i.e., generated problems and real problems each have 50% win rate). This metric is simple and intuitive, allowing quick judgment of the overall quality level of the generation system.
In our implementation, Win Rate evaluation is conducted through the flow shown in Figure 12.6:
Figure 12.6 Data Generation Quality Evaluation Flow Diagram
In pairwise comparison evaluation, each comparison produces three possible results: generated problem wins (Win), real problem wins (Loss), or tie (Tie). We evaluate the quality of generated problems by counting the proportions of these three results:
Evaluation Metrics:
1. Win Rate: Proportion of generated problems judged as better, reflecting advantages of generated problems relative to real problems.
$$ \text{Win Rate} = \frac{\text{Wins}}{\text{Total Comparisons}} $$
2. Loss Rate: Proportion of real problems judged as better, reflecting disadvantages of generated problems relative to real problems.
$$ \text{Loss Rate} = \frac{\text{Losses}}{\text{Total Comparisons}} $$
3. Tie Rate: Proportion judged as equivalent quality, reflecting similarity between generated problems and real problems.
$$ \text{Tie Rate} = \frac{\text{Ties}}{\text{Total Comparisons}} $$
Where Total Comparisons is the total number of comparisons, Wins, Losses, and Ties are the numbers of generated problem wins, losses, and ties respectively. These three metrics satisfy: Win Rate + Loss Rate + Tie Rate = 100%.
Ideal Result: Win Rate ≈ 50% (indicating generation quality is close to real problems). If Win Rate is significantly lower than 50%, it indicates generated problem quality is inferior to real problems and generation strategy needs optimization; if Win Rate is significantly higher than 50%, it may indicate generated problems surpass real problems in some aspects, or there is bias in evaluation standards.
(3) Manual Verification
Design Motivation: Although LLM Judge and Win Rate can automatically evaluate problem quality, for mathematical problems that require strict logical reasoning, manual verification is still indispensable. Especially when evaluating answer generation quality, human experts are needed to verify answer accuracy, solution step completeness, and mathematical reasoning rigor. Additionally, manual verification can discover issues that automated evaluation might miss, such as subjective factors like problem innovation and interest. To improve manual verification efficiency and experience, we developed a Gradio-based Web interface, allowing verifiers to conveniently browse problems, score, annotate status, and add comments, greatly lowering the barrier to manual verification.
In our implementation, manual verification is conducted through the following steps:
- Read problem, answer, solution
- Score (1-5 points): correctness, clarity, difficulty matching, completeness
- Annotate status:
- ✅ approved (passed)
- ❌ rejected (rejected)
- 🔄 needs_revision (needs revision)
- Add comments
12.4.2 System Architecture
Data generation and evaluation system adopts modular design:
The system contains four core components: First is AIMEGenerator (problem generator), using HelloAgents framework to generate AIME-style problems, supporting batch generation and progress saving, and automatically handling API rate limits; second is LLMJudgeTool (LLM Judge evaluation tool), providing 4-dimensional quality evaluation, automatically generating JSON results and Markdown reports; third is WinRateTool (Win Rate evaluation tool), calculating win rate, loss rate, and tie rate through pairwise comparison evaluation; finally is HumanVerificationUI (manual verification interface), based on Gradio Web interface, supporting scoring and status annotation.
12.4.3 AIME Problem Generator Implementation
Our goal is to generate a similar style dataset, so we randomly select reference examples from 900+ AIME real problems (1983-2025)
Generation prompt design (English):
We choose to generate problems in English for four important reasons: first is consistency with AIME real problems (AIME is an English competition, generating English problems is more reasonable), second is ensuring evaluation fairness (LLM Judge evaluation is fairer when English vs English), third is facilitating internationalization (English problems can be more widely used), and finally is avoiding translation issues (no need to worry about accuracy of Chinese-English translation).
Batch generation implementation:
LaTeX mathematical formula support:
Generated AIME problems contain LaTeX mathematical formulas (such as $\frac{a}{b}$, $\sqrt{x}$), requiring special JSON parsing handling:
Backslashes in LaTeX formulas (such as \frac, \sqrt) are illegal escape characters in JSON, causing parsing failure:
By using regular expressions to replace unescaped backslashes with double backslashes, making them legal in JSON.
12.4.4 LLM Judge Evaluation Tool
LLM Judge tool uses LLM as judge to conduct multi-dimensional evaluation of generated problems.
Evaluation Prompt:
Evaluation Report Example:
12.4.5 Win Rate Evaluation Tool
Win Rate tool evaluates the quality of generated data relative to real problems through pairwise comparison.
AIDataset is responsible for loading generated data and AIME real problem data, supporting two data types:
We choose to use only AIME 2025 dataset for four reasons: first is data timeliness (2025 is the latest AIME competition data), second is simplified maintenance (maintaining only one dataset, code is more concise), third is unified format (JSONL format, field names unified to lowercase), and finally is sufficient representativeness (30 problems are enough to evaluate generation quality).
Comparison Prompt:
Evaluation Report Example:
12.4.6 Manual Verification Interface
Use Gradio to create Web interface, supporting manual verification of generated problems.
Usage Method:
The final effect can be referenced in Figure 12.7. For problem correctness, manual review is best:
Figure 12.7 AIME Problem Manual Verification Page
Verification Process:
- Open verification interface in browser
- Read problem, answer, solution
- Score from 4 dimensions (1-5 points)
- Select verification status (approved/rejected/needs_revision)
- Add comments (optional)
- Click "Submit Verification"
- View next problem
Verification Result Saving:
Verification results are automatically saved as <data_path>_verifications.json:
12.4.7 Complete Evaluation Flow
Integrate all evaluation methods into a complete flow.
Run Method:
Output Example:
12.4.8 Comprehensive Evaluation Report
The system automatically generates comprehensive evaluation reports, summarizing all evaluation results. Below is an example report:
Based on practical usage experience, summarize the following content:
In data generation, use appropriate delay time (2-3 seconds) to avoid API rate limits, enable checkpoint saving to avoid interruption losses, first test with small batches (10) to confirm no issues before large-scale generation, and regularly check generation quality to adjust prompts in time. In evaluation strategy, recommend combining LLM Judge and Win Rate methods, where LLM Judge is used for absolute quality evaluation, Win Rate for relative quality comparison, and manual verification for final quality control. For quality standards, recommend LLM Judge average score above 4.0/5.0, Win Rate above 45% (close to 50%), pass rate above 80%, and manual verification pass rate above 90%. In iterative optimization, adjust generation prompts based on evaluation results, analyze common issues in low-scoring problems, reference advantages of high-scoring problems, and continuously improve generation strategy.
Through learning this section, we have mastered how to use the HelloAgents framework for data generation quality evaluation, including three methods: LLM Judge evaluation, Win Rate evaluation, and manual verification. This complete evaluation system can ensure high quality of generated data, providing reliable data support for AI system training and testing.
For LLM Judge and Win Rate evaluation, HelloAgents has also integrated tools and provided complete example code. If you are interested in the specific implementation details of these two evaluation methods, you can also refer to the example code.
12.5 Chapter Summary
In this chapter, we built a complete performance evaluation system for the HelloAgents framework. Let's review the core content learned:
(1) Evaluation System Overview
We established a three-tier evaluation system, comprehensively covering different capability dimensions of agents. First is tool calling capability evaluation (BFCL), focusing on evaluating agent function calling accuracy, including simple, multiple, parallel, irrelevance four categories, using AST matching technology for precise evaluation. Second is general capability evaluation (GAIA), evaluating agent comprehensive problem-solving capabilities, including three difficulty levels with 466 real-world problems, focusing on multi-step reasoning, tool usage, file processing and other capabilities. Third is data generation quality evaluation (AIME), evaluating LLM-generated data quality, using LLM Judge and Win Rate methods, supporting manual verification and comprehensive report generation, ensuring generated data reaches reference data quality standards.
(2) Core Technical Points
In technical implementation, we adopted six core technical points. First is modular design, evaluation system adopts three-tier architecture: data layer (Dataset responsible for data loading and management), evaluation layer (Evaluator responsible for executing evaluation flow), and metrics layer (Metrics responsible for calculating various evaluation metrics). Second is tool encapsulation, all evaluation functions are encapsulated as Tools, can be directly called by agents, integrated into workflows, or used through unified interface. Third is AST matching technology, using abstract syntax tree matching for function calls, more intelligent than simple string matching, able to ignore parameter order, recognize equivalent expressions, and ignore format differences. Fourth is multimodal support, GAIA evaluation supports text questions, attachment files, image inputs and other multimodal data. Fifth is LLM Judge evaluation, using LLM as judge to evaluate generated data quality, providing multi-dimensional scoring (correctness, clarity, difficulty matching, completeness), automated evaluation flow, detailed evaluation reports, and supporting custom evaluation dimensions and standards. Sixth is Win Rate comparison evaluation, evaluating generation quality through pairwise comparison (generated data vs reference data), LLM judges which is better and calculates win rate statistics, close to 50% indicates equivalent quality.
(3) Extension Directions
Based on this chapter's evaluation system, you can extend in four directions. First is adding new evaluation benchmarks, can refer to BFCL and GAIA implementation patterns, implement Dataset, Evaluator, Metrics three components, and encapsulate as Tool for use. Second is custom evaluation metrics, add new metric calculation methods in Metrics class, design metrics according to specific application scenarios. Third is integration into CI/CD flow, automatically run evaluation on code commits, set performance thresholds to prevent performance degradation, generate evaluation reports and archive. Fourth is extending data generation evaluation, support more data types (code, dialogue, documents, etc.), add more evaluation dimensions (innovation, diversity, etc.), integrate more reference datasets, support multi-model comparison evaluation.
Congratulations on completing Chapter 12! 🎉
Evaluation is an important part of agent development, it allows us to:
- Objectively measure agent capabilities
- Discover and fix issues
- Continuously improve systems
In the next chapter, we will explore how to apply the HelloAgents framework to actual projects.
Keep going! 💪
Exercises
Hint: Some exercises have no standard answers, focusing on cultivating learners' comprehensive understanding and practical ability in agent performance evaluation.
-
This chapter introduced multiple agent evaluation benchmarks. Please analyze:
- In Section 12.1.2, BFCL, GAIA, AgentBench and other evaluation benchmarks were introduced. Please compare BFCL and GAIA: What core capabilities of agents do they evaluate respectively? Why does BFCL use AST matching algorithm while GAIA uses Quasi Exact Match? What are the advantages and disadvantages of these two evaluation methods?
- Suppose you want to build an "intelligent customer service system" that needs to evaluate the following capabilities: (1) accuracy of understanding user intent; (2) correctness of calling backend APIs; (3) friendliness and professionalism of responses; (4) robustness in handling exceptional situations. Please select or design appropriate evaluation metrics and methods for each capability.
- In Section 12.1.1, it was mentioned that agent evaluation faces three major challenges: "output uncertainty", "evaluation standard diversity", and "high evaluation cost". Please propose specific solutions for each challenge and analyze the feasibility and limitations of the solutions.
-
BFCL (Berkeley Function Calling Leaderboard) is an important benchmark for evaluating tool calling capabilities. Based on Section 12.2 content, please think deeply:
Hint: This is a hands-on practice question, actual operation is recommended
- In the AST matching algorithm in Section 12.2.3, we judge whether function calls are correct by comparing abstract syntax trees. Please analyze: Why is AST matching more suitable than simple string matching? In what situations might AST matching produce misjudgments (false positives or false negatives)? How to improve the AST matching algorithm to increase accuracy?
- BFCL dataset contains four categories: simple, multiple, parallel, irrelevance. Please design 2-3 new test samples for each category, requiring ability to test boundary cases or error-prone scenarios under that category.
- Please extend the BFCL evaluator based on the code in Section 12.2.4, adding the following functions: (1) support evaluating execution order of tool calls (for multiple tool calls with dependencies); (2) evaluate tool calling efficiency (such as whether minimum number of calls was used); (3) generate detailed error analysis report (such as which types of errors are most common).
-
GAIA (General AI Assistants) evaluates agent comprehensive capabilities. Based on Section 12.3 content, please complete the following extension practice:
Hint: This is a hands-on practice question, actual operation is recommended
- In Section 12.3.2, three difficulty levels of GAIA (Level 1/2/3) were introduced. Please analyze: What are the differences between these three levels in task complexity, required capabilities, evaluation standards, etc.? If designing Level 4 (ultra-high difficulty), what types of tasks should it include?
- GAIA uses "Quasi Exact Match" algorithm to evaluate answer correctness. Please analyze: How does this method handle answer diversity (such as "42", "forty-two", "42.0" should all be considered correct)? In what situations might quasi exact match not be sufficient? Please design a more intelligent answer matching algorithm that can handle semantically equivalent answers.
- Please implement a "custom GAIA evaluation set" based on the code in Section 12.3.4: select a specific domain (such as medical, legal, financial), design 10 real-world questions, and implement complete evaluation flow. Require questions to cover different difficulty levels, and provide standard answers and scoring criteria.
-
LLM Judge is an emerging method of using large language models for evaluation. Based on Section 12.4 content, please analyze in depth:
- In Section 12.4.2, we used GPT-4 as judge to evaluate agent response quality. Please analyze: What advantages does LLM Judge have compared to traditional rule matching or metric calculation? What potential biases or limitations does it have (such as preference for certain response styles, sensitivity to length)?
- LLM Judge scoring criteria design is crucial. Please design detailed scoring criteria (including scoring dimensions, weights, examples) for the following three different evaluation scenarios: (1) code generation quality evaluation; (2) creative writing quality evaluation; (3) technical documentation quality evaluation.
- In Section 12.4.3, it was mentioned that multiple LLM Judges can be used for "jury-style" evaluation. Please design a "multi-judge evaluation system": using 3-5 different LLMs (such as GPT-4, Claude, Qwen) as judges, how to aggregate their scores? How to handle disagreements between judges? How to detect and filter abnormal scores?
-
Practical application of agent evaluation needs to consider multiple aspects. Please think:
- In actual projects, evaluation often needs to balance between "evaluation cost" and "evaluation quality". Please design a "tiered evaluation strategy": (1) quick evaluation (low cost, for daily development iteration); (2) standard evaluation (medium cost, for pre-release); (3) comprehensive evaluation (high cost, for major updates or public release). What evaluation items should each tier include? How to design evaluation flow?
- Agent performance may change over time (such as changes in dependent external APIs, changes in user needs). Please design a "continuous evaluation system": able to periodically automatically run evaluation, monitor agent performance change trends, and alert in time when performance declines. What components should this system include? How to design alert rules?
- Evaluation results need to be presented clearly to different audiences (such as developers, product managers, users). Please design an "evaluation report generation system": able to automatically generate reports with different levels of detail based on audience type. What technical details should developer reports include? What business metrics should product manager reports highlight? How should user reports be simplified and visualized?
References
[1] Patil, S. G., Zhang, T., Wang, X., & Gonzalez, J. E. (2023). Gorilla: Large Language Model Connected with Massive APIs. arXiv preprint arXiv:2305.15334.
[2] Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., ... & Sun, M. (2023). ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. arXiv preprint arXiv:2307.16789.
[3] Li, M., Zhao, Y., Yu, B., Song, F., Li, H., Yu, H., ... & Li, Y. (2023). Api-bank: A comprehensive benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244.
[4] Mialon, G., Dessì, R., Lomeli, M., Nalmpantis, C., Pasunuru, R., Raileanu, R., ... & Scialom, T. (2023). GAIA: a benchmark for General AI Assistants. arXiv preprint arXiv:2311.12983.
[5] Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., ... & Zhang, D. (2023). AgentBench: Evaluating LLMs as Agents. arXiv preprint arXiv:2308.03688.
[6] Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., ... & Neubig, G. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv preprint arXiv:2307.13854.
[7] Chan, C. M., Chen, W., Su, Y., Yu, J., Xue, W., Zhang, S., ... & Liu, Z. (2023). ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate. arXiv preprint arXiv:2308.07201.
[8] Zhou, X., Zhu, H., Mathur, L., Zhang, R., Yu, H., Qi, Z., ... & Neubig, G. (2023). SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents. arXiv preprint arXiv:2310.11667.
[9] Mathematical Association of America. (2024). American Invitational Mathematics Examination (AIME). Retrieved from https://www.maa.org/math-competitions/invitational-competitions/aime
