Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification

Verifiers—functions assigning rewards to agent behavior—have been key for AI progress in domains such as math, code and games. However, extending these gains to domains without clear-cut success criteria (e.g., computer use) remains a challenge: while humans can recognize desired outcomes, translating this intuition into scalable rules is nontrivial. Multimodal Large Language Models (MLLMs) emerge as a promising solution, given their world knowledge, human-preference alignment, and reasoning skills. We evaluate MLLMs as verifiers of agent trajectories across web navigation, computer use, and robotic manipulation, and identify a critical limitation: a strong tendency to over-validate agent behavior, a phenomenon we call agreement bias. We show that agreement bias is pervasive across models, is resilient to test-time scaling, and can affect existing methods relying on MLLM-based evaluations. We discuss metrics to measure and methods to mitigate this bias, and introduce Self-Grounded Verification (SGV), a lightweight method that harnesses MLLMs' own sampling mechanisms by modulating (un)conditional generation to better leverage their knowledge, alignment, and reasoning. SGV operates in two steps: first, the MLLM is elicited to generate broad priors about desired agent behavior, independent of the data under evaluation. Then, conditioned on self-generated priors, it reasons over and evaluates a candidate trajectory. SGV yields more human-aligned evaluations, improving verification across models, metrics, and benchmarks, with gains of up to 25pp in failure identification, 14pp in accuracy, and benefits extending to downstream applications. In self-refinement and online supervision, SGV boosts task completion of a GUI specialist in OSWorld, a diffusion policy in robomimic, and a ReAct agent in VisualWebArena—setting a new state of the art, surpassing the previous best by 20pp. Finally, we quantitatively analyze and provide practical guidance on several design choices for MLLM verifiers, and release an updated version of VisualWebArena featuring more human-aligned evaluators, strong agent baselines, environment parallelism with improved execution fidelity, and runtime speedups of over 10x.

Top: example of a task in VisualWebArena and a corresponding agent trajectory. Middle: an MLLM verifier validates and reinforces flawed agent behavior, generating reasoning to rationalize incorrect judgments. Bottom: Self-Grounded Verification (SGV) leads to more accurate verification and enables the agent to backtrack.

Problem Setup

We study verifiers—functions that approximate human judgment of agent behavior in interactive environments. An agent completes a task $q \in \mathcal{Q}$ through a series of actions $a_t$. Given a history of environment states $s_{r:t} \equiv [s_r, ..., s_t]$, the agent executes an action sampled from a policy $\pi$: $a_t = \pi(s_{r:t}, q)$, leading to a new state $s_{t+1}$. The repetition of this process yields a trajectory $\tau_{r:t} \equiv (s_r, a_r, \ldots, s_t, a_t)$ that serves as the basis for evaluation.

We define a multimodal verifier as a function $r: \mathcal{T} \times \mathcal{Q} \to \mathbb{R} \times (\mathcal{V} \cup \{\emptyset\})$ that maps trajectory-task pairs to rewards consisting of a real-valued score and optional outputs in other modalities from $\mathcal{V}$ (e.g., natural-language feedback).

Applications of Verifiers

To characterize the potential benefits and risks of MLLM verifiers, we unify the view of existing methods by how they use $r$ and are therefore affected by its quality, categorizing them into offline and online settings.

Offline Applications

The verifier is applied to trajectories post-hoc. Examples include:

Agent performance evaluation (e.g., trajectory $\to$ task-completion scores).
Filtering “successful” trajectories for finetuning.
Self-refinement pipelines, such as inducing tools, memory, or reflection from past trajectories.

Online Applications

The verifier influences the agent's policy during execution. Examples include:

Online supervision: text/scalar rewards to steer agents toward task completion or to prevent undesired behavior
Training signal for online methods such as reinforcement learning
Ranking action-trajectory pairs in methods such as (Tree) Search

MLLM Verifiers and Agreement Bias

A conventional approach to build an MLLM-based verifier is to prompt the model with the task $q$, a trajectory $\tau$, and additional context $C$ (rubrics, reasoning steps, formatting), then map the model completion $y$ into a reward:

$r_{\text{MLLM}}(q,\tau,C) = h\!\left(\prod_{i=1}^{n} P(y_i \mid y_{<i}, q,\tau,C)\right)$

However, this approach is subject to a critical failure mode we call agreement bias: a tendency to over-validate agent behavior—judging flawed trajectories as aligned with $q$ despite carefully crafted instructions, established test-time scaling techniques, and decoding algorithms. This degrades the quality of $r_{\text{MLLM}}$ and can adversely affect methods that make direct or indirect uses of it.

Self-Grounded Verification (SGV)

We propose Self-Grounded Verification (SGV), a lightweight two-step method that substantially improves MLLM verifiers by leveraging the model's own sampling mechanisms to enable more effective use of its world knowledge, alignment, and reasoning capabilities.

Step 1 (Priors): elicit broad priors $\hat{k}_q$ about desired agent behavior for task $q$, conditioned only on essential information needed to frame the task (e.g., initial state $\tau_{0:t}$):
$\hat{k}_{q} = g\!\left(\prod_{i=1}^{n} P(y_i \mid y_{<i}, \tau_{0:t}, C, q)\right)$
Step 2 (Verification): evaluate the candidate trajectory conditioned on the self-generated priors:
$r_{\text{SGV}}(\tau, C, q) = h\!\left(\prod_{i=1}^{n} P(y_i \mid y_{<i}, q,\tau,C,\hat{k}_q)\right)$

Intuitively, step 1 encourages the model to explore its probability distribution freely, extracting knowledge pertinent to the task at hand and independent of the data under evaluation. In step 2, the MLLM evaluates a candidate trajectory by sampling from a conditional distribution induced by its own priors. We hypothesize that MLLMs, given their extensive world knowledge, can generally produce human-aligned priors on desired behavior that can serve as impartial references for grounding the verification, leading to more truthful and accurate evaluation of agent behavior.

Evaluating MLLM Verifiers

We outline several key factors relevant for a reliable assessment of verifiers: (i) diversity of environments and agents, (ii) quality and balance of trajectories and annotations, (iii) strong, comprehensive verifier implementations, and (iv) the choice of verifier applications and quantitative metrics for their evaluation.

For works proposing verifiers (or artifacts derived from them), we highlight the importance of:

Report fine-grained metrics that capture a verifier's ability to both reinforce desired behavior and, crucially, to identify failures, as the latter is essential to provide feedback when agent behavior is flawed and requires correction;
Evaluate a verifier not only by its performance to judge post-hoc trajectories (most common) but also by its impact on downstream applications, as leniency-strictness tradeoffs inherent to verification render offline metrics insufficient to characterize a verifier's practical utility.

Environment Diversity

Diversity of environments and agents is key for generalizable findings. The benchmarks evaluated span a diverse set of tasks in web navigation, computer use, and long-horizon robotic manipulation that require nuanced verification and multimodal reasoning:

VisualWebArena (VWA): a web-browser environment with 910 tasks, many combining text and image instructions (e.g., "Buy this product" + image).
OSWorld: a computer-use benchmark with 369 tasks involving widely used desktop applications in single- and multi-app workflows.
robomimic: long-horizon robot manipulation; we focus on tool hang (most challenging), consisting of (1) inserting an L-shaped pencil into a base and (2) hanging a wrench on it.

Task Examples

"Show me the most expensive collectible historical USA coin that is under $2000." (VisualWebArena)

"I really like these shoes, but I am not satisfied with the color. Could you find me an item on this page for these shoes with a darker shade or color?" (VisualWebArena)

"Please set my slides upright instead of sideways." (OSWorld)

"My glasses are broken, and I'm having trouble seeing small things clearly. Could you help me enlarge the text on my screen so it's easier to read?" (OSWorld)

Agents & Trajectory Quality

Weak agents and buggy environments can produce trajectories that are trivial to verify and may artificially inflate verifier performance (e.g., long action loops, "page not found" errors). We fix several issues with the VisualWebArena environment and incorporate diverse agents to improve generalization.

VWA: We build a strong ReAct agent that produces a balanced distribution of failures and successes.
OSWorld: We utilize the GUI-Specialist (UI-TARS-1.5), the best performing agent in the benchmark at the time of writing.
robomimic: We train a diffusion policy on expert demonstrations.

Trajectory Annotations

Similar to prior work, we rely on oracle verifiers included in the benchmark as scalable proxies for human judgment, since large-scale human annotation is expensive. For a reliable baseline, we made several improvements to VisualWebArena oracles and externally validated them on trajectories provided by the concurrent work AgentRewardBench, achieving near-perfect alignment to human labels.

**Table:** Performance of oracle evaluators in AgentRewardBench-VisualWebArena.
	Original	Improved Oracle
Precision	85.2	100
TPR	58.2	92
TNR	95.9	100
Acc	85.1	98

Verifier Implementations

To ensure strong and generalizable findings, we implement strong verifier baselines and explore several design choices, including: Chain-of-thought (CoT) prompting, Set-of-marks (SoM) prompting, Majority voting/self-consistency, Reasoning models, and 28+ variations to prompt/scoring templates including common interventions to mitigate biases in LLM judgments. For VisualWebArena, we additionally consider human-written benchmark-specific rubrics explored in prior work and adopted by subsequent work for self-refinement, tree-search, and finetuning.

Choice of Verifier Applications

We evaluate MLLM verifiers on three representative applications: (i) offline evaluation of trajectories, (ii) self-refinement via Reflexion, and (iii) online supervision.

These applications: (1) introduce fewer confounding factors; (2) are of direct practical interest and often serve as building blocks in larger pipelines; and (3) yield informative signals for broader applicability.

Quantitative metrics

For offline verification of agent trajectories, we evaluate alignment of rewards produced by MLLM verifiers and humans/oracles through bias, distance skewness, and true positive and true negative rates:

\[ \text{bias} = \frac{1}{n}\sum_i \mathbb{E}[d_i]; \qquad d\text{Skew} = 1 - \frac{\sum_{i,j}\lVert d_i - d_j\rVert}{\sum_{i,j}\lVert d_i + d_j\rVert}; \qquad \text{T?R}(c) = \frac{\sum_i \mathbf{1}(\hat{r}_i=c \wedge r_i^*=c)}{\sum_i \mathbf{1}(r_i^*=c)}, \quad c\in\{0,1\} \]

where $\hat{r}_i$ is the MLLM verifier reward, $r_i^*$ the human/oracle reward, and $d_i = \hat{r}_i - r_i^*$.

To facilitate comparisons, we also include accuracy as an auxiliary metric that summarizes the tradeoff between TPR and TNR: $\text{ACC} = (1-\text{SR})\cdot \text{TNR} + \text{SR}\cdot \text{TPR}$, where $\text{SR}$ is the base agent success rate.

Due to the intrinsic leniency-strictness tradeoff in verification, we also evaluate the impact of a verifier on downstream applications by measuring the task completion rate (SR) of agents with and without verifier intervention.

Key Observations:

bias and dSkew: summary statistics reflecting the distribution of MLLM responses. Positive values indicate MLLM rewards systematically higher than humans/oracles; values near zero indicate closer alignment.
TNR: measures how often MLLMs identify failures among the subset of trajectories marked as failures. Provides an empirical estimate of the probability of classifying a trajectory as flawed when it truly is. A low $TNR = 1-\text{FalsePositiveRate}$ implies a high number of false positives. (everything analogous for TPR)
Statistics like TPR and TNR directly relate to the core function of verifiers: providing feedback to improve agent performance. Low values indicate not only misalignment but also risks for downstream applications—especially a low probability of flagging failures (low TNR).

External Validation: Results on AgentRewardBench

As a validation of our experimental design, we evaluate our refined oracles and baseline verifier on external human-labeled trajectories provided by the concurrent work AgentRewardBench (AGRB) for the benchmarks we consider (VisualWebArena). Key observations:

Our revised oracles achieve near-perfect alignment to human labels.
Our baseline verifier already achieves a new state-of-the-art on the benchmark, reinforcing the strength of our findings on agreement bias and SGV's relative gains.
SGV further improves performance, outperforming even the original oracles included in the VisualWebArena benchmark.
This high performance is partly explained by the use of weaker agents and environment bugs in AGRB trajectories, validating our efforts to develop stronger agents and improve the environment.

Offline Verification of Digital Agent Trajectories

MLLMs display a strong tendency to over-validate flawed behavior, manifested as (i) responses tilted toward favorable evaluations (positive bias/skewness), (ii) a high number of false positives (1-TNR), and (iii) a low probability of flagging failures (TNR as low as 50%). This pattern is pervasive across models and persists even when verifiers and agents are built from models of different families, sizes, and methods. It persists for trajectories produced by both weak and strong agents (e.g., more severe in VisualWebArena vs OSWorld), and is stronger as the gap between agents and verifiers increases.

Agreement bias is resilient to established test-time scaling techniques such as CoT, Set-of-Marks (SoM), and autonomously generated reasoning steps (thinking).

Agreement bias and MLLM output distributions

Agreement bias manifests as a skew in MLLM output distributions toward favorable labels. This pattern is consistent across prompt/scoring templates and is largely unaffected by interventions to mitigate biases in LLM judgments [1] [2] and grounding via web-search tools [3].

Critically, the probability of sampling a correct verification from the model's output distribution on failure cases can be close to chance. This explains the ineffectiveness of methods such as majority voting/self-consistency and indicates an imbalance to be accounted for in test-time and training methods that rely on sampling (see paper for experiments on the potential for such approaches).

Effectiveness of SGV and Leniency-Strictness Trade-offs in Verification

By producing more balanced and human-aligned evaluations, SGV improves verification across all models and benchmarks. Remarkably, (i) SGV boosts performance of large reasoning models, and (ii) enables non-reasoning models to match their reasoning counterparts.

SGV can lead to stricter verification (lower TPR), a pattern partly explained by disagreements with permissive oracles on simplistic tasks. For example, to "Find the hours on the engine of the most recently listed red boat," clicking on the "Boats" category and returning the first result is validated by oracles and the baseline verifiers but is rejected by SGV, which prompts the agent to search for and verify requested attributes, and ultimately completes the task via a more robust approach.

This illustrates a fundamental trade-off in verification: leniency can encourage brittle behavior, while strictness may reject valid solutions but enable more robust ones. As seen here, offline metrics provide only a partial view of this balance, highlighting the importance of evaluating verifiers in downstream applications to more clearly characterize their practical usefulness.

Downstream Applications

The adverse effects of agreement bias extend beyond incorrect evaluation of agent performance. The tendency to over-validate flawed behavior means that MLLM verifiers are unable to provide corrective feedback exactly when most needed: when agent behavior is flawed and requires correction.

In self-refinement, agreement bias hinders the agent's ability to self-improve, in contrast to SGV which leads to consistent gains across refinement iterations.

Similarly, agreement bias reduces the effectiveness of MLLM-derived rewards in online methods. In online supervision and diffusion policy steering, the (strong) MLLM verifier fails to deliver meaningful improvements. In contrast, SGV leads to substantial gains, setting a new state-of-the-art in VisualWebArena while utilizing substantially fewer tokens, no access to previous trajectories, and enabling backtracking via native web actions.

Notably, these results occur for a strong baseline MLLM verifier with over 91% recall and state-of-the-art performance (precision) on AgentRewardBench, reinforcing (i) the importance of fine-grained metrics (especially, TNR) and (ii) the relevance of downstream application metrics.

We invite readers to explore our paper for additional results and discussions:

Additional Results and Discussions

Trajectory length & verification difficulty: Verification is generally stronger on longer trajectories (more opportunities for clearly observable failures). Harder-to-verify trajectories (e.g., produced by stronger agents) degrade verification performance, largely due to increased agreement bias. SGV improves verification in all such cases.
Qualitative analysis discussing and categorizing factors affecting agent success rates in downstream applications.
Periodic vs outcome-based verification: Largely same conclusions (MLLM-based verification fails to deliver meaningful improvements, in contrast to SGV) but: Periodic verification can be more effective in environments with destructive or irreversible actions.
Ablations to SGV: SGV outperforms grounding via web-search tools; is robust to moderate noise in first-step generations; weaker models can produce effective priors for stronger or reasoning models; multiple priors from diverse models can further improve performance.
VisualWebArena improvements: Additional details on action-execution fixes and runtime speedups.
Intervening at the MLLM output distribution: Opportunities and limitations of methods that explicitly account for skewed MLLM distributions associated with agreement bias.

Let's Think in Two Steps:Mitigating Agreement Bias in MLLMs withSelf-Grounded Verification

Abstract

Overview

Problem Setup

Applications of Verifiers

Offline Applications

Online Applications

MLLM Verifiers and Agreement Bias

Self-Grounded Verification (SGV)

Experiments

Evaluating MLLM Verifiers

Task Examples

"Show me the most expensive collectible historical USA coin that is under $2000." (VisualWebArena)

"I really like these shoes, but I am not satisfied with the color. Could you find me an item on this page for these shoes with a darker shade or color?" (VisualWebArena)

"Please set my slides upright instead of sideways." (OSWorld)

"My glasses are broken, and I'm having trouble seeing small things clearly. Could you help me enlarge the text on my screen so it's easier to read?" (OSWorld)

Offline Verification of Digital Agent Trajectories

Agreement bias and MLLM output distributions

Effectiveness of SGV and Leniency-Strictness Trade-offs in Verification

Downstream Applications

Additional Results

Example Trajectories

"Find me the cheapest controller from the classifieds site that is meant for the console in the image on the Reddit tab." (VisualWebArena)

Agent:tab_focus [1]

Agent:type [5] [Nintendo 64 controller]

Agent:click [32]

Agent:stop

Verifier feedback:

Agent:go_back

Agent:click [18]

Agent:stop

Verifier feedback:

EVALUATION: SUCCESS FEEDBACK: The assistant successfully completed the task by identifying the console, searching for the controller, sorting by price, and navigating to the page of the cheapest item.

"How many hours are on the engine of the most recently listed red boat?" (VisualWebArena)

Agent:click [38]

Agent:scroll [down]

Agent:stop [80]

Verifier feedback:

Agent:go_back

Agent:type [5] [red boat] [1]

Agent:stop [80]

Verifier feedback:

Agent:click [38]

Agent:stop [80]

Verifier feedback:

EVALUATION: SUCCESS FEEDBACK: The assistant successfully navigated to the listing for the red boat and found the engine hours.

Let's Think in Two Steps:
Mitigating Agreement Bias in MLLMs with
Self-Grounded Verification

"Find me the cheapest controller from the classifieds site that is meant for the console in the image on the Reddit tab."
(VisualWebArena)

Agent:
tab_focus [1]

Agent:
type [5] [Nintendo 64 controller]

Agent:
click [32]

Agent:
stop

Agent:
go_back

Agent:
click [18]

Agent:
stop

EVALUATION: SUCCESS
FEEDBACK: The assistant successfully completed the task by identifying the console, searching for the controller, sorting by price, and navigating to the page of the cheapest item.

"How many hours are on the engine of the most recently listed red boat?"
(VisualWebArena)

Agent:
click [38]

Agent:
scroll [down]

Agent:
stop [80]

Agent:
go_back

Agent:
type [5] [red boat] [1]

Agent:
stop [80]

Agent:
click [38]

Agent:
stop [80]

EVALUATION: SUCCESS
FEEDBACK: The assistant successfully navigated to the listing for the red boat and found the engine hours.