Verifiers—functions assigning rewards to agent behavior—have been key for AI progress in domains such as math, code and games. However, extending these gains to domains without clear-cut success criteria (e.g., computer use) remains a challenge: while humans can recognize desired outcomes, translating this intuition into scalable rules is nontrivial. Multimodal Large Language Models (MLLMs) emerge as a promising solution, given their world knowledge, human-preference alignment, and reasoning skills. We evaluate MLLMs as verifiers of agent trajectories across web navigation, computer use, and robotic manipulation, and identify a critical limitation: a strong tendency to over-validate agent behavior, a phenomenon we call agreement bias. We show that agreement bias is pervasive across models, is resilient to test-time scaling, and can affect existing methods relying on MLLM-based evaluations. We discuss metrics to measure and methods to mitigate this bias, and introduce Self-Grounded Verification (SGV), a lightweight method that harnesses MLLMs' own sampling mechanisms by modulating (un)conditional generation to better leverage their knowledge, alignment, and reasoning. SGV operates in two steps: first, the MLLM is elicited to generate broad priors about desired agent behavior, independent of the data under evaluation. Then, conditioned on self-generated priors, it reasons over and evaluates a candidate trajectory. SGV yields more human-aligned evaluations, improving verification across models, metrics, and benchmarks, with gains of up to 25pp in failure identification, 14pp in accuracy, and benefits extending to downstream applications. In self-refinement and online supervision, SGV boosts task completion of a GUI specialist in OSWorld, a diffusion policy in robomimic, and a ReAct agent in VisualWebArena—setting a new state of the art, surpassing the previous best by 20pp. Finally, we quantitatively analyze and provide practical guidance on several design choices for MLLM verifiers, and release an updated version of VisualWebArena featuring more human-aligned evaluators, strong agent baselines, environment parallelism with improved execution fidelity, and runtime speedups of over 10x.
We study verifiers—functions that approximate human judgment of agent behavior in interactive environments. An agent completes a task \(q \in \mathcal{Q}\) through a series of actions \(a_t\). Given a history of environment states \(s_{r:t} \equiv [s_r, ..., s_t]\), the agent executes an action sampled from a policy \(\pi\): \(a_t = \pi(s_{r:t}, q)\), leading to a new state \(s_{t+1}\). The repetition of this process yields a trajectory \(\tau_{r:t} \equiv (s_r, a_r, \ldots, s_t, a_t)\) that serves as the basis for evaluation.
We define a multimodal verifier as a function \(r: \mathcal{T} \times \mathcal{Q} \to \mathbb{R} \times (\mathcal{V} \cup \{\emptyset\})\) that maps trajectory-task pairs to rewards consisting of a real-valued score and optional outputs in other modalities from \(\mathcal{V}\) (e.g., natural-language feedback).
To characterize the potential benefits and risks of MLLM verifiers, we unify the view of existing methods by how they use \(r\) and are therefore affected by its quality, categorizing them into offline and online settings.
The verifier is applied to trajectories post-hoc. Examples include:
The verifier influences the agent's policy during execution. Examples include:
A conventional approach to build an MLLM-based verifier is to prompt the model with the task \(q\), a trajectory \(\tau\), and additional context \(C\) (rubrics, reasoning steps, formatting), then map the model completion \(y\) into a reward:
\(r_{\text{MLLM}}(q,\tau,C) = h\!\left(\prod_{i=1}^{n} P(y_i \mid y_{<i}, q,\tau,C)\right)\)
However, this approach is subject to a critical failure mode we call agreement bias: a tendency to over-validate agent behavior—judging flawed trajectories as aligned with \(q\) despite carefully crafted instructions, established test-time scaling techniques, and decoding algorithms. This degrades the quality of \(r_{\text{MLLM}}\) and can adversely affect methods that make direct or indirect uses of it.
We propose Self-Grounded Verification (SGV), a lightweight two-step method that substantially improves MLLM verifiers by leveraging the model's own sampling mechanisms to enable more effective use of its world knowledge, alignment, and reasoning capabilities.
Intuitively, step 1 encourages the model to explore its probability distribution freely, extracting knowledge pertinent to the task at hand and independent of the data under evaluation. In step 2, the MLLM evaluates a candidate trajectory by sampling from a conditional distribution induced by its own priors. We hypothesize that MLLMs, given their extensive world knowledge, can generally produce human-aligned priors on desired behavior that can serve as impartial references for grounding the verification, leading to more truthful and accurate evaluation of agent behavior.
We outline several key factors relevant for a reliable assessment of verifiers: (i) diversity of environments and agents, (ii) quality and balance of trajectories and annotations, (iii) strong, comprehensive verifier implementations, and (iv) the choice of verifier applications and quantitative metrics for their evaluation.
For works proposing verifiers (or artifacts derived from them), we highlight the importance of:
Agreement bias is resilient to established test-time scaling techniques such as CoT, Set-of-Marks (SoM), and autonomously generated reasoning steps (thinking).
Agreement bias manifests as a skew in MLLM output distributions toward favorable labels. This pattern is consistent across prompt/scoring templates and is largely unaffected by interventions to mitigate biases in LLM judgments [1] [2] and grounding via web-search tools [3].
Critically, the probability of sampling a correct verification from the model's output distribution on failure cases can be close to chance. This explains the ineffectiveness of methods such as majority voting/self-consistency and indicates an imbalance to be accounted for in test-time and training methods that rely on sampling (see paper for experiments on the potential for such approaches).
By producing more balanced and human-aligned evaluations, SGV improves verification across all models and benchmarks. Remarkably, (i) SGV boosts performance of large reasoning models, and (ii) enables non-reasoning models to match their reasoning counterparts.
SGV can lead to stricter verification (lower TPR), a pattern partly explained by disagreements with permissive oracles on simplistic tasks.
For example, to "Find the hours on the engine of the most recently listed red boat," clicking on the "Boats" category and returning the first result is validated by oracles and the baseline verifiers but is rejected by SGV, which prompts the agent to search for and verify requested attributes, and ultimately completes the task via a more robust approach.
This illustrates a fundamental trade-off in verification: leniency can encourage brittle behavior, while strictness may reject valid solutions but enable more robust ones.
As seen here, offline metrics provide only a partial view of this balance, highlighting the importance of evaluating verifiers in downstream applications to more clearly characterize their practical usefulness.
We invite readers to explore our paper for additional results and discussions:
1. Example of the agent receiving valid feedback from the verifier and modifying its approach to correctly complete the task:
2. Example of the verifier guiding the agent toward a more robust strategy to complete the task: