Training with a Process Reward Model (PRM)
Teaches the policy what "good progress" looks like at each step, so longer rollouts stay goal-directed—like a coach saying "yes, keep going" or "no, that's a detour" during learning.
GUI agents are systems that can use real software like a person—they look at the screen, click buttons, type into forms, and navigate multi-step workflows across browsers, operating systems, and mobile apps. A capable agent could automate many everyday tasks end-to-end.
The figure below shows an example of a GUI agent completing one such task: finding a free slot, auto-filling the event and sending an invite, then confirming success.
The hardest part is making these long sequences reliable. Instead of only scaling model size, we improve the agent by letting it take longer action sequences and consider more possibilities at inference time:
Teaches the policy what "good progress" looks like at each step, so longer rollouts stay goal-directed—like a coach saying "yes, keep going" or "no, that's a detour" during learning.
Dynamically retrieves contrastive past experience (successes and failures) to support think-before-acting reasoning and guide action selection—like recalling a similar situation on the spot.
By combining external PRM scoring with internal world-model guidance, we aim for higher success rates, fewer environment interactions, and better generalization than a baseline GUI agent without inference-time scaling.
Our PRM pipeline follows the workflow shown below, which we use both to train a Process Reward Model and to guide downstream GUI agent training.
We build a Process Reward Model that scores each step of a GUI trajectory given the task, screenshots, and actions, then plug that PRM into a reinforcement learning loop:
Goal: Improve long-horizon GUI reliability by helping the agent think before acting via contrastive retrieval of past successes and failures, reducing repeated failure patterns.
Experience abstraction. Agent trajectories are converted into structured memory items and indexed with FAISS. We maintain separate memory banks for successful and failed trajectories to enable contrastive retrieval at inference time. The ReasoningBank below captures this flow: experience/trajectory → memory extraction → memory items → consolidation and retrieval.
Candidate-action guidance. At each step, the agent evaluates multiple candidate actions. The world model retrieves similar past trajectories and uses contrastive evidence to prioritize actions aligned with successful behaviors while avoiding known failure patterns. The figure below contrasts the expert trajectory (green) with alternative actions and their resulting states.
Contrastive world model pipeline. Given the current screenshot and task, the system: (1) retrieves top-k success and failure trajectories via FAISS + CLIP embeddings; (2) analyzes divergence points between successful and failed action sequences; (3) summarizes key success patterns and common pitfalls into ~200-token guidance; (4) injects guidance into the agent prompt (initial + step-level). We also use confidence checks and evidence aggregation to stabilize guidance when retrieval signals are weak or ambiguous. This is supported by implicit world modeling: Stage 1 predicts next state given action (world modeling), and Stage 2 uses continual training to predict the action given state.
After PRM fine-tuning on 184 generated training tasks, we evaluated on the OSWorld 39-task test split. Each trajectory was truncated into 3-step windows, yielding 505 PRM-evaluated examples. Pred=True means the PRM predicts a positive step (reward > 0).
| Model / Setting | TP | FP | FN | TN | Acc. | Prec. | Rec. | F1 |
|---|---|---|---|---|---|---|---|---|
| Qwen3-VL-4B (zero-shot PRM) | 111 | 9 | 304 | 81 | 38.02% | 92.50% | 26.75% | 41.50% |
| Fine-tuned PRM (LLaMA-Factory) | 313 | 41 | 102 | 49 | 71.68% | 88.42% | 75.42% | 81.40% |
Fine-tuning yields a clear improvement: accuracy increases from 38.02% to 71.68%, with a substantial reduction in false negatives (higher recall) while precision remains high.
| Benchmark | Baseline (Qwen3-VL-4B) | PRM-trained agent (ours) |
|---|---|---|
| MMBench-GUI (grounding) | 83.17% | 82.92% |
| UI-Vision (planning + grounding) | 27.76% | 28.13% |
| AndroidControl (High Step SR) | 31.64% | 31.66% |
| Mind2Web (Step SR) | 16.50% | 17.50% |
| AndroidWorld (task SR) | 24.00% | 24.00% |
| AndroidWorld (verification score) | 71.00% | 86.00% |
Across planning-heavy benchmarks, the PRM-guided agent largely preserves grounding ability while modestly improving step-wise decision quality and substantially boosting AndroidWorld verification accuracy.
We compare a CoMEM-style success-only retrieval baseline against our contrastive (success + failure) retrieval with dynamic step-level guidance. Setup: top-k=3 success + 3 failure trajectories per step (FAISS + CLIP); dynamic re-retrieval each step. On average, 40% of retrieved trajectories change between step 0 and step 5.
| Domain | Baseline | World Model | ∆ (pp) |
|---|---|---|---|
| Google Maps | 16.67% | 26.83% | +10.16 |
| Amazon | 19.51% | 39.02% | +19.51 |
| Allrecipes | 15.91% | 11.11% | -4.80 |
| Coursera | 2.38% | 9.52% | +7.14 |
The contrastive world model improves success rate on 3 of 4 domains, with the largest gain on Amazon. Allrecipes shows a regression; we are investigating domain-specific retrieval and confidence filtering.
PRM-guided agent training. We built and curated datasets for PRM training, fine-tuned a PRM, and used it to provide dense step-wise rewards for GUI-agent training. Even under a limited training budget (one epoch and short rollouts), PRM-based dense rewards preserve grounding ability while improving planning-oriented metrics and step success rates, suggesting that PRM supervision helps the agent make better local decisions during multi-step interaction.
Internal world model. We introduced a contrastive memory mechanism that retrieves both successful and failed trajectories to guide agent actions during inference. Dual FAISS indices (success vs. failure) with CLIP embeddings and state-aware step-level retrieval keep guidance aligned with the evolving GUI state. On WebVoyager, this world model improves success rate on 3 of 4 domains, with the largest gain on Amazon, while also revealing domains where additional filtering and domain-specific retrieval are needed.
Toward robust inference-time scaling. Together, PRM-guided training and the internal world model show how inference-time scaling—without increasing model size—can improve GUI agent reliability by turning longer trajectories into structured reasoning traces rather than random wandering. Future work will scale PRM supervision and strengthen contrastive world modeling so that these process-level gains translate into larger end-to-end task success improvements.
Limitations. Due to time and compute constraints, we trained the GUI agent for only 1 epoch and used only the first 4 steps of each rollout for PRM-guided agent training.
If you use this work, please cite:
@article{wang2025inference,
title = {Inference-Time Scaling for GUI Agents with Process Reward
Models and Internal World Models},
author = {Wang, Bella and Wu, Rita Yujia and Liu, Shuchang and
Huang, Ziyu},
year = {2025},
url = {https://github.com/RitaYujiaWu/DSC180-A08-GUI-Project},
note = {DSC180 A08 Capstone. Mentors: Kun Zhou, Zhiting Hu.},
}