🔍 Introduction

GUI agents are systems that can use real software like a person—they look at the screen, click buttons, type into forms, and navigate multi-step workflows across browsers, operating systems, and mobile apps. A capable agent could automate many everyday tasks end-to-end.

The figure below shows an example of a GUI agent completing one such task: finding a free slot, auto-filling the event and sending an invite, then confirming success.

Agent example: three-step process (Finding a slot, Auto-fill & invite, Success)
Agent example: Step 1 Finding a slot → Step 2 Auto-fill & invite → Step 3 Success! Task complete.

The hardest part is making these long sequences reliable. Instead of only scaling model size, we improve the agent by letting it take longer action sequences and consider more possibilities at inference time:

Training with a Process Reward Model (PRM)

Teaches the policy what "good progress" looks like at each step, so longer rollouts stay goal-directed—like a coach saying "yes, keep going" or "no, that's a detour" during learning.

Inference with an internal world model

Dynamically retrieves contrastive past experience (successes and failures) to support think-before-acting reasoning and guide action selection—like recalling a similar situation on the spot.

By combining external PRM scoring with internal world-model guidance, we aim for higher success rates, fewer environment interactions, and better generalization than a baseline GUI agent without inference-time scaling.

🔬 Methods

Part A: Process Reward Model and PRM-Guided Agent Training

Our PRM pipeline follows the workflow shown below, which we use both to train a Process Reward Model and to guide downstream GUI agent training.

PRM pipeline: data generation, labeling, formatting, supervised fine-tuning, and PRM-guided agent training
Process Reward Model pipeline: ZeroGUI-style data generation on OSWorld tasks, reward labeling with GPT5-mini, LLaMA-Factory–compatible formatting, supervised fine-tuning on Qwen3-VL-4B, and deployment as a trained PRM.

We build a Process Reward Model that scores each step of a GUI trajectory given the task, screenshots, and actions, then plug that PRM into a reinforcement learning loop:

  • Inputs (per step): task instruction (goal) + recent screenshots + the action the agent just took.
  • PRM output: a progress score (0/1 or scaled to [0,1]) and an optional short reason (for debugging).
  • How we fine-tune the PRM: generate and collect OSWorld trajectories → label each step with progress/no-progress using GPT-5-mini → fine-tune a Qwen3-VL-4B model to predict the step score from the inputs above.
  • How we train the agent with the PRM: use Qwen3-VL-4B as the policy backbone; during RL, after every action we query the PRM for a step reward and use that reward to update the agent policy so longer rollouts stay goal-directed.
Example Android Lab task and PRM-guided agent behavior
Agent evaluation case from AndroidWorld: the PRM helps the agent maintain progress over a long-horizon mobile GUI task.

Part B: Internal World Model

Goal: Improve long-horizon GUI reliability by helping the agent think before acting via contrastive retrieval of past successes and failures, reducing repeated failure patterns.

Experience abstraction. Agent trajectories are converted into structured memory items and indexed with FAISS. We maintain separate memory banks for successful and failed trajectories to enable contrastive retrieval at inference time. The ReasoningBank below captures this flow: experience/trajectory → memory extraction → memory items → consolidation and retrieval.

ReasoningBank: memory retrieval, experience/trajectory, memory extraction, consolidation
ReasoningBank: experience/trajectory, memory extraction, and consolidation into memory items for retrieval.

Candidate-action guidance. At each step, the agent evaluates multiple candidate actions. The world model retrieves similar past trajectories and uses contrastive evidence to prioritize actions aligned with successful behaviors while avoiding known failure patterns. The figure below contrasts the expert trajectory (green) with alternative actions and their resulting states.

Expert trajectory vs alternative actions and resulting states
Expert trajectory (green) vs alternative actions and resulting states.

Contrastive world model pipeline. Given the current screenshot and task, the system: (1) retrieves top-k success and failure trajectories via FAISS + CLIP embeddings; (2) analyzes divergence points between successful and failed action sequences; (3) summarizes key success patterns and common pitfalls into ~200-token guidance; (4) injects guidance into the agent prompt (initial + step-level). We also use confidence checks and evidence aggregation to stabilize guidance when retrieval signals are weak or ambiguous. This is supported by implicit world modeling: Stage 1 predicts next state given action (world modeling), and Stage 2 uses continual training to predict the action given state.

Implicit World Modeling: Stage 1 World Modeling, Stage 2 Continual Training
Implicit World Modeling: Stage 1 (world modeling P(ŝj|s,âj)) and Stage 2 (continual training P(a|s)).

📉 Results

PRM evaluation (OSWorld)

After PRM fine-tuning on 184 generated training tasks, we evaluated on the OSWorld 39-task test split. Each trajectory was truncated into 3-step windows, yielding 505 PRM-evaluated examples. Pred=True means the PRM predicts a positive step (reward > 0).

Model / Setting TP FP FN TN Acc. Prec. Rec. F1
Qwen3-VL-4B (zero-shot PRM) 111 9 304 81 38.02% 92.50% 26.75% 41.50%
Fine-tuned PRM (LLaMA-Factory) 313 41 102 49 71.68% 88.42% 75.42% 81.40%

Fine-tuning yields a clear improvement: accuracy increases from 38.02% to 71.68%, with a substantial reduction in false negatives (higher recall) while precision remains high.

Agent evaluation (planning benchmarks)

Benchmark Baseline (Qwen3-VL-4B) PRM-trained agent (ours)
MMBench-GUI (grounding) 83.17% 82.92%
UI-Vision (planning + grounding) 27.76% 28.13%
AndroidControl (High Step SR) 31.64% 31.66%
Mind2Web (Step SR) 16.50% 17.50%
AndroidWorld (task SR) 24.00% 24.00%
AndroidWorld (verification score) 71.00% 86.00%

Across planning-heavy benchmarks, the PRM-guided agent largely preserves grounding ability while modestly improving step-wise decision quality and substantially boosting AndroidWorld verification accuracy.

World model evaluation (WebVoyager)

We compare a CoMEM-style success-only retrieval baseline against our contrastive (success + failure) retrieval with dynamic step-level guidance. Setup: top-k=3 success + 3 failure trajectories per step (FAISS + CLIP); dynamic re-retrieval each step. On average, 40% of retrieved trajectories change between step 0 and step 5.

Domain Baseline World Model ∆ (pp)
Google Maps 16.67% 26.83% +10.16
Amazon 19.51% 39.02% +19.51
Allrecipes 15.91% 11.11% -4.80
Coursera 2.38% 9.52% +7.14

The contrastive world model improves success rate on 3 of 4 domains, with the largest gain on Amazon. Allrecipes shows a regression; we are investigating domain-specific retrieval and confidence filtering.

📝 Conclusion

PRM-guided agent training. We built and curated datasets for PRM training, fine-tuned a PRM, and used it to provide dense step-wise rewards for GUI-agent training. Even under a limited training budget (one epoch and short rollouts), PRM-based dense rewards preserve grounding ability while improving planning-oriented metrics and step success rates, suggesting that PRM supervision helps the agent make better local decisions during multi-step interaction.

Internal world model. We introduced a contrastive memory mechanism that retrieves both successful and failed trajectories to guide agent actions during inference. Dual FAISS indices (success vs. failure) with CLIP embeddings and state-aware step-level retrieval keep guidance aligned with the evolving GUI state. On WebVoyager, this world model improves success rate on 3 of 4 domains, with the largest gain on Amazon, while also revealing domains where additional filtering and domain-specific retrieval are needed.

Toward robust inference-time scaling. Together, PRM-guided training and the internal world model show how inference-time scaling—without increasing model size—can improve GUI agent reliability by turning longer trajectories into structured reasoning traces rather than random wandering. Future work will scale PRM supervision and strengthen contrastive world modeling so that these process-level gains translate into larger end-to-end task success improvements.

Limitations. Due to time and compute constraints, we trained the GUI agent for only 1 epoch and used only the first 4 steps of each rollout for PRM-guided agent training.

BibTeX Citation

If you use this work, please cite:

@article{wang2025inference,
  title     = {Inference-Time Scaling for GUI Agents with Process Reward
               Models and Internal World Models},
  author    = {Wang, Bella and Wu, Rita Yujia and Liu, Shuchang and
               Huang, Ziyu},
  year      = {2025},
  url       = {https://github.com/RitaYujiaWu/DSC180-A08-GUI-Project},
  note      = {DSC180 A08 Capstone. Mentors: Kun Zhou, Zhiting Hu.},
}