Beyond Instructions: Rethinking Exploration in Embodied AI

Robotics today treats robots as instruction-followers: give them a command, they execute a plan. But intelligence isn't about perfect execution—it's about exploration, adaptation, and building understanding when reality doesn't match our expectations. What follows are some of my thoughts on what it means for robots to truly explore.

The instruction fallacy

When we tell a robot to “pick up the red cup,” what are we really asking it to do? At first glance, the task seems straightforward: navigate to the table, identify the cup, execute a grasp. But this instruction-following paradigm misses something fundamental about intelligence—the ability to explore, to discover what works when the world doesn’t match our expectations.

Traditional motion planners treat commands as gospel: “move to table → pick up red cup.” They execute predetermined sequences, assuming the world conforms to their models. But what happens when the cup is slippery? When lighting conditions change? When the grasp fails not once, but repeatedly? An instruction-following agent simply retries the same failed approach, unable to adapt.

This brittleness reveals a fundamental gap: these systems lack the ability to truly explore—to discover what works when their models fail. But what does exploration actually mean in robotics?

What does it mean to explore?

The debate centers on a deceptively simple question: what constitutes exploration in robotics?

One perspective argues that exploration requires venturing into continuous action spaces—trying novel motor commands, discovering physical solutions through trial and error. This classical view, common in motion planning and continuous control research, sees exploration as happening in joint angles, gripper forces, and trajectory parameters. Without this low-level experimentation, the argument goes, we're merely combining existing primitives rather than discovering genuinely novel solutions.

But there's another view: exploration can happen at the semantic level, through instruction composition and feedback loops. When an agent tries "move closer," observes failure, then adjusts to "modify gripper angle," it's exploring—not in motor space, but in strategy space. This perspective aligns more closely with how humans actually learn: we don't explore by randomly varying joint angles; we explore through semantic hypotheses like "maybe I need a firmer grip" or "perhaps I should approach from the side."

This isn't just philosophical hairsplitting. It points to three distinct levels of exploration:

Perceptual exploration: What am I seeing? Where are the objects?
Motor exploration: Which actions can I execute? How should I move?
Strategic exploration: What should I try next? When should I switch approaches?

The hardest challenges lie not in discovering new actions, but in discovering when to transition between them.

The subtask transition problem

Consider a seemingly simple task: opening a door to retrieve an object. The challenge isn’t opening the door or grasping the object—it’s knowing when you’ve truly finished opening the door and should proceed to grasping.

This is the subtask transition problem: determining task boundaries in a continuous, ambiguous world. An agent must learn not just how to execute subtasks, but when to recognize their completion and where to explore next. This requires something deeper than motion planning—it requires understanding semantic-geometric alignment.

But even this framing assumes something problematic: that tasks can be neatly Markovized into discrete subtasks with clear boundaries.

When Markov assumptions break

The real world is not Markovian. Agents don't have access to complete state information. Critical facts hide in history, lie outside the field of view, or emerge from the world's own dynamics. When we strip away the Markov assumption, exploration transforms from "finding a feasible trajectory through state space" into "building an internal world model."

Compare two agents attempting to grasp a red cup:

Markov Agent: Observes current image, sees red cup, executes grasp policy. Fails. Retries identical approach. Fails again.

Model-Based Agent: Observes failure, then asks: Why did this fail? Is the cup too slippery? Is the lighting causing mislocalization? Was my approach angle wrong? Did I move too quickly?

The second agent isn't just acting—it's learning how the world works. This is epistemic exploration: actively gathering information to reduce uncertainty about the world itself, not just about what actions to take. This is, fundamentally, the core insight of model-based reinforcement learning—that intelligent behavior emerges not from memorizing action-value mappings, but from building and refining predictive models of how the world responds to our actions.

Coarse-grained transitions as compressed causality

What should an intelligent agent learn from exploration? Not the fine-grained details of every motion, but coarse-grained environmental transitions: meaningful semantic changes like “object: free → grasped” or “door: closed → open.”

A Markov policy might learn: “move hand toward object → adjust wrist angle → slowly close fingers → verify contact → increase grip force → lift slightly → confirm stable grasp.” This is a long, brittle chain of motor primitives.

A model-based agent learns: “free → grasped.” This single semantic transition compresses an enormous amount of physical detail—friction coefficients, contact dynamics, force distributions—into a stable, reusable unit.

These coarse transitions represent causal compression: abstraction of low-level physics into semantic events. “Door: closed → open” compresses rotational mechanics, hinge friction, applied torques, and material elasticity into a single, compositional concept.

The power of coarse transitions is generalization. Once an agent understands “free → grasped” as a semantic concept, it can apply the insight across objects, contexts, and even modalities.

Exploration beyond motion

Cognitive architectures that lean on large language models (LLMs) often treat them as instruction engines: they parse commands, expand them into subtasks, and hand them off to motion planners. But LLMs can play a much deeper role. They can serve as semantic explorers, hypothesizing about the environment and suggesting structured interactions.

Imagine an LLM-driven agent facing a stuck drawer. Instead of commanding “pull harder,” the LLM can suggest: “Maybe the drawer is locked. Check for a latch. If none, inspect for obstructions.” Each suggestion encodes causal assumptions about the world.

Exploration thrives on this loop: generate hypotheses, act, observe, update. LLMs supply the hypotheses; the embodied agent executes and observes; the system reconciles expectations with outcomes.

Epistemic reward functions

Instruction-following agents optimize for task completion. Exploratory agents optimize for information gain. This shift demands new reward structures.

For a door-opening task, the epistemic objective isn’t “open the door.” It’s “reduce uncertainty about the door’s mechanics.” Actions earn reward when they clarify whether the door slides, swings, or lifts; whether the handle turns or pushes; whether the system needs to unlock it first.

This reframing leads to agents that adapt faster across environments. They seek signals that disambiguate latent structure, not just actions that succeed once.

LLMs as belief updaters

Most integrations treat LLMs as static planners. Instead, we should treat them as belief updaters that maintain probability distributions over world hypotheses.

When encountering a door, the system should:

Generate hypotheses: LLM proposes {push-door: 0.4, pull-door: 0.3, slide-door: 0.2, automatic: 0.1}
Explore: Agent tries interactions, observes outcomes
Update beliefs: Reweight hypotheses based on evidence
Plan: Execute high-probability strategy

The reward function changes from “Did you open the door?” to “Did you reduce uncertainty about the door type?” This is the shift from outcome-driven to epistemic exploration.

Training belief-based agents

How do we train such a system? Not by teaching robots to open doors, but by teaching them to figure out what kind of door they’re facing.

The training loop involves co-training between two levels:

Agent Level (M-step): Given hypotheses from LLM, execute interactions, collect evidence, measure belief updates.

LLM Level (E-step): Given interaction outcomes, refine hypothesis generation and belief update strategies.

This is EM-style alternating optimization: the agent learns how to gather evidence that discriminates between hypotheses, and the LLM learns how to generate better hypotheses and update beliefs more accurately.

The reward signal for the agent isn’t task success—it’s information gain: how much did this action reduce uncertainty?

The credit assignment problem

When a subtask fails, how does the system know whether the semantic plan was wrong or just the motor execution? This is cross-layer credit assignment, and it requires generating expectations—predictive models of what should happen if everything goes right.

The system needs semantic expectation models: "If grasping succeeds, I should observe: object and hand moving together; increased contact force; decreased object-hand distance." One promising approach is to use video generation models (like Veo3) to visualize expected outcomes before execution: hand approaching, fingers closing, cup lifting.

After execution, compare reality against expectation. If the hand reached the cup but the grasp failed—controller problem (adjust grip force). If the hand never reached the expected position—planning problem (revise strategy). Contact detected but object slipped → motor failure. No contact achieved → semantic failure.

This diagnostic process allows the system to distinguish between "try harder" (motor adjustment) and "try differently" (strategy revision).

Expectation variance and exploration among humans

What's fascinating is that humans exhibit dramatic variation in how they handle expectation mismatches. Some people exhibit remarkable persistence, staying focused on a problem and iterating until they find a solution. Others adapt quickly, pivoting to new strategies when early attempts don't work. This variance in expectation tolerance shapes our individual cognitive styles.

An agent with updateable beliefs, predictive models, and tunable expectation variance can exhibit the same flexibility. It can decide when to persist versus when to pivot. When to trust its model versus when to gather more information. This parameter—expectation variance—becomes a knob for exploration strategy: high variance agents explore aggressively, low variance agents exploit existing strategies.

This isn't just a technical detail. It's how intelligent agents understand their relationship to uncertainty itself.

Conclusion: exploration as understanding

Rethinking exploration in robotics means moving beyond the instruction-following paradigm. It means building agents that:

Form hypotheses about how the world works
Actively test those hypotheses through interaction
Update their beliefs based on evidence
Learn coarse-grained causal structures rather than fine-grained motor sequences
Use semantic knowledge as hypothesis generators, not answer sheets
Distinguish between failures of understanding and failures of execution

This is exploration not as random search through action space, but as systematic inquiry into the nature of reality. It's the difference between an agent that executes commands and one that learns to understand.

I believe we're at a rare convergence point. Foundation models give us semantic reasoning, modern hardware enables real-time world modeling, and model-based RL provides the learning framework. For the first time, we have the tools to build systems that learn the way humans do: through curiosity, hypothesizing, and adaptation.