Phase-1 validated · May 2026

AnimalTaskSim

Testing what makes a decision animal-like.

Mice and monkeys don't just chase rewards — they hesitate, retry, get distracted. AnimalTaskSim asks the smallest version of a hard question: which parts of a learning agent are actually necessary to behave like a real animal? Then it tests the answer the way labs do — by removing one piece at a time and watching what breaks.

View on GitHubPython · MIT License · open source

It's not a benchmark. It's a microscope.

Most reinforcement-learning leaderboards ask: how much reward can an agent earn? AnimalTaskSim asks something stranger and more useful: can an agent be wrong in the same ways a real animal is wrong?

Real mice slow down on hard trials. They repeat what worked and switch after surprises. They occasionally lapse on trivial questions. Those "imperfections" are the signal — they reveal the control machinery a brain actually uses. We measure them, then test which pieces of an agent's architecture are necessary to reproduce them.

The agent has five working parts

We call this the adaptive-control agent. It's a computational analogy — not a literal brain map — borrowing the structure that mouse, monkey, and human brains seem to use to balance evidence, memory, and persistence.

InputCurrent stimulus

InputPrevious trial

Evidence core

"What does the stimulus say right now?"

Watches the input and accumulates evidence over time, the way neurons in sensory cortex do. Built on a drift-diffusion simulator — the math that turns "stimulus strength" into a choice and a reaction time.

Outcome state

"What just happened?"

A short-term memory of recent actions, rewards, and surprises. The agent's sense of "how is this session going?"

Persistence controller

"Should I keep trying?"

Pushes the agent to repeat a choice after a failure that wasn't clearly the agent's fault — the kind of borderline trial where animals also try again.

Exploration controller

"Should I try something else?"

Nudges toward sampling alternatives when the recent action history feels stale. (Spoiler: this one didn't survive its own test — see below.)

Arbitration layer

"Which voice gets heard?"

Decides how loud each voice gets. Crucially, it's gated by uncertainty: when the stimulus is obvious, control signals get muted so they can't override clear evidence.

OutputChoice + reaction time

Why uncertainty-gating matters. The arbitration layer multiplies control signals by how unsure the agent is. When the stimulus is obvious, persistence and exploration get muted — they can't override clear evidence. This single mechanism is what keeps the agent from devolving into stubbornness or randomness.

Side-by-side with a real mouse

The blue line is our agent. The gray line is a real mouse, averaged across 10 sessions and 8,406 trials from the International Brain Laboratory ^[1]. Three behaviors, one comparison.

Three-panel comparison of the adaptive-control agent against an IBL mouse: accuracy curve, reaction-time curve, and history effects

(a) Accuracy curve

How often the agent picks the rightward target as the stimulus shifts. The agent's curve sits inside the per-mouse range.

(b) Reaction time

Both agent and mouse get faster as the stimulus gets stronger — the signature of evidence accumulation.

After a win, repeat. After a loss, switch. The agent leans the right direction; it under-stays compared to the mouse.

The lesion experiment

Train four versions of the agent: full machinery, none of it, and each controller alone. Then run the same task on each. If a piece is necessary for some behavior, removing it should change that behavior in a measurable, repeatable way.

Condition	Accuracy slope	RT slope (ms/unit)	Retry gap	Stale-switch lift
No control	27.71	−48.54	+0.057	−0.073
Exploration only	24.00	−38.83	+0.092	−0.160
Persistence only	21.75	−33.47	+0.164	−0.159
Full control	22.26	−33.97	+0.165	−0.152

Retry gap = how much more often the agent retries after a borderline failure than after an obvious one. Higher = more uncertainty-driven persistence. Stale-switch lift = how much more often the agent switches when its action history has gone stale. Higher = more curiosity-driven exploration. Means across 5 random seeds.

Four-panel summary of behavioral readouts across the lesion suite

Bars show means across 5 seeds; error bars are 1 standard deviation. Retry gap rises monotonically as control machinery is added back — the smoking gun for persistence. Stale-switch lift stays stubbornly negative everywhere — the smoking gun for exploration not working in this design.

What survives the test

Comparing each lesion to the no-control baseline, seed by seed. We count not just whether the average effect points the right way — but how many seeds individually agree.

Supported

Persistence (the retry instinct)

+0.109Δ retry gap

5 / 5 seedspositive

Adding persistence reliably makes the agent retry after borderline failures — every single seed shows the effect.

Failed isolation probe

Exploration (the curiosity instinct)

−0.079Δ stale-switch lift

0 / 5 seedspositive

The exploration mechanism failed its own test. The probe shows the wrong sign in every seed — we won't pretend otherwise.

Paired lesion deltas vs. no-control baseline, seed by seed

Every bar pair compares one lesion to the same seed's no-control run. Numbers are positive-seed counts (n / N) — how many of five seeds move in the expected direction. 5 / 5 for retry (green); 0 / 5 for stale-switch (purple) in every adaptive condition.

Honest scope

Supported

Uncertainty-gated retry / persistence

The full agent reliably retries after a borderline failure — positive in 5 / 5 seeds vs. the no-control lesion. Persistence alone recovers ~98% of that effect.

Not yet supported

Rewarded-streak exploration

The exploration controller failed its own probe in every seed. The mechanism needs a different gate, a different probe, or both. We're saying so on the front page rather than burying it.

Why this matters

The probe, not the architecture, is the science

The same lesion-and-probe pipeline can ask, for any candidate circuit, whether it's necessary to produce a behavior we see in animals. The architecture is a hypothesis. The probe is the test.

Two canonical experiments

Task environments mirror the lab protocols exactly. Same stimulus levels, same timing, same response window — anything else and the comparison isn't real.

Mouse 2AFC

Laboratory mice

A mouse sees a faint pattern on the left or right of a screen and turns a wheel toward it. Easy patterns are obvious; the hardest ones are a coin flip.

Reference data

8,406 trials across 10 sessions (International Brain Laboratory) ^[1]

Accuracy curveReaction-time curveWin-stay / lose-shiftLapse rate

Macaque RDM

Rhesus macaques

A monkey watches a cloud of dots moving. Most go one direction — but how many? Higher consensus means an easier call. The monkey decides when it's sure enough to commit.

Reference data

2,611 trials from Roitman & Shadlen (Shadlen Lab) ^[2]

Accuracy curveReaction-time curveBiasLapse rate

Built with

Python 3.11+PyTorchGymnasiumStable-Baselines3PydanticSciPyNumPyPandasMatplotlib

Every trial logged to schema-validated .ndjson. Deterministic seeding. CPU-friendly runs. The environment owns logging — agents never touch the trial log directly.

Run it yourself

Three commands take you from a fresh clone to a published-style agent-vs-animal dashboard.

Terminal

# 1. Install
git clone https://github.com/ermanakar/animaltasksim.git
cd animaltasksim && pip install -e ".[dev]"

# 2. Train one adaptive-control run
python scripts/train_adaptive_control.py \
    --output-dir runs/demo --task ibl_2afc \
    --seed 42 --episodes 5 --epochs 3

# 3. Run the full lesion suite (4 conditions × 5 seeds)
python scripts/adaptive_control_validation_suite.py \
    --run-root runs/validation_suite

# 4. Build an agent-vs-animal dashboard
python scripts/make_dashboard.py \
    --opts.agent-log runs/demo/trials.ndjson \
    --opts.reference-log data/ibl/reference.ndjson \
    --opts.output runs/demo/dashboard.html

References

International Brain Laboratory et al. (2021). Standardized and reproducible measurement of decision-making in mice. Neuron, 109(7), 1166–1180.
Roitman, J. D., & Shadlen, M. N. (2002). Response of neurons in the lateral intraparietal area during a combined visual discrimination reaction time task. Journal of Neuroscience, 22(21), 9475–9489.
Ratcliff, R., & McKoon, G. (2008). The diffusion decision model: theory and data for two-choice decision tasks. Neural Computation, 20(4), 873–922.
Urai, A. E., et al. (2019). Mechanisms of choice history biases in perceptual decisions. Nature Communications, 10(1), 1983.

Read the full story

Open source, MIT licensed. Documents the wins, the negative results, and every wrong turn it took to find them.

Source code Findings report