AnimalTaskSim
Testing what makes a decision animal-like.
Mice and monkeys don't just chase rewards — they hesitate, retry, get distracted. AnimalTaskSim asks the smallest version of a hard question: which parts of a learning agent are actually necessary to behave like a real animal? Then it tests the answer the way labs do — by removing one piece at a time and watching what breaks.
It's not a benchmark. It's a microscope.
Most reinforcement-learning leaderboards ask: how much reward can an agent earn? AnimalTaskSim asks something stranger and more useful: can an agent be wrong in the same ways a real animal is wrong?
Real mice slow down on hard trials. They repeat what worked and switch after surprises. They occasionally lapse on trivial questions. Those "imperfections" are the signal — they reveal the control machinery a brain actually uses. We measure them, then test which pieces of an agent's architecture are necessary to reproduce them.
The agent has five working parts
We call this the adaptive-control agent. It's a computational analogy — not a literal brain map — borrowing the structure that mouse, monkey, and human brains seem to use to balance evidence, memory, and persistence.
Evidence core
"What does the stimulus say right now?"
Watches the input and accumulates evidence over time, the way neurons in sensory cortex do. Built on a drift-diffusion simulator — the math that turns "stimulus strength" into a choice and a reaction time.
Outcome state
"What just happened?"
A short-term memory of recent actions, rewards, and surprises. The agent's sense of "how is this session going?"
Persistence controller
"Should I keep trying?"
Pushes the agent to repeat a choice after a failure that wasn't clearly the agent's fault — the kind of borderline trial where animals also try again.
Exploration controller
"Should I try something else?"
Nudges toward sampling alternatives when the recent action history feels stale. (Spoiler: this one didn't survive its own test — see below.)
Arbitration layer
"Which voice gets heard?"
Decides how loud each voice gets. Crucially, it's gated by uncertainty: when the stimulus is obvious, control signals get muted so they can't override clear evidence.
Why uncertainty-gating matters. The arbitration layer multiplies control signals by how unsure the agent is. When the stimulus is obvious, persistence and exploration get muted — they can't override clear evidence. This single mechanism is what keeps the agent from devolving into stubbornness or randomness.
Side-by-side with a real mouse
The blue line is our agent. The gray line is a real mouse, averaged across 10 sessions and 8,406 trials from the International Brain Laboratory [1]. Three behaviors, one comparison.

How often the agent picks the rightward target as the stimulus shifts. The agent's curve sits inside the per-mouse range.
Both agent and mouse get faster as the stimulus gets stronger — the signature of evidence accumulation.
After a win, repeat. After a loss, switch. The agent leans the right direction; it under-stays compared to the mouse.
The lesion experiment
Train four versions of the agent: full machinery, none of it, and each controller alone. Then run the same task on each. If a piece is necessary for some behavior, removing it should change that behavior in a measurable, repeatable way.
| Condition | Accuracy slope | RT slope (ms/unit) | Retry gap | Stale-switch lift |
|---|---|---|---|---|
| No control | 27.71 | −48.54 | +0.057 | −0.073 |
| Exploration only | 24.00 | −38.83 | +0.092 | −0.160 |
| Persistence only | 21.75 | −33.47 | +0.164 | −0.159 |
| Full control | 22.26 | −33.97 | +0.165 | −0.152 |
Retry gap = how much more often the agent retries after a borderline failure than after an obvious one. Higher = more uncertainty-driven persistence. Stale-switch lift = how much more often the agent switches when its action history has gone stale. Higher = more curiosity-driven exploration. Means across 5 random seeds.

Bars show means across 5 seeds; error bars are 1 standard deviation. Retry gap rises monotonically as control machinery is added back — the smoking gun for persistence. Stale-switch lift stays stubbornly negative everywhere — the smoking gun for exploration not working in this design.
What survives the test
Comparing each lesion to the no-control baseline, seed by seed. We count not just whether the average effect points the right way — but how many seeds individually agree.
Persistence (the retry instinct)
Adding persistence reliably makes the agent retry after borderline failures — every single seed shows the effect.
Exploration (the curiosity instinct)
The exploration mechanism failed its own test. The probe shows the wrong sign in every seed — we won't pretend otherwise.

Every bar pair compares one lesion to the same seed's no-control run. Numbers are positive-seed counts (n / N) — how many of five seeds move in the expected direction. 5 / 5 for retry (green); 0 / 5 for stale-switch (purple) in every adaptive condition.
Honest scope
Uncertainty-gated retry / persistence
The full agent reliably retries after a borderline failure — positive in 5 / 5 seeds vs. the no-control lesion. Persistence alone recovers ~98% of that effect.
Rewarded-streak exploration
The exploration controller failed its own probe in every seed. The mechanism needs a different gate, a different probe, or both. We're saying so on the front page rather than burying it.
The probe, not the architecture, is the science
The same lesion-and-probe pipeline can ask, for any candidate circuit, whether it's necessary to produce a behavior we see in animals. The architecture is a hypothesis. The probe is the test.
Two canonical experiments
Task environments mirror the lab protocols exactly. Same stimulus levels, same timing, same response window — anything else and the comparison isn't real.
Mouse 2AFC
A mouse sees a faint pattern on the left or right of a screen and turns a wheel toward it. Easy patterns are obvious; the hardest ones are a coin flip.
Reference data
8,406 trials across 10 sessions (International Brain Laboratory) [1]
Macaque RDM
A monkey watches a cloud of dots moving. Most go one direction — but how many? Higher consensus means an easier call. The monkey decides when it's sure enough to commit.
Reference data
2,611 trials from Roitman & Shadlen (Shadlen Lab) [2]
Built with
Every trial logged to schema-validated .ndjson. Deterministic seeding. CPU-friendly runs. The environment owns logging — agents never touch the trial log directly.
Run it yourself
Three commands take you from a fresh clone to a published-style agent-vs-animal dashboard.
# 1. Install
git clone https://github.com/ermanakar/animaltasksim.git
cd animaltasksim && pip install -e ".[dev]"
# 2. Train one adaptive-control run
python scripts/train_adaptive_control.py \
--output-dir runs/demo --task ibl_2afc \
--seed 42 --episodes 5 --epochs 3
# 3. Run the full lesion suite (4 conditions × 5 seeds)
python scripts/adaptive_control_validation_suite.py \
--run-root runs/validation_suite
# 4. Build an agent-vs-animal dashboard
python scripts/make_dashboard.py \
--opts.agent-log runs/demo/trials.ndjson \
--opts.reference-log data/ibl/reference.ndjson \
--opts.output runs/demo/dashboard.htmlReferences
- International Brain Laboratory et al. (2021). Standardized and reproducible measurement of decision-making in mice. Neuron, 109(7), 1166–1180.
- Roitman, J. D., & Shadlen, M. N. (2002). Response of neurons in the lateral intraparietal area during a combined visual discrimination reaction time task. Journal of Neuroscience, 22(21), 9475–9489.
- Ratcliff, R., & McKoon, G. (2008). The diffusion decision model: theory and data for two-choice decision tasks. Neural Computation, 20(4), 873–922.
- Urai, A. E., et al. (2019). Mechanisms of choice history biases in perceptual decisions. Nature Communications, 10(1), 1983.
Read the full story
Open source, MIT licensed. Documents the wins, the negative results, and every wrong turn it took to find them.