Thinking Machines Lab Debuts Real-Time Interaction Models
Mira Murati's Thinking Machines Lab has released its first research preview introducing "interaction models" — AI designed from scratch for continuous real-time multimodal processing rather than bolting streaming onto a turn-based LLM stack. The core architectural argument: achieving genuine real-time co-presence across audio, video, and text requires a new model class, not incremental inference optimization. A limited preview is expected within months.
What the Source Actually Says
The lead release is TML-Interaction-Small, a 276-billion-parameter Mixture-of-Experts model with 12B active parameters at inference time. Rather than waiting for full user turns, it treats all I/O as continuous 200ms micro-turn streams, processing audio, video, and text in parallel. Encoder-free early fusion and batch-invariant kernels underpin the streaming training regime — choices NLP Newsletter describes as enabling the model to "listen, look, and speak in parallel."
A second, asynchronous background reasoning model handles complex cognition without blocking the interaction loop. This deliberate two-tier split keeps the foreground model latency-bounded while offloading deep deliberation to a heavier async process — operating at the infrastructure level rather than as a prompt-routing heuristic. The pattern echoes dual-process cognitive models and recent multi-agent orchestration patterns, but hardcoded into the architecture itself.
On FD-bench v1.5 — a benchmark specifically targeting interrupt handling, visual proactivity, and overlapping speech — TML-Interaction-Small scores 77.8 against a competitor range of 39.0–54.3. AI Search's independent YouTube weekly roundup corroborates the two-model architecture and highlights capabilities only non-turn-based designs can provide: overlapping speech support, visual cue awareness, and time-sensitive mid-task prompting.
Strategic Take
The two-model split — always-present lightweight interaction model plus async deep reasoner — is a reference architecture worth tracking for applications requiring persistent co-presence: voice agents, live tutors, whiteboard collaborators. The 77.8 FD-bench claim awaits independent replication, but the structural argument against retrofitting turn-based LLMs for real-time interaction is sound.


