ml-intern: HuggingFace Releases a Full-Loop Autonomous Post-Training Agent
HuggingFace has released ml-intern, an open-source agent that executes the complete machine learning research-to-training pipeline from a single high-level prompt. It is one of the most concrete demonstrations yet of autonomous AI conducting AI research work — not pausing for human direction at every step, but completing the full loop independently.
What the Source Actually Says
Drop a prompt such as "build the best scientific reasoning model" and ml-intern handles the rest: it reads arXiv papers and citation graphs to identify relevant techniques, pulls and cleans datasets from HF Hub, implements SFT and GRPO training scripts, launches jobs via HF Jobs or local infrastructure, monitors run metrics, diagnoses failures, runs ablations, and iterates until results improve.
The benchmark numbers are concrete. On Qwen3-1.7B targeting scientific reasoning, ml-intern pushed GPQA from 10% to 32% in under 10 hours — outpacing Claude Code's best-reported score of 22.99% on the same benchmark. In a healthcare domain test, the agent judged available datasets too low quality, generated 1,100 synthetic data points from scratch, and beat OpenAI's Codex on HealthBench by 60%. In a math domain test, it wrote a GRPO script, launched on A100s via HF Spaces, watched reward curves collapse, ran ablations, and recovered — unsupervised.
HuggingFace CEO Clément Delangue ran a personal test: 1 hour of post-training, approximately $1 of compute on HF Jobs. He has provisioned $1,000 in GPU credits plus Anthropic API credits for early community users. The project is available as both a CLI (github.com/huggingface/ml-intern) and a web app (huggingface.co/spaces/smolagents/ml-intern).
Strategic Take
ml-intern directly challenges the assumption that human judgment is required at every training decision point. For AI engineering teams, the entry cost is deliberately low — small base models, commodity compute, minutes of wall time. The open-source release means the agent's scaffolding is also a legible, forkable reference for teams building domain-specific post-training pipelines.