HuggingFace ml-intern: Autonomous Post-Training Agent Pushes GPQA 10% → 32% in Under 10 Hours

HuggingFace released ml-intern as an open-source tool that takes a high-level prompt ("build the best scientific reasoning model") and autonomously executes the full post-training pipeline: reads arXiv papers and citation graphs, pulls and cleans HF Hub datasets, implements SFT/GRPO/synthetic data generation, launches training jobs on HF Jobs or Spaces, monitors runs, diagnoses failures, and iterates with ablations. On scientific reasoning, it pushed Qwen3-1.7B from 10% to 32% on GPQA in under 10 hours; on healthcare, it generated 1,100 synthetic data points from scratch and beat Codex on HealthBench by 60%. Clement Delangue tested it for 1 hour at approximately $1 of compute cost.

Why It Matters

ml-intern represents a credible demo of fully automated model improvement without human intervention in the training loop — a direct challenge to the assumption that post-training requires ML engineering teams. If the benchmark gains hold at scale, this reframes what a small AI team can accomplish without a dedicated research function. Source: GitHub | Web app.