HuggingFace ml-intern: Autonomous Post-Training Agent Pushes GPQA 10% → 32% in Under 10 Hours

HuggingFace released ml-intern, an open-source autonomous agent that accepts a high-level prompt and runs the full ML post-training loop — reading arXiv, pulling datasets, implementing SFT/GRPO, launching jobs, and iterating — boosting GPQA from 10% to 32% on Qwen3-1.7B in under 10 hours.

1 min read|agenticonsult Intelligence

HuggingFace ml-intern: Autonomous Post-Training Agent Pushes GPQA 10% → 32% in Under 10 Hours

HuggingFace released ml-intern as an open-source tool that takes a high-level prompt ("build the best scientific reasoning model") and autonomously executes the full post-training pipeline: reads arXiv papers and citation graphs, pulls and cleans HF Hub datasets, implements SFT/GRPO/synthetic data generation, launches training jobs on HF Jobs or Spaces, monitors runs, diagnoses failures, and iterates with ablations. On scientific reasoning, it pushed Qwen3-1.7B from 10% to 32% on GPQA in under 10 hours; on healthcare, it generated 1,100 synthetic data points from scratch and beat Codex on HealthBench by 60%. Clement Delangue tested it for 1 hour at approximately $1 of compute cost.

Why It Matters

ml-intern represents a credible demo of fully automated model improvement without human intervention in the training loop — a direct challenge to the assumption that post-training requires ML engineering teams. If the benchmark gains hold at scale, this reframes what a small AI team can accomplish without a dedicated research function. Source: GitHub | Web app.

This breaking-news item was assembled from the cited primary source with AI assistance. It is intended for rapid situational awareness — refer to the original publication for the definitive statement.

HuggingFace ml-intern: Autonomous Post-Training Agent Pushes GPQA 10% → 32% in Under 10 Hours

HuggingFace ml-intern: Autonomous Post-Training Agent Pushes GPQA 10% → 32% in Under 10 Hours

Why It Matters

Live Intel Feed