Pinned
Last week we released s1 - our simple recipe for sample-efficient reasoning & test-time scaling.
We’re releasing 𝐬𝟏.𝟏 trained on the 𝐬𝐚𝐦𝐞 𝟏𝐊 𝐪𝐮𝐞𝐬𝐭𝐢𝐨𝐧𝐬 but performing much better by using r1 instead of Gemini traces. 60% on AIME25 I.
Details in 🧵1/9
DeepSeek r1 is exciting but misses OpenAI’s test-time scaling plot and needs lots of data.
We introduce s1 reproducing o1-preview scaling & performance with just 1K samples & a simple test-time intervention.
📜arxiv.org/abs/2501.19393
















