I tested 9 flagships (Claude 4.6, GPT-5.2, Gemini 3.1 Pro, Kimi K2.5, etc.) in my own mini-benchmark with novel tasks, web search disabled and zero training contamination and no cheating possible.

TL;DR: Claude 4.6 is currently the best reasoning model, GPT-5.2 is overrated, and open-source is catching up fast, in particular Moonshot.ai’s Kimi K2.5 seems very capable.

  • Cyteseer@lemmy.world
    link
    fedilink
    English
    arrow-up
    18
    ·
    1 day ago

    I don’t really understand how your 6 questions evaluate a growth or plateau in llm model performance. They did perform a certain way with your questions but growth has to be evaluated through the lens of time, whether literally or evaluating multiple versions of the same model.