Refact.ai Agent + Claude 3.7 Sonnet scored 92.9% (no Thinking) and 93.3% (with Thinking) on Polyglot Benchmark — fully autonomous!
Benchmark: Aider’s Polyglot (225 hardest coding exercises) across C++, Go, Java, JS, Python, and Rust.
Result: 20 points ahead of the current highest score on the leaderboard (72.9% by Aider with Gemini 2.5 Pro).
How? Refact.ai handles programming tasks end-to-end in your IDE — with high accuracy & without human input:
- Acts autonomously at every step.
- Takes iterative approach: Plans, executes, tests, self-corrects — all by itself until the task is fully solved.
- Deeply integrates with dev tool and environment, enabling Agent to act independently.
- Self-tests & revises steps mid-process, plus runs multiple checks if needed
- Solves tasks in ≤30 steps, optimizing token usage.
This is much closer to real-world software development and vibe coding: developers can delegate entire tasks to AI Agent while doing other work, then simply receive the final result.
__
Thinking vs. No-Thinking Mode
Thinking Mode improved accuracy by 0.4% but used 2x tokens.
Refact.ai Agent + Claude 3.7 Sonnet scored 92.9% (no Thinking) and 93.3% (with Thinking) on Polyglot Benchmark — fully autonomous!
Benchmark: Aider’s Polyglot (225 hardest coding exercises) across C++, Go, Java, JS, Python, and Rust. Result: 20 points ahead of the current highest score on the leaderboard (72.9% by Aider with Gemini 2.5 Pro).
How? Refact.ai handles programming tasks end-to-end in your IDE — with high accuracy & without human input:
- Acts autonomously at every step. - Takes iterative approach: Plans, executes, tests, self-corrects — all by itself until the task is fully solved. - Deeply integrates with dev tool and environment, enabling Agent to act independently. - Self-tests & revises steps mid-process, plus runs multiple checks if needed - Solves tasks in ≤30 steps, optimizing token usage.
This is much closer to real-world software development and vibe coding: developers can delegate entire tasks to AI Agent while doing other work, then simply receive the final result.
__
Thinking vs. No-Thinking Mode Thinking Mode improved accuracy by 0.4% but used 2x tokens.
__
Full breakdown, approach reveal & insights: https://refact.ai/blog/2025/refact-ai-agent-achieves-93-3-on...
Happy to discuss!