Last updated: March 2026

Google Gemini 3.1 Pro

Google keeps pushing Gemini back into the frontier conversation, then seeing the lead challenged again within weeks by OpenAI or Anthropic. Gemini 3.1 Pro, released on February 20, 2026, is the latest example of that cycle.

The headline number is 77.1% on ARC-AGI-2, a benchmark designed to test logic patterns the model has likely not seen before. That is materially higher than Gemini 3 Pro’s prior result, and it put Gemini 3.1 Pro near the front of the pack on this particular test at launch.

Whether that lead lasts a month or only a release cycle is hard to know. The more durable story is that Google paired stronger results with pricing that still sits below the most expensive frontier options.

The Benchmarks

Artificial Analysis, a third-party evaluation firm, placed Gemini 3.1 Pro at or near the top across several categories at launch. Google’s published numbers, echoed by external testing, included:

  • ARC-AGI-2 (novel logic reasoning): 77.1%
  • GPQA Diamond (scientific knowledge): 94.3%
  • SWE-Bench Verified (real-world coding): 80.6%
  • LiveCodeBench Pro (competitive coding): 2887 Elo
  • MMMLU (multimodal understanding): 92.6%

Google says these gains come from architecture optimization and better multimodal data fusion, not just scaling parameters. The model handles text, images, and code simultaneously, maintaining coherence across modalities. That’s been Gemini’s pitch since day one (“native multimodal”), and 3.1 Pro is the strongest execution of it yet.

What’s Actually New

Beyond benchmark scores, the practical improvements matter more.

Adjustable reasoning depth. VentureBeat describes it as a “Deep Think Mini” — you can dial the reasoning effort up or down depending on the task. Simple questions get fast answers. Complex problems get extended thinking chains. This is similar to what OpenAI did with o3’s reasoning tokens, but integrated into the base model rather than a separate “reasoning mode.”

SVG and 3D generation from text. The model can generate animated SVGs directly from prompts — code-based animations that stay scalable and tiny compared to video. One demo had it building a live aerospace dashboard from a public telemetry stream, visualizing the ISS orbit in real time. Another created a 3D starling murmuration with hand-tracking interaction and generative audio.

Better “vibe coding.” Hostinger Horizons reported that 3.1 Pro understands the “vibe” behind a prompt, translating intent into style-accurate code for non-developers. Cartwheel’s co-founder noted it fixed long-standing rotation order bugs in 3D animation pipelines that previous models couldn’t handle.

The Pricing Play

This is where the launch becomes more commercially relevant. Gemini 3.1 Pro costs exactly the same as its predecessor:

Standard (≤200K tokens)Long context (>200K)
Input$2.00 / 1M tokens$4.00 / 1M tokens
Output$12.00 / 1M tokens$18.00 / 1M tokens

For comparison, Anthropic’s paid Claude flagship and OpenAI’s premium reasoning tiers still sit in a meaningfully higher price band. Gemini 3.1 Pro offers frontier-class competition in some benchmarks at materially lower cost.

JetBrains reported a 15% quality improvement over previous Gemini versions while noting the model is “stronger, faster, and more efficient, requiring fewer output tokens.” Fewer output tokens at the same quality means lower bills for the same work.

For free users, Google offers 10 complex reasoning requests and 50 basic conversations per day, with a 100K token context window. The paid tier ($20/month) removes limits and extends context to 1 million tokens. Developers get 1,000 free API calls per month through Google AI Studio.

The Context Window Question

Gemini’s 1 million token context window (for paid users) remains its biggest structural advantage. Competing paid frontier tiers generally expose materially shorter standard windows. A million tokens is roughly 750,000 words — enough to process entire codebases, book-length documents, or hours of meeting transcripts in a single prompt.

But there’s a catch that the marketing doesn’t mention. Multiple users and technical bloggers have reported “memory decay” in the back half of very long contexts. Information extraction accuracy drops by roughly 8% in the latter portions of documents approaching the million-token limit. The context window exists, but the model’s attention isn’t uniform across it.

This is a known limitation across long-context models generally, but it matters here because Google’s product messaging leans heavily on the million-token number.

Where It Falls Short

No model is perfect, and 3.1 Pro has clear gaps:

Multilingual mixed tasks. Several testers report stuttering when the model has to switch between languages within a single task. If you’re working in English and Chinese simultaneously, expect some friction.

Prompt injection resistance. Google hasn’t published specific numbers on this, and independent testing is still early. For agent-style use cases (where the model processes untrusted content from emails or websites), this matters. Claude has historically led on safety benchmarks, and that gap may persist.

Ecosystem maturity. OpenAI has Codex, ChatGPT plugins, and deep enterprise integrations. Anthropic has Claude Code and a growing agent ecosystem. Google’s developer tooling around Gemini is improving but still plays catch-up in terms of third-party integrations and community support.

The Competitive Picture

The AI model leaderboard changes monthly. Gemini 3.1 Pro’s more durable contribution is not a temporary benchmark lead, but the fact that frontier-level reasoning can be offered at mid-tier pricing. At $2/$12 per million tokens, it stays accessible to individual developers and smaller teams in a way that Opus-tier pricing often does not.

Google also teased Gemini 3.1 Ultra, currently in internal testing, with a 2 million token context window and real-time video analysis. Expected Q2 2026. The gaps between releases are shrinking and the jumps are getting bigger. For users, that is directionally good news, even if the exact lead changes quickly.