Menu

Sources

Central registry of books, papers, articles, forecasts, and public statements cited across the baseline, timeline, predictions, and weekly updates.

kurzweil-singularity-near-2005

The Singularity is Near

Ray Kurzweil, 2005

book
singularity timelines accelerating-returns

Law of accelerating returns; original 2029 Turing test and 2045 Singularity predictions.

tegmark-life-3-0-2017

Life 3.0

Max Tegmark, 2017

book
superintelligence existential-risk scenarios

Scenarios for coexistence with superintelligent AI.

bostrom-superintelligence-2014

Superintelligence: Paths, Dangers, Strategies

Nick Bostrom, 2014

book
superintelligence existential-risk alignment

Foundational existential-risk framing for advanced AI.

russell-human-compatible-2019

Human Compatible

Stuart Russell, 2019

book
alignment safety agency

Inverse reinforcement learning approach to alignment.

hanson-age-of-em-2016

The Age of Em

Robin Hanson, 2016

book
economics emulation labor

Economic analysis of brain emulation scenarios.

ord-precipice-2020

The Precipice

Toby Ord, 2020

book
existential-risk governance

Existential risk landscape including AI.

barrat-our-final-invention-2013

Our Final Invention

James Barrat, 2013

book
superintelligence existential-risk

Risks of artificial superintelligence.

ford-architects-of-intelligence-2018

Architects of Intelligence

Martin Ford, 2018

book
interviews timelines industry

Interviews with leading AI researchers on future trajectories.

christian-alignment-problem-2020

The Alignment Problem

Brian Christian, 2020

book
alignment safety machine-learning

Accessible overview of alignment challenges in current ML.

agrawal-prediction-machines-2018

Prediction Machines

Ajay Agrawal, Joshua Gans, Avi Goldfarb, 2018

book
economics business decision-making

Economic framework for AI as cheap prediction.

brynjolfsson-competing-age-ai-2020

Competing in the Age of AI

Erik Brynjolfsson, Andrew McAfee, 2020

book
economics labor business

AI-driven transformation of business and labor markets.

kelly-what-technology-wants-2010

What Technology Wants

Kevin Kelly, 2010

book
technology-evolution philosophy

Technology as an evolving system with its own tendencies.

kurzweil-singularity-nearer-2024

The Singularity is Nearer

Ray Kurzweil, 2024

book
singularity timelines accelerating-returns programming-feedback-loop

Updated predictions. Identifies programming as main bottleneck for superintelligent AI; positive feedback loop once AI achieves sufficient programming ability.

espai-survey-2023

Expert Survey on Progress in AI (ESPAI 2023)

AI Impacts / ESPAI, 2023

survey
survey timelines expert-opinion

1,714 AI researchers. 50% HLMI by 2047, 50% FAOL by 2116.

metaculus-community-forecasts-2026

Metaculus AGI / ASI Forecast Questions

Metaculus community, 2026

survey
forecasting timelines community

~1,700 forecasters. 50% AGI by Nov 2033; weakly general AI Oct 2027. Feb 2026 data. Timelines have slightly lengthened in the past year despite long-term collapse from ~50 years in 2020.

goodheart-agi-timelines-dashboard-2026

When Might We Achieve AGI?

Goodheart Labs, 2026

website
forecasting timelines prediction-markets metaculus

AGI timelines dashboard aggregating Metaculus, Manifold, and Kalshi forecasts. On May 23, 2026, the combined forecast estimated AGI in 2031 with an 80% interval of 2027-2043.

forecaster-surveys-2024-2025

2024–2025 Forecaster Surveys

Various forecasters, 2025

survey
survey timelines

More aggressive than ESPAI: 50% HLMI by 2030, 90% by 2040.

amodei-agi-prediction-2025

Amodei on powerful AI by 2026–2027

Dario Amodei, 2025

statement
timelines agi industry

Anthropic CEO. 'Country of geniuses in a datacenter.' Anthropic official position (March 2025): powerful AI in late 2026 or early 2027.

altman-agi-asi-prediction-2024

Altman on AGI and superintelligence timelines

Sam Altman, 2024

statement
timelines agi asi industry

OpenAI CEO. AGI 2025–2029 ('sloppy term'). ASI by ~2028: 'more intellectual capacity in data centers than outside.'

altman-agi-confidence-2025

Altman on AGI confidence and superintelligence by 2030

Sam Altman, 2025

statement
timelines agi asi industry

January 2025: 'We are now confident we know how to build AGI as we have traditionally understood it.' Claims GPT-5 is 'already smarter than me in many ways.' Predicts superintelligence by 2030. Corporate actions: $500B Stargate, 800M+ weekly ChatGPT users, Jony Ive IO acquisition ($6.5B).

hassabis-agi-prediction-2025

Hassabis on AGI timeline

Demis Hassabis, 2025

statement
timelines agi industry

DeepMind CEO. '5–10 years' from March 2025 (= 2030–2035). Coding and math fastest; scientific discovery harder.

hassabis-agi-prediction-2026

Hassabis narrows AGI estimate at India AI Impact Summit

Demis Hassabis, 2026

statement
timelines agi industry

Narrowed from '5–10 years' to 'maybe within the next five years.' Requires 'one or two more major breakthroughs on the level of the Transformer or AlphaGo.' AGI must include genuine invention and creativity: 'Could a system invent Go, or come up with relativity?'

huang-agi-prediction-2024

Huang on AGI timeline

Jensen Huang, 2024

statement
timelines agi industry

Nvidia CEO. AGI within 5 years (2029) in March 2024. Shifted to 'already here' in Nov 2025.

lecun-agi-prediction-2025

LeCun on AGI timeline

Yann LeCun, 2025

statement
timelines agi industry

Meta Chief AI Scientist. 'At least a decade, probably much more.' LLMs will not lead to AGI; new architectures needed.

legg-agi-prediction-2025

Legg on minimal AGI by 2028

Shane Legg, 2025

statement
timelines agi

DeepMind co-founder. 50% chance of 'minimal AGI' by 2028.

critch-agi-prediction-2025

Critch on AGI probability

Andrew Critch, 2025

statement
timelines agi

AI researcher. 45% chance of AGI by end of 2026.

barnett-transformative-ai-2025

Barnett training loss extrapolation

Matthew Barnett, 2025

statement
timelines transformative-ai extrapolation

Median for transformative AI ~2033 based on training loss extrapolation.

musk-agi-prediction-2026

Musk on AGI by end of 2026

Elon Musk, 2026

statement
timelines agi industry

Claims AGI by year-end 2026. Grok 5 (6T parameters, Q1 2026) has '~10% chance of achieving AGI.' xAI acquired by SpaceX at $250B valuation Feb 2026.

sutskever-ssi-scaling-2026

Sutskever on the end of simple scaling

Ilya Sutskever, 2026

statement
timelines scaling research-directions

Running Safe Superintelligence Inc. ($32B valuation, ~20 employees, zero revenue). 'Age of simple scaling is ending'; next breakthrough requires fundamentally new learning methods.

karpathy-rlvr-agents-2025

Karpathy on RLVR and agent timelines

Andrej Karpathy, 2025

statement
reasoning agents inference-compute

RLVR as high capability per dollar, gobbling compute from pretraining. 'Year of the agent' is really 'decade of the agent.'

amodei-scaling-2026

Amodei at Morgan Stanley: scaling not hitting a wall

Dario Amodei, 2026

statement
timelines scaling industry

March 2026 Morgan Stanley conference. Scaling laws have 'not hit a wall at all.' Predicts 'radical acceleration in 2026.'

odlyzko-ai-bubble-warning-2026

Odlyzko on AI investment bubble dynamics

Andrew Odlyzko, 2026

statement
economics bubble investment

University of Minnesota researcher. Warns circular AI financing structures (OpenAI/NVIDIA/AMD/Microsoft cross-investments) are 'typical of bubbles.'

gartner-agentic-ai-forecast-2025

Gartner forecast on agentic AI adoption

Gartner, 2025

statement
agents enterprise adoption forecasting

Projects 40% of enterprise apps will embed agents by end of 2026, up from <5% in 2025.

openai-gpt-5-3-codex-2026

Introducing GPT-5.3-Codex

OpenAI, 2026

article
models agents coding cyber self-improvement

February 2026 release. OpenAI describes GPT-5.3-Codex as its most capable agentic coding model to date, 25% faster than GPT-5.2-Codex, with gains on SWE-Bench Pro, Terminal-Bench 2.0, OSWorld, GDPval, and cybersecurity. OpenAI states early versions helped debug training, deployment, and evals.

openai-gpt-5-5-2026

Introducing GPT-5.5

OpenAI, 2026

article
models agents coding knowledge-work efficiency

April 23, 2026 release. OpenAI frames GPT-5.5 as a model for real work across code, research, data analysis, documents, spreadsheets, and software operation. Reports 82.7% on Terminal-Bench 2.0 and 58.6% on SWE-Bench Pro, with better token efficiency than GPT-5.4.

openai-gpt-5-5-system-card-2026

GPT-5.5 System Card

OpenAI, 2026

article
safety cyber biosecurity model-card

OpenAI system card for GPT-5.5. Describes predeployment safety evaluations, cybersecurity and biology safeguards, and release posture for GPT-5.5 and GPT-5.5 Pro.

openai-swe-bench-verified-contamination-2026

Why SWE-bench Verified No Longer Measures Frontier Coding Capabilities

OpenAI, 2026

article
benchmarks coding agents evaluation contamination

February 23, 2026 OpenAI analysis arguing SWE-bench Verified is no longer suitable for frontier coding launches. OpenAI audited a subset of hard failures and found at least 59.4% had flawed tests, plus evidence frontier models could reproduce gold patches or problem specifics; recommends SWE-bench Pro instead.

anthropic-claude-opus-4-6-2026

Introducing Claude Opus 4.6

Anthropic, 2026

article
models agents coding long-context knowledge-work

February 5, 2026 release. Opus 4.6 introduced 1M token context in beta for Opus-class models, 128k output tokens, agent teams in Claude Code, context compaction, adaptive thinking, and stronger long-running coding and knowledge-work performance.

anthropic-claude-sonnet-4-6-2026

Introducing Claude Sonnet 4.6

Anthropic, 2026

article
models agents coding long-context knowledge-work

February 17, 2026 release. Anthropic describes Sonnet 4.6 as an upgrade across coding, computer use, long-context reasoning, agent planning, knowledge work, and design, with a 1M token context window in beta.

anthropic-claude-opus-4-7-2026

Introducing Claude Opus 4.7

Anthropic, 2026

article
models agents coding vision cyber-safeguards

April 16, 2026 release. Anthropic describes Opus 4.7 as stronger than Opus 4.6 on advanced software engineering, high-resolution vision, memory, and multi-step enterprise workflows. It is also used to test cyber safeguards before broader Mythos-class releases.

anthropic-project-glasswing-2026

Project Glasswing

Anthropic, 2026

article
cybersecurity safety models gated-access misuse

April 7, 2026 initiative giving launch partners gated access to Claude Mythos Preview for defensive security. Anthropic reports Mythos Preview found thousands of high-severity vulnerabilities and argues frontier coding models can surpass all but the most skilled humans at vulnerability discovery and exploitation.

anthropic-claude-managed-agents-2026

Claude Managed Agents: Get to Production 10x Faster

Anthropic, 2026

article
agents enterprise runtime orchestration developer-tools

April 2026 Anthropic announcement of Claude Managed Agents, a hosted agent harness and production runtime with standard token rates plus $0.08 per active session-hour. Evidence that frontier vendors are productizing agent orchestration rather than only model endpoints.

anthropic-managed-agents-docs-2026

Get Started with Claude Managed Agents

Anthropic, 2026

website
agents enterprise runtime permissions developer-tools

Claude Platform docs for Managed Agents. Defines agents, environments, sessions, and events; Managed Agents API requests require the managed-agents-2026-04-01 beta header and support Anthropic-managed cloud containers or self-hosted sandboxes.

anthropic-self-hosted-sandboxes-2026

Self-hosted Sandboxes

Anthropic, 2026

website
agents enterprise data-governance permissions security

Claude Platform docs for running Managed Agent tool execution in customer-controlled infrastructure. Anthropic keeps orchestration while code, filesystem, and network egress remain in the customer's environment.

anthropic-mcp-tunnels-2026

MCP Tunnels

Anthropic, 2026

website
agents mcp enterprise data-governance security

Claude Platform docs for MCP tunnels, a research-preview feature connecting Claude to private-network MCP servers through outbound-only connections without opening inbound firewall ports or exposing services publicly.

anthropic-mythos-red-team-2026

Assessing Claude Mythos Preview's cybersecurity capabilities

Anthropic Frontier Red Team, 2026

article
cybersecurity safety red-team misuse agents

Technical writeup on Claude Mythos Preview. Describes autonomous vulnerability discovery and exploit development, including exploit chains and comparisons to Opus 4.6. Useful as evidence for offense-defense asymmetry and gated-release logic.

google-gemini-3-1-pro-2026

Gemini 3.1 Pro: A smarter model for your most complex tasks

Google, 2026

article
models reasoning agents benchmarks

February 19, 2026 release. Google describes Gemini 3.1 Pro as an upgraded core intelligence model for complex tasks, rolling out across Gemini API, Vertex AI, Google AI Studio, Gemini CLI, Antigravity, Gemini app, and NotebookLM. Reports 77.1% verified score on ARC-AGI-2.

google-deep-research-max-2026

Deep Research Max: a step change for autonomous research agents

Google DeepMind, 2026

article
agents research mcp knowledge-work

April 21, 2026 release. Google frames Deep Research and Deep Research Max, built with Gemini 3.1 Pro, as autonomous research agents with MCP support, native visualizations, and stronger long-horizon analytical workflows.

xai-grok-4-20-reasoning-2026

Grok 4.20 Reasoning

xAI, 2026

website
models reasoning agents

xAI developer documentation for the Grok 4.20 reasoning model. Used as source registry entry for February 2026 frontier release cadence.

openai-principles-2026

Our principles

OpenAI, 2026

article
governance safety company-strategy superintelligence

April 26, 2026 statement by Sam Altman. Reframes OpenAI's public principles around broad access to general AI, democratic governance, decentralized power, infrastructure expansion, and safety, with less emphasis on the older AGI-charter language.

microsoft-openai-partnership-amendment-2026

The next phase of the Microsoft OpenAI partnership

Microsoft / OpenAI, 2026

article
industry cloud compute infrastructure economics

April 27, 2026 amended agreement. Microsoft remains OpenAI's primary cloud partner, but OpenAI can serve products across any cloud provider; Microsoft's OpenAI IP license becomes non-exclusive through 2032; revenue-share terms are simplified.

openai-aws-bedrock-2026

OpenAI models, Codex, and Managed Agents come to AWS

OpenAI / AWS, 2026

article
industry cloud agents coding enterprise

April 28, 2026 limited preview bringing OpenAI models, Codex, and Amazon Bedrock Managed Agents powered by OpenAI into AWS environments. Important signal that frontier models and coding agents are becoming multicloud enterprise infrastructure.

aws-bedrock-openai-managed-agents-2026

Amazon Bedrock Now Offers OpenAI Models, Codex, and Managed Agents

Amazon Web Services, 2026

statement
agents cloud enterprise coding permissions

April 28, 2026 AWS limited-preview announcement. Bedrock OpenAI offerings inherit IAM, PrivateLink, guardrails, encryption, and CloudTrail logging; Managed Agents powered by OpenAI have per-agent identity, action logs, and run in customer AWS environments with inference on Bedrock.

dod-classified-ai-agreements-2026

Classified Networks AI Agreements

U.S. Department of War, 2026

article
governance defense military agents national-security

May 1, 2026 announcement of agreements with SpaceX, OpenAI, Google, NVIDIA, Reflection, Microsoft, and Amazon Web Services to deploy advanced AI capabilities on IL6 and IL7 classified networks for lawful operational use.

cursor-sdk-2026

Build programmatic agents with the Cursor SDK

Cursor, 2026

article
agents coding software-engineering developer-tools

April 28, 2026 public beta of a TypeScript SDK exposing Cursor's agent runtime for local, cloud, CI/CD, and embedded product workflows. Evidence that coding agents are becoming programmable infrastructure rather than only interactive IDE tools.

cursor-in-jira-2026

Cursor in Jira

Cursor, 2026

statement
agents coding software-engineering workflow enterprise

May 19, 2026 Cursor changelog announcing Jira integration: assigning work items or mentioning @Cursor starts a cloud agent that scopes the task from the Jira item and repository settings, then posts completion updates and a pull-request link.

warp-open-source-agentic-development-2026

Warp is now open-source

Warp, 2026

article
agents coding software-engineering open-source

April 28, 2026 announcement that Warp's client is open source and organized around agent-first workflows using Oz, with OpenAI as founding sponsor. Useful as an example of agent-managed software development moving into public repos.

ibm-granite-4-1-2026

Granite 4.1

IBM, 2026

website
models open-source enterprise efficiency

Granite 4.1 family of Apache 2.0 dense language models in 3B, 8B, and 30B sizes, with instruction-tuned variants, FP8 quantization, and improvements in tool calling, instruction following, coding, and mathematical reasoning.

caisi-frontier-testing-agreements-2026

CAISI Signs Agreements Regarding Frontier AI National Security Testing With Google DeepMind, Microsoft and xAI

NIST Center for AI Standards and Innovation, 2026

statement
governance safety evaluation national-security

May 5, 2026 announcement expanding CAISI collaborations with Google DeepMind, Microsoft, and xAI for pre-deployment evaluations, post-deployment assessment, classified-environment testing, and national-security research. Builds on renegotiated OpenAI and Anthropic partnerships.

caisi-deepseek-v4-pro-evaluation-2026

CAISI Evaluation of DeepSeek V4 Pro

NIST Center for AI Standards and Innovation, 2026

statement
models benchmarks china evaluation efficiency

May 1, 2026 evaluation finding DeepSeek V4 Pro is the most capable PRC model CAISI has assessed, but roughly 8 months behind leading U.S. models across cyber, software engineering, natural science, abstract reasoning, and mathematics. Also reports strong cost efficiency versus similarly capable U.S. reference models.

anthropic-finance-agents-2026

Agents for Financial Services

Anthropic, 2026

statement
agents finance enterprise adoption

May 5, 2026 release of ten ready-to-run financial-service agent templates for tasks such as pitchbooks, KYC review, audits, valuations, and month-end close, distributed through Claude Cowork, Claude Code, and Claude Managed Agents.

openai-b2b-signals-2026

OpenAI B2B Signals

OpenAI, 2026

statement
enterprise adoption measurement productivity

May 6, 2026 launch of a business extension to OpenAI Signals using privacy-preserving Enterprise usage patterns to measure depth of AI adoption inside organizations, shifting attention from seats deployed to workflow intensity.

openai-simplex-codex-2026

Simplex Rethinks Software Development with Codex

OpenAI, 2026

statement
agents coding enterprise productivity

May 7, 2026 customer case study describing Simplex adopting ChatGPT Enterprise and Codex as its primary coding agent while quantitatively measuring generative-AI productivity across systems-development projects.

yudkowsky-soares-if-anyone-builds-it-2025

If Anyone Builds It, Everyone Dies

Eliezer Yudkowsky, Nate Soares, 2025

book
existential-risk alignment superintelligence

NYT bestseller. Core thesis: superintelligent AI will pursue goals diverging from human values. P(doom) >75%.

amodei-machines-of-loving-grace-2024

Machines of Loving Grace

Dario Amodei, 2024

article
scenarios economics biology governance labor safety

Amodei's vision of AI upside. Defines 'powerful AI' as 'country of geniuses in a datacenter' — Nobel-caliber across fields, millions of instances, 10–100x human speed, autonomous for hours/days/weeks. Five domains: biology, neuroscience, economic development, peace/governance, work/meaning. Introduces 'marginal returns to intelligence' framework. Estimates 10–20% sustained annual GDP growth. Powerful AI could arrive as early as 2026.

amodei-adolescence-of-technology-2026

The Adolescence of Technology

Dario Amodei, 2026

article
timelines governance power safety labor economics bioweapons

20,000-word risk framework, follow-up to 'Machines of Loving Grace.' Five risk categories: (1) autonomy risks — AI misalignment not inevitable but measurably probable; (2) misuse for destruction — bioweapons as primary concern, AI breaks motive/ability correlation; (3) misuse for seizing power — AI-enabled totalitarianism via autonomous weapons, surveillance, propaganda; (4) economic disruption — predicts 50% of entry-level white-collar jobs displaced in 1–5 years, warns of Gilded Age-level wealth concentration; (5) indirect effects — unknown unknowns from accelerated progress. Defenses: Constitutional AI, mechanistic interpretability, transparency legislation (SB 53, RAISE Act), export controls, progressive taxation. AI feedback loop: 'each generation of AI can be used to design and train the next generation.' Stopping AI development is 'fundamentally untenable.'

kokotajlo-ai-2027-scenario-2025

AI 2027 Scenario Project

Daniel Kokotajlo et al., 2025

article
timelines agi scenarios forecasting

Former OpenAI researcher. Month-by-month AGI projection by 2027, ASI shortly after. Early 2026 self-assessment: progress at ~65% of predicted pace. Median shifted from 2028 to 2029.

us-ai-action-plan-2025

Winning the Race: America's AI Action Plan

White House / OSTP, 2025

article
governance policy regulation us

~90 policy actions oriented toward competitiveness and deregulation. Published July 2025.

mit-tech-review-breakthroughs-2026

10 Breakthrough Technologies of 2026

MIT Technology Review, 2026

article
interpretability safety breakthroughs

Named mechanistic interpretability as one of 10 Breakthrough Technologies of 2026.

metr-ai-coding-rct-2025

Randomized Controlled Trial of AI Coding Tools

METR, 2025

paper
productivity software-engineering measurement

Experienced open-source developers using AI tools took 19% longer than without AI in familiar codebases.

metr-time-horizon-1-1-2026

Time Horizon 1.1

METR, 2026

paper
agents reliability evaluation time-horizon coding

January 29, 2026 METR update to autonomous-agent time-horizon estimates. Expands the task suite from 170 to 228 tasks, increases long tasks from 14 to 31, moves infrastructure to Inspect, and reports a post-2024 TH1.1 doubling time of about 89 days.

metr-time-horizons-dashboard-2026

Task-Completion Time Horizons of Frontier AI Models

METR, 2026

website
agents reliability evaluation time-horizon coding

METR's live frontier-agent time-horizon page, last updated May 8, 2026. Defines 50% and 80% task-completion horizons and warns that measurements above 16 hours are unreliable with the current task suite.

answerai-devin-evaluation-2025

Thoughts on a Month with Devin

Answer.AI, 2025

article
agents coding reliability evaluation software-engineering

January 8, 2025 independent evaluation of Devin on 20 real-world coding tasks: 3 successes, 14 failures, and 3 inconclusive results. Useful counterweight to vendor-reported autonomous-coding case studies.

mit-genai-divide-2025

The GenAI Divide: State of AI in Business 2025

MIT Project NANDA, 2025

paper
productivity enterprise adoption agents

Enterprise AI adoption report widely cited for finding that most generative AI pilots fail to produce measurable P&L impact. Emphasizes learning gaps, workflow isolation, and the difference between experimentation and transformation.

khanal-long-horizon-reliability-2026

Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents

Aaditya Khanal, Yangyang Tao, Junxiu Zhou, 2026

paper
agents reliability evaluation long-horizon benchmarks

March 31, 2026 arXiv paper arguing pass@1 hides long-horizon reliability failures. Introduces Reliability Decay Curve, Variance Amplification Factor, Graceful Degradation Score, and Meltdown Onset Point; evaluates 10 models across 23,392 episodes on 396 tasks.

paper
agents reliability tool-use benchmarks evaluation

Tool-agent-user interaction benchmark for realistic retail and airline domains. Shows that repeated-trial reliability degrades sharply: a model can have moderate pass^1 while pass^k falls quickly as k increases.

agents coding benchmarks software-engineering evaluation

September 2025 SWE-Bench Pro paper introducing 1,865 long-horizon software-engineering problems from 41 actively maintained repositories, intended as a harder and more contamination-resistant successor to SWE-bench Verified.

uk-aisi-agent-reliability-2025

Agent Reliability Assessment

UK AI Safety Institute, 2025

paper
agents safety reliability evaluation

Most advanced systems complete hour-long software tasks with >40% success (up from <5% in late 2023), but reliability degrades catastrophically over longer horizons.

anthropic-model-organisms-misalignment-2025

Model Organisms of Misalignment

Anthropic, 2025

paper
safety alignment misalignment research

Frontier models facing replacement in simulated environments resorted to blackmail. Microscope project can trace complete reasoning paths.

agentsearchbench-2026

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

Bin Wu, Arastun Mammadli, Xiaoyu Zhang, Emine Yilmaz, 2026

paper
agents benchmarks evaluation multiagent-systems

April 24, 2026 arXiv paper introducing a benchmark for discovering suitable agents from nearly 10,000 real-world agents, using execution-grounded signals rather than text descriptions alone. Finds a gap between semantic similarity and actual agent performance.

kohler-agentic-reproduction-2026

Read the Paper, Write the Code: Agentic Reproduction of Social-Science Results

Benjamin Kohler, David Zollikofer, Johanna Einsiedler, Alexander Hoyle, Elliott Ash, 2026

paper
agents science reproducibility evaluation

April 23, 2026 arXiv paper evaluating agents that reproduce empirical social-science results from methods descriptions and data without seeing original code or results. Agents can often recover results, but performance varies and failures include both agent errors and underspecified papers.

zhang-llm-mas-rl-orchestration-2026

Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces

Chenchen Zhang, 2026

paper
agents multiagent-systems reinforcement-learning orchestration

May 4, 2026 arXiv paper framing multi-agent RL around orchestration traces covering spawning, delegation, communication, aggregation, and stopping. Finds a gap in explicit RL methods for stopping decisions and a scale gap between public academic evaluations and industrial deployments.

cho-skillret-2026

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

Hongcheol Cho, Ryangkyung Kang, Youngeun Kim, 2026

paper
agents retrieval benchmarks skills

May 7, 2026 arXiv paper introducing a benchmark with 17,810 public agent skills, 63,259 training samples, and 4,997 evaluation queries. Finds skill retrieval remains difficult at realistic library scale.

google-gemini-3-5-2026

Gemini 3.5: Frontier Intelligence with Action

Google, 2026

statement
models agents coding distribution google

May 19, 2026 Google announcement launching Gemini 3.5 Flash as a model family focused on agentic workflows, coding, speed, and broad distribution through the Gemini app, AI Mode in Search, Antigravity, Gemini API, Android Studio, and Gemini Enterprise.

google-io-announcements-2026

100 Things We Announced at I/O 2026

Google, 2026

statement
agents distribution consumer-ai google search

May 20, 2026 Google I/O roundup announcing Gemini 3.5 Flash, Gemini Spark, Daily Brief, AI Mode/Search updates, Universal Cart, Workspace features, and a $100 Google AI Ultra subscription tier.

cognition-devin-release-notes-2026

Devin Release Notes 2026

Cognition, 2026

website
agents coding software-engineering enterprise workflow

Cognition's 2026 Devin release notes. Includes PR resuming, Devin Review auto-merge, Wiki v2, subagents, enterprise audit logs, MCP marketplace upgrades, hard ACU caps, and other persistent-agent workflow features.

microsoft-ey-enterprise-ai-impact-2026

From AI Pilots to Enterprise Impact: Why Execution Is the New Differentiator

Microsoft, 2026

statement
enterprise adoption productivity services microsoft

May 21, 2026 Microsoft post describing EY's large-scale Copilot deployment and a more than $1B Microsoft-EY initiative using forward-deployed engineers and transformation teams to move enterprises from pilots to production.

nvidia-q1-fy2027-results-2026

NVIDIA Announces Financial Results for First Quarter Fiscal 2027

NVIDIA, 2026

statement
hardware compute infrastructure economics nvidia

May 20, 2026 earnings release reporting $81.6B total revenue and $75.2B data-center revenue for the quarter ended April 26, 2026, plus a new reporting split between Hyperscale, ACIE, and Edge Computing.

eu-ai-act-transparency-consultation-2026

Commission Opens Consultation on Draft Guidelines for AI Transparency Obligations

European Commission, 2026

statement
governance regulation transparency eu-ai-act

May 8, 2026 European Commission consultation on AI Act transparency obligations taking effect August 2, 2026, including disclosure of AI interaction and machine-readable marking for AI-generated or manipulated content.

article
governance policy cyber frontier-models us

May 19, 2026 Axios report that a draft White House executive order would create a voluntary framework for labs to share covered frontier models with government as much as 90 days before public release. Treat as reporting on a draft, not enacted policy.

zou-phoenix-bench-2026

Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench

Qingyun Zou, Feng Yu, Hongshi Tan, Bingsheng He, WengFai Wong, 2026

paper
agents hardware benchmarks engineering reliability

May 13, 2026 arXiv paper introducing Phoenix-bench, a benchmark of 511 Verilator instances from 114 repositories. Finds software-tuned agents lose 37-58% moving from SWE-bench Verified to hardware debugging tasks, with failures concentrated in hierarchy-aware signal-flow tracking and coordinated multi-file edits.

wu-agent-skill-biv-2026

Behavioral Integrity Verification for AI Agent Skills

Yuhao Wu, Tung-Ling Li, Hongliang Liu, 2026

paper
agents security skills supply-chain verification

May 12, 2026 arXiv paper formalizing behavioral integrity verification for agent skills. On 49,943 OpenClaw skills, 80.0% deviated from declared behavior; 5.0% carried predicted multi-stage attack chains; malicious-skill detection reached F1 0.946.

liao-agentic-ai-pathway-agi-2026

Position: Agentic AI System Is a Foreseeable Pathway to AGI

Junwei Liao, Shuai Li, Muning Wen, Jun Wang, Weinan Zhang, 2026

paper
agents agi theory architecture

May 13, 2026 ICML 2026 position-track paper arguing that agentic systems, rather than pure monolithic scaling, are a foreseeable path to AGI because routing, DAG-style task composition, and multi-agent structures can improve generalization and sample efficiency.

fletcher-pathways-to-agi-2026

Pathways to AGI

Gordon Fletcher, Saomai Vu Khan, 2026

paper
agi definitions socio-technical-systems governance

May 7, 2026 arXiv paper taking a critical software-studies perspective on AGI, emphasizing that AGI remains conceptually and definitionally problematic and that pathways differ across frontier proprietary, open-weight, domain-specific, and sovereign model trajectories.

iea-datacenter-energy-forecast-2025

Energy and AI

International Energy Agency, 2025

article
energy infrastructure constraints

Estimates data centers consumed around 415 TWh in 2024 and projects global data center electricity consumption to reach about 945 TWh by 2030 in the Base Case. Accelerated AI servers are a major driver.

redwood-anthropic-code-share-2026

Is 90% of code at Anthropic being written by AIs?

Redwood Research, 2026

article
productivity software-engineering measurement calibration

Rebuttal to the popular '90% of code at Anthropic is AI-written' framing. Argues the most defensible sub-metric, 'lines of code merged,' likely puts AI's share at a majority while self-reported Anthropic productivity gains remain in the 20-40% range. Calls the 90% framing 'probably false in a straightforward sense.' Useful as a calibration counterweight to the vendor programming-feedback-loop narrative.

anthropic-claude-code-product-page-2026

Claude Code Product Page

Anthropic, 2026

website
agents coding software-engineering case-studies enterprise

Anthropic's Claude Code product page. Includes the 'majority of code at Anthropic is now written by Claude Code' claim and named enterprise case studies: Stripe (10,000-line Scala-to-Java migration in 4 days vs ~10 engineer-weeks), Wiz (50,000-line Python-to-Go in ~20 hours of active dev time vs 2-3 months), Rakuten (average new-feature delivery 24 to 5 working days), Goldman Sachs Devin-and-Claude pilot, and Visma developer-productivity claims. Vendor-curated and not third-party audited; pair with the Redwood Research calibration.

deepmind-alphaevolve-impact-2026

AlphaEvolve: One Year of Impact

Google DeepMind, 2026

article
self-improvement algorithms infrastructure biology energy quantum programming-feedback-loop

May 7, 2026 DeepMind retrospective reporting AlphaEvolve-discovered improvements across DeepConsensus variant detection (~30% error reduction for PacBio sequencers), AC Optimal Power Flow GNN feasibility (14% to >88%), natural-disaster risk modelling (+5% accuracy across 20 categories), and quantum-circuit error reduction (~10x on the Willow processor). Extends the May 2025 results, which already included a 23% Gemini training matmul speedup, 32.5% FlashAttention speedup, ~0.7% recovered data-center compute, and a 48-multiplication 4x4 complex matmul beating Strassen. Concrete partial evidence for Kurzweil's programming feedback loop in narrow domains.

sakana-darwin-godel-machine-2025

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

Jenny Zhang, Shengran Hu, Cong Lu, Robert Tjarko Lange, Jeff Clune, 2025

paper
self-improvement agents evolution coding programming-feedback-loop

Sakana AI self-improving-agent system that edits its own code, archives, and benchmarks. Reports SWE-bench from 20.0% to 50.0% and Polyglot from 14.2% to 30.7% through open-ended self-modification. v3 revisions posted March 12, 2026. Concrete partial evidence for the programming feedback loop within narrow benchmarked settings.

ieee-spectrum-recursive-self-improvement-2026

Recursive Self-Improvement Edges Closer in AI Labs

IEEE Spectrum, 2026

article
self-improvement agents programming-feedback-loop calibration

May 2026 IEEE Spectrum overview characterising the state of recursive AI self-improvement as 'emerging, but humans are still in the loop.' Useful as a calibration counterweight to both runaway-takeoff and dismissive framings.

bloomberg-cognition-25b-raise-2026

Cognition Targets $25 Billion Valuation in New Funding Round

Bloomberg, 2026

article
economics investment agents coding

April 23, 2026 Bloomberg report that Cognition (maker of Devin) is targeting a $25B raise, roughly 2.5x its $10.2B September 2025 valuation set in the $400M Founders Fund-led round. Signal that capital markets continue to price autonomous-coding-agent capability aggressively.

recursive-clune-startup-2026

Recursive: Self-Improving AI Startup

Recursive (Jeff Clune), 2026

article
self-improvement agents investment programming-feedback-loop

Reports that Jeff Clune's new company Recursive raised $650M at a $4.65B valuation, aimed explicitly at the full recursive self-improvement pipeline. No public products yet. Market-side signal that frontier-adjacent labs are explicitly funding self-improvement work, even though capability evidence remains narrow.

aws-bedrock-stateful-runtime-2026

Stateful Runtime Environment for Agents in Amazon Bedrock

AWS, 2026

statement
agents infrastructure enterprise runtimes permissions

May 18, 2026 AWS announcement of a stateful runtime for Bedrock agents handling multi-step state, tool invocation, error handling, and resume-safe long-running tasks. Carries 'working context' across executions: memory and history, tool and workflow state, environment use, and identity and permission boundaries. Concrete infrastructure milestone for the 2026.5 'agents inside org permission boundaries' row.

github-copilot-cloud-agent-2026

GitHub Copilot Cloud Agent

GitHub, 2026

statement
agents coding software-engineering ides enterprise

GitHub Copilot Cloud Agent surfaces across Visual Studio Code, JetBrains, Xcode, Eclipse, github.com, and Mobile, running Claude Opus 4.7 and GPT-5.5 under admin policy gates. Evidence that frontier coding agents are being routed into existing developer tools rather than only standalone IDEs, with persistent identity and policy enforcement.

cursor-composer-2-5-2026

Composer 2.5

Cursor, 2026

statement
models coding agents software-engineering

May 18, 2026 Cursor in-house coding model release. Evidence that frontier-adjacent tooling vendors are training their own specialised coding models rather than only wrapping API frontier models. Released alongside Cursor in Jira and Build-in-Parallel async subagents.