kurzweil-singularity-near-2005
The Singularity is Near
Ray Kurzweil, 2005
Law of accelerating returns; original 2029 Turing test and 2045 Singularity predictions.
Central registry of books, papers, articles, forecasts, and public statements cited across the baseline, timeline, predictions, and weekly updates.
kurzweil-singularity-near-2005
Ray Kurzweil, 2005
Law of accelerating returns; original 2029 Turing test and 2045 Singularity predictions.
tegmark-life-3-0-2017
Max Tegmark, 2017
Scenarios for coexistence with superintelligent AI.
bostrom-superintelligence-2014
Nick Bostrom, 2014
Foundational existential-risk framing for advanced AI.
russell-human-compatible-2019
Stuart Russell, 2019
Inverse reinforcement learning approach to alignment.
hanson-age-of-em-2016
Robin Hanson, 2016
Economic analysis of brain emulation scenarios.
ord-precipice-2020
Toby Ord, 2020
Existential risk landscape including AI.
barrat-our-final-invention-2013
James Barrat, 2013
Risks of artificial superintelligence.
ford-architects-of-intelligence-2018
Martin Ford, 2018
Interviews with leading AI researchers on future trajectories.
christian-alignment-problem-2020
Brian Christian, 2020
Accessible overview of alignment challenges in current ML.
agrawal-prediction-machines-2018
Ajay Agrawal, Joshua Gans, Avi Goldfarb, 2018
Economic framework for AI as cheap prediction.
brynjolfsson-competing-age-ai-2020
Erik Brynjolfsson, Andrew McAfee, 2020
AI-driven transformation of business and labor markets.
kelly-what-technology-wants-2010
Kevin Kelly, 2010
Technology as an evolving system with its own tendencies.
kurzweil-singularity-nearer-2024
Ray Kurzweil, 2024
Updated predictions. Identifies programming as main bottleneck for superintelligent AI; positive feedback loop once AI achieves sufficient programming ability.
espai-survey-2023
AI Impacts / ESPAI, 2023
1,714 AI researchers. 50% HLMI by 2047, 50% FAOL by 2116.
~1,700 forecasters. 50% AGI by Nov 2033; weakly general AI Oct 2027. Feb 2026 data. Timelines have slightly lengthened in the past year despite long-term collapse from ~50 years in 2020.
AGI timelines dashboard aggregating Metaculus, Manifold, and Kalshi forecasts. On May 23, 2026, the combined forecast estimated AGI in 2031 with an 80% interval of 2027-2043.
forecaster-surveys-2024-2025
Various forecasters, 2025
More aggressive than ESPAI: 50% HLMI by 2030, 90% by 2040.
80000hours-critical-period-2025
80,000 Hours, 2025
Identifies 2028–2032 as likely bottleneck period for AGI arrival.
aschenbrenner-situational-awareness-2024
Leopold Aschenbrenner, 2024
June 2024 essay series by a former OpenAI Superalignment researcher. Core thesis — 'counting the OOMs': compute (~0.5 OOM/yr) plus algorithmic efficiency (~0.5 OOM/yr) plus 'unhobbling' (chatbot → agent → drop-in remote worker) make a 'drop-in AI researcher/engineer' AGI 'strikingly plausible' by 2027 on trendline extrapolation alone. Automated AI research then drives an intelligence explosion; softened-takeoff path 2026/27 proto-engineer → 2027/28 >90%-automated research → 2028/29 superintelligence. Compute/economics: ~$100B AI revenue run-rate by 2026, >$1T/yr total AI investment by 2027, $100B+ individual training clusters by 2028, $1T+ clusters drawing >20% of US electricity by end of decade, US power production up tens of percent. Also argues lab security must be locked down against CCP espionage, superalignment is unsolved but maybe tractable, and that by 2027/28 a government-led 'Project' (Manhattan-style nationalization, labs voluntarily merging) will run AGI development. One of the most influential aggressive-timeline documents of the period and an intellectual antecedent to the AI 2027 scenario.
amodei-agi-prediction-2025
Dario Amodei, 2025
Anthropic CEO. 'Country of geniuses in a datacenter.' Anthropic official position (March 2025): powerful AI in late 2026 or early 2027.
altman-agi-asi-prediction-2024
Sam Altman, 2024
OpenAI CEO. AGI 2025–2029 ('sloppy term'). ASI by ~2028: 'more intellectual capacity in data centers than outside.'
altman-agi-confidence-2025
Sam Altman, 2025
January 2025: 'We are now confident we know how to build AGI as we have traditionally understood it.' Claims GPT-5 is 'already smarter than me in many ways.' Predicts superintelligence by 2030. Corporate actions: $500B Stargate, 800M+ weekly ChatGPT users, Jony Ive IO acquisition ($6.5B).
hassabis-agi-prediction-2025
Demis Hassabis, 2025
DeepMind CEO. '5–10 years' from March 2025 (= 2030–2035). Coding and math fastest; scientific discovery harder.
hassabis-agi-prediction-2026
Demis Hassabis, 2026
Narrowed from '5–10 years' to 'maybe within the next five years.' Requires 'one or two more major breakthroughs on the level of the Transformer or AlphaGo.' AGI must include genuine invention and creativity: 'Could a system invent Go, or come up with relativity?'
huang-agi-prediction-2024
Jensen Huang, 2024
Nvidia CEO. AGI within 5 years (2029) in March 2024. Shifted to 'already here' in Nov 2025.
lecun-agi-prediction-2025
Yann LeCun, 2025
Meta Chief AI Scientist. 'At least a decade, probably much more.' LLMs will not lead to AGI; new architectures needed.
legg-agi-prediction-2025
Shane Legg, 2025
DeepMind co-founder. 50% chance of 'minimal AGI' by 2028.
critch-agi-prediction-2025
Andrew Critch, 2025
AI researcher. 45% chance of AGI by end of 2026.
barnett-transformative-ai-2025
Matthew Barnett, 2025
Median for transformative AI ~2033 based on training loss extrapolation.
musk-agi-prediction-2026
Elon Musk, 2026
Claims AGI by year-end 2026. Grok 5 (6T parameters, Q1 2026) has '~10% chance of achieving AGI.' xAI acquired by SpaceX at $250B valuation Feb 2026.
sutskever-ssi-scaling-2026
Ilya Sutskever, 2026
Running Safe Superintelligence Inc. ($32B valuation, ~20 employees, zero revenue). 'Age of simple scaling is ending'; next breakthrough requires fundamentally new learning methods.
karpathy-rlvr-agents-2025
Andrej Karpathy, 2025
RLVR as high capability per dollar, gobbling compute from pretraining. 'Year of the agent' is really 'decade of the agent.'
amodei-scaling-2026
Dario Amodei, 2026
March 2026 Morgan Stanley conference. Scaling laws have 'not hit a wall at all.' Predicts 'radical acceleration in 2026.'
odlyzko-ai-bubble-warning-2026
Andrew Odlyzko, 2026
University of Minnesota researcher. Warns circular AI financing structures (OpenAI/NVIDIA/AMD/Microsoft cross-investments) are 'typical of bubbles.'
gartner-agentic-ai-forecast-2025
Gartner, 2025
Projects 40% of enterprise apps will embed agents by end of 2026, up from <5% in 2025.
February 2026 release. OpenAI describes GPT-5.3-Codex as its most capable agentic coding model to date, 25% faster than GPT-5.2-Codex, with gains on SWE-Bench Pro, Terminal-Bench 2.0, OSWorld, GDPval, and cybersecurity. OpenAI states early versions helped debug training, deployment, and evals.
April 23, 2026 release. OpenAI frames GPT-5.5 as a model for real work across code, research, data analysis, documents, spreadsheets, and software operation. Reports 82.7% on Terminal-Bench 2.0 and 58.6% on SWE-Bench Pro, with better token efficiency than GPT-5.4.
OpenAI system card for GPT-5.5. Describes predeployment safety evaluations, cybersecurity and biology safeguards, and release posture for GPT-5.5 and GPT-5.5 Pro.
openai-swe-bench-verified-contamination-2026
OpenAI, 2026
February 23, 2026 OpenAI analysis arguing SWE-bench Verified is no longer suitable for frontier coding launches. OpenAI audited a subset of hard failures and found at least 59.4% had flawed tests, plus evidence frontier models could reproduce gold patches or problem specifics; recommends SWE-bench Pro instead.
February 5, 2026 release. Opus 4.6 introduced 1M token context in beta for Opus-class models, 128k output tokens, agent teams in Claude Code, context compaction, adaptive thinking, and stronger long-running coding and knowledge-work performance.
February 17, 2026 release. Anthropic describes Sonnet 4.6 as an upgrade across coding, computer use, long-context reasoning, agent planning, knowledge work, and design, with a 1M token context window in beta.
April 16, 2026 release. Anthropic describes Opus 4.7 as stronger than Opus 4.6 on advanced software engineering, high-resolution vision, memory, and multi-step enterprise workflows. It is also used to test cyber safeguards before broader Mythos-class releases.
April 7, 2026 initiative giving launch partners gated access to Claude Mythos Preview for defensive security. Anthropic reports Mythos Preview found thousands of high-severity vulnerabilities and argues frontier coding models can surpass all but the most skilled humans at vulnerability discovery and exploitation.
anthropic-claude-managed-agents-2026
Anthropic, 2026
April 2026 Anthropic announcement of Claude Managed Agents, a hosted agent harness and production runtime with standard token rates plus $0.08 per active session-hour. Evidence that frontier vendors are productizing agent orchestration rather than only model endpoints.
Claude Platform docs for Managed Agents. Defines agents, environments, sessions, and events; Managed Agents API requests require the managed-agents-2026-04-01 beta header and support Anthropic-managed cloud containers or self-hosted sandboxes.
Claude Platform docs for running Managed Agent tool execution in customer-controlled infrastructure. Anthropic keeps orchestration while code, filesystem, and network egress remain in the customer's environment.
Claude Platform docs for MCP tunnels, a research-preview feature connecting Claude to private-network MCP servers through outbound-only connections without opening inbound firewall ports or exposing services publicly.
anthropic-mythos-red-team-2026
Anthropic Frontier Red Team, 2026
Technical writeup on Claude Mythos Preview. Describes autonomous vulnerability discovery and exploit development, including exploit chains and comparisons to Opus 4.6. Useful as evidence for offense-defense asymmetry and gated-release logic.
February 19, 2026 release. Google describes Gemini 3.1 Pro as an upgraded core intelligence model for complex tasks, rolling out across Gemini API, Vertex AI, Google AI Studio, Gemini CLI, Antigravity, Gemini app, and NotebookLM. Reports 77.1% verified score on ARC-AGI-2.
google-deep-research-max-2026
Google DeepMind, 2026
April 21, 2026 release. Google frames Deep Research and Deep Research Max, built with Gemini 3.1 Pro, as autonomous research agents with MCP support, native visualizations, and stronger long-horizon analytical workflows.
xAI developer documentation for the Grok 4.20 reasoning model. Used as source registry entry for February 2026 frontier release cadence.
April 26, 2026 statement by Sam Altman. Reframes OpenAI's public principles around broad access to general AI, democratic governance, decentralized power, infrastructure expansion, and safety, with less emphasis on the older AGI-charter language.
microsoft-openai-partnership-amendment-2026
Microsoft / OpenAI, 2026
April 27, 2026 amended agreement. Microsoft remains OpenAI's primary cloud partner, but OpenAI can serve products across any cloud provider; Microsoft's OpenAI IP license becomes non-exclusive through 2032; revenue-share terms are simplified.
April 28, 2026 limited preview bringing OpenAI models, Codex, and Amazon Bedrock Managed Agents powered by OpenAI into AWS environments. Important signal that frontier models and coding agents are becoming multicloud enterprise infrastructure.
aws-bedrock-openai-managed-agents-2026
Amazon Web Services, 2026
April 28, 2026 AWS limited-preview announcement. Bedrock OpenAI offerings inherit IAM, PrivateLink, guardrails, encryption, and CloudTrail logging; Managed Agents powered by OpenAI have per-agent identity, action logs, and run in customer AWS environments with inference on Bedrock.
May 1, 2026 announcement of agreements with SpaceX, OpenAI, Google, NVIDIA, Reflection, Microsoft, and Amazon Web Services to deploy advanced AI capabilities on IL6 and IL7 classified networks for lawful operational use.
April 28, 2026 public beta of a TypeScript SDK exposing Cursor's agent runtime for local, cloud, CI/CD, and embedded product workflows. Evidence that coding agents are becoming programmable infrastructure rather than only interactive IDE tools.
May 19, 2026 Cursor changelog announcing Jira integration: assigning work items or mentioning @Cursor starts a cloud agent that scopes the task from the Jira item and repository settings, then posts completion updates and a pull-request link.
April 28, 2026 announcement that Warp's client is open source and organized around agent-first workflows using Oz, with OpenAI as founding sponsor. Useful as an example of agent-managed software development moving into public repos.
nvidia-nemotron-3-nano-omni-2026
NVIDIA, 2026
April 28, 2026 open omni-modal model for video, audio, image, and text reasoning in agentic workloads. NVIDIA reports higher throughput and lower compute for video reasoning, reinforcing the efficiency-plus-agent-infrastructure trend.
Granite 4.1 family of Apache 2.0 dense language models in 3B, 8B, and 30B sizes, with instruction-tuned variants, FP8 quantization, and improvements in tool calling, instruction following, coding, and mathematical reasoning.
caisi-frontier-testing-agreements-2026
NIST Center for AI Standards and Innovation, 2026
May 5, 2026 announcement expanding CAISI collaborations with Google DeepMind, Microsoft, and xAI for pre-deployment evaluations, post-deployment assessment, classified-environment testing, and national-security research. Builds on renegotiated OpenAI and Anthropic partnerships.
caisi-deepseek-v4-pro-evaluation-2026
NIST Center for AI Standards and Innovation, 2026
May 1, 2026 evaluation finding DeepSeek V4 Pro is the most capable PRC model CAISI has assessed, but roughly 8 months behind leading U.S. models across cyber, software engineering, natural science, abstract reasoning, and mathematics. Also reports strong cost efficiency versus similarly capable U.S. reference models.
anthropic-enterprise-ai-services-company-2026
Anthropic, 2026
May 4, 2026 announcement of an AI services company formed by Anthropic, Blackstone, Hellman & Friedman, and Goldman Sachs to help mid-sized companies deploy Claude into core operations with hands-on engineering support.
May 5, 2026 release of ten ready-to-run financial-service agent templates for tasks such as pitchbooks, KYC review, audits, valuations, and month-end close, distributed through Claude Cowork, Claude Code, and Claude Managed Agents.
May 6, 2026 launch of a business extension to OpenAI Signals using privacy-preserving Enterprise usage patterns to measure depth of AI adoption inside organizations, shifting attention from seats deployed to workflow intensity.
May 7, 2026 customer case study describing Simplex adopting ChatGPT Enterprise and Codex as its primary coding agent while quantitatively measuring generative-AI productivity across systems-development projects.
amd-openai-mrc-2026
AMD, 2026
May 6, 2026 AMD post describing OpenAI, AMD, Microsoft, and other industry contributors making Multipath Reliable Connection available through the Open Compute Project to improve production-scale AI networking.
amd-instinct-mi350p-pcie-2026
AMD, 2026
May 7, 2026 AMD announcement of MI350P PCIe cards aimed at fitting agentic inference into standard air-cooled enterprise servers rather than only purpose-built large GPU clusters.
yudkowsky-soares-if-anyone-builds-it-2025
Eliezer Yudkowsky, Nate Soares, 2025
NYT bestseller. Core thesis: superintelligent AI will pursue goals diverging from human values. P(doom) >75%.
Amodei's vision of AI upside. Defines 'powerful AI' as 'country of geniuses in a datacenter' — Nobel-caliber across fields, millions of instances, 10–100x human speed, autonomous for hours/days/weeks. Five domains: biology, neuroscience, economic development, peace/governance, work/meaning. Introduces 'marginal returns to intelligence' framework. Estimates 10–20% sustained annual GDP growth. Powerful AI could arrive as early as 2026.
20,000-word risk framework, follow-up to 'Machines of Loving Grace.' Five risk categories: (1) autonomy risks — AI misalignment not inevitable but measurably probable; (2) misuse for destruction — bioweapons as primary concern, AI breaks motive/ability correlation; (3) misuse for seizing power — AI-enabled totalitarianism via autonomous weapons, surveillance, propaganda; (4) economic disruption — predicts 50% of entry-level white-collar jobs displaced in 1–5 years, warns of Gilded Age-level wealth concentration; (5) indirect effects — unknown unknowns from accelerated progress. Defenses: Constitutional AI, mechanistic interpretability, transparency legislation (SB 53, RAISE Act), export controls, progressive taxation. AI feedback loop: 'each generation of AI can be used to design and train the next generation.' Stopping AI development is 'fundamentally untenable.'
June 2026 policy essay, third in the sequence after 'Machines of Loving Grace' and 'The Adolescence of Technology.' Argues the Mythos/Glasswing cyber evidence makes AI's risks 'undeniable' and that it is time to go beyond transparency to binding regulation. Marks Anthropic's escalation from its transparency-first posture (SB 53, RAISE, IL SB 315) to advocating an FAA-style regime: mandatory third-party testing for models above a compute threshold in four risk areas — cybersecurity, biological weapons, loss of control, and automated R&D — with government power to block or reverse deployment, scoped and protected against political favoritism, possibly via a 'regulatory markets' model. Anthropic is releasing a frontier-model-testing legislative proposal and a job-displacement policy framework with financial backing. Covers five areas: (1) FAA-style public-safety regulation; (2) macro/tax — 'hypergrowth, hyper-inequality' risk, pro-employment incentives, wage insurance, UBI/universal capital accounts, AI firms absorbing datacenter rate increases; (3) accelerating downstream science — reform FDA/EMA (7–8yr pipeline) to accept AI simulation (PD/PK, toxicology, synthetic control arms); (4) state vs. civil liberties — autonomous-weapon accountability/off-switch, ban domestic autonomous weapons, close the data-broker loophole, public right to AI in adverse government action; (5) democratic AI coalition — coordinated export controls (MATCH, OVERWATCH bills), mutual defense, rejection of AI-powered repression. Reaffirms 'country of geniuses in a datacenter' within 'a year or two' and a 3-year AI lead as militarily decisive.
kokotajlo-ai-2027-scenario-2025
Daniel Kokotajlo et al., 2025
Former OpenAI researcher. Month-by-month AGI projection by 2027, ASI shortly after. Early 2026 self-assessment: progress at ~65% of predicted pace. Median shifted from 2028 to 2029.
us-ai-action-plan-2025
White House / OSTP, 2025
~90 policy actions oriented toward competitiveness and deregulation. Published July 2025.
mit-tech-review-breakthroughs-2026
MIT Technology Review, 2026
Named mechanistic interpretability as one of 10 Breakthrough Technologies of 2026.
Experienced open-source developers using AI tools took 19% longer than without AI in familiar codebases.
January 29, 2026 METR update to autonomous-agent time-horizon estimates. Expands the task suite from 170 to 228 tasks, increases long tasks from 14 to 31, moves infrastructure to Inspect, and reports a post-2024 TH1.1 doubling time of about 89 days.
METR's live frontier-agent time-horizon page, last updated May 8, 2026. Defines 50% and 80% task-completion horizons and warns that measurements above 16 hours are unreliable with the current task suite.
January 8, 2025 independent evaluation of Devin on 20 real-world coding tasks: 3 successes, 14 failures, and 3 inconclusive results. Useful counterweight to vendor-reported autonomous-coding case studies.
mit-nber-ai-productivity-2026
Salomé Baslandze et al., 2026
March 2026 NBER working paper using a survey of nearly 750 corporate executives. Finds heterogeneous AI adoption, positive productivity gains concentrated in high-skill services and finance, and expected strengthening in 2026.
mit-genai-divide-2025
MIT Project NANDA, 2025
Enterprise AI adoption report widely cited for finding that most generative AI pilots fail to produce measurable P&L impact. Emphasizes learning gaps, workflow isolation, and the difference between experimentation and transformation.
khanal-long-horizon-reliability-2026
Aaditya Khanal, Yangyang Tao, Junxiu Zhou, 2026
March 31, 2026 arXiv paper arguing pass@1 hides long-horizon reliability failures. Introduces Reliability Decay Curve, Variance Amplification Factor, Graceful Degradation Score, and Meltdown Onset Point; evaluates 10 models across 23,392 episodes on 396 tasks.
yao-tau-bench-2024
Shunyu Yao et al., 2024
Tool-agent-user interaction benchmark for realistic retail and airline domains. Shows that repeated-trial reliability degrades sharply: a model can have moderate pass^1 while pass^k falls quickly as k increases.
scale-swe-bench-pro-2025
Scale AI, 2025
September 2025 SWE-Bench Pro paper introducing 1,865 long-horizon software-engineering problems from 41 actively maintained repositories, intended as a harder and more contamination-resistant successor to SWE-bench Verified.
uk-aisi-agent-reliability-2025
UK AI Safety Institute, 2025
Most advanced systems complete hour-long software tasks with >40% success (up from <5% in late 2023), but reliability degrades catastrophically over longer horizons.
anthropic-model-organisms-misalignment-2025
Anthropic, 2025
Frontier models facing replacement in simulated environments resorted to blackmail. Microscope project can trace complete reasoning paths.
agentsearchbench-2026
Bin Wu, Arastun Mammadli, Xiaoyu Zhang, Emine Yilmaz, 2026
April 24, 2026 arXiv paper introducing a benchmark for discovering suitable agents from nearly 10,000 real-world agents, using execution-grounded signals rather than text descriptions alone. Finds a gap between semantic similarity and actual agent performance.
kohler-agentic-reproduction-2026
Benjamin Kohler, David Zollikofer, Johanna Einsiedler, Alexander Hoyle, Elliott Ash, 2026
April 23, 2026 arXiv paper evaluating agents that reproduce empirical social-science results from methods descriptions and data without seeing original code or results. Agents can often recover results, but performance varies and failures include both agent errors and underspecified papers.
zhang-llm-mas-rl-orchestration-2026
Chenchen Zhang, 2026
May 4, 2026 arXiv paper framing multi-agent RL around orchestration traces covering spawning, delegation, communication, aggregation, and stopping. Finds a gap in explicit RL methods for stopping decisions and a scale gap between public academic evaluations and industrial deployments.
sharma-agent-execution-validation-2026
Reshabh K Sharma, Gaurav Mittal, Yu Hu, 2026
May 4, 2026 arXiv paper proposing validation of autonomous-agent execution from 2-10 passing traces, using dominator analysis, semantic equivalence, and topological subsequence matching to detect bugs and false successes.
cho-skillret-2026
Hongcheol Cho, Ryangkyung Kang, Youngeun Kim, 2026
May 7, 2026 arXiv paper introducing a benchmark with 17,810 public agent skills, 63,259 training samples, and 4,997 evaluation queries. Finds skill retrieval remains difficult at realistic library scale.
May 19, 2026 Google announcement launching Gemini 3.5 Flash as a model family focused on agentic workflows, coding, speed, and broad distribution through the Gemini app, AI Mode in Search, Antigravity, Gemini API, Android Studio, and Gemini Enterprise.
May 20, 2026 Google I/O roundup announcing Gemini 3.5 Flash, Gemini Spark, Daily Brief, AI Mode/Search updates, Universal Cart, Workspace features, and a $100 Google AI Ultra subscription tier.
Cognition's 2026 Devin release notes. Includes PR resuming, Devin Review auto-merge, Wiki v2, subagents, enterprise audit logs, MCP marketplace upgrades, hard ACU caps, and other persistent-agent workflow features.
infosys-cognition-devin-2026
Infosys / Cognition, 2026
January 7, 2026 Infosys and Cognition announcement to deploy Devin across Infosys's internal engineering ecosystem and client engagements, combining Devin with Infosys Topaz Fabric for enterprise software-development workflows.
openai-dell-codex-enterprise-2026
OpenAI, 2026
May 18, 2026 OpenAI announcement that Codex will connect with Dell AI Data Platform and explore Dell AI Factory integrations so enterprises can run agentic workflows closer to governed on-prem and hybrid data.
microsoft-ey-enterprise-ai-impact-2026
Microsoft, 2026
May 21, 2026 Microsoft post describing EY's large-scale Copilot deployment and a more than $1B Microsoft-EY initiative using forward-deployed engineers and transformation teams to move enterprises from pilots to production.
nvidia-q1-fy2027-results-2026
NVIDIA, 2026
May 20, 2026 earnings release reporting $81.6B total revenue and $75.2B data-center revenue for the quarter ended April 26, 2026, plus a new reporting split between Hyperscale, ACIE, and Edge Computing.
eu-ai-act-transparency-consultation-2026
European Commission, 2026
May 8, 2026 European Commission consultation on AI Act transparency obligations taking effect August 2, 2026, including disclosure of AI interaction and machine-readable marking for AI-generated or manipulated content.
axios-frontier-model-eo-2026
Ashley Gold, 2026
May 19, 2026 Axios report that a draft White House executive order would create a voluntary framework for labs to share covered frontier models with government as much as 90 days before public release. Treat as reporting on a draft, not enacted policy.
zou-phoenix-bench-2026
Qingyun Zou, Feng Yu, Hongshi Tan, Bingsheng He, WengFai Wong, 2026
May 13, 2026 arXiv paper introducing Phoenix-bench, a benchmark of 511 Verilator instances from 114 repositories. Finds software-tuned agents lose 37-58% moving from SWE-bench Verified to hardware debugging tasks, with failures concentrated in hierarchy-aware signal-flow tracking and coordinated multi-file edits.
wu-agent-skill-biv-2026
Yuhao Wu, Tung-Ling Li, Hongliang Liu, 2026
May 12, 2026 arXiv paper formalizing behavioral integrity verification for agent skills. On 49,943 OpenClaw skills, 80.0% deviated from declared behavior; 5.0% carried predicted multi-stage attack chains; malicious-skill detection reached F1 0.946.
liao-agentic-ai-pathway-agi-2026
Junwei Liao, Shuai Li, Muning Wen, Jun Wang, Weinan Zhang, 2026
May 13, 2026 ICML 2026 position-track paper arguing that agentic systems, rather than pure monolithic scaling, are a foreseeable path to AGI because routing, DAG-style task composition, and multi-agent structures can improve generalization and sample efficiency.
May 7, 2026 arXiv paper taking a critical software-studies perspective on AGI, emphasizing that AGI remains conceptually and definitionally problematic and that pathways differ across frontier proprietary, open-weight, domain-specific, and sovereign model trajectories.
Estimates data centers consumed around 415 TWh in 2024 and projects global data center electricity consumption to reach about 945 TWh by 2030 in the Base Case. Accelerated AI servers are a major driver.
redwood-anthropic-code-share-2026
Redwood Research, 2026
Rebuttal to the popular '90% of code at Anthropic is AI-written' framing. Argues the most defensible sub-metric, 'lines of code merged,' likely puts AI's share at a majority while self-reported Anthropic productivity gains remain in the 20-40% range. Calls the 90% framing 'probably false in a straightforward sense.' Useful as a calibration counterweight to the vendor programming-feedback-loop narrative.
Anthropic's Claude Code product page. Includes the 'majority of code at Anthropic is now written by Claude Code' claim and named enterprise case studies: Stripe (10,000-line Scala-to-Java migration in 4 days vs ~10 engineer-weeks), Wiz (50,000-line Python-to-Go in ~20 hours of active dev time vs 2-3 months), Rakuten (average new-feature delivery 24 to 5 working days), Goldman Sachs Devin-and-Claude pilot, and Visma developer-productivity claims. Vendor-curated and not third-party audited; pair with the Redwood Research calibration.
May 7, 2026 DeepMind retrospective reporting AlphaEvolve-discovered improvements across DeepConsensus variant detection (~30% error reduction for PacBio sequencers), AC Optimal Power Flow GNN feasibility (14% to >88%), natural-disaster risk modelling (+5% accuracy across 20 categories), and quantum-circuit error reduction (~10x on the Willow processor). Extends the May 2025 results, which already included a 23% Gemini training matmul speedup, 32.5% FlashAttention speedup, ~0.7% recovered data-center compute, and a 48-multiplication 4x4 complex matmul beating Strassen. Concrete partial evidence for Kurzweil's programming feedback loop in narrow domains.
sakana-darwin-godel-machine-2025
Jenny Zhang, Shengran Hu, Cong Lu, Robert Tjarko Lange, Jeff Clune, 2025
Sakana AI self-improving-agent system that edits its own code, archives, and benchmarks. Reports SWE-bench from 20.0% to 50.0% and Polyglot from 14.2% to 30.7% through open-ended self-modification. v3 revisions posted March 12, 2026. Concrete partial evidence for the programming feedback loop within narrow benchmarked settings.
ieee-spectrum-recursive-self-improvement-2026
IEEE Spectrum, 2026
May 2026 IEEE Spectrum overview characterising the state of recursive AI self-improvement as 'emerging, but humans are still in the loop.' Useful as a calibration counterweight to both runaway-takeoff and dismissive framings.
bloomberg-cognition-25b-raise-2026
Bloomberg, 2026
April 23, 2026 Bloomberg report that Cognition (maker of Devin) is targeting a $25B raise, roughly 2.5x its $10.2B September 2025 valuation set in the $400M Founders Fund-led round. Signal that capital markets continue to price autonomous-coding-agent capability aggressively.
recursive-clune-startup-2026
Recursive (Jeff Clune), 2026
Reports that Jeff Clune's new company Recursive raised $650M at a $4.65B valuation, aimed explicitly at the full recursive self-improvement pipeline. No public products yet. Market-side signal that frontier-adjacent labs are explicitly funding self-improvement work, even though capability evidence remains narrow.
aws-bedrock-stateful-runtime-2026
AWS, 2026
May 18, 2026 AWS announcement of a stateful runtime for Bedrock agents handling multi-step state, tool invocation, error handling, and resume-safe long-running tasks. Carries 'working context' across executions: memory and history, tool and workflow state, environment use, and identity and permission boundaries. Concrete infrastructure milestone for the 2026.5 'agents inside org permission boundaries' row.
github-copilot-cloud-agent-2026
GitHub, 2026
GitHub Copilot Cloud Agent surfaces across Visual Studio Code, JetBrains, Xcode, Eclipse, github.com, and Mobile, running Claude Opus 4.7 and GPT-5.5 under admin policy gates. Evidence that frontier coding agents are being routed into existing developer tools rather than only standalone IDEs, with persistent identity and policy enforcement.
cursor-composer-2-5-2026
Cursor, 2026
May 18, 2026 Cursor in-house coding model release. Evidence that frontier-adjacent tooling vendors are training their own specialised coding models rather than only wrapping API frontier models. Released alongside Cursor in Jira and Build-in-Parallel async subagents.
May 28, 2026 flagship release, 41 days after Opus 4.7. SWE-bench Verified 88.6% (up from 87.6%), Terminal-Bench 2.1 74.6%, GPQA Diamond 93.6%, GDPval-AA 1890 Elo (+121 over GPT-5.5), Online-Mind2Web 84% (strongest computer-use/browser-agent tested). Pricing unchanged at $5/$25 per M tokens; fast mode 2.5x speed at $10/$50, three times cheaper than 4.7 fast mode; 1M-token input, 128K output. New 'dynamic workflows' in Claude Code orchestrate hundreds of parallel subagents (capped ~1,000) with planning, distribution, and output verification. Notable calibration result: first Claude to score 0% on uncritically reporting flawed results, >10x reduction in overconfident behaviour vs 4.7, fails to surface important events only 3.7% of the time. A capability release whose headline includes an honesty/calibration improvement directly relevant to long-horizon agent reliability.
anthropic-30b-raise-900b-2026
Bloomberg, 2026
Reporting that Anthropic was closing a $30B+ round at a $900B-plus valuation as soon as the week of May 26, 2026, surpassing OpenAI's $852B March valuation to become the most valuable private AI startup. Co-leads (Sequoia, Dragoneer, Altimeter, Greenoaks) each ~$2B. Revenue cited: Q1 $4.8B doubling to a projected $10.9B in Q2; annualised figures reported near $45B (vs OpenAI ~$33B). IPO reportedly targeted October 2026 with ~$1T discussions. Not a capability signal; a market-concentration and circular-financing signal.
anthropic-xai-colossus-compute-2026
Cryptobriefing / SpaceX S-1 reporting, 2026
Disclosed via SpaceX's IPO filing: Anthropic reserves Colossus 1 (Memphis, ~220,000+ NVIDIA H100/H200/GB200 GPUs, ~300 MW) at ~$1.25B/month (~$15B/yr, >$40B through May 2029), reportedly absorbing roughly half of Anthropic's ARR. SpaceX acquired xAI in a Feb 2026 stock merger and is using the lease to boost revenue ahead of its own IPO. Illustrates the scale of compute commitments relative to revenue and the increasingly circular financing among frontier players.
datacenter-electrical-gear-bottleneck-2026
Industry reporting (Data Center Knowledge / Tech-Insider), 2026
Late-May 2026 reporting that of ~12 GW of U.S. data center capacity expected to come online in 2026, only about one-third was under active construction, while lead times for critical electrical gear (transformers, switchgear) stretched to as long as five years, against $650B+ in combined 2026 hyperscaler AI capex. Concrete instance of the energy/supply-chain constraint binding before capital does.
whitehouse-frontier-ai-eo-2026
The White House, 2026
Executive order signed June 2, 2026. Directs a framework under which developers voluntarily give the federal government access to covered frontier models up to 30 days before release to any other party, and lets developers and government select trusted partners for early access to strengthen critical-infrastructure cybersecurity. Explicitly bars any mandatory licensing or preclearance requirement, keeping the regime voluntary. Enacts (at a narrower 30-day window) the direction the May 19 Axios-reported draft floated at up to 90 days. State AI legislation continues despite the administration's preemption push.
anthropic-series-h-965b-2026
Anthropic, 2026
Series H closed late May 2026: $65B raised at a $965B post-money valuation, led by Altimeter, Dragoneer, Greenoaks, and Sequoia — the largest single private AI round and the first time Anthropic's private valuation passed OpenAI's ($852B March mark). Run-rate revenue reported to have crossed ~$47B. Disclosed compute agreements: up to 5 GW with Amazon, 5 GW of next-generation TPU capacity with Google and Broadcom, and GPU access in SpaceX's Colossus 1 and 2. Apollo Global and Blackstone arranged a $36B private-credit deal — backed by Broadcom — to buy Google TPUs for Anthropic, described as the largest chip-financing debt transaction on record. Anthropic confidentially filed a draft S-1 with the SEC on June 1, 2026. Finalizes and supersedes the prior reporting in anthropic-30b-raise-900b-2026 ($30B+/$900B+).
microsoft-mai-models-2026
Microsoft AI, 2026
June 2, 2026 (Build 2026). Microsoft AI launched seven in-house models trained from scratch: MAI-Thinking-1 (its first reasoning model, reported 97% on AIME 25 and 53% on SWE-Bench Pro, near Opus 4.6), MAI-Code-1 / MAI-Code-1-Flash (a GitHub-tuned coding model now in Copilot and VS Code), MAI-Image-2.5 / Flash, MAI-Transcribe-1.5, and MAI-Voice-2 / Flash. Framed around 'long-term self-sufficiency' and a 'superintelligence lab,' with co-design against Maia 200 silicon. Notable because Microsoft has been OpenAI's primary partner; the amended April 2026 agreement made that relationship non-exclusive, and these models are the partner becoming a frontier competitor.
rabanser-agent-reliability-science-2026
Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, Arvind Narayanan, 2026
Princeton-led paper (latest version June 2, 2026) decomposing agent reliability into four dimensions — consistency, robustness, predictability, and safety — via twelve metrics. Evaluates 14 models across two benchmarks and finds recent capability gains have produced only small improvements in reliability; standard evaluations ignore whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Independent academic counterweight to vendor-reported calibration claims (e.g., Opus 4.8) and direct support for the baseline's capability-versus-reliability thesis.
softbank-france-5gw-2026
SoftBank Group, 2026
May 31, 2026 (Choose France summit). SoftBank committed up to €75B to develop and operate 5 GW of AI data center capacity in France, its largest European AI infrastructure investment. Phase 1 is ~€45B for 3.1 GW in the Hauts-de-France region by 2031 (Dunkirk, Bosquel, Bouchain), with a Schneider Electric power-module/enclosure manufacturing cluster at the Port of Dunkirk. Siting rationale is explicitly energy: France draws ~70% of power from nuclear and posts industrial electricity prices well under half the UK's. Concrete instance of clean firm baseload power reshaping compute geography.
anthropic-claude-outage-2026-06
Industry reporting (Cybersecurity News), 2026
June 5, 2026 multi-service disruption with elevated error rates across claude.ai, the Claude API, Claude Code, and Claude Cowork. Anthropic attributed it to infrastructure issues rather than a security breach. One of several Claude outages in 2026 (March, May). Minor but concrete deployment-reliability signal: agent workflows inherit the availability of the underlying platform.
Late-May 2026 close: Cognition (maker of Devin) raised over $1B at a $26B post-money valuation ($25B pre-money), led by Lux Capital, General Catalyst, and 8VC — about 2.5x its $10.2B September 2025 mark, finalizing the target reported in bloomberg-cognition-25b-raise-2026. Annualized revenue run-rate cited near $492M with enterprise Devin usage reported growing ~50% month-over-month. Continues the aggressive capital pricing of autonomous-coding-agent capability.
June 10, 2026 DeepMind position paper (15 authors incl. Shane Legg, Marcus Hutter, Allan Dafoe, Joel Z. Leibo, Iason Gabriel, Thore Graepel, Tim Genewein). Deliberately refuses point timelines and frames the AGI-to-ASI transition as a set of open research questions — a measured establishment-DeepMind counterweight to both aggressive-timeline (Aschenbrenner, AI 2027) and doom (Yudkowsky) poles. Characterizes ASI relative to large human-expert collectives and grounds the notion formally via the Legg-Hutter intelligence score and AIXI as the (incomputable) theoretical upper bound; argues the current pretrain-plus-finetune paradigm has no proven fundamental theoretical blocker to scaling toward universal intelligence, but also clear practical limits (continual learning, long-context, robust planning). Four non-mutually-exclusive, likely-parallel pathways from AGI to ASI: (1) scaling compute/models/data; (2) algorithmic paradigm shifts; (3) recursive self-improvement; (4) multi-agent group-agent formation (collectives, markets, 'multi-agent scaling laws'). Six bottlenecks (Table 4): data wall, economic/natural-resource demand growing too fast, neural paradigm insufficient, research-gets-harder (Bloom et al.), abstraction barrier, and deliberate slowdown/regulation — each paired with possible counters, and whether each binds is treated as an open empirical question. Key analytic move: decouples individual-model plateau from collective ASI — even if per-model capability stalls at human level, ~10x/yr effective-compute growth (hardware ~1.5x x investment ~2.5x x algorithmic efficiency ~3-6x) plus ~25x/yr 'population scaling' (MacAskill & Moorhouse) could yield collective superintelligence by running millions of AGI instances faster and in parallel. Introduces the Abstraction Barrier (Lerchner) and the Embodied Bottleneck: models trained on human abstractions may be bounded by human conceptual frameworks, and novel concept discovery must be validated against physical reality at real-world experiment speeds, imposing a linear brake on recursive self-improvement. Also catalogs fundamental limits of any ASI (Table 2: Landauer, Bremermann, Bekenstein, light-speed, P vs NP, Goedel/Halting, real-time physical experimentation) and uses Boden's three creativity levels plus Hassabis's 'could an AI have invented general relativity from 1900s knowledge? today the answer is no' as the test for transformative creativity / true ASI. Net stance: cruising past AGI into ASI within a decade or two 'cannot easily be dismissed,' but absent an intelligence explosion the more likely outcomes are either a plateau before AGI or a relatively smooth AGI-to-(weak-)ASI transition.
apple-gemini-siri-wwdc-2026
Apple / industry reporting (TechCrunch, MacRumors, AppleInsider), 2026
WWDC 2026 (June 8-9). Apple shipped a rebuilt Siri whose server-side reasoning runs on a custom ~1.2-trillion-parameter Google Gemini model executed inside Apple's Private Cloud Compute, reportedly for ~$1B/year. Apple's own on-device foundation models remain Apple-built and contain no Gemini (per AppleInsider). Significance is distribution, not capability: the largest consumer device platform routes its assistant's heavy reasoning through a frontier lab's model rather than its own, the clearest consumer-side instance of the baseline's 'distribution cadence rivals release cadence' thread. Also a competitive note — Apple chose Google's model over OpenAI/Anthropic for the core assistant.
eu-ai-content-labelling-code-2026
European Commission (AI Office), 2026
June 10, 2026. The Commission published the final voluntary Code of Practice, prepared by independent experts in a multi-stakeholder process facilitated by the AI Office, to help providers and deployers meet AI Act Article 50 transparency obligations that apply from August 2, 2026. Covers machine-readable marking and detection of AI-generated/manipulated audio, image, video and text; mandatory labelling of deepfakes and of AI text published on matters of public interest; and disclosure when users interact with a chatbot. Commission and AI Board will assess adequacy and complement it with Article 50 implementation guidelines. Concrete operationalization of the EU AI Act transparency thread the baseline already tracks.
anthropic-when-ai-builds-itself-2026
Marina Favaro, Jack Clark (Anthropic Institute), 2026
June 4, 2026 Anthropic Institute report (not covered in the June 7 update). States Claude wrote more than 80% of the code merged into Anthropic's production systems and argues AI may be nearing a point where systems improve themselves with little meaningful human involvement, potentially outpacing safety and governance. Central recommendation: the world should preserve the 'option' to coordinate a slowdown or temporary pause of frontier development to let alignment research and societal structures catch up — Anthropic does not commit to a unilateral halt. Distinct from Amodei's 'Policy on the AI Exponential' (FAA-style mandatory testing) already in the baseline; this is an RSI-framed argument for a coordinated-pause option. Caveat: the >80% figure is the same revealed-preference 'lines merged' metric flagged by Redwood Research, not an audited productivity multiplier (self-reported gains remain 20-40%).
spacex-ipo-2026
Industry reporting (CNBC, CoinDesk), 2026
SpaceX priced its IPO at $135/share on June 11, 2026 (~$1.77T valuation, ~$75B raised, book ~4x oversubscribed), began trading June 12 on Nasdaq as SPCX, and closed ~$161 (+19%) — the largest IPO on record by deal size. Relevant to the baseline only via the compute-financing web: the earlier Colossus 1 lease note described SpaceX 'booking that spend as revenue ahead of its own listing.' The listing has now happened, so the Memphis/Colossus AI-compute revenue line now sits inside a public company subject to disclosure.
June 9, 2026 release. Claude Fable 5 is the Mythos-class model made safe for general use — Anthropic calls it the most capable model it has made generally available, state-of-the-art on nearly all tested benchmarks (software engineering, knowledge work, vision, scientific research, autonomous task execution); Stripe is quoted reporting it 'compressed months of engineering into days.' Claude Mythos 5 is the identical underlying model with some safeguards lifted for authorized cybersecurity professionals and infrastructure providers (the Glasswing/defensive-cyber lineage). Public release of the Mythos line first seen as April's gated Claude Mythos Preview. Safety architecture: classifiers in cybersecurity, biology/chemistry, and distillation trigger a fallback to Claude Opus 4.8, on average in under 5% of sessions; mandatory 30-day traffic retention to defend against novel attacks; external bug bounty reported 'no universal jailbreaks in over 1,000 hours.' Pricing $10/M input, $50/M output; free on Pro/Max/Team/seat-based Enterprise plans through June 22, 2026. Significance for the baseline: the clearest instance to date of capability gating shipped as a product feature — and (see anthropic-fable-5-foreign-access-suspension-2026 and fable-5-jailbreak-degradation-backlash-2026) of that gating immediately stress-tested.
fable-5-jailbreak-degradation-backlash-2026
Industry reporting (TechCrunch, TechTimes), 2026
Two controversies within days of the June 9 launch. (1) Jailbreak: red-teamer Pliny the Liberator claimed a coordinated multi-step bypass of Fable 5's classifiers (Unicode substitution, conversation dilution, fictional framing, decomposing prohibited goals into innocuous sub-questions), posting screenshots of the model producing working software-exploit code and chemical-synthesis instructions and claiming to have extracted the system prompt; Anthropic disputed that isolated outputs constitute a true safety-system breach, citing 'no universal jailbreak in over 1,000 hours' of bug-bounty testing. (2) Silent degradation: security researchers, developers, and scientists reported Fable 5 quietly refusing or degrading legitimate high-risk work (cyber, bio, chemistry, distillation) without notice — including for users suspected of building competing systems — plus an aggressive 30-day data-retention policy and over-tuned classifiers. Anthropic apologized within days and made the Opus-4.8 fallback visible so users know when they are no longer talking to the full model, but kept the capability limits. Together a live demonstration that capability gating can be both porous (jailbroken) and over-broad (blocks legitimate work) at once.
anthropic-fable-5-foreign-access-suspension-2026
Anthropic, 2026
June 13, 2026. Anthropic received a U.S. government directive at 5:21pm ET, citing national-security authorities, to suspend access to Fable 5 and Mythos 5 by any foreign national whether inside or outside the United States, including foreign-national Anthropic employees; other Anthropic models unaffected. The letter gave no specific details of the national-security concern; Anthropic's understanding is that the government believes it became aware of a jailbreak method (described as asking the model to read a codebase and fix software flaws). Because nationality cannot be verified per session, the practical effect was that Anthropic disabled Fable 5 and Mythos 5 for ALL customers the same evening (~6:59pm PT) to ensure compliance — taking the just-launched flagship fully dark four days after release. Anthropic concluded the demonstrated capability was widely available from other models and routinely used by security professionals, and committed to sharing more detail within 24 hours. Corroborated by Bloomberg (2026-06-13) and reproduced/annotated by Simon Willison (simonwillison.net, 2026-06-13). Significance: the first time U.S. export-control / national-security authority has been used to deny foreign-national access to a deployed, generally-available frontier model rather than to chips or pre-release review — a new modality in the export-control thread.