모든 모델 랩이 에이전트 랩으로 변하다

Ahead of OpenAI’s likely IPO filing next week, Greg makes the latest in a series of comments where Model Labs are increasingly also building Agents as the product:

The quote is a big reversal of stance from a position ~uniformly held by anyone who worked at Team Big Model, including his previous head of OpenAI Labs:

This comes with the shuttering of AI21’s model team, which is now pivoting to agents:

and even the venerable DeepSeek is now building a “Harness team” for the first time:

The “Systems over Models” people will take this as a point of validation of what they have been saying all along… except for the nuance that models cotrained with harnesses does open the door for closing access to models even further — if you can effectively posttrain a model to only meaningfully perform with your closed source agent, then you get to funnel the majority of users to your agent at the expense of your model/API co-opetition.

But that’s a topic of a much larger discussion…

AI News for 5/4/2026-5/5/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

Agent Products, Harnesses, and the Shift Beyond “Just the Model”

The product surface is moving up-stack: A recurring theme was that model quality alone is no longer the moat; the winning product is increasingly model + harness + workflow + UI + memory + economics. @gdb put it bluntly: “the model alone is no longer the product,” while @dzhng argued top-tier products need model <> harness <> product symbiosis. The same pattern shows up in practice: @signulll framed ambient AI and agentic AI as the new seam of computing interfaces, and @teortaxesTex noted that harness research still risks converging on “replicate Claude Code” instead of exploring broader interfaces.
Coding-agent product differentiation is becoming concrete: OpenAI shipped another substantial Codex update via “codex thursday no. 6” with appshots, /goal improvements, remote computer use while locked, annotation mode, plugin sharing, and analytics. @gdb separately highlighted Appshots, while users reported meaningful workflow shifts: @gdb said it’s hard to remember coding before Codex, and @reach_vb said they haven’t opened an IDE in over a month. But product rough edges remain: @theo praised T3 Code’s remote feature as ahead of alternatives, then contrasted it with buggy remote workflows in Codex in a follow-up post. On the Claude side, @ClaudeDevs expanded auto mode to the Pro plan and added Sonnet 4.6 support; @_mohansolo also had to clarify and patch IDE support in Antigravity 2.0 after user backlash.

Model Performance, Cost Curves, and Frontier Competition

DeepSeek’s pricing move was the biggest market signal: @deepseek_ai made the 75% DeepSeek-V4-Pro discount permanent, triggering strong reactions because it materially changes the cost/performance frontier. @ArtificialAnlys quantified first-party pricing at $0.435/M input, $0.87/M output, $0.0036/M cached input, estimating a blended ~$0.18/M and placing V4 Pro on the Pareto frontier for intelligence vs run cost. They estimate running their Intelligence Index on V4 Pro costs ~3x less than Gemini 3.1 Pro Preview, ~12x less than GPT-5.5, and ~19x less than Claude Opus 4.7. Community reaction centered on DeepSeek’s push toward “intelligence too cheap to meter,” as @scaling01 put it. @Yuchenj_UW and @kimmonismus both emphasized the magnitude of the cut.
Gemini Flash improved, but usage feedback was mixed: @OfficialLoganK reported Gemini 3.5 Flash making major progress over 3.1 Pro on GDPval, claiming Flash is now “competing at the frontier,” and @Designarena placed it 16th overall on Design Arena, a 16-position jump from Gemini 3 Flash Preview. But several builders pushed back on usefulness vs benchmark gains: @Alezander907 saw only slight browser-agent improvement at higher cost, @giffmana argued this isn’t “Flash progress” if the brand still implies cheapness, and @jeremyphoward said the model feels optimized to max evals rather than cooperate with humans. That aligns with broader eval skepticism from @HamelHusain, who argued current tooling underweights qualitative, HITL judgment.
Qwen and Chinese frontier models keep compressing the race: The official @Alibaba_Qwen teasers and a long third-party review from @ZhihuFrontier portrayed Qwen3.7-Max as a meaningful step up, especially in instruction following, context reliability, and stability, while still suffering from verbosity and high token usage. Elsewhere, @scaling01 claimed recent ALE-Bench runs show Chinese models like Kimi-K2.6, DeepSeek-V4, GLM-5.1 outperforming several Western releases in that setting. @ArtificialAnlys also reported Cursor Composer 2.5 as 3–18x cheaper than Opus 4.7 and 5–32x cheaper than GPT-5.5 on Coding Agent benchmarks, with notably lower token use.

Protocols, Infra, and Agent Runtime Tooling

MCP’s new release candidate is a substantive protocol simplification: @dsp_ announced the MCP 2026-07-28 release candidate, with the key change that the protocol is now stateless: no handshake, no session ID, and any request can hit any server instance. The RC also introduces first-class extensions like MCP Apps and Tasks, plus auth hardening and a clearer deprecation policy. For infra teams, statelessness is a big operational shift: easier scaling, simpler load balancing, fewer sticky-session concerns.
Sandboxes and managed execution are becoming first-class primitives: @_philschmid demoed Gemini Managed Agents + Interactions API to give an agent a secure hosted Linux sandbox with memory and code execution. @CoreWeave launched CoreWeave Sandboxes in public preview for RL, agent tool use, and model eval, while @cnakazawa released Cloudsail for per-task Cloudflare sandboxes with shell, Codex, and GitHub access without exposing tokens. At the orchestration layer, @skypilot_org argued RL doesn’t work on Slurm because modern RL is a multi-service system with heterogeneous hardware and recovery needs.
Open-source harnesses and memory layers are proliferating: @NVIDIAAI open-sourced AI-Q agent skills for portable deep-research pipelines that can plug into arbitrary harnesses. @Teknium added Bitwarden support for key management in Hermes and later restored 256K context for Grok Build v0.1 in Hermes here. @shannholmberg described a shared-memory “gBrain” layer under Hermes agents, with typed folders and read-first access for specialist agents. @aakashadesara updated CTOP to support Devin and a CLI for listing, searching, and killing agent sessions.

Research: RL, Distillation, Architectures, and Evaluation

RL post-training and reward design are under active reconsideration: @RyanBoldi introduced Vector Policy Optimization (VPO), arguing scalar reward collapse during RL can sabotage test-time scaling. VPO instead optimizes vector-valued rewards, improving search performance even on the original scalar objective. @lateinteraction framed this as a way to train LLMs for more diverse environments and goals, while @FeiziSoheil connected it to broader moves toward structured feedback instead of a single reward number. Separately, @jsuarez teased a solution to a long-standing RL problem involving extreme sparsity, with initial sweeps showing SOTA on one internal environment.
Agent compilation/distillation is emerging as a serious economic idea: @dair_ai highlighted a paper showing a full agentic workflow—multi-step calls, tool use, scratchpads, decision structure—can be distilled into weights and run at ~100x lower inference cost while preserving near-frontier quality. This is one of the clearest technical arguments yet for compiling expensive runtime agent loops into cheaper deployable models.
Architecture work remains lively beyond vanilla transformers: @ChunyuanDeng introduced LT2, a linear-time looped transformer combining sparse and linear attention to make looping practical, along with a distilled Ouro-hybrid-1.4B. @ZyphraAI shared work extending Equilibrium Propagation beyond energy-based models toward biologically realistic neurons. On MoE, @Jianlin_S proposed Moving Quantile Balancing for sequence-level load balancing without a loss penalty. Meanwhile @allen_ai launched ArtifactLinker, which predicts which benchmarks a model is likely to set SOTA on before running them—a useful meta-eval tool amid growing benchmark sprawl.
Math and reasoning capability discourse shifted again: @cozyblaze265065 reported 99.46% on a multi-digit multiplication experiment using gpt-5.5 with medium reasoning and no tools, and @teortaxesTex noted modern LLMs can now do 100-digit multiplication without tools. That’s not a complete theory of reasoning, but it further weakens old “autoregression can’t do arithmetic” talking points.

Multimodal Systems: Video, Speech, World Models, and Imaging

Google’s I/O stack pushed toward persistent agents and world simulators: @Google introduced Gemini Spark, a 24/7 personal AI agent for recurring tasks, skills, and workflows. @GoogleDeepMind also launched Project Genie + Street View, letting users turn real U.S. locations into interactive worlds; follow-up posts confirm rollout to Google AI Ultra subscribers via Google Labs. The multimodal side was reinforced by @Google announcing Gemini Omni for conversational video creation/editing and custom avatars, while @emollick emphasized the significance of a fully multimodal system that can natively edit video.
Runway and image/video tooling keep raising editability: @runwayml released Aleph 2.0, supporting multishot sequences up to 30s at 1080p with targeted edits that preserve the rest of the scene. @CuriousRefuge highlighted SeeDance 2 Stitcher for seamlessly extending AI-generated cinematic clips using Omni-generated continuations.
Speech and image generation saw notable jumps: @ArtificialAnlys ranked Cartesia Sonic-3.5 as the new #1 TTS model on their Speech Arena, citing an Elo of 1218, support for 42 languages, and strong naturalness/transcript following. Cartesia claims 82ms end-to-end first audio in production here. In image generation, @wildmindai flagged Tencent’s Z-Image 6B as a pixel-space generator with no VAE, 1K resolution, and a transfer framework for converting Flux/SD models; related ecosystem work included Pixal3D demos from @victormustar and training support for Z-Image L2P 1k in AI Toolkit from @ostrisai.

Security, Cyber, and Policy Pressure

Cybersecurity is quickly becoming a proving ground for advanced agents: @AnthropicAI said Project Glasswing and partners found more than ten thousand high- or critical-severity vulnerabilities in essential software within a month, and explicitly warned the industry will need to adapt to the volume of vulnerabilities that models like Claude Mythos Preview can find. Security productization is following: @perplexity_ai open-sourced Bumblebee, a read-only scanner for macOS/Linux to detect risky packages, extensions, and AI tool configs; @AravSrinivas said enterprise deployment will require agentic sandboxes plus continuous security engineering.
US immigration policy changes triggered sharp backlash from AI leaders: Several high-engagement posts argued a proposed rule forcing green-card applicants to apply from outside the US would directly damage the AI talent pipeline. See @Nick_Davidov, @AndrewYNg, @theo, @garrytan, and @togelius. The common argument: the rule punishes legal high-skill immigrants, undermines startups and research, and harms US competitiveness in AI.

Top tweets (by engagement)