Dòng tin
Mới hôm nay
PewDiePie xây dựng agent orchestrator riêng
RT by @dair_ai: 😂PewDiePie building his own agent orchestrator and releasing it was not on my 2026 bingo card.
Own the agent. Own the harness.
It's not that hard, folks.
- ›PewDiePie phát hành agent orchestrator do chính mình xây dựng.
Biên giới Hiệu quả: Tối ưu quản lý ngữ cảnh cho agent
RT by @dair_ai: // The Efficiency Frontier //
Cool paper on context management.
As agents reuse the same documents and histories across many turns, the cheapest context strategy is not fixed. This work describes a principled rule for picking one per deployment instead of defaulting to whatever topped a benchmark in isolation.
Retrieval and compression methods are almost always benchmarked on accuracy and cost separately, so you never learn when one actually beats another under real load.
The Efficiency Frontier models context strategy selection as a single cost-performance problem, with a log-utility term for diminishing returns from extra context and a reuse parameter N that amortizes preprocessing across repeated queries.
Sweep N and the optimal strategy changes, exposing crossover regions where retrieval, compression, or full context each wins. On 5,000 HotpotQA instances, deployment-aware selection cuts effective token usage about 25 percent at the same performance, and amortized memory compression runs over 50 percent cheaper than full-context prompting in higher-performance settings.
Paper: https://arxiv.org/abs/2605.23071
Learn to build effective AI agents in our academy: https://academy.dair.ai/
- ›Bài báo đề xuất quy tắc chọn chiến lược quản lý ngữ cảnh tối ưu tùy theo từng triển khai.
Những bài báo AI hàng đầu của tuần (24-31 Tháng 5)
Pinned: The Top AI Papers of the Week (May 24 - May 31)
- SkillOpt
- AutoScientists
- The Efficiency Frontier
- Language Models Need Sleep
- Adapting the Interface, Not the Model
- Forecasting Scientific Progress with AI
- Compiling Agentic Workflows into Weights
Read on for more:
- ›Tổng hợp 7 bài báo nổi bật: SkillOpt, AutoScientists, The Efficiency Frontier, Language Models Need Sleep, Adapting the Interface, Forecasting Scientific Progress, Compiling Agentic Workflows.
Bài viết trên X
x.com/i/article/206078258805…
- ›Liên kết bài viết trên nền tảng X nhưng nội dung không hiển thị rõ trong nguồn.
Trước đó
Đường biên hiệu quả: GPT-5.6 sẽ đứng ở đâu?
RT by @dair_ai: The efficiency frontier!
Where do you think GPT-5.6 will land?
- ›Claude Opus 4.8 đạt 58% Pass@1 trên DeepSWE Bench, xếp thứ 2 sau GPT-5.5.
Bài nói chuyện về LLM Wikis và HTML Artifacts
RT by @dair_ai: I did a talk on LLM Wikis and HTML artifacts recently, if you are curious to learn more on the topic: https://academy.dair.ai/events/cmovobp97000904l5h0n9a2yz
Doing a second session and a few releases on our platform around artifacts.
- ›Tác giả đã thực hiện bài nói chuyện về LLM Wikis và HTML artifacts trên nền tảng DAIR.AI.
HTML Artifacts trở thành công cụ cốt lõi trong làm việc với AI agents
RT by @dair_ai: Increasingly, HTML Artifacts are becoming a core part of how I work with AI agents.
Long-horizon agent sessions need a better way to surface insights about what work it has done.
This may not be obvious right now, but as you start to let your agent work on dynamic workflows, large codebases, long-running loops (e.g., using /goal), and deep research tasks, you need a good way to present results. Chat window is not it.
You also don't want to just trust everything the agents do. Artifacts help provide an important verification layer, which in turn enables important decision-making.
I like HTML artifacts because I can just ask the agent to produce as many of them (and in whatever form) as I need to verify the work and make sense out of everything. I even built a nice tab system for my artifacts. They are great for continual learning and research.
I use HTML artifacts for logging, tracking experiments, brainstorming, managing my inbox, code reviews, agent session management, deep research, writing, reading, and so much more.
I believe @karpathy wrote about this somewhere: As we move on to more advanced applications of AI agents and outputs get more complex, we will start to find the need for even more advanced forms of interactions with AI, including interactive neural videos/simulations.
- ›HTML Artifacts giúp hiển thị kết quả phức tạp từ phiên agent dài hạn tốt hơn so với cửa sổ chat.
MCP sẽ là chìa khóa cơ bản cho sự phát triển của AI agents
RT by @dair_ai: In a few months, people will start to realize how fundamentally important MCP for agents is.
It's not even about connecting tools. There are many ways to do that.
It's about the types of abstraction it already enables. My new self-improving system, enabled through agent-to-agent interaction, is all powered by MCPs.
This was not an accident. I ran my entire orchestrator through a self-improving loop with clear criteria/goal, and it came up with all kinds of interesting ways (mostly powered by MCP tools) on how to enable complex interactions, versioning, eval workflows, communications, tools, etc.
Something new could always emerge, but I think the protocol itself will be crucial and necessary for all the advancements ahead.
MCP is the future. And I am glad a lot of it is built in the open.
- ›MCP không chỉ về kết nối công cụ, mà còn về cách nó cho phép các loại trừu tượng hoá mới.
Agents chủ động có thực sự cần LLM để quyết định khi nào 'thức dậy'?
Pinned: Do proactive agents really need an LLM to decide when to wake?
The default proactive agent calls an LLM on every event just to decide whether to wake up. That is a lot of expensive inference spent on a yes or no.
New research from Microsoft and Purdue asks whether the trigger really needs a language model at all.
Their answer is a 220MiB temporal-graph encoder that decides when to wake and what context to anchor. It gains +16.7 mean F1 across 14 backbones, runs 4 to 83x faster, and fits on-device at around 11ms per event.
If you run an always-on agent loop, the polling decision is quietly the main cost driver. A tiny encoder removes it without giving up accuracy.
Paper: https://arxiv.org/abs/2605.30152
Learn to build effective AI agents in our academy: https://academy.dair.ai/
- ›Các agent chủ động truyền thống lãng phí tính toán bằng cách gọi LLM cho mỗi sự kiện để quyết định kích hoạt.
Quy Luật Mở Rộng cho Agent Harnesses
RT by @dair_ai: // Scaling Laws for Agent Harnesses //
If you build agent harnesses, this one is worth your time.
(bookmark it)
Most harness tuning treats every token and tool call as if volume is all that counts. New research shows that most of it does not.
The work introduces Effective Feedback Compute (EFC), a coordinate that counts only the feedback an agent can actually act on.
Raw token and tool-call counts explain agent failure at R2 of 0.33 to 0.42. EFC pushes that to 0.99.
Why does it matter?
Once you budget by useful feedback instead of raw volume, reallocation alone lifts success from 0.27 to 0.90 at the same compute. This also turns harness design from guesswork into something you can predict.
Paper: https://arxiv.org/abs/2605.29682
Learn to build effective AI agents in our academy: https://academy.dair.ai/
- ›Giới thiệu Effective Feedback Compute (EFC), một chỉ số chỉ tính phản hồi mà agent thực sự có thể hành động được, thay vì đếm tất cả tokens và tool calls.
AutoScientists: Nhà Khoa Học AI Phi Tập Trung
Banger paper from Harvard.
AutoScientists drops the central planner entirely. Agents interpret shared experimental data, self-organize around promising directions, evaluate proposals before resource allocation, and document successes AND failures. Decentralized AI co-scientists with failure documentation as a first-class step.
Validated across three concrete domains. Biomedical ML reaches 74.4% mean leaderboard percentile. Language model training converges 1.9x faster. Protein fitness prediction lifts +12.5% on specific assays and +6.5% broader.
The strongest argument so far that the AI-scientist bottleneck is governance rather than raw capability.
Paper: https://arxiv.org/abs/2605.28655
Learn to build effective AI agents in our academy: https://academy.dair.ai/
- ›AutoScientists loại bỏ central planner hoàn toàn; các agent tự tổ chức xung quanh các hướng hứa hẹn và đánh giá proposals trước khi allocate tài nguyên.
Xây Dựng Coding Agent Tự Cải Thiện Trong 24 Giờ
RT by @dair_ai: It's crazy that this is even possible today.
It inspired me to build my own self-improving coding agent with simple read, write, bash,...
I already used the coding agent to build an entire production-grade application in 24 hrs.
I don't know, man. This feels so strange.
- ›Developer xây dựng một coding agent tự cải thiện sử dụng các công cụ cơ bản: read, write, bash.
Mô Hình Mạnh Hơn Không Cần Harness Phức Tạp Hơn
Stronger models do not always need lighter harnesses.
Everyone believes more structured harnesses universally improve reliability, and that higher-capability models need proportionally less structural guidance. Together, that implies a clean inverse relationship between model tier and optimal harness complexity.
This new research tests it with a controlled 432-run experiment, six models across four capability tiers crossed with three harness conditions, on a 24-task benchmark with git-based workspace verification.
For a frontier chat model, increasing harness verbosity dropped success by 29 to 38 percentage points. They call it the harness-complexity paradox.
Paper: https://arxiv.org/abs/2605.26731
Learn to build effective AI agents in our academy: https://academy.dair.ai/
- ›Quan sát ngược lại trực giác thông thường: tăng tính phức tạp của harness lại giảm hiệu suất của các mô hình mạnh hơn.
Agent Cũng Lão Hóa: Độ Tin Cậy Theo Thời Gian
RT by @dair_ai: // Your Agents are Aging Too //
Huh!? They need "sleep," and now they are aging?
Joke aside, great write-up on reliable agentic engineering.
This new research introduces AgingBench, a longitudinal reliability benchmark. It organizes agent aging into four mechanisms, including compression aging and interference aging, and measures not just whether deployed agents degrade but what form the degradation takes and where repair should target.
We benchmark agents on day one and then deploy them for months. That gap hides a basic systems question. How long does an agent stay reliable after deployment?
Even with frozen model weights, an agent's effective state keeps shifting. It compresses interaction history, retrieves from a growing memory store, revises facts after updates, and goes through routine maintenance. Reliability becomes a lifespan property of the full harness, not a snapshot of the base model.
Paper: https://arxiv.org/abs/2605.26302
Learn to build effective AI agents in our academy: https://academy.dair.ai/
- ›Giới thiệu AgingBench, một benchmark đo độ tin cậy của agent theo thời gian dài sau deployment.
Xây dựng AI tương lai với kiến trúc composable
RT by @dair_ai: For future-proof, build AI that's composable.
Regardless of what you use, all these should be composable, iterative, and customizable:
- LLMs
- Evals
- Automations
- MCP/CLI tools
- Skills/Memory/Context
- Agent Harness (Codex, CC, Pi,...)
The compounding effects are insane.
- ›Để AI bền vững, các thành phần (LLMs, đánh giá, tự động hóa, MCP/CLI tools, skills, memory, context) cần thiết kế composable và tùy chỉnh được.
Mô hình ngôn ngữ cần ngủ
// Language Models Need Sleep //
Let your agents "sleep", folks.
On a serious note, this is a fascinating paper on getting the most from long-horizon agents.
Here is the problem with agents today: Attention scales badly with context length, so long-horizon agents keep paying a quadratic tax at inference time.
This work proposes a sleep-like consolidation step instead. The model periodically does N offline recurrent passes over recent context, writes the result into persistent fast weights in its state-space blocks, then clears the KV cache.
The effect is that extra compute moves to sleep while wake-time prediction stays low latency. On cellular automata, multi-hop graph retrieval, and a math reasoning task where a plain transformer and SSM-attention hybrids fail, longer sleep durations improve performance, with the biggest gains on examples that need deeper reasoning.
Why does it matter?
It points at an alternative to ever-larger KV caches for agents that run for a long time. Consolidate, then forget the raw tokens.
Paper: https://arxiv.org/abs/2605.26099
Learn to build effective AI agents in our academy: https://academy.dair.ai/
- ›Agent dài hạn hiện tại phải chịu chi phí bình phương vì attention scales tệ với context dài.
Mở rộng hệ thống là bottleneck thực sự của AI agentic
System scaling is the next real bottleneck in agentic AI.
If you build agent orchestration layers, this is a clean map of where the engineering leverage actually sits. The labs own the model. You own the harness, and that is increasingly where agent quality is won or lost.
The default mental model still puts all the weight on the foundation model. Bigger model, better agent. But agent behavior actually emerges from the whole stack around it. Memory substrate, context constructor, skill routing, orchestration loop, and the verification and governance layer.
This new research calls that stack the harness and argues we should treat it as a first-class object of design and evaluation. It names three core bottlenecks to scale. Context governance, trustworthy memory, and dynamic skill routing. It also ships CheetahClaws, a Python-native reference harness, and compares it with Claude Code and OpenClaw.
Paper: https://arxiv.org/abs/2605.26112
Learn to build effective AI agents in our academy: https://academy.dair.ai/
- ›Chất lượng agent không chỉ phụ thuộc vào mô hình nền tảng mà phụ thuộc vào toàn bộ stack: memory, context constructor, skill routing, orchestration, governance layer.
Kỹ năng agent mới: chuyển đổi video YouTube thành slide và ghi chú
New agent skill to convert YouTube videos to slides and notes.
- ›Một agent skill mới có khả năng trích xuất hoàn hảo các slide từ video YouTube.
Agent skill mới: trích xuất slide từ video YouTube và ghi chú vào Obsidian
RT by @dair_ai: Just built an insane new agent skill.
It can perfectly extract slides from YT videos, then write notes, images, transcripts, and slides into Obsidian vaults.
An HTML artifact allows me to navigate and add more notes as I listen.
Should I release the skill?
- ›Một agent skill có khả năng trích xuất hoàn hảo các slide từ video YouTube.
/goal thật tuyệt vời: Cách tối ưu hóa coding agents ngày nay
RT by @dair_ai: /goal is really insane!
It's how you can get the most out of coding agents today.
For efficiency, I find it works best when you do planning before /goal. This ensures the agent has the right context and goal, which often only happens with careful planning.
- ›Tính năng /goal là công cụ mạnh để tận dụng tối đa khả năng của coding agents.
Biên giới hiệu quả của LLMs: Bạn đang trả quá nhiều cho context?
// The Efficiency Frontier in LLMs //
(bookmark this one)
How much are you overpaying for context you do not need?
It turns out that context costs dominate production LLM bills, and the right strategy depends on how often you reuse preprocessing. Modeling that explicitly lets you pick the cheapest point that still hits your target quality.
This work treats context-strategy selection as a deployment-aware optimization problem instead of a fixed choice, using amortized cost modeling across performance, token cost, and preprocessing reuse.
It achieves roughly 25% token savings at equal F1 (around 0.78), and amortized memory compression delivers more than 50% lower token cost versus full-context in high-performance settings. Tested on 5,000 HotpotQA instances.
Paper: https://arxiv.org/abs/2605.23071
Learn to build effective AI agents in our academy: https://academy.dair.ai/
- ›Chi phí context chiếm ưu thế trong hóa đơn LLM production, chiến lược tối ưu phụ thuộc tần suất tái sử dụng preprocessing.
Bài báo Microsoft: Kỹ năng agent tự tiến hóa (SkillOpt)
New paper from Microsoft on Self-Evolving Agent Skills
- ›Microsoft giới thiệu SkillOpt, công cụ tối ưu hóa kỹ năng agent bằng cách chỉnh sửa file skill dưới giám sát validation.
Nhu cầu cao kỹ sư & nhà nghiên cứu AI: Học nền tảng, xây dựng cùng AI
RT by @dair_ai: We are going to need so many engineers and researchers where things are headed.
Don't listen to the noise, go learn the fundamentals and build/collaborate with AI as much as you can.
- ›Tương lai AI sẽ cần rất nhiều kỹ sư và nhà nghiên cứu để phát triển công nghệ tiếp theo.
Bài báo AI hàng đầu tuần (18-24 tháng 5): AIRA, MetaCogAgent, Memory as a Model và nhiều hơn
The Top AI Papers of the Week (May 18 - 24):
- AIRA
- MetaCogAgent
- Memory as a Model
- Code as Agent Harness
- Weak-Model Critic-Comparator
- OpenAI Disproves the Unit Distance Conjecture
- Production Agent Architecture Methodology
Read on for more:
- ›Tuần này có 7 bài báo AI quan trọng: AIRA, MetaCogAgent, Memory as a Model, Code as Agent Harness.
Bài báo tuần hàng đầu AI (chi tiết đầy đủ)
x.com/i/article/205826023785…
- ›Bản tóm tắt chi tiết về 7 bài báo AI hàng đầu tuần 18-24 tháng 5 được đăng trên X.