Dòng tin

25 nội dung mới nhất

Tất cả

Sebastian RaschkaXBài đăng·5 ngày trước

Phân tích kỹ thuật MiniMax M2: Attention, MoE, Agent Training và Self-Evolution

The MiniMax M2 series was one of the most widely used open-weight LLM series earlier this year. Now, we got a technical report with some interesting tidbits. I summarized some of them below: 1. Full attention as an anti-trend?: They tried hybrid sliding-window attention variants (like so many others, like Xiaomi MiMo, Laguna, Gemma 4, Arcee, Olmo 3, etc.). But even though there were efficiency gains, they said that the production-quality tradeoffs were not worth it for M2. 2. Linear and sparse attention deployment issues: They found that linear and sparse attention are attractive on paper because they reduce the cost of long-context attention, but they are harder to make work well in a production agent system. In particular, they found that these efficient attention variants may be more fragile when KV-like state or intermediate memory is stored in lower precision. Also, they have worse prefix caching support, which matters a lot when using coding agents (which reuse a lot of the context). 3. Fine-grained Mixture-of-Experts (MoEs) are useful: Finally a recent MoE ablation study! It's only on the 2B-active parameter scale, but hey, better than nothing. Concretely, they compare a baseline with 32 experts and top-2 routing against a fine-grained setup with 128 experts and top-8 routing. The fine-grained setup improves MATH from 19.6 to 24.1 and HumanEval from 29.7 to 32.5. That's clearly a win for more fine-grained experts (confirming what the DeepSeek MoE paper reported ~2 years ago). 4. Sophisticated agent pipeline It's probably no surprise, but this papers confirms that training for agent-like behavior on software engineering task is now a big component of the training pipeline. They mine GitHub pull requests, builds runnable Docker environments, extracts task-specific test rewards, etc. 5. Interleaved thinking for context management Interestingly, they found that removing reasoning blocks from previous turns results in worse performance, especially in multi-step agent tasks. (Another point why long-context support is so important these days). 6. Speed rewards It's common to have token usage penalties, but what's interesting is that the MiniMax team adds a task-completion-time reward that depends on wall-clock time. This is to minimize unnecessary (slow) tool calls. Also, I'm thinking that this would encourage agent parallelization (if supported by the harness) 7. Self-evolution Looks like self-evolution is also already a big design component of open-weight LLMs. E.g., the paper says that M2.7 already handles 30 to 50 percent of the daily RL iteration workload, modifies its own scaffold, and completed a 100-round autonomous scaffold optimization cycle with a 30 percent gain on internal evaluations.

›Full attention vẫn tốt hơn sliding-window variants vì tradeoff chất lượng production, linear/sparse attention fragile trong agent systems.

#LLM #Mixture-of-Experts #Agent Architecture

Sebastian RaschkaXBài đăng·9 ngày trước

Triển khai DeepSeek Sparse Attention từ đầu trong repo LLMs-from-scratch

Added a DeepSeek Sparse Attention (DSA) from-scratch implementation to my LLMs-from-scratch repo thanks to an awesome new reader contrib. With motivation, overview, and GPT-style model reference implementation as standalone example code: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/09_dsa

›Thêm DSA (DeepSeek Sparse Attention) implementation vào LLMs-from-scratch với motivation, overview và GPT-style reference.

#Sparse Attention #LLM #Open Source

Sebastian RaschkaXBài đăng·10 ngày trước

P.S: Có thể tìm thấy trong các mô hình Qwen gần đây từ Qwen3-Next

R to @rasbt: PS: it can be found in recent Qwen models since Qwen3-Next

›Tính năng này đã có sẵn trong các mô hình Qwen mới như Qwen3-Next.

#LLM #Qwen

Sebastian RaschkaXBài đăng·10 ngày trước

Gated DeltaNet-2: Nâng cao hiệu suất Linear Attention bằng cơ chế tách riêng

Gated DeltaNet has been one of my favorite "hybrid attention" newcomers in the good old transformer stack. Excited to see Gated DeltaNet-2. Adding it to my reading stack. In the meantime, I have a primer on Gated DeltaNet here: https://magazine.sebastianraschka.com/i/177848019/26-gated-deltanet

›Gated DeltaNet-2 là kiến trúc hybrid attention mới tách riêng cơ chế xóa (erase) và ghi (write) thay vì dùng chung một gate.

#Linear Attention #State Space Models #LLM

Sebastian RaschkaXBài đăng·11 ngày trước

Cohere Command A+: Mô hình LLM mạnh mẽ tối ưu hóa cho hiệu quả phần cứng

It's been *almost* a bit quiet around LLM architecture releases in the past two weeks 😅 Interesting tidbit is the parallel block design. Via the Cmd-A the tech report "equivalent performance but significant improvement in throughput compared to the vanilla transformer block."

›Cohere phát hành Command A+, mô hình LLM mạnh mẽ nhất được tối ưu để chạy trên ít phần cứng hơn và phát hành open-source.

#LLM #Open Source #Kiến trúc mô hình

Sebastian RaschkaXBài đăng·16 ngày trước

Tổng quan trực quan về các tiến bộ gần đây trong kiến trúc LLM

New article: a visual tour of recent LLM architecture advances, from Gemma 4 to DeepSeek V4. I focus on long-context efficiency tweaks like KV sharing, per-layer embeddings, layer-wise attention budgets, compressed attention, and mHC. Link: https://magazine.sebastianraschka.com/p/recent-developments-in-llm-architectures

›Bài viết của Sebastian Raschka trình bày các tiến bộ kiến trúc LLM gần đây từ Gemma 4 đến DeepSeek V4 qua hình ảnh minh họa.

#LLM #Kiến trúc mô hình #Long-context

Sebastian RaschkaBlogBài viết·16 ngày trước

Những phát triển gần đây trong kiến trúc LLM: KV Sharing, mHC và Compressed Attention

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention

›Các kiến trúc LLM mới tập trung vào hiệu quả xử lý bối cảnh dài thông qua KV sharing, per-layer embeddings và compressed attention.
›KV-cache size, memory traffic và attention cost trở thành những ràng buộc chính khi agent workflows giữ lại nhiều token.
›Gemma 4, Laguna XS.2, ZAYA1-8B và DeepSeek V4 áp dụng các kỹ thuật kiến trúc này để giảm chi phí tính toán.

#LLM #Kiến trúc mô hình #Attention mechanism #Hiệu quả tính toán

Sebastian RaschkaXBài đăng·18 ngày trước

Bảng thông số Active Parameter Ratio cho các mô hình LLM

R to @rasbt: The table in HTML format for easier (and non-truncated) viewing: https://sebastianraschka.com/llm-architecture-gallery/active-parameter-ratio/

›Sebastian Raschka chia sẻ bảng dữ liệu về active parameter ratio của các mô hình LLM ở định dạng HTML để xem đầy đủ không bị cắt ngắn.

#LLM #Model parameters

Sebastian RaschkaXBài đăng·18 ngày trước

Nhận xét: DeepSeek vẫn là vua tỷ lệ tham số hoạt động

Meta observation: DeepSeek is still king of the active-parameter ratio

›DeepSeek dẫn đầu trong hiệu quả tỷ lệ tham số hoạt động (active-parameter ratio) so với các mô hình khác.

#LLM #DeepSeek #Hiệu quả mô hình

Sebastian RaschkaXBài đăng·19 ngày trước

Bài học từ việc xây dựng kiến trúc LLM từ đầu bằng Python và PyTorch

A little talk on what we can learn from implementing LLM architectures from scratch in Python and PyTorch. And how I approach new open-weight models, compare them against reference implementations etc: https://www.youtube.com/watch?v=TXzQ7PGpO6w

›Tìm hiểu kiến trúc LLM bằng cách lập trình từ đầu giúp hiểu sâu hơn các cơ chế nội tại.

#LLM #PyTorch #Kiến trúc mô hình

Sebastian RaschkaXBài đăng·19 ngày trước

Lighthouse Attention: Sửa đổi attention chi phí thấp cho huấn luyện hiệu quả

Interesting paper. What I like about this is that it is a relatively low-commitment attention modification. I.e., one can use it during most of training, switch back to vanilla attention near the end, and recover roughly the same modeling performance as if full attention had been used the whole time.

›Lighthouse Attention là phương pháp bao bọc attention tiêu chuẩn với một lớp nén subquadratic, giảm chi phí tính toán.

#Attention mechanism #LLM #Tối ưu hóa huấn luyện

Sebastian RaschkaXBài đăng·22 ngày trước

Tổng quan các thành phần kiến trúc LLM được phát hành gần đây

Back from a little family break! Lots has happened, and I’m planning to do a deeper dive into the most interesting architectural components (soon). Btw, are there any major architectures I missed below?

›Có rất nhiều mô hình và kiến trúc LLM mới được phát hành với những cải tiến thú vị.

#LLM #Kiến trúc mô hình #Tổng hợp

Sebastian RaschkaXBài đăng·29 ngày trước

Truy cập gallery kiến trúc LLM chi tiết với độ phân giải cao

R to @rasbt: As always, more details and higher-res versions at https://sebastianraschka.com/llm-architecture-gallery/

›Trang LLM Architecture Gallery cung cấp chi tiết toàn diện và phiên bản hình ảnh độ phân giải cao của các kiến trúc.

#LLM #Kiến trúc mô hình #Tài nguyên

Sebastian RaschkaXBài đăng·29 ngày trước

Lô thứ hai các mô hình LLM mới tháng Tư: Ant Ling, Minimax, Xiaomi và nhiều hơn nữa

Here is a 2nd batch of April architecture drops. What a month! - Ant Ling 2.6 1T - Minimax M2.7 - Xiaomi MiMo V2.5 - Poolside Laguna XS.2 - Tencent Hy3-preview - IBM Granite 4.1

›Tháng Tư chứng kiến một lượng lớn các mô hình mới: Ant Ling 2.6 1T, Minimax M2.7, Xiaomi MiMo V2.5.

#LLM #Kiến trúc mô hình #Mô hình mới

Sebastian RaschkaXBài đăng·khoảng 1 tháng trước

Hình ảnh và tóm tắt độ phân giải cao hơn trong thư viện kiến trúc LLM

R to @rasbt: Higher res figures (and summaries) in the LLM architecture gallery: https://sebastianraschka.com/llm-architecture-gallery/#card-deepseek-v4-pro

›Cập nhật thư viện kiến trúc LLM với hình ảnh độ phân giải cao hơn để hiển thị chi tiết tốt hơn.

#LLM #Kiến trúc mô hình #Trực quan hóa

Sebastian RaschkaXBài đăng·khoảng 1 tháng trước

Tháng Tư: mùa phát hành LLM mạnh mẽ với Gemma 4, GLM-5.1, Qwen 3.6, Kimi K2.6, DeepSeek V4

April was a pretty strong month for LLM releases: - Gemma 4 - GLM-5.1 - Qwen3.6 - Kimi K2.6 - DeepSeek V4 All are now added to the LLM Architecture Gallery. More details once I am fully back in May!

›Tháng Tư phát hành 5 mô hình LLM quan trọng: Gemma 4, GLM-5.1, Qwen 3.6, Kimi K2.6, DeepSeek V4.

#LLM #Phát hành mô hình #Kiến trúc

Sebastian RaschkaBlogBài viết·khoảng 1 tháng trước

Quy trình làm việc của tôi để hiểu kiến trúc LLM

My Workflow for Understanding LLM Architectures

›Quy trình bắt đầu từ báo cáo kỹ thuật chính thức nhưng các paper hiện nay thường ít chi tiết hơn.
›Nếu mô hình được chia sẻ trên Hugging Face Model Hub và hỗ trợ bởi thư viện transformers, có thể kiểm tra config và reference implementation để hiểu chi tiết kiến trúc.
›Code "hoạt động" không bao giờ nói dối, nên là nguồn thông tin đáng tin cậy nhất.

#LLM #Transformers #Model Hub #Kiến trúc mô hình