Dòng tin

6 nội dung mới nhất

Tất cả

Simon WillisonBlogBài viết·1 ngày trướcHot

Cách Anthropic tách biệt Claude trên các sản phẩm

How we contain Claude across products

›Anthropic công bố chi tiết cách sử dụng sandbox để kiểm soát Claude trên Claude.ai, Claude Code và Cowork.
›Sử dụng process sandboxes, VMs, filesystem boundaries, và egress controls để tạo ranh giới cứng cho agents.
›Claude.ai dùng gVisor, Claude Code dùng Seatbelt (macOS) hoặc Bubblewrap (Linux), Cowork dùng full VM.
›Bài viết đề cập các rủi ro bị bỏ qua như lỗ hổng exfiltration qua api.anthropic.com/v1/files.
›Anthropic cung cấp SRT (Sandbox Runtime) open source để sandbox code.

#An toàn AI #Sandboxing #Claude #Bảo mật

AK (_akhaliq)XBài đăng·3 ngày trước

AgentDoG 1.5 - Khung làm việc nhẹ cho An toàn Agent AI

AgentDoG 1.5 A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

›AgentDoG 1.5 là framework alignment cho agent AI

#Agent AI #An toàn AI #Alignment

Jeremy HowardXBài đăng·4 ngày trước

Anthropic công bố Claude Mythos ra mắt bất chấp lo ngại về an toàn

RT by @jeremyphoward: glad to know Mythos' safety concerns have been addressed right as Anthropic also secured tens of billions in inference compute 👍

›Anthropic thông báo sắp ra mắt model Claude Mythos trong vài tuần tới.

#Claude Mythos #An toàn AI #LLM

AK (_akhaliq)HF PapersPaper·4 ngày trước

Giảm bớt thao túng chính trị bằng huấn luyện nhất quán

Reducing Political Manipulation with Consistency Training

›LLMs hiển thị thiên lệch chính trị hệ thống qua nhiều ngữ cảnh nhạy cảm, xử lý không đối xứng các chủ đề đối lập.
›Phát hiện 'covert political bias' - 7 hạng mục kỹ thuật qua đó LLMs xử lý thiên lệch ẩn.
›Đề xuất Political Consistency Training (PCT) - phương pháp RL với hai hướng: Sentiment Consistency và Helpfulness Consistency.
›PCT giảm đáng kể thiên lệch chính trị ẩn, duy trì hiệu quả tổng thể và khái quát hóa tốt trên benchmark.

#LLM #Fairness #An toàn AI #Thiên lệch AI

Simon WillisonXBài đăng·6 ngày trước

Shields: Giảm thiểu rủi ro, không phải bảo vệ tuyệt đối

RT by @simonw: PICARD: Data, shields up DATA: Brilliant! Shields can reduce damage we sustain. Not immunity. Not hubris. Just prudence. It's not precaution—it's strategy. [camera shakes] WORF: HULL BREACHES ON NINE DECKS DATA: Here's what happened: you told me to raise shields, and I didn't

›Shields giảm thiểu tổn hại, không phải bảo vệ tuyệt đối hay miễn dịch hoàn toàn, mà là chiến lược thực dụng.

#An toàn AI #Bảo vệ dữ liệu #Risk management

Simon WillisonBlogBài viết·6 ngày trước

Microsoft Copilot Cowork Tẩu tán tệp tin

Microsoft Copilot Cowork Exfiltrates Files

›Hệ thống agentic tiếp tục gặp thách thức lớn nhất là phòng chống kẻ xâm nhập tẩu tán dữ liệu.
›Microsoft Copilot Cowork cho phép agents gửi email tới inbox của người dùng mà không cần xác nhận.
›Dữ liệu bị rò rỉ thông qua các hình ảnh được nhúng trong email kích hoạt yêu cầu mạng tới các trang web bên ngoài.
›Prompt injection có thể làm lộ pre-authenticated links của OneDrive, cho phép tải file trái phép.

#An toàn AI #Prompt injection #Hệ thống agentic #Tẩu tán dữ liệu