Dòng tin

11 nội dung mới nhất

Tất cả

AK (_akhaliq)HF PapersPaper·4 ngày trước

DynaFLIP: Xây dựng lại nhận thức robot qua biểu diễn hướng dẫn bởi động học ba phương thức

DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation

›Hầu hết hệ thống học robot sử dụng visual encoders huấn luyện cho nhận dạng tĩnh, bỏ qua hiểu biết chuyển động.
›DynaFLIP là framework huấn luyện trước đa phương thức đẩy hiểu biết chuyển động vào phần nhận thức upstream.
›Sử dụng bộ ba image-language-3D flow từ video con người và robot để hình thành biểu diễn tập trung vào vùng điều khiển.
›Đạt lợi suất +22.5% trong tình huống ngoài phân phối, cải thiện tổng quát hóa robot trên các chính sách khác nhau.

#Robotics #Thị giác máy tính #Biểu diễn học #Học tập robot

Demis HassabisXBài đăng·6 ngày trước

Omni tiếp tục làm tôi kinh ngạc

RT by @demishassabis: omni continues to blow my mind (left original / right generated) waymo looks sick in matte black

›Google Omni sinh ra nội dung ấn tượng qua khả năng tạo hình ảnh và video.

#Google Omni #Waymo #Thị giác máy tính

AK (_akhaliq)HF PapersPaper·8 ngày trước

Avatar đầu 3D Gaussian nhất quán đa góc nhìn mà không cần tạo sinh đa góc nhìn

Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation

›Tạo sinh avatar đầu 3D có độ trung thực cao từ hình ảnh 2D ngẫu nhiên mà không cần dữ liệu đa góc nhìn hay giám sát 3D.
›Giới thiệu MVCHead - state space model đơn shot thực thi nhất quán đa góc nhìn trực tiếp trong biểu diễn 3D Gaussian.
›Đề xuất Hierarchical State Space (HiSS) và Hierarchical Bi-directional State Scan (HiBiSS) để nắm bắt phụ thuộc dài hạn.
›Công bố FaceGS-10K - dataset lớn đầu tiên với 10K 3D Gaussian head assets sẵn sàng sử dụng cho đào tạo.

#3D Generation #Thị giác máy tính #Avatar #Gaussian Splatting

Demis HassabisXBài đăng·10 ngày trước

Gemini Omni tạo video tầm nhìn first-person từ bản đồ

RT by @demishassabis: Can't believe we're getting this before GTA 6

›Người dùng upload screenshot Google Maps với tuyến đường vẽ sẵn vào Gemini Omni.

#Gemini Omni #Tạo video #Thị giác máy tính

Demis HassabisXBài đăng·10 ngày trước

Project Genie hợp tác Google Maps - Biến thế giới thực thành những thế giới tương tác

RT by @demishassabis: Project Genie 🤝 @GoogleMaps Street View You can now take real U.S. places and transform them into new, interactive worlds. 🌍

›Project Genie cho phép chuyển đổi các địa điểm thực tế của Mỹ thành những thế giới tương tác 3D.

#Tạo nội dung AI #Thị giác máy tính #AI sáng tạo

Andrew NgYouTubeVideo·10 ngày trước

AI Dev 26 x SF | Thierry Damiba: Phát Hiện Bất Thường Trong Video từ Edge đến Cloud

AI Dev 26 x SF | Thierry Damiba: Edge to Cloud Video Anomaly Detection

›Sử dụng kỹ thuật xử lý video từ các thiết bị edge (cục bộ) đến cloud để phát hiện bất thường.
›Kết hợp tính toán tại edge và cloud tối ưu hóa hiệu suất, độ trễ và chi phí.
›Ứng dụng trong giám sát an ninh, phát hiện đe dọa và các hệ thống phân tích video real-time.

#Thị giác máy tính #Edge Computing #Video AI

Demis HassabisXBài đăng·10 ngày trước

Gemini Omni tạo video góc nhìn lái xe từ ảnh bản đồ

RT by @demishassabis: I uploaded a screenshot of Google Maps to Gemini Omni with a route drawn on it. Then I prompted it to create a first person view of someone driving a taxi cab along the route in the reference image. Pretty close to the real thing.

›Tải ảnh chụp màn hình Google Maps với tuyến đường vẽ sẵn vào Gemini Omni.

#Gemini Omni #Tạo video #Thị giác máy tính

Demis HassabisBlogBài viết·14 ngày trước

Mô phỏng các địa điểm thực tế với Project Genie và Street View

Simulate real-world places with Project Genie and Street View

›Project Genie cho phép mô phỏng các địa điểm thực tế dựa trên dữ liệu Street View.
›Mở rộng quyền truy cập Google AI Ultra cho người dùng trên toàn cầu.
›Công cụ này giúp hiểu sâu hơn về không gian và môi trường qua mô hình AI đa phương thức.

#Mô phỏng AI #Thị giác máy tính #Google AI

Andrej KarpathyXBài đăng·21 ngày trước

Tương tác Input/Output giữa con người và AI: từ text đến thị giác

This works really well btw, at the end of your query ask your LLM to "structure your response as HTML", then view the generated file in your browser. I've also had some success asking the LLM to present its output as slideshows, etc. More generally, imo audio is the human-preferred input to AIs but vision (images/animations/video) is the preferred output from them. Around a ~third of our brains are a massively parallel processor dedicated to vision, it is the 10-lane superhighway of information into brain. As AI improves, I think we'll see a progression that takes advantage: 1) raw text (hard/effortful to read) 2) markdown (bold, italic, headings, tables, a bit easier on the eyes) <-- current default 3) HTML (still procedural with underlying code, but a lot more flexibility on the graphics, layout, even interactivity) <-- early but forming new good default ...4,5,6,... n) interactive neural videos/simulations Imo the extrapolation (though the technology doesn't exist just yet) ends in some kind of interactive videos generated directly by a diffusion neural net. Many open questions as to how exact/procedural "Software 1.0" artifacts (e.g. interactive simulations) may be woven together with neural artifacts (diffusion grids), but generally something in the direction of the recently viral https://x.com/zan2434/status/2046982383430496444 There are also improvements necessary and pending at the input. Audio nor text nor video alone are not enough, e.g. I feel a need to point/gesture to things on the screen, similar to all the things you would do with a person physically next to you and your computer screen. TLDR The input/output mind meld between humans and AIs is ongoing and there is a lot of work to do and significant progress to be made, way before jumping all the way into neuralink-esque BCIs and all that. For what's worth exploring at the current stage, hot tip try ask for HTML.

›Yêu cầu LLM trả lời dưới dạng HTML rồi xem trong trình duyệt hiệu quả hơn markdown

#Giao diện AI #Thị giác máy tính #UX/Interaction

Fei-Fei LiXBài đăng·khoảng 2 tháng trước

Nhìn xuyên qua góc với Marble 1.1 Plus

RT by @drfeifei: see around corners with Marble 1.1 Plus 👀

›Marble 1.1 Plus có khả năng nhìn xuyên qua góc (vision around corners).

#Thị giác máy tính #Mô hình 3D

Fei-Fei LiarXivPaper·hơn 9 năm trước

Phát hiện đối tượng nổi bật RGB-D bằng học chuyển giao đa phương thức

RGB-D Salient Object Detection Based on Discriminative Cross-modal Transfer Learning

›Đề xuất dùng CNN để cải thiện phát hiện đối tượng nổi bật dựa trên thông tin độ sâu (depth).
›Xử lý vấn đề thiếu dữ liệu nhãn cho modality depth bằng chuyển giao từ ảnh RGB.
›Tận dụng dữ liệu phụ trợ từ modality nguồn để huấn luyện hiệu quả hơn.

#Thị giác máy tính #RGB-D #Học chuyển giao