Dòng tin

2 nội dung mới nhất

Tất cả

AK (_akhaliq)HF PapersPaper·4 ngày trước

EarlyTom: Early Token Compression Completes Fast Video Understanding

›Video-LLMs xử lý lượng lớn visual tokens dẫn đến hiệu suất thấp, vision encoding tốn phần lớn time-to-first-token (TTFT).
›EarlyTom nén visual tokens không cần training bên trong vision encoder thay vì chỉ sau, giúp giảm TTFT và tối ưu hóa vision encoder.
›Giới thiệu decoupled spatial token selection strategy cải thiện hiệu quả nén tổng thể.
›Giảm TTFT lên đến 2.65x và FLOPs lên đến 61% trên NVIDIA A100 cho LLaVA-OneVision-7B, duy trì độ chính xác baseline.

AK (_akhaliq)HF PapersPaper·4 ngày trước

Towards Consistent Video Geometry Estimation

›ViGeo là foundation model transformer để khôi phục hình học không gian dày đặc và nhất quán theo thời gian từ chuỗi video.
›Dynamic chunking attention cho phép tiếp xúc với cả context hai chiều và nhân quả trong huấn luyện, thích ứng ở thời gian test mà không cần retraining.
›Completion-based data refinement framework huấn luyện video depth completion teacher từ annotations thưa thớt để cải thiện supervision quality.
›ViGeo dự đoán depth, surface normals và point maps trong cùng framework với state-of-the-art hiệu suất.
›Hoạt động tốt trên online, offline và long-video depth estimation mà chỉ huấn luyện trên public datasets.