NVlabs/VILA
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
[ICLR & NeurIPS 2025] Repository for Show-o series, One Single Transformer to Unify Multimodal Understanding and Generation.
[CVPR 2025] Magma: A Foundation Model for Multimodal AI Agents
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
1 capture since 2026-05-27