Alibaba has launched Wan2.2, a suite of large video generation models designed to help creators and developers produce cinematic-quality video content with greater ease and control. The new models are the first open-source video generation tools built on the Mixture-of-Experts (MoE) architecture, offering improved efficiency and creative flexibility.
Enhancing video generation with MoE architecture
The Wan2.2 release includes three distinct models: Wan2.2-T2V-A14B (text-to-video), Wan2.2-I2V-A14B (image-to-video), and Wan2.2-TI2V-5B (a hybrid text/image-to-video model). All three models are built on the MoE architecture and trained using carefully curated aesthetic datasets, allowing them to deliver cinematic-style video outputs with fine-tuned artistic control.
The models enable users to customise visual elements such as lighting, colour tone, camera angles, composition, focal length, and time of day. They also deliver realistic representations of complex motion, including facial expressions, hand gestures, and dynamic movement, while better adhering to physical laws and user instructions.
To address the typically high computational costs associated with video generation, Wan2.2-T2V-A14B and Wan2.2-I2V-A14B feature a two-expert denoising design in their diffusion process. One expert handles high-noise data for overall scene structure, while the other refines detail and texture. Despite containing 27 billion parameters, only 14 billion are activated per generation step, which reduces the overall computational load by up to 50%.
Improved performance and creative capabilities
Wan2.2 builds on the foundation of its predecessor, Wan2.1, with a significantly expanded dataset and improved generation capabilities. The training data includes 65.6% more image data and 83.2% more video data compared to the previous version. This allows Wan2.2 to deliver better performance in generating complex scenes, capturing nuanced motion, and producing varied creative styles.
The models support a cinematic prompt system, which categorises and refines user input across key aesthetic dimensions such as lighting, colour, and composition. This system ensures a higher level of interpretability and alignment with users’ visual intentions during video creation.
Compact hybrid model enables scalability
In addition to the primary MoE models, Alibaba has introduced Wan2.2-TI2V-5B, a compact hybrid model that supports both text and image input. It uses a dense model architecture with a high-compression 3D VAE system that achieves a temporal and spatial compression ratio of 4x16x16. This boosts the overall compression rate to 64, enabling it to generate five-second 720p videos in just a few minutes using a standard consumer-grade GPU.
The Wan2.2 models are available for download through Hugging Face, GitHub, and Alibaba Cloud’s open-source platform ModelScope. Alibaba has been a regular contributor to the global open-source AI community, previously releasing four Wan2.1 models in February 2025 and the Wan2.1-VACE (Video All-in-one Creation and Editing) model in May 2025. Collectively, these models have been downloaded over 5.4 million times from Hugging Face and ModelScope.