World’s First Multimodal Video AI Model “Kling O1” Released

Dec 05, 20254 Mins read186

Recently, the world’s first unified multimodal video and image creation tool, “Kling O1,” officially launched. Based on a novel video and image model, Kling O1 uses natural language as a semantic skeleton, combined with multimodal descriptions like video, images, and subjects. It integrates all generation and editing tasks into one all-purpose engine, building a brand-new multimodal creation workflow for users to achieve a one-stop closed loop from inspiration to finished product.

Unified Model, Solving All Challenges in Video Creation

As the first unified multimodal video model, Kling O1 is based on the MVL (Multi-modal Visual Language) concept. It breaks the boundaries of traditional models designed for single video generation tasks. It integrates various tasks—such as reference-based video generation, text-to-video, first-and-last-frame-to-video, video content addition/deletion, video modification/transformation, style redrawing, and shot extension—into the same all-purpose engine. This allows users to complete the entire creation process from generation to modification in one place, without switching between multiple models and tools.

Leveraging the deep semantic understanding of the Kling Video O1 model, the images, videos, subjects, and text uploaded by users are all interpreted as instructions by Kling O1. The model breaks through modality limitations, capable of comprehensively understanding a photo, a video clip, a subject, or even different perspectives of a character, and accurately generating various details.

Kling O1’s multimodal instruction input area transforms tedious editing and post-production into simple conversation. Users no longer need to manually mask objects or set keyframes; they just need to input instructions like “remove pedestrians,” “change day to dusk,” or “replace the main character’s clothing.” The AI model can understand the visual logic and automatically perform pixel-level semantic reconstruction, from local subject replacement to overall video style redrawing. Additionally, it supports capabilities like image/subject reference; instruction-based transformation (adding/deleting video content, switching shots/perspectives, video modification tasks, etc.); video reference; first-and-last frame generation; and text-to-video.

Addressing the pain point of difficult AI video implementation—consistency of characters and scenes—the underlying technology of Kling O1 strengthens the understanding of input images and videos. It can “remember” the main characters, props, and scenes like a human director. No matter how the shot flows, the subject’s characteristics remain stable and consistent.

Furthermore, the model demonstrates powerful multi-subject fusion capabilities. Users can freely combine multiple different subjects or mix subjects with reference images. Even in complex group scenes or interactive scenarios, the model can independently identify and maintain the characteristics of each character or prop, ensuring industrial-grade feature consistency for “main characters” across different shots.

Unlock Skill Combinations: Break Free from Single Tasks

Users can ask Kling O1 to “add a subject to the video while modifying the background,” or “modify the style simultaneously while generating based on image reference.” This ability to generate multiple creative variations at once greatly expands creative freedom, making creative synergy possible.

Narrative duration is freely definable, giving each story its unique rhythm. Kling O1 returns the power to define time to the creator, supporting free generation of 3-10 seconds. Whether it’s short visual impact or a long story buildup, it’s entirely under the user’s control. Notably, as part of the unified model, Kling O1’s first-and-last-frame generation capability will also support 3-10 second duration options (coming soon), further enhancing narrative pacing.

Also making its debut is the Kling Image O1 model, which enables seamless connection from basic image generation to advanced detailed editing. Users can generate images from pure text or upload up to 10 reference images for fusion and re-creation.

This model possesses four core advantages: Highly consistent feature preservation, keeping subject elements stable without deviation; precise response to detail modifications, ensuring every adjustment meets expectations; accurate control over style and tone, keeping the visual atmosphere unified; and exceptionally rich imagination, making creative presentations more impactful, truly achieving “what you think is what you get.”

One Model, Endless Creation

The new Kling O1 integrates generation and editing, making it widely applicable to various scenarios such as film/TV production, social media content creation, and advertising/e-commerce. Whether it’s narrative generation from scratch or deep reshaping of existing materials, Kling O1 can flexibly utilize its reference and editing capabilities according to different needs, easily completing creative work.

In the field of film and TV creation, Kling O1’s highly consistent image (subject) reference, combined with its subject library function, can precisely lock in characters, costumes, props, and scenery for each storyboard shot, easily generating multiple coherent film/TV shots. For video post-production and social media creators, simple conversational prompts like “delete pedestrians in the background” or “make the sky blue” can instruct Kling O1 to automatically perform pixel-level intelligent repair and reconstruction.

Addressing the problems of high costs and long production cycles associated with traditional offline advertising shoots. Now, users only need to upload product images, model photos, and scene pictures, along with simple instruction descriptions, to quickly generate multiple cool product showcase advertisements, significantly reducing the cost of live shoots. Tackling issues like the hassle of scheduling model shoots or the need for re-shoots when changing backgrounds/clothing, using Kling O1 allows you to build your never-ending virtual runway: upload model + clothing photos, input instructions, perfectly restore the texture and details of the garments, and batch-produce high-quality Lookbook videos.

The latest news reported that Kling O1’s ability to achieve such powerful and comprehensive functionality stems from deep innovation in its technical foundation. The new Kling Video O1 model breaks the functional fragmentation in generation, editing, and understanding within video models, establishing a novel generative foundation. By integrating Multimodal Transformer for multimodal understanding and multimodal long-context processing, it achieves deep integration and unification of multiple tasks.