Sora can generate 60-second videos based on text instructions, capable of creating complex scenes with multiple characters, specific types of movements, and precise details of themes and backgrounds. It can also create multiple shots within a single generated video, accurately retaining characters and visual styles.
OpenAI is teaching AI to understand and simulate the physical world of motion, aiming to train models to “help people solve problems that require real-world interaction.” However, Sora sometimes confuses left and right in spatial details and struggles to understand specific instances of cause and effect.

After the successful development of the popular chatbot ChatGPT, OpenAI continues to innovate in generative AI. On February 16th, OpenAI introduced a new AI model, Sora, which generates “realistic” and “imaginative” 60-second videos based on quick text prompts.
OpenAI states that Sora can generate videos up to 60 seconds long while maintaining visual quality and adhering to user prompts. Sora can create complex scenes with multiple characters, specific types of movements, and precise details of themes and backgrounds. The model deeply understands language, accurately grasping prompts to generate convincing characters. Sora can also create multiple shots within a single generated video, accurately preserving characters and visual styles.

“This model not only understands what users are asking for in the prompts but also understands how these things exist in the real world.” OpenAI is teaching artificial intelligence to understand and simulate the physical world of motion, aiming to train models to “help people solve problems that require real-world interaction.”
In addition to generating videos solely from text prompts, the model can also animate existing static images, accurately animating the content, or extend and fill in missing frames from existing videos.

However, Sora is still in development and has apparent “weaknesses,” particularly in spatial details where it confuses left and right, and it struggles to understand specific instances of cause and effect. For example, it might create a video where someone takes a bite of a cookie, but there are no bite marks on the cookie afterward.

In videos generated by Sora, animals or people may appear out of nowhere, particularly in scenes containing numerous entities. Additionally, Sora struggles to accurately describe events that occur over time, such as tracking specific camera trajectories.

Videos generated by Sora: A regular plastic chair is discovered in the desert, with people carefully digging and brushing off sand. In this example, Sora did not model the chair as a rigid object, resulting in inaccurate physical interactions.
In terms of model safety considerations, OpenAI plans to collaborate with an expert team to test the latest models, closely monitoring for errors, hate speech, and bias, among other factors. OpenAI also states that it is developing tools to detect misleading information, such as classifiers that can determine when a video was generated by Sora. Its text classifier can review and reject text input prompts that violate usage policies, such as extreme violence, sexual content, hate imagery, celebrity portraits, etc. “We have also developed robust image classifiers to review each frame of every generated video to help ensure that it complies with our usage policies before being displayed to users.”
OpenAI states that Sora will initially be provided to cybersecurity professors to assess the product’s harm or risks. Some visual artists, designers, and filmmakers will also gain access to Sora to collect feedback on how creative professionals use it.

Reece Hayden, Senior Analyst at market research firm ABI Research, stated that while multimodal large models are not new, and models for text-to-video generation already exist, OpenAI’s claim that Sora stands out due to its length and accuracy. Hayden believes that such AI models could have a significant impact on the digital entertainment market, with new personalized content being disseminated across various channels. “A clear use case is television, creating short scenes to support storytelling.”