OpenAI announced Sora, a new text-to-video AI program that can turn short prompts into photo-realistic video. It still doesn’t really know what to do with hands yet, though.
Sora is a diffusion model, which OpenAI explains generates a video by beginning with one that looks like static noise and gradually transforms it by removing the noise over many steps.
“Sora is capable of generating entire videos all at once or extending generated videos to make them longer. By giving the model foresight of many frames at a time, we’ve solved a challenging problem of making sure a subject stays the same even when it goes out of view temporarily,” OpenAI says.
The video above was generated with the prompt: “A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.”
According to OpenAI, Sora can generate a complex scene with multiple moving objects or characters and replicate specific types of motion along with background detail because it supposedly not only understands simple text prompts, but also how what it is being asked to create exists in the real, physical world.
“The model has a deep understanding of language, enabling it to accurately interpret prompts and generate compelling characters that express vibrant emotions,” OpenAI says. “Sora can also create multiple shots within a single generated video that accurately persist characters and visual style.”
The video above, from which the header image for this article was taken, was generated with a very simple prompt: “The story of a robot’s life in a cyberpunk setting.”
Sora is new and, as such, it’s imperfect. OpenAI recognizes this and says that the current model has weaknesses, including issues simulating the physics of complex scenes and it also might not fully understand specific instances of cause and effect.
“For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark,” OpenAI explains.
“The model may also confuse spatial details of a prompt, for example, mixing up left and right, and may struggle with precise descriptions of events that take place over time, like following a specific camera trajectory.”
The example above was made using the prompt: ” Extreme close up of a 24 year old woman’s eye blinking, standing in Marrakech during magic hour, cinematic film shot in 70mm, depth of field, vivid colors, cinematic.” The example below was created using the prompt: “A young man at his 20s is sitting on a piece of cloud in the sky, reading a book.”
It also struggles with hands. Even in still images, hands were perhaps the biggest hurdle for AI image generators to handle and many still can’t quite figure them out. In video, that issue persists as shown by Drew Harwell from The Washington Post on Threads:
Post by @drewharwell
View on Threads
While the camera movement is believable and some of the background detail is too, the main character has a level of uncanny valley that is somewhat disturbing to view and the hands of the women to her right are definitely not rendering correctly.
OpenAI says it takes safety very seriously and is working with domain experts in areas like misinformation, hateful content, and bias who will be “adversarially testing” the model.