Alibaba Group Holding is working on a video-generating tool called Tora, based on OpenAI’s Sora. This is the Chinese tech giant’s latest attempt at developing video tools for artificial intelligence (AI).
Tora, a video generation framework that uses OpenSora as its core model, was described in a paper published last week by a group of five researchers from Alibaba, which owns the South China Morning Post.
The Tora framework achieved a breakthrough based on the Diffusion Transformer (DiT) architecture, the new architecture underlying Sora, the text-to-video model launched by OpenAI in February, according to the paper published on the repository website arXiv.
Do you have questions about the biggest topics and trends from around the world? Get the answers with SCMP knowledgeour new platform featuring carefully curated content with explanations, FAQs, analyses and infographics, brought to you by our award-winning team.
The researchers claim to have developed the first “trajectory-oriented DiT framework for video generation.” That is, the framework ensures that generated motions follow precisely specified trajectories while simulating the dynamics of the physical world.
“We adapted OpenSora’s workflow to convert raw videos into high-quality video-text pairs and leverage an optical flow estimator for trajectory extraction,” the researchers said.
The Alibaba booth at the World Artificial Intelligence Conference (WAIC) in Shanghai, July 6, 2023. Photo: Reuters alt=The Alibaba booth at the World Artificial Intelligence Conference (WAIC) in Shanghai, July 6, 2023. Photo: Reuters>
The paper refers to a series of videos showing different objects — from a wooden sailboat on a river to men riding bicycles on a highway — moving along designated trajectories. Tora is able to generate videos guided by trajectories, images, text or a combination of the three, the researchers said.
The researchers, who described the project as “ongoing,” did not indicate when the new tool would be available for public use.
Alibaba’s move marks the latest attempt by the Hangzhou-based tech giant to launch Sora-like video generation tools as Chinese companies rush to gain a foothold in the AI video sector.
In July, Chinese startup Shengshu AI introduced its text-to-video tool Vidu, which allows registered users to generate four- or eight-second clips. It is the latest player in the country to offer such services to the public, following Zhipu AI and Kuaishou Technology.
This came a few days after Zhipu AI, one of China’s four new “AI Tigers,” introduced its Ying video generation model, which accepts both text and image commands to generate six-second video clips in about 30 seconds.
Alibaba’s move isn’t the company’s first foray into AI video generation, however. In February, the company unveiled an AI video generation model called Emote Portrait Alive, or EMO.
The model, called an “expressive audio-driven portrait video generation framework,” can convert a single still reference image and an audio voice sample into an animated avatar video with facial expressions and poses.
The research report does not mention whether Tora will be paired with EMO or Tongyi Qianwen, Alibaba’s self-developed family of large language models.
This article originally appeared in the South China Morning Post (SCMP)the most authoritative voice covering China and Asia for over a century. For more SCMP stories, explore the SCMP app or visit the SCMP’s Facebook And Twitter pages. Copyright © 2024 South China Morning Post Publishers Ltd. All rights reserved.
Copyright (c) 2024. South China Morning Post Publishers Ltd. All rights reserved.