Last month, when OpenAI released its latest video generation model, Sora, it invited some filmmakers to give it a try.
This week, the company revealed the results: seven surreal short films were born. There is no doubt that the future of video generation is rapidly approaching.
At the end of 2022, companies such as Meta, Google, and the video technology startup Runway launched the first models that could convert text into video.
The idea was ingenious, but the results were rough, with their models always having minor flaws, and the outcomes only lasted a few seconds.
18 months later, Sora has brought high-definition and realistic videos, astonishing everyone, to the extent that some excited observers predicted the demise of Hollywood.Additionally, the latest models from Runway can produce short films that rival those of major animation studios. The companies behind the two most popular text-to-image models, Midtravel and Stability AI, are also developing video generation models.
Many companies are racing to try to commercialize these cutting-edge technologies, but most are still pondering what their business models will be.
Advertisement
Vyond, a company that produces animated short videos, has CEO Gary Lipkowitz saying, "When I use these tools, I often exclaim, 'Wow, it's so cool.' But how can you use it in your work?"
No matter what the answer to this question is, it could disrupt many forms of business and change the roles of many professionals, including animators and advertisers.
At the same time, concerns about the misuse of technology are also increasing, as the ability to produce fake videos has greatly improved, making it easier than ever for false propaganda and deepfake pornography to appear on the internet.We can foresee these challenges. But the question is, no one has been able to solve them well so far.
As we continue to move forward into the future with a "one step at a time" mentality, both good and bad things will happen. If you want to understand the future of video generation technology, I hope you will think deeply about the following four things.
Sora is just the beginning.
OpenAI's Sora is currently far ahead of its competitors in terms of video generation capabilities, but other companies are working hard to catch up.In the coming months, an increasing number of companies will upgrade their technology and launch competitors to Sora, making the field of video generation extremely lively.
The British startup Haiper officially emerged from "stealth mode" this month. It was founded in 2021 by former Google DeepMind and TikTok researchers who mainly study a technology called NeRF, which can convert 2D images into 3D virtual environments.
They believe that a tool that converts images into scenes that users can enter is very useful for making games.
However, six months ago, Haiper shifted from virtual environments to video generation. This transformation was to adapt to the latest ideas of its CEO, Yishu Miao, and to target a larger market than games.
Yishu Miao said, "We realized that video production is a better field, and the demand for it will be very high."Like Sora, Haiper's video generation technology uses a diffusion model to manage visual effects and employs a Transformer (a component found in large language models such as GPT-4, which is very good at predicting what comes next) to manage the consistency between video frames.
Miao Yishu said: "A video is essentially a sequence of data, and the Transformer is the best model for learning sequences."
Consistency is a major challenge in the field of video generation and is the main reason why existing tools can only generate a few seconds of video at a time.
Transformers used for video generation can improve the quality and duration of the output video, but the side effect is that they will "make up" some things, which is to say, they generate what is known as "hallucinations."
In text content, hallucinations are not always obvious. But in a video, something that goes against common sense may be very conspicuous, such as it may give a person multiple heads. Keeping the Transformer running normally also requires a large amount of training data and powerful computing power.This is why Irreverent Labs, founded by former Microsoft researchers, is taking a different approach.
Like Haiper, Irreverent Labs also focused on generating virtual environments for games before turning to video generation. However, the company does not want to follow the practices of OpenAI and other companies.
David Raskino, co-founder and chief technology officer of Irreverent Labs, said: "Because this is a battle of computing power, a battle of GPUs.
In this situation, there is only one winner, hint: he always wears a leather jacket." (Answer: Huang Renxun, CEO of the trillion-dollar chip giant Nvidia.)
Irreverent Labs' technology does not use Transformers, but combines diffusion models with models that predict the next frame content based on physical common sense, such as how a ball bounces or how water splashes on the ground.Laskino stated that this approach has not only reduced training costs but also the frequency of hallucinations. He mentioned that the model still produces minor flaws, but they are mostly physical issues, such as a bouncing ball not following the correct trajectory, which can be corrected by introducing mathematics after the video generation.
It remains to be seen which method will ultimately emerge as the winner. Miao Yishu compared today's level of video generation technology to the large language models during the GPT-2 era.
Five years ago, OpenAI's pioneering early model surprised people by demonstrating the possible directions of development. However, it took several more years for this technology to completely change the game.
Miao Yishu said, "We are all still at the foot of the mountain."How will people use generated videos?
Videos are the most common medium on the internet. YouTube, TikTok, news clips, and advertisements - AI-generated videos will appear everywhere that videos can be played.
The marketing industry is one of the most enthusiastic industries to adopt generative technology. A recent survey by Adobe in the United States showed that two-thirds of professional marketers have tried generative AI in their work, and more than half of them said they have used this technology to create images.
AI-generated videos will be the next hot spot. Some marketing companies have already produced short films to showcase the potential of this technology.
The latest example is the 2.5-minute "Somme Requiem" created by Myles Advertising Agency."The Somme Requiem" depicts soldiers trapped by snow during the Christmas truce of World War I in 1914.
This short film is composed of dozens of different shots, which were made using Runway's video generation model, then spliced together by Myles' video editing, color corrected, and accompanied by music.
Myles founder and CEO Josh Kahn said, "The future of storytelling (in film) will be a hybrid workflow."
Kahn chose the war scene to illustrate his point. He noted that the Apple TV+ series "Masters of the Air," which cost $250 million to produce, tells the story of a group of World War II pilots.
The team behind Peter Jackson's World War I documentary "They Shall Not Grow Old" spent four years planning and restoring over 100 hours of old footage.Kahn said: "Most filmmakers can only dream of having the opportunity to tell this type of story."
"Independent filmmaking has been on the decline," he added, "I think this (technology) will create an incredible resurgence."
Laskino hopes so. He said: "Horror movies are the first choice for people to test and try new things, always testing to their limits.
I think we will see a sensational horror movie created by four people in a basement using artificial intelligence."
So, will video generation technology kill Hollywood? Not yet. The scenes in "Somme Requiem," including the empty woods and desolate barracks, all look great, but the people in it still have issues like misaligned fingers and facial distortions, which are the iconic shortcomings of this technology.Video generation technology excels at wide-angle panning or close-up shots, which are suitable for creating atmosphere but contain almost no action. If "The Somme Requiem" continues in the existing style, it will become tedious.
However, in feature films, establishing shots always appear. Most are only a few seconds long, but they can take hours to shoot.
Laskino suggests that video generation models can quickly be used to produce these shots, at a fraction of the current cost. This can also be done in the post-production stage of film production without the need for re-shooting.
Michal Pechoucek, CTO of Gen Digital, agrees with this. Gen Digital is a cybersecurity giant with a range of antivirus software, including Norton and Avast.
"I think this is the direction of technology development," he said. "We will see many different models, each specifically trained for a particular area of film production. These are just tools used by talented video production teams."We have not yet reached that stage. A major issue with video generation technology is the lack of user control over the output content. Existing technologies already encounter problems when creating static images, and creating a few seconds of video presents an even greater challenge.
Miao Yishu said: "Now it is still very interesting, and you will encounter exciting moments. But generating a video that you really want is a very difficult technical problem.
To achieve stable generation of videos with sufficient duration and consistency from a single prompt, we still have a long way to go."
This is why Lipowitz of Vyond believes that this technology is not yet ready to serve most business customers. He said that these users want more control over the form of the video, which current tools cannot achieve.
Thousands of companies worldwide, including about 65% of the Fortune 500 companies, have used Vyond's platform to create animated short films for internal communication, training, marketing, and other purposes.Vyond utilizes a series of generative models, including text-to-image and text-to-speech, but it offers a simple drag-and-drop interface that allows users to manually piece together videos segment by segment, rather than generating complete short films with a single click.
Lipowitz said that running a generative model is like rolling dice. "For most video production teams, this is an unacceptable issue, especially in the corporate field, where every pixel must be flawless and consistent with the brand identity."
He said, "If a company finds that the video effect is not good, such as the character having too many fingers, or the company logo's color being incorrect, unfortunately, this is the way generative artificial intelligence (currently) works."
As for how to solve it? Only more data, more training, and the cycle continues. Miao Yishu said: "I wish I could point out some complex algorithms (as a solution). But there are none, it just needs more learning."Deepfakes Could Make Things Worse
Over the years, misinformation online has been eroding our trust in media, institutions, and each other. Some are concerned that fake videos on an already chaotic internet could further undermine our trust in what we see.
Pejecek said, "We are replacing trust with distrust, confusion, fear, and hatred. A society that cannot see the truth will fall."
Pejecek is particularly worried about the malicious use of deepfakes in elections. For example, during last year's elections in Slovakia, some people shared a fake video showing the leading candidate discussing a plan to manipulate voters.
The video was of poor quality and easily identifiable as a deepfake. However, Pejecek believes that it was enough to make the election results more favorable to another candidate.John Wissinger is the head of the strategy and innovation team at Blackbird AI, a company that tracks and manages the spread of misinformation online. He believes that when a fake video combines real and fake clips, it becomes the most deceptive.
Imagine combining two videos of President Joe Biden walking across a stage. In one, he falls, and in the other, he doesn't. Who can tell which one is real?
Wissinger said, "Assuming an event did happen, but the way it's presented to me has subtle differences. This will affect my emotional response to it."
As Pejcheck pointed out, fake videos don't even need to be of high quality to have an impact. Wissinger said that a low-quality forgery that aligns with existing biases can cause more damage than a high-quality forgery that doesn't align with existing biases.
This is why Blackbird AI focuses on tracking who is sharing what with whom. Wissinger said that, in a sense, whether something is true or false is less important than where it comes from and how it is spread.His company has been tracking misinformation that lacks technical sophistication, such as posts on social media that take real images out of context.
He said that generative technologies have made things worse, but people displaying media in misleading ways, whether intentionally or unintentionally, is nothing new.
If you also consider bots and their role in sharing and promoting misinformation on social networks, things get even worse.
Just knowing the existence of fake media can sow the seeds of doubt in malicious speech. Wiesinger said, "You can see that soon people will not be able to distinguish between synthetic and real (information)."We are facing a new reality of the internet.
Fake content will soon be everywhere, from disinformation campaigns to advertisements, to Hollywood blockbusters. So, what can we do to figure out what is real and what is fake?
We have a range of solutions, but they must be used in combination, complementing each other; going it alone will not be effective.
The tech industry is addressing this issue. Most generation tools try to enforce certain terms of use, such as preventing people from creating videos of public figures. However, there are ways to bypass these filters, and the open-source versions of these tools may have more lenient policies.
Companies are also developing standards and detection tools for watermarking content generated by artificial intelligence, but not all tools will add watermarks by default, and watermarks can also be removed from the metadata of the video.There are no 100% reliable detection tools, and even if these tools are effective, they will become part of a "cat and mouse" game, trying to keep up with the progress of generative technologies.
Online platforms like X and Facebook cannot achieve comprehensive review. Once the issues become more challenging, we should not pin our hopes on them.
Miao Yishu once worked at TikTok, where he helped build a review tool that can detect video uploads that violate TikTok's terms of use. Even he remains vigilant about what is to come: "The online environment is really dangerous, do not trust what you see on your laptop."
Blackbird AI has developed a tool called Compass, which allows you to fact-check articles and social media posts.
Paste the link into the tool, and a large language model will generate a passage extracted from credible online sources, providing background information for the content in the link. Wissinger said that these credible sources have always been reviewable.The results are very similar to the "community labeling" commonly seen on social media platforms today, which sometimes attach to controversial posts on platforms like X, Facebook, and Instagram. The company wants Compass to generate community labels for all information. "We are working on it," said Wissinger.
However, people who know how to use fact-checking websites are already quite savvy, and many may not be aware of the existence of these tools, or may be unwilling to trust them. The spread of misinformation is also often more widespread than any subsequent corrections.
In the meantime, there is still disagreement about who this is a problem for. Pejovic said that technology companies need to open up their software to allow for more competition in terms of security and trust. This means allowing cybersecurity companies like it to develop third-party software to regulate this technology.
He said that this is what happened 30 years ago when Windows had a malware problem: "Microsoft let antivirus companies come in to help protect Windows. So the online world became safer."
But Pejovic is not optimistic. He said, "Technology developers need to build their tools with security as the primary goal. But more people are considering how to make this technology more powerful, rather than how to make it safer."The tech industry has a common fatalistic slogan: Change is coming, face the reality.
Laskino said: "Generative AI cannot be unpopular. My thoughts may not be well-received, but I believe this is true: I don't think tech companies can take on all the responsibility.
Ultimately, the best defense against any technology is a well-educated public. There are no shortcuts."
Miao Yishu agreed. "We will inevitably adopt generative technology extensively," he said, "but this is also the responsibility of the whole society. We need to educate people."
He added: "Technology will move forward, and we need to prepare for this change. We need to remind our parents and friends that what they see on the screen may not be real."He said that this is especially true for the older generation: "Our parents need to be aware of this danger. I believe everyone should work together."
We need to cooperate quickly. When Sora was launched a month ago, the technology industry was shocked by the rapid development of video generation technology.
However, the vast majority of people are still unaware of the existence of this technology, Wissinger said: "They obviously do not understand the frontline situation we are in. I think this will sweep the globe."
POST A COMMENT