AI Video Era (Textual content-To-Video Translation) – Zbigatron



There have been plenty of moments in my profession in AI when I’ve been bowled over by the progress mankind has made within the subject. I recall the primary time I noticed object detection/recognition being carried out at near-human stage of accuracy by Convolutional Neural Networks (CNNs). I’m fairly positive it was this image from Google’s MobileNet (mid 2017) that affected me a lot that I wanted to catch my breath and instantly afterwards exclaim “No method!” (insert expletive in that phrase, too):


After I first began out in Laptop Imaginative and prescient method again in 2004 I used to be adamant that object recognition at this stage of experience and pace can be merely unimaginable for a machine to realize due to the inherent stage of complexity concerned. I used to be actually satisfied of this. There have been simply too many parameters for a machine to deal with! And but, there I used to be being confirmed incorrect. It was an unbelievable second of awe, one which I often recall to my college students once I lecture on AI.

Since then, I’ve learnt to not underestimate the ability of science. However I nonetheless get caught out infrequently. Nicely, possibly not caught out (as a result of I actually did study my lesson) however extra like bowled over.

The second memorable second in my profession once I pushed my swivel chair away from my desk and as soon as extra exclaimed “No method!” (insert expletive there once more) was once I noticed image-to-text translation (you present a textual content immediate and a machine creates photos based mostly on it) being carried out by DALL-E in January of 2021. For instance:



I wrote about DALL-E’s preliminary capabilities on the finish of this submit on GPT3. Since then, OpenAI has launched DALL-E 2, which is much more awe-inspiring. However that preliminary second in January of final yr will perpetually be ingrained in my thoughts – as a result of a machine creating photos from scratch based mostly on textual content enter is one thing actually exceptional.

This yr, we’ve seen text-to-image translation change into mainstream. It’s been on the information, John Oliver made a video about it, numerous open supply implementations have been launched to most of the people (e.g. DeepAI – attempt it out your self!), and it has achieved some milestones – for instance, Cosmopolitan journal used a DALL-E 2 generated picture as a canopy on a particular situation of theirs:


That does look groovy, it’s a must to admit.

My third “No method!” second (with expletive, in fact) occurred just a few weeks in the past. It occurred once I realised that text-to-video translation (you present a textual content immediate and a machine creates a collection of movies based mostly on it) is likewise on its solution to probably change into mainstream. 4 weeks in the past (Oct 2022) Google offered ImagenVideo and a short while later additionally revealed one other answer referred to as Phenaki. A month earlier to this, Meta’s text-to-video translation utility was introduced referred to as Make-A-Video (Sep 2022), which in flip was preceded by CogVideo by Tsinghua College (Might 2022).

All of those options are of their infancy levels. Aside from Phenaki, movies generated after offering an preliminary textual content enter/instruction are just a few seconds in size. No generated movies have audio. Outcomes aren’t good with distortions (aka artefacts) clearly seen. And the movies that we’ve seen have undoubtedly been cherry-picked (CogVideo, nevertheless, has been launched as open supply to the general public so one can attempt it out oneself). However hey, the movies are usually not dangerous both! You need to begin someplace, proper?

Let’s check out some examples generated by these 4 fashions. Keep in mind, this can be a machine creating movies purely from textual content enter – nothing else.

CogVideo from Tsinghua College

Textual content immediate: “A cheerful canine” (video supply)


Right here is a whole collection of movies created by the mannequin that’s offered on the official github web site (chances are you’ll must press “play” to see the movies in movement):

As I discussed earlier, CogVideo is on the market as open supply software program, so you may obtain the mannequin your self and run it in your machine when you have an A100 GPU. And you may as well mess around with an on-line demo right here. The one down facet of this mannequin is that it solely accepts simplified Chinese language as textual content enter, so that you’ll must get your Google Translate up and working, too, should you’re not accustomed to the language.

Make-A-Video from Meta

Some instance movies generated from textual content enter:

Textual content immediate: “A teddy bear portray a portrait”
An example media generated by meta's application
Textual content immediate: a younger couple strolling in heavy rain

An example image generated by meta
Textual content immediate: A canine carrying a Superhero outfit with pink cape flying by means of the sky

The opposite wonderful options of Make-A-Video are that you would be able to present a nonetheless picture and get the appliance to offer it movement, or you may present 2 nonetheless photos and the appliance will “fill-in” the movement between them, or you may present a video and request completely different variations of this video to be produced.

Instance – left picture is enter picture, proper picture reveals generated movement for it:

Input diagram to be transformed to a video  

It’s arduous to not be impressed by this. Nevertheless, as I discussed earlier, these outcomes are clearly cherry-picked. We shouldn’t have entry to any API or code to provide our personal creations.

ImagenVideo from Google

Google’s first answer makes an attempt to construct on the standard of Meta’s and Tsinghua College’s releases. Firstly, the decision of movies has been upscaled to 1024×768 with 24 fps (frames per second). Meta’s movies by default are created with 256 x 256 decision. Meta mentions, nevertheless, that max decision might be set to 768 x 768 with 16 fps. CogVideo has comparable limitations to their generated movies.

Listed here are some examples launched by Google from ImagenVideo:

ImagenVideo example
Textual content immediate: Flying by means of an intense battle between pirate ships in a stormy ocean
ImagenVideo example
Textual content immediate: An astronaut driving a horse
ImagenVideo example
Textual content immediate: A panda consuming bamboo on a rock

Google claims that the movies generated surpass these of different state-of-the-art fashions. Supposedly, ImagenVideo has a greater understanding of the 3D world and may also course of rather more advanced textual content inputs. For those who take a look at the examples offered by Google on their mission’s web page, it seems as if their declare just isn’t unfounded.

Phenaki by Google

It is a answer that actually blew my thoughts.

Whereas ImagenVideo had its concentrate on high quality, Phenaki, which was developed by a unique group of Google researchers, focussed on coherency and size. With Phenaki, a consumer can current a protracted listing of prompts (quite than only one) that the system then takes and creates a movie of arbitrary size. Comparable sorts of glitches and jitteriness are exhibited in these generated clips, however the truth that movies might be created of two-minute plus size, is simply astounding (though of decrease decision). Actually.

Listed here are some examples:

Phenaki example
Textual content prompts: A photorealistic teddy bear is swimming within the ocean at San Francisco. The teddy bear goes beneath water. The teddy bear retains swimming beneath the water with colourful fishes. A panda bear is swimming beneath water
Phenaki example
Textual content prompts: Facet view of an astronaut strolling by means of a puddle on mars. The astronaut is dancing on mars. The astronaut walks his canine on mars. The astronaut and his canine watch fireworks

Phenaki may also generate movies from single photos, however these photos can moreover be accompanied by textual content prompts. The next instance makes use of the enter picture as its first body after which builds on that by following the textual content immediate:

Phenaki example
Accompanying textual content immediate: A white cat touches the digital camera with the paw

For extra wonderful examples like this (together with a couple of 2+ minute movies), I might encourage you to view the mission’s web page.

Moreover, phrase on the road is that the group behind ImagenVideo and Phenaki are combining strengths to provide one thing even higher. Watch this area!


A couple of months in the past I wrote two posts on this weblog discussing why I feel AI is beginning to decelerate (half 2 right here) and that there’s proof that we’re slowly starting to hit the ceiling of AI’s potentialities (until new breakthroughs happen). I nonetheless stand by that submit due to the sheer quantity of time and cash that’s required to coach any of those giant neural networks performing these feats. That is the primary cause I used to be so astonished to see text-to-video fashions being launched so rapidly after solely simply getting used to their text-to-image counterparts. I assumed we’d be a great distance away from this. However science discovered a method, didn’t it?

So, what’s subsequent in retailer for us? What’s going to trigger one other “No method!” second for me? Textual content-to-music technology and text-to-video with audio can be good wouldn’t it? I’ll attempt to analysis these out and see how far we’re from them and current my findings in a future submit.

To be told when new content material like that is posted, subscribe to the mailing listing: