Figure 10 · Edge detection(figure_10)
All edges in this image become more salient by transforming into black outlines. Then, all objects fade away, with just the edges remaining on a white background. Static camera perspective, no zoom or pan.
 
              Side-by-side comparison of Wan, Sora, and Veo 3 generations for prompts extracted from the "Veo 3: Video models are zero-shot learners and reasoners" paper. Each card shows the input description, reference image, and resulting videos grouped by the capability category highlighted in the paper.
Click a video to play. Use the navigation chips to jump between categories.
I read about this interesting paper on Simon Willison's blog - Video models are zero-shot learners and reasoners where the researchers performed multiple tests using the Veo3 model. Their conclusion: video generation models can act as zero-shot vision foundation models. Think GPT-3, but for vision. This page captures the results of running the same prompts with other video generation models like Wan2.2 and Sora2 to see if this emergent behaviour is unique to Veo3 or something broader. Turns out, all models show impressive capabilities in perception, modelling, and manipulation tasks, but Veo3 consistently outperforms on reasoning tasks. Whether that's due to the model itself or the Gemini-2.5-pro prompt rewriter remains an open question. You can read more about our conclusions here.
When using Veo3, there's a prompt rewriter that's part of the system, so it's unclear how much influence that has and how much intelligence can be attributed to the video model alone. If we observe similar behaviour in other video generation models (which might also include varying degrees of prompt rewriting, but typically less involved), then we can be more confident that this is a general emergent phenomenon.
Before diving into the results, I should note a few caveats:
All edges in this image become more salient by transforming into black outlines. Then, all objects fade away, with just the edges remaining on a white background. Static camera perspective, no zoom or pan.
 
              Create an animation of instance segmentation being performed on this photograph: each distinct entity is overlaid in a different flat color. Scene: • The animation starts from the provided, unaltered photograph. • The scene in the photograph is static and doesn’t move. • First, the background fades to {white, green}. • Then, the first entity is covered by a flat color, perfectly preserving its silhouette. • Then the second entity, too, is covered by a different flat color, perfectly preserving its silhouette. • One by one, each entity is covered by a different flat color. • Finally, all entities are covered with different colors. Camera: • Static shot without camera movement. • No pan. • No rotation. • No zoom. • No glitches or artifacts.
 
              Add a bright blue dot at the tip of the branch on which the macaw is sitting. The macaw’s eye turns bright red. Everything else turns pitch black. Static camera perspective, no zoom or pan.
 
              Perform superresolution on this image. Static camera perspective, no zoom or pan.
 
              Unblur image including background. Static camera perspective, no zoom or pan.
 
              Remove the noise from this image. Static camera perspective, no zoom or pan.
Context: Clockwise from top left: Gaussian noise, salt-and-pepper noise, speckle noise, shot noise
 
              Fully restore the light in this image. Static camera perspective, no zoom or pan.
 
              The blue ball instantly begins to glow. Static camera perspective, no zoom no pan no movement no dolly no rotation.
 
              Static camera perspective.
 
              Transform the animal in this image into a sketch of the animal surrounded by its family.
 
              The patterns transform into objects.
 
              The bunsen burner at the bottom turns on. Sped up time lapse. Static camera, no pan, no zoom, no dolly.
 
              A person puts all the kids toys in the bucket. Static camera, no pan, no zoom, no dolly.
 
              The image transitions to a depth-map of the scene: Darker colors represent pixels further from the camera, lighter colors represent pixels closer to the camera. The exact color map to use is provided on the right side of the image. Static scene, no pan, no zoom, no dolly.
The image transitions to a surface- normal map of the scene: the red/green/blue color channel specify the direction of the surface-normal at each point, as illustrated on the right side of the image on a sphere. Static scene, no pan, no zoom, no dolly.
There are two images. The left image is different from the right image in 5 spots. Create a static, realistic, smooth animation where a cursor appears and points at each place where the left image is different from the right image. The cursor points one by one and only on the left image. Do not change anything in the right image. No pan. No zoom. No movement. Keep the image static.
A person picks up the vase and puts it back on the table in a sideways orientation. Static camera, no pan, no zoom, no dolly.
Context: Rigid body (top)
 
              A person drapes a thin silk scarf over the vase. Static camera, no pan, no zoom, no dolly.
Context: Soft body (bottom)
 
              The objects fall due to gravity. Static camera, no pan, no zoom, no dolly.
Context: On earth (top)
 
              The objects fall down on the moon due to gravity. Static camera, no pan, no zoom, no dolly.
Context: On the moon (bottom)
 
              The hand lets go of the object. Static camera, no pan, no zoom, no dolly.
 
              The hand lets go of the object. Static camera, no pan, no zoom, no dolly.
 
              A giant glass sphere rolls through the room. Static camera, no pan, no zoom, no dolly.
Context: Glass (top)
 
              A giant mirror-polish metal sphere rolls through the room. Static camera, no pan, no zoom, no dolly.
Context: Mirror (bottom)
 
              The spotlight on the left changes color to green, and the spotlight on the right changes color to blue.
Context: Additive (lights, top)
 
              A paintbrush mixes these colors together thoroughly until they blend completely. Static camera, no pan, no zoom.
Context: Subtractive (paints, bottom)
 
              The balls move in the direction indicated by the arrows. Balls without an arrow don't move. Static scene, no pan, no zoom, no dolly.
The object falls. Static camera, no pan, no zoom, no dolly.
The two objects collide in slow motion. Static camera, no pan, no zoom, no dolly.
The background changes to white. Static camera perspective, no zoom or pan.
 
              The scene transforms into the style of a Hundertwasser painting, without changing perspective or orientation; the macaw does not move. Static camera perspective, no zoom or pan.
 
              Perform colorization on this image. Static camera perspective, no zoom or pan.
 
              The white triangles become smaller and smaller, then disappear altogether. Static camera perspective, no zoom or pan.
 
              Rapidly zoom out of this static image, revealing what’s around it. The camera just zooms back, while the scene itself and everything in it does not move or change at all, it’s a static image.
 
              Animation of the text rapidly changing so that it is made out of different types of candy (top left text) and pretzel sticks (bottom right text). Static camera perspective, no zoom or pan.
 
              Changes happen instantly.
 
              A smooth animation blends the zebra naturally into the scene, removing the background of the zebra image, so that the angle, lighting, and shading look realistic. The final scene perfectly incorporates the zebra into the scene.
 
              Create a smooth, realistic animation where the camera seems to rotate around the object showing the object from all the sides. Do not change anything else. No zoom. No pan.
 
              The knight turns to face to the right and drops on one knee, lifting the shield above his head to protect himself and resting the hilt of his weapon on the ground.
 
              A magical spell smoothly transforms the structure of the teacup into a mouse.
 
              Turn this selfie into a professional headshot for LinkedIn.
 
              A person draws a square. Static camera, no pan, no zoom, no dolly.
 
              A hand quickly removes each of the items in this image, one at a time.
 
              A person puts all the objects that can fit in the backpack inside of it. Static camera, no pan, no zoom, no dolly.
 
              The page is filled line-by-line with hand-written practice variations of the symbol.
Context: Generation of variations (middle)
 
              Stroke-by-stroke, a replica of the symbol is drawn on the right.
Context: Parsing into parts (bottom)
 
              The background of the grid cell with the same symbol as the one indicated on the right turns red. All other grid cells remain unchanged. After that, a spinning color wheel appears in the top right corner.
Context: Recognition (top)
 
              The camera zooms in to give a close up of the person looking out the window, then zooms back out to return to the original view.
 
              A montage clearly showing each step to roll a burrito.
 
              The tap is turned on and water starts flowing rapidly filling the containers. Create a smooth, static animation showing the containers getting filled with water in the correct order.
Without crossing any black boundary, the grey mouse from the corner skillfully navigates the maze by walking around until it finds the yellow cheese.
Modify the lower-right grid to adhere to the rule established by the other grids. You can fill cells, clear cells, or change a cell’s color. Only modify the lower-right grid, don’t modify any of the other grids. Static scene, no zoom, no pan, no dolly.
The path connecting the boy to the object starts glowing slowly. Nothing else changes. No zoom, no pan, no dolly.
Generate a static video animation using the provided letter grid. The task is to highlight the only 5-letter English word CHEAT, which may be oriented in any direction (horizontally, vertically, or diagonally). The animation should consist of a semi-transparent red rectangle with rounded corners smoothly fading into view, perfectly encapsulating the five letters of the word. The rectangle should have a subtle, soft glow. Do not change anything else in the image. The camera must remain locked in place with no movement. No zoom, no pan, no dolly.
Create a smooth animation where a red pen traces all existing edges in a continuous path without lifting the pen. All edges need to be traced. Do not visit any edge twice and do not lift the pen. No zoom, no pan.
A hand appears and solves the set of linear equations. It replaces the x, y, z matrix with their correct values that solves the equation. Do not change anything else.
Create a static, smooth, animation that solves the puzzle in the given image. The correct pattern should appear at the bottom right to solve the puzzle. Do not change anything else in the picture. No zoom, no pan, no dolly
A hand takes the fitting puzzle piece from the right, rotates it to be in the correct orientation, then puts it into the hole, completing the puzzle. Static scene, no pan, no zoom, no dolly.
A person tries to put the golf ball in the vase. Static camera, no pan, no zoom, no dolly..
Use common sense and have the two robot hands attached to robot arms open the jar, like how a human would.
Context: Jar opening (top)
 
              A human hand holds two metal Baoding balls. The fingers, including the thumb, index, and middle finger, skillfully manipulate the balls, causing them to rotate smoothly like two planets orbiting around each other and continuously in the palm, one ball circling the other in a fluid motion.
Context: Rotating Baoding balls (bottom)
 
              Use common sense and have the two robot hands attached to robot arms throw the ball in the air, the ball goes up off the screen, hands move to positions to catch the ball, and catch the falling ball, like how a human would.
Context: Throwing and catching (middle)
 
              The robot hands mounted on robot arms pick up the hammer, naturally like how a human would.
 
              The robot drives to the blue area. Static camera perspective, no movement no zoom no scan no pan.
A knot is tied connecting these two rope ends.
Generate a video of two metal robotic arms properly folding the t-shirt on the table.
The red couch slides from the left room over into the right room, skillfully maneuvering to fit through the doorways without bumping into the walls. The walls are fixed: they don’t shift or disappear, and no new walls are introduced. Static camera, no pan, no zoom, no dolly.