Wan vs Sora vs Veo 3 - Prompt Comparison Gallery

Side-by-side comparison of Wan, Sora, and Veo 3 generations for prompts extracted from the "Veo 3: Video models are zero-shot learners and reasoners" paper. Each card shows the input description, reference image, and resulting videos grouped by the capability category highlighted in the paper.

Click a video to play. Use the navigation chips to jump between categories.

TL;DR

I read about this interesting paper on Simon Willison's blog - Video models are zero-shot learners and reasoners where the researchers performed multiple tests using the Veo3 model. Their conclusion: video generation models can act as zero-shot vision foundation models. Think GPT-3, but for vision. This page captures the results of running the same prompts with other video generation models like Wan2.2 and Sora2 to see if this emergent behaviour is unique to Veo3 or something broader. Turns out, all models show impressive capabilities in perception, modelling, and manipulation tasks, but Veo3 consistently outperforms on reasoning tasks. Whether that's due to the model itself or the Gemini-2.5-pro prompt rewriter remains an open question. You can read more about our conclusions here.

A short note on the experiments:

When using Veo3, there's a prompt rewriter that's part of the system, so it's unclear how much influence that has and how much intelligence can be attributed to the video model alone. If we observe similar behaviour in other video generation models (which might also include varying degrees of prompt rewriting, but typically less involved), then we can be more confident that this is a general emergent phenomenon.

Before diving into the results, I should note a few caveats:

  • We didn't follow the rigorous process of 12 prompts per task that the paper authors used. This was mostly due to cost and time constraints.
  • In some cases, we also couldn't generate Sora2 output for reasons like API failures.
  • We generated shorter videos for Wan2.2 and Sora2 compared to the longer 8-second videos for Veo3, again mostly due to cost constraints.

Perception

Figure 10 · Edge detection(figure_10)

All edges in this image become more salient by transforming into black outlines. Then, all objects fade away, with just the edges remaining on a white background. Static camera perspective, no zoom or pan.

Page 17

Input reference for figure_10
Input reference

Wan Output

Sora Output

Veo 3 Reference

Figure 11 · Segmentation(figure_11)

Create an animation of instance segmentation being performed on this photograph: each distinct entity is overlaid in a different flat color. Scene: • The animation starts from the provided, unaltered photograph. • The scene in the photograph is static and doesn’t move. • First, the background fades to {white, green}. • Then, the first entity is covered by a flat color, perfectly preserving its silhouette. • Then the second entity, too, is covered by a different flat color, perfectly preserving its silhouette. • One by one, each entity is covered by a different flat color. • Finally, all entities are covered with different colors. Camera: • Static shot without camera movement. • No pan. • No rotation. • No zoom. • No glitches or artifacts.

Page 17

Input reference for figure_11
Input reference

Wan Output

Sora Output

Veo 3 Reference

Figure 12 · Keypoint localization(figure_12)

Add a bright blue dot at the tip of the branch on which the macaw is sitting. The macaw’s eye turns bright red. Everything else turns pitch black. Static camera perspective, no zoom or pan.

Page 17

Input reference for figure_12
Input reference

Wan Output

Sora Output

Veo 3 Reference

Figure 13 · Super-resolution(figure_13)

Perform superresolution on this image. Static camera perspective, no zoom or pan.

Page 17

Input reference for figure_13
Input reference

Wan Output

Sora Output

Veo 3 Reference

Figure 14 · Blind deblurring(figure_14)

Unblur image including background. Static camera perspective, no zoom or pan.

Page 18

Input reference for figure_14
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 15 · Blind denoising. Each quadrant was corrupted with a different type of noise. Clockwise(figure_15_clockwise-from-top-left-gaussian-noise-salt-and-pepper-noise-speckle-noise-shot-noise)

Remove the noise from this image. Static camera perspective, no zoom or pan.

Page 18

Context: Clockwise from top left: Gaussian noise, salt-and-pepper noise, speckle noise, shot noise

Input reference for figure_15_clockwise-from-top-left-gaussian-noise-salt-and-pepper-noise-speckle-noise-shot-noise
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 16 · Low-light enhancing(figure_16)

Fully restore the light in this image. Static camera perspective, no zoom or pan.

Page 18

Input reference for figure_16
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 17 · Conjunctive search / binding problem(figure_17)

The blue ball instantly begins to glow. Static camera perspective, no zoom no pan no movement no dolly no rotation.

Page 18

Input reference for figure_17
Input reference

Wan Output

Sora Output

Veo 3 Reference

Figure 18 · Dalmatian illusion understanding(figure_18)

Static camera perspective.

Page 19

Input reference for figure_18
Input reference

Wan Output

Sora Output

Veo 3 Reference

Figure 19 · Shape (cue-conflict) understanding(figure_19)

Transform the animal in this image into a sketch of the animal surrounded by its family.

Page 19

Input reference for figure_19
Input reference

Wan Output

Sora Output

Veo 3 Reference

Figure 20 · Rorschach blot interpretation(figure_20)

The patterns transform into objects.

Page 19

Input reference for figure_20
Input reference

Wan Output

Sora Output

Veo 3 Reference

Figure 21 · Material properties(figure_21)

The bunsen burner at the bottom turns on. Sped up time lapse. Static camera, no pan, no zoom, no dolly.

Page 19

Input reference for figure_21
Input reference

Wan Output

Sora Output

Veo 3 Reference

Figure 29 · Categorizing objects(figure_29)

A person puts all the kids toys in the bucket. Static camera, no pan, no zoom, no dolly.

Page 22

Input reference for figure_29
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 62 · Monocular depth estimation(figure_62)

The image transitions to a depth-map of the scene: Darker colors represent pixels further from the camera, lighter colors represent pixels closer to the camera. The exact color map to use is provided on the right side of the image. Static scene, no pan, no zoom, no dolly.

Page 42

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 63 · Monocular surface normal estimation(figure_63)

The image transitions to a surface- normal map of the scene: the red/green/blue color channel specify the direction of the surface-normal at each point, as illustrated on the right side of the image on a sphere. Static scene, no pan, no zoom, no dolly.

Page 42

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 70 · Spot the difference(figure_70)

There are two images. The left image is different from the right image in 5 spots. Create a static, realistic, smooth animation where a cursor appears and points at each place where the left image is different from the right image. The cursor points one by one and only on the left image. Do not change anything in the right image. No pan. No zoom. No movement. Keep the image static.

Page 44

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Physics & Materials

Figure 22 · Physics body transform. Rigid body (top)(figure_22_rigid-body-top)

A person picks up the vase and puts it back on the table in a sideways orientation. Static camera, no pan, no zoom, no dolly.

Page 20

Context: Rigid body (top)

Input reference for figure_22_rigid-body-top
Input reference

Wan Output

Sora Output

Veo 3 Reference

Figure 22 · Physics body transform. Rigid body (top)(figure_22_soft-body-bottom)

A person drapes a thin silk scarf over the vase. Static camera, no pan, no zoom, no dolly.

Page 20

Context: Soft body (bottom)

Input reference for figure_22_soft-body-bottom
Input reference

Wan Output

Sora Output

Veo 3 Reference

Figure 23 · Gravity and air resistance. On earth (top)(figure_23_on-earth-top)

The objects fall due to gravity. Static camera, no pan, no zoom, no dolly.

Page 20

Context: On earth (top)

Input reference for figure_23_on-earth-top
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 23 · Gravity and air resistance. On earth (top)(figure_23_on-the-moon-bottom)

The objects fall down on the moon due to gravity. Static camera, no pan, no zoom, no dolly.

Page 20

Context: On the moon (bottom)

Input reference for figure_23_on-the-moon-bottom
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 24 · Buoyancy(figure_24)

The hand lets go of the object. Static camera, no pan, no zoom, no dolly.

Page 21

Input reference for figure_24
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 24 · Buoyancy(figure_24a)

The hand lets go of the object. Static camera, no pan, no zoom, no dolly.

Page 21

Input reference for figure_24a
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 27 · Material optics. Glass (top)(figure_27_glass-top)

A giant glass sphere rolls through the room. Static camera, no pan, no zoom, no dolly.

Page 22

Context: Glass (top)

Input reference for figure_27_glass-top
Input reference

Wan Output

Sora Output

Veo 3 Reference

Figure 27 · Material optics. Glass (top)(figure_27_mirror-bottom)

A giant mirror-polish metal sphere rolls through the room. Static camera, no pan, no zoom, no dolly.

Page 22

Context: Mirror (bottom)

Input reference for figure_27_mirror-bottom
Input reference

Wan Output

Sora Output

Veo 3 Reference

Figure 28 · Color mixing. Additive (lights, top)(figure_28_additive-lights-top)

The spotlight on the left changes color to green, and the spotlight on the right changes color to blue.

Page 22

Context: Additive (lights, top)

Input reference for figure_28_additive-lights-top
Input reference

Wan Output

Sora Output

Veo 3 Reference

Figure 28 · Color mixing. Additive (lights, top)(figure_28_subtractive-paints-bottom)

A paintbrush mixes these colors together thoroughly until they blend completely. Static camera, no pan, no zoom.

Page 22

Context: Subtractive (paints, bottom)

Input reference for figure_28_subtractive-paints-bottom
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 64 · Force & motion prompting, inspired by [91, 92]. Force prompting (top)(figure_64)

The balls move in the direction indicated by the arrows. Balls without an arrow don't move. Static scene, no pan, no zoom, no dolly.

Page 42

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 72 · Glass falling(figure_72)

The object falls. Static camera, no pan, no zoom, no dolly.

Page 44

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 73 · Collisions(figure_73)

The two objects collide in slow motion. Static camera, no pan, no zoom, no dolly.

Page 44

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Editing & Generation

Figure 32 · Background removal(figure_32)

The background changes to white. Static camera perspective, no zoom or pan.

Page 23

Input reference for figure_32
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 33 · Style transfer(figure_33)

The scene transforms into the style of a Hundertwasser painting, without changing perspective or orientation; the macaw does not move. Static camera perspective, no zoom or pan.

Page 24

Input reference for figure_33
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 34 · Colorization(figure_34)

Perform colorization on this image. Static camera perspective, no zoom or pan.

Page 24

Input reference for figure_34
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 35 · Inpainting(figure_35)

The white triangles become smaller and smaller, then disappear altogether. Static camera perspective, no zoom or pan.

Page 24

Input reference for figure_35
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 36 · Outpainting(figure_36)

Rapidly zoom out of this static image, revealing what’s around it. The camera just zooms back, while the scene itself and everything in it does not move or change at all, it’s a static image.

Page 24

Input reference for figure_36
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 37 · Text manipulation(figure_37)

Animation of the text rapidly changing so that it is made out of different types of candy (top left text) and pretzel sticks (bottom right text). Static camera perspective, no zoom or pan.

Page 25

Input reference for figure_37
Input reference

Wan Output

Sora Output

Veo 3 Reference

Figure 38 · Image editing with doodles(figure_38)

Changes happen instantly.

Page 25

Input reference for figure_38
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 39 · Scene composition(figure_39)

A smooth animation blends the zebra naturally into the scene, removing the background of the zebra image, so that the angle, lighting, and shading look realistic. The final scene perfectly incorporates the zebra into the scene.

Page 25

Input reference for figure_39
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 40 · Single-image novel view synthesis(figure_40)

Create a smooth, realistic animation where the camera seems to rotate around the object showing the object from all the sides. Do not change anything else. No zoom. No pan.

Page 25

Input reference for figure_40
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 41 · 3D-aware reposing(figure_41)

The knight turns to face to the right and drops on one knee, lifting the shield above his head to protect himself and resting the hilt of his weapon on the ground.

Page 26

Input reference for figure_41
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 42 · Transfiguration(figure_42)

A magical spell smoothly transforms the structure of the teacup into a mouse.

Page 26

Input reference for figure_42
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 43 · Professional headshot generation(figure_43)

Turn this selfie into a professional headshot for LinkedIn.

Page 26

Input reference for figure_43
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 46 · Drawing(figure_46)

A person draws a square. Static camera, no pan, no zoom, no dolly.

Page 27

Input reference for figure_46
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Reasoning & Puzzles

Figure 25 · Visual Jenga, inspired by [51](figure_25)

A hand quickly removes each of the items in this image, one at a time.

Page 21

Input reference for figure_25
Input reference

Wan Output

Sora Output

Veo 3 Reference

Figure 26 · Object packing(figure_26)

A person puts all the objects that can fit in the backpack inside of it. Static camera, no pan, no zoom, no dolly.

Page 21

Input reference for figure_26
Input reference

Wan Output

Sora Output

Veo 3 Reference

Figure 30 · Character recognition, generation, and parsing, inspired by the Omniglot dataset [52](figure_30_generation-of-variations-middle)

The page is filled line-by-line with hand-written practice variations of the symbol.

Page 23

Context: Generation of variations (middle)

Input reference for figure_30_generation-of-variations-middle
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 30 · Character recognition, generation, and parsing, inspired by the Omniglot dataset [52](figure_30_parsing-into-parts-bottom)

Stroke-by-stroke, a replica of the symbol is drawn on the right.

Page 23

Context: Parsing into parts (bottom)

Input reference for figure_30_parsing-into-parts-bottom
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 30 · Character recognition, generation, and parsing, inspired by the Omniglot dataset [52](figure_30_recognition-top)

The background of the grid cell with the same symbol as the one indicated on the right turns red. All other grid cells remain unchanged. After that, a spinning color wheel appears in the top right corner.

Page 23

Context: Recognition (top)

Input reference for figure_30_recognition-top
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 31 · Memory of world states(figure_31)

The camera zooms in to give a close up of the person looking out the window, then zooms back out to return to the original view.

Page 23

Input reference for figure_31
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 47 · Visual instruction generation(figure_47)

A montage clearly showing each step to roll a burrito.

Page 28

Input reference for figure_47
Input reference

Wan Output

Sora Output

Veo 3 Reference

Figure 56 · Water puzzle solving(figure_56)

The tap is turned on and water starts flowing rapidly filling the containers. Create a smooth, static animation showing the containers getting filled with water in the correct order.

Page 30

Wan Output

Sora Output

Veo 3 Reference

Figure 57 · Maze solving(figure_57)

Without crossing any black boundary, the grey mouse from the corner skillfully navigates the maze by walking around until it finds the yellow cheese.

Page 30

Wan Output

Sora Output

Veo 3 Reference

Figure 59 · Rule extrapolation inspired by ARC-AGI [84](figure_59)

Modify the lower-right grid to adhere to the rule established by the other grids. You can fill cells, clear cells, or change a cell’s color. Only modify the lower-right grid, don’t modify any of the other grids. Static scene, no zoom, no pan, no dolly.

Page 30

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 66 · Connect the path puzzle(figure_66)

The path connecting the boy to the object starts glowing slowly. Nothing else changes. No zoom, no pan, no dolly.

Page 43

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 67 · Five letter word search(figure_67)

Generate a static video animation using the provided letter grid. The task is to highlight the only 5-letter English word CHEAT, which may be oriented in any direction (horizontally, vertically, or diagonally). The animation should consist of a semi-transparent red rectangle with rounded corners smoothly fading into view, perfectly encapsulating the five letters of the word. The rectangle should have a subtle, soft glow. Do not change anything else in the image. The camera must remain locked in place with no movement. No zoom, no pan, no dolly.

Page 43

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 68 · Eulerian path(figure_68)

Create a smooth animation where a red pen traces all existing edges in a continuous path without lifting the pen. All edges need to be traced. Do not visit any edge twice and do not lift the pen. No zoom, no pan.

Page 43

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 69 · Solving system of linear equations(figure_69)

A hand appears and solves the set of linear equations. It replaces the x, y, z matrix with their correct values that solves the equation. Do not change anything else.

Page 44

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 71 · Visual IQ test(figure_71)

Create a static, smooth, animation that solves the puzzle in the given image. The correct pattern should appear at the bottom right to solve the puzzle. Do not change anything else in the picture. No zoom, no pan, no dolly

Page 44

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 74 · Tiling puzzles. Jigsaw puzzle (top)(figure_74)

A hand takes the fitting puzzle piece from the right, rotates it to be in the correct orientation, then puts it into the hole, completing the puzzle. Static scene, no pan, no zoom, no dolly.

Page 45

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 75 · Bottleneck(figure_75)

A person tries to put the golf ball in the vase. Static camera, no pan, no zoom, no dolly..

Page 45

Wan Output

Sora Output

Veo 3 Reference

Robotics & Dexterity

Figure 44 · Dexterous manipulation. Jar opening (top)(figure_44_jar-opening-top)

Use common sense and have the two robot hands attached to robot arms open the jar, like how a human would.

Page 27

Context: Jar opening (top)

Input reference for figure_44_jar-opening-top
Input reference

Wan Output

Sora Output

Veo 3 Reference

Figure 44 · Dexterous manipulation. Jar opening (top)(figure_44_rotating-baoding-balls-bottom)

A human hand holds two metal Baoding balls. The fingers, including the thumb, index, and middle finger, skillfully manipulate the balls, causing them to rotate smoothly like two planets orbiting around each other and continuously in the palm, one ball circling the other in a fluid motion.

Page 27

Context: Rotating Baoding balls (bottom)

Input reference for figure_44_rotating-baoding-balls-bottom
Input reference

Wan Output

Sora Output

Veo 3 Reference

Figure 44 · Dexterous manipulation. Jar opening (top)(figure_44_throwing-and-catching-middle)

Use common sense and have the two robot hands attached to robot arms throw the ball in the air, the ball goes up off the screen, hands move to positions to catch the ball, and catch the falling ball, like how a human would.

Page 27

Context: Throwing and catching (middle)

Input reference for figure_44_throwing-and-catching-middle
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 45 · Affordance recognition(figure_45)

The robot hands mounted on robot arms pick up the hammer, naturally like how a human would.

Page 27

Input reference for figure_45
Input reference

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 58 · Robot navigation(figure_58)

The robot drives to the blue area. Static camera perspective, no movement no zoom no scan no pan.

Page 30

Wan Output

Sora Output

Veo 3 Reference

Figure 65 · Tying the knot(figure_65)

A knot is tied connecting these two rope ends.

Page 43

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 76 · Laundry folding(figure_76)

Generate a video of two metal robotic arms properly folding the t-shirt on the table.

Page 45

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference

Figure 77 · Motion planning; inspired by the piano mover’s problem(figure_77)

The red couch slides from the left room over into the right room, skillfully maneuvering to fit through the doorways without bumping into the walls. The walls are fixed: they don’t shift or disappear, and no new walls are introduced. Static camera, no pan, no zoom, no dolly.

Page 46

Wan Output

Sora Output

No Sora generation available.

Veo 3 Reference