Computer vision (for robots)
405 words · 3 min read · 2 sources
Computer vision is how a robot makes sense of what its camera sees. It turns pixels into objects, distances, and decisions — and it's done a lot more of the heavy lifting in modern robotics than you'd guess.
The concept concept: Computer vision is how a robot makes sense
Difficulty 3/5 · ClassroomComputer vision is how a robot makes sense of what its camera sees. It turns a flat grid of pixels into "there's a person 3 metres ahead, facing me, walking forward at 1 m/s." It's done a lot more of the heavy lifting in modern robotics than you'd guess.
💡 Think of it like…
Think of it like a household object that does the same job — the underlying idea is the same, just adapted for robots.
🇮🇳 In India
Cropnosis, an Indian startup, uses computer vision on drone footage to detect crop diseases before they spread — saving lakhs of rupees per farm.
Why it matters
Without computer vision (for robots), many concept systems in robotics simply couldn't work.
🤯 A state-of-the-art vision model can classify 1,000 object categories in under 1 millisecond. A human takes ~100 ms to consciously identify an object.
🎯 Quick challenge
What neural network architecture revolutionised computer vision in 2012?
Computer vision is how a robot makes sense of what its camera sees. It turns a flat grid of pixels into "there's a person 3 metres ahead, facing me, walking forward at 1 m/s." It's done a lot more of the heavy lifting in modern robotics than you'd guess.
The five jobs computer vision does for robots
Classification. "What is this thing?" — given a picture, output a label: dog, traffic light, screwdriver.
Detection. "Where is the thing?" — given a picture, output a bounding box around every object of interest plus its label.
Segmentation. "Which pixels are the thing?" — given a picture, output a precise outline of every object, pixel-perfect.
Depth estimation. "How far is the thing?" — given a picture (or two from stereo cameras), output a depth map.
Tracking. "Where is the same thing going?" — given a sequence of pictures, follow each object frame by frame.
Most robot vision pipelines chain these together: detect objects → estimate their depth → track their motion → decide what to do.
How it actually works today
Until ~2012, computer vision was hand-engineered: humans wrote rules ("if the pixels here are dark and edge-aligned, it's probably a road"). It mostly didn't work.
After 2012 (the AlexNet moment), the field switched to convolutional neural networks (CNNs). You feed millions of labelled pictures to a network. The network learns the features on its own.
After 2021, transformer-based vision models (ViT, DINO, SAM) became state of the art. These can be trained on much larger, less-labelled datasets, and they generalise better.
A modern robot like Optimus or Figure 03 has a single neural network that does detection, segmentation, depth, and tracking in one pass — at 30 frames per second, running on a few-hundred-watt onboard computer.
Tesla vs. Waymo — the classic vision debate
Tesla uses vision-only self-driving: just cameras, no lidar, no radar.
Waymo uses sensor fusion: cameras plus lidar plus radar.
Tesla's argument: humans drive with just two cameras (eyes), so cars should be able to. Waymo's argument: cars don't need to be limited to human senses — give them lidar too and they're safer.
Neither approach is finished. The argument is one of the biggest open questions in modern robotics.
The first practical use of computer vision most people have touched is the Roomba's navigation cameras. Read How a Roomba decides where to clean.
Ask R2 Co-pilot anything you didn't understand about Computer vision (for robots). It'll explain it plainly.
Learn this in the Academy
🔌W-05: Computer Vision for Robots
Hands-on lesson · Wire track
Keep going
Lidar
Lidar is a sensor that measures distance by firing invisible laser pulses and timing how long they take to bou…
RobotOptimus (Tesla)
Optimus is the humanoid robot Tesla is building to do general-purpose work — in their factories first, and eve…
ConceptSLAM
SLAM is the technique a robot uses to build a map of an unfamiliar place — while figuring out where it is on t…
Last updated · 2026-05-19
Community discussion
0 questions & insightsLoading discussion…
Spotted something off? Report an error →