Computer vision (for robots)

हिंदी: कंप्यूटर विज़न (दृष्टि प्रणाली)

405 words · 3 min read · 2 sources

Computer vision is how a robot makes sense of what its camera sees. It turns pixels into objects, distances, and decisions — and it's done a lot more of the heavy lifting in modern robotics than you'd guess.

The concept concept: Computer vision is how a robot makes sense

Difficulty 3/5 · Classroom

Computer vision is how a robot makes sense of what its camera sees. It turns a flat grid of pixels into "there's a person 3 metres ahead, facing me, walking forward at 1 m/s." It's done a lot more of the heavy lifting in modern robotics than you'd guess.

💡 Think of it like…

Think of it like a household object that does the same job — the underlying idea is the same, just adapted for robots.

🇮🇳 In India

Cropnosis, an Indian startup, uses computer vision on drone footage to detect crop diseases before they spread — saving lakhs of rupees per farm.

Why it matters

Without computer vision (for robots), many concept systems in robotics simply couldn't work.

Real robots:Perseverance Rover Da Vinci Surgical Amazon warehouse robots

Used in:healthcareagricultureretailsecurityautonomous vehicles

🤯 A state-of-the-art vision model can classify 1,000 object categories in under 1 millisecond. A human takes ~100 ms to consciously identify an object.

🎯 Quick challenge

What neural network architecture revolutionised computer vision in 2012?

The five jobs computer vision does for robots

Classification. "What is this thing?" — given a picture, output a label: dog, traffic light, screwdriver.

Detection. "Where is the thing?" — given a picture, output a bounding box around every object of interest plus its label.

Segmentation. "Which pixels are the thing?" — given a picture, output a precise outline of every object, pixel-perfect.

Depth estimation. "How far is the thing?" — given a picture (or two from stereo cameras), output a depth map.

Tracking. "Where is the same thing going?" — given a sequence of pictures, follow each object frame by frame.

Most robot vision pipelines chain these together: detect objects → estimate their depth → track their motion → decide what to do.

How it actually works today

Until ~2012, computer vision was hand-engineered: humans wrote rules ("if the pixels here are dark and edge-aligned, it's probably a road"). It mostly didn't work.

After 2012 (the AlexNet moment), the field switched to convolutional neural networks (CNNs). You feed millions of labelled pictures to a network. The network learns the features on its own.

After 2021, transformer-based vision models (ViT, DINO, SAM) became state of the art. These can be trained on much larger, less-labelled datasets, and they generalise better.

A modern robot like Optimus or Figure 03 has a single neural network that does detection, segmentation, depth, and tracking in one pass — at 30 frames per second, running on a few-hundred-watt onboard computer.

Tesla vs. Waymo — the classic vision debate

Tesla uses vision-only self-driving: just cameras, no lidar, no radar.

Waymo uses sensor fusion: cameras plus lidar plus radar.

Tesla's argument: humans drive with just two cameras (eyes), so cars should be able to. Waymo's argument: cars don't need to be limited to human senses — give them lidar too and they're safer.

Neither approach is finished. The argument is one of the biggest open questions in modern robotics.

The first practical use of computer vision most people have touched is the Roomba's navigation cameras. Read How a Roomba decides where to clean.

Still curious?

Ask R2 Co-pilot anything you didn't understand about Computer vision (for robots). It'll explain it plainly.

Last updated · 2026-05-19

Community discussion

0 questions & insights

Loading discussion…

The five jobs computer vision does for robots

Classification. "What is this thing?" — given a picture, output a label: dog, traffic light, screwdriver.

Detection. "Where is the thing?" — given a picture, output a bounding box around every object of interest plus its label.

Segmentation. "Which pixels are the thing?" — given a picture, output a precise outline of every object, pixel-perfect.

Depth estimation. "How far is the thing?" — given a picture (or two from stereo cameras), output a depth map.

Tracking. "Where is the same thing going?" — given a sequence of pictures, follow each object frame by frame.

Most robot vision pipelines chain these together: detect objects → estimate their depth → track their motion → decide what to do.

How it actually works today

Until ~2012, computer vision was hand-engineered: humans wrote rules ("if the pixels here are dark and edge-aligned, it's probably a road"). It mostly didn't work.

After 2012 (the AlexNet moment), the field switched to convolutional neural networks (CNNs). You feed millions of labelled pictures to a network. The network learns the features on its own.

After 2021, transformer-based vision models (ViT, DINO, SAM) became state of the art. These can be trained on much larger, less-labelled datasets, and they generalise better.

Tesla vs. Waymo — the classic vision debate

Tesla uses vision-only self-driving: just cameras, no lidar, no radar.

Waymo uses sensor fusion: cameras plus lidar plus radar.

Tesla's argument: humans drive with just two cameras (eyes), so cars should be able to. Waymo's argument: cars don't need to be limited to human senses — give them lidar too and they're safer.

Neither approach is finished. The argument is one of the biggest open questions in modern robotics.

The first practical use of computer vision most people have touched is the Roomba's navigation cameras. Read How a Roomba decides where to clean.

Computer vision (for robots)

The five jobs computer vision does for robots

How it actually works today

Tesla vs. Waymo — the classic vision debate

Keep going

Lidar

Optimus (Tesla)

SLAM

Community discussion

Computer vision (for robots)

The five jobs computer vision does for robots

How it actually works today

Tesla vs. Waymo — the classic vision debate

Keep going

Lidar

Optimus (Tesla)

SLAM

Community discussion

Computer vision (for robots)

The five jobs computer vision does for robots

How it actually works today

Tesla vs. Waymo — the classic vision debate

Keep going

Lidar

Optimus (Tesla)

SLAM

💬 Community discussion

Computer vision (for robots)

The five jobs computer vision does for robots

How it actually works today

Tesla vs. Waymo — the classic vision debate

Keep going

Lidar

Optimus (Tesla)

SLAM

💬 Community discussion

Community discussion

Community discussion