For humans, identifying objects in a scene—whether it’s an avocado or an Aventador, a pile of mashed potatoes or an alien mothership—is as simple as looking at them. But for AI and computer vision systems, developing a high-fidelity understanding of the surrounding environment requires more effort. Well, still working hard. About 800 hours of manually labeled training images work, if we’re being specific. To help machines better understand how people behave, a team of researchers at MIT CSAIL, in collaboration with Cornell University and Microsoft, developed STEGO, an algorithm that recognizes images down to individual pixels.
MIT
Typically, creating CV training data involves humans drawing boxes around specific objects in an image—for example, boxes around a dog sitting on grass—and labeling those boxes with what’s inside (“dogs”), so that AI training can learn from the grass Distinguish the dog. Instead, STEGO (self-supervised transformer with energy-based graph optimization) uses a technique called semantic segmentation, which applies a class label to every pixel in an image to give the AI a more accurate understanding of the world around it.
A labeled box will contain the object plus other items in surrounding pixels within the bounds of the box, and semantic segmentation labels each pixel in the object, but if only Pixels that make up the object – all you get is dog pixels, not dog pixels plus some grass. It’s the machine learning equivalent of using the Smart Lasso and Rectangular Marquee tools in Photoshop.
The problem with this technique is one of scope. Traditional multi-shot supervision systems typically require thousands, if not hundreds of thousands, of labeled images to train algorithms. Multiply that by 65,536 individual pixels, or even make up a single 256×256 image, all of which now need to be labeled individually, and the amount of work required quickly becomes impossible.
Instead, “STEGO looks for similar objects that appear throughout the dataset,” the CSAIL team wrote in a Thursday release. “It then links these similar objects together to build a consistent view of the world across all the images it learns.”
“If you’re looking at tumor scans, planetary surfaces, or high-resolution biological images, it’s hard to know what object to look for without expertise. In emerging fields, sometimes even human experts don’t know what the right object should be,” Ma said. said Mark Hamilton, a CSAIL doctoral student at MIT, a Microsoft software engineer and the paper’s lead author. “In these types of situations, you want to design a method that operates on the boundaries of science, and you can’t rely on humans to figure it out before machines.”
Trained on a variety of image domains—from home interiors to high-altitude aerial photography—STEGO doubles the performance of previous semantic segmentation schemes, closely related to human-controlled image evaluation. What’s more, “When applied to the driverless car dataset, STEGO successfully segmented roads, people and street signs with higher resolution and granularity than previous systems. On images from space, the system will Every square foot of the Earth’s surface breaks down into roads, vegetation and buildings,” the MIT CSAIL team wrote.
MIT
“In making a general tool for understanding potentially complex datasets, we hope this algorithm can automate the scientific process of discovering objects from images,” Hamilton said. “There are many different fields where human labeling is very expensive, or the specific structure is simply not known to humans, for example in some biological and astrophysical fields. We hope that future work can be applied to a very wide range of datasets. Since You don’t need any human labels, and we can now start applying ML tools more broadly.”
Although STEGO outperforms previous systems, it does have limitations. For example, it can identify pasta and grits as “food” but can’t distinguish them very well. It also gets confused by nonsense images, like a banana sitting on a telephone receiver. Is this a foodie? Is this a pigeon? STEGO can’t tell. The team hopes to add a little more flexibility in future iterations, allowing the system to recognize objects under multiple classes.
All products featured by Engadget are selected by our editorial team, independent of our parent company. Some of our stories include affiliate links. We may receive an affiliate commission if you purchase through one of these links.