Most business applications of computer vision involve identifying what’s actually inside an image. And because visual information is so dense with information, there’s a lot of work to be done. However, rather than replacing human beings, though, computer vision capabilities are most commonly augmenting human beings. To put it another way: computer vision projects generally are about creating a workflow where people handle the complicated or creative or uniquely human tasks while machines handle the rest. But what’s the end goal here? What are the undertakings actually for? How can we think about them at a high level? While you can break them down by industry or the actual annotation tools themselves, it’s likely more interesting to divide them thematically.
You can put a vast majority of image processing tasks into the following four buckets:
Let’s start with creativity.
Often we think that the goals of commercial computer vision have to do with helping people do more things or do them faster/more effectively. But in some cases, organizations are actually more interested in inspiring customers or sparking their imagination.
For example, at a practical level, Shutterstock helps people find the right images out of their enormous database, but they also suggest particular kinds of edits based on what’s in the the photo that a customer selects. This comes from the fact that people want to do different things with, say, portraits than they do with landscapes. So while we could say there’s a use case around image search, it’s probably better to focus on what searches are for since knowing the content of an image lets them be more specific with what they help users do.
An adjacent use case leverages user behavior to uncover intent and inspire creativity as well. For example, people who post a lot of kitchen pictures to Pinterest may be planning a remodel. In fact, Lowe’s has been exploring how to take a person’s whole page of pins, match it against the Lowe’s catalog to find similar objects and then assembling all these pieces together visually. The business intent is to sell a whole new kitchen, but its success is connected to how Lowe’s helps users dream and turn those dreams into reality.
Caption: Computer vision lets you recognize items a user may post on social media and find similar things in your catalogue; here Lowe’s helps assemble a kitchen with Microsoft HoloLens
Staying with Pinterest, their “Shop the Outfit” capability could be placed in both the creative and social buckets—people want to look good, so they click on a piece of clothing and find various versions of it.
The ability for computers to pick out similarity on a range of axes can power all kinds of recommendations. Which is to say: models that can identify what makes a blue hightop a blue hightop can use that information to recommend products that are visually similar.
The flip side of similarity is unusualness—for example, Edvard Munch’s “The Scream” is completely unlike things that came before it but portended what might come next. In the commercial realm, this is the ability to pick out standout images or novel style. Something unusual often exists at odd places in a product map or visual space and that may tell you something new and important about a customer or even start to identify trends as those unusual items become more popular or less unusual over time.
Humans are social animals, so a lot of projects that seem to be about search/retrieval are really about how we relate to others.
This is most obvious with Facebook’s facial recognition, which finds and suggests photos that have friends and family in them. Meanwhile, Apple uses high-level computer models to help search your photos for, say, a dog. Even though you haven’t annotated any of the images like you might on Facebook, the model will help find your favorite pooches in your albums.
Now, since sharing is at the heart of why people are looking for a photo, you want to make sharing a prominent and easy part of the product design. You can also learn, over time, the qualities of the photos that users individually and collectively tend to share. That is, you can build in feedback loops that make user actions part of the training data so that your systems get smarter over time.
In this vein, Trulia logs how long people look at various photos of homes for sale in their app and they use this to intuit what a user likes as well as to understand long-range themes across users, down to the level of which enamels and finishes correlate the best with attention and house sales. This is a particularly good spot to be in: individual user behaviors enrich data at a higher, more general level, over a longer stretch of time.
Monitoring what’s inside photographs also lets companies identify products in social media. While relatively simple text analytics can tell you how people are mentioning your products, vision models can help you know know how often products are appearing in Instagram or any other platform. Understanding a person’s style in clothing and cars can also help target them for related products. You can also match people with places—if you’re visiting a new city, where is it that people who look like you go? As you can imagine, a lot of computer vision applications require careful ethical reflection about what they do to privacy and social segregation.
Caption: If you’re a brand, the emotion in the bottom right corner is far preferred to the bottom left
Lastly, there are also a number of companies like Affectiva that work to identify emotions from photos and video. How much smiling or frowning is happening in a chain of retail stores? Or think about something like video chat with a customer service agents. While an agent attends to the particular needs of a customer, computer vision applications can look across interactions to see how much frustration and relief there is on customers’ faces. Understanding how people feel in stores or on help calls can help brands connect with their customers and, ultimately, provide better products and experiences.
The easiest business applications for computer vision come from helping people be more efficient; instead of fixing problems, just help people know where to look. For example, in content moderation, you can train a model to find offensive images without necessarily having to traumatize a bunch of human content reviewers with the more disturbing things that get posted on websites.
Efficiency is also behind a lot of medical/healthcare applications of computer vision. The goal isn’t to replace diagnosticians but to direct their efforts. Instead of looking at hundreds of, say, radiology scans where nothing is out of the ordinary, show the experts only the images the model finds problematic or ones where the algorithm isn’t very confident.
If there are water rationing provisions in place but you see a bunch of emerald green lawns for mansions in Beverly Hills, you can be confident there are violations. It’s a lot easier for a machine to churn through acres upon acres of satellite photos than for a human to do that–or go door-to-door. But models like the one created by OmniEarth can be used to distinguish between a pond and a pool in satellite photos.
There are a variety of uses for aerial and satellite photos, perhaps the most wide-reaching are concerned with deforestation and urbanization. The MIT Media Lab has worked to find safer and less safe parts of cities as well as to understand what makes cities thrive. Likewise, using aerial photos to detect logging roads that are a precursor to logging in the rain forest makes enforcement far more efficient.
Caption: Maybe that’s too green for a drought?
In some cases, like dealing with drought, surveillance is the public good. Most people also consider that to be the case for facial recognition at security checkpoints.
But at what point does this become overly intrusive? For example, what about when we’re just walking down the street? Or how about facial scanning to get toilet paper at a public restroom?
There’s a Black Mirror/Minority Report scenario here where facial recognition used to categorize people who haven’t done anything wrong. And while there’s work to keep people private when cameras are looking–face paint and clothing that could eventually lead to some really amazing fashion statements in addition to privacy–this is a growing question we’ll hear about more and more in the coming years.
The availability of GPUs, the ability to annotate and score images at scale, and the steady improvement in models have all combined to make computer vision far more viable than they’ve been at any time up to now. And while the list above isn’t completely exhaustive, generally, you can break out a lot of computer vision projects into the buckets above.
If you missed our previous blog post about scoping and designing computer vision projects, please do check that out. We’ll be concluding this series next Thursday with the third in our computer vision trilogy. For now? Thanks for reading.