Artificial Intelligence

How Facebook is using AI to improve photo descriptions for people who are blind or visually impaired

January 19, 2021

Update on November 2, 2021: Today we announced a significant reduction in our use of facial recognition technology, including the deletion of face recognition templates that we previously used to identify faces in photos. This means automatic alt text will still be able to recognize when a person is in a photo, but it will no longer include names of people.

When Facebook users scroll through their News Feed, they find all kinds of content — articles, friends’ comments, event invitations, and of course, photos. Most people are able to instantly see what’s in these images, whether it’s their new grandchild, a boat on a river, or a grainy picture of a band onstage. But many users who are blind or visually impaired (BVI) can also experience that imagery, provided it’s tagged properly with alternative text (or “alt text”). A screen reader can describe the contents of these images using a synthetic voice and enable people who are BVI to understand images in their Facebook feed.

Where we started

The concept of alt text dates back to the early days of the internet, providing slow dial-up connections with a text alternative to downloading bandwidth-intensive images. Of course, alt text also helped people who are blind or visually impaired navigate the internet, since it can be used by screen reader software to generate spoken image descriptions. Unfortunately, faster internet speeds made alt text less of a priority for many users. And since these descriptions needed to be added manually by whoever uploaded an image, many photos began to feature no alt text at all — with no recourse for the people who had relied on it.

Nearly five years ago, we leveraged Facebook’s computer vision expertise to help solve this problem. The first version of AAT was developed using human-labeled data, with which we trained a deep convolutional neural network using millions of examples in a supervised fashion. Our completed AAT model could recognize 100 common concepts, like “tree,” “mountain,” and “outdoors.” And since people who use Facebook often share photos of friends and family, our AAT descriptions used facial recognition models that identified people (as long as those people gave explicit opt-in consent). For people who are BVI, this was a giant step forward.

Seeing more of the world

But we knew there was more that AAT could do, and the next logical step was to expand the number of recognizable objects and refine how we described them.

To achieve this, we moved away from fully supervised learning with human-labeled data. While this method delivers precision, the time and effort involved in labeling data are extremely high — and that’s why our original AAT model reliably recognized only 100 objects. Recognizing that this approach would not scale, we needed a new path forward.

For our latest iteration of AAT, we leveraged a model trained on weakly supervised data in the form of billions of public Instagram images and their hashtags. To make our models work better for everyone, we fine-tuned them so that data was sampled from images across all geographies, and using translations of hashtags in many languages. We also evaluated our concepts along gender, skin tone, and age axes. The resulting models are both more accurate and culturally and demographically inclusive — for instance, they can identify weddings around the world based (in part) on traditional apparel instead of labeling only photos featuring white wedding dresses.

It also gave us the ability to more readily repurpose machine learning models as the starting point for training on new tasks — a process known as transfer learning. This enabled us to create models that identified concepts such as national monuments, food types (like fried rice and french fries), and selfies. This entire process wouldn’t have been possible in the past.

To get richer information like position and counts, we also trained a two-stage object detector, called Faster R-CNN, using Detectron2, an open source platform for object detection and segmentation developed by Facebook AI Research. We trained the models to predict locations and semantic labels of the objects within an image. Multilabel/multi–data set training techniques helped make our model more reliable with the larger label space.

The improved AAT reliably recognizes over 1,200 concepts — more than 10 times as many as the original version we launched in 2016. As we consulted with screen reader users regarding AAT and how best to improve it, they made it clear that accuracy is paramount. To that end, we’ve included only those concepts for which we could ensure well-trained models that met a certain high threshold of precision. While there is a margin for error, which is why we start every description with "May be," we’ve set the bar very high and have intentionally omitted concepts that we couldn’t reliably identify.

We want to give our users who are blind or visually impaired as much information as possible about a photo’s contents — but only correct information.

Delivering details

Having increased the number of objects recognized while maintaining a high level of accuracy, we turned our attention to figuring out how to best describe what we found in a photo.

We asked users who depend on screen readers how much information they wanted to hear and when they wanted to hear it. They wanted more information when an image is from friends or family, and less when it’s not. We designed the new AAT to provide a succinct description for all photos by default but offer an easy way to get more detailed descriptions about photos of specific interest.

When users select that latter option, a panel is presented that provides a more comprehensive description of a photo’s contents, including a count of the elements in the photo, some of which may not have been mentioned in the default description. Detailed descriptions also include simple positional information — top/middle/bottom or left/center/right — and a comparison of the relative prominence of objects, described as “primary,” “secondary,” or “minor.” These words were specifically chosen to minimize ambiguity. Feedback on this feature during development showed that using a word like "big" to describe an object could be confusing because it’s unclear whether the reference is to its actual size or its size relative to other objects in an image. Even a Chihuahua looks large if it’s photographed up close!

AAT uses simple phrasing for its default description rather than a long, flowy sentence. It’s not poetic, but it is highly functional. Our users can read and understand the description quickly — and it lends itself to translation so all the alt text descriptions are available in 45 different languages, ensuring that AAT is useful to people around the world.

Facebook is for everyone

Every day, our users share billions of photos. The ubiquity of inexpensive cameras in mobile phones, fast wireless connections, and social media products like Instagram and Facebook have made it easy to capture and share photography and help make it one of the most popular ways to communicate — including for individuals who are blind or visually impaired. While we wish everyone who uploaded a photo would include an alt text description, we recognize that this often doesn’t happen. We built AAT to bridge this gap, and the impact it’s had on those who need it is immeasurable. AI promises extraordinary advances, and we’re excited to have the opportunity to bring these advances to communities that are so often underserved.

We're hiring AI scientists!

Help us drive scientific breakthroughs in core AI research

View AI Jobs

Tech at Meta

How Facebook is using AI to improve photo descriptions for people who are blind or visually impaired

Where we started

Seeing more of the world

Delivering details

Facebook is for everyone

More Stories like this...

Artificial Intelligence

Follow us

Research & Engineering

Developers

News

More from Meta