At any given moment, the web is a treasure trove of shopping inspiration — brimming with fashionable trends, seasonal tablescapes, and artful shelfies. But how many times have you seen something you want but can’t figure out how to buy it or even check it out?
Product recognition is among the most important ways to make it easier for people to shop online today. If AI can predict and understand exactly what’s in any given virtual frame, then people could — one day — choose to make any image or video shoppable. People would more easily find exactly what they’re looking for, and sellers could make their products more discoverable.
Facebook AI is building the world’s largest shoppable social media platform, where billions of items can be bought and sold in one place. As a key milestone toward this goal, we’re sharing details on how we’re expanding GrokNet, our breakthrough product recognition system, to new applications on Facebook and Instagram. GrokNet identifies what products are in an image and predicts their categories, like “sofa,” and attributes, like color and style. Unlike previous systems, which required separate models for each vertical, GrokNet is a first-of-its-kind, all-in-one model that scales across billions of photos across vastly different verticals, including fashion, auto, and home decor. GrokNet started as a fundamental AI research project with its first few applications on Marketplace, where AI analyzes search queries like “midcentury modern sofa” and predicts matches to search indexes so that the more than a billion people who visit Marketplace each month get the most relevant results when searching for products.
Since 2020, we’ve expanded this technology to new applications to make posts more shoppable across new Facebook applications. Right now, when a seller posts an image on their Facebook page, our AI-powered shopping system helps identify untagged items and suggests tags based on their product catalog — so that instead of taking several minutes to manually tag their items, a seller could create and post their photo in just seconds. And when a shopper is viewing an untagged post from a seller, the system instantly suggests similar products below the post from that seller’s product catalog.
With billions of images uploaded to Shops on Facebook and Instagram by sellers, predicting just the right product at any given moment is an extremely hard, open AI challenge. Today we’re sharing details on our newest state-of-the-art advancements, which are making our AI systems remarkably smarter at recognizing products — from multimodal understanding to learning deeper, more nuanced attributes.
To help shoppers find exactly what they’re looking for, it’s important that product recognition systems excel at recognizing specific product characteristics — also known as attributes. But there are thousands of possibilities, and each one can apply to a range of categories. For example, you can have blue skirts, blue pants, blue cars, or even a blue sky. The most accurate AI systems today rely on labeled examples, known as supervised learning, to learn these attributes, but with near-infinite possibilities, this is not scalable. Even just 1,000 objects and 1,000 attributes would mean manually labeling more than a million combinations. Plus, some combinations might occur more frequently in data. For example, there might be many blue cars, but few blue cheetah-print clothing items.
How can we make our systems work even on rare occurrences?
We built a new model that learns from some attribute-object pairs and adapts to entirely new, uncommon attributes. So, if you train on blue skirts, blue cars, and blue skies, you’d still be able to recognize blue pants even if your model never saw them during training. We built a new compositional framework on top of our previous foundational research that uses deep learning to achieve state-of-the-art image recognition. This approach uses “weakly supervised learning,” where the model learns from associated hashtags from 78M public Instagram images rather than relying entirely on manually labeled examples. Notably, we added a new compositional module that makes it possible for us to predict combinations of objects and attributes that aren’t in the labeled example set. Each object can be modified with many attributes, increasing the fine-grained space of classes with few orders of magnitude. Meaning, we can scale to millions of images and hundreds of thousands of fine-grained class labels in ways that were not possible before. And we can quickly spin up predictions for new verticals to cover the range of products in our Facebook catalog, or even recognize those blue cheetah-print clothing items should we ever come across them.
While collecting the training data to train these models, we sampled objects and attributes from all geographies around the world. This helps us reduce the potential for bias in recognizing concepts like “wedding dress,” which is often white in Western cultures but is likely to be red in South Asian cultures, for instance. As part of our ongoing efforts to improve the algorithmic fairness of models we build, we trained and evaluated our AI models across subgroups, including 15 countries and four age buckets. By continuously collecting annotations for these subgroups, we can better evaluate and flag when models might work better at recognizing some attributes, like the neckline (V-neck, square, crew, etc.) on shirts for women compared with those for men if, for instance, we didn’t have enough training data of men wearing a V-neck shirt. Although the AI field is just beginning to understand the challenges of fairness in AI, we’re continuously working to understand and improve the way our products work for everyone across the world.
This model is now live on Marketplace, and as a next step, we’re exploring and deploying these models to strengthen AI-assisted tagging and product matches across our apps. We’re also working on using this technique to incorporate search with more flexibility, like: “Find a scarf with the same pattern and material as this skirt.”
In the Facebook family of apps, images almost always come with associated text, such as metadata or product descriptions. So, building vision-only models potentially leaves critical pieces of the puzzle on the table. We’re already pushing state-of-the-art multimodal advancements to improve content understanding across our platform. And now we’ve seen that signals from associated text significantly improved the accuracy of product categorization.
We first tested a multimodal understanding framework using a clothing attributes data set, including catalog data that includes text input. A key challenge with multimodal understanding, however, is that the text data itself can sometimes be misleading. For example, a product description might read, “Here is the perfect sequined top to wear with your favorite pair of black skinny jeans.” An AI model might incorrectly predict that the top is black when in fact it is silver. We also needed to prepare for occasions when fashions in images are completely missing descriptions or related text.
To address this challenge, we combined visual signals from the image and related text description to guide the final model prediction. We found a great recipe for a multimodal model, which includes a slew of AI frameworks and tools, like the early-fusion architecture: Facebook AI’s Multimodal Bitransformer, generalized as the MMF Transformer in Facebook AI’s Multimodal Framework, as well as the Transformer encoder that’s pretrained on public Facebook posts. Through this test, we found that multimodal models are, in fact, better even if there’s some text missing occasionally.
The advancements we’ve shared today not only make shopping easier today, but they also represent the building blocks of future experiences. For example on Instagram, shopping begins with visual discovery. Every day, people scroll through the app and see thumb-stopping inspiration — whether that’s a floral dress for summer or the perfect wedding dress. With AI-powered visual search, people can find similar dresses just by tapping on an image they see within Instagram. While it’s still early, we think visual search will enhance mobile shopping by making even more images on Instagram shoppable.
Still, AI-powered shopping today is in its infancy --- to machines, photos of products are still just collections of pixels. While some attributes can be straightforward, like “short sleeves,” others are more objective, like “formal wear” or “warm weather.” Training AI models that can flexibly make use of the right information in each situation requires solving scientific and engineering challenges. With each year, we’re building smarter AI systems that are fine-tuned to understand shopping-related images and text with state-of-the-art accuracy. All of these advancements are collectively pushing us toward smarter product understanding systems that connect consumers with exactly what they want as soon as it catches their eye.
In the future, this technology could fuel more immersive experiences. With millions of pieces of multimedia content posted on public Facebook pages every day, we hope to eventually build AI models that learn varieties and styles to match people with their taste in music, travel, and other interests. Imagine watching a livestream video of your favorite artist performing at a concert. You could instantly browse outfits and accessories inspired by the artist, shop hashtags associated with the song, and even automatically surface product reviews from your friends and family who are watching the livestream with you. Today’s latest AI advancements bring us one step closer to the future of AI-powered shopping.
We’d like to acknowledge the contributions of Sami Alsheikh, Yina Tang, Pratik Dubal, Wenwen Jiang, Yanping Xie, Animesh Sinha, Jun Chen, Filip Radenovic, Dhruv Mahajan, Sridhar Rao, Naveen Adibhatla, Dillon Stuart, Faizan Bhat, Tao Xiang, Shawn Tzeng, Grigorios Antonellis, Omkar Parkhi, and Licheng Yu, as well as researchers, engineers, and other teammates who worked on Connected Commerce, Product Clustering Platform, IG Shopping, and Catalog Quality Inference.
Meta creates breakthrough technologies and advances AI to connect people to what matters and to help keep communities safe.