Artificial Intelligence

One-track minds: Using AI for music source separation

Facebook AI researchers’ Demucs project helps machines listen more like people do.

March 6, 2020

If you have ever stumbled across those online videos of Freddie Mercury singing what sounds like an a cappella rendition of “Another One Bites the Dust” or a version of Alanis Morissette’s “You Oughta Know” featuring only Flea’s distinctive slapped bass, then you’re already familiar with the concept of music source separation.

Simply put, music source separation is the use of technology to break a song into its constituent contributions, such as the vocals, bass, and drums. It is easy to achieve if you own the original multitrack studio recordings: You just adjust the mix to isolate a single track. But if you’re starting with a regular CD or MP3 audio file (where all the instruments and vocals have been mixed into a single stereo recording), even the most sophisticated software programs would struggle to precisely pluck out a single part. That is, until now.

A better way to break up a sound wave

Sound source separation has long fascinated scientists. In 1953, the British cognitive scientist Colin Cherry coined the phrase “cocktail party effect” to describe the human ability to zero in on a single conversation in a crowded, noisy room. Engineers first tried to isolate a song’s vocals or guitars by adjusting the left and right channels in a stereo recording or fiddling with the equalizer settings to boost or cut certain frequencies. They began experimenting with AI to separate sounds, including those in musical recordings, in the early 2000s.

Today, the most commonly used AI-powered music source-separation techniques work by analyzing spectrograms, which are heat map-like visualizations of a song’s different audio frequencies. “They are made by humans for other humans, so they are technically easy to create and visually easy to understand,” says Defossez. Spectrograms may be nice to look at, but the AI models that use them have several important limitations. They struggle in particular to separate drum and bass tracks, and they also tend to omit important information about the original multitrack recording (such as when the frequencies of a saxophone and guitar cancel each other out). This is principally because they attempt to corral sounds into a predetermined matrix of frequency and time, rather than dealing with them as they actually are.

Spectrogram-based AI systems are relatively effective in separating out the notes of instruments that ring or resonate at a single frequency at any given point in time, such as mezzo piano or legato violin melodies. These show up on a spectrogram as distinct, unbroken horizontal lines running from right to left. But isolating percussive sounds that produce residual noise, such as a drum kit, bass slapping, or even staccato piano, is a much tougher task. Like a flash of lightning, a drumbeat feels like a single, whole event in real time, but it actually contains various parts. For a drum, this includes an initial attack that covers a broad range of higher frequencies, followed by a pitchless decay in a smaller range of low frequencies. The average snare drum “is all over the place in terms of frequency,” says Defossez.

Spectrograms, which can only represent sound waves as a montage of time and frequency, cannot capture such nuances. Consequently, they process a drumbeat or a slapped bass note as several noncontiguous vertical lines rather than as one neat, seamless sound. That is why drum and bass tracks that have been separated via spectrogram often sound muddy and indistinct.

Facebook AI researchers used Demucs to isolate the bass guitar in the song as well.

A system smart enough to reconstruct what’s missing

AI-based waveform models avoid these problems because they do not attempt to push a song into a rigid structure of time and frequency. Defossez explains that waveform models work in a similar way to computer vision, the AI research field that aims to enable computers to learn to identify patterns from digital images so they can gain a high-level understanding of the visual world.

Computer vision uses neural networks to detect basic patterns – analogous to spotting corners and edges in an image – before inferring higher-level or more complex ones. “The way a waveform model works is very similar,” says Defossez. “It detects patterns in the waveforms and then adds higher-scale structure.” He explains how a waveform model needs a few seconds to cotton on to the standout frequencies in a song – the vocals, bass, drums or guitar – and generate separate waveforms for each of those elements. Then it begins to infer higher-scale structure to add nuance and finely sculpt each waveform.

Defossez says his system can also be likened to the seismographic tools that detect and record earthquakes. During an earthquake, the base of the seismograph moves but the weight hanging above it does not, which allows a pen attached to that weight to draw a waveform that records the ground’s motion. An AI model can detect several different earthquakes happening at the same time and then infer detail about each one’s seismic magnitude and intensity. Likewise, Defossez’s system analyzes and separates a song as it actually is, rather than chopping it up according to the preconceived structure of a spectrogram.

This sample is the original mix of the song "From the Sky." Facebook AI researchers took this audio file and separated out vocals, drums, and bass using Demucs.

That is a snapshot of how Defossez’s Demucs waveform model works. (The name Demucs is a portmanteau derived from “deep extractor for music sources.”) Defossez explains that building the system required overcoming a series of complex technical challenges. He started by using the underlying architecture of Wave-U-Net, an earlier AI-powered waveform model developed for music source separation. But he had plenty of work to do, since spectrogram models were outperforming Wave-U-Net. He fine-tuned the parameters of the algorithms in Wave-U-Net that analyze patterns by adding gated linear units. Defossez also added long short-term memory, the architecture that allows a network to process entire sequences of data, such as a passage of music or section of video, rather than just a single data point, such as an image. Defossez also improved Wave-U-Net’s speed and memory usage.

The modifications helped Demucs outperform Wave-U-Net in important ways, such as how it dealt with the problem of having one sound overpowering another. “If you imagine a plane taking off: The engine noise can drown out a person’s voice,” says Defossez.

Previous waveform models dealt with the problem by simply removing parts of the original audio source file, but they then could not reconstruct important parts of the missing material. Defossez beefed up the capacity of Demucs’s decoder, the architecture in an AI system that has done enough deep learning to fill in the blanks. “Demucs can re-create the audio that it thinks is there but got lost in the mix.” In practice, this means that his model can resynthesize the soft piano note that might have been lost to a loud crash cymbal, because it understands what sounds should be present.

This ability to reconstruct as well as separate gave Demucs an edge over other waveform models. Defossez says Demucs already matches the very best waveform techniques and is “way beyond” state-of-the-art spectrograms. In blind listening tests, 38 participants listened to random 8-second extracts from 50 test tracks separated by three models: Demucs and the leading waveform and spectrogram techniques. The listeners rated Demucs as the best performer in terms of quality and absence of artifacts, such as background noise or distortion. Demucs was judged to be on par with the two other models in terms of contamination, which is when another source bleeds into the separated track.

Demucs has already generated interest from AI enthusiasts such as Jaime Altozano, a popular Spanish YouTuber who focuses on technology and music. Technically savvy readers can download the code for Demucs from GitHub. If you prefer to listen, you can hear a demo version of Demucs on the Brooklyn-based band AirLands’ song, “From the Sky,” which is part of the Facebook Music Initiative.

This audio sample isolates the vocals on the same song by Airlands.

Defossez, who worked on the project under the supervision of Facebook AI’s Leon Bottou and Nicolas Usunier and INRIA researcher Francis Bach, hopes that releasing the code will allow users to build new layers of complexity and sophistication into Demucs. “I’d like to be able to separate two kinds of guitars, like rhythm and lead, or even different brands of guitar,” he says. “And although we’re pretty far from being able to separate an entire orchestra, I’d like to add instruments such as piano, ukulele, and flute.”

Defossez explains that as Demucs develops, it will bring sonic authenticity to the digital audio workstations that people use to create music at home. Those workstations offer synthesized instruments that evoke a certain era or style, which usually require extensive digital remodeling from the original hardware. But imagine if music source-separation technology were able to perfectly capture the sound of a vintage hollow-body electric guitar played on a tube amplifier from a 1950s rock and roll song. Demucs brings music fans and musicians one step closer to that capability – and it will help AI researchers get closer to building machines that can focus on just one element of a complex audio source, just as people do.

Artificial Intelligence

Meta creates breakthrough technologies and advances AI to connect people to what matters and to help keep communities safe.

Tech at Meta

One-track minds: Using AI for music source separation

A better way to break up a sound wave

A system smart enough to reconstruct what’s missing

Artificial Intelligence

Follow us

Research & Engineering

Developers

News

More from Meta