On the rise of Machine Learning through the lens of Music Source Separation

Douglas Adams famously quipped that we treat with skepticism any technology invented after our 35th birthday, but anything invented before is unremarkable. It happens that I joined University of Melbourne mere weeks after I turned 35. I am ashamed to say that up until that time (and for a little while after) I had been far too skeptical and dismissive towards machine learning research and technology. Since then, I’ve come to repudiate much of my ignorance about ML (which, naturally, was the source of my skepticism); though I remain an ML novice.

This post documents some fun I’ve had over the summer break playing with ML models for music source separation. I’ll try to explain why I (and many others) find this technology so exciting, and use it to explain some conclusions about ML that I’ve slowly come around to over the past few years, especially relevant to those of us who work on topics related to software engineering, formal specification, and verification.

What is Music Source Separation?

Imagine you have a recording of your favourite band or music artist. Since the mid-1950s, such recordings have been made by recording multiple parts and then mixing those recordings together to produce the finished product. Each individual part is called a track or stem, and this process overall is known as multitrack recording. Music source separation can be thought of as akin to trying to reverse this process: taking a recording and separating it out into its individual parts.

(Pedants may quibble that tracks and stems are not always the same thing: a drum kit may be recorded by placing one microphone on each individual drum, producing a multi-track recording of the drum kit. When recording a rock band, the multi-track drum part is typically mixed together to form a single stem. However, these distinctions are not important for this post.)

It’s difficult to convey just how challenging this problem is. Imagine you wanted to write an algorithm to do it. Perhaps you would start by trying to decompose the audio into different waveforms (perhaps using some kind of FFT), and then try to group different waveforms together that correspond to the different parts perhaps because they exhibit some kind of commonality? Another approach, commonly employed to try to separate out vocals before the rise of machine learning, was to subtract from a stereo recording the signal that was common in both the left and right channels. This would remove from the recording anything in the centre, and so would work (only) when the vocals was in the centre of the recording but all other instruments were panned either to the left or right. (The vocal part could then be obtained by subtracting the result from the original recording.) But this approach is very unreliable. Honestly, I don’t know how to reliably solve this problem in general. I honestly can’t imagine how to solve this problem well by designing an algorithm to do it.

As someone who has played with music and recording for decades, reliable music source separation feels like being able to take baking mix and get back the individual dry flour, eggs, and milk that constitute it.

I was blown away recently to learn that as of a couple years ago, we now have passable machine learning methods for performing this task. Check out the SoundCloud playlist below from Alexandre Défossez, who is the maintainer of one of the leading open source ML music source separation tools called Demucs, to see what I mean.

Here he compares Demucs against other music source separation tools (though others have since arisen that this playlist doesn’t include, which I’ll also mention below).

Why is Music Source Separation so Exciting?

We have seen impressive machine learning applications for years. Since Krizhevsky, Sutskever, and Hinton’s success on ImageNet image classification in 2012 it should have been clear to all of us that there are some kinds of problems for which we know no other way to solve them by computer except by training machine learning models (e.g. as opposed to designing and programming a traditional algorithm). (Of course it would take me years to reach this conclusion.)

However, music source separation seems different to many other machine learning problems. Humans can recognise images, after all. For many tasks that machine learning has gotten good at performing, there exists a human who can carry out that same task (perhaps more slowly and at much greater cost). However, just like separating baking mix into its individual ingredients, no human can perform music source separation—except by playing and re-recording the individual music parts from scratch, which is akin to throwing away the baking mix and buying fresh flour, eggs and milk.

It’s not just that ML methods can do music source separation better than humans. Humans cannot perform this task at all.

Music source separation is surely not the first problem that humans cannot solve at all but that machines have learned to approximate good-enough solutions to. I’ll leave it to the machine learning experts to comment further on the appropriate precedents. However, it is certainly the first time I have personally encountered a problem of this nature.

(Of course, when I played and explained the above SoundCloud playlist to my kids, neither of them were impressed, each telling me that of course computers can perform this task. Which only goes to prove Douglas Adams’ adage that I referenced at the beginning of this post.)

Having fun with Music Source Separation

But you don’t need to be a computer scientist, interested in the limits of the problems that computers can and cannot solve, to appreciate this technology.

Not a year goes by without a 30-odd year old album being re-released by an aging rock band, proudly sporting the designator “remastered”. Once a multi-track music recording is mixed to produce a single track or mixdown, the resulting recording is then mastered which involves adjusting things like dynamics and EQ to produce the best sound overall. Remastering repeats the mastering process, and can help improve the quality of an old recording. But only so much.

Even better is if audio engineers have access to the original, individual recorded tracks in the multi-track recording. With these in hand they can not only re-master the recording but also re-mix it, including adjusting dynamics and EQ of individual tracks before mixing them together to ensure the resulting recording overall is as good as it can be. This happened, for instance, for the 50th anniversary edition of Sgt. Pepper’s.

But what if the original stems no longer exist? Or what about recordings made before the rise of multi-track recording? Reliable music source separation would allow old recordings to be re-mixed and re-mastered.

What might that sound like? To find out, I took an old rock band recording from 2000 and used another current leading music source separation model, SCNet, to separate it into drums, bass, vocals, and the remainder (in this case, rhythm and lead guitar). I then re-mixed the resulting 4-track recording in GarageBand.

The result (to my ears) is a much improved sound overall. Check out the result below. This recording begins with the new, improved mix. Then switches to the original mix at approx. 15 seconds. At 25 seconds, it switches back to the improved mix. At 41 seconds it again switches to the original mix before switching back to the improved mix at approx. 50 seconds.

I should confess: this recording was made by me and my brother, Alex, when we were second-year undergraduates, of the post-grunge band we formed in high school together. While we had somehow hung on to the overall recording, the individual tracks were long lost. Even though SCNet did not produce perfect stems, the resulting re-mix is still promising.

We should expect music source separation technology to improve in coming years. The implications for enhancing and preserving old recordings I find fascinating.

New forms of creativity

However, re-mixing old recordings is just the tip of the iceberg.

There are strong precedents for new technologies giving rise to new forms of creativity. Indeed, the arrival of music sampling technology is often credited as being critical to spawning hiphop and electronic dance music in the 70s and 80s.

We should expect that music source separation will do likewise.

Computer scientist Alan Perlis once wrote:

“If art interprets our dreams, the computer executes them in the guise of programs!”

I recall being struck by the poetry of this quote more than 20 years ago when I first read The Structure and Interpretation of Computer Programs.

As a music fan, I’ve spent many idle moments wondering how music might have turned out had history played out differently.

Perhaps you’ve heard Johnny Cash’s famous cover of Nine Inch Nails’ song Hurt, released in 2002 a year before Cash’s death?

Have you heard the original, released eight years prior?

Both are haunting and beautiful, and each renders the pain of the respective artist with gravity and sincerity. But what if Cash and Trent Reznor (the sole member of Nine Inch Nails at the time of Hurt) had had the opportunity to collaborate on this song? What might that sound like?

I spent a couple hours exploring this idea using Demucs. A sample of the result (for obvious copyright reasons) is below:

Here, I applied Demucs to separate both recordings into two tracks: vocals and everything else. I was interested in the idea of taking Cash’s vocal performance and pairing that with Reznor’s original instrumentation, which is darker and—in my opinion—more dynamic than Cash’s. However, the sample above also uses Reznor’s vocals (in the right channel) in the pre-chorus and chorus, starting at approx. 20 seconds, to simulate communion with Cash; while employing the instrumentation of Cash’s version (in the left channel) in the chorus, starting at approx. 41 seconds, to add the sweetness of Cash’s performance.

Music nerds will have guessed that I had to adjust the pitch and speed of the original to match Cash’s performance. (I also had to manually adjust the timing throughout Cash’s vocal performance which, unlike Reznor’s, did not use a click track to ensure consistent tempo throughout.) Because I had to transpose them down, the result renders Reznor’s vocals overly effected on the first line of the chorus. However, that was the lesser of two evils (the alternative choice being having to transpose up Cash’s performance throughout, which would have left it sounding trivial, thereby undermining the song’s emotional intensity).

I can only imagine what teens today will make of this tech as it matures, and the new music it will help to spawn. As I said on BlueSky:

“I see no reason why emerging ML methods for audio and music processing won’t be just as influential in the coming decade on the creation of new kinds of music as was sampling technology in the 70s and 80s that led to the rise of hiphop and electronic dance music.

What has this to do with formal specification and verification?

I promised I would touch on formal methods. If it isn’t clear from the discussion above, music source separation falls into the category of problems for which approximate, good enough solutions now exist for the first time. However, it seems plain to me that writing a formal specification that specifies formally what it means for a solution to be good enough is infeasible.

Given a finished recording for which we have no stems, it seems infeasible to formally specify what the resulting stems should be and when a solution is “close enough” to them. Music source separation, much like image recognition, appears to be a problem for which the solution is “I’ll know it when I see it”—or, in the case of music source separation, hear it.

Even MNIST, which is a problem whose input and output domains are small and easily to comprehend seems beyond the reach of formal specification. MNIST is a handwritten digit classification problem: given a 28x28 pixel grayscale image of a handwritten digit “0” through “9” inclusive, classify it into one of the ten classes “0” through “9”. This problem is one that we can literally visualise. Solutions are functions f of type char[28][28] -> float[10]:

MNIST Problem

Yet writing a formal specification for what it would mean for a handwritten-digit recogniser to be “good enough” on arbitrary 28x28 grayscale images seems totally infeasible.

This isn’t a question about the skill of the person writing the formal specification, or the time it might take to write it. Skilled formal methods experts with sufficient time have formally specified the functional correctness of software that performs functions far more interesting than the MNIST problem. MNIST seems beyond formal specification even for the most skilled formal methods experts with unlimited time.

That such a simple-looking problem as MNIST totally eludes formal specification is as fascinating as it is disturbing.

The arrival of highly useful computer programs that appear beyond the reach of formal correctness specification (let alone verification) raises interesting questions for this formal methods researcher. Perhaps I’ll have more to say on that in future.

For now, the best we might hope to do is to specify weaker properties for such algorithms. An obvious one for music source separation would be that, having taken a finished recording and performed music source separation, the sum of the resulting stems should yield a recording close to the original. However, that alone is not a sufficient correctness specification because it would be satisfied by the trivial algorithm that produces two stems: one containing a duplicate of the original recording and a second containing only silence.

Software engineers, I’d love your thoughts on these questions.

For now, I look forward to the future of machine-learning enabled music creation.

Acknowledgements

Thank you to Matthew Fernandez who provided very useful comments on an early version of this post.