ChimeraNet

This webpage provides simple examples and API for our paper Deep Clustering and Conventional Networks for Music Separation: Stronger Together. This paper is now on arxiv.

If you are also interested in single-channel speech separation, see our other paper: Deep Attractor Network for Single-microphone Speaker Separation

A first glimpse: try it yourself!

We provide an API for you to try it yourself. By simply uploading your music recording, it will separate it and return the recovered singing voice and accompaniment. To use it, make sure your file is monphonic, and there must be singing voice in the mixture. Otherwise this API may fail. Also, due to limited computational resources, the maximum acceptable length of the recording is 10 seconds, and we always resample the mixture to 16kHz.

Please wait for few seconds for the separation process to finish after uploading your file. Typically it will not take longer than 10 seconds.

This server is extremely small, so we will clean up your files once you close the result page. This might sometimes leads to errors, and we are sorry for your inconvenience. If an error happens, try to reload the homepage and re-upload your file.

Going deeper: ChimeraNet

Deep clustering is the first method to handle general audio separation scenarios with multiple sources of the same type and an arbitrary number of sources, performing impressively in speaker-independent speech separation tasks. However, little is known about its effectiveness in other challenging situations such as music source separation. Moreover, since deep clustering does not have the advantage of end-to-end training, we further combine it with a conventional mask-inference network via an approach akin to multi-task learning.

The name "ChimeraNet" comes from the idea that this multi-task learning network has two heads (like a Chimera): one deep clustering head and one mask-inference head. The figure on the right demonstrates the network structure.

By only using the deep clustering head, we won the 1st place in Singing Voice Separation Task in MIREX 2016. However, by using both heads, we can achieve even better results, even if the training and test data are mismatched.

We provide some examples for separation results. These examples are generated from the public iKala dataset, which are completely mismatched with our training dataset.

	54236_chorus	71726_chorus
Mixture:
Original vocals:
Original accompaniment:
Recovered vocals:
Recovered accompaniment:

Deep Clustering and Conventional Networks for Music Separation: Stronger Together Yi Luo (yl3364@columbia.edu) Zhuo Chen (zc2204@columbia.edu) John R. Hershey (hershey@merl.com) Jonathan Le Roux (leroux@merl.com) Nima Mesgarani (nima@ee.columbia.edu)

A first glimpse: try it yourself!

Going deeper: ChimeraNet

Deep Clustering and Conventional Networks for Music Separation: Stronger Together
Yi Luo (yl3364@columbia.edu) Zhuo Chen (zc2204@columbia.edu)
John R. Hershey (hershey@merl.com) Jonathan Le Roux (leroux@merl.com) Nima Mesgarani (nima@ee.columbia.edu)