This webpage provides simple examples and API for our paper Deep Attractor Network for Single-microphone Speaker Separation. This paper is now on arxiv.

If you are also interested in singing voice separation, see our other paper: Deep Clustering and Conventional Networks for Music Separation: Stronger Together

A first glimpse: try it yourself!

We provide an API for you to try it yourself. By simply uploading your speech mixture file, it will separate it and return the recovered clean speaker files. To use it, make sure your file is monphonic, contains and only contains two speakers, and they are overlapping for at least 50% of the time. Otherwise this API may not work, since we only trained it on 2 speaker mixtures. Also, due to limited computational resources, the maximum acceptable length of the speech mixture is 5 seconds, and we always resample the mixture to 8kHz.

Please wait for few seconds for the separation process to finish after uploading your file. Typically it will not take longer than 10 seconds.

This server is extremely small, so we will clean up your files once you close the result page. This might sometimes leads to errors, and we are sorry for your inconvenience. If an error happens, try to reload the homepage and re-upload your file.

Going deeper: DANet

DANet, which stands for Deep Attractor Network, is a novel deep learning framework for general source separation problem. The network forms attractor points in a high-dimensional embedding space of the signal, and the similarity between attractors and time-frequency embeddings are then converted into a soft separation mask.

We first provide some examples for separation.

Same gender Different gender
Speech mixture
Original speaker 1
Original speaker 2
Recovered speaker 1
Recovered speaker 2

These examples are generated from speakers which are completely unknown to the network.

Figures below provide examples for attractor points formed and the location of T-F bins in the embedded space.


Fig. 1. Location of T-F bins in the embedded space. Each dot visualizes the first three principle components of one T-F bin, where colors distinguish the relative power of speakers, and the location of attractors is marked with X.


Fig. 2. Location of attractor points in the embedding space. Each dot corresponds to one of the 10000 mixtures sounds, visualized using the first three principal components. Two distinct attractor pairs are visible (denoted by A1 and A2).