We explore the use of sparse representations for separation of a monaural mixture signal, where by a sparse representation we mean one where the number of non-zero elements is smaller than might be expected. This is a surprisingly powerful idea, as the ability to express a signal sparsely in some known, and potentially overcomplete, basis constitutes a strong model, while also lending itself to efficient algorithms. In the framework we explore, the representation of the signal is linear in a vector of coefficients. However, because many coefficient values could represent the same signal, the mapping from signal to coefficients is nonlinear, with the coefficients being chosen to simultaneously represent the signal and maximize a measure of sparsity. This conversion of the signal into the coefficients using L1-optimization is viewed not as a pre-processing step performed before the data reaches the heart of the algorithm, but rather as itself the heart of the algorithm: after the coefficients have been found, only trivial processing remains to be done. We show how, by suitable choice of overcomplete basis, this framework can use a variety of cues (e.g., speaker identity, differential filtering, differential attenuation) to accomplish monaural separation. We also discuss two radically different algorithms for finding the required overcomplete dictionaries: one based on non-negative matrix factorization of isolated sources, and the other based on end-to-end optimization using automatic differentiation.
Draft manuscript (150KB, PDF)
See also Asari et al., (2006) and my dissertation (Chapter 2 and technical notes).