Oral
in
Workshop: Machine Learning for Audio
Self-Supervised Speech Enhancement using Multi-Modal Data
Yu-Lin Wei · Rajalaxmi Rajagopalan · Bashima Islam · Romit Roy Choudhury
We consider the problem of speech enhancement in earphones. While microphones are classical speech sensors, motion sensors embedded in modern earphones also pick up faint components of the user’s speech. While this faint motion data has generally been ignored, we show that they can serve as a pathway for selfsupervised speech enhancement. Our proposed model is an iterative framework in which the motion data offers a hint to the microphone (in the form of an estimated posterior); the microphone SNR improves from the hint, which then helps the motion data to refine it’s next hint. Results show that this alternating self-supervision converges even in the presence of strong ambient noise, and the performance is comparable to supervised Denoisers. When small amount of training data is available, our model outperforms the same Denoisers.