Efficient Voice Conversion at High Sampling Rates
Demo page and audio examples
Anders R. Bargum, Simon Lajboschitz, Cumhur ErkutAbstract
Voice conversion has gained increasing popularity within the field of audio manipulation and speech synthesis. Often, the main objective is to transfer the input identity to that of a target speaker without changing its linguistic content. While current work provides high fidelity solutions they rarely focus on model simplicity, high-sampling rate environments or stream-ability. By incorporating speech representation learning into a generative timbre transfer model, traditionally created for musical purposes, we investigate the realm of voice conversion generated directly in the time-domain at high sampling rates. More specifically, we guide the latent space of a baseline model towards linguistically relevant representations and condition it on external speaker information. Based on objective evaluations, we show that the proposed solution can achieve comparable levels of naturalness and intelligibility to a state-of-the-art solution while significantly reducing inference time. Nonetheless, while the converted output contains target speaker characteristic, actual speaker similarity remains an area of concern.
Real-Time Example
Below is a real-time demonstration of the S-RAVE model exported to Max MSP. The video shows an any-to-many scenario where only the target speakers are seen during training.
Audio Examples
The remaining content contains voice conversion samples for the S-RAVE model. We provide samples in unseen-to-seen and unseen-to-unseen scenarios using data from the VCTK dataset. Lastly we provide a few cherrypicked examples of conversions carried out on out-of-domain data, namely input data from the LibriSpeech dataset. All examples are generated and rendered at 48kHz.
Navigation
- Unseen-to-Seen Voice Conversion
- Unseen-to-Unseen Voice Conversion
- Librispeech-to-Seen Voice Conversion
Unseen-to-Seen Voice Conversion
Source Speaker | Target Speaker | Conversion |
---|---|---|
p334 (Male) |
p225 (Female) |
|
p228 (Female) |
||
p245 (Male) |
||
p254 (Male) |
||
p343 (Female) |
p225 (Female) |
|
p228 (Female) |
||
p245 (Male) |
||
p254 (Male) |
||
p360 (Male) |
p225 (Female) |
|
p228 (Female) |
||
p245 (Male) |
||
p254 (Male) |
||
p362 (Female) |
p225 (Female) |
|
p228 (Female) |
||
p245 (Male) |
||
p254 (Male) |
Unseen-to-Unseen Voice Conversion
Source Speaker | Target Speaker | Conversion |
---|---|---|
p334 (Male) |
p334 (Male) |
|
p343 (Female) |
||
p360 (Male) |
||
p362 (Female) |
||
p343 (Female) |
p334 (Male) |
|
p343 (Female) |
||
p360 (Male) |
||
p362 (Female) |
||
p360 (Male) |
p334 (Male) |
|
p343 (Female) |
||
p360 (Male) |
||
p362 (Female) |
||
p362 (Female) |
p334 (Male) |
|
p343 (Female) |
||
p360 (Male) |
||
p362 (Female) |
Out of domain data (LibriSpeech-to-Seen Conversion)
Source Speaker | Target Speaker | Conversion |
---|---|---|
1089 (Male) |
p225 (Female) |
|
p228 (Female) |
||
p241 (Male) |
||
p245 (Female) |
||
8230 (Male) |
p225 (Female) |
|
p228 (Female) |
||
p241 (Male) |
||
p245 (Male) |