Results on the resynthesis task for 3 unsupervised models plus one LogMel baseline and 3 unit sizes. Bitrates are in bit/sec, PER are for a pretrained phone recognition model without lexicon and LM, CER are derived from a full ASR model (lower is better). Human MOS (upper is better) and CER (computed from transcription, lower is better) are provided (the 95% confidence interval was on average .32 for MOS and 1.8 for human CER).
Systems . | End-to-end ASR-based metrics . | Human Opinion . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
S2u architect. . | Nb units . | Bit-rate . | PER↓ (LJ) . | PER↓ (LS) . | CER↓ (LJ) . | CER↓ (LS) . | MOS↑ (LJ) . | MOS↑ (LS) . | CER↓ (LJ) . | CER↓ (LS) . |
Toplines | ||||||||||
original wav | – | – | – | – | 4.83 | 4.30 | 8.88 | 6.73 | ||
orig text+TTS | 7.78 | 7.92 | 8.87 | 5.14 | 4.02 | 4.03 | 13.25 | 10.73 | ||
ASR + TTS | 27 | 9.45 | 8.18 | 9.48 | 5.30 | 4.04 | 4.06 | 15.98 | 11.56 | |
Baselines | ||||||||||
LogMel | 50 | 214.8 | 27.72 | 49.38 | 27.73 | 52.05 | 2.41 | 2.07 | 43.78 | 66.75 |
LogMel | 100 | 292.7 | 25.83 | 45.58 | 24.88 | 48.71 | 2.65 | 2.01 | 37.39 | 62.72 |
LogMel | 200 | 373.8 | 19.78 | 45.16 | 17.86 | 46.12 | 2.96 | 2.16 | 23.33 | 62.6 |
Unsupervised | ||||||||||
CPC | 50 | 159.4 | 10.87 | 17.16 | 10.68 | 12.06 | 3.63 | 3.51 | 13.97 | 19.92 |
CPC | 100 | 213.1 | 10.75 | 15.82 | 9.84 | 9.46 | 3.42 | 3.68 | 13.53 | 14.73 |
CPC | 200 | 279.4 | 8.74 | 14.23 | 9.20 | 8.29 | 3.85 | 3.54 | 9.36 | 14.33 |
HuBERT-L6 | 50 | 125.7 | 11.45 | 16.68 | 11.02 | 11.85 | 3.69 | 3.49 | 14.54 | 13.14 |
HuBERT-L6 | 100 | 168.1 | 9.53 | 13.24 | 9.31 | 7.19 | 3.84 | 3.68 | 13.02 | 11.43 |
HuBERT-L6 | 200 | 211.3 | 8.87 | 11.06 | 8.88 | 5.35 | 4.00 | 3.85 | 11.67 | 10.84 |
wav2vec-L14 | 50 | 141.3 | 24.95 | 33.69 | 25.42 | 32.91 | 2.45 | 2.87 | 46.82 | 54.9 |
wav2vec-L14 | 100 | 182.1 | 14.58 | 22.07 | 13.72 | 17.22 | 3.50 | 3.32 | 23.76 | 28.1 |
wav2vec-L14 | 200 | 226.8 | 10.65 | 16.34 | 10.21 | 10.50 | 3.83 | 3.51 | 13.14 | 15.27 |
Systems . | End-to-end ASR-based metrics . | Human Opinion . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
S2u architect. . | Nb units . | Bit-rate . | PER↓ (LJ) . | PER↓ (LS) . | CER↓ (LJ) . | CER↓ (LS) . | MOS↑ (LJ) . | MOS↑ (LS) . | CER↓ (LJ) . | CER↓ (LS) . |
Toplines | ||||||||||
original wav | – | – | – | – | 4.83 | 4.30 | 8.88 | 6.73 | ||
orig text+TTS | 7.78 | 7.92 | 8.87 | 5.14 | 4.02 | 4.03 | 13.25 | 10.73 | ||
ASR + TTS | 27 | 9.45 | 8.18 | 9.48 | 5.30 | 4.04 | 4.06 | 15.98 | 11.56 | |
Baselines | ||||||||||
LogMel | 50 | 214.8 | 27.72 | 49.38 | 27.73 | 52.05 | 2.41 | 2.07 | 43.78 | 66.75 |
LogMel | 100 | 292.7 | 25.83 | 45.58 | 24.88 | 48.71 | 2.65 | 2.01 | 37.39 | 62.72 |
LogMel | 200 | 373.8 | 19.78 | 45.16 | 17.86 | 46.12 | 2.96 | 2.16 | 23.33 | 62.6 |
Unsupervised | ||||||||||
CPC | 50 | 159.4 | 10.87 | 17.16 | 10.68 | 12.06 | 3.63 | 3.51 | 13.97 | 19.92 |
CPC | 100 | 213.1 | 10.75 | 15.82 | 9.84 | 9.46 | 3.42 | 3.68 | 13.53 | 14.73 |
CPC | 200 | 279.4 | 8.74 | 14.23 | 9.20 | 8.29 | 3.85 | 3.54 | 9.36 | 14.33 |
HuBERT-L6 | 50 | 125.7 | 11.45 | 16.68 | 11.02 | 11.85 | 3.69 | 3.49 | 14.54 | 13.14 |
HuBERT-L6 | 100 | 168.1 | 9.53 | 13.24 | 9.31 | 7.19 | 3.84 | 3.68 | 13.02 | 11.43 |
HuBERT-L6 | 200 | 211.3 | 8.87 | 11.06 | 8.88 | 5.35 | 4.00 | 3.85 | 11.67 | 10.84 |
wav2vec-L14 | 50 | 141.3 | 24.95 | 33.69 | 25.42 | 32.91 | 2.45 | 2.87 | 46.82 | 54.9 |
wav2vec-L14 | 100 | 182.1 | 14.58 | 22.07 | 13.72 | 17.22 | 3.50 | 3.32 | 23.76 | 28.1 |
wav2vec-L14 | 200 | 226.8 | 10.65 | 16.34 | 10.21 | 10.50 | 3.83 | 3.51 | 13.14 | 15.27 |