Skip to content

GPT‐SoVITS‐v3‐features (新特性)

RVC-Boss edited this page Mar 4, 2025 · 6 revisions

1-v1v2v3情况对比 (v3 compared with v2v1)

语种支持(可互相跨语种合成) GPT训练集时长 SoVITS训练集时长 推理速度 参数量 功能
v1 中日英 约2k小时 约2k小时 baseline 90M+77M baseline
v2 中日英韩粤 约2.5k小时 vq encoder约2k小时(v1冻结),一共5k小时 翻倍 90M+77M 新增语速调节,无参考文本模式,更好的混合语种切分
v3 中日英韩粤 约7k小时 vq encoder约2k小时(v1冻结),一共7k小时 约等于v2 330M+77M 大幅增加zero shot相似度;情绪表达、微调性能提升
Language Support (Cross-language synthesis) GPT Training Dataset Duration SoVITS Training Dataset Duration Inference Speed Number of Parameters Features
v1 Chinese, Japanese, English about 2k hours about 2k hours baseline 90M+77M baseline
v2 Chinese, Japanese, English, Korean, Cantonese about 2.5k hours vq encoder about 2k hours (frozen from v1),5k hours in total doubled 90M+77M Added speed control, reference-free mode, better mixed-language slices
v3 Chinese, Japanese, English, Korean, Cantonese 约7k小时 vq encoder about 2k hours (frozen from v1),7k hours in total ~v2 330M+77M Significant enhancement in zero-shot similarity; improvements in emotional expression and fine-tuning performance.

v2对比v3

(1)音色相似度更像,需要更少训练集来逼近本人(不训练直接使用底模的模式下音色相似性提升更大)

The timbre similarity is higher, requiring less training data to approximate the target speaker (the timbre similarity is significantly improved using the base model directly without fine-tuning).

(2)GPT合成更稳定,重复漏字(根据测试集实验指标)更少,也更容易跑出丰富情感

The GPT model is more stable, with fewer repetitions and omissions, and it is easier to generate speech with richer emotional expression.

(3)比v2更忠实于参考音频。微调场景下,v2比v3更受训练集整体平均影响,然后带一些参考音频的引导。

Compared to v2model, v3model is more faithful to the reference audio. In fine-tuning scenarios, V2 is more influenced by the overall average of the training dataset, with some guidance from the reference audio.

如果你的训练集质量比较糟糕,也许“更受训练集整体平均影响”的v2vits版本更适合你。

If your training dataset is of poor quality, the V2 (ViTS) version, which is 'more influenced by the overall average of the training dataset,' might be more suitable for you.

2-SeedTTS ZeroShot TTS eval testset CN

CER SIM
v1 0.025 0.526
v2 0.017 0.549
v3 0.014 0.702
GT 0.013 0.760

3-技术变更点 (Technical Updates)

(1)训练集增加至7k小时 (MOS分音质过滤、标点停顿校验)

The training dataset has been expanded to 7,000 hours (with MOS-based audio quality filtering and punctuation pause verification).

只使用7k小时优选训练集,更大的想象空间留给各位看官们发挥~

Only 7,000 hours of training data were used, leaving more room for imagination and creativity for the audience to explore.

(2)s2结构变更为:shortcut Conditional Flow Matching Diffusion Transformers (shortcut-CFM-DiT)

The S2 architecture has been modified to shortcut-CFM-DiT.

由于s2占整体延时比例太低,s2变复杂对于整体耗时影响不大。

Since the proportion of S2 in the overall latency is minimal, increasing the complexity of S2 has little impact on the total processing time.

音质最佳:采样步数32

Best Audio Quality: Sampling steps set to 32.

速度快:4/8步 (zero shot这档配置没啥瑕疵,少量样本微调可能需要提升步数)

Faster Speed: 4/8 steps (zero-shot configuration shows no significant flaws, though fine-tuning with a small number of samples may require increasing the steps).

s2原理的变更(基于参考音频扩散补全)导致音色相似度大幅提升。

The principle of S2 has been updated (based on reference audio diffusion outpainting), resulting in a significant improvement in timbre similarity.

由于没用端到端合成,使用了开源的24k的BigVGANv2参数从mel谱得到波形。

As end-to-end synthesis is not utilized, the open-source 24k BigVGANv2 parameters are employed to generate waveforms from mel-spectrograms.

(3)s1结构不变,更新了一版参数

The S1 architecture remains unchanged, with the parameters updated.