You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Abstract: *In this paper, we pose the current state-of-the-art voice conversion (VC) systems as two-encoder-one-decoder models. After comparing these models, we combine the best features and propose Assem-VC, a new state-of-the-art any-to-many non-parallel VC system. This paper also introduces the GTA finetuning in VC, which significantly improves the quality and the speaker similarity of the outputs. Assem-VC outperforms the previous state-of-the-art approaches in both the naturalness and the speaker similarity on the VCTK dataset. As an objective result, the degree of speaker disentanglement of features such as phonetic posteriorgrams (PPG) is also explored. Our investigation indicates that many-to-many VC results are no longer distinct from human speech and similar quality can be achieved with any-to-many models.*
12
12
13
-
## TODO List (2021.07.01)
14
-
-[ ] Enable GTA finetuning
15
-
-[ ] Upload loss curves
13
+
## TODO List (2021.07.03)
14
+
-[x] Enable GTA finetuning
16
15
-[ ] Upload pre-trained weight
17
16
18
17
## Requirements
@@ -37,8 +36,14 @@ cd assem-vc
37
36
- To reproduce the results from our paper, you need to download:
Here, all files with name other than `default.yaml` will be ignored from git (see `.gitignore`).
77
82
78
83
-`config/global`: Global configs that are both used for training Cotatron & VC decoder.
79
-
- Fill in the blanks of: `speakers`, `train_dir`, `train_meta`, `val_dir`, `val_meta`.
84
+
- Fill in the blanks of: `speakers`, `train_dir`, `train_meta`, `val_dir`, `val_meta`, `f0s_list_path`.
80
85
- Example of speaker id list is shown in `datasets/metadata/libritts_vctk_speaker_list.txt`.
81
86
- When replicating the two-stage training process from our paper (training with LibriTTS and then LibriTTS+VCTK), please put both list of speaker ids from LibriTTS and VCTK at global config.
87
+
-`f0s_list_path` is set to `f0s.txt` by default
82
88
-`config/cota`: Configs for training Cotatron.
83
89
- You may want to change: `batch_size` for GPUs other than 32GB V100, or change `chkpt_dir` to save checkpoints in other disk.
90
+
- You can also modify `use_attn_loss`, whether guided attention loss is used or not.
84
91
-`config/vc`: Configs for training VC decoder.
85
92
- Fill in the blank of: `cotatron_path`.
86
93
@@ -128,7 +135,22 @@ The optional checkpoint argument is also available for VC decoder.
128
135
129
136
### 3. GTA finetuning HiFi-GAN
130
137
131
-
TBD
138
+
Once the VC decoder is trained, finetune the HiFi-GAN with GTA finetuning.
139
+
First, you should extract GTA mel-spectrograms from VC decoder.
The GTA mel-spectrograms calculated from audio file will be saved as `**.wav.gta` at first,
145
+
and then loaded from disk afterwards.
146
+
147
+
Train/validation metadata of GTA mels will be saved in `datasets/gta_metadata/gta_<orignal_metadata_name>.txt`.
148
+
You should use those metadata when finetuning HiFi-GAN.
149
+
150
+
After extracting GTA mels, get into hifi-gan and follow manuals in [hifi-gan/README.md](https://github.com/wookladin/hifi-gan/blob/master/README.md)
151
+
```bash
152
+
cd hifi-gan
153
+
```
132
154
133
155
### Monitoring via Tensorboard
134
156
@@ -152,7 +174,7 @@ You can convert it to speaker contained in trainset: which is any-to-many voice
152
174
Note that speaker_id has no effect whether or not it is in the training set.
153
175
3. Convert `datasets/inference_source/metadata_origin.txt` into ARPABET.
<p><strong>Authors: </strong>Kang-wook Kim, Seung-won Park, Myun-chul Joe @<ahref="https://mindslab.ai">MINDsLab Inc.</a>, SNU</p>
8
+
<p><strong>Authors: </strong>Kang-wook Kim, Seung-won Park, Myun-chul Joe @<ahref="https://maum.ai">MINDsLab Inc.</a>, SNU</p>
9
9
<p><strong>Abstract: </strong>In this paper, we pose the current state-of-the-art voice conversion (VC) systems as two-encoder-one-decoder models. After comparing these models, we combine the best features and propose Assem-VC, a new state-of-the-art any-to-many non-parallel VC system. This paper also introduces the GTA finetuning in VC, which significantly improves the quality and the speaker similarity of the outputs. Assem-VC outperforms the previous state-of-the-art approaches in both the naturalness and the speaker similarity on the VCTK dataset. As an objective result, the degree of speaker disentanglement of features such as phonetic posteriorgrams (PPG) is also explored. Our investigation indicates that many-to-many VC results are no longer distinct from human speech and similar quality can be achieved with any-to-many models.</p>
0 commit comments