Skip to content

Commit c93571e

Browse files
authored
Merge pull request #8 from mindslab-ai/canary
Enable GTA finetuning & Bugfix
2 parents 544dc03 + f2c96e9 commit c93571e

File tree

11 files changed

+110
-37
lines changed

11 files changed

+110
-37
lines changed

.gitignore

+3
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,9 @@ config/*/*.yaml
88
# f0 information
99
f0s.txt
1010

11+
# GTA metadatas
12+
datasets/gta_metadata/
13+
1114
# logs, checkpoints
1215
chkpt/
1316
logs/

README.md

+91-17
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,15 @@
33
![](./docs/images/overall.png)
44

55
**Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques**<br>
6-
Kang-wook Kim, Seung-won Park, Myun-chul Joe @ [MINDsLab Inc.](https://mindslab.ai), SNU
6+
Kang-wook Kim, Seung-won Park, Myun-chul Joe @ [MINDsLab Inc.](https://maum.ai/), SNU
77

88
Paper: https://arxiv.org/abs/2104.00931 <br>
99
Audio Samples: https://mindslab-ai.github.io/assem-vc/ <br>
1010

1111
Abstract: *In this paper, we pose the current state-of-the-art voice conversion (VC) systems as two-encoder-one-decoder models. After comparing these models, we combine the best features and propose Assem-VC, a new state-of-the-art any-to-many non-parallel VC system. This paper also introduces the GTA finetuning in VC, which significantly improves the quality and the speaker similarity of the outputs. Assem-VC outperforms the previous state-of-the-art approaches in both the naturalness and the speaker similarity on the VCTK dataset. As an objective result, the degree of speaker disentanglement of features such as phonetic posteriorgrams (PPG) is also explored. Our investigation indicates that many-to-many VC results are no longer distinct from human speech and similar quality can be achieved with any-to-many models.*
1212

13-
## TODO List (2021.07.01)
14-
- [ ] Enable GTA finetuning
15-
- [ ] Upload loss curves
13+
## TODO List (2021.07.03)
14+
- [x] Enable GTA finetuning
1615
- [ ] Upload pre-trained weight
1716

1817
## Requirements
@@ -37,8 +36,14 @@ cd assem-vc
3736
- To reproduce the results from our paper, you need to download:
3837
- LibriTTS train-clean-100 split [tar.gz link](http://www.openslr.org/resources/60/train-clean-100.tar.gz)
3938
- [VCTK dataset (Version 0.80)](https://datashare.ed.ac.uk/handle/10283/2651)
40-
- Unzip each files.
39+
- Unzip each files, and clone them in `datasets/`.
4140
- Resample them into 22.05kHz using `datasets/resample.py`.
41+
```bash
42+
python datasets/resample.py
43+
```
44+
Note that `dataset/resample.py` was hard-coded to remove original wavfiles in `datasets/` and replace them into resampled wavfiles,
45+
and their filename will be the same as the original filename.
46+
4247

4348
### Preparing Metadata
4449

@@ -76,11 +81,13 @@ cp config/vc/default.yaml config/vc/config.yaml
7681
Here, all files with name other than `default.yaml` will be ignored from git (see `.gitignore`).
7782

7883
- `config/global`: Global configs that are both used for training Cotatron & VC decoder.
79-
- Fill in the blanks of: `speakers`, `train_dir`, `train_meta`, `val_dir`, `val_meta`.
84+
- Fill in the blanks of: `speakers`, `train_dir`, `train_meta`, `val_dir`, `val_meta`, `f0s_list_path`.
8085
- Example of speaker id list is shown in `datasets/metadata/libritts_vctk_speaker_list.txt`.
8186
- When replicating the two-stage training process from our paper (training with LibriTTS and then LibriTTS+VCTK), please put both list of speaker ids from LibriTTS and VCTK at global config.
87+
- `f0s_list_path` is set to `f0s.txt` by default
8288
- `config/cota`: Configs for training Cotatron.
8389
- You may want to change: `batch_size` for GPUs other than 32GB V100, or change `chkpt_dir` to save checkpoints in other disk.
90+
- You can also modify `use_attn_loss`, whether guided attention loss is used or not.
8491
- `config/vc`: Configs for training VC decoder.
8592
- Fill in the blank of: `cotatron_path`.
8693

@@ -128,7 +135,22 @@ The optional checkpoint argument is also available for VC decoder.
128135

129136
### 3. GTA finetuning HiFi-GAN
130137

131-
TBD
138+
Once the VC decoder is trained, finetune the HiFi-GAN with GTA finetuning.
139+
First, you should extract GTA mel-spectrograms from VC decoder.
140+
```bash
141+
python gta_extractor.py -c <path_to_global_config_yaml> <path_to_vc_config_yaml> \
142+
-p <checkpoint_path>
143+
```
144+
The GTA mel-spectrograms calculated from audio file will be saved as `**.wav.gta` at first,
145+
and then loaded from disk afterwards.
146+
147+
Train/validation metadata of GTA mels will be saved in `datasets/gta_metadata/gta_<orignal_metadata_name>.txt`.
148+
You should use those metadata when finetuning HiFi-GAN.
149+
150+
After extracting GTA mels, get into hifi-gan and follow manuals in [hifi-gan/README.md](https://github.com/wookladin/hifi-gan/blob/master/README.md)
151+
```bash
152+
cd hifi-gan
153+
```
132154

133155
### Monitoring via Tensorboard
134156

@@ -152,7 +174,7 @@ You can convert it to speaker contained in trainset: which is any-to-many voice
152174
Note that speaker_id has no effect whether or not it is in the training set.
153175
3. Convert `datasets/inference_source/metadata_origin.txt` into ARPABET.
154176
```bash
155-
python3 datasets/g2p.py -i datasets/inference_source/metadata_origin.txt \
177+
python datasets/g2p.py -i datasets/inference_source/metadata_origin.txt \
156178
-o datasets/inference_source/metadata_g2p.txt
157179
```
158180
4. Run [inference.ipynb](./inference.ipynb)
@@ -168,11 +190,10 @@ Hence, the quality of the result may differ from the paper.*
168190
169191
Here are some noteworthy details of implementation, which could not be included in our paper due to the lack of space:
170192
171-
- Guided attention loss
172-
173-
We applied guided attention loss proposed in [DC-TTS](https://arxiv.org/abs/1710.08969).
174-
It helped Cotatron's alignment learning stable and faster convergence.
175-
See [utils/alignment_loss.py](./utils/alignment_loss.py).
193+
- Guided attention loss <br>
194+
We applied guided attention loss proposed in [DC-TTS](https://arxiv.org/abs/1710.08969).
195+
It helped Cotatron's alignment learning stable and faster convergence.
196+
See [modules/alignment_loss.py](./modules/alignment_loss.py).
176197
177198
## License
178199
@@ -193,8 +214,61 @@ If you have a question or any kind of inquiries, please contact Kang-wook Kim at
193214

194215

195216
## Repository structure
196-
197-
TBD
217+
```
218+
.
219+
├── LICENSE
220+
├── README.md
221+
├── cotatron.py
222+
├── cotatron_trainer.py # Trainer file for Cotatron
223+
├── gta_extractor.py # GTA mel spectrogram extractor
224+
├── inference.ipynb
225+
├── preprocess.py # Extracting speakers' pitch range
226+
├── requirements.txt
227+
├── synthesizer.py
228+
├── synthesizer_trainer.py # Trainer file for VC decoder (named as "synthesizer")
229+
├── config
230+
│   ├── cota
231+
│   │   └── default.yaml # configuration template for Cotatron
232+
│   ├── global
233+
│   │   └── default.yaml # configuration template for both Cotatron and VC decoder
234+
│   └── vc
235+
│   └── default.yaml # configuration template for VC decoder
236+
├── datasets # TextMelDataset and text preprocessor
237+
│   ├── __init__.py
238+
│   ├── g2p.py # Using G2P to convert metadata's transcription into ARPABET
239+
│   ├── resample.py # Python file for audio resampling
240+
│   └── text_mel_dataset.py
241+
│   ├── inference_source
242+
│   │   (omitted) # custom source speechs and transcriptions for inference.ipynb
243+
│   ├── metadata
244+
│   │   (ommited) # Refer to README.md within the folder.
245+
│   └── text
246+
│   ├── __init__.py
247+
│   ├── cleaners.py
248+
│   ├── cmudict.py
249+
│   ├── numbers.py
250+
│   └── symbols.py
251+
├── docs # Audio samples and code for https://mindslab-ai.github.io/assem-vc/
252+
│   (omitted)
253+
├── hifi-gan # Modified HiFi-GAN vocoder (https://github.com/wookladin/hifi-gan)
254+
│   (omitted)
255+
├── modules # All modules that compose model, including mel.py
256+
│   ├── __init__.py
257+
│   ├── alignment_loss.py # Guided attention loss
258+
│   ├── attention.py # Implementation of DCA (https://arxiv.org/abs/1910.10288)
259+
│   ├── classifier.py
260+
│   ├── cond_bn.py
261+
│   ├── encoder.py
262+
│   ├── f0_encoder.py
263+
│   ├── mel.py # Code for calculating mel-spectrogram from raw audio
264+
│   ├── tts_decoder.py
265+
│   ├── vc_decoder.py
266+
│   └── zoneout.py # Zoneout LSTM
267+
└── utils # Misc. code snippets, usually for logging
268+
├── loggers.py
269+
├── plotting.py
270+
└── utils.py
271+
```
198272

199273
## References
200274

@@ -210,7 +284,7 @@ This implementation uses code from following repositories:
210284
This README was inspired by:
211285
- [Tips for Publishing Research Code](https://github.com/paperswithcode/releasing-research-code)
212286

213-
The audio samples on our [webpage](https://mindslab-ai.github.io/cotatron/) are partially derived from:
287+
The audio samples on our [webpage](https://mindslab-ai.github.io/assem-vc/) are partially derived from:
214288
- [LibriTTS](https://arxiv.org/abs/1904.02882): Dataset for multispeaker TTS, derived from LibriSpeech.
215-
- [VCTK](https://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html): 46 hours of English speech from 108 speakers.
289+
- [VCTK](https://datashare.ed.ac.uk/handle/10283/2651): 46 hours of English speech from 108 speakers.
216290
- [KSS](https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset): Korean Single Speaker Speech Dataset.

config/cota/default.yaml

+2-1
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,8 @@ train:
1111
end: 50000
1212
grad_clip: 1.0 # 0 for no gradient clipping
1313
teacher_force:
14-
rate: 0.5
14+
rate: 0.5 # 0.5 is the most stable value
15+
use_attn_loss: True # Using guided attention loss (See README.md)
1516
###########################
1617
log:
1718
chkpt_dir: 'chkpt/cota'

config/global/default.yaml

-6
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,6 @@ data:
88
val_meta: '' # relative path of metadata file from val_dir
99
f0s_list_path: '' # preprocessed f0 list
1010
###########################
11-
train:
12-
use_attn_loss: True
13-
###########################
1411
audio: # WARNING: this can't be changed.
1512
n_mel_channels: 80
1613
filter_length: 1024
@@ -30,9 +27,6 @@ chn:
3027
speaker:
3128
cnn: [32, 32, 64, 64, 128, 128]
3229
token: 256
33-
# residual encoder
34-
residual: [32, 32, 64, 64, 128, 128]
35-
residual_out: 1
3630
# f0 encoder
3731
prenet_f0: 1
3832
# TTS decoder

cotatron.py

+7-4
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
from modules import TextEncoder, TTSDecoder, SpeakerEncoder, SpkClassifier
1414
from datasets import TextMelDataset, text_mel_collate
1515
from datasets.text import Language
16-
from utils.alignment_loss import GuidedAttentionLoss
16+
from modules.alignment_loss import GuidedAttentionLoss
1717

1818

1919
class Cotatron(pl.LightningModule):
@@ -35,7 +35,10 @@ def __init__(self, hparams):
3535
self.is_val_first = True
3636

3737
self.use_attn_loss = hp.train.use_attn_loss
38-
self.attn_loss = GuidedAttentionLoss(20000, 0.25, 1.00025)
38+
if self.use_attn_loss:
39+
self.attn_loss = GuidedAttentionLoss(20000, 0.25, 1.00025)
40+
else:
41+
self.attn_loss = None
3942

4043
def forward(self, text, mel_target, speakers, input_lengths, output_lengths, max_input_len,
4144
prenet_dropout=0.5, no_mask=False, tfrate=True):
@@ -59,7 +62,7 @@ def inference(self, text, mel_target):
5962
decoder_input = torch.cat((text_encoding, speaker_emb_rep), dim=2)
6063
_, mel_postnet, alignment = \
6164
self.decoder(mel_target, decoder_input, in_len, out_len, in_len,
62-
prenet_dropout=0.0, no_mask=True, tfrate=False)
65+
prenet_dropout=0.5, no_mask=True, tfrate=False)
6366
return mel_postnet, alignment
6467

6568
def training_step(self, batch, batch_idx):
@@ -85,7 +88,7 @@ def validation_step(self, batch, batch_idx):
8588
text, mel_target, speakers, input_lengths, output_lengths, max_input_len, _ = batch
8689
speaker_emb, mel_pred, mel_postnet, alignment = \
8790
self.forward(text, mel_target, speakers, input_lengths, output_lengths, max_input_len,
88-
prenet_dropout=0.0, tfrate=False)
91+
prenet_dropout=0.5, tfrate=False)
8992
speaker_prob = self.classifier(speaker_emb)
9093
classifier_loss = F.nll_loss(speaker_prob, speakers)
9194

docs/index.html

+2-2
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,9 @@
33
<title>Assem-VC Demo</title>
44
</head>
55
<body><h2>Audio Samples from "Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques"</h2>
6-
<p><b>Paper(will updated):</b> <a href="https://arxiv.org/abs/2104.00931">arXiv:2104.00931</a> (Submitted to INTERSPEECH 2021)</p>
6+
<p><b>Paper:</b> <a href="https://arxiv.org/abs/2104.00931">arXiv:2104.00931</a></p>
77
<p><b>Repository:</b> <a href="https://github.com/mindslab-ai/assem-vc">mindslab-ai/assem-vc @ GitHub<iframe src="https://ghbtns.com/github-btn.html?user=mindslab-ai&repo=assem-vc&type=star&count=true" frameborder="0" scrolling="0" width="150" height="20" title="GitHub"></iframe></a>
8-
<p><strong>Authors: </strong>Kang-wook Kim, Seung-won Park, Myun-chul Joe @<a href="https://mindslab.ai">MINDsLab Inc.</a>, SNU</p>
8+
<p><strong>Authors: </strong>Kang-wook Kim, Seung-won Park, Myun-chul Joe @<a href="https://maum.ai">MINDsLab Inc.</a>, SNU</p>
99
<p><strong>Abstract: </strong>In this paper, we pose the current state-of-the-art voice conversion (VC) systems as two-encoder-one-decoder models. After comparing these models, we combine the best features and propose Assem-VC, a new state-of-the-art any-to-many non-parallel VC system. This paper also introduces the GTA finetuning in VC, which significantly improves the quality and the speaker similarity of the outputs. Assem-VC outperforms the previous state-of-the-art approaches in both the naturalness and the speaker similarity on the VCTK dataset. As an objective result, the degree of speaker disentanglement of features such as phonetic posteriorgrams (PPG) is also explored. Our investigation indicates that many-to-many VC results are no longer distinct from human speech and similar quality can be achieved with any-to-many models.</p>
1010
<hr>
1111

gta_extractor.py

+1-2
Original file line numberDiff line numberDiff line change
@@ -65,11 +65,10 @@ def extract_and_write_meta(self, mode):
6565
temp_meta = self.extract_gta_mels(batch, mode)
6666
meta_list.extend(temp_meta)
6767

68-
root_dir = self.hp.data.train_dir if mode == 'train' else self.hp.data.val_dir
6968
meta_path = self.hp.data.train_meta if mode == 'train' else self.hp.data.val_meta
7069
meta_filename = os.path.basename(meta_path)
7170
new_meta_filename = 'gta_' + meta_filename
72-
new_meta_path = os.path.join(root_dir, META_DIR, new_meta_filename)
71+
new_meta_path = os.path.join('datasets', META_DIR, new_meta_filename)
7372

7473
os.makedirs(os.path.join('datasets', META_DIR), exist_ok=True)
7574
with open(new_meta_path, 'w', encoding='utf-8') as f:
File renamed without changes.

synthesizer.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ def inference(self, text, mel_source, mel_reference, f0_padded):
7373
decoder_input = torch.cat((text_encoding, z_s_repeated), dim=2)
7474
_, _, alignment = \
7575
self.cotatron.decoder(mel_source, decoder_input, in_len, out_len, in_len,
76-
prenet_dropout=0.0, no_mask=True, tfrate=False)
76+
prenet_dropout=0.5, no_mask=True, tfrate=False)
7777
ling_s = torch.bmm(alignment, text_encoding)
7878
ling_s = ling_s.transpose(1, 2)
7979

@@ -95,7 +95,7 @@ def inference_from_z_t(self, text, mel_source, z_t):
9595
decoder_input = torch.cat((text_encoding, z_s_repeated), dim=2)
9696
_, _, alignment = \
9797
self.cotatron.decoder(mel_source, decoder_input, in_len, out_len, in_len,
98-
prenet_dropout=0.0, no_mask=True, tfrate=False)
98+
prenet_dropout=0.5, no_mask=True, tfrate=False)
9999
ling_s = torch.bmm(alignment, text_encoding)
100100
ling_s = ling_s.transpose(1, 2)
101101

@@ -147,7 +147,7 @@ def validation_step(self, batch, batch_idx):
147147

148148
if self.is_val_first:
149149
self.is_val_first = False
150-
self.logger.log_figures(mel_source, mel_s_s, mel_s_t, alignment, residual, self.global_step)
150+
self.logger.log_figures(mel_source, mel_s_s, mel_s_t, alignment, f0_padded, self.global_step)
151151

152152
return {'loss_rec': loss_rec}
153153

utils/plotting.py

-1
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,6 @@ def plot_residual(residual):
9393
for feat in residual:
9494
plt.plot(range(len(feat)), feat)
9595
plt.xlim(0, len(residual[0]))
96-
#plt.ylim(-1.0, 1.0)
9796
plt.xlabel('time frames')
9897
plt.ylabel('residual info')
9998
plt.subplots_adjust(bottom=0.1, right=0.88, top=0.9)

0 commit comments

Comments
 (0)