Merge pull request #8 from mindslab-ai/canary

wookladin · web-flow · commit c93571e19c6a · 2021-07-03T17:59:06.000+09:00
Enable GTA finetuning &amp; Bugfix
diff --git a/.gitignore b/.gitignore
@@ -8,6 +8,9 @@ config/*/*.yaml
 # f0 information
 f0s.txt
 
+# GTA metadatas
+datasets/gta_metadata/
+
 # logs, checkpoints
 chkpt/
 logs/
diff --git a/README.md b/README.md
@@ -3,16 +3,15 @@
 ![](./docs/images/overall.png)
 
 **Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques**<br>
-Kang-wook Kim, Seung-won Park, Myun-chul Joe @ [MINDsLab Inc.](https://mindslab.ai), SNU
+Kang-wook Kim, Seung-won Park, Myun-chul Joe @ [MINDsLab Inc.](https://maum.ai/), SNU
 
 Paper: https://arxiv.org/abs/2104.00931 <br>
 Audio Samples: https://mindslab-ai.github.io/assem-vc/ <br>
 
 Abstract: *In this paper, we pose the current state-of-the-art voice conversion (VC) systems as two-encoder-one-decoder models. After comparing these models, we combine the best features and propose Assem-VC, a new state-of-the-art any-to-many non-parallel VC system. This paper also introduces the GTA finetuning in VC, which significantly improves the quality and the speaker similarity of the outputs. Assem-VC outperforms the previous state-of-the-art approaches in both the naturalness and the speaker similarity on the VCTK dataset. As an objective result, the degree of speaker disentanglement of features such as phonetic posteriorgrams (PPG) is also explored. Our investigation indicates that many-to-many VC results are no longer distinct from human speech and similar quality can be achieved with any-to-many models.*
 
-## TODO List (2021.07.01)
-- [ ] Enable GTA finetuning
-- [ ] Upload loss curves
+## TODO List (2021.07.03)
+- [x] Enable GTA finetuning
 - [ ] Upload pre-trained weight
 
 ## Requirements
@@ -37,8 +36,14 @@ cd assem-vc
 - To reproduce the results from our paper, you need to download:
   - LibriTTS train-clean-100 split [tar.gz link](http://www.openslr.org/resources/60/train-clean-100.tar.gz)
   - [VCTK dataset (Version 0.80)](https://datashare.ed.ac.uk/handle/10283/2651)
-- Unzip each files.
+- Unzip each files, and clone them in `datasets/`.
 - Resample them into 22.05kHz using `datasets/resample.py`.
+  ```bash
+  python datasets/resample.py
+  ```
+  Note that `dataset/resample.py` was hard-coded to remove original wavfiles in `datasets/` and replace them into resampled wavfiles,
+  and their filename will be the same as the original filename.
+
 
 ### Preparing Metadata
 
@@ -76,11 +81,13 @@ cp config/vc/default.yaml config/vc/config.yaml
 Here, all files with name other than `default.yaml` will be ignored from git (see `.gitignore`).
 
 - `config/global`: Global configs that are both used for training Cotatron & VC decoder.
-  - Fill in the blanks of: `speakers`, `train_dir`, `train_meta`, `val_dir`, `val_meta`.
+  - Fill in the blanks of: `speakers`, `train_dir`, `train_meta`, `val_dir`, `val_meta`, `f0s_list_path`.
   - Example of speaker id list is shown in `datasets/metadata/libritts_vctk_speaker_list.txt`.
   - When replicating the two-stage training process from our paper (training with LibriTTS and then LibriTTS+VCTK), please put both list of speaker ids from LibriTTS and VCTK at global config.
+  - `f0s_list_path` is set to `f0s.txt` by default
 - `config/cota`: Configs for training Cotatron.
   - You may want to change: `batch_size` for GPUs other than 32GB V100, or change `chkpt_dir` to save checkpoints in other disk.
+  - You can also modify `use_attn_loss`, whether guided attention loss is used or not.
 - `config/vc`: Configs for training VC decoder.
   - Fill in the blank of: `cotatron_path`. 
 
@@ -128,7 +135,22 @@ The optional checkpoint argument is also available for VC decoder.
 
 ### 3. GTA finetuning HiFi-GAN
 
-TBD
+Once the VC decoder is trained, finetune the HiFi-GAN with GTA finetuning.
+First, you should extract GTA mel-spectrograms from VC decoder.
+```bash
+python gta_extractor.py -c <path_to_global_config_yaml> <path_to_vc_config_yaml> \
+                        -p <checkpoint_path>
+```
+The GTA mel-spectrograms calculated from audio file will be saved as `**.wav.gta` at first, 
+and then loaded from disk afterwards.
+
+Train/validation metadata of GTA mels will be saved in `datasets/gta_metadata/gta_<orignal_metadata_name>.txt`.
+You should use those metadata when finetuning HiFi-GAN.
+
+After extracting GTA mels, get into hifi-gan and follow manuals in [hifi-gan/README.md](https://github.com/wookladin/hifi-gan/blob/master/README.md)
+```bash
+cd hifi-gan
+```
 
 ### Monitoring via Tensorboard
 
@@ -152,7 +174,7 @@ You can convert it to speaker contained in trainset: which is any-to-many voice
     Note that speaker_id has no effect whether or not it is in the training set.
 3. Convert `datasets/inference_source/metadata_origin.txt` into ARPABET.
     ```bash
-    python3 datasets/g2p.py -i datasets/inference_source/metadata_origin.txt \
+    python datasets/g2p.py -i datasets/inference_source/metadata_origin.txt \
                             -o datasets/inference_source/metadata_g2p.txt
     ```
 4. Run [inference.ipynb](./inference.ipynb)
@@ -168,11 +190,10 @@ Hence, the quality of the result may differ from the paper.*
 
 Here are some noteworthy details of implementation, which could not be included in our paper due to the lack of space:
 
-- Guided attention loss
-
-We applied guided attention loss proposed in [DC-TTS](https://arxiv.org/abs/1710.08969).
-It helped Cotatron's alignment learning stable and faster convergence.
-See [utils/alignment_loss.py](./utils/alignment_loss.py).
+- Guided attention loss <br>
+  We applied guided attention loss proposed in [DC-TTS](https://arxiv.org/abs/1710.08969).
+  It helped Cotatron's alignment learning stable and faster convergence.
+  See [modules/alignment_loss.py](./modules/alignment_loss.py).
 
 ## License
 
@@ -193,8 +214,61 @@ If you have a question or any kind of inquiries, please contact Kang-wook Kim at
 
 
 ## Repository structure
-
-TBD
+```
+.
+├── LICENSE
+├── README.md
+├── cotatron.py
+├── cotatron_trainer.py         # Trainer file for Cotatron
+├── gta_extractor.py            # GTA mel spectrogram extractor
+├── inference.ipynb
+├── preprocess.py               # Extracting speakers' pitch range
+├── requirements.txt
+├── synthesizer.py
+├── synthesizer_trainer.py      # Trainer file for VC decoder (named as "synthesizer")
+├── config
+│   ├── cota
+│   │   └── default.yaml        # configuration template for Cotatron
+│   ├── global
+│   │   └── default.yaml        # configuration template for both Cotatron and VC decoder
+│   └── vc
+│        └── default.yaml       # configuration template for VC decoder
+├── datasets                    # TextMelDataset and text preprocessor
+│   ├── __init__.py         
+│   ├── g2p.py                  # Using G2P to convert metadata's transcription into ARPABET
+│   ├── resample.py             # Python file for audio resampling
+│   └── text_mel_dataset.py
+│   ├── inference_source
+│   │    (omitted)              # custom source speechs and transcriptions for inference.ipynb
+│   ├── metadata
+│   │    (ommited)              # Refer to README.md within the folder.
+│   └── text
+│        ├── __init__.py
+│        ├── cleaners.py
+│        ├── cmudict.py
+│        ├── numbers.py
+│        └── symbols.py
+├── docs                        # Audio samples and code for https://mindslab-ai.github.io/assem-vc/
+│   (omitted)
+├── hifi-gan                    # Modified HiFi-GAN vocoder (https://github.com/wookladin/hifi-gan)
+│   (omitted)
+├── modules                     # All modules that compose model, including mel.py
+│   ├── __init__.py
+│   ├── alignment_loss.py       # Guided attention loss
+│   ├── attention.py            # Implementation of DCA (https://arxiv.org/abs/1910.10288)
+│   ├── classifier.py
+│   ├── cond_bn.py
+│   ├── encoder.py
+│   ├── f0_encoder.py
+│   ├── mel.py                  # Code for calculating mel-spectrogram from raw audio
+│   ├── tts_decoder.py
+│   ├── vc_decoder.py
+│   └── zoneout.py              # Zoneout LSTM
+└── utils                       # Misc. code snippets, usually for logging
+    ├── loggers.py
+    ├── plotting.py
+    └── utils.py
+```
 
 ## References
 
@@ -210,7 +284,7 @@ This implementation uses code from following repositories:
 This README was inspired by:
 - [Tips for Publishing Research Code](https://github.com/paperswithcode/releasing-research-code)
 
-The audio samples on our [webpage](https://mindslab-ai.github.io/cotatron/) are partially derived from:
+The audio samples on our [webpage](https://mindslab-ai.github.io/assem-vc/) are partially derived from:
 - [LibriTTS](https://arxiv.org/abs/1904.02882): Dataset for multispeaker TTS, derived from LibriSpeech.
-- [VCTK](https://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html): 46 hours of English speech from 108 speakers.
+- [VCTK](https://datashare.ed.ac.uk/handle/10283/2651): 46 hours of English speech from 108 speakers.
 - [KSS](https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset): Korean Single Speaker Speech Dataset.
diff --git a/config/cota/default.yaml b/config/cota/default.yaml
@@ -11,7 +11,8 @@ train:
     end: 50000
   grad_clip: 1.0  # 0 for no gradient clipping
   teacher_force:
-    rate: 0.5
+    rate: 0.5 # 0.5 is the most stable value
+  use_attn_loss: True # Using guided attention loss (See README.md)
 ###########################
 log:
   chkpt_dir: 'chkpt/cota'
diff --git a/config/global/default.yaml b/config/global/default.yaml
@@ -8,9 +8,6 @@ data:
   val_meta: ''  # relative path of metadata file from val_dir
   f0s_list_path: '' # preprocessed f0 list
 ###########################
-train:
-  use_attn_loss: True
-###########################
 audio:  # WARNING: this can't be changed.
   n_mel_channels: 80
   filter_length: 1024
@@ -30,9 +27,6 @@ chn:
   speaker:
     cnn: [32, 32, 64, 64, 128, 128]
     token: 256
-  # residual encoder
-  residual: [32, 32, 64, 64, 128, 128]
-  residual_out: 1
   # f0 encoder
   prenet_f0: 1
   # TTS decoder
diff --git a/cotatron.py b/cotatron.py
@@ -13,7 +13,7 @@
 from modules import TextEncoder, TTSDecoder, SpeakerEncoder, SpkClassifier
 from datasets import TextMelDataset, text_mel_collate
 from datasets.text import Language
-from utils.alignment_loss import GuidedAttentionLoss
+from modules.alignment_loss import GuidedAttentionLoss
 
 
 class Cotatron(pl.LightningModule):
@@ -35,7 +35,10 @@ def __init__(self, hparams):
         self.is_val_first = True
 
         self.use_attn_loss = hp.train.use_attn_loss
-        self.attn_loss = GuidedAttentionLoss(20000, 0.25, 1.00025)
+        if self.use_attn_loss:
+            self.attn_loss = GuidedAttentionLoss(20000, 0.25, 1.00025)
+        else:
+            self.attn_loss = None
 
     def forward(self, text, mel_target, speakers, input_lengths, output_lengths, max_input_len,
                 prenet_dropout=0.5, no_mask=False, tfrate=True):
@@ -59,7 +62,7 @@ def inference(self, text, mel_target):
         decoder_input = torch.cat((text_encoding, speaker_emb_rep), dim=2)
         _, mel_postnet, alignment = \
             self.decoder(mel_target, decoder_input, in_len, out_len, in_len,
-                         prenet_dropout=0.0, no_mask=True, tfrate=False)
+                         prenet_dropout=0.5, no_mask=True, tfrate=False)
         return mel_postnet, alignment
 
     def training_step(self, batch, batch_idx):
@@ -85,7 +88,7 @@ def validation_step(self, batch, batch_idx):
         text, mel_target, speakers, input_lengths, output_lengths, max_input_len, _ = batch
         speaker_emb, mel_pred, mel_postnet, alignment = \
             self.forward(text, mel_target, speakers, input_lengths, output_lengths, max_input_len,
-                         prenet_dropout=0.0, tfrate=False)
+                         prenet_dropout=0.5, tfrate=False)
         speaker_prob = self.classifier(speaker_emb)
         classifier_loss = F.nll_loss(speaker_prob, speakers)
 
diff --git a/docs/index.html b/docs/index.html
@@ -3,9 +3,9 @@
 		<title>Assem-VC Demo</title>
 	</head>
 	<body><h2>Audio Samples from "Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques"</h2>
-	<p><b>Paper(will updated):</b> <a href="https://arxiv.org/abs/2104.00931">arXiv:2104.00931</a> (Submitted to INTERSPEECH 2021)</p>
+	<p><b>Paper:</b> <a href="https://arxiv.org/abs/2104.00931">arXiv:2104.00931</a></p>
     <p><b>Repository:</b> <a href="https://github.com/mindslab-ai/assem-vc">mindslab-ai/assem-vc @ GitHub<iframe src="https://ghbtns.com/github-btn.html?user=mindslab-ai&repo=assem-vc&type=star&count=true" frameborder="0" scrolling="0" width="150" height="20" title="GitHub"></iframe></a>
-    <p><strong>Authors: </strong>Kang-wook Kim, Seung-won Park, Myun-chul Joe @<a href="https://mindslab.ai">MINDsLab Inc.</a>, SNU</p>
+    <p><strong>Authors: </strong>Kang-wook Kim, Seung-won Park, Myun-chul Joe @<a href="https://maum.ai">MINDsLab Inc.</a>, SNU</p>
 	<p><strong>Abstract: </strong>In this paper, we pose the current state-of-the-art voice conversion (VC) systems as two-encoder-one-decoder models. After comparing these models, we combine the best features and propose Assem-VC, a new state-of-the-art any-to-many non-parallel VC system. This paper also introduces the GTA finetuning in VC, which significantly improves the quality and the speaker similarity of the outputs. Assem-VC outperforms the previous state-of-the-art approaches in both the naturalness and the speaker similarity on the VCTK dataset. As an objective result, the degree of speaker disentanglement of features such as phonetic posteriorgrams (PPG) is also explored. Our investigation indicates that many-to-many VC results are no longer distinct from human speech and similar quality can be achieved with any-to-many models.</p>
 <hr>
 
diff --git a/gta_extractor.py b/gta_extractor.py
@@ -65,11 +65,10 @@ def extract_and_write_meta(self, mode):
             temp_meta = self.extract_gta_mels(batch, mode)
             meta_list.extend(temp_meta)
 
-        root_dir = self.hp.data.train_dir if mode == 'train' else self.hp.data.val_dir
         meta_path = self.hp.data.train_meta if mode == 'train' else self.hp.data.val_meta
         meta_filename = os.path.basename(meta_path)
         new_meta_filename = 'gta_' + meta_filename
-        new_meta_path = os.path.join(root_dir, META_DIR, new_meta_filename)
+        new_meta_path = os.path.join('datasets', META_DIR, new_meta_filename)
 
         os.makedirs(os.path.join('datasets', META_DIR), exist_ok=True)
         with open(new_meta_path, 'w', encoding='utf-8') as f:
diff --git a/hifi-gan b/hifi-gan
@@ -1 +1 @@
-Subproject commit 4769534d45265d52a904b850da5a622601885777
+Subproject commit 7395e41169e6549f62f941959675cc1fe7bf9735
diff --git a/modules/alignment_loss.py b/modules/alignment_loss.py
diff --git a/synthesizer.py b/synthesizer.py
@@ -73,7 +73,7 @@ def inference(self, text, mel_source, mel_reference, f0_padded):
         decoder_input = torch.cat((text_encoding, z_s_repeated), dim=2)
         _, _, alignment = \
             self.cotatron.decoder(mel_source, decoder_input, in_len, out_len, in_len,
-                                  prenet_dropout=0.0, no_mask=True, tfrate=False)
+                                  prenet_dropout=0.5, no_mask=True, tfrate=False)
         ling_s = torch.bmm(alignment, text_encoding)
         ling_s = ling_s.transpose(1, 2)
 
@@ -95,7 +95,7 @@ def inference_from_z_t(self, text, mel_source, z_t):
         decoder_input = torch.cat((text_encoding, z_s_repeated), dim=2)
         _, _, alignment = \
             self.cotatron.decoder(mel_source, decoder_input, in_len, out_len, in_len,
-                                  prenet_dropout=0.0, no_mask=True, tfrate=False)
+                                  prenet_dropout=0.5, no_mask=True, tfrate=False)
         ling_s = torch.bmm(alignment, text_encoding)
         ling_s = ling_s.transpose(1, 2)
 
@@ -147,7 +147,7 @@ def validation_step(self, batch, batch_idx):
 
         if self.is_val_first:
             self.is_val_first = False
-            self.logger.log_figures(mel_source, mel_s_s, mel_s_t, alignment, residual, self.global_step)
+            self.logger.log_figures(mel_source, mel_s_s, mel_s_t, alignment, f0_padded, self.global_step)
 
         return {'loss_rec': loss_rec}
 
diff --git a/utils/plotting.py b/utils/plotting.py
@@ -93,7 +93,6 @@ def plot_residual(residual):
     for feat in residual:
         plt.plot(range(len(feat)), feat)
     plt.xlim(0, len(residual[0]))
-    #plt.ylim(-1.0, 1.0)
     plt.xlabel('time frames')
     plt.ylabel('residual info')
     plt.subplots_adjust(bottom=0.1, right=0.88, top=0.9)