Skip to content

Commit 3484fe6

Browse files
committed
Initial
0 parents  commit 3484fe6

15 files changed

+14293
-0
lines changed

LICENSE

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2020 Jungil Kong
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

LJSpeech-1.1/training.txt

+12,950
Large diffs are not rendered by default.

LJSpeech-1.1/validation.txt

+150
Large diffs are not rendered by default.

README.md

+85
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
2+
3+
### Jungil Kong, Jaehyeon Kim, Jaekyoung Bae
4+
5+
In our [paper](https://arxiv.org/abs/2010.05646),
6+
we proposed HiFi-GAN: a GAN-based model capable of generating high fidelity speech efficiently.<br/>
7+
We provide our implementation and pretrained models as open source in this repository.
8+
9+
**Abstract :**
10+
Several recent studies on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms.
11+
Although such methods improve the sampling efficiency and memory usage,
12+
their sample quality has not yet reached that of autoregressive and flow-based generative models.
13+
In this study, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis.
14+
As speech audio consists of sinusoidal signals with various periods,
15+
we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality.
16+
A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method
17+
demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than
18+
real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen
19+
speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times
20+
faster than real-time on CPU with comparable quality to an autoregressive counterpart.
21+
22+
Visit our [demo website](https://jik876.github.io/hifi-gan-demo/) for audio samples.
23+
24+
25+
## Pre-requisites
26+
1. Python >= 3.6
27+
2. Clone this repository.
28+
3. Install python requirements. Please refer [requirements.txt](requirements.txt)
29+
4. Download and extract the [LJ Speech dataset](https://keithito.com/LJ-Speech-Dataset/).
30+
And move all wav files to `LJSpeech-1.1/wavs`
31+
32+
33+
## Training
34+
```
35+
python train.py --config config_v1.json
36+
```
37+
To train V2 or V3 Generator, replace `config_v1.json` with `config_v2.json` or `config_v3.json`.<br>
38+
Checkpoints and copy of the configuration file are saved in `cp_hifigan` directory by default.<br>
39+
You can change the path by adding `--checkpoint_path` option.
40+
41+
42+
## Pretrained Model
43+
You can also use pretrained models we provide.<br/>
44+
[Download pretrained models](https://drive.google.com/drive/folders/1-eEYTB5Av9jNql0WGBlRoi-WH2J7bp5Y?usp=sharing)<br/>
45+
Details of each folder are as in follows:
46+
47+
|Folder Name|Generator|Dataset|Fine-Tuned|
48+
|------|---|---|---|
49+
|LJ_V1|V1|LJSpeech|No|
50+
|LJ_V2|V2|LJSpeech|No|
51+
|LJ_V3|V3|LJSpeech|No|
52+
|LJ_FT_T2_V1|V1|LJSpeech|Yes ([Tacotron2](https://github.com/NVIDIA/tacotron2))|
53+
|LJ_FT_T2_V2|V2|LJSpeech|Yes ([Tacotron2](https://github.com/NVIDIA/tacotron2))|
54+
|LJ_FT_T2_V3|V3|LJSpeech|Yes ([Tacotron2](https://github.com/NVIDIA/tacotron2))|
55+
|VCTK_V1|V1|VCTK|No|
56+
|VCTK_V2|V2|VCTK|No|
57+
|VCTK_V3|V3|VCTK|No|
58+
59+
60+
## Inference from wav file
61+
1. Make `test_files` directory and copy wav files into the directory.
62+
2. Run the following command.
63+
```
64+
python inference.py --checkpoint_file [generator checkpoint file path]
65+
```
66+
Generated wav files are saved in `generated_files` by default.<br>
67+
You can change the path by adding `--output_dir` option.
68+
69+
70+
## Inference for end-to-end speech synthesis
71+
1. Make `test_mel_files` directory and copy generated mel-spectrogram files into the directory.<br>
72+
You can generate mel-spectrograms using [Tacotron2](https://github.com/NVIDIA/tacotron2),
73+
[Glow-TTS](https://github.com/jaywalnut310/glow-tts) and so forth.
74+
2. Run the following command.
75+
```
76+
python inference_e2e.py --checkpoint_file [generator checkpoint file path]
77+
```
78+
Generated wav files are saved in `generated_files_from_mel` by default.<br>
79+
You can change the path by adding `--output_dir` option.
80+
81+
82+
## Acknowledgements
83+
We referred to [WaveGlow](https://github.com/NVIDIA/waveglow), [MelGAN](https://github.com/descriptinc/melgan-neurips)
84+
and [Tacotron2](https://github.com/NVIDIA/tacotron2) to implement this.
85+

config_v1.json

+37
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
{
2+
"resblock": "1",
3+
"num_gpus": 0,
4+
"batch_size": 16,
5+
"learning_rate": 0.0002,
6+
"adam_b1": 0.8,
7+
"adam_b2": 0.99,
8+
"lr_decay": 0.999,
9+
"seed": 1234,
10+
11+
"upsample_rates": [8,8,2,2],
12+
"upsample_kernel_sizes": [16,16,4,4],
13+
"upsample_initial_channel": 512,
14+
"resblock_kernel_sizes": [3,7,11],
15+
"resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
16+
17+
"segment_size": 8192,
18+
"num_mels": 80,
19+
"num_freq": 1025,
20+
"n_fft": 1024,
21+
"hop_size": 256,
22+
"win_size": 1024,
23+
24+
"sampling_rate": 22050,
25+
26+
"fmin": 0,
27+
"fmax": 8000,
28+
"fmax_for_loss": null,
29+
30+
"num_workers": 4,
31+
32+
"dist_config": {
33+
"dist_backend": "nccl",
34+
"dist_url": "tcp://localhost:54321",
35+
"world_size": 1
36+
}
37+
}

config_v2.json

+37
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
{
2+
"resblock": "1",
3+
"num_gpus": 0,
4+
"batch_size": 16,
5+
"learning_rate": 0.0002,
6+
"adam_b1": 0.8,
7+
"adam_b2": 0.99,
8+
"lr_decay": 0.999,
9+
"seed": 1234,
10+
11+
"upsample_rates": [8,8,2,2],
12+
"upsample_kernel_sizes": [16,16,4,4],
13+
"upsample_initial_channel": 128,
14+
"resblock_kernel_sizes": [3,7,11],
15+
"resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
16+
17+
"segment_size": 8192,
18+
"num_mels": 80,
19+
"num_freq": 1025,
20+
"n_fft": 1024,
21+
"hop_size": 256,
22+
"win_size": 1024,
23+
24+
"sampling_rate": 22050,
25+
26+
"fmin": 0,
27+
"fmax": 8000,
28+
"fmax_for_loss": null,
29+
30+
"num_workers": 4,
31+
32+
"dist_config": {
33+
"dist_backend": "nccl",
34+
"dist_url": "tcp://localhost:54321",
35+
"world_size": 1
36+
}
37+
}

config_v3.json

+37
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
{
2+
"resblock": "2",
3+
"num_gpus": 0,
4+
"batch_size": 16,
5+
"learning_rate": 0.0002,
6+
"adam_b1": 0.8,
7+
"adam_b2": 0.99,
8+
"lr_decay": 0.999,
9+
"seed": 1234,
10+
11+
"upsample_rates": [8,8,4],
12+
"upsample_kernel_sizes": [16,16,8],
13+
"upsample_initial_channel": 256,
14+
"resblock_kernel_sizes": [3,5,7],
15+
"resblock_dilation_sizes": [[1,2], [2,6], [3,12]],
16+
17+
"segment_size": 8192,
18+
"num_mels": 80,
19+
"num_freq": 1025,
20+
"n_fft": 1024,
21+
"hop_size": 256,
22+
"win_size": 1024,
23+
24+
"sampling_rate": 22050,
25+
26+
"fmin": 0,
27+
"fmax": 8000,
28+
"fmax_for_loss": null,
29+
30+
"num_workers": 4,
31+
32+
"dist_config": {
33+
"dist_backend": "nccl",
34+
"dist_url": "tcp://localhost:54321",
35+
"world_size": 1
36+
}
37+
}

env.py

+15
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
import os
2+
import shutil
3+
4+
5+
class AttrDict(dict):
6+
def __init__(self, *args, **kwargs):
7+
super(AttrDict, self).__init__(*args, **kwargs)
8+
self.__dict__ = self
9+
10+
11+
def build_env(config, config_name, path):
12+
t_path = os.path.join(path, config_name)
13+
if config != t_path:
14+
os.makedirs(path, exist_ok=True)
15+
shutil.copyfile(config, os.path.join(path, config_name))

inference.py

+95
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
from __future__ import absolute_import, division, print_function, unicode_literals
2+
3+
import glob
4+
import os
5+
import argparse
6+
import json
7+
import torch
8+
from scipy.io.wavfile import write
9+
from env import AttrDict
10+
from meldataset import mel_spectrogram, MAX_WAV_VALUE, load_wav
11+
from models import Generator
12+
13+
h = None
14+
device = None
15+
16+
17+
def load_checkpoint(filepath, device):
18+
assert os.path.isfile(filepath)
19+
print("Loading '{}'".format(filepath))
20+
checkpoint_dict = torch.load(filepath, map_location=device)
21+
print("Complete.")
22+
return checkpoint_dict
23+
24+
25+
def get_mel(x):
26+
return mel_spectrogram(x, h.n_fft, h.num_mels, h.sampling_rate, h.hop_size, h.win_size, h.fmin, h.fmax)
27+
28+
29+
def scan_checkpoint(cp_dir, prefix):
30+
pattern = os.path.join(cp_dir, prefix + '*')
31+
cp_list = glob.glob(pattern)
32+
if len(cp_list) == 0:
33+
return ''
34+
return sorted(cp_list)[-1]
35+
36+
37+
def inference(a):
38+
generator = Generator(h).to(device)
39+
40+
state_dict_g = load_checkpoint(a.checkpoint_file, device)
41+
generator.load_state_dict(state_dict_g['generator'])
42+
43+
filelist = os.listdir(a.input_wavs_dir)
44+
45+
os.makedirs(a.output_dir, exist_ok=True)
46+
47+
generator.eval()
48+
generator.remove_weight_norm()
49+
with torch.no_grad():
50+
for i, filname in enumerate(filelist):
51+
wav, sr = load_wav(os.path.join(a.input_wavs_dir, filname))
52+
wav = wav / MAX_WAV_VALUE
53+
wav = torch.FloatTensor(wav).to(device)
54+
x = get_mel(wav.unsqueeze(0))
55+
y_g_hat = generator(x)
56+
audio = y_g_hat.squeeze()
57+
audio = audio * MAX_WAV_VALUE
58+
audio = audio.cpu().numpy().astype('int16')
59+
60+
output_file = os.path.join(a.output_dir, os.path.splitext(filname)[0] + '_generated.wav')
61+
write(output_file, h.sampling_rate, audio)
62+
print(output_file)
63+
64+
65+
def main():
66+
print('Initializing Inference Process..')
67+
68+
parser = argparse.ArgumentParser()
69+
parser.add_argument('--input_wavs_dir', default='test_files')
70+
parser.add_argument('--output_dir', default='generated_files')
71+
parser.add_argument('--checkpoint_file', required=True)
72+
a = parser.parse_args()
73+
74+
config_file = os.path.join(os.path.split(a.checkpoint_file)[0], 'config.json')
75+
with open(config_file) as f:
76+
data = f.read()
77+
78+
global h
79+
json_config = json.loads(data)
80+
h = AttrDict(json_config)
81+
82+
torch.manual_seed(h.seed)
83+
global device
84+
if torch.cuda.is_available():
85+
torch.cuda.manual_seed(h.seed)
86+
device = torch.device('cuda')
87+
else:
88+
device = torch.device('cpu')
89+
90+
inference(a)
91+
92+
93+
if __name__ == '__main__':
94+
main()
95+

0 commit comments

Comments
 (0)