Skip to content

WeDataset: List of (OpenSource data) + (Crawler Resources) #2094

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
xingchensong opened this issue Oct 31, 2023 · 6 comments
Closed

WeDataset: List of (OpenSource data) + (Crawler Resources) #2094

xingchensong opened this issue Oct 31, 2023 · 6 comments
Assignees
Labels
future plan help wanted Extra attention is needed Stale

Comments

@xingchensong
Copy link
Member

xingchensong commented Oct 31, 2023

统计 开源数据 和 爬虫源, 不断更新中... 欢迎追加编辑

@xingchensong
Copy link
Member Author

xingchensong commented Oct 31, 2023

The Dataset of Speech Recognition (ASR) / Speech Translation (ST)

Chinese

name duration/h address remark
THCHS-30 30 https://openslr.org/18/
Aishell 150 https://openslr.org/33/
ST-CMDS 110 https://openslr.org/38/
Primewords 99 https://openslr.org/47/
aidatatang 200 https://openslr.org/62/
MagicData 755 https://openslr.org/68/
ASR&SD 160 http://ncmmsc2021.org/competition2.html if available
Aishell2 1000 http://www.aishelltech.com/aishell_2 if available
TAL ASR 100 https://ai.100tal.com/dataset
Common Voice 63 https://commonvoice.mozilla.org/zh-CN/datasets Common Voice Corpus 7.0
ASRU2019 ASR 500 https://www.datatang.com/competition if available
2021 SLT CSRC 398 https://www.data-baker.com/csrc_challenge.html if available
aidatatang_1505zh 1505 https://datatang.com/opensource if available
WenetSpeech 10000 https://github.com/wenet-e2e/WenetSpeech
KeSpeech 1542 https://openreview.net/forum?id=b3Zoeq2sCLq speech recognition, speaker verification, subdialect identification, voice conversion
MagicData-RAMC 180 https://arxiv.org/pdf/2203.16844.pdf conversational speech data recorded from native speakers of Mandarin Chinese
Mandarin Heavy Accent Conversational Speech Corpus 58.78 https://magichub.com/datasets/mandarin-heavy-accent-conversational-speech-corpus/
Free ST Chinese Mandarin Corpus - https://openslr.org/38/

English

name duration/h speakers address remark
Common Voice 2015 - https://commonvoice.mozilla.org/zh-CN/datasets Common Voice Corpus 7.0, Narrated Wikipedia; CC0-1.0
LibriSpeech 960 2480 https://openslr.org/12/ Audiobooks; CC-BY-4.0
ST-AEDS-20180100 4.7 - http://www.openslr.org/45/
TED-LIUM Release 3 430 2030 https://openslr.org/51/ TED talks; CC-BY-NC-ND 3.0
Multilingual LibriSpeech 44659 - https://openslr.org/94/ limited supervision
SPGISpeech 5000 - https://datasets.kensho.com/datasets/scribe if available
Speech Commands 10 - https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/data
2020AESRC 160 - https://datatang.com/INTERSPEECH2020 if available
GigaSpeech 10000 - https://github.com/SpeechColab/GigaSpeech Audiobook, podcast, YouTube; apache-2.0
The People’s Speech 31400 - https://openreview.net/pdf?id=R8CwidgJ0yT Government, interviews; CC-BY-SA-4.0
Earnings-21 39 - https://arxiv.org/abs/2104.11348
VoxPopuli 24100+543 1310 https://arxiv.org/pdf/2101.00390.pdf, github 24100(unlabeled), 543(transcribed), European Parliament; CC0
CMU Wilderness Multilingual Speech Dataset 13 - http://festvox.org/cmu_wilderness/ Multilingual
How-2 Dataset 2000 - https://github.com/srvk/how2-dataset 2000(english asr) 300(english->portuguese st); Creative Commons BY-SA 4.0
AMI 100 - https://openslr.org/16/ meetings; CC-BY-4.0
SwitchBoard 260 540 https://catalog.ldc.upenn.edu/LDC97S62 Telephone conversations; LDC
Fisher 1960 11917 https://catalog.ldc.upenn.edu/LDC2004T19 telephone conversations; LDC

Chinese-English

name duration/h address remark
SEAME 30 https://www.isca-speech.org/archive_v0/archive_papers/interspeech_2010/i10_1986.pdf
TAL CSASR 587 https://ai.100tal.com/dataset
ASRU2019 CSASR 200 https://www.datatang.com/competition if available
ASCEND 10.62 https://arxiv.org/pdf/2112.06223.pdf

Japanese (ja-JP)

name duration/h address remark
Common Voice 26 https://commonvoice.mozilla.org/zh-CN/datasets Common Voice Corpus 7.0
Japanese_Scripted_Speech_Corpus_Daily_Use_Sentence 18 https://magichub.io/cn/datasets/japanese-scripted-speech-corpus-daily-use-sentence/
LaboroTVSpeech 2000 https://arxiv.org/pdf/2103.14736.pdf
CSJ 650 https://github.com/kaldi-asr/kaldi/tree/master/egs/csj
JTubeSpeech 1300 https://arxiv.org/pdf/2112.09323.pdf

Korean (ko-KR)

name duration/h address remark
korean-scripted-speech-corpus-daily-use-sentence 4.3 https://magichub.io/cn/datasets/korean-scripted-speech-corpus-daily-use-sentence/
korean-conversational-speech-corpus 5.22 https://magichub.io/cn/datasets/korean-conversational-speech-corpus/

Russian (ru-RU)

name duration/h address remark
Common Voice 148 https://commonvoice.mozilla.org/zh-CN/datasets Common Voice Corpus 7.0
OpenSTT 20000 https://arxiv.org/pdf/2006.08274.pdf limited supervision

French (fr-Fr)

name duration/h address remark
MediaSpeech 10 https://arxiv.org/pdf/2103.16193.pdf ASR system evaluation dataset

Spanish (es-ES)

name duration/h address remark
MediaSpeech 10 https://arxiv.org/pdf/2103.16193.pdf ASR system evaluation dataset

Turkish (tr-TR)

name duration/h address remark
MediaSpeech 10 https://arxiv.org/pdf/2103.16193.pdf ASR system evaluation dataset

Arabic (ar)

name duration/h address remark
MediaSpeech 10 https://arxiv.org/pdf/2103.16193.pdf ASR system evaluation dataset

noise & nonspeech

name duration/h address remark
MUSAN - https://openslr.org/17/
Room Impulse Response and Noise Database - https://openslr.org/28/
AudioSet - https://ieeexplore.ieee.org/document/7952261

@xingchensong
Copy link
Member Author

The Dataset of Speech Synthesis

Chinese

name duration/h address remark
Aishell3 85 https://openslr.org/93/
Opencpop - https://wenet.org.cn/opencpop/download/ Singing Voice Synthesis

English

name duration/h address remark
Hi-Fi Multi-Speaker English TTS Dataset 291.6 https://openslr.org/109/
LibriTTS corpus 585 https://openslr.org/60/
Speechocean762 - https://www.openslr.org/101/
RyanSpeech 10 http://mohammadmahoor.com/ryanspeech/

@xingchensong
Copy link
Member Author

The Dataset of Speech Recognition & Speaker Diarization

Chinese

name duration/h address remark application
Aishell4 120 https://openslr.org/111/ 8-channel, conference scenarios speech recognition, speaker diarization
ASR&SD 160 http://ncmmsc2021.org/competition2.html if available speech recognition, speaker diarization
zhijiangcup - https://zhijiangcup.zhejianglab.com/zhijiang/match/details/id/6.html if available speech recognition, speaker diarization
M2MET 120 https://arxiv.org/pdf/2110.07393.pdf 8-channel, conference scenarios speech recognition, speaker diarization

English

name duration/h address remark application
CHiME-6 - https://chimechallenge.github.io/chime6/download.html if available speech recognition, speaker diarization

@xingchensong
Copy link
Member Author

The Dataset of Speaker Recognition

Chinese

name duration/h address remark application
CN-Celeb - https://openslr.org/82/
KeSpeech 1542 https://openreview.net/forum?id=b3Zoeq2sCLq speech recognition, speaker verification, subdialect identification, voice conversion
MTASS 55.6 https://github.com/Windstudent/Complex-MTASSNet
THCHS-30 40 http://www.openslr.org/18/

English

name duration/h address remark
VoxCeleb Data - http://www.robots.ox.ac.uk/~vgg/data/voxceleb/

@xingchensong
Copy link
Member Author

xingchensong commented Oct 31, 2023

The Resource of Crawler

name type address remark application
voicetube video https://tw.voicetube.com/ 台湾的在线英语学习平台,每个视频都附有英文和用户的母语(通常是中文)的字幕
Chinese-Podcasts collection of video & podcast https://github.com/alaskasquirrel/Chinese-Podcasts 收集整理的中文视频、播客、电台等

@Mddct
Copy link
Collaborator

Mddct commented Dec 26, 2023

@github-actions github-actions bot added the Stale label Feb 25, 2024
@github-actions github-actions bot closed this as completed Mar 4, 2024
@Mddct Mddct reopened this Mar 4, 2024
@github-actions github-actions bot removed the Stale label Mar 5, 2024
@github-actions github-actions bot added the Stale label May 4, 2024
@Mddct Mddct reopened this May 11, 2024
@github-actions github-actions bot removed the Stale label May 12, 2024
@github-actions github-actions bot added the Stale label Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
future plan help wanted Extra attention is needed Stale
Projects
None yet
Development

No branches or pull requests

5 participants