WeDataset: List of (OpenSource data) + (Crawler Resources) #2094

xingchensong · 2023-10-31T07:57:35Z

统计开源数据和爬虫源, 不断更新中... 欢迎追加编辑

xingchensong · 2023-10-31T08:02:14Z

The Dataset of Speech Recognition (ASR) / Speech Translation (ST)

Chinese

name	duration/h	address	remark
THCHS-30	30	https://openslr.org/18/
Aishell	150	https://openslr.org/33/
ST-CMDS	110	https://openslr.org/38/
Primewords	99	https://openslr.org/47/
aidatatang	200	https://openslr.org/62/
MagicData	755	https://openslr.org/68/
ASR&SD	160	http://ncmmsc2021.org/competition2.html	if available
Aishell2	1000	http://www.aishelltech.com/aishell_2	if available
TAL ASR	100	https://ai.100tal.com/dataset
Common Voice	63	https://commonvoice.mozilla.org/zh-CN/datasets	Common Voice Corpus 7.0
ASRU2019 ASR	500	https://www.datatang.com/competition	if available
2021 SLT CSRC	398	https://www.data-baker.com/csrc_challenge.html	if available
aidatatang_1505zh	1505	https://datatang.com/opensource	if available
WenetSpeech	10000	https://github.com/wenet-e2e/WenetSpeech
KeSpeech	1542	https://openreview.net/forum?id=b3Zoeq2sCLq	speech recognition, speaker verification, subdialect identification, voice conversion
MagicData-RAMC	180	https://arxiv.org/pdf/2203.16844.pdf	conversational speech data recorded from native speakers of Mandarin Chinese
Mandarin Heavy Accent Conversational Speech Corpus	58.78	https://magichub.com/datasets/mandarin-heavy-accent-conversational-speech-corpus/
Free ST Chinese Mandarin Corpus	-	https://openslr.org/38/

English

name	duration/h	speakers	address	remark
Common Voice	2015	-	https://commonvoice.mozilla.org/zh-CN/datasets	Common Voice Corpus 7.0, Narrated Wikipedia; CC0-1.0
LibriSpeech	960	2480	https://openslr.org/12/	Audiobooks; CC-BY-4.0
ST-AEDS-20180100	4.7	-	http://www.openslr.org/45/
TED-LIUM Release 3	430	2030	https://openslr.org/51/	TED talks; CC-BY-NC-ND 3.0
Multilingual LibriSpeech	44659	-	https://openslr.org/94/	limited supervision
SPGISpeech	5000	-	https://datasets.kensho.com/datasets/scribe	if available
Speech Commands	10	-	https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/data
2020AESRC	160	-	https://datatang.com/INTERSPEECH2020	if available
GigaSpeech	10000	-	https://github.com/SpeechColab/GigaSpeech	Audiobook, podcast, YouTube; apache-2.0
The People’s Speech	31400	-	https://openreview.net/pdf?id=R8CwidgJ0yT	Government, interviews; CC-BY-SA-4.0
Earnings-21	39	-	https://arxiv.org/abs/2104.11348
VoxPopuli	24100+543	1310	https://arxiv.org/pdf/2101.00390.pdf, github	24100(unlabeled), 543(transcribed), European Parliament; CC0
CMU Wilderness Multilingual Speech Dataset	13	-	http://festvox.org/cmu_wilderness/	Multilingual
How-2 Dataset	2000	-	https://github.com/srvk/how2-dataset	2000(english asr) 300(english->portuguese st); Creative Commons BY-SA 4.0
AMI	100	-	https://openslr.org/16/	meetings; CC-BY-4.0
SwitchBoard	260	540	https://catalog.ldc.upenn.edu/LDC97S62	Telephone conversations; LDC
Fisher	1960	11917	https://catalog.ldc.upenn.edu/LDC2004T19	telephone conversations; LDC

Chinese-English

name	duration/h	address	remark
SEAME	30	https://www.isca-speech.org/archive_v0/archive_papers/interspeech_2010/i10_1986.pdf
TAL CSASR	587	https://ai.100tal.com/dataset
ASRU2019 CSASR	200	https://www.datatang.com/competition	if available
ASCEND	10.62	https://arxiv.org/pdf/2112.06223.pdf

Japanese (ja-JP)

name	duration/h	address	remark
Common Voice	26	https://commonvoice.mozilla.org/zh-CN/datasets	Common Voice Corpus 7.0
Japanese_Scripted_Speech_Corpus_Daily_Use_Sentence	18	https://magichub.io/cn/datasets/japanese-scripted-speech-corpus-daily-use-sentence/
LaboroTVSpeech	2000	https://arxiv.org/pdf/2103.14736.pdf
CSJ	650	https://github.com/kaldi-asr/kaldi/tree/master/egs/csj
JTubeSpeech	1300	https://arxiv.org/pdf/2112.09323.pdf

Korean (ko-KR)

name	duration/h	address	remark
korean-scripted-speech-corpus-daily-use-sentence	4.3	https://magichub.io/cn/datasets/korean-scripted-speech-corpus-daily-use-sentence/
korean-conversational-speech-corpus	5.22	https://magichub.io/cn/datasets/korean-conversational-speech-corpus/

Russian (ru-RU)

name	duration/h	address	remark
Common Voice	148	https://commonvoice.mozilla.org/zh-CN/datasets	Common Voice Corpus 7.0
OpenSTT	20000	https://arxiv.org/pdf/2006.08274.pdf	limited supervision

French (fr-Fr)

name	duration/h	address	remark
MediaSpeech	10	https://arxiv.org/pdf/2103.16193.pdf	ASR system evaluation dataset

Spanish (es-ES)

name	duration/h	address	remark
MediaSpeech	10	https://arxiv.org/pdf/2103.16193.pdf	ASR system evaluation dataset

Turkish (tr-TR)

name	duration/h	address	remark
MediaSpeech	10	https://arxiv.org/pdf/2103.16193.pdf	ASR system evaluation dataset

Arabic (ar)

name	duration/h	address	remark
MediaSpeech	10	https://arxiv.org/pdf/2103.16193.pdf	ASR system evaluation dataset

noise & nonspeech

name	duration/h	address
MUSAN	-	https://openslr.org/17/
Room Impulse Response and Noise Database	-	https://openslr.org/28/
AudioSet	-	https://ieeexplore.ieee.org/document/7952261

xingchensong · 2023-10-31T08:02:40Z

The Dataset of Speech Synthesis

Chinese

name	duration/h	address	remark
Aishell3	85	https://openslr.org/93/
Opencpop	-	https://wenet.org.cn/opencpop/download/	Singing Voice Synthesis

English

name	duration/h	address
Hi-Fi Multi-Speaker English TTS Dataset	291.6	https://openslr.org/109/
LibriTTS corpus	585	https://openslr.org/60/
Speechocean762	-	https://www.openslr.org/101/
RyanSpeech	10	http://mohammadmahoor.com/ryanspeech/

xingchensong · 2023-10-31T08:03:04Z

The Dataset of Speech Recognition & Speaker Diarization

Chinese

name	duration/h	address	remark	application
Aishell4	120	https://openslr.org/111/	8-channel, conference scenarios	speech recognition, speaker diarization
ASR&SD	160	http://ncmmsc2021.org/competition2.html	if available	speech recognition, speaker diarization
zhijiangcup	-	https://zhijiangcup.zhejianglab.com/zhijiang/match/details/id/6.html	if available	speech recognition, speaker diarization
M2MET	120	https://arxiv.org/pdf/2110.07393.pdf	8-channel, conference scenarios	speech recognition, speaker diarization

English

name	duration/h	address	remark	application
CHiME-6	-	https://chimechallenge.github.io/chime6/download.html	if available	speech recognition, speaker diarization

xingchensong · 2023-10-31T08:03:13Z

The Dataset of Speaker Recognition

Chinese

name	duration/h	address	application
CN-Celeb	-	https://openslr.org/82/
KeSpeech	1542	https://openreview.net/forum?id=b3Zoeq2sCLq	speech recognition, speaker verification, subdialect identification, voice conversion
MTASS	55.6	https://github.com/Windstudent/Complex-MTASSNet
THCHS-30	40	http://www.openslr.org/18/

English

name	duration/h	address	remark
VoxCeleb Data	-	http://www.robots.ox.ac.uk/~vgg/data/voxceleb/

xingchensong · 2023-10-31T08:13:35Z

The Resource of Crawler

name	type	address	remark	application
voicetube	video	https://tw.voicetube.com/	台湾的在线英语学习平台，每个视频都附有英文和用户的母语（通常是中文）的字幕
Chinese-Podcasts	collection of video & podcast	https://github.com/alaskasquirrel/Chinese-Podcasts	收集整理的中文视频、播客、电台等

Mddct · 2023-12-26T07:20:19Z

https://www.atr-p.com/products/sdb.html#DIGI

xingchensong assigned xingchensong, Mddct, pengzhendong, whiteshirt0429 and robin1001 Oct 31, 2023

xingchensong added the help wanted Extra attention is needed label Oct 31, 2023

xingchensong pinned this issue Oct 31, 2023

xingchensong mentioned this issue Nov 1, 2023

中文开源语音大模型计划 #2097

Closed

14 tasks

xingchensong added the future plan label Nov 1, 2023

xingchensong unpinned this issue Nov 1, 2023

github-actions bot added the Stale label Feb 25, 2024

github-actions bot closed this as completed Mar 4, 2024

Mddct reopened this Mar 4, 2024

github-actions bot removed the Stale label Mar 5, 2024

github-actions bot added the Stale label May 4, 2024

github-actions bot closed this as completed May 11, 2024

Mddct reopened this May 11, 2024

github-actions bot removed the Stale label May 12, 2024

github-actions bot added the Stale label Jul 11, 2024

github-actions bot closed this as completed Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WeDataset: List of (OpenSource data) + (Crawler Resources) #2094

WeDataset: List of (OpenSource data) + (Crawler Resources) #2094

xingchensong commented Oct 31, 2023 •

edited

Loading

xingchensong commented Oct 31, 2023 •

edited

Loading

xingchensong commented Oct 31, 2023

xingchensong commented Oct 31, 2023

xingchensong commented Oct 31, 2023

xingchensong commented Oct 31, 2023 •

edited

Loading

Mddct commented Dec 26, 2023

WeDataset: List of (OpenSource data) + (Crawler Resources) #2094

WeDataset: List of (OpenSource data) + (Crawler Resources) #2094

Comments

xingchensong commented Oct 31, 2023 • edited Loading

xingchensong commented Oct 31, 2023 • edited Loading

The Dataset of Speech Recognition (ASR) / Speech Translation (ST)

Chinese

English

Chinese-English

Japanese (ja-JP)

Korean (ko-KR)

Russian (ru-RU)

French (fr-Fr)

Spanish (es-ES)

Turkish (tr-TR)

Arabic (ar)

noise & nonspeech

xingchensong commented Oct 31, 2023

The Dataset of Speech Synthesis

xingchensong commented Oct 31, 2023

The Dataset of Speech Recognition & Speaker Diarization

xingchensong commented Oct 31, 2023

The Dataset of Speaker Recognition

xingchensong commented Oct 31, 2023 • edited Loading

The Resource of Crawler

Mddct commented Dec 26, 2023

xingchensong commented Oct 31, 2023 •

edited

Loading

xingchensong commented Oct 31, 2023 •

edited

Loading

xingchensong commented Oct 31, 2023 •

edited

Loading