Off-the-Shelf Datasets


Our licensable datasets to jumpstart your AI projects

Browse Catalog


Image

Product Catalog



While open data or public data sets are convenient, we offer an extensive catalog of ‘off-the-shelf’, 250+ licensable datasets across 80 languages across multiple dialects for a variety of common AI use cases. We are excited to announce 30+ new datasets for 2020 that deliver immediate value to our customers. Among our offerings, you will find data sets for speech recognition, learning datasets for machine learning algorithms, all created with the most advanced available data science.




Image

Speed



Available immediately to support your AI/ML projects today


Image

Cost Effective



Licensed data sets are more economical than custom data collection


Image

Expertise



20+ years’ data collection experience


Image

Support All Data Types



Image, video, speech, audio, and text


Image

Scale



Provide the right amount of data to train your models effectively

Image

Quality



Improve quality and minimize bias in your AI models






Dataset NameProduct TypeCommon Use CasesRecording DeviceUnit
Dataset NameProduct TypeCommon Use CasesRecording DeviceUnit
138
Albanian (Albania) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A12,000 words Add Quotesqi_ALB_PHONAppen GlobalPronunciation DictionaryAlbanianAlbaniaN/AN/AN/AN/A12,000N/AtextAlbanian (Albania) Pronunciation Dictionary
139
Amharic (Ethiopia) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A45,000 words Add Quoteamh_ETH_PHONAppen GlobalPronunciation DictionaryAmharicEthiopiaN/AN/AN/AN/A45,000N/AtextAmharic (Ethiopia) Pronunciation Dictionary
144
Arabic (Algeria) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A11,000 words Add Quoteara_DZA_PHONAppen GlobalPronunciation DictionaryArabicAlgeriaN/AN/AN/AN/A11,000N/AtextArabic (Algeria) Pronunciation Dictionary
20
Arabic (Eastern Algeria) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline29 hours Add QuoteEAR_ASR001Appen GlobalConversational SpeechArabicAlgeriaLow background noise (home/office)4962Available on request11,3278alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
For the majority of calls, both speakers (in-line/out-line) were collected and transcribed however, for a smaller number of calls, only one half of the conversation was collected and transcribed
Arabic (Eastern Algeria) conversational telephony
140
Arabic (Egypt) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A40,000 words Add Quoteara_EGY_PHONAppen GlobalPronunciation DictionaryArabicEgyptN/AN/AN/AN/A40,000N/AtextArabic (Egypt) Pronunciation Dictionary
114
Arabic (Egypt) scripted smartphone
Audio ASR, Virtual Assistant, ChatbotMobile phone352 hours Add QuoteARE_ASR001_CNAppen ChinaScripted SpeechArabicEgyptLow background noise (home/office)6271128,908207,57616wavDataset is fully transcribedArabic (Egypt) scripted smartphone
142
Arabic (Iraq) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A13,000 words Add Quoteara_IRQ_POSAppen GlobalPart of Speech DictionaryArabicIraqN/AN/AN/AN/A13,000N/AtextArabic (Iraq) Part of Speech Dictionary
141
Arabic (Iraq) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A15,000 words Add Quoteara_IRQ_PHONAppen GlobalPronunciation DictionaryArabicIraqN/AN/AN/AN/A15,000N/AtextPerson namesArabic (Iraq) Pronunciation Dictionary
143
Arabic (Libya) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A48,000 words Add Quoteara_LBY_PHONAppen GlobalPronunciation DictionaryArabicLibyaN/AN/AN/AN/A48,000N/AtextArabic (Libya) Pronunciation Dictionary
65
Arabic (Modern Standard Arabic) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone12 hours Add QuoteMSA_ASR001Global PhoneScripted SpeechArabicTunisiaLow background noise (home/office)7814,908Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Arabic (Modern Standard Arabic) scripted microphone
112
Arabic (Morocco) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline33 hours Add QuoteARY_ASR001Appen GlobalConversational SpeechArabicMoroccoLow background noise180280,54423,8368alawEach speaker participated in 1 to 4 conversations. Speakers are identified by a unique 4-digit speaker ID which is recorded in the demographic file
Transcription is available in original script and fully reversible Romanised version with accompanying pronunciation lexicon
English translation of product transcription is available (ARY_MT001, ARY_ASRMT001)
Arabic (Morocco) conversational telephony
113
Arabic (Morocco) conversational telephony translation
Text MT, Chatbot , Conversational AIN/A80,544 utterances Add QuoteARY_MT001Appen GlobalConversational TranslationArabicMoroccoN/A180N/A80,43023,844N/AtextCorresponding audio, transcription, fully reversible romanised transcription and pronunciation lexicon data are available (ARY_ASR001, ARY_ASRMT001)Arabic (Morocco) conversational telephony translation
146
Arabic (Morocco) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A60,000 words Add Quoteara_MAR_PHONAppen GlobalPronunciation DictionaryArabicMoroccoN/AN/AN/AN/A60,000N/AtextArabic (Morocco) Pronunciation Dictionary
147
Arabic (N/A) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A40,000 words Add Quotearb_N/A_PHONAppen GlobalPronunciation DictionaryArabicN/AN/AN/AN/AN/A40,000N/AtextArabic (N/A) Pronunciation Dictionary
115
Arabic (Saudi Arabia) scripted smartphone
Audio ASR, Virtual Assistant, ChatbotMobile phone322 hours Add QuoteARS_ASR001_CNAppen ChinaScripted SpeechArabicSaudi ArabiaLow background noise (home/office)2271104,574156,28216wavDataset is fully transcribedArabic (Saudi Arabia) scripted smartphone
149
Arabic (Sudan) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A17,000 words Add Quoteara_SDN_PHONAppen GlobalPronunciation DictionaryArabicSudanN/AN/AN/AN/A17,000N/AtextArabic (Sudan) Pronunciation Dictionary
148
Arabic (United Arab Emirates) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A75,000 words Add Quoteara_ARE_PHONAppen GlobalPronunciation DictionaryArabicUnited Arab EmiratesN/AN/AN/AN/A75,000N/AtextArabic (United Arab Emirates) Pronunciation Dictionary
122
Arabic (United Arab Emirates) scripted smartphone
Audio ASR, Virtual Assistant, ChatbotMobile phone170 hours Add QuoteARU_ASR001_CNAppen ChinaScripted SpeechArabicUnited Arab EmiratesLow background noise (home/office)133142,35285,77516wavDataset is fully transcribedArabic (United Arab Emirates) scripted smartphone
70
Arabic (United Arab Emirates) scripted telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone and landline48 hours Add QuoteOrienTel United Arab Emirates MCA (Modern Colloquial Arabic)NuanceScripted SpeechArabicUnited Arab EmiratesLow background noise880143,000Available on request8alawDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
49 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words and spontaneous items for control
Arabic (United Arab Emirates) scripted telephony
71
Arabic (United Arab Emirates) scripted telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone and landline31 hours Add QuoteOrienTel United Arab Emirates MSA (Modern Standard Arabic)NuanceScripted SpeechArabicUnited Arab EmiratesLow background noise500124,500Available on request8alawDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
49 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words and spontaneous items for control
Arabic (United Arab Emirates) scripted telephony
9
Arabic (United Arab Emirates/ Saudi Arabia) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone86 hours Add QuoteCGA_ASR001Appen GlobalScripted SpeechArabicUnited Arab Emirates; Saudi ArabiaLow background noise (home/office)150442,00019,24516alawComplete transcriptions of the content of the speech files at a word level
All acoustic events have been tagged using conventions derived from the SpeechDATmodel
All transcriptions fully vowelized
280 prompts per speaker including 30 Person names (first name and family name) from a set of 15, 10 single isolated digits 0-10, 8-digit sequences (randomly generated), 200 phonetically balanced sentences, 30 x 10-word phonetically balanced word strings
Arabic (United Arab Emirates/ Saudi Arabia) scripted microphone
130
Arabic NER news text
Text NER, Content Classification, Search EnginesN/A20,774 sentences Add QuoteARB_NER001Appen GlobalNews NERStandard ArabicN/AN/AN/AN/A20,774Available on requestN/AtextArabic NER news text
150
Assamese (India) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A40,000 words Add Quoteasm_IND_PHONAppen GlobalPronunciation DictionaryAssameseIndiaN/AN/AN/AN/A40,000N/AtextAssamese (India) Pronunciation Dictionary
124
Baby crying audio
Audio Baby Monitor, Security & Other Consumer ApplicationsMobile phone3 hours Add QuoteCRY_ASR001Appen ChinaHuman SoundN/AChinaLow background noise (home/office)1001NANA16wavCrying sound of babies 0-3 years old, each lasting around 2 minutes.Baby crying audio
4
Bahasa Indonesia conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline31 hours Add QuoteBAH_ASR001Appen GlobalConversational SpeechIndonesianIndonesiaLow background noise1,0022Available on request11,4808wavDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
For a large proportion of calls, only one half of the conversation was collected and transcribed
Bahasa Indonesia conversational telephony
153
Basque (Spain) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A10,000 words Add Quoteeus_ESP_PHONAppen GlobalPronunciation DictionaryBasqueSpainN/AN/AN/AN/A10,000N/AtextBasque (Spain) Pronunciation Dictionary
6
Bengali (Bangladesh) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline47 hours Add QuoteBEN_ASR001Appen GlobalConversational SpeechBengaliBangladeshMixed (in-car, roadside, home/office)1,0002Available on request17,9228alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Bengali (Bangladesh) conversational telephony
154
Bengali (India) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A29,000 words Add Quoteben_IND_PHONAppen GlobalPronunciation DictionaryBengaliIndiaN/AN/AN/AN/A29,000N/AtextBengali (India) Pronunciation Dictionary
7
Bulgarian (Bulgaria) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline38 hours Add QuoteBUL_ASR001Appen GlobalConversational SpeechBulgarianBulgariaLow background noise (home/office)2172Available on request22,3428alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project - 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
Bulgarian (Bulgaria) conversational telephony
155
Bulgarian (Bulgaria) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A55,000 words Add Quotebul_BGR_PHONAppen GlobalPronunciation DictionaryBulgarianBulgariaN/AN/AN/AN/A55,000N/AtextBulgarian (Bulgaria) Pronunciation Dictionary
111
Bulgarian (Bulgaria) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone22 hours Add QuoteBUL_ASR002Global PhoneScripted SpeechBulgarianBulgariaLow background noise (home/office)7718,674Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Bulgarian (Bulgaria) scripted microphone
158
Cantonese (China) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A10,000 words Add Quoteyue_HKG_POSAppen GlobalPart of Speech DictionaryCantoneseChinaN/AN/AN/AN/A10,000N/AtextTraditionalCantonese (China) Part of Speech Dictionary
156
Cantonese (China) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A37,000 words Add Quoteyue_CHN_PHONAppen GlobalPronunciation DictionaryCantoneseChinaN/AN/AN/AN/A37,000N/AtextSimplifiedCantonese (China) Pronunciation Dictionary
157
Cantonese (China) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A40,000 words Add Quoteyue_CHN_PHONAppen GlobalPronunciation DictionaryCantoneseChinaN/AN/AN/AN/A40,000N/AtextTraditionalCantonese (China) Pronunciation Dictionary
159
Catalan (Spain) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A10,000 words Add Quotecat_ESP_PHONAppen GlobalPronunciation DictionaryCatalanSpainN/AN/AN/AN/A10,000N/AtextCatalan (Spain) Pronunciation Dictionary
160
Cebuano (Philippines) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A20,000 words Add Quoteceb_PHL_PHONAppen GlobalPronunciation DictionaryCebuanoPhilippinesN/AN/AN/AN/A20,000N/AtextCebuano (Philippines) Pronunciation Dictionary
10
Croatian (Croatia) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline39 hours Add QuoteCRO_ASR001Appen GlobalConversational SpeechCroatianCroatiaLow background noise (home/office)2002Available on request23,9198alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project - 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
Croatian (Croatia) conversational telephony
161
Croatian (Croatia) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A20,000 words Add Quotehrv_HRV_PHONAppen GlobalPronunciation DictionaryCroatianCroatiaN/AN/AN/AN/A20,000N/AtextCroatian (Croatia) Pronunciation Dictionary
11
Croatian (Croatia) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone11 hours Add QuoteCRO_ASR002Global PhoneScripted SpeechCroatianCroatiaLow background noise (home/office)9414,499Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Croatian (Croatia) scripted microphone
116
Croatian (Croatia) scripted smartphone
Audio ASR, Virtual Assistant, ChatbotMobile phone263 hours Add QuoteCRO_ASR003_CNAppen ChinaScripted SpeechCroatianCroatiaLow background noise (home/office)243173,467136,14016wavDataset is fully transcribedCroatian (Croatia) scripted smartphone
162
Czech (Czech Republic) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A50,000 words Add Quoteces_CZE_PHONAppen GlobalPronunciation DictionaryCzechCzech RepublicN/AN/AN/AN/A50,000N/AtextCzech (Czech Republic) Pronunciation Dictionary
12
Czech (Czech Republic) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone31 hours Add QuoteCZE_ASR001Global PhoneScripted SpeechCzechCzech RepublicLow background noise (home/office)102112,425Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Czech (Czech Republic) scripted microphone
13
Czech (Czech Republic) scripted telephony
Audio ASR, Call Centre, Virtual AssistantLandline only93 hours Add QuoteCzech SpeechDat(E) DatasetNuanceScripted SpeechCzechCzech RepublicLow background noise1,000152,000Available on request8alawDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
52 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, and phonetically rich words and sentences
Czech (Czech Republic) scripted telephony
164
Danish (Denmark) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A100,000 words Add Quotedan_DNK_POSAppen GlobalPart of Speech DictionaryDanishDenmarkN/AN/AN/AN/A100,000N/AtextDanish (Denmark) Part of Speech Dictionary
163
Danish (Denmark) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A107,000 words Add Quotedan_DNK_PHONAppen GlobalPronunciation DictionaryDanishDenmarkN/AN/AN/AN/A107,000N/AtextDanish (Denmark) Pronunciation Dictionary
90
Danish (Denmark) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone53 hours Add QuoteSpeecon DanishNuanceScripted SpeechDanishDenmarkMixed (office, entertainment, car, public place)600 (550 adult speakers and 50 child speakers)4170,000Available on request16alawDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers
Danish (Denmark) scripted microphone
15
Dari (Afghanistan) broadcast data
Audio ASR, Automatic Captioning, Keyword SpottingMicrophone51 hours Add QuoteDAR_BRC001Appen GlobalBroadcast SpeechDariAfghanistanLow background noise (studio)N/A1Available on requestAvailable on requestN/AwavDataset is fully transcribed and timestamped
Dataset is largely speech only and does not include music or advertisements
Data types include: talk shows, interviews, news broadcasts (excluding news reading by anchors)
Dari (Afghanistan) broadcast data
14
Dari (Afghanistan) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline40 hours Add QuoteDAR_ASR001Appen GlobalConversational SpeechDariAfghanistanLow background noise5002Available on request11,1688alawDataset is fully transcribed and timestamped
Dataset is largely speech only and does not include music or advertisements
Dari (Afghanistan) conversational telephony
165
Dari (Afghanistan) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A30,000 words Add Quoteprs_AFG_PHONAppen GlobalPronunciation DictionaryDariAfghanistanN/AN/AN/AN/A30,000N/AtextDari (Afghanistan) Pronunciation Dictionary
166
Dholuo (Kenya) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A20,000 words Add Quoteluo_KEN_PHONAppen GlobalPronunciation DictionaryDholuoKenyaN/AN/AN/AN/A20,000N/AtextDholuo (Kenya) Pronunciation Dictionary
91
Dutch (Belgium) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone47 hours Add QuoteSpeecon Dutch from BelgiumNuanceScripted SpeechDutchBelgiumMixed (office, entertainment, car, public place)600 (550 adult speakers and 50 child speakers)4170,000Available on request16alawDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers
Dutch (Belgium) scripted microphone
33
Dutch (Belgium) scripted telephony
Audio ASR, Call Centre, Virtual AssistantMicrophone80 hours Add QuoteFlemish SpeechDat(II) FDB-1000 (FIXED1FL)NuanceScripted SpeechDutchBelgiumLow background noise1,000152,000Available on request8alawDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
52 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words and spontaneous items for control
Dutch (Belgium) scripted telephony
19
Dutch (Netherlands & Belgium) scripted in-car
Audio ASR, Virtual Assistant, In Car HMI & EntertainmentMicrophone and mobile phone27 hours Add QuoteDutch and Flemish SpeechDat-CarNuanceScripted SpeechDutchNetherland; BelgiumMixed (in-car)302515,100Available on request16 and 8alawDataset is fully transcribed and is accompanied by a pronunciation lexicon and validation report
125 prompts per adult speaker including digits, natural numbers, letter strings, personal, place and business names (some spontaneous), generic command and control items, phonetically rich words and sentences and prompts for spontaneous speech
Dutch (Netherlands & Belgium) scripted in-car
66
Dutch (Netherlands) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline36 hours Add QuoteNLD_ASR001Appen GlobalConversational SpeechDutchNetherlandsLow background noise2002Available on request14,9648alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project - 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
Dutch (Netherlands) conversational telephony
167
Dutch (Netherlands) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A45,000 words Add Quotenld_NLD_PHONAppen GlobalPronunciation DictionaryDutchNetherlandsN/AN/AN/AN/A45,000N/AtextDutch (Netherlands) Pronunciation Dictionary
92
Dutch (Netherlands) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone68 hours Add QuoteSpeecon Dutch from the NetherlandsNuanceScripted SpeechDutchNetherlandsMixed (office, entertainment, car, public place)600 (550 adult speakers and 50 child speakers)4170,000Available on request16alawDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers
Dutch (Netherlands) scripted microphone
125
East African facial images
Image Facial RecognitionCamera13500 images Add QuoteIMG_FACE_KEN_CNAppen ChinaHuman FaceN/AKenyaMixed background and lighting conditions100NANANANAjpgEast African facial images
21
English (Arabic - Levant/Egypt) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline28 hours Add QuoteENA_ASR001Appen GlobalConversational SpeechEnglishEgyptLow background noise2502Available on request5,6198alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Average length of calls: 10-15 mins
English (Arabic - Levant/Egypt) conversational telephony
169
English (Australia) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A157,000 words Add Quoteeng_AUS_PHONAppen GlobalPronunciation DictionaryEnglishAustraliaN/AN/AN/AN/A157,000N/AtextEnglish (Australia) Pronunciation Dictionary
2
English (Australia) scripted telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone and landline92 hours Add QuoteAUS_ASR001Appen GlobalScripted SpeechEnglishAustraliaLow background noise (home/office)500182,50035,1378alawFully transcribed to SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
162 prompts (read speech) per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items (from a set of 215), phonetically rich sentences and words
English (Australia) scripted telephony
3
English (Australia) scripted telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone and landline118 hours Add QuoteAUS_ASR002Appen GlobalScripted SpeechEnglishAustraliaMixed1,000175,000198alawFully transcribed to SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
75 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words
The prompts are a mixture of 'read' and 'elicited' items where 5 prompts per script are 'spontaneous free speech'
English (Australia) scripted telephony
171
English (Canada) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A3,000 words Add Quoteeng_CAN_POSAppen GlobalPart of Speech DictionaryEnglishCanadaN/AN/AN/AN/A3,000N/AtextEnglish (Canada) Part of Speech Dictionary
170
English (Canada) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A50,000 words Add Quoteeng_CAN_PHONAppen GlobalPronunciation DictionaryEnglishCanadaN/AN/AN/AN/A50,000N/AtextEnglish (Canada) Pronunciation Dictionary
22
English (Canada) scripted telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone and landline144 hours Add QuoteENC_ASR001Appen GlobalScripted SpeechEnglishCanadaMixed1,000199,00012,4838alaw or wavFully transcribed to SALA II/SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
99 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words
English (Canada) scripted telephony
173
English (Hong Kong) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A18,000 words Add Quoteeng_HKG_PHONAppen GlobalPronunciation DictionaryEnglishHong KongN/AN/AN/AN/A18,000N/AtextEnglish (Hong Kong) Pronunciation Dictionary
25
English (India) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline67 hours Add QuoteENI_ASR002Appen GlobalConversational SpeechEnglishIndiaLow background noise540277,56511,6468alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
271 telephony conversations are recorded for this project
English (India) conversational telephony
175
English (India) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A13,000 words Add Quoteeng_IND_POSAppen GlobalPart of Speech DictionaryEnglishIndiaN/AN/AN/AN/A13,000N/AtextEnglish (India) Part of Speech Dictionary
174
English (India) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A60,000 words Add Quoteeng_IND_PHONAppen GlobalPronunciation DictionaryEnglishIndiaN/AN/AN/AN/A60,000N/AtextEnglish (India) Pronunciation Dictionary
24
English (India) scripted telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone and landline217 hours Add QuoteENI_ASR001Appen GlobalScripted SpeechEnglishIndiaMixed2,3581117,9009,1908alawFully transcribed to SpeechDAT type conventions.
Dataset is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed words
49 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words
English (India) scripted telephony
176
English (Ireland) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A12,000 words Add Quoteeng_IRL_PHONAppen GlobalPronunciation DictionaryEnglishIrelandN/AN/AN/AN/A12,000N/AtextEnglish (Ireland) Pronunciation Dictionary
177
English (NZ) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A50,000 words Add Quoteeng_NZL_PHONAppen GlobalPronunciation DictionaryEnglishNZN/AN/AN/AN/A50,000N/AtextEnglish (NZ) Pronunciation Dictionary
23
English (Philippines) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline53 hours Add QuoteENF_ASR001Appen GlobalConversational SpeechEnglishPhilippinesLow background noise450241,6027,2728alaw or wavDataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Average length of calls: 10-15 mins
English (Philippines) conversational telephony
172
English (Philippines) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A5,000 words Add Quoteeng_PHL_PHONAppen GlobalPronunciation DictionaryEnglishPhilippinesN/AN/AN/AN/A5,000N/AtextEnglish (Philippines) Pronunciation Dictionary
168
English (United Arab Emirates) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A5,000 words Add Quoteeng_ARE_PHONAppen GlobalPronunciation DictionaryEnglishUnited Arab EmiratesN/AN/AN/AN/A5,000N/AtextEnglish (United Arab Emirates) Pronunciation Dictionary
67
English (United Arab Emirates) scripted telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone and landline33 hours Add QuoteOrienTel English as spoken in the United Arab EmiratesNuanceScripted SpeechEnglishUnited Arab EmiratesLow background noise500125,500Available on request8alawDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
51 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words and spontaneous items for control
English (United Arab Emirates) scripted telephony
99
English (United Kingdom)
Audio TTSHeadset microphone10 hours Add QuoteTC-STAR female baseline voice LauraNuanceScripted SpeechEnglishUnited KingdomLow background noise (studio)11Available on requestAvailable on request96Available on requestDataset includes manual orthographic transcription, automatic segmentation into phonemes, automatic generation of pitch marks (where a certain percentage of phonetic segments and pitch marks has been manually checked)
Dataset is accompanied by a pronunciation lexicon with POS, lemma and phonetic transcription
English (United Kingdom)
100
English (United Kingdom)
Audio TTSHeadset microphone10 hours Add QuoteTC-STAR male baseline voice IanNuanceScripted SpeechEnglishUnited KingdomLow background noise (studio)11Available on requestAvailable on request96Available on requestDataset includes manual orthographic transcription, automatic segmentation into phonemes, automatic generation of pitch marks (where a certain percentage of phonetic segments and pitch marks has been manually checked)
Dataset is accompanied by a pronunciation lexicon with POS, lemma and phonetic transcription
English (United Kingdom)
259
English (United Kingdom) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline50 hours Add QuoteUKE_ASR001BAppen GlobalConversational SpeechEnglishUnited KingdomLow background noise1,1502Available on request13,1928wavDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
English (United Kingdom) conversational telephony
104
English (United Kingdom) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline150 hours Add QuoteUKE_ASR001Appen GlobalConversational SpeechEnglishUnited KingdomLow background noise1,1502298,56224,1938wavDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
English (United Kingdom) conversational telephony
179
English (United Kingdom) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A155,000 words Add Quoteeng_GBR_POSAppen GlobalPart of Speech DictionaryEnglishUnited KingdomN/AN/AN/AN/A155,000N/AtextEnglish (United Kingdom) Part of Speech Dictionary
178
English (United Kingdom) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A195,000 words Add Quoteeng_GBR_PHONAppen GlobalPronunciation DictionaryEnglishUnited KingdomN/AN/AN/AN/A195,000N/AtextEnglish (United Kingdom) Pronunciation Dictionary
107
English (United States) conversational smartphone
Audio ASR, Conversational AI, Speech AnalyticsMobile phone1000 hours Add QuoteUSE_ASR003Appen GlobalConversational SpeechEnglishUnited StatesLow background noise2,0001500,00052,58616wavDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Conversations cover a wide variety of topics including: study/major/work, hometown, living arrangements, weather and seasons, punctuality, TV programs/film)
English (United States) conversational smartphone
181
English (United States) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A263,000 words Add Quoteeng_USA_POSAppen GlobalPart of Speech DictionaryEnglishUnited StatesN/AN/AN/AN/A263,000N/AtextEnglish (United States) Part of Speech Dictionary
180
English (United States) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A330,000 words Add Quoteeng_USA_PHONAppen GlobalPronunciation DictionaryEnglishUnited StatesN/AN/AN/AN/A330,000N/AtextEnglish (United States) Pronunciation Dictionary
93
English (United States) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone53 hours Add QuoteSpeecon English (USA) databaseNuanceScripted SpeechEnglishUnited StatesMixed (office, entertainment, car, public place)600 (550 adult speakers and 50 child speakers)4170,000Available on request16Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers
English (United States) scripted microphone
106
English (United States) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone62 hours Add QuoteUSE_ASR001Appen GlobalScripted SpeechEnglishUnited StatesLow background noise (studio)200280,00018,31848alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Each speaker read 400 prompts including digits, natural numbers, personal and city names, telephone numbers, generic command and control items, phonetically rich sentences and words
English (United States) scripted microphone
131
English NER news text
Text NER, Content Classification, Search EnginesN/A22,768 sentences Add QuoteENG_NER001Appen GlobalNews NEREnglishN/AN/AN/AN/A22,768Available on requestN/AtextEnglish NER news text
32
Farsi/Persian (Iran) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline30 hours Add QuoteFAR_ASR002Appen GlobalConversational SpeechIranian PersianIranMixed1,0002Available on request12,3588wavDataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Farsi/Persian (Iran) conversational telephony
31
Farsi/Persian (Iran) scripted telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone and landline85 hours Add QuoteFAR_ASR001Appen GlobalScripted SpeechIranian PersianIranMixed789138,4008,7168alawFully transcribed to OrienTel type conventions
Dataset is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed words
48 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words
Farsi/Persian (Iran) scripted telephony
135
Farsi/Persian NER news text
Text NER, Content Classification, Search EnginesN/A19,584 sentences Add QuoteFAR_NER001Appen GlobalNews NERIranian PersianIranN/AN/AN/A19,584Available on requestN/AtextFarsi/Persian NER news text
185
Finnish (Finland) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A10,000 words Add Quotefin_FIN_POSAppen GlobalPart of Speech DictionaryFinnishFinlandN/AN/AN/AN/A10,000N/AtextFinnish (Finland) Part of Speech Dictionary
128
Finnish (Finland) printed text OCR
Image Document Processing, Document SearchCamera7293 images Add QuoteIMG_OCR_FIN_CNAppen ChinaDocument OCRFinnishFinlandMixed lighting conditions4NANANANAjpgImages containing text, such as billboards / outer packaging / signage / magazines / menus, etc.Finnish (Finland) printed text OCR
184
Finnish (Finland) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A85,000 words Add Quotefin_FIN_PHONAppen GlobalPronunciation DictionaryFinnishFinlandN/AN/AN/AN/A85,000N/AtextFinnish (Finland) Pronunciation Dictionary
145
French (Algeria) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A4,000 words Add Quotefra_DZA_PHONAppen GlobalPronunciation DictionaryFrenchAlgeriaN/AN/AN/AN/A4,000N/AtextArabic scriptFrench (Algeria) Pronunciation Dictionary
5
French (Belgium) scripted telephony
Audio ASR, Call Centre, Virtual AssistantLandline only76 hours Add QuoteBelgian French SpeechDat(II) FDB-1000 (FIXED1BF)NuanceScripted SpeechFrenchBelgiumLow background noise1,000153,000Available on request8alawDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
53 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words and spontaneous items for control
French (Belgium) scripted telephony
36
French (Canada) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline9 hours Add QuoteFRC_ASR003Appen GlobalConversational SpeechFrenchCanadaMixed682Available on request6,0228alawDataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Average length of calls: 10-15 mins
For the majority of calls, only one half of the conversation was collected and transcribed, however, for a smaller number of calls, both speakers (in-line/out-line) were collected and transcribed
French (Canada) conversational telephony
186
French (Canada) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A67,000 words Add Quotefra_CAN_PHONAppen GlobalPronunciation DictionaryFrenchCanadaN/AN/AN/AN/A67,000N/AtextFrench (Canada) Pronunciation Dictionary
35
French (Canada) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone46 hours Add QuoteFRC_ASR002Appen GlobalScripted SpeechFrenchCanadaLow background noise (home/office)150122,50010,75516alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
150 prompts per speaker including digits, digit strings (randomly generated), addressses and phonetically rich sentences and words
French (Canada) scripted microphone
34
French (Canada) scripted telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone131 hours Add QuoteFRC_ASR001Appen GlobalScripted SpeechFrenchCanadaMixed1,0001100,00011,6978alawFully transcribed to SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed words
100 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words
French (Canada) scripted telephony
40
French (France) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline25 hours Add QuoteFRF_ASR001Appen GlobalConversational SpeechFrenchFranceLow background noise5632Available on request11,9228alawDataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
For the majority of calls, both speakers (in-line/out-line) were collected and transcribed, however, for a smaller number of calls, only one half of the conversation was collected and transcribed
French (France) conversational telephony
39
French (France) In-Car
Audio ASR, Virtual Assistant, In Car HMI & EntertainmentMicrophone and mobile phone Add QuoteFrench SpeechDat-CarNuanceScripted SpeechFrenchFranceMixed (in-car)300537,500Available on request16 and 8Available on requestDataset is fully transcribed and is accompanied by a pronunciation lexicon and validation report
Approximately 125 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names (some spontaneous), generic command and control items, phonetically rich words and sentences and prompts for spontaneous speech
French (France) In-Car
188
French (France) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A95,000 words Add Quotefra_FRA_POSAppen GlobalPart of Speech DictionaryFrenchFranceN/AN/AN/AN/A95,000N/AtextFrench (France) Part of Speech Dictionary
187
French (France) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A112,000 words Add Quotefra_FRA_PHONAppen GlobalPronunciation DictionaryFrenchFranceN/AN/AN/AN/A112,000N/AtextFrench (France) Pronunciation Dictionary
41
French (France) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone26 hours Add QuoteFRF_ASR003Global PhoneScripted SpeechFrenchFranceLow background noise (home/office)98110,273Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
French (France) scripted microphone
37
French (France) scripted telephony
Audio ASR, Call Centre, Virtual AssistantLandline only41 hours Add QuoteFrench SpeechDat(II) FDB-1000NuanceScripted SpeechFrenchFranceLow background noise (home/office)1,017148,000Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
48 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
French (France) scripted telephony
38
French (France) scripted telephony
Audio ASR, Call Centre, Virtual AssistantLandline only305 hours Add QuoteFrench SpeechDat(II) FDB-5000NuanceScripted SpeechFrenchFranceLow background noise5,0401237,000Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
47 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
French (France) scripted telephony
60
French (Luxembourg) telephony
Audio ASR, Call Centre, Virtual AssistantLandline only45 hours Add QuoteLuxembourgish French SpeechDat(II) FDB-500 (FIXED1LF)NuanceScripted SpeechFrenchLuxembourgLow background noise614132,000Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
53 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
French (Luxembourg) telephony
189
German (Germany) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A146,000 words Add Quotedeu_DEU_PHONAppen GlobalPronunciation DictionaryGermanGermanyN/AN/AN/AN/A146,000N/AtextGerman (Germany) Pronunciation Dictionary
16
German (Germany) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone16 hours Add QuoteDEU_ASR001Appen GlobalScripted SpeechGermanGermanyLow background noise (studio)127212,7006,82616alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Each speaker read 100 prompts including digits, natural numbers, personal and city names, telephone numbers, generic command and control items, phonetically rich sentences and words
German (Germany) scripted microphone
18
German (Germany) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone25 hours Add QuoteDEU_ASR003Global PhoneScripted SpeechGermanGermanyLow background noise (home/office)77110,085Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
German (Germany) scripted microphone
42
German (Germany) telephony
Audio ASR, Call Centre, Virtual AssistantLandline only31 hours Add QuoteGerman SpeechDat (II) FDB-1000NuanceScripted SpeechGermanGermanyLow background noise (home/office)988143,000Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
44 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
German (Germany) telephony
43
German (Germany) telephony
Audio ASR, Call Centre, Virtual AssistantLandline only268 hours Add QuoteGerman SpeechDat(II) FDB-4000NuanceScripted SpeechGermanGermanyLow background noise (home/office)4,0001160,000Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
40 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
German (Germany) telephony
61
German (Luxembourg) telephony
Audio ASR, Call Centre, Virtual AssistantLandline only33 hours Add QuoteLuxembourgish German SpeechDat(II) FDB-500 (FIXED1LG)NuanceScripted SpeechGermanLuxembourgLow background noise500126,500Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
53 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
German (Luxembourg) telephony
190
German (Switzerland) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A15,000 words Add Quotedeu_CHE_PHONAppen GlobalPronunciation DictionaryGermanSwitzerlandN/AN/AN/AN/A15,000N/AtextGerman (Switzerland) Pronunciation Dictionary
94
German (Switzerland) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone53 hours Add QuoteSpeecon German (Switzerland) databaseNuanceScripted SpeechGermanSwitzerlandMixed (office, entertainment, car, public place)600 (550 adult speakers and 50 child speakers)4170,000Available on request16Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers
German (Switzerland) scripted microphone
68
German (Turkey) telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone and landline31 hours Add QuoteOrienTel German Spoken by TurkishNuanceScripted SpeechGermanTurkeyLow background noise300115,600Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
52 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
German (Turkey) telephony
191
Greek (Greece) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A5,000 words Add Quoteell_GRC_PHONAppen GlobalPronunciation DictionaryGreekGreeceN/AN/AN/AN/A5,000N/AtextGreek (Greece) Pronunciation Dictionary
118
Greek (Greece) scripted smartphone
Audio ASR, Virtual Assistant, ChatbotMobile phone191 hours Add QuoteGRE_ASR001_CNAppen ChinaScripted SpeechGreekGreeceLow background noise (home/office)287154,11368,27116wavDataset is fully transcribedGreek (Greece) scripted smartphone
192
Guarani (Paraguay) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A35,000 words Add Quotegrn_PRY_PHONAppen GlobalPronunciation DictionaryGuaraniParaguayN/AN/AN/AN/A35,000N/AtextGuarani (Paraguay) Pronunciation Dictionary
194
Haitian Creole (Haiti) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A15,000 words Add Quotehat_HTI_PHONAppen GlobalPronunciation DictionaryHaitian CreoleHaitiN/AN/AN/AN/A15,000N/AtextHaitian Creole (Haiti) Pronunciation Dictionary
45
Hausa (Nigeria) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone33 hours Add QuoteHAU_ASR002Appen GlobalConversational SpeechHausaNigeriaLow background noise2002Available on request7,9498alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project - 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
Hausa (Nigeria) conversational telephony
195
Hausa (Nigeria) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A11,000 words Add Quotehau_NGA_PHONAppen GlobalPronunciation DictionaryHausaNigeriaN/AN/AN/AN/A11,000N/AtextHausa (Nigeria) Pronunciation Dictionary
44
Hausa scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone20 hours Add QuoteHAU_ASR001Global PhoneScripted SpeechHausaMultipleLow background noise (home/office)10317,895Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Hausa scripted microphone
46
Hebrew (Israel) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline34 hours Add QuoteHEB_ASR001Appen GlobalConversational SpeechHebrewIsraelLow background noise2002Available on request19,2508alaw or wavDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project - 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
Hebrew (Israel) conversational telephony
196
Hebrew (Israel) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A31,000 words Add Quoteheb_ISR_PHONAppen GlobalPronunciation DictionaryHebrewIsraelN/AN/AN/AN/A31,000N/AtextHebrew (Israel) Pronunciation Dictionary
48
Hindi (India) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline32 hours Add QuoteHIN_ASR002Appen GlobalConversational SpeechHindiIndiaMixed9962Available on request12,2668wavDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
For the majority of calls, both speakers (in-line/out-line) were collected and transcribed, however, for a smaller number of calls, only one half of the conversation was collected and transcribed
Hindi (India) conversational telephony
197
Hindi (India) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A35,000 words Add Quotehin_IND_PHONAppen GlobalPronunciation DictionaryHindiIndiaN/AN/AN/AN/A35,000N/AtextHindi (India) Pronunciation Dictionary
47
Hindi (India) scripted telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone224 hours Add QuoteHIN_ASR001Appen GlobalScripted SpeechHindiIndiaLow background noise1,920196,0009,8538alawFully transcribed to SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed words
50 prompts per speaker including digits, natural numbers, personal, business and place names, web addresses, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words
Hindi (India) scripted telephony
129
Human body movement
Video Fitness Applications, Action Classification, Gesture RecognitionMobile phone2000 videos Add QuoteVED_HUMAN_BODY_CNAppen ChinaHuman BodyN/AChinaMixed background and lighting conditions1000NANANANAmp4Video clips are approximately 10-20 seconds longHuman body movement
198
Hungarian (Hungary) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A500 words Add Quotehun_HUN_PHONAppen GlobalPronunciation DictionaryHungarianHungaryN/AN/AN/AN/A500N/AtextHungarian (Hungary) Pronunciation Dictionary
119
Hungarian (Hungary) scripted smartphone
Audio ASR, Virtual Assistant, ChatbotMobile phone286 hours Add QuoteHUN_ASR001_CNAppen ChinaScripted SpeechHungarianHungaryLow background noise (home/office)254194,031201,92116wavDataset is fully transcribedHungarian (Hungary) scripted smartphone
49
Hungarian (Hungary) scripted telephony
Audio ASR, Call Centre, Virtual AssistantLandline only65 hours Add QuoteHungarian SpeechDat(E)NuanceScripted SpeechHungarianHungaryLow background noise1,000148,000Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
48 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
Hungarian (Hungary) scripted telephony
199
Igbo (Nigeria) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A30,000 words Add Quoteibo_NGA_PHONAppen GlobalPronunciation DictionaryIgboNigeriaN/AN/AN/AN/A30,000N/AtextIgbo (Nigeria) Pronunciation Dictionary
152
Indonesian (Indonesia) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A10,000 words Add Quoteind_IDN_POSAppen GlobalPart of Speech DictionaryIndonesianIndonesiaN/AN/AN/AN/A10,000N/AtextIndonesian (Indonesia) Part of Speech Dictionary
151
Indonesian (Indonesia) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A95,000 words Add Quoteind_IDN_PHONAppen GlobalPronunciation DictionaryIndonesianIndonesiaN/AN/AN/AN/A95,000N/AtextIndonesian (Indonesia) Pronunciation Dictionary
183
Iranian Persian (Iran) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A1,400,000 words Add Quotepes_IRN_POSAppen GlobalPart of Speech DictionaryIranian PersianIranN/AN/AN/AN/A1,400,000N/AtextIranian Persian (Iran) Part of Speech Dictionary
182
Iranian Persian (Iran) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A80,000 words Add Quotepes_IRN_PHONAppen GlobalPronunciation DictionaryIranian PersianIranN/AN/AN/AN/A80,000N/AtextIranian Persian (Iran) Pronunciation Dictionary
52
Italian (Italy) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline36 hours Add QuoteITA_ASR003Appen GlobalConversational SpeechItalianItalyLow background noise2002Available on request18,9748alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project - 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
Italian (Italy) conversational telephony
201
Italian (Italy) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A147,000 words Add Quoteita_ITA_POSAppen GlobalPart of Speech DictionaryItalianItalyN/AN/AN/AN/A147,000N/AtextItalian (Italy) Part of Speech Dictionary
200
Italian (Italy) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A197,000 words Add Quoteita_ITA_PHONAppen GlobalPronunciation DictionaryItalianItalyN/AN/AN/AN/A197,000N/AtextItalian (Italy) Pronunciation Dictionary
50
Italian (Italy) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone44 hours Add QuoteITA_ASR001Appen GlobalScripted SpeechItalianItalyMixed200440,0007,31622alawFully transcribed to SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 prompts per speaker including 100 command and control type items and 100 phonetically rich sentences
Italian (Italy) scripted microphone
51
Italian (Italy) scripted microphone
Audio ASR, Virtual Assistant, In Car HMI & EntertainmentMicrophone47 hours Add QuoteITA_ASR002Appen GlobalScripted SpeechItalianItalyMixed (in-car)103435,87510,36648alawFully transcribed to SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
350 prompts per speaker including digits, street names, generic command and control items, phonetically rich sentences and words
Each speaker recorded 1or 2 sessions including Session 1 in a parked vehicle with the engine running and Session 2 in a vehicle travelling at 60 mph (100 km/h)
Italian (Italy) scripted microphone
53
Italian (Italy) scripted microphone
Audio TTSMicrophone3 hours Add QuoteITA_TTS001Appen GlobalScripted SpeechItalianItalyLow background noise (studio)113,300Available on request22alawDataset is accompanied by a pronunciation lexicon containing all words spoken in the Dataset
3,300 prompts per speaker including phonetically rich sentences
Italian (Italy) scripted microphone
54
Italian (Italy) telephony
Audio ASR, Call Centre, Virtual AssistantLandline only38 hours Add QuoteItalian Fixed Network Speech SpeechDat(M) CorpusNuanceScripted SpeechItalianItalyLow background noise (home/office)1,000139,000Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
39 prompts per speaker includign isolated and connected digits, natural numbers, money amounts, spelled words, time and date phrases, yes/no questions, city names, common application words, application words in phrases and phonetically rich sentences
Italian (Italy) telephony
55
Italian (Italy) telephony
Audio ASR, Call Centre, Virtual AssistantLandline only228 hours Add QuoteItalian SpeechDat(II) FDB-3000NuanceScripted SpeechItalianItalyLow background noise (home/office)3,0401134,000Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
44 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
Italian (Italy) telephony
56
Italian (Italy) telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone103 hours Add QuoteItalian SpeechDat(II) MDB-250NuanceScripted SpeechItalianItalyLow background noise (home/office)375119,000Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
51 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
Italian (Italy) telephony
89
Italian (Italy) telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone13 hours Add QuoteSpeechDat(M) Italian Mobile Network Speech DatabaseNuanceScripted SpeechItalianItalyLow background noise (home/office)342113,500Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
40 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
Italian (Italy) telephony
203
Japanese (Japan) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A265,000 words Add Quotejpn_JPN_POSAppen GlobalPart of Speech DictionaryJapaneseJapanN/AN/AN/AN/A265,000N/AtextJapanese (Japan) Part of Speech Dictionary
202
Japanese (Japan) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A262,000 words Add Quotejpn_JPN_PHONAppen GlobalPronunciation DictionaryJapaneseJapanN/AN/AN/AN/A262,000N/AtextJapanese (Japan) Pronunciation Dictionary
57
Japanese (Japan) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone33 hours Add QuoteJPN_ASR001Global PhoneScripted SpeechJapaneseJapanLow background noise (home/office)144113,067Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Japanese (Japan) scripted microphone
95
Japanese (Japan) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone57 hours Add QuoteSpeecon JapaneseNuanceScripted SpeechJapaneseJapanMixed (office, entertainment, car, public place)600 (550 adult speakers and 50 child speakers)4170,000Available on request16Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers
Japanese (Japan) scripted microphone
136
Japanese NER news text
Text NER, Content Classification, Search EnginesN/A20,629 sentences Add QuoteJPY_NER001Appen GlobalNews NERJapaneseJapanN/AN/AN/A20,629Available on requestN/AtextJapanese NER news text
204
Javanese (Indonesia) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A20,000 words Add Quotejav_IDN_PHONAppen GlobalPronunciation DictionaryJavaneseIndonesiaN/AN/AN/AN/A20,000N/AtextJavanese (Indonesia) Pronunciation Dictionary
58
Kannada (India) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline15 hours Add QuoteKAN_ASR001Appen GlobalConversational SpeechKannadaIndiaMixed1782Available on request15,6608alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Kannada (India) conversational telephony
109
Kannada (India) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline57 hours Add QuoteKAN_ASR001AAppen GlobalConversational SpeechKannadaIndiaMixed1,0002Available on request15,6608alawApprox. 25% of the dataset sessions are transcribed and time stamped - full transcripts can be made available
Database is accompanied by a pronunciation lexicon containing all transcribed words
Kannada (India) conversational telephony
205
Kannada (India) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A35,000 words Add Quotekan_IND_PHONAppen GlobalPronunciation DictionaryKannadaIndiaN/AN/AN/AN/A35,000N/AtextKannada (India) Pronunciation Dictionary
206
Kazakh (Kazakhstan) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A30,000 words Add Quotekaz_KAZ_PHONAppen GlobalPronunciation DictionaryKazakhKazakhstanN/AN/AN/AN/A30,000N/AtextKazakh (Kazakhstan) Pronunciation Dictionary
123
Khmer (Cambodia) scripted smartphone
Audio ASR, Virtual Assistant, ChatbotMobile phone90 hours Add QuoteKHM_ASR001_CNAppen ChinaScripted SpeechCentral KhmerCambodiaLow background noise (home/office)94124,61852,15716wavDataset is fully transcribedKhmer (Cambodia) scripted smartphone
208
Korean (South Korea) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A100,000 words Add Quotekor_KOR_POSAppen GlobalPart of Speech DictionaryKoreanSouth KoreaN/AN/AN/AN/A100,000N/AtextKorean (South Korea) Part of Speech Dictionary
207
Korean (South Korea) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A100,000 words Add Quotekor_KOR_PHONAppen GlobalPronunciation DictionaryKoreanSouth KoreaN/AN/AN/AN/A100,000N/AtextKorean (South Korea) Pronunciation Dictionary
59
Korean (South Korea) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone20 hours Add QuoteKOR_ASR001Global PhoneScripted SpeechKoreanSouth KoreaLow background noise (home/office)10018,107Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Korean (South Korea) scripted microphone
132
Korean NER news text
Text NER, Content Classification, Search EnginesN/A25,830 sentences Add QuoteKOR_NER001Appen GlobalNews NERKoreanSouth KoreaN/AN/AN/A25,830Available on requestN/AtextKorean NER news text
209
Kurmanji (Turkey) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A60,000 words Add Quotekur_TUR_PHONAppen GlobalPronunciation DictionaryKurmanjiTurkeyN/AN/AN/AN/A60,000N/AtextKurmanji (Turkey) Pronunciation Dictionary
210
Lao (Laos) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A9,000 words Add Quotelao_LAO_PHONAppen GlobalPronunciation DictionaryLaoLaosN/AN/AN/AN/A9,000N/AtextLao (Laos) Pronunciation Dictionary
211
Lithuanian (Lithuania) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A60,000 words Add Quotelit_LTU_PHONAppen GlobalPronunciation DictionaryLithuanianLithuaniaN/AN/AN/AN/A60,000N/AtextLithuanian (Lithuania) Pronunciation Dictionary
212
Malayalam (India) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A4,000 words Add Quotemal_IND_PHONAppen GlobalPronunciation DictionaryMalayalamIndiaN/AN/AN/AN/A4,000N/AtextMalayalam (India) Pronunciation Dictionary
213
Malaysian (Malaysia) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A10,000 words Add Quotemsa_MYS_PHONAppen GlobalPronunciation DictionaryMalaysianMalaysiaN/AN/AN/AN/A10,000N/AtextMalaysian (Malaysia) Pronunciation Dictionary
214
Mandarin (Simplified) (China) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A35,000 words Add Quotezho_CHN_PHONAppen GlobalPronunciation DictionaryMandarin (Simplified)ChinaN/AN/AN/AN/A35,000N/AtextMandarin (Simplified) (China) Pronunciation Dictionary
215
Mandarin (Traditional) (Taiwan) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A50,000 words Add Quotezho_TWN_PHONAppen GlobalPronunciation DictionaryMandarin (Traditional)TaiwanN/AN/AN/AN/A50,000N/AtextMandarin (Traditional) (Taiwan) Pronunciation Dictionary
63
Mandarin Chinese (China) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone26 hours Add QuoteMAC_ASR002Global PhoneScripted SpeechMandarin ChineseChinaLow background noise (home/office)132110,225Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Mandarin Chinese (China) scripted microphone
62
Mandarin Chinese (China) scripted telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone and landline323 hours Add QuoteMAC_ASR001Appen GlobalScripted SpeechMandarin ChineseChinaMixed2,0001200,0007,1458alawFully transcribed to SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed words
98 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items (from a set of 215), phonetically rich sentences and words
Mandarin Chinese (China) scripted telephony
134
Mandarin NER news text
Text NER, Content Classification, Search EnginesN/A17,313 sentences Add QuoteMAC_NER001Appen GlobalNews NERMandarin ChineseChinaN/AN/AN/A17,313Available on requestN/AtextMandarin NER news text
64
Marathi (India) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline15 hours Add QuoteMAR_ASR001Appen GlobalConversational SpeechMarathiIndiaMixed1802Available on request11,9088alawApprox. 29% of the dataset sessions are transcribed and time stamped - full transcripts can be made available
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Marathi (India) conversational telephony
110
Marathi (India) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline52 hours Add QuoteMAR_ASR001AAppen GlobalConversational SpeechMarathiIndiaMixed1,0002Available on request11,9088alawPortion of the dataset sessions are transcribed and time stamped - full transcripts can be made available
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Marathi (India) conversational telephony
216
Marathi (India) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A30,000 words Add Quotemar_IND_PHONAppen GlobalPronunciation DictionaryMarathiIndiaN/AN/AN/AN/A30,000N/AtextMarathi (India) Pronunciation Dictionary
217
Mongolian (Mongolia) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A30,000 words Add Quotemon_MNG_PHONAppen GlobalPronunciation DictionaryMongolianMongoliaN/AN/AN/AN/A30,000N/AtextMongolian (Mongolia) Pronunciation Dictionary
219
Norwegian (Norway) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A3,000 words Add Quotenor_NOR_POSAppen GlobalPart of Speech DictionaryNorwegianNorwayN/AN/AN/AN/A3,000N/AtextNorwegian (Norway) Part of Speech Dictionary
218
Norwegian (Norway) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A115,000 words Add Quotenor_NOR_PHONAppen GlobalPronunciation DictionaryNorwegianNorwayN/AN/AN/AN/A115,000N/AtextNorwegian (Norway) Pronunciation Dictionary
220
Oriya (India) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A15,000 words Add Quoteori_IND_PHONAppen GlobalPronunciation DictionaryOriyaIndiaN/AN/AN/AN/A15,000N/AtextOriya (India) Pronunciation Dictionary
80
Panjabi (Pakistan) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline20 hours Add QuotePAP_ASR001Appen GlobalConversational SpeechPanjabiPakistanLow background noise2052Available on request7,2988alawDataset is fully transcribed and time-stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
71% of calls, both speakers (in-line/out-line) were collected and transcribed, however, for 29% calls, only one half of the conversation was collected and transcribed
Panjabi (Pakistan) conversational telephony
74
Pashto (Afghanistan) broadcast
Audio ASR, Automatic Captioning, Keyword SpottingMicrophone51 hours Add QuotePAS_BRC001Appen GlobalBroadcast SpeechNorthern Pashto; Southern PashtoAfghanistanLow background noise (studio)N/A1Available on requestAvailable on requestN/AwavDataset is fully transcribed and timestamped
Dataset is largely speech only and does not include music or advertisements
Data types include: talk shows, interviews, news broadcasts (excluding news reading by anchors)
Pashto (Afghanistan) broadcast
73
Pashto (Afghanistan) conversational microphone
Audio ASR, Conversational AI, Speech AnalyticsMicrophone39 hours Add QuotePAS_ASR002Appen GlobalConversational SpeechNorthern Pashto; Southern PashtoAfghanistanLow background noise402Available on request9,48016wavDataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
A full translation of the transcripts into French is also available as an optional additional purchase
Average length of calls: 120 mins where one speaker acts as an interviewer and the other as the interviewee for scenarios are similar to TransTAC style (e.g. civil affairs, checkpoints etc.)
The interviewer appears in more than one set of dialogues but the interviewee is unique for each set
Pashto (Afghanistan) conversational microphone
72
Pashto (Afghanistan) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline55 hours Add QuotePAS_ASR001Appen GlobalConversational SpeechNorthern Pashto; Southern PashtoAfghanistanLow background noise9672Available on request13,6338wavDataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
For the majority of calls, both speakers (in-line/out-line) were collected and transcribed, however, for a smaller number of calls, only one half of the conversation was collected and transcribed
Pashto (Afghanistan) conversational telephony
221
Pashto (Afghanistan) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A65,000 words Add Quotepus_AFG_PHONAppen GlobalPronunciation DictionaryPashtoAfghanistanN/AN/AN/AN/A65,000N/AtextPashto (Afghanistan) Pronunciation Dictionary
223
Polish (Poland) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A4,000 words Add Quotepol_POL_POSAppen GlobalPart of Speech DictionaryPolishPolandN/AN/AN/AN/A4,000N/AtextPolish (Poland) Part of Speech Dictionary
222
Polish (Poland) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A40,000 words Add Quotepol_POL_PHONAppen GlobalPronunciation DictionaryPolishPolandN/AN/AN/AN/A40,000N/AtextPolish (Poland) Pronunciation Dictionary
75
Polish (Poland) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone25 hours Add QuotePOL_ASR001Global PhoneScripted SpeechPolishPolandLow background noise (home/office)99110,130Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Polish (Poland) scripted microphone
120
Polish (Poland) scripted smartphone
Audio ASR, Virtual Assistant, ChatbotMobile phone293 hours Add QuotePOL_ASR002_CNAppen ChinaScripted SpeechPolishPolandLow background noise (home/office)3531106,674168,54416wavDataset is fully transcribedPolish (Poland) scripted smartphone
76
Polish (Poland) scripted telephony
Audio ASR, Call Centre, Virtual AssistantLandline only78 hours Add QuotePolish SpeechDat(E) DatabaseNuanceScripted SpeechPolishPolandLow background noise1,000148,000Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
48 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
Polish (Poland) scripted telephony
78
Portuguese (Brazil) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline33 hours Add QuotePTB_ASR002Appen GlobalConversational SpeechPortugueseBrazilLow background noise2002Available on request11,2878alawDataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Portuguese (Brazil) conversational telephony
77
Portuguese (Brazil) microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone26 hours Add QuotePTB_ASR001Global PhoneScripted SpeechPortugueseBrazilLow background noise (home/office)102110,417Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Portuguese (Brazil) microphone
225
Portuguese (Brazil) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A100,000 words Add Quotepor_BRA_POSAppen GlobalPart of Speech DictionaryPortugueseBrazilN/AN/AN/AN/A100,000N/AtextPortuguese (Brazil) Part of Speech Dictionary
224
Portuguese (Brazil) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A102,000 words Add Quotepor_BRA_PHONAppen GlobalPronunciation DictionaryPortugueseBrazilN/AN/AN/AN/A102,000N/AtextPortuguese (Brazil) Pronunciation Dictionary
79
Portuguese (Portugal) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline36 hours Add QuotePTP_ASR001Appen GlobalConversational SpeechPortuguesePortugalLow background noise2002Available on request16,3398alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project - 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
Portuguese (Portugal) conversational telephony
227
Portuguese (Portugal) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A100,000 words Add Quotepor_PRT_POSAppen GlobalPart of Speech DictionaryPortuguesePortugalN/AN/AN/AN/A100,000N/AtextPortuguese (Portugal) Part of Speech Dictionary
226
Portuguese (Portugal) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A112,000 words Add Quotepor_PRT_PHONAppen GlobalPronunciation DictionaryPortuguesePortugalN/AN/AN/AN/A112,000N/AtextPortuguese (Portugal) Pronunciation Dictionary
81
Romanian (Romania) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline37 hours Add QuoteROM_ASR001Appen GlobalConversational SpeechRomanianRomaniaLow background noise2002Available on request16,6588alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project - 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
Romanian (Romania) conversational telephony
228
Romanian (Romania) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A15,000 words Add Quoteron_ROU_PHONAppen GlobalPronunciation DictionaryRomanianRomaniaN/AN/AN/AN/A15,000N/AtextRomanian (Romania) Pronunciation Dictionary
82
Russian (Russia) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline37 hours Add QuoteRUS_ASR001Appen GlobalConversational SpeechRussianRussiaLow background noise2002Available on request28,2848alaw or wavDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project - 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
Russian (Russia) conversational telephony
230
Russian (Russia) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A100,000 words Add Quoterus_RUS_POSAppen GlobalPart of Speech DictionaryRussianRussiaN/AN/AN/AN/A100,000N/AtextRussian (Russia) Part of Speech Dictionary
229
Russian (Russia) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A115,000 words Add Quoterus_RUS_PHONAppen GlobalPronunciation DictionaryRussianRussiaN/AN/AN/AN/A115,000N/AtextRussian (Russia) Pronunciation Dictionary
83
Russian (Russia) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone31 hours Add QuoteRUS_ASR002Global PhoneScripted SpeechRussianRussiaLow background noise (home/office)115112,205Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Russian (Russia) scripted microphone
96
Russian (Russia) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone46 hours Add QuoteSpeecon Russian DatabaseNuanceScripted SpeechRussianRussiaMixed (office, entertainment, car, public place)600 (550 adult speakers and 50 child speakers)4170,000Available on request16Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers
Russian (Russia) scripted microphone
84
Russian (Russia) scripted telephony
Audio ASR, Call Centre, Virtual AssistantLandline only180 hours Add QuoteRussian SpeechDat(E) DatabaseNuanceScripted SpeechRussianRussiaLow background noise2,5001112,000Available on request8alawDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
45 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
Russian (Russia) scripted telephony
133
Russian NER news text
Text NER, Content Classification, Search EnginesN/A29,888 sentences Add QuoteRUS_NER001Appen GlobalNews NERRussianRussiaN/AN/AN/A29,888Available on requestN/AtextRussian NER news text
231
Serbian (Serbia) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A15,000 words Add Quotesrp_SRB_PHONAppen GlobalPronunciation DictionarySerbianSerbiaN/AN/AN/AN/A15,000N/AtextSerbian (Serbia) Pronunciation Dictionary
126
Simplified Chinese printed text OCR
Image Document Processing, Document SearchCamera200 images Add QuoteIMG_OCR_MAC_CNAppen ChinaDocument OCRN/AChinaMixed lighting conditions30NANANANAjpgText in each image is labeled with bounding boxes by the line
Images containing heavy text in Chinese, including books, publications, posters, receipts, PPT, printed paper, etc.
Simplified Chinese printed text OCR
85
Slovak (Slovakia) scripted telephony
Audio ASR, Call Centre, Virtual AssistantLandline only65 hours Add QuoteSlovak SpeechDat(E) DatabaseNuanceScripted SpeechSlovakSlovakiaLow background noise1,000148,000Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
48 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
Slovak (Slovakia) scripted telephony
86
Slovenian (Slovenian) telephony
Audio ASR, Call Centre, Virtual AssistantLandline only76 hours Add QuoteSlovenian SpeechDat(II) FDB-1000NuanceScripted SpeechSlovenianSloveniaLow background noise (home/office)1,000140,000Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
Approximately 40 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
Slovenian (Slovenian) telephony
87
Somali (Somalia) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline50 hours Add QuoteSOM_ASR001Appen GlobalConversational SpeechSomaliSomaliaLow background noise1,0002Available on request23,2178alawDataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Somali (Somalia) conversational telephony
232
Somali (Somalia) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A76,000 words Add Quotesom_SOM_PHONAppen GlobalPronunciation DictionarySomaliSomaliaN/AN/AN/AN/A76,000N/AtextSomali (Somalia) Pronunciation Dictionary
233
Sorani (Iraq) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A25,000 words Add Quotekur_IRQ_PHONAppen GlobalPronunciation DictionarySoraniIraqN/AN/AN/AN/A25,000N/AtextSorani (Iraq) Pronunciation Dictionary
88
Sorani (Kurdish) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline5 hours Add QuoteSOR_ASR001Appen GlobalConversational SpeechCentral Kurdish (Iran)IranLow background noise1702Available on request7,9248alaw or wavDataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
For a large proportion of calls, only one half of the conversation was collected and transcribed
Sorani (Kurdish) conversational telephony
234
Spanish (Argentina) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A15,000 words Add Quotespa_ARG_PHONAppen GlobalPronunciation DictionarySpanishArgentinaN/AN/AN/AN/A15,000N/AtextSpanish (Argentina) Pronunciation Dictionary
236
Spanish (Chile) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A15,000 words Add Quotespa_CHL_PHONAppen GlobalPronunciation DictionarySpanishChileN/AN/AN/AN/A15,000N/AtextSpanish (Chile) Pronunciation Dictionary
237
Spanish (Colombia) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A15,000 words Add Quotespa_COL_PHONAppen GlobalPronunciation DictionarySpanishColombiaN/AN/AN/AN/A15,000N/AtextSpanish (Colombia) Pronunciation Dictionary
27
Spanish (Latin America - Chile and Colombia) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline22 hours Add QuoteESL_ASR002Appen GlobalConversational SpeechSpanishChile; ColumbiaMixed842Available on requestAvailable on request8wavDataset is fully transcribed and time-stamped
Call Centre style conversations (by 64 customers, 14 agents) in banking and telco domains, primarily using mobile phone
Spanish (Latin America - Chile and Colombia) conversational telephony
26
Spanish (Latin America) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone17 hours Add QuoteESL_ASR001Global PhoneScripted SpeechSpanishCosta RicaLow background noise (home/office)10016,898Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Spanish (Latin America) scripted microphone
238
Spanish (Peru) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A15,000 words Add Quotespa_PER_PHONAppen GlobalPronunciation DictionarySpanishPeruN/AN/AN/AN/A15,000N/AtextSpanish (Peru) Pronunciation Dictionary
235
Spanish (Spain) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A100,000 words Add Quotespa_ESP_PHONAppen GlobalPronunciation DictionarySpanishSpainN/AN/AN/AN/A100,000N/AtextSpanish (Spain) Pronunciation Dictionary
28
Spanish (Spain) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone39 hours Add QuoteESP_ASR001Appen GlobalScripted SpeechSpanishSpainMixed200440,0006,36722alawFully transcribed to SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 prompts per speaker including 100 command and control type items and 100 phonetically rich sentences
Spanish (Spain) scripted microphone
30
Spanish (Spain) scripted microphone
Audio TTSMicrophone1 hour Add QuoteESP_TTS001Appen GlobalScripted SpeechSpanishSpainLow background noise (studio)111,7873,61422alawDataset is accompanied by a pronunciation lexicon containing all words spoken in the Dataset
1,787 prompts per speaker including phonetically rich sentences
Spanish (Spain) scripted microphone
97
Spanish (Spain) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone46 hours Add QuoteSpeecon Spanish DatabaseNuanceScripted SpeechSpanishSpainMixed (office, entertainment, car, public place)600 (550 adult speakers and 50 child speakers)4170,000Available on request16Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers
Spanish (Spain) scripted microphone
117
Spanish (Spain) scripted smartphone
Audio ASR, Virtual Assistant, ChatbotMobile phone540 hours Add QuoteESP_ASR002_CNAppen ChinaScripted SpeechSpanishSpainLow background noise (home/office)3471258,395134,93916wavDataset is fully transcribedSpanish (Spain) scripted smartphone
239
Spanish (United States) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A90,000 words Add Quotespa_USA_PHONAppen GlobalPronunciation DictionarySpanishUnited StatesN/AN/AN/AN/A90,000N/AtextSpanish (United States) Pronunciation Dictionary
240
Spanish (Venezuela) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A15,000 words Add Quotespa_VEN_PHONAppen GlobalPronunciation DictionarySpanishVenezuelaN/AN/AN/AN/A15,000N/AtextSpanish (Venezuela) Pronunciation Dictionary
241
Swahili (Kenya) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A66,000 words Add Quoteswa_KEN_PHONAppen GlobalPronunciation DictionarySwahiliKenyaN/AN/AN/AN/A66,000N/AtextSwahili (Kenya) Pronunciation Dictionary
243
Swedish (Sweden) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A105,000 words Add Quoteswe_SWE_POSAppen GlobalPart of Speech DictionarySwedishSwedenN/AN/AN/AN/A105,000N/AtextSwedish (Sweden) Part of Speech Dictionary
242
Swedish (Sweden) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A100,000 words Add Quoteswe_SWE_PHONAppen GlobalPronunciation DictionarySwedishSwedenN/AN/AN/AN/A100,000N/AtextSwedish (Sweden) Pronunciation Dictionary
98
Swedish (Sweden/ Finland) microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone30 hours Add QuoteSWE_ASR001Global PhoneScripted SpeechSwedishSweden; FinlandLow background noise (home/office)98111,816Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Swedish (Sweden/ Finland) microphone
244
Sylheti (Bangladesh- India) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A22,000 words Add Quotesyl_BGD;IND_PHONAppen GlobalPronunciation DictionarySylhetiBangladesh; IndiaN/AN/AN/AN/A22,000N/AtextSylheti (Bangladesh- India) Pronunciation Dictionary
245
Tagalog (Philippines) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A30,000 words Add Quotetgl_PHL_PHONAppen GlobalPronunciation DictionaryTagalogPhilippinesN/AN/AN/AN/A30,000N/AtextTagalog (Philippines) Pronunciation Dictionary
247
Tamil (India) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A105,000 words Add Quotetam_IND_PHONAppen GlobalPronunciation DictionaryTamilIndiaN/AN/AN/AN/A105,000N/AtextTamil (India) Pronunciation Dictionary
246
Telugu (India) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A50,000 words Add Quotetel_IND_PHONAppen GlobalPronunciation DictionaryTeluguIndiaN/AN/AN/AN/A50,000N/AtextTelugu (India) Pronunciation Dictionary
101
Thai (Thailand) microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone28 hours Add QuoteTHA_ASR001Global PhoneScripted SpeechThaiThailandLow background noise (home/office)98114,039Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Thai (Thailand) microphone
127
Thai (Thailand) printed text OCR
Image Document Processing, Document SearchCamera1219 images Add QuoteIMG_OCR_THA_CNAppen ChinaDocument OCRThaiThailandMixed lighting conditions10NANANANAjpgImages containing text, Shopping receipts / tickets / invoices / taxi slips, etc.Thai (Thailand) printed text OCR
248
Thai (Thailand) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A55,000 words Add Quotetha_THA_PHONAppen GlobalPronunciation DictionaryThaiThailandN/AN/AN/AN/A55,000N/AtextThai (Thailand) Pronunciation Dictionary
249
Tok Pisin (Papua New Guinea) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A10,000 words Add Quotetpi_PNG_PHONAppen GlobalPronunciation DictionaryTok PisinPapua New GuineaN/AN/AN/AN/A10,000N/AtextTok Pisin (Papua New Guinea) Pronunciation Dictionary
102
Turkish (Turkey) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline41 hours Add QuoteTUR_ASR001Appen GlobalConversational SpeechTurkishTurkeyLow background noise2002Available on request32,3868alaw or wavDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project - 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
Turkish (Turkey) conversational telephony
103
Turkish (Turkey) microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone17 hours Add QuoteTUR_ASR002Global PhoneScripted SpeechTurkishTurkeyLow background noise (home/office)10016,950Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Turkish (Turkey) microphone
251
Turkish (Turkey) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A257,000 words Add Quotetur_TUR_POSAppen GlobalPart of Speech DictionaryTurkishTurkeyN/AN/AN/AN/A257,000N/AtextTurkish (Turkey) Part of Speech Dictionary
250
Turkish (Turkey) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A255,000 words Add Quotetur_TUR_PHONAppen GlobalPronunciation DictionaryTurkishTurkeyN/AN/AN/AN/A255,000N/AtextTurkish (Turkey) Pronunciation Dictionary
121
Turkish (Turkey) scripted smartphone
Audio ASR, Virtual Assistant, ChatbotMobile phone739 hours Add QuoteTUR_ASR003_CNAppen ChinaScripted SpeechTurkishTurkeyLow background noise (home/office)6641185,706215,13516wavDataset is fully transcribedTurkish (Turkey) scripted smartphone
69
Turkish (Turkey) telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone and landline118 hours Add QuoteOrienTel Turkish DatabaseNuanceScripted SpeechTurkishTurkeyLow background noise1,700176,500Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
45 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
Turkish (Turkey) telephony
252
Ukrainian (Ukraine) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A5,000 words Add Quoteukr_UKR_PHONAppen GlobalPronunciation DictionaryUkrainianUkraineN/AN/AN/AN/A5,000N/AtextUkrainian (Ukraine) Pronunciation Dictionary
105
Urdu (India/ Pakistan) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline47 hours Add QuoteURD_ASR001Appen GlobalConversational SpeechUrduIndia; PakistanMixed1,0002Available on request10,8718wavDataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Urdu (India/ Pakistan) conversational telephony
254
Urdu (Pakistan) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A12,000 words Add Quoteurd_PAK_POSAppen GlobalPart of Speech DictionaryUrduPakistanN/AN/AN/AN/A12,000N/AtextUrdu (Pakistan) Part of Speech Dictionary
253
Urdu (Pakistan) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A40,000 words Add Quoteurd_PAK_PHONAppen GlobalPronunciation DictionaryUrduPakistanN/AN/AN/AN/A40,000N/AtextUrdu (Pakistan) Pronunciation Dictionary
137
Urdu NER news text
Text NER, Content Classification, Search EnginesN/A20,634 sentences Add QuoteURD_NER001Appen GlobalNews NERUrduPakistanN/AN/AN/A20,634Available on requestN/AtextUrdu NER news text
108
Vietnamese (Vietnam) microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone47 hours Add QuoteVIE_ASR001Global PhoneScripted SpeechVietnameseVietnamLow background noise (home/office)129118,842Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Vietnamese (Vietnam) microphone
255
Vietnamese (Vietnam) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A8,000 words Add Quotevie_VNM_PHONAppen GlobalPronunciation DictionaryVietnameseVietnamN/AN/AN/AN/A8,000N/AtextVietnamese (Vietnam) Pronunciation Dictionary
256
Wu (China) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A10,000 words Add Quotewuu_CHN_PHONAppen GlobalPronunciation DictionaryWuChinaN/AN/AN/AN/A10,000N/AtextWu (China) Pronunciation Dictionary
257
Xiang (China) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A10,000 words Add Quotehsn_CHN_PHONAppen GlobalPronunciation DictionaryXiangChinaN/AN/AN/AN/A10,000N/AtextXiang (China) Pronunciation Dictionary
258
Zulu (South Africa) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A75,000 words Add Quotezul_ZAF_PHONAppen GlobalPronunciation DictionaryZuluSouth AfricaN/AN/AN/AN/A75,000N/AtextZulu (South Africa) Pronunciation Dictionary




Image

Use Cases


Whether you are working on a text-to-speech system, a voice recognition system or another solution that relies on natural language, high-quality licensed speech and language datasets allow you to go to market faster and reach more potential customers.