Off the Shelf Datasets


Our licensable datasets to jumpstart your AI projects

Browse Catalog


Image

Product Catalog



We offer an extensive catalog of ‘off-the-shelf’, 250+ licensable datasets across 80 languages across multiple dialects for a variety of common AI use cases. We are excited to announce 30+ new datasets for 2020 that deliver immediate value to our customers.



Image

Speed



Available immediately to support your AI/ML projects today


Image

Cost Effective



Licensed data sets are more economical than custom data collection


Image

Expertise



20+ years’ data collection experience


Image

Support All Data Types



Image, video, speech, audio, and text

Image

Scale



Provide the right amount of data to train your models effectively

Image

Quality



Improve quality and minimize bias in your AI models






Dataset NameProduct TypeCommon Use CasesRecording DeviceUnit
Dataset NameProduct TypeCommon Use CasesRecording DeviceUnit
138
Albanian (Albania) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A12,000 words Add Quotesqi_ALB_PHONAppen GlobalPronunciation DictionaryAlbanianAlbaniaN/AN/AN/AN/A12,000N/Atext
139
Amharic (Ethiopia) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A45,000 words Add Quoteamh_ETH_PHONAppen GlobalPronunciation DictionaryAmharicEthiopiaN/AN/AN/AN/A45,000N/Atext
144
Arabic (Algeria) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A11,000 words Add Quoteara_DZA_PHONAppen GlobalPronunciation DictionaryArabicAlgeriaN/AN/AN/AN/A11,000N/Atext
20
Arabic (Eastern Algeria) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline29 hours Add QuoteEAR_ASR001Appen GlobalConversational SpeechArabicAlgeriaLow background noise (home/office)4962Available on request11,3278alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
For the majority of calls, both speakers (in-line/out-line) were collected and transcribed however, for a smaller number of calls, only one half of the conversation was collected and transcribed
140
Arabic (Egypt) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A40,000 words Add Quoteara_EGY_PHONAppen GlobalPronunciation DictionaryArabicEgyptN/AN/AN/AN/A40,000N/Atext
114
Arabic (Egypt) scripted smartphone
Audio ASR, Virtual Assistant, ChatbotMobile phone352 hours Add QuoteARE_ASR001_CNAppen ChinaScripted SpeechArabicEgyptLow background noise (home/office)62711,28,9082,07,57616wavDataset is fully transcribed
142
Arabic (Iraq) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A13,000 words Add Quoteara_IRQ_POSAppen GlobalPart of Speech DictionaryArabicIraqN/AN/AN/AN/A13,000N/Atext
141
Arabic (Iraq) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A15,000 words Add Quoteara_IRQ_PHONAppen GlobalPronunciation DictionaryArabicIraqN/AN/AN/AN/A15,000N/AtextPerson names
143
Arabic (Libya) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A48,000 words Add Quoteara_LBY_PHONAppen GlobalPronunciation DictionaryArabicLibyaN/AN/AN/AN/A48,000N/Atext
65
Arabic (Modern Standard Arabic) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone12 hours Add QuoteMSA_ASR001Global PhoneScripted SpeechArabicTunisiaLow background noise (home/office)7814,908Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
112
Arabic (Morocco) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline33 hours Add QuoteARY_ASR001Appen GlobalConversational SpeechArabicMoroccoLow background noise180280,54423,8368alawEach speaker participated in 1 to 4 conversations. Speakers are identified by a unique 4-digit speaker ID which is recorded in the demographic file
Transcription is available in original script and fully reversible Romanised version with accompanying pronunciation lexicon
English translation of product transcription is available (ARY_MT001, ARY_ASRMT001)
113
Arabic (Morocco) conversational telephony translation
Text MT, Chatbot , Conversational AIN/A80,544 utterances Add QuoteARY_MT001Appen GlobalConversational TranslationArabicMoroccoN/A180N/A80,43023,844N/AtextCorresponding audio, transcription, fully reversible romanised transcription and pronunciation lexicon data are available (ARY_ASR001, ARY_ASRMT001)
146
Arabic (Morocco) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A60,000 words Add Quoteara_MAR_PHONAppen GlobalPronunciation DictionaryArabicMoroccoN/AN/AN/AN/A60,000N/Atext
147
Arabic (N/A) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A40,000 words Add Quotearb_N/A_PHONAppen GlobalPronunciation DictionaryArabicN/AN/AN/AN/AN/A40,000N/Atext
115
Arabic (Saudi Arabia) scripted smartphone
Audio ASR, Virtual Assistant, ChatbotMobile phone322 hours Add QuoteARS_ASR001_CNAppen ChinaScripted SpeechArabicSaudi ArabiaLow background noise (home/office)22711,04,5741,56,28216wavDataset is fully transcribed
149
Arabic (Sudan) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A17,000 words Add Quoteara_SDN_PHONAppen GlobalPronunciation DictionaryArabicSudanN/AN/AN/AN/A17,000N/Atext
148
Arabic (United Arab Emirates) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A75,000 words Add Quoteara_ARE_PHONAppen GlobalPronunciation DictionaryArabicUnited Arab EmiratesN/AN/AN/AN/A75,000N/Atext
122
Arabic (United Arab Emirates) scripted smartphone
Audio ASR, Virtual Assistant, ChatbotMobile phone170 hours Add QuoteARU_ASR001_CNAppen ChinaScripted SpeechArabicUnited Arab EmiratesLow background noise (home/office)133142,35285,77516wavDataset is fully transcribed
70
Arabic (United Arab Emirates) scripted telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone and landline48 hours Add QuoteOrienTel United Arab Emirates MCA (Modern Colloquial Arabic)NuanceScripted SpeechArabicUnited Arab EmiratesLow background noise880143,000Available on request8alawDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
49 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words and spontaneous items for control
71
Arabic (United Arab Emirates) scripted telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone and landline31 hours Add QuoteOrienTel United Arab Emirates MSA (Modern Standard Arabic)NuanceScripted SpeechArabicUnited Arab EmiratesLow background noise500124,500Available on request8alawDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
49 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words and spontaneous items for control
9
Arabic (United Arab Emirates/ Saudi Arabia) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone86 hours Add QuoteCGA_ASR001Appen GlobalScripted SpeechArabicUnited Arab Emirates; Saudi ArabiaLow background noise (home/office)150442,00019,24516alawComplete transcriptions of the content of the speech files at a word level
All acoustic events have been tagged using conventions derived from the SpeechDATmodel
All transcriptions fully vowelized
280 prompts per speaker including 30 Person names (first name and family name) from a set of 15, 10 single isolated digits 0-10, 8-digit sequences (randomly generated), 200 phonetically balanced sentences, 30 x 10-word phonetically balanced word strings
130
Arabic NER news text
Text NER, Content Classification, Search EnginesN/A20,774 sentences Add QuoteARB_NER001Appen GlobalNews NERStandard ArabicN/AN/AN/AN/A20,774Available on requestN/Atext
150
Assamese (India) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A40,000 words Add Quoteasm_IND_PHONAppen GlobalPronunciation DictionaryAssameseIndiaN/AN/AN/AN/A40,000N/Atext
124
Baby crying audio
Audio Baby Monitor, Security & Other Consumer ApplicationsMobile phone3 hours Add QuoteCRY_ASR001Appen ChinaHuman SoundN/AChinaLow background noise (home/office)1001NANA16wavCrying sound of babies 0-3 years old, each lasting around 2 minutes.
4
Bahasa Indonesia conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline31 hours Add QuoteBAH_ASR001Appen GlobalConversational SpeechIndonesianIndonesiaLow background noise1,0022Available on request11,4808wavDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
For a large proportion of calls, only one half of the conversation was collected and transcribed
153
Basque (Spain) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A10,000 words Add Quoteeus_ESP_PHONAppen GlobalPronunciation DictionaryBasqueSpainN/AN/AN/AN/A10,000N/Atext
6
Bengali (Bangladesh) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline47 hours Add QuoteBEN_ASR001Appen GlobalConversational SpeechBengaliBangladeshMixed (in-car, roadside, home/office)1,0002Available on request17,9228alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
154
Bengali (India) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A29,000 words Add Quoteben_IND_PHONAppen GlobalPronunciation DictionaryBengaliIndiaN/AN/AN/AN/A29,000N/Atext
7
Bulgarian (Bulgaria) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline38 hours Add QuoteBUL_ASR001Appen GlobalConversational SpeechBulgarianBulgariaLow background noise (home/office)2172Available on request22,3428alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project - 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
155
Bulgarian (Bulgaria) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A55,000 words Add Quotebul_BGR_PHONAppen GlobalPronunciation DictionaryBulgarianBulgariaN/AN/AN/AN/A55,000N/Atext
111
Bulgarian (Bulgaria) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone22 hours Add QuoteBUL_ASR002Global PhoneScripted SpeechBulgarianBulgariaLow background noise (home/office)7718,674Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
158
Cantonese (China) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A10,000 words Add Quoteyue_HKG_POSAppen GlobalPart of Speech DictionaryCantoneseChinaN/AN/AN/AN/A10,000N/AtextTraditional
156
Cantonese (China) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A37,000 words Add Quoteyue_CHN_PHONAppen GlobalPronunciation DictionaryCantoneseChinaN/AN/AN/AN/A37,000N/AtextSimplified
157
Cantonese (China) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A40,000 words Add Quoteyue_CHN_PHONAppen GlobalPronunciation DictionaryCantoneseChinaN/AN/AN/AN/A40,000N/AtextTraditional
159
Catalan (Spain) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A10,000 words Add Quotecat_ESP_PHONAppen GlobalPronunciation DictionaryCatalanSpainN/AN/AN/AN/A10,000N/Atext
160
Cebuano (Philippines) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A20,000 words Add Quoteceb_PHL_PHONAppen GlobalPronunciation DictionaryCebuanoPhilippinesN/AN/AN/AN/A20,000N/Atext
10
Croatian (Croatia) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline39 hours Add QuoteCRO_ASR001Appen GlobalConversational SpeechCroatianCroatiaLow background noise (home/office)2002Available on request23,9198alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project - 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
161
Croatian (Croatia) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A20,000 words Add Quotehrv_HRV_PHONAppen GlobalPronunciation DictionaryCroatianCroatiaN/AN/AN/AN/A20,000N/Atext
11
Croatian (Croatia) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone11 hours Add QuoteCRO_ASR002Global PhoneScripted SpeechCroatianCroatiaLow background noise (home/office)9414,499Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
116
Croatian (Croatia) scripted smartphone
Audio ASR, Virtual Assistant, ChatbotMobile phone263 hours Add QuoteCRO_ASR003_CNAppen ChinaScripted SpeechCroatianCroatiaLow background noise (home/office)243173,4671,36,14016wavDataset is fully transcribed
162
Czech (Czech Republic) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A50,000 words Add Quoteces_CZE_PHONAppen GlobalPronunciation DictionaryCzechCzech RepublicN/AN/AN/AN/A50,000N/Atext
12
Czech (Czech Republic) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone31 hours Add QuoteCZE_ASR001Global PhoneScripted SpeechCzechCzech RepublicLow background noise (home/office)102112,425Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
13
Czech (Czech Republic) scripted telephony
Audio ASR, Call Centre, Virtual AssistantLandline only93 hours Add QuoteCzech SpeechDat(E) DatasetNuanceScripted SpeechCzechCzech RepublicLow background noise1,000152,000Available on request8alawDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
52 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, and phonetically rich words and sentences
164
Danish (Denmark) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A1,00,000 words Add Quotedan_DNK_POSAppen GlobalPart of Speech DictionaryDanishDenmarkN/AN/AN/AN/A1,00,000N/Atext
163
Danish (Denmark) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A1,07,000 words Add Quotedan_DNK_PHONAppen GlobalPronunciation DictionaryDanishDenmarkN/AN/AN/AN/A1,07,000N/Atext
90
Danish (Denmark) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone53 hours Add QuoteSpeecon DanishNuanceScripted SpeechDanishDenmarkMixed (office, entertainment, car, public place)600 (550 adult speakers and 50 child speakers)41,70,000Available on request16alawDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers
15
Dari (Afghanistan) broadcast data
Audio ASR, Automatic Captioning, Keyword SpottingMicrophone51 hours Add QuoteDAR_BRC001Appen GlobalBroadcast SpeechDariAfghanistanLow background noise (studio)N/A1Available on requestAvailable on requestN/AwavDataset is fully transcribed and timestamped
Dataset is largely speech only and does not include music or advertisements
Data types include: talk shows, interviews, news broadcasts (excluding news reading by anchors)
14
Dari (Afghanistan) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline40 hours Add QuoteDAR_ASR001Appen GlobalConversational SpeechDariAfghanistanLow background noise5002Available on request11,1688alawDataset is fully transcribed and timestamped
Dataset is largely speech only and does not include music or advertisements
165
Dari (Afghanistan) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A30,000 words Add Quoteprs_AFG_PHONAppen GlobalPronunciation DictionaryDariAfghanistanN/AN/AN/AN/A30,000N/Atext
166
Dholuo (Kenya) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A20,000 words Add Quoteluo_KEN_PHONAppen GlobalPronunciation DictionaryDholuoKenyaN/AN/AN/AN/A20,000N/Atext
91
Dutch (Belgium) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone47 hours Add QuoteSpeecon Dutch from BelgiumNuanceScripted SpeechDutchBelgiumMixed (office, entertainment, car, public place)600 (550 adult speakers and 50 child speakers)41,70,000Available on request16alawDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers
33
Dutch (Belgium) scripted telephony
Audio ASR, Call Centre, Virtual AssistantMicrophone80 hours Add QuoteFlemish SpeechDat(II) FDB-1000 (FIXED1FL)NuanceScripted SpeechDutchBelgiumLow background noise1,000152,000Available on request8alawDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
52 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words and spontaneous items for control
19
Dutch (Netherlands & Belgium) scripted in-car
Audio ASR, Virtual Assistant, In Car HMI & EntertainmentMicrophone and mobile phone27 hours Add QuoteDutch and Flemish SpeechDat-CarNuanceScripted SpeechDutchNetherland; BelgiumMixed (in-car)302515,100Available on request16 and 8alawDataset is fully transcribed and is accompanied by a pronunciation lexicon and validation report
125 prompts per adult speaker including digits, natural numbers, letter strings, personal, place and business names (some spontaneous), generic command and control items, phonetically rich words and sentences and prompts for spontaneous speech
66
Dutch (Netherlands) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline36 hours Add QuoteNLD_ASR001Appen GlobalConversational SpeechDutchNetherlandsLow background noise2002Available on request14,9648alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project - 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
167
Dutch (Netherlands) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A45,000 words Add Quotenld_NLD_PHONAppen GlobalPronunciation DictionaryDutchNetherlandsN/AN/AN/AN/A45,000N/Atext
92
Dutch (Netherlands) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone68 hours Add QuoteSpeecon Dutch from the NetherlandsNuanceScripted SpeechDutchNetherlandsMixed (office, entertainment, car, public place)600 (550 adult speakers and 50 child speakers)41,70,000Available on request16alawDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers
125
East African facial images
Image Facial RecognitionCamera13500 images Add QuoteIMG_FACE_KEN_CNAppen ChinaHuman FaceN/AKenyaMixed background and lighting conditions100NANANANAjpg
21
English (Arabic - Levant/Egypt) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline28 hours Add QuoteENA_ASR001Appen GlobalConversational SpeechEnglishEgyptLow background noise2502Available on request5,6198alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Average length of calls: 10-15 mins
169
English (Australia) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A1,57,000 words Add Quoteeng_AUS_PHONAppen GlobalPronunciation DictionaryEnglishAustraliaN/AN/AN/AN/A1,57,000N/Atext
2
English (Australia) scripted telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone and landline92 hours Add QuoteAUS_ASR001Appen GlobalScripted SpeechEnglishAustraliaLow background noise (home/office)500182,50035,1378alawFully transcribed to SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
162 prompts (read speech) per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items (from a set of 215), phonetically rich sentences and words
3
English (Australia) scripted telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone and landline118 hours Add QuoteAUS_ASR002Appen GlobalScripted SpeechEnglishAustraliaMixed1,000175,000198alawFully transcribed to SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
75 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words
The prompts are a mixture of 'read' and 'elicited' items where 5 prompts per script are 'spontaneous free speech'
171
English (Canada) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A3,000 words Add Quoteeng_CAN_POSAppen GlobalPart of Speech DictionaryEnglishCanadaN/AN/AN/AN/A3,000N/Atext
170
English (Canada) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A50,000 words Add Quoteeng_CAN_PHONAppen GlobalPronunciation DictionaryEnglishCanadaN/AN/AN/AN/A50,000N/Atext
22
English (Canada) scripted telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone and landline144 hours Add QuoteENC_ASR001Appen GlobalScripted SpeechEnglishCanadaMixed1,000199,00012,4838alaw or wavFully transcribed to SALA II/SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
99 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words
173
English (Hong Kong) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A18,000 words Add Quoteeng_HKG_PHONAppen GlobalPronunciation DictionaryEnglishHong KongN/AN/AN/AN/A18,000N/Atext
25
English (India) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline67 hours Add QuoteENI_ASR002Appen GlobalConversational SpeechEnglishIndiaLow background noise540277,56511,6468alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
271 telephony conversations are recorded for this project
175
English (India) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A13,000 words Add Quoteeng_IND_POSAppen GlobalPart of Speech DictionaryEnglishIndiaN/AN/AN/AN/A13,000N/Atext
174
English (India) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A60,000 words Add Quoteeng_IND_PHONAppen GlobalPronunciation DictionaryEnglishIndiaN/AN/AN/AN/A60,000N/Atext
24
English (India) scripted telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone and landline217 hours Add QuoteENI_ASR001Appen GlobalScripted SpeechEnglishIndiaMixed2,35811,17,9009,1908alawFully transcribed to SpeechDAT type conventions.
Dataset is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed words
49 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words
176
English (Ireland) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A12,000 words Add Quoteeng_IRL_PHONAppen GlobalPronunciation DictionaryEnglishIrelandN/AN/AN/AN/A12,000N/Atext
177
English (NZ) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A50,000 words Add Quoteeng_NZL_PHONAppen GlobalPronunciation DictionaryEnglishNZN/AN/AN/AN/A50,000N/Atext
23
English (Philippines) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline53 hours Add QuoteENF_ASR001Appen GlobalConversational SpeechEnglishPhilippinesLow background noise450241,6027,2728alaw or wavDataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Average length of calls: 10-15 mins
172
English (Philippines) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A5,000 words Add Quoteeng_PHL_PHONAppen GlobalPronunciation DictionaryEnglishPhilippinesN/AN/AN/AN/A5,000N/Atext
168
English (United Arab Emirates) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A5,000 words Add Quoteeng_ARE_PHONAppen GlobalPronunciation DictionaryEnglishUnited Arab EmiratesN/AN/AN/AN/A5,000N/Atext
67
English (United Arab Emirates) scripted telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone and landline33 hours Add QuoteOrienTel English as spoken in the United Arab EmiratesNuanceScripted SpeechEnglishUnited Arab EmiratesLow background noise500125,500Available on request8alawDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
51 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words and spontaneous items for control
99
English (United Kingdom)
Audio TTSHeadset microphone10 hours Add QuoteTC-STAR female baseline voice LauraNuanceScripted SpeechEnglishUnited KingdomLow background noise (studio)11Available on requestAvailable on request96Available on requestDataset includes manual orthographic transcription, automatic segmentation into phonemes, automatic generation of pitch marks (where a certain percentage of phonetic segments and pitch marks has been manually checked)
Dataset is accompanied by a pronunciation lexicon with POS, lemma and phonetic transcription
100
English (United Kingdom)
Audio TTSHeadset microphone10 hours Add QuoteTC-STAR male baseline voice IanNuanceScripted SpeechEnglishUnited KingdomLow background noise (studio)11Available on requestAvailable on request96Available on requestDataset includes manual orthographic transcription, automatic segmentation into phonemes, automatic generation of pitch marks (where a certain percentage of phonetic segments and pitch marks has been manually checked)
Dataset is accompanied by a pronunciation lexicon with POS, lemma and phonetic transcription
259
English (United Kingdom) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline50 hours Add QuoteUKE_ASR001BAppen GlobalConversational SpeechEnglishUnited KingdomLow background noise1,1502Available on request13,1928wavDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
104
English (United Kingdom) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline150 hours Add QuoteUKE_ASR001Appen GlobalConversational SpeechEnglishUnited KingdomLow background noise1,15022,98,56224,1938wavDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
179
English (United Kingdom) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A1,55,000 words Add Quoteeng_GBR_POSAppen GlobalPart of Speech DictionaryEnglishUnited KingdomN/AN/AN/AN/A1,55,000N/Atext
178
English (United Kingdom) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A1,95,000 words Add Quoteeng_GBR_PHONAppen GlobalPronunciation DictionaryEnglishUnited KingdomN/AN/AN/AN/A1,95,000N/Atext
107
English (United States) conversational smartphone
Audio ASR, Conversational AI, Speech AnalyticsMobile phone1000 hours Add QuoteUSE_ASR003Appen GlobalConversational SpeechEnglishUnited StatesLow background noise2,00015,00,00052,58616wavDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Conversations cover a wide variety of topics including: study/major/work, hometown, living arrangements, weather and seasons, punctuality, TV programs/film)
181
English (United States) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A2,63,000 words Add Quoteeng_USA_POSAppen GlobalPart of Speech DictionaryEnglishUnited StatesN/AN/AN/AN/A2,63,000N/Atext
180
English (United States) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A3,30,000 words Add Quoteeng_USA_PHONAppen GlobalPronunciation DictionaryEnglishUnited StatesN/AN/AN/AN/A3,30,000N/Atext
93
English (United States) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone53 hours Add QuoteSpeecon English (USA) databaseNuanceScripted SpeechEnglishUnited StatesMixed (office, entertainment, car, public place)600 (550 adult speakers and 50 child speakers)41,70,000Available on request16Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers
106
English (United States) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone62 hours Add QuoteUSE_ASR001Appen GlobalScripted SpeechEnglishUnited StatesLow background noise (studio)200280,00018,31848alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Each speaker read 400 prompts including digits, natural numbers, personal and city names, telephone numbers, generic command and control items, phonetically rich sentences and words
131
English NER news text
Text NER, Content Classification, Search EnginesN/A22,768 sentences Add QuoteENG_NER001Appen GlobalNews NEREnglishN/AN/AN/AN/A22,768Available on requestN/Atext
32
Farsi/Persian (Iran) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline30 hours Add QuoteFAR_ASR002Appen GlobalConversational SpeechIranian PersianIranMixed1,0002Available on request12,3588wavDataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
31
Farsi/Persian (Iran) scripted telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone and landline85 hours Add QuoteFAR_ASR001Appen GlobalScripted SpeechIranian PersianIranMixed789138,4008,7168alawFully transcribed to OrienTel type conventions
Dataset is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed words
48 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words
135
Farsi/Persian NER news text
Text NER, Content Classification, Search EnginesN/A19,584 sentences Add QuoteFAR_NER001Appen GlobalNews NERIranian PersianIranN/AN/AN/A19,584Available on requestN/Atext
185
Finnish (Finland) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A10,000 words Add Quotefin_FIN_POSAppen GlobalPart of Speech DictionaryFinnishFinlandN/AN/AN/AN/A10,000N/Atext
128
Finnish (Finland) printed text OCR
Image Document Processing, Document SearchCamera7293 images Add QuoteIMG_OCR_FIN_CNAppen ChinaDocument OCRFinnishFinlandMixed lighting conditions4NANANANAjpgImages containing text, such as billboards / outer packaging / signage / magazines / menus, etc.
184
Finnish (Finland) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A85,000 words Add Quotefin_FIN_PHONAppen GlobalPronunciation DictionaryFinnishFinlandN/AN/AN/AN/A85,000N/Atext
145
French (Algeria) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A4,000 words Add Quotefra_DZA_PHONAppen GlobalPronunciation DictionaryFrenchAlgeriaN/AN/AN/AN/A4,000N/AtextArabic script
5
French (Belgium) scripted telephony
Audio ASR, Call Centre, Virtual AssistantLandline only76 hours Add QuoteBelgian French SpeechDat(II) FDB-1000 (FIXED1BF)NuanceScripted SpeechFrenchBelgiumLow background noise1,000153,000Available on request8alawDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
53 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words and spontaneous items for control
36
French (Canada) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline9 hours Add QuoteFRC_ASR003Appen GlobalConversational SpeechFrenchCanadaMixed682Available on request6,0228alawDataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Average length of calls: 10-15 mins
For the majority of calls, only one half of the conversation was collected and transcribed, however, for a smaller number of calls, both speakers (in-line/out-line) were collected and transcribed
186
French (Canada) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A67,000 words Add Quotefra_CAN_PHONAppen GlobalPronunciation DictionaryFrenchCanadaN/AN/AN/AN/A67,000N/Atext
35
French (Canada) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone46 hours Add QuoteFRC_ASR002Appen GlobalScripted SpeechFrenchCanadaLow background noise (home/office)150122,50010,75516alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
150 prompts per speaker including digits, digit strings (randomly generated), addressses and phonetically rich sentences and words
34
French (Canada) scripted telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone131 hours Add QuoteFRC_ASR001Appen GlobalScripted SpeechFrenchCanadaMixed1,00011,00,00011,6978alawFully transcribed to SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed words
100 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words
40
French (France) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline25 hours Add QuoteFRF_ASR001Appen GlobalConversational SpeechFrenchFranceLow background noise5632Available on request11,9228alawDataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
For the majority of calls, both speakers (in-line/out-line) were collected and transcribed, however, for a smaller number of calls, only one half of the conversation was collected and transcribed
39
French (France) In-Car
Audio ASR, Virtual Assistant, In Car HMI & EntertainmentMicrophone and mobile phone Add QuoteFrench SpeechDat-CarNuanceScripted SpeechFrenchFranceMixed (in-car)300537,500Available on request16 and 8Available on requestDataset is fully transcribed and is accompanied by a pronunciation lexicon and validation report
Approximately 125 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names (some spontaneous), generic command and control items, phonetically rich words and sentences and prompts for spontaneous speech
188
French (France) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A95,000 words Add Quotefra_FRA_POSAppen GlobalPart of Speech DictionaryFrenchFranceN/AN/AN/AN/A95,000N/Atext
187
French (France) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A1,12,000 words Add Quotefra_FRA_PHONAppen GlobalPronunciation DictionaryFrenchFranceN/AN/AN/AN/A1,12,000N/Atext
41
French (France) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone26 hours Add QuoteFRF_ASR003Global PhoneScripted SpeechFrenchFranceLow background noise (home/office)98110,273Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
37
French (France) scripted telephony
Audio ASR, Call Centre, Virtual AssistantLandline only41 hours Add QuoteFrench SpeechDat(II) FDB-1000NuanceScripted SpeechFrenchFranceLow background noise (home/office)1,017148,000Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
48 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
38
French (France) scripted telephony
Audio ASR, Call Centre, Virtual AssistantLandline only305 hours Add QuoteFrench SpeechDat(II) FDB-5000NuanceScripted SpeechFrenchFranceLow background noise5,04012,37,000Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
47 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
60
French (Luxembourg) telephony
Audio ASR, Call Centre, Virtual AssistantLandline only45 hours Add QuoteLuxembourgish French SpeechDat(II) FDB-500 (FIXED1LF)NuanceScripted SpeechFrenchLuxembourgLow background noise614132,000Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
53 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
189
German (Germany) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A1,46,000 words Add Quotedeu_DEU_PHONAppen GlobalPronunciation DictionaryGermanGermanyN/AN/AN/AN/A1,46,000N/Atext
16
German (Germany) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone16 hours Add QuoteDEU_ASR001Appen GlobalScripted SpeechGermanGermanyLow background noise (studio)127212,7006,82616alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Each speaker read 100 prompts including digits, natural numbers, personal and city names, telephone numbers, generic command and control items, phonetically rich sentences and words
18
German (Germany) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone25 hours Add QuoteDEU_ASR003Global PhoneScripted SpeechGermanGermanyLow background noise (home/office)77110,085Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
42
German (Germany) telephony
Audio ASR, Call Centre, Virtual AssistantLandline only31 hours Add QuoteGerman SpeechDat (II) FDB-1000NuanceScripted SpeechGermanGermanyLow background noise (home/office)988143,000Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
44 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
43
German (Germany) telephony
Audio ASR, Call Centre, Virtual AssistantLandline only268 hours Add QuoteGerman SpeechDat(II) FDB-4000NuanceScripted SpeechGermanGermanyLow background noise (home/office)4,00011,60,000Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
40 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
61
German (Luxembourg) telephony
Audio ASR, Call Centre, Virtual AssistantLandline only33 hours Add QuoteLuxembourgish German SpeechDat(II) FDB-500 (FIXED1LG)NuanceScripted SpeechGermanLuxembourgLow background noise500126,500Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
53 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
190
German (Switzerland) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A15,000 words Add Quotedeu_CHE_PHONAppen GlobalPronunciation DictionaryGermanSwitzerlandN/AN/AN/AN/A15,000N/Atext
94
German (Switzerland) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone53 hours Add QuoteSpeecon German (Switzerland) databaseNuanceScripted SpeechGermanSwitzerlandMixed (office, entertainment, car, public place)600 (550 adult speakers and 50 child speakers)41,70,000Available on request16Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers
68
German (Turkey) telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone and landline31 hours Add QuoteOrienTel German Spoken by TurkishNuanceScripted SpeechGermanTurkeyLow background noise300115,600Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
52 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
191
Greek (Greece) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A5,000 words Add Quoteell_GRC_PHONAppen GlobalPronunciation DictionaryGreekGreeceN/AN/AN/AN/A5,000N/Atext
118
Greek (Greece) scripted smartphone
Audio ASR, Virtual Assistant, ChatbotMobile phone191 hours Add QuoteGRE_ASR001_CNAppen ChinaScripted SpeechGreekGreeceLow background noise (home/office)287154,11368,27116wavDataset is fully transcribed
192
Guarani (Paraguay) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A35,000 words Add Quotegrn_PRY_PHONAppen GlobalPronunciation DictionaryGuaraniParaguayN/AN/AN/AN/A35,000N/Atext
194
Haitian Creole (Haiti) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A15,000 words Add Quotehat_HTI_PHONAppen GlobalPronunciation DictionaryHaitian CreoleHaitiN/AN/AN/AN/A15,000N/Atext
45
Hausa (Nigeria) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone33 hours Add QuoteHAU_ASR002Appen GlobalConversational SpeechHausaNigeriaLow background noise2002Available on request7,9498alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project - 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
195
Hausa (Nigeria) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A11,000 words Add Quotehau_NGA_PHONAppen GlobalPronunciation DictionaryHausaNigeriaN/AN/AN/AN/A11,000N/Atext
44
Hausa scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone20 hours Add QuoteHAU_ASR001Global PhoneScripted SpeechHausaMultipleLow background noise (home/office)10317,895Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
46
Hebrew (Israel) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline34 hours Add QuoteHEB_ASR001Appen GlobalConversational SpeechHebrewIsraelLow background noise2002Available on request19,2508alaw or wavDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project - 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
196
Hebrew (Israel) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A31,000 words Add Quoteheb_ISR_PHONAppen GlobalPronunciation DictionaryHebrewIsraelN/AN/AN/AN/A31,000N/Atext
48
Hindi (India) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline32 hours Add QuoteHIN_ASR002Appen GlobalConversational SpeechHindiIndiaMixed9962Available on request12,2668wavDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
For the majority of calls, both speakers (in-line/out-line) were collected and transcribed, however, for a smaller number of calls, only one half of the conversation was collected and transcribed
197
Hindi (India) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A35,000 words Add Quotehin_IND_PHONAppen GlobalPronunciation DictionaryHindiIndiaN/AN/AN/AN/A35,000N/Atext
47
Hindi (India) scripted telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone224 hours Add QuoteHIN_ASR001Appen GlobalScripted SpeechHindiIndiaLow background noise1,920196,0009,8538alawFully transcribed to SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed words
50 prompts per speaker including digits, natural numbers, personal, business and place names, web addresses, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words
129
Human body movement
Video Fitness Applications, Action Classification, Gesture RecognitionMobile phone2000 videos Add QuoteVED_HUMAN_BODY_CNAppen ChinaHuman BodyN/AChinaMixed background and lighting conditions1000NANANANAmp4Video clips are approximately 10-20 seconds long
198
Hungarian (Hungary) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A500 words Add Quotehun_HUN_PHONAppen GlobalPronunciation DictionaryHungarianHungaryN/AN/AN/AN/A500N/Atext
119
Hungarian (Hungary) scripted smartphone
Audio ASR, Virtual Assistant, ChatbotMobile phone286 hours Add QuoteHUN_ASR001_CNAppen ChinaScripted SpeechHungarianHungaryLow background noise (home/office)254194,0312,01,92116wavDataset is fully transcribed
49
Hungarian (Hungary) scripted telephony
Audio ASR, Call Centre, Virtual AssistantLandline only65 hours Add QuoteHungarian SpeechDat(E)NuanceScripted SpeechHungarianHungaryLow background noise1,000148,000Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
48 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
199
Igbo (Nigeria) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A30,000 words Add Quoteibo_NGA_PHONAppen GlobalPronunciation DictionaryIgboNigeriaN/AN/AN/AN/A30,000N/Atext
152
Indonesian (Indonesia) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A10,000 words Add Quoteind_IDN_POSAppen GlobalPart of Speech DictionaryIndonesianIndonesiaN/AN/AN/AN/A10,000N/Atext
151
Indonesian (Indonesia) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A95,000 words Add Quoteind_IDN_PHONAppen GlobalPronunciation DictionaryIndonesianIndonesiaN/AN/AN/AN/A95,000N/Atext
183
Iranian Persian (Iran) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A14,00,000 words Add Quotepes_IRN_POSAppen GlobalPart of Speech DictionaryIranian PersianIranN/AN/AN/AN/A14,00,000N/Atext
182
Iranian Persian (Iran) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A80,000 words Add Quotepes_IRN_PHONAppen GlobalPronunciation DictionaryIranian PersianIranN/AN/AN/AN/A80,000N/Atext
52
Italian (Italy) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline36 hours Add QuoteITA_ASR003Appen GlobalConversational SpeechItalianItalyLow background noise2002Available on request18,9748alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project - 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
201
Italian (Italy) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A1,47,000 words Add Quoteita_ITA_POSAppen GlobalPart of Speech DictionaryItalianItalyN/AN/AN/AN/A1,47,000N/Atext
200
Italian (Italy) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A1,97,000 words Add Quoteita_ITA_PHONAppen GlobalPronunciation DictionaryItalianItalyN/AN/AN/AN/A1,97,000N/Atext
50
Italian (Italy) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone44 hours Add QuoteITA_ASR001Appen GlobalScripted SpeechItalianItalyMixed200440,0007,31622alawFully transcribed to SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 prompts per speaker including 100 command and control type items and 100 phonetically rich sentences
51
Italian (Italy) scripted microphone
Audio ASR, Virtual Assistant, In Car HMI & EntertainmentMicrophone47 hours Add QuoteITA_ASR002Appen GlobalScripted SpeechItalianItalyMixed (in-car)103435,87510,36648alawFully transcribed to SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
350 prompts per speaker including digits, street names, generic command and control items, phonetically rich sentences and words
Each speaker recorded 1or 2 sessions including Session 1 in a parked vehicle with the engine running and Session 2 in a vehicle travelling at 60 mph (100 km/h)
53
Italian (Italy) scripted microphone
Audio TTSMicrophone3 hours Add QuoteITA_TTS001Appen GlobalScripted SpeechItalianItalyLow background noise (studio)113,300Available on request22alawDataset is accompanied by a pronunciation lexicon containing all words spoken in the Dataset
3,300 prompts per speaker including phonetically rich sentences
54
Italian (Italy) telephony
Audio ASR, Call Centre, Virtual AssistantLandline only38 hours Add QuoteItalian Fixed Network Speech SpeechDat(M) CorpusNuanceScripted SpeechItalianItalyLow background noise (home/office)1,000139,000Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
39 prompts per speaker includign isolated and connected digits, natural numbers, money amounts, spelled words, time and date phrases, yes/no questions, city names, common application words, application words in phrases and phonetically rich sentences
55
Italian (Italy) telephony
Audio ASR, Call Centre, Virtual AssistantLandline only228 hours Add QuoteItalian SpeechDat(II) FDB-3000NuanceScripted SpeechItalianItalyLow background noise (home/office)3,04011,34,000Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
44 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
56
Italian (Italy) telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone103 hours Add QuoteItalian SpeechDat(II) MDB-250NuanceScripted SpeechItalianItalyLow background noise (home/office)375119,000Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
51 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
89
Italian (Italy) telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone13 hours Add QuoteSpeechDat(M) Italian Mobile Network Speech DatabaseNuanceScripted SpeechItalianItalyLow background noise (home/office)342113,500Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
40 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
203
Japanese (Japan) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A2,65,000 words Add Quotejpn_JPN_POSAppen GlobalPart of Speech DictionaryJapaneseJapanN/AN/AN/AN/A2,65,000N/Atext
202
Japanese (Japan) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A2,62,000 words Add Quotejpn_JPN_PHONAppen GlobalPronunciation DictionaryJapaneseJapanN/AN/AN/AN/A2,62,000N/Atext
57
Japanese (Japan) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone33 hours Add QuoteJPN_ASR001Global PhoneScripted SpeechJapaneseJapanLow background noise (home/office)144113,067Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
95
Japanese (Japan) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone57 hours Add QuoteSpeecon JapaneseNuanceScripted SpeechJapaneseJapanMixed (office, entertainment, car, public place)600 (550 adult speakers and 50 child speakers)41,70,000Available on request16Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers
136
Japanese NER news text
Text NER, Content Classification, Search EnginesN/A20,629 sentences Add QuoteJPY_NER001Appen GlobalNews NERJapaneseJapanN/AN/AN/A20,629Available on requestN/Atext
204
Javanese (Indonesia) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A20,000 words Add Quotejav_IDN_PHONAppen GlobalPronunciation DictionaryJavaneseIndonesiaN/AN/AN/AN/A20,000N/Atext
58
Kannada (India) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline15 hours Add QuoteKAN_ASR001Appen GlobalConversational SpeechKannadaIndiaMixed1782Available on request15,6608alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
109
Kannada (India) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline57 hours Add QuoteKAN_ASR001AAppen GlobalConversational SpeechKannadaIndiaMixed1,0002Available on request15,6608alawApprox. 25% of the dataset sessions are transcribed and time stamped - full transcripts can be made available
Database is accompanied by a pronunciation lexicon containing all transcribed words
205
Kannada (India) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A35,000 words Add Quotekan_IND_PHONAppen GlobalPronunciation DictionaryKannadaIndiaN/AN/AN/AN/A35,000N/Atext
206
Kazakh (Kazakhstan) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A30,000 words Add Quotekaz_KAZ_PHONAppen GlobalPronunciation DictionaryKazakhKazakhstanN/AN/AN/AN/A30,000N/Atext
123
Khmer (Cambodia) scripted smartphone
Audio ASR, Virtual Assistant, ChatbotMobile phone90 hours Add QuoteKHM_ASR001_CNAppen ChinaScripted SpeechCentral KhmerCambodiaLow background noise (home/office)94124,61852,15716wavDataset is fully transcribed
208
Korean (South Korea) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A1,00,000 words Add Quotekor_KOR_POSAppen GlobalPart of Speech DictionaryKoreanSouth KoreaN/AN/AN/AN/A1,00,000N/Atext
207
Korean (South Korea) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A1,00,000 words Add Quotekor_KOR_PHONAppen GlobalPronunciation DictionaryKoreanSouth KoreaN/AN/AN/AN/A1,00,000N/Atext
59
Korean (South Korea) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone20 hours Add QuoteKOR_ASR001Global PhoneScripted SpeechKoreanSouth KoreaLow background noise (home/office)10018,107Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
132
Korean NER news text
Text NER, Content Classification, Search EnginesN/A25,830 sentences Add QuoteKOR_NER001Appen GlobalNews NERKoreanSouth KoreaN/AN/AN/A25,830Available on requestN/Atext
209
Kurmanji (Turkey) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A60,000 words Add Quotekur_TUR_PHONAppen GlobalPronunciation DictionaryKurmanjiTurkeyN/AN/AN/AN/A60,000N/Atext
210
Lao (Laos) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A9,000 words Add Quotelao_LAO_PHONAppen GlobalPronunciation DictionaryLaoLaosN/AN/AN/AN/A9,000N/Atext
211
Lithuanian (Lithuania) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A60,000 words Add Quotelit_LTU_PHONAppen GlobalPronunciation DictionaryLithuanianLithuaniaN/AN/AN/AN/A60,000N/Atext
212
Malayalam (India) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A4,000 words Add Quotemal_IND_PHONAppen GlobalPronunciation DictionaryMalayalamIndiaN/AN/AN/AN/A4,000N/Atext
213
Malaysian (Malaysia) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A10,000 words Add Quotemsa_MYS_PHONAppen GlobalPronunciation DictionaryMalaysianMalaysiaN/AN/AN/AN/A10,000N/Atext
214
Mandarin (Simplified) (China) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A35,000 words Add Quotezho_CHN_PHONAppen GlobalPronunciation DictionaryMandarin (Simplified)ChinaN/AN/AN/AN/A35,000N/Atext
215
Mandarin (Traditional) (Taiwan) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A50,000 words Add Quotezho_TWN_PHONAppen GlobalPronunciation DictionaryMandarin (Traditional)TaiwanN/AN/AN/AN/A50,000N/Atext
63
Mandarin Chinese (China) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone26 hours Add QuoteMAC_ASR002Global PhoneScripted SpeechMandarin ChineseChinaLow background noise (home/office)132110,225Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
62
Mandarin Chinese (China) scripted telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone and landline323 hours Add QuoteMAC_ASR001Appen GlobalScripted SpeechMandarin ChineseChinaMixed2,00012,00,0007,1458alawFully transcribed to SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed words
98 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items (from a set of 215), phonetically rich sentences and words
134
Mandarin NER news text
Text NER, Content Classification, Search EnginesN/A17,313 sentences Add QuoteMAC_NER001Appen GlobalNews NERMandarin ChineseChinaN/AN/AN/A17,313Available on requestN/Atext
64
Marathi (India) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline15 hours Add QuoteMAR_ASR001Appen GlobalConversational SpeechMarathiIndiaMixed1802Available on request11,9088alawApprox. 29% of the dataset sessions are transcribed and time stamped - full transcripts can be made available
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
110
Marathi (India) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline52 hours Add QuoteMAR_ASR001AAppen GlobalConversational SpeechMarathiIndiaMixed1,0002Available on request11,9088alawPortion of the dataset sessions are transcribed and time stamped - full transcripts can be made available
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
216
Marathi (India) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A30,000 words Add Quotemar_IND_PHONAppen GlobalPronunciation DictionaryMarathiIndiaN/AN/AN/AN/A30,000N/Atext
217
Mongolian (Mongolia) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A30,000 words Add Quotemon_MNG_PHONAppen GlobalPronunciation DictionaryMongolianMongoliaN/AN/AN/AN/A30,000N/Atext
219
Norwegian (Norway) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A3,000 words Add Quotenor_NOR_POSAppen GlobalPart of Speech DictionaryNorwegianNorwayN/AN/AN/AN/A3,000N/Atext
218
Norwegian (Norway) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A1,15,000 words Add Quotenor_NOR_PHONAppen GlobalPronunciation DictionaryNorwegianNorwayN/AN/AN/AN/A1,15,000N/Atext
220
Oriya (India) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A15,000 words Add Quoteori_IND_PHONAppen GlobalPronunciation DictionaryOriyaIndiaN/AN/AN/AN/A15,000N/Atext
80
Panjabi (Pakistan) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline20 hours Add QuotePAP_ASR001Appen GlobalConversational SpeechPanjabiPakistanLow background noise2052Available on request7,2988alawDataset is fully transcribed and time-stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
71% of calls, both speakers (in-line/out-line) were collected and transcribed, however, for 29% calls, only one half of the conversation was collected and transcribed
74
Pashto (Afghanistan) broadcast
Audio ASR, Automatic Captioning, Keyword SpottingMicrophone51 hours Add QuotePAS_BRC001Appen GlobalBroadcast SpeechNorthern Pashto; Southern PashtoAfghanistanLow background noise (studio)N/A1Available on requestAvailable on requestN/AwavDataset is fully transcribed and timestamped
Dataset is largely speech only and does not include music or advertisements
Data types include: talk shows, interviews, news broadcasts (excluding news reading by anchors)
73
Pashto (Afghanistan) conversational microphone
Audio ASR, Conversational AI, Speech AnalyticsMicrophone39 hours Add QuotePAS_ASR002Appen GlobalConversational SpeechNorthern Pashto; Southern PashtoAfghanistanLow background noise402Available on request9,48016wavDataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
A full translation of the transcripts into French is also available as an optional additional purchase
Average length of calls: 120 mins where one speaker acts as an interviewer and the other as the interviewee for scenarios are similar to TransTAC style (e.g. civil affairs, checkpoints etc.)
The interviewer appears in more than one set of dialogues but the interviewee is unique for each set
72
Pashto (Afghanistan) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline55 hours Add QuotePAS_ASR001Appen GlobalConversational SpeechNorthern Pashto; Southern PashtoAfghanistanLow background noise9672Available on request13,6338wavDataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
For the majority of calls, both speakers (in-line/out-line) were collected and transcribed, however, for a smaller number of calls, only one half of the conversation was collected and transcribed
221
Pashto (Afghanistan) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A65,000 words Add Quotepus_AFG_PHONAppen GlobalPronunciation DictionaryPashtoAfghanistanN/AN/AN/AN/A65,000N/Atext
223
Polish (Poland) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A4,000 words Add Quotepol_POL_POSAppen GlobalPart of Speech DictionaryPolishPolandN/AN/AN/AN/A4,000N/Atext
222
Polish (Poland) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A40,000 words Add Quotepol_POL_PHONAppen GlobalPronunciation DictionaryPolishPolandN/AN/AN/AN/A40,000N/Atext
75
Polish (Poland) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone25 hours Add QuotePOL_ASR001Global PhoneScripted SpeechPolishPolandLow background noise (home/office)99110,130Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
120
Polish (Poland) scripted smartphone
Audio ASR, Virtual Assistant, ChatbotMobile phone293 hours Add QuotePOL_ASR002_CNAppen ChinaScripted SpeechPolishPolandLow background noise (home/office)35311,06,6741,68,54416wavDataset is fully transcribed
76
Polish (Poland) scripted telephony
Audio ASR, Call Centre, Virtual AssistantLandline only78 hours Add QuotePolish SpeechDat(E) DatabaseNuanceScripted SpeechPolishPolandLow background noise1,000148,000Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
48 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
78
Portuguese (Brazil) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline33 hours Add QuotePTB_ASR002Appen GlobalConversational SpeechPortugueseBrazilLow background noise2002Available on request11,2878alawDataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
77
Portuguese (Brazil) microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone26 hours Add QuotePTB_ASR001Global PhoneScripted SpeechPortugueseBrazilLow background noise (home/office)102110,417Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
225
Portuguese (Brazil) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A1,00,000 words Add Quotepor_BRA_POSAppen GlobalPart of Speech DictionaryPortugueseBrazilN/AN/AN/AN/A1,00,000N/Atext
224
Portuguese (Brazil) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A1,02,000 words Add Quotepor_BRA_PHONAppen GlobalPronunciation DictionaryPortugueseBrazilN/AN/AN/AN/A1,02,000N/Atext
79
Portuguese (Portugal) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline36 hours Add QuotePTP_ASR001Appen GlobalConversational SpeechPortuguesePortugalLow background noise2002Available on request16,3398alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project - 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
227
Portuguese (Portugal) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A1,00,000 words Add Quotepor_PRT_POSAppen GlobalPart of Speech DictionaryPortuguesePortugalN/AN/AN/AN/A1,00,000N/Atext
226
Portuguese (Portugal) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A1,12,000 words Add Quotepor_PRT_PHONAppen GlobalPronunciation DictionaryPortuguesePortugalN/AN/AN/AN/A1,12,000N/Atext
81
Romanian (Romania) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline37 hours Add QuoteROM_ASR001Appen GlobalConversational SpeechRomanianRomaniaLow background noise2002Available on request16,6588alawDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project - 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
228
Romanian (Romania) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A15,000 words Add Quoteron_ROU_PHONAppen GlobalPronunciation DictionaryRomanianRomaniaN/AN/AN/AN/A15,000N/Atext
82
Russian (Russia) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline37 hours Add QuoteRUS_ASR001Appen GlobalConversational SpeechRussianRussiaLow background noise2002Available on request28,2848alaw or wavDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project - 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
230
Russian (Russia) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A1,00,000 words Add Quoterus_RUS_POSAppen GlobalPart of Speech DictionaryRussianRussiaN/AN/AN/AN/A1,00,000N/Atext
229
Russian (Russia) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A1,15,000 words Add Quoterus_RUS_PHONAppen GlobalPronunciation DictionaryRussianRussiaN/AN/AN/AN/A1,15,000N/Atext
83
Russian (Russia) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone31 hours Add QuoteRUS_ASR002Global PhoneScripted SpeechRussianRussiaLow background noise (home/office)115112,205Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
96
Russian (Russia) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone46 hours Add QuoteSpeecon Russian DatabaseNuanceScripted SpeechRussianRussiaMixed (office, entertainment, car, public place)600 (550 adult speakers and 50 child speakers)41,70,000Available on request16Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers
84
Russian (Russia) scripted telephony
Audio ASR, Call Centre, Virtual AssistantLandline only180 hours Add QuoteRussian SpeechDat(E) DatabaseNuanceScripted SpeechRussianRussiaLow background noise2,50011,12,000Available on request8alawDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
45 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
133
Russian NER news text
Text NER, Content Classification, Search EnginesN/A29,888 sentences Add QuoteRUS_NER001Appen GlobalNews NERRussianRussiaN/AN/AN/A29,888Available on requestN/Atext
231
Serbian (Serbia) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A15,000 words Add Quotesrp_SRB_PHONAppen GlobalPronunciation DictionarySerbianSerbiaN/AN/AN/AN/A15,000N/Atext
126
Simplified Chinese printed text OCR
Image Document Processing, Document SearchCamera200 images Add QuoteIMG_OCR_MAC_CNAppen ChinaDocument OCRN/AChinaMixed lighting conditions30NANANANAjpgText in each image is labeled with bounding boxes by the line
Images containing heavy text in Chinese, including books, publications, posters, receipts, PPT, printed paper, etc.
85
Slovak (Slovakia) scripted telephony
Audio ASR, Call Centre, Virtual AssistantLandline only65 hours Add QuoteSlovak SpeechDat(E) DatabaseNuanceScripted SpeechSlovakSlovakiaLow background noise1,000148,000Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
48 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
86
Slovenian (Slovenian) telephony
Audio ASR, Call Centre, Virtual AssistantLandline only76 hours Add QuoteSlovenian SpeechDat(II) FDB-1000NuanceScripted SpeechSlovenianSloveniaLow background noise (home/office)1,000140,000Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
Approximately 40 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
87
Somali (Somalia) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline50 hours Add QuoteSOM_ASR001Appen GlobalConversational SpeechSomaliSomaliaLow background noise1,0002Available on request23,2178alawDataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
232
Somali (Somalia) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A76,000 words Add Quotesom_SOM_PHONAppen GlobalPronunciation DictionarySomaliSomaliaN/AN/AN/AN/A76,000N/Atext
233
Sorani (Iraq) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A25,000 words Add Quotekur_IRQ_PHONAppen GlobalPronunciation DictionarySoraniIraqN/AN/AN/AN/A25,000N/Atext
88
Sorani (Kurdish) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline5 hours Add QuoteSOR_ASR001Appen GlobalConversational SpeechCentral Kurdish (Iran)IranLow background noise1702Available on request7,9248alaw or wavDataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
For a large proportion of calls, only one half of the conversation was collected and transcribed
234
Spanish (Argentina) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A15,000 words Add Quotespa_ARG_PHONAppen GlobalPronunciation DictionarySpanishArgentinaN/AN/AN/AN/A15,000N/Atext
236
Spanish (Chile) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A15,000 words Add Quotespa_CHL_PHONAppen GlobalPronunciation DictionarySpanishChileN/AN/AN/AN/A15,000N/Atext
237
Spanish (Colombia) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A15,000 words Add Quotespa_COL_PHONAppen GlobalPronunciation DictionarySpanishColombiaN/AN/AN/AN/A15,000N/Atext
27
Spanish (Latin America - Chile and Colombia) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline22 hours Add QuoteESL_ASR002Appen GlobalConversational SpeechSpanishChile; ColumbiaMixed842Available on requestAvailable on request8wavDataset is fully transcribed and time-stamped
Call Centre style conversations (by 64 customers, 14 agents) in banking and telco domains, primarily using mobile phone
26
Spanish (Latin America) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone17 hours Add QuoteESL_ASR001Global PhoneScripted SpeechSpanishCosta RicaLow background noise (home/office)10016,898Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
238
Spanish (Peru) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A15,000 words Add Quotespa_PER_PHONAppen GlobalPronunciation DictionarySpanishPeruN/AN/AN/AN/A15,000N/Atext
235
Spanish (Spain) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A1,00,000 words Add Quotespa_ESP_PHONAppen GlobalPronunciation DictionarySpanishSpainN/AN/AN/AN/A1,00,000N/Atext
28
Spanish (Spain) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone39 hours Add QuoteESP_ASR001Appen GlobalScripted SpeechSpanishSpainMixed200440,0006,36722alawFully transcribed to SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 prompts per speaker including 100 command and control type items and 100 phonetically rich sentences
30
Spanish (Spain) scripted microphone
Audio TTSMicrophone1 hour Add QuoteESP_TTS001Appen GlobalScripted SpeechSpanishSpainLow background noise (studio)111,7873,61422alawDataset is accompanied by a pronunciation lexicon containing all words spoken in the Dataset
1,787 prompts per speaker including phonetically rich sentences
97
Spanish (Spain) scripted microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone46 hours Add QuoteSpeecon Spanish DatabaseNuanceScripted SpeechSpanishSpainMixed (office, entertainment, car, public place)600 (550 adult speakers and 50 child speakers)41,70,000Available on request16Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers
117
Spanish (Spain) scripted smartphone
Audio ASR, Virtual Assistant, ChatbotMobile phone540 hours Add QuoteESP_ASR002_CNAppen ChinaScripted SpeechSpanishSpainLow background noise (home/office)34712,58,3951,34,93916wavDataset is fully transcribed
239
Spanish (United States) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A90,000 words Add Quotespa_USA_PHONAppen GlobalPronunciation DictionarySpanishUnited StatesN/AN/AN/AN/A90,000N/Atext
240
Spanish (Venezuela) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A15,000 words Add Quotespa_VEN_PHONAppen GlobalPronunciation DictionarySpanishVenezuelaN/AN/AN/AN/A15,000N/Atext
241
Swahili (Kenya) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A66,000 words Add Quoteswa_KEN_PHONAppen GlobalPronunciation DictionarySwahiliKenyaN/AN/AN/AN/A66,000N/Atext
243
Swedish (Sweden) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A1,05,000 words Add Quoteswe_SWE_POSAppen GlobalPart of Speech DictionarySwedishSwedenN/AN/AN/AN/A1,05,000N/Atext
242
Swedish (Sweden) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A1,00,000 words Add Quoteswe_SWE_PHONAppen GlobalPronunciation DictionarySwedishSwedenN/AN/AN/AN/A1,00,000N/Atext
98
Swedish (Sweden/ Finland) microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone30 hours Add QuoteSWE_ASR001Global PhoneScripted SpeechSwedishSweden; FinlandLow background noise (home/office)98111,816Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
244
Sylheti (Bangladesh- India) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A22,000 words Add Quotesyl_BGD;IND_PHONAppen GlobalPronunciation DictionarySylhetiBangladesh; IndiaN/AN/AN/AN/A22,000N/Atext
245
Tagalog (Philippines) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A30,000 words Add Quotetgl_PHL_PHONAppen GlobalPronunciation DictionaryTagalogPhilippinesN/AN/AN/AN/A30,000N/Atext
247
Tamil (India) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A1,05,000 words Add Quotetam_IND_PHONAppen GlobalPronunciation DictionaryTamilIndiaN/AN/AN/AN/A1,05,000N/Atext
246
Telugu (India) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A50,000 words Add Quotetel_IND_PHONAppen GlobalPronunciation DictionaryTeluguIndiaN/AN/AN/AN/A50,000N/Atext
101
Thai (Thailand) microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone28 hours Add QuoteTHA_ASR001Global PhoneScripted SpeechThaiThailandLow background noise (home/office)98114,039Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
127
Thai (Thailand) printed text OCR
Image Document Processing, Document SearchCamera1219 images Add QuoteIMG_OCR_THA_CNAppen ChinaDocument OCRThaiThailandMixed lighting conditions10NANANANAjpgImages containing text, Shopping receipts / tickets / invoices / taxi slips, etc.
248
Thai (Thailand) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A55,000 words Add Quotetha_THA_PHONAppen GlobalPronunciation DictionaryThaiThailandN/AN/AN/AN/A55,000N/Atext
249
Tok Pisin (Papua New Guinea) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A10,000 words Add Quotetpi_PNG_PHONAppen GlobalPronunciation DictionaryTok PisinPapua New GuineaN/AN/AN/AN/A10,000N/Atext
102
Turkish (Turkey) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline41 hours Add QuoteTUR_ASR001Appen GlobalConversational SpeechTurkishTurkeyLow background noise2002Available on request32,3868alaw or wavDataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project - 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
103
Turkish (Turkey) microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone17 hours Add QuoteTUR_ASR002Global PhoneScripted SpeechTurkishTurkeyLow background noise (home/office)10016,950Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
251
Turkish (Turkey) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A2,57,000 words Add Quotetur_TUR_POSAppen GlobalPart of Speech DictionaryTurkishTurkeyN/AN/AN/AN/A2,57,000N/Atext
250
Turkish (Turkey) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A2,55,000 words Add Quotetur_TUR_PHONAppen GlobalPronunciation DictionaryTurkishTurkeyN/AN/AN/AN/A2,55,000N/Atext
121
Turkish (Turkey) scripted smartphone
Audio ASR, Virtual Assistant, ChatbotMobile phone739 hours Add QuoteTUR_ASR003_CNAppen ChinaScripted SpeechTurkishTurkeyLow background noise (home/office)66411,85,7062,15,13516wavDataset is fully transcribed
69
Turkish (Turkey) telephony
Audio ASR, Call Centre, Virtual AssistantMobile phone and landline118 hours Add QuoteOrienTel Turkish DatabaseNuanceScripted SpeechTurkishTurkeyLow background noise1,700176,500Available on request8Available on requestDataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
45 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
252
Ukrainian (Ukraine) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A5,000 words Add Quoteukr_UKR_PHONAppen GlobalPronunciation DictionaryUkrainianUkraineN/AN/AN/AN/A5,000N/Atext
105
Urdu (India/ Pakistan) conversational telephony
Audio ASR, Conversational AI, Speech AnalyticsMobile phone and landline47 hours Add QuoteURD_ASR001Appen GlobalConversational SpeechUrduIndia; PakistanMixed1,0002Available on request10,8718wavDataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
254
Urdu (Pakistan) Part of Speech Dictionary
Text ASR, TTS, Language ModellingN/A12,000 words Add Quoteurd_PAK_POSAppen GlobalPart of Speech DictionaryUrduPakistanN/AN/AN/AN/A12,000N/Atext
253
Urdu (Pakistan) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A40,000 words Add Quoteurd_PAK_PHONAppen GlobalPronunciation DictionaryUrduPakistanN/AN/AN/AN/A40,000N/Atext
137
Urdu NER news text
Text NER, Content Classification, Search EnginesN/A20,634 sentences Add QuoteURD_NER001Appen GlobalNews NERUrduPakistanN/AN/AN/A20,634Available on requestN/Atext
108
Vietnamese (Vietnam) microphone
Audio ASR, Virtual Assistant, ChatbotMicrophone47 hours Add QuoteVIE_ASR001Global PhoneScripted SpeechVietnameseVietnamLow background noise (home/office)129118,842Available on request16wavDataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web tocover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
255
Vietnamese (Vietnam) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A8,000 words Add Quotevie_VNM_PHONAppen GlobalPronunciation DictionaryVietnameseVietnamN/AN/AN/AN/A8,000N/Atext
256
Wu (China) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A10,000 words Add Quotewuu_CHN_PHONAppen GlobalPronunciation DictionaryWuChinaN/AN/AN/AN/A10,000N/Atext
257
Xiang (China) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A10,000 words Add Quotehsn_CHN_PHONAppen GlobalPronunciation DictionaryXiangChinaN/AN/AN/AN/A10,000N/Atext
258
Zulu (South Africa) Pronunciation Dictionary
Text ASR, TTS, Language ModellingN/A75,000 words Add Quotezul_ZAF_PHONAppen GlobalPronunciation DictionaryZuluSouth AfricaN/AN/AN/AN/A75,000N/Atext




Image

Use Cases


Whether you are working on a text-to-speech system, a voice recognition system or another solution that relies on natural language, high-quality licensed speech and language datasets allow you to go to market faster and reach more potential customers.