Off the shelf machine learning datasets repository from Appen. Find 250+ datasets across 80 languages and dialects for a variety of common AI and ML use cases.
This dataset consists of more than four hundred thousand handwritten names collected through charity projects to support disadvantaged children around the world.
Optical Character Recognition (OCR) utilizes image processing technologies to convert characters on scanned documents into digital forms. It typically performs well in machine printed fonts. However, it still poses a difficult challenges for machines to recognize handwritten characters, because of the huge variation in individual writing styles.
There are 206,799 first names and 207,024 surnames in total. The data was divided into a training set (331,059), testing set (41,382), and validation set (41,382) respectively.
Labels of all images created via human-in-the-loop anotation on the Appen platform are also provided, enabling you to extend the data set with your own data.
The input data in this job is a hundreds of thousands of images of handwritten names. In the “Data” tab above, you’ll find the transcribed images broken up into test, training, and validation sets.