The ProjectTo preserve the Larrakia language, linguist Dr. Mark Harvey has teamed up with the Larrakia Nation Aboriginal Corporation of People and Appen with a goal to improve the database of usable text and audio data language samples the Larrakia language. This database is a major step in preserving and reviving the Larrakia language as the last fluent speaker died more than 20 years ago. At the beginning of the project, there was a digitized audio and text database of the Larrakia words, sentences, and utterances with limitations. Because this database will eventually be used to learn and teach the Larrakia language, resolving data discrepancies and filling in data gaps is critical to safeguard the integrity of the language data.
The ChallengeThe challenge faced at the start of this project was that the two databases, one with text and one with audio, were not linked. The audio and the text could each be accessed independently via loose time alignments, but there was no easy way to isolate particular sentences or particular speakers, or to distinguish between passages of English vs. Larrakia. Additionally, the text database had a lot of errors and needed significant editing work. This led Dr. Harvey to reach out to Appen. With Appen’s knowledge and experience working with large amounts of data and creating easy-to-use, seamless databases, Dr. Harvey has been able to build a better, longer-lasting language database. Another challenge of this project was making sure that the data and database remain useable for a long way into the future. As Dr. Harvey noted in our interview, “One thing people don’t appreciate is that software and computers actually make archiving much worse than paper. Have you tried to access a Word document from the 1980s?” While a piece of paper from the 80s would be easily readable for any person, a digital document from the 80s would be incompatible with most modern software and computers. By working with Appen to create a sustainable, useable database, Dr. Harvey is ensuring the longevity of the Larrakia database while also making sure that it’s useable in a variety of formats.
The SolutionAppen was brought into this project to further align the two databases, enrich the associated metadata and provide acoustic measurements to help describe Larrakia vowels and consonants. Appen linguists provided supplementary English transcription and introduced more granular time-stamping by inserting markers at relevant sense units (phrases, sentences or single words). Finally, each sense unit was further labeled by speaker role and language being spoken. In the second phase of the project, this granularity allowed Appen to easily isolate particular parts of the text and collaborate with Harvey on making corrections and adding labeling that could then be slotted back into the database. In the final phase of the project, subsets of vowels and consonants were extracted from the data. Appen specialists supervised the phonetic annotation of the extracted subsets and performed acoustic measurements, which will help describe and better understand the phonetic inventory (that is, vowels and consonants) of Larrakia. Partnering with Appen seemed inevitable to Dr. Harvey, as Appen was one of the few companies he knew of that could work with such a large amount of unique data. He went on to say, the “quality of the staff and the expertise has been what I’ve needed. And people have generally delivered. They’ve achieved the timelines and deadlines that they set, which is unusual in my experience of data processing. And data processing is rarely timetable.”
The ResultThe Larrakia Language database project is an ongoing effort. Aligning the two databases and studying vowels and consonants is just the beginning. The next steps will be to preserve and teach the language. As a partner, Appen has been helpful in creating a useable, sustainable database. While the project is continuing, Dr. Harvey was able to define what success looks like for this initiative. At the end, he looks to have a useable database with a good shelf-life, meaning it’s widely available and widely used.
“So, 20 years from now, hopefully, somebody will be at least able to get into the database and know where they’re going.” – Linguist Dr. Mark Harvey