The field of artificial intelligence (AI) continues to evolve at a rapid pace as companies learn and grow from their challenges and successes. More and more, companies are aligning on the foundational role that data plays in AI implementation. As a consequence, there are numerous exciting trends emerging within the data subdomain worth watching. These trends illustrate both the barriers still faced by companies pursuing AI, and the potential pathways to success.
Here are five of the most pronounced trends we’re seeing in the AI market this year:
1. High-quality Data Remains a Major Roadblock for AI
The majority of AI projects still don’t make it to production, proving that companies still face significant challenges in implementing AI successfully. One of the largest hurdles remains obtaining high-quality data. In a recent survey, respondents cited the lack of skilled individuals and the lack of data (or data quality issues) as the top two bottlenecks for launching AI. This isn’t surprising: most of the time data scientists spend on AI projects is spent on collecting and preparing data, and the quality of that data has a tremendous impact on how well the model performs. Getting this part right is critical, but resource-intensive.
Any company that solves the data challenge will unlock major competitive opportunities. Indeed, companies are exploring expanded solutions, such as hiring external data providers to access needed expertise and tooling (this according to the latest State of AI report).
2. AI Use Cases are Becoming Narrower
Selecting a narrow business problem to solve is a top tip for building high-performing AI solutions. It seems that companies are starting to learn that this is a critical first step, and are choosing to be more specific and less broad in the problems they choose to focus on. To illustrate this, here are a few projects Appen has recently worked on with a very narrow focus:
Biz-speak: The company was building an AI model that would suggest improvements or alternative phrases to common “biz-speak” (or business jargon). These terms were often very nuanced, presenting us with a challenge of capturing sufficient data.
Body movements: A company working on a model to automate personal training faced an obstacle: movement profile changes with age. To help solve this, they asked us to capture and annotate video of seniors doing somersaults.
Long-tail languages: Covid-19 information needs to be shared globally, but the problem is that translation technology doesn’t support all languages. We were tasked on a project to collect and annotate data for languages that are very uncommon, such as Dari, Dinka, Hausa, and more.
These examples demonstrate how companies are narrowing their focus to very specific use cases, which has implications both for how training data is being used and what type of data is being collected.
3. Shift from Model-centric to Data-centric AI
Is it better to improve code or training data? It’s a question that was at the forefront of AI minds in the past several years. Several famous experiments demonstrate that the magic is in the data, and we’re seeing a shift from model-centric development to data-centric. With model-centric AI, the idea is to use available data and develop models that compensate for any noise and inaccuracies. Data-centric, on the other hand, focuses on improving the volume or quality of the data.
A well-known AI practitioner conducted an experiment using a computer vision model that detects defects in steel sheets. He split his team into two and had one group work on improving the code of the model only, and the other group work on improving the data given to the model. He discovered that improving the code had virtually no impact on the performance of the model, while improving data provided a huge uplift (from 76% to 93%). With data improvement alone, the model was even able to beat humans at spotting defects, who only had a 90% accuracy rate.
4. Emerging Need for Training Data Operations
As companies acknowledge the importance of training data to the success of their AI models, naturally there comes a growing need for ways of managing training data. After all, developing training data includes many tasks: data collection, ingestion, exploration, labeling, validation, and preparation. Having a governance framework for these items will prove a useful tool for AI teams going forward. A functional data governance framework may include the following key features:
- Version control for traceability
- Data security protocols
- Access controls
- Data pipeline monitoring
- Collaboration protocols
In any case, data governance frameworks set the foundation for building data pipelines, which lead to greater scalability. Companies that master training data operations will be well-positioned to scale their AI solutions.
5. AI-assisted Annotations Increasing
AutoML solutions are on the rise as a response to the challenging nature of AI development. We’re a long way off from full automation, but in the meantime companies are leveraging AI-assisted annotations to streamline the time-consuming data labeling process. There are three main buckets of automation here:
An AI model performs an initial best guess on what the annotation should be. The annotator then checks and, if needed, corrects the hypothesis. This alone significantly reduces annotation time and maintains high quality at the same time.
AI assists the annotator during the labeling process to save time; this works much like an auto-complete function.
AI verifies the annotator’s output and notifies the annotator if it’s not within an expected threshold. This improves the annotator’s performance and the quality of the annotations.
A company may choose to leverage any one or all of the above AI-assisted annotation methods. Regardless, more automation has the potential for greater time-savings and cost reduction—as long as quality is maintained.
While data remains a huge bottleneck, more AI teams are figuring out how to make data work for them: by understanding the need for data-centric AI, looking into training data operations, and reducing time spent labeling via automation. Narrowing their problem scope also makes it easier to understand data needs and collect and train accordingly. With time, these approaches could serve to increase the number of projects that make it to production. Without doubt, data trends are the ones to watch as we continue to observe this developing industry.