How to Collect Data Efficiently and Responsibly for Your AI Initiatives
Data collection continues to be a major bottleneck for teams building artificial intelligence (AI). The reasons vary: there may be a lack of sufficient data for a use case, new machine learning (ML) techniques like deep learning require more data, or teams don’t have the right processes in place to get the data they need. Whatever the case, there’s a growing need for accurate and scalable data solutions.Best Practices for Collecting High-Quality Data
As an AI practitioner, developing a plan for data collection requires asking the right questions.What kind of data do I need?
The problem you choose to solve will indicate what kind of data you need. For a speech recognition model, for example, you’ll want speech data from speakers that represent the full range of customers you expect to have. This means speech data that covers all of the languages, accents, ages, and characteristics of your target customers.Where can I source data from?
First, understand what data you already have available internally, and whether it’s usable for the problem you’re trying to solve. If you need more data, there are many publicly-available online sources of data. You can also work with a data partner to generate data through crowdsourcing. Another option is to create synthetic data to fill in gaps in your dataset. The other element to keep in mind here is that you need a steady source of data long after launching your model to production. Make sure your data source is equipped to provide continuous data for retraining purposes post-launch.How much data do I need?
This will depend on the problem you’re trying to solve as well as your budget, but generally the answer is: as much as possible. There’s generally no such thing as too much data when it comes to building machine learning models. You need to make sure that you have enough data to cover all of the potential use cases of your model, including edge cases.How do I ensure my data is high-quality?
Clean your datasets before using them for training your model. This means removing irrelevant or incomplete data as a first step (and checking that you weren’t counting on that data for use case coverage). Your next step is to accurately label your data. Many companies turn to crowdsourcing for access to large numbers of annotators; the greater variety of people you have annotating your data, the more inclusive your labels will end up. If your data requires specific domain knowledge, leverage experts in that field for your labeling needs. By answering these questions, you can start to build a data pipeline that enables you to collect high-quality, accurately-labeled data efficiently. Ultimately, having a repeatable, consistent data pipeline will help you scale.