You may have heard the buzzword “big data” in the context of AI, but what about small data? Whether you’re aware of it or not, small data is all around you: it powers online shopping experiences, airline recommendations, weather reports, and more. Small data is data that’s in an accessible and actionable format and easily comprehensible by humans. Data scientists often leverage small data to analyze current situations.
The growth in small data in machine learning (ML) is likely due to the greater availability of data in general, as well as experimentation in new data mining techniques. As the AI industry evolves, data scientists are increasingly turning to small data for the low levels of computing power it requires and the ease of usability.
Small Data vs. Big Data
How exactly is big data different from small data? Big data consists of both structured and unstructured data in large chunks. Given its size, it’s much harder to understand and analyze than small data and requires a lot of computer processing power to interpret.
Small data enables companies to achieve actionable insights without needing the complex algorithms required for big data analysis. As a result, companies don’t have to invest as much in data mining processes. Big data can be converted into small data through the application of computer algorithms that change the data into smaller, actionable chunks that represent components of the larger dataset.
An example of big to small data conversion is monitoring social media during a brand launch. There are tons of social media posts being created at any given second. A data scientist would need to filter the data they want by platform, by time period, by keyword, and any other relevant feature. This process converts the big data into small, more manageable chunks from which to draw insight.
The Benefits of Small Data
We’ve already hinted at the benefits of using small data versus big data, but there are several worth highlighting.
Big data is harder to manage: Using big data at scale is a massive effort, one that requires tremendous computer power for analysis purposes.
Small data is easier: Analyzing small chunks of data can be done very efficiently, without much investment of time and effort. This means that small data is more actionable than big data.
Small data is everywhere: Small data is already widely-available for many industries. For example, social media provides a ton of actionable bytes of data that can be utilized for a wide variety of purposes, marketing or otherwise.
Small data focuses on the end user: With small data, researchers can target the end user and their needs first. Small data provides the why behind end user behavior.
In many use cases, small data is a fast, efficient approach to analysis and can help inform powerful insights about customers across industries.
Approaches to Small Data in ML
Under the most traditional machine learning method, supervised learning, models are trained on large amounts of labeled training data. But there are numerous other methods for model training, many of which are gaining in popularity due to the cost efficiencies and time savings they offer. While these methods often rely on small data, in this case data quality becomes paramount.
Data scientists use small data when models only require small amounts of data or when the model doesn’t have enough data. In these cases, data scientists can use any one of the following ML techniques.
With few-shot learning, data scientists provide an ML model with a small amount of training data. We see this approach commonly in computer vision, where the model may not need many examples to identify an object. If you have a face recognition algorithm that unlocks your smartphone for example, your phone doesn’t require thousands of pictures of you to enable it. It only needs a few to add the security feature.
This technique is both low cost and low effort, making it appealing in situations where there may not be sufficient data to train a model under fully supervised learning.
Knowledge graphs are secondary datasets, as they’re formed through filtering original, larger data. They consist of a set of data points or labels that have defined meanings and describe a specific domain. For instance, a knowledge graph could include data points of names of famous actresses, with lines (known as edges) connecting actresses who have worked together before. Knowledge graphs are a very useful tool for organizing knowledge in a way that’s highly explainable and reusable.
Transfer learning is when an ML model is used as a starting point for another model that needs to accomplish a related task. It’s essentially a knowledge transfer from one model to the other. With the original model as a starting point, additional data can be used to further train the model toward handling the new task. Components of the original model can also be pruned if they aren’t needed for the new task.
Transfer learning is particularly useful in fields like natural language processing and computer vision, which require a lot of computing power and data. This method, if achievable, can offer a shortcut to obtaining results with less effort.
The idea behind self-supervised learning is for the model to gather supervisory signals from the data it has available. The model uses available data to make predictions on the unobserved or hidden data. For instance, in natural language processing data scientists may give a model a sentence with missing words and have the model predict which words are missing. With enough context clues from the unhidden words, the model learns to identify the remainder.
Synthetic data may be leveraged when a given dataset has gaps that are difficult to fill with existing data. A popular example is in the case of facial recognition models. These models need image data of faces that cover the full range of human skin tones; the problem is that images of darker-skinned individuals are rarer than those of lighter-skinned. Rather than create a model that has trouble identifying darker-skinned people, a data scientist can instead artificially create data of darker-skinned faces to achieve equal representation. But machine learning specialists must test these models more thoroughly in the real world and plan to add additional training data where the computer-generated data sets are not sufficient.
The approaches noted here aren’t an exhaustive list, but give a promising picture of the various directions machine learning is progressing. Generally, data scientists are moving away from supervised learning and instead experimenting with approaches that rely on small data.
Expert Insight from Rahul Parundekar – Director of Data Science
It’s important to clarify that “small” data doesn’t mean a small amount of data. It means the right kind of data needed to create a model that generates business insight or automates decisions. Often we see someone, having been overpromised on what AI can deliver, share a few images and expect a production quality model – that’s not what we are talking about here. We’re talking about finding out the data that is most appropriate for creating the model that gives the right output you need when deployed in practice.
Here are a few things to keep in mind while creating your “small” dataset:
Make a conscious choice of what data is going into your dataset. You should ensure that it contains only the kind of data you will see when you use your model in practice (i.e. in production). For example, if you were doing defect detection on a manufacturing conveyor line of one type of manufactured part at a time – then the data you’d have in your set are images taken on a camera mounted on the line of that part with and without defects, and images of an empty conveyor when no objects are present.
Data Diversity vs. Repetition
It’s important to cover all the different cases of data that your model will see in practice, and have a good balance of the variety within those cases. Avoid over-stuffing your dataset with data that’s already covered. In the defect detection example, you want to make sure you capture objects without defects, objects with different types of defects, in different lighting conditions that the factory floor will have, various rotations and positions on the belt, and maybe even throw in a few examples in maintenance mode. Since a manufactured object without a defect is identical to others that have no defects, you don’t need to over-stuff. Another example of unnecessary repetition are video frames with little or no change.
Build with Robust Techniques
The approaches to small data listed above are a great place to start – perhaps you can benefit from transfer learning on another model in a similar domain you’ve already trained that’s giving you good results and then tune it with your small data. For the defect detection example, this would be perhaps another defect detection model you’ve previously trained, as opposed to maybe fine tuning a model trained on MS COCO dataset, that’s dissimilar to your defect detection on conveyor line use case.
Data-centric AI vs. Model-Centric AI
The latest learnings from the AI industry show that it is much more impactful to model performance if you find the right data to train it with. Find edge cases, variations, can yield better results instead of training with multiple hyperparameters, different model architectures, or, generally speaking, assuming that competent Data Scientists will “figure it out”. If your defect detection model can’t detect certain types of defect well, invest more in getting more images of that type, instead of trying different model architectures or hyperparameter tuning.
Collaborate with Training Data Experts:
With data-centric AI, you also want to focus your debugging efforts on the data, which the domain experts are better at, rather than the model, which the Data Scientists are good at. Work with the domain experts to identify patterns in the cases where the model is failing and hypothesize why it might be failing. This will help you determine the right data you need to go get. For example, the engineer expert in the object defects can help you prioritize the right data your model needs, cleaning noisy or unwanted data mentioned above, and maybe even point out nuances that the Data Scientist might use to choose a better model architecture.
To summarize, think of small data as also being more “dense” than big data. You want the highest quality of data in the smallest possible size of dataset, making it cost effective, and easily usable by one of the approaches above to create your “Champion” model.
What We Can Do For You
Appen provides data collection and annotation services on our platform to improve machine learning at scale. As a global leader in our field, our clients benefit from our capability to quickly deliver large volumes of high-quality data across multiple data types, including image, video, speech, audio, and text for your specific AI program needs. We offer several data solutions and services to best fit your needs. Operating with over 25+ years of expertise, we’ll work with you to optimize your data pipeline efficiency to its maximum.
To discuss your training data needs, contact us.