Everything You Need to Know About Synthetic Data
Companies launching artificial intelligence (AI) face a major hurdle in collecting sufficient data for their models. For many use cases, the right data simply isn’t available or is very difficult and expensive to acquire. Missing, or incomplete, data isn’t desirable when creating AI models, but even larger tech companies misstep in this area. For example, researchers in 2018 discovered that the top facial recognition softwares could easily identify white male faces, but showed up to 34% higher error rates when identifying individuals with darker skin. The data used to train these models was missing representation for an entire subset of the population. So what can companies do in a situation like this? Synthetic data provides a compelling solution.
Synthetic data is data that’s been artificially-generated via a computer program, rather than generated by real-life events. Companies may augment their training data with synthetic data to fill out all potential use and edge cases, to save money on data collection, or to accommodate privacy requirements. With the rise in computing power and data storage options like the cloud, synthetic data is more accessible than ever. This is unquestionably a positive development: synthetic data advances the development of AI solutions that work better for all end users.
Why Should You Use Synthetic Data?
Say you have an AI problem you’re trying to solve, and you’re not sure whether you should invest in synthetic data to either partially or fully cover your data needs. The following are a few reasons why synthetic data could be a great fit for your project:
Improve model robustness
Access more diverse data for your models without having to collect it. With synthetic data, you can train your model with variations of the same person with different hairstyles, facial hair, wearing different glasses, different head pose, etc, as well as skin tone, ethnic features, bone structure, freckles, to create diverse faces to make it more robust.
It’s faster to acquire than “real” data
Teams can produce large volumes of synthetic data in a short period of time. This is especially helpful when the real-life data is reliant on events that rarely occur. When collecting data for a self-driving car, for example, teams may struggle to capture sufficient real-life data of extreme road conditions due to their rarity.
Additionally, data scientists can set up algorithms to automatically label the synthetic data as it’s created, cutting down on the time-consuming annotation process.
It accounts for edge cases
Machine learning algorithms prefer a well balanced dataset. Recall our facial recognition example. Had the companies created synthetic data of darker-skinned faces to fill in their data gaps, not only the accuracy of their models would improve (and in fact, this is exactly what several of these companies did), but they would be producing a more ethical model. Synthetic data assists teams in covering all use cases, including edge cases where the data is less widely-available or doesn’t exist at all.
It protects private user data
Depending on the industry and type of data, companies may face security challenges when working with sensitive data. In healthcare, for instance, patient data often includes personal health information (PHI) that requires high levels of security to be met for usage. Synthetic data mitigates privacy concerns because it doesn’t reference information about a real person. If your team needs to meet certain data privacy requirements, consider incorporating synthetic data as an alternative.
Use Cases for Synthetic Data
From a business perspective, synthetic data has numerous applications: model validation, model training, test data for new products, and more. Several industries have pioneered its usage in machine learning, a few of which we’ll highlight:
Organizations developing self-driving cars often rely on simulations to test performance. Real road data can be difficult or dangerous to acquire under certain conditions, such as cases of extreme weather. In general, there are far too many variables to account for in all of the possible driving experiences to rely on live tests with real cars on the roads. Synthetic data is a safer and faster alternative to manual collection.
The healthcare industry is prime for synthetic data adoption due to the sensitive nature of its data. Teams can leverage synthetic data for capturing physiologies for all possible patient types, ultimately helping to diagnose conditions more quickly and precisely. One exciting example of this is Google’s melanoma detection model, which uses synthetic data of darker-skinned individuals (an area of clinical data that’s unfortunately underrepresented) to equip the model with the ability to perform well for all skin types.
Synthetic data adds greater security for organizations. Returning to our facial recognition example, you may have heard of the term “deep fakes,” referring to artificially-created images or video. Companies can create deep fakes to test their own security systems and facial recognition platforms.
Video surveillance also takes advantage of synthetic data to train models at lower cost and greater speed.
Organizations need reliable and secure ways to be able to share their training data with others. Another interesting use case of synthetic data can be hiding personally identifiable information (PII) before making the dataset available to others. This is called privacy preserving synthetic data and is useful for sharing scientific research datasets, medical data, sociological data and other areas that might contain PII.
How to Create Synthetic Data
Teams create synthetic data programmatically with machine learning techniques. Generally, they’ll use a sample set of data from which to create it; the synthetic data must retain the statistical properties of the sample data. Synthetic data itself can be binary, numerical, or categorical. It should be randomly generated and of arbitrary length, and robust enough to cover the required use cases. There are several techniques for generating synthetic data; the most common are described below:
Drawing Numbers from a Distribution
If you don’t have real data, but understand what the dataset distribution would look like, you can generate synthetic data by distribution. In this technique, you would generate a random sample of any distribution (normal, exponential, etc.) to create the fake data.
Fit Real Data to a Distribution
If you do have real data, you can use techniques like the Monte Carlo method to find the best fit distribution for that data and generate synthetic data using that.
Deep learning models can generate synthetic data. For example:
- Variational Autoencoder model: This unsupervised model compresses the initial dataset and sends it to a decoder, which then outputs a representation of that initial dataset.
- Generative Adversarial Network (GAN) model: A GAN model consists of two networks. The generator takes in a sample dataset and outputs synthetic data. The discriminator compares the synthetic data to a real dataset and fine tunes iteratively.
Using a combination of the above methods may be most beneficial depending on how much real data you’re starting with and what you’re using your synthetic data for.
The Future of Synthetic Data
In the last decade, we’ve seen a huge acceleration in the usage of synthetic data. While it saves time and money for organizations, it’s not without its challenges. It lacks outliers, which occur in real data naturally and are for some models crucial for accuracy. Note also that its quality is often dependent on the input data used for generation; biases present in the input data can easily disseminate into the synthetic data, so the importance of using high-quality data as your starting point can’t be understated. Finally, it requires additional output control; namely, comparing the synthetic data with human-annotated real data to ensure inconsistencies aren’t introduced.
Despite these challenges, synthetic data remains an exciting area of opportunity. It enables us to produce innovative AI solutions even when real-life data isn’t accessible. Most importantly, it helps organizations create products that are more inclusive and representative of the diversity of their end users.
Expert Insight From an Appen Director of Data Science
Remember that Synthetic Data is a technique for Data Augmentation and not replacing data collection and annotation. It is important to understand that you’re not going to be able to create a model that works exceptionally well in the real world WITHOUT any real world data. You might get a majority of the cases covered, but there would be a long tail of edge-cases that your model would fail at (e.g. for our face recognition case there may be some rare lighting condition, rare facial feature, plastic surgery, etc. cases you might not have considered – which you wouldn’t know about if you started only with synthetic data, no matter how photorealistic those faces are).
Apart from that, here are things you need to be mindful of when creating and using Synthetic Data:
- Understand the robustness requirement of your model to define what synthetic data you need: Even before you start generating synthetic data you want to figure out what does the model really need and create a set of functional requirements of the type of synthetic data you need to generate. Building synthetic data similar to what you already have is useless for the model. Instead, you might want to augment your data to improve diversity (e.g. faces with different facial features for face recognition use case) and variations (e.g. slight deviations of the same person). You might want to think of the rare or edge cases, and prioritize those for synthetic data generation. An alternative approach is to derive the requirements of the synthetic data from your false positives and false negatives on real world train, validation, and test set predictions to reduce their occurrence.
- Understand what synthetic data can and cannot do for your dataset and model: Data augmentation gives a bump to the accuracy to your model – it doesn’t take the model to perfection. Since our distribution of our synthetic data is close to our real-world data that we know of, it is not going to magically be able to effectively understand any significantly different data occurring in the real world or create a prediction or an outcome that the training data did not guide it towards. Also consider the origin of the data and the conditions for it (e.g. the faces generated on ThisPersonDoesNotExist.com are generated from profile headshots. These won’t help your model recognize images indoors when the sky is cloudy and the room is dark).
- Know the various synthetic data tools at your disposal and those rapidly becoming available: Common existing methods for synthetic data are related to either partially cloning some data from the real world and superimposing on another real world data, or using Unity or some 3D environment able to generate photorealistic data. But this area is fast-evolving thanks to changing GAN and VAE technology. Instead of creating completely new data you create variations of the real world data by compositing some new artifacts on it (e.g. adding freckles to a real person’s face, altering shadow angle, etc.). Another example is where you can refine the superimposed data to make it more realistic. There are many more other tools at your disposal, which you might want to become aware of.
- Versioning your data: As you start generating synthetic data, your capability to generate better synthetic data will also grow. So an image you generated last month, may now become obsolete by a newer version of it where your image looks more like the real world (e.g. you found a better skin texture you used for the face, a new GPU helps you do more detailed results from ray tracing, etc.). You don’t want to train your old model with this older image. Versioning will help you know what data you replaced with new data and you can verify your model improvement as you added different synthetic data or updated old data..
To summarize, your synthetic data exists to improve your model performance in the real world. Any approach you take or the data you generate has to make your model more robust and help improve that performance. Having well defined requirements of what your model needs based on where it lacks can help you focus your efforts by choosing the right tools and generating the right data.
What Appen Can Do For You
At Appen, we have over 25 years of experience supporting our customers with their data collection and annotation needs. Our experts can work with your team to analyze if synthetic data is a viable option for your model, and then leverage our services to get your AI solutions off the ground and out to market with speed.