Data Engineering Your AI Model

Once we’ve started training and testing a model, we most likely will need to identify some missing data types. How much data should we generate and add to our training set to address the problem? Well, it depends on the goal we are trying to achieve. Similar to architecture engineering, by choosing the sample types and their distribution, we actually define the problem our model will eventually be able to solve and whether our model will be biased or unbiased with respect to any of the attributes we can control when we generate the data. We call this task data engineering.

The approach to finding the right amount of data is to iterate on the data itself. We can’t know how much data will be needed ahead of time, but we can iteratively increase the amount of data in reaction to model performance boosts. However, when trying to improve performance of a specific problem identified we need to keep an eye on the general model performance and make sure we are creating new problems due to new biases created in our training data. This requires multiple solid test datasets both to measure the standard-case model performance on in-domain data, the model performance on edge case data and the validation of the model not having biases. Creating these test datasets is the first step in this process. 

Read Part 1 of the Big Book of Synthetic Data: Why is my model not working? 

Below is an example of suggested in-domain and edge cases for a certain computer vision task:

TestDetailsPurposeSize
General test set – target domainGeneral large test set from target domainRepresents the prior probabilities of our domain1000
Low lightUnit test containing low light samples (may be synthetic or real)Represents important edge cases which may not be represented in the test set100
Extreme poseUnit test containing extreme pose samples (may be synthetic or real)100
OcclusionsUnit test containing occlusion samples (may be synthetic or real)100

We suggest adopting the following principles: 

  • If the data we are generating represents edge cases, we should use a limited number of samples in the training set. We need to verify that we do not substantially degrade performance on the regular in-distribution samples in parallel to improving the results on the edge cases. This may be a trade-off about which we will need to make a clear decision. 
  • If the data we are generating are in-distribution samples to enrich our data, we expect the addition of new samples will improve our model. We suggest adding more data iteratively until the model does not further improve. 
  • If the data we are generating represent valid sub-groups of the data which are under-represented for some reason, we should use around the same as other subgroups in the data to avoid biasing the data distribution unintentionally. 

Learn more about leveraging synthetic data in the Big Book of Synthetic Data  Part 1: Why is my model not working?

Orly Zvitia is Datagen’s Director of Artificial Intelligence. She has over 15 years of experience in computer vision and machine learning, both as a manager and a researcher. Orly has a track record of leading challenging computer vision projects starting from initial ideation to full productization for world leading corporates and startup companies. She works to bring Datagen’s synthetic data revolution to the world-wide computer vision community.