Data Engineering Your AI Model

Ofir Zuk (Chakon)

2/02/2023

3 Min read

Once we’ve started training and testing a model, we most likely will need to identify some missing data types. How much data should we generate and add to our training set to address the problem? Well, it depends on the goal we are trying to achieve. Similar to architecture engineering, by choosing the sample types and their distribution, we actually define the problem our model will eventually be able to solve and whether our model will be biased or unbiased with respect to any of the attributes we can control when we generate the data. We call this task data engineering.

The approach to finding the right amount of data is to iterate on the data itself. We can’t know how much data will be needed ahead of time, but we can iteratively increase the amount of data in reaction to model performance boosts. However, when trying to improve performance of a specific problem identified we need to keep an eye on the general model performance and make sure we are creating new problems due to new biases created in our training data. This requires multiple solid test datasets both to measure the standard-case model performance on in-domain data, the model performance on edge case data and the validation of the model not having biases. Creating these test datasets is the first step in this process.

Read Part 1 of the Big Book of Synthetic Data: Why is my model not working?

Below is an example of suggested in-domain and edge cases for a certain computer vision task:

General test set – target domain	General large test set from target domain	Represents the prior probabilities of our domain	1000
Test	Details	Purpose	Size
Low light	Unit test containing low light samples (may be synthetic or real)	Represents important edge cases which may not be represented in the test set	100
Extreme pose	Unit test containing extreme pose samples (may be synthetic or real)		100
Occlusions	Unit test containing occlusion samples (may be synthetic or real)		100

We suggest adopting the following principles:

If the data we are generating represents edge cases, we should use a limited number of samples in the training set. We need to verify that we do not substantially degrade performance on the regular in-distribution samples in parallel to improving the results on the edge cases. This may be a trade-off about which we will need to make a clear decision.
If the data we are generating are in-distribution samples to enrich our data, we expect the addition of new samples will improve our model. We suggest adding more data iteratively until the model does not further improve.
If the data we are generating represent valid sub-groups of the data which are under-represented for some reason, we should use around the same as other subgroups in the data to avoid biasing the data distribution unintentionally.

Learn more about leveraging synthetic data in the Big Book of Synthetic Data Part 1: Why is my model not working?

Orly Zvitia is Datagen’s Director of Artificial Intelligence. She has over 15 years of experience in computer vision and machine learning, both as a manager and a researcher. Orly has a track record of leading challenging computer vision projects starting from initial ideation to full productization for world leading corporates and startup companies. She works to bring Datagen’s synthetic data revolution to the world-wide computer vision community.