AI/ML Model Training: Data, Models, and Metrics

What Is Model Training?

Model training is an important step in machine learning and can be used to build, test, and deploy successful artificial intelligence and machine learning (AI/ML) models. The training phase has a major impact on a machine learning model’s performance – the quality and quantity of data fed into the model at the training stage determines the quality and efficacy of applications built with that model. 

Two important factors in the model training phase are:

  • The training dataset—labeled data samples that contain typical inputs the algorithm is expected to handle, together with the correct model output (the labels). A training dataset is usually partitioned into training, validation, and testing. The training dataset is used in initial training; validation data is used to evaluate the model and tune hyperparameters; and testing data is used to measure performance of the final model.
  • The training algorithm—there are multiple algorithms available for most machine learning tasks. There are trade-offs in determining the best algorithm, such as model complexity, interpretability, performance, and computational requirements. Each algorithm has its own training process and may have different training data requirements.

 

What You Should Do Before Training Your Model

Define Your Task

The first phase of any machine learning project is developing an understanding of the business requirements. You need to know what problem you’re trying to solve. Work with the owner of the project and make sure you understand its objectives and requirements. The goal is to convert this knowledge into a suitable problem definition for the machine learning project and devise a preliminary plan for achieving the objectives.

In order to build a successful model, you will need good data, sufficient volumes of data, and the ability to clean the data and prepare it to the model’s training requirements. Depending on the nature and size of the dataset, this can be a formidable task.

At the start of the project, identify your data needs, what data is available, and whether it is in appropriate format and shape for the machine learning project. You’ll need to go through the following stages:

  • Identify the problem and candidate algorithms
  • Identify data required to train the algorithm/s
  • Collect initial data
  • Identify its quality and suitability for the task
  • Plan what is needed to make the dataset suitable for the project

Finally, it is important to identify differences between training data and real-world data and determine your approach for evaluating model performance.

Define Required Data and Annotations

Some organizations have no problem collecting data for Machine Learning and already have a large collection of information accumulated over the years. In some cases, this includes digitized information. However, if you haven’t collected data, you can use a public or commercial reference dataset to handle this task. 

When you start analyzing your dataset, you can collect data, ensuring it has the right format. You might collect structured or unstructured data—structured data could be a CSV or CLS file with columns for each data attribute, while unstructured data might include text, image, and video files. The type of data you collect depends on the business use case.

There are several ways you can find a dataset online: 

  • Google Dataset Search—lets you search via keywords and filter results based on data type or if the dataset is free. 
  • Github—provides a list of popular public datasets based on various categories.
  • Kaggle—lets you find datasets with usability ratings based on desired categories (it is a valuable resource for independent ML projects). 
  • OpenML—lets data scientists share data and collaborate to solve problems. 
  • UCI Machine Learning Repository—lists various data generators and datasets.

If you don’t have enough data, you can use a data generator. Synthetic data is crucial for projects that require massive datasets. However, you must understand the dataset creation process to use it. 

Data generators can help provide supplementary training or testing data when you cannot find enough real-world data. Synthetic data is also helpful for protecting privacy and data confidentiality, especially if you process sensitive data (i.e., medical or personal data).

There are two main types of data generators:

  • Distribution model-based generators—these might use distributions based on real data or a data scientist’s expertise. 
  • Deep Learning-based generators—these include Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), which use neural networks to create data and require special expertise.

Learn more in our detailed guide to synthetic data

Choose the Right Model

Once the data is available and you know the problem you need to solve, select the machine learning model that is best suited for the task. This involves the following steps:

  • Choose the right algorithm based on your learning goals and data requirements.
  • Identify the model variant, architecture, or options that will give you the best results.
  • Within the chosen model variant, learn to configure and tune hyperparameters for optimal performance (for example, in a neural network this can be the number of layers, activation function, and learning rate).
  • Benchmark models and hyperparameters to prove which works best for your use case.
  • Decide if the model provides the required level of explainability.
  • If relevant, consider an ensemble approach, running multiple models in parallel to enable a “best of many” results.
  • Determine your model’s operational and deployment requirements.

Set an Achievable Performance Level

Establish a line of communication with project owners and stakeholders and create clear expectations about the result of the machine learning project. Talk to them about the performance level a model can realistically achieve and whether that performance will provide value or solve the business problem. Jointly define a minimum performance threshold that will be considered as success.

Choose One Model Performance Metric

An AI/ML model should have one primary performance metric you can assign prior to training, and use to evaluate the model throughout its lifecycle. For instance, for a regression task, you can use the root mean squared error (RMSE) as a performance metric. For a classification task, the performance metric could be classification accuracy against a labeled dataset.

Once you select your performance metric, use it to compare potential models for your business problem, and track each model through testing stages. Use the same metric in production to evaluate the model’s success at generating real-life predictions.

How to Train a Machine Learning Model

Parameters, Hyperparameters, and Variables

When preparing a new model, you will work with three key elements:

  • Parameters—model parameters are values that machine learning models automatically learn from input data during training. For example, in a neural network, these are the neuron weights. In a logistic regression, these are the regression coefficients.
  • Hyperparameters—these are values defined outside the machine learning model, which specify how it operates. Different hyperparameters can result in different levels of performance for the same model and dataset, so tuning hyperparameters is critical. While it is common to tune hyperparameters manually, there are several approaches for automatically searching through hyperparameter values to discover the optimal settings.
  • Inputs—it is important to define exactly what inputs the model will receive, their format, and the shape of the data. For example, computer vision models can process images of a certain size, quality level, in color or black and white, and can require that the images be vectorized before inputting them into the model.

Training, Validation, and Testing

Machine learning algorithms typically require three stages before they are ready for inference: 

  • Training—running the model with a set of examples used to tune the initial model parameters. In many machine learning models, training is iterative and may be repeated hundreds or thousands of times.
  • Validation—a validation dataset contains unseen examples, which are used to evaluate model performance and tune hyperparameters. 
  • Testing—a testing dataset is another group of unseen examples, used after training and validation to test the real-life performance of the model. 

The initial dataset is partitioned randomly (to avoid bias) into these three groups. The training dataset is the largest group, because the model needs as many examples as possible to learn the subject domain. It is common to distribute a dataset into training, validation, and testing sets, respectively, using a ratio of 80/10/10 or 60/10/10. 

Training, validation and testing can be performed many times in the lifecycle of a model. If the end result of testing is not sufficient for the use case, data scientists can experiment with different hyperparameters, fine tune and dataset, or even switch to an entirely different model, until they achieve the right level of performance.

Training, validation, and testing datasets require ground truth labels. Read our guide to ground truth.

Ongoing Maintenance of AI Models

There is growing awareness in the data science community that model training is not a one time effort. Because model inputs constantly change, the performance of a model is not static. Model performance may be different in real life than it was on a limited example dataset. 

Model performance might also deteriorate over time, which is known in the industry as data drift. Data drift can reduce the accuracy of AI models, so early warnings are important to identify and solve problems. 

Continuous monitoring is essential to validate model performance in real life tasks, identify issues, and repeat the training, validation, and testing process with new data if necessary.

Tuning Hyperparameters

Hyperparameters are high-level properties that data science teams set before training a model. A model can learn many properties from the training data, but its own hyperparameters are “exogenic”—they are defined outside the model and determine the way it operates.

For example, in a deep neural network, hyperparameters include the number of layers, neuron dropouts, and the activation function used. In a random forest algorithm, key hyperparameters are where to split the decision tree and the total number of trees to use.

When training a machine learning model, you will try different hyperparameters to see which combination delivers the best result with a given set of training data.

Model Training Best Practices

Here are some best practices to help you train your ML model successfully:

Build Your Datasets Carefully

Successful model training relies on the quality of your annotated datasets. Skilled modelers should carefully build and annotate your training datasets. When choosing the training data, consider the defects and defect labels. For example, in image-based training projects, you might look at the size and quality of the image and its defects. 

You can use ML engineers to tag defects to understand the dataset better. This understanding is essential for training models and analyzing errors efficiently. It can also help you isolate errors. Monitor the model to ensure it uses the right training images.

Start with Small Datasets

Start your training pipeline or dataset by overfitting the model (training it on a small set of data samples) and evaluating the results. This approach helps you verify the pipeline’s efficacy quickly and easily. For example, you train the model with a handful of random images as a sanity check to avoid errors before feeding more data to the model. 

Use Correctly Labeled Datasets

Accuracy is important for data training, and improperly labeled data can impact your model’s performance. It is important to use image annotation when generating the training datasets for computer vision models. Collecting correctly labeled datasets can be challenging, but you can use an ML or AI data labeling system.

Ensure You Have Enough Data

Sufficient training data is key to the success of an ML model. The more training data you have, the higher the accuracy of your model’s predictions. Depending on the model type and use case, you may have different training data requirements. For example, DL models need more qualitative and quantitative datasets. 

Augment Data with Transformations

Data augmentation expands the available data for making predictions, which increases model accuracy. You can use image transformation to expand your training distribution with image processing. Using image transformations in training datasets helps you build stronger models. However, this approach is only useful for training, not for production models or performance monitoring.