Creating an Image Dataset for Machine Learning: Quick Tutorial

What Are Image Datasets?

A dataset is a curated collection of data for a machine learning project. An image dataset contains digital images used to test, train, and evaluate the performance of computer vision algorithms.

Image datasets help algorithms learn how to recognize and recognize information in images and perform relevant cognitive activities. For example, an AI algorithm could be trained to tag photos, read license plates, or identify tumors in medical images. Computer vision algorithms can also use a variety of real-world applications to transform images or create entirely new ones.

This is part of a series of articles about image datasets.

Why Do We Need an Image Dataset for Machine Learning?

A dataset is a collection of examples used to train and test a model. This can be a selection of examples that pertain to a particular subject or domain—however, datasets are usually designed to serve multiple applications. Datasets are often labeled, making them ideal for training and testing supervised models. However, there are also unlabeled datasets for training unsupervised models.

The model makes predictions based on what it already knows, so it’s important not to test examples already used in training. This is called “overfitting” the model. To address this, datasets should be split into training and testing portions. This process involves selecting a portion of a dataset (for example, 70% of the data) and allowing a machine learning algorithm to train using this data. Data scientists can then pick the remaining unseen examples from the dataset (the remaining 30%) and use them to test the model’s training. 

Datasets can also be used as tools to measure the performance of machine learning techniques. Models designed for the same purpose must be compared in a fair benchmark. This is done by running multiple algorithms on the same image datasets, making it possible to accurately compare their performance.

How to Create Image Dataset for Classification Problems

For many computer vision projects, standard datasets will not be enough, and you will need to build your own dataset consisting of labeled images. An image classification dataset consists of labeled images you collect and use them to train your model for a specific problem.

A common way to construct a dataset of images for classification is to create a list of search queries corresponding to your target labels. Using that list, you then execute an image search on Google or Bing and scrape the results. To construct your image classifier dataset, follow this brief Python tutorial.

Installing packages

You can use the Bing Image Downloader library to scrape images for our dataset from the internet rather than building our scraper from scratch, which would have taken more time. This library is a part of the PyPi suite of Python packages and can be installed using the following command:

pip3 install bing-image-downloader

or can be cloned directly from its GitHub repository:

git clone https://github.com/gurugaurav/bing_image_downloader

Image scraping

Import the downloader module from the bing_image_downloader package in a Python file. Next, create a function called download_images() with one argument—the current request. It downloads and saves the first 20 results into a directory specified by output_dir.

from bing_image_downloader import downloader

def download_images(request):

    “””Store downloaded images in a directory.

        Parameters:

    request: list of queries to search for

        Return: Images downloaded that match request

        “””

    print(request)

        downloader.download(request, 

                    limit=20, 

                    output_dir=’images’, 

                    adult_filter_off=False, 

                    force_replace=False, 

                    timeout=60)

Construct a list of images to download

To illustrate the approach, let’s build a scraper to collect images that can help us detect plant diseases. Therefore, the list below contains the search terms of the plant disease images we want to download. You can also construct a list of search terms of your choice.

plant_diseases = [

‘Bell Pepper Bacterial Spot’,

‘Bell Pepper Late Blight’,

‘Bell Pepper Leaf Mold’,

‘Potato Early Blight’,

‘Potato Late Blight’,

‘Tomato Bacterial Spot’,

‘Tomato Early Blight’,

‘Tomato Late Blight’,

‘Tomato Leaf Mold’,

‘Tomato Septoria Leaf Spot’,

‘Tomato Spider Mites’,

‘Tomato Target Spot’,

‘Tomato Mosaic Virus’,

‘Tomato Yellow Virus’,

‘Apple Rust’,

‘Corn Rust’,

‘Grape Black Rot’,

‘Squash Powdery Mildew’]

Now iterating over each plant disease in the list we execute the

download_images()

function by passing it the name of the plant disease as an argument. The function proceeds to look for images on Bing for each plant disease and scrapes them. The results are then saved in

output_dir.

for disease_number in range(len(plant_diseases)):

    print(‘Downloading images of the following plant disease’, plant_diseases[disease_number])

    download_images(plant_diseases[disease_number])

After the images are saved, cross-check the search term used against the images retrieved. Further steps would include the resizing and augmentation of the images and splitting them into train and test datasets for your model.

Image Dataset Best Practices for Machine Learning

Ensure the Dataset Has Enough Images 

Determining the right number of images for your image dataset can be challenging. A good rule of thumb is to start with 1000 images per class, but this is not suitable for all use cases, such as transfer learning.

One way to check if a training data set has the right amount of data is the Sample-Size Determination Methodology (Balki et al.), which is easy to understand and effective, albeit only applies to medical imaging. It involves incrementally using larger proportions of the dataset to train models, starting with 5%, then 10%, etc. You record the standard deviation and mean accuracy to predict the optimal amount of data needed. 

Filter the Images

When downloading images in bulk, some are likely to be low-resolution, duplicates, or otherwise irrelevant. Constructing a useful image dataset requires filtering out these images:

  • Delete duplicates—use a tool to remove duplicates automatically (e.g., a Resnet18 or perceptual hashing algorithm). This approach won’t be fully accurate and might also delete some images that are not duplicates, but it can help save time.
  • Delete images of unexpected size—images that are smaller or larger than expected by the model can result in poor performance. You can use scripts to automatically cycle through dataset images and filter images that are not of the appropriate size or resolution.
  • Manually prune images—while this approach is the most time-consuming, it is sometimes necessary to improve dataset quality. Automated processes cannot identify quality issues as accurately as humans, especially for tightly defined classes.

Label or Tag Occluded Objects

An occluded object is a partially blocked object in an image. Attach proper labels to occluded objects as if they are in full view. Don’t draw bounding boxes on only the visible parts of an occluded object—it is okay if the boxes for different objects overlap. These practices must be consistent across all annotators to be effective.

Be Specific When Labeling Images

Use specific names when labeling objects—it is better to err on the side of being overly specific to avoid having to relabel the dataset. For example, if the general object class is dogs, you might specify sub-classes for different dog breeds (i.e., Labrador, Pug, Poodle, etc.). If these labels are too specific for the use case, you can combine them into one category (dog), which is easier than relabeling the whole dataset. As above, it is critical that all annotators use the same practices and terminology.

Start your free trial now!