Top 10 Face Datasets for Facial Recognition and Analysis

Ofir Zuk (Chakon)

25/07/2022

5 Min read

In This Article

What is a Face Recognition Dataset?

Most people can recognize about 5,000 faces, and it takes a human 0.2 seconds to recognize a specific one. We also interpret facial expressions and detect emotions automatically. In other words, we’re naturally good at facial recognition and analysis. But, in recent years, Computer Vision (CV) has been catching up and in some cases outperforming humans in facial recognition. Advances in CV and Machine Learning have created solutions that can handle tasks more efficiently and accurately than humans. The applications of this technology are wide-ranging and exciting. One example is in marketing and retail. Saks Fifth Avenue uses facial recognition technology in their stores both to check against criminal databases and prevent theft, but also to identify which displays attract attention and to analyze in-store traffic patterns. Powering all these advances are numerous large datasets of faces, with different features and focuses. Sifting through the datasets to find the best fit for a given project can take time and effort.

To help teams find the best datasets for their needs, we provide a quick guide to some popular and high-quality, public datasets focused on human faces.

Read our Whitepaper on Facial Landmark Detection Using Synthetic Data. Download here.

Digi-Face 1M

Released – 2022
Description – Digi-Face 1M is the largest scale synthetic dataset for face recognition that is free from privacy violations and lack of consent.
Face Images – 1.2 million
Identities – 110,000
Licensing – The Digi-Face 1M dataset is available for non-commercial research purposes only.
Download the dataset here.

CelebFaces Attributes Dataset (CelebA)

Released – 2015
Description – CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutter.
Face Images – 202,599
Identities – 10,177
Licensing – The CelebA dataset is available for non-commercial research purposes only.
Download the dataset here.

Check out our new whitepaper, Facial Landmark Detection Using Synthetic Data, to learn how we used a synthetic face dataset to train a facial landmark detection model and achieved results comparable to training with real data only.

VGG Face2

Released – 2018
Description – The dataset contains 3.31 million images with large variations in pose, age, illumination, ethnicity and professions.
Face Images – 3,310,000
Identities – 9,131
Licensing – The VGG Face2 dataset is available for non-commercial research purposes only.
Download the dataset here.

CelebA dataset

UMDFaces

Released – 2017
Description – UMDFaces has 367,888 annotated faces of 8,277 subjects. We discuss how a large dataset can be collected and annotated using human annotators and deep networks
Face Images – 22,000 videos + 367,888 images
Identities – 8,277 in images + 3,100 in video
Licensing – The UMDFaces dataset is available for non-commercial research purposes only.
Download the dataset here.

MS-Celeb-1M

Released – 2016
Description – “This training dataset was prepared in two main steps. First, we select the top 100K entities from our one-million celebrity list in terms of their web appearance frequency. Then, we leverage popular search engines to provide approximately 100 images per celebrity.”
Face Images – 10,000,000
Identities – 100,000
Licensing – This Dataset is under the Open Data Commons Public Domain Dedication and License.
Download the dataset here.

PaSc

Released – 2013
Description – The challenge includes 9,376 still images and 2,802 videos of 293 people. The images are balanced with respect to distance to the camera, alternative sensors, frontal versus not-frontal views, and different locations. Verification results are presented for public baseline algorithms and a commercial algorithm for three cases: comparing still images to still images, videos to videos, and still images to videos.
Face Videos – 2,802
Identities – 293
Licensing – This is strictly licensed, so should be checked before use.
Download the dataset here.

iQIYI-VID

Released – 2018
Description – iQIYI-VID, the largest video dataset for multi-modal person identification. It is composed of 600K video clips of 5,000 celebrities. These video clips are extracted from 400K hours of online videos of various types, ranging from movies, variety shows, TV series, to news broadcasting. All video clips pass through a careful human annotation process, and the error rate of labels is lower than 0.2%.
Face Videos – 600,000
Identities – 5,000
Licensing – a simple and permissive license with conditions only requiring preservation of copyright and license notices that enables commercial use.
Download the dataset here.

Wider Face

Released – 2018
Description – we introduce the WIDER FACE dataset, which is 10 times larger than existing datasets. The dataset contains rich annotations, including occlusions, poses, event categories, and face bounding boxes. Faces in the proposed dataset are extremely challenging due to large variations in scale, pose and occlusion,
Images – 32,203 images
Identities – 393,703
Licensing – The Wider Face dataset is available for non-commercial research purposes only.
Download the dataset here.

MALF

Released – 2015
Description – MALF is the first face detection dataset that supports fine-gained evaluation.
Images – 5,250
Identities – 11,931
Licensing – The MALF dataset is available for non-commercial research purposes only.
Download the dataset here.

IMDB-Wiki

Released – 2015
Description – “We crawled 0.5 million images of celebrities from IMDb and Wikipedia that we make public on this website. This is the largest public dataset for age prediction to date.”
Images – 524,230
Identities – 100,000
Licensing – This dataset is made available for academic research purposes only.
Download the dataset here.