Top 10 Face Datasets for Facial Recognition and Analysis

By Datagen, July 25, 2022

What is a Face Recognition Dataset?

Most people can recognize about 5,000 faces, and it takes a human 0.2 seconds to recognize a specific one. We also interpret facial expressions and detect emotions automatically. In other words, we’re naturally good at facial recognition and analysis. But, in recent years, Computer Vision (CV) has been catching up and in some cases outperforming humans in facial recognition. Advances in CV and Machine Learning have created solutions that can handle tasks more efficiently and accurately than humansThe applications of this technology are wide-ranging and exciting. One example is in marketing and retail. Saks Fifth Avenue uses facial recognition technology in their stores both to check against criminal databases and prevent theft, but also to identify which displays attract attention and to analyze in-store traffic patterns. Powering all these advances are numerous large datasets of faces, with different features and focuses. Sifting through the datasets to find the best fit for a given project can take time and effort. 

To help teams find the best datasets for their needs, we provide a quick guide to some popular and high-quality, public datasets focused on human faces. 

 

facial recognition

Read our Whitepaper on Facial Landmark Detection Using Synthetic Data. Download here.

CelebFaces Attributes Dataset (CelebA)

  • Released – 2015 
  • Description – CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutter. 
  • Face Images  – 202,599
  • Identities – 10,177
  • Licensing – The CelebA dataset is available for non-commercial research purposes only.
  • Download the dataset here.

 

Check out our new whitepaper, Facial Landmark Detection Using Synthetic Data, to learn how we used a synthetic face dataset to train a facial landmark detection model and achieved results comparable to training with real data only.

VGG Face2

  • Released – 2018
  • Description – The dataset contains 3.31 million images with large variations in pose, age, illumination, ethnicity and professions.
  • Face Images  – 3,310,000
  • Identities – 9,131
  • Licensing – The VGG Face2 dataset is available for non-commercial research purposes only.
  • Download the dataset here.

CelebA dataset

UMDFaces

  • Released – 2017 
  • Description – UMDFaces has 367,888 annotated faces of 8,277 subjects. We discuss how a large dataset can be collected and annotated using human annotators and deep networks
  • Face Images  – 22,000 videos + 367,888 images
  • Identities – 8,277 in images + 3,100 in video
  • Licensing – The UMDFaces dataset is available for non-commercial research purposes only.
  • Download the dataset here.

MS-Celeb-1M

  • Released – 2016 
  • Description – “This training dataset was prepared in two main steps. First, we select the top 100K entities from our one-million celebrity list in terms of their web appearance frequency. Then, we leverage popular search engines to provide approximately 100 images per celebrity.”
  • Face Images  – 10,000,000
  • Identities – 100,000
  • Licensing –  This Dataset is under the Open Data Commons Public Domain Dedication and License.
  • Download the dataset here.

YoutubeFace

  • Released – 2011 
  • Description – This training dataset follows the identical preparation steps to MS-Celeb-1M
  • Face Videos  – 3,425 videos
  • Identities – 1,595
  • Licensing – YouTubeFacesDB is licensed under the MIT License a simple and permissive license with conditions only requiring preservation of copyright and license notices that enables commercial use.
  • Download the dataset here.

PaSc

  • Released – 2013
  • Description – The challenge includes 9,376 still images and 2,802 videos of 293 people. The images are balanced with respect to distance to the camera, alternative sensors, frontal versus not-frontal views, and different locations. Verification results are presented for public baseline algorithms and a commercial algorithm for three cases: comparing still images to still images, videos to videos, and still images to videos.
  • Face Videos  – 2,802
  • Identities – 293
  • Licensing – This is strictly licensed, so should be checked before use. 
  • Download the dataset here.

iQIYI-VID

  • Released – 2018
  • Description – iQIYI-VID, the largest video dataset for multi-modal person identification. It is composed of 600K video clips of 5,000 celebrities. These video clips are extracted from 400K hours of online videos of various types, ranging from movies, variety shows, TV series, to news broadcasting. All video clips pass through a careful human annotation process, and the error rate of labels is lower than 0.2%.
  • Face Videos  – 600,000
  • Identities – 5,000
  • Licensing –  a simple and permissive license with conditions only requiring preservation of copyright and license notices that enables commercial use.
  • Download the dataset here.

Wider Face

  • Released – 2018
  • Description – we introduce the WIDER FACE dataset, which is 10 times larger than existing datasets. The dataset contains rich annotations, including occlusions, poses, event categories, and face bounding boxes. Faces in the proposed dataset are extremely challenging due to large variations in scale, pose and occlusion,
  • Images  – 32,203 images
  • Identities – 393,703
  • Licensing – The Wider Face dataset is available for non-commercial research purposes only.
  • Download the dataset here.

MALF

  • Released – 2015
  • Description – MALF is the first face detection dataset that supports fine-gained evaluation.
  • Images – 5,250
  • Identities – 11,931
  • Licensing – The MALF dataset is available for non-commercial research purposes only.
  • Download the dataset here.

IMDB-Wiki

  • Released – 2015
  • Description – “We crawled 0.5 million images of celebrities from IMDb and Wikipedia that we make public on this website. This is the largest public dataset for age prediction to date.”
  • Images  – 524,230
  • Identities – 100,000
  • Licensing – This dataset is made available for academic research purposes only.
  • Download the dataset here.

Generate Synthetic Data with Our New Free Trial. Start now!