Body Landmarks: Methods, Libraries and Datasets to Get You Started

What Is a Body Landmark? 

A body landmark refers to a specific point or feature on a human body that can be used for identification or tracking. These landmarks can be used in various fields such as computer vision, computer graphics, and human-computer interaction.

For example, in computer vision, body landmarks can be used to track the movement of a person in a video, or to identify specific points on a person’s face for facial recognition. In computer graphics, body landmarks can be used to create more realistic animations of human movement. Another application of body landmarks is to control devices or interfaces through gestures or movements.

Some common body landmarks include facial features, such as the eyes, nose, and mouth, as well as points on the limbs, joints, and spine. They can either be defined by location or shape. 

This is part of a series of articles about body segmentation.

What Are the Use Cases for Body Landmarks?

There are several use cases for body landmarks in various fields, including:

  • Computer vision: Body landmarks can be used to detect and track people in images and videos. This can be used for applications such as surveillance, security, and human-computer interaction. Body landmarks can also be used for facial recognition, which can be used for applications such as security, marketing, and entertainment.
  • Computer graphics: Body landmarks can be used to create realistic animations of human movement. This can be used for applications such as video games, movies, and virtual reality.
  • Human-computer interaction: Body landmarks can be used to control devices or interfaces through gestures or movements. This can be used for applications such as gaming, virtual reality, and assistive technology for people with disabilities.
  • Robotics: Body landmarks can be used to control the movement of robots, especially in the field of human-robot interaction.
  • Medical and healthcare: Body landmarks can be used to track and monitor the movement of patients, such as those with Parkinson’s Disease or multiple sclerosis, to help diagnose and treat them.
  • Sports: Body landmarks can be used to analyze the performance of athletes and to provide feedback to coaches. 
  • Fashion and retail: Body landmarks can be used to create virtual try-on experiences, and also to improve the fitting of clothing and shoes.

Concepts Related to Body Landmark

Semantic Labeling

Semantic labeling is the process of assigning a label or class to each pixel or region in an image or video to represent the underlying objects or scenes. The goal of semantic labeling is to produce a dense label map, where each pixel or region is assigned a class label such as “person,” “car,” “road,” etc. It is used in computer vision for tasks such as object recognition, scene understanding, and image segmentation. It is also a crucial step in many applications such as autonomous driving, video surveillance, and image-based search.


In body landmarking, a descriptor is a compact and distinctive representation of a set of landmarks or features that are used to describe the shape or appearance of an object, such as a face or a body. Descriptors are used in computer vision for tasks such as object recognition, face recognition, and body tracking. 

Human Body Landmark Models 

Human body landmark models aim to identify key points or regions on a 3D or 2D representation of the human body. Here are a few methods commonly used to do so.

Graphs and Markov Networks

A Markov network can be used to predict human body landmarks by modeling the relationships between the landmarks and other features of the body, such as body parts, joints, and pose. The network can represent these relationships as nodes and edges, where the nodes represent the landmarks and the edges represent the dependencies between the landmarks and other features.

The probabilities associated with each node can be estimated from training data, which consists of images or videos of people with the landmarks manually annotated. These probabilities can be used to make predictions about the landmarks in new images, by computing the most likely configuration of landmarks given the observed image data.

Convolutional Neural Networks

Deep learning can be used to locate body landmarks by training a convolutional neural network (CNN) to predict the position of landmarks in an image or a video. This is typically done by training the network on a large dataset of labeled images, where the landmarks are annotated by hand or using computer algorithms. The network learns to identify the shape and structure of the body by processing the image or video through multiple layers of convolution and pooling operations.

Once trained, the network can be used to predict the position of landmarks in new images or videos. The predicted landmarks can then be used for tasks such as pose estimation, face alignment, and body tracking. This approach has been shown to be highly effective in locating landmarks accurately and robustly, even in challenging conditions such as changes in lighting, pose, and appearance. 

Body Landmark Detection and Estimation Libraries

MediaPipe Pose

MediaPipe Pose is a machine learning solution developed by Google for real-time multi-person 2D pose estimation and body tracking in images and videos. It uses machine learning and computer vision algorithms to detect and track the positions of keypoints, such as the joints and body regions, in real-time. MediaPipe Pose is designed to be flexible and scalable, allowing developers to use it for a variety of applications, including human-computer interaction, virtual reality, and sports analysis. It is also designed to be highly efficient, allowing for real-time processing on both cloud and mobile platforms. It works with the mediapipe library, which is open-source and can be easily integrated into existing systems and pipelines. 


OpenPose can detect human body landmarks in real-time. It is developed by Carnegie Mellon University and is based on deep learning techniques such as convolutional neural networks (CNNs). OpenPose can estimate the pose, including body, face, and hand keypoints, in a given image or video stream. It can be used for a variety of applications such as action recognition, human-computer interaction, and virtual reality.

OpenPose uses a single-shot, multi-person framework that can detect multiple people in an image or video frame and estimate their pose simultaneously. It has been trained on a large dataset of annotated images and is highly accurate in detecting body landmarks in various poses and under different lighting conditions. It can also track a single person and detect “keypoint” body landmarks like feet, hands, and facial features. 

OpenPose is designed to be fast and efficient, making it well-suited for real-time applications and can be run on both CPU and GPU hardware.

4 Body Landmark Datasets 


The COCO (Common Objects in Context) dataset is a large-scale image recognition dataset that contains over 200,000 images of 250,000 person instances labeled with keypoints. It is widely used for training and evaluating computer vision models, particularly in the areas of object detection, instance segmentation, and caption generation.

The COCO dataset was created to provide a more challenging and realistic evaluation of computer vision models compared to other datasets, such as PASCAL VOC and ImageNet. The images in the COCO dataset were collected from the Internet and include a wide range of objects and scenes in different contexts, lighting conditions, and scales.


COCO-WholeBody is a large-scale human body landmark detection dataset created for computer vision research. It is an extension of the popular COCO dataset. COCO-WholeBody provides annotated data for the detection of human body landmarks, including body joints, body parts, and semantic body regions (over 130 keypoints). The dataset includes a diverse range of images and videos that cover various scenarios, such as sports, daily activities, and artistic performances.

COCO-WholeBody is used as a benchmark for evaluating the performance of computer vision algorithms in detecting and tracking human bodies. It provides a common evaluation platform for researchers and practitioners to compare and evaluate different methods for human body landmark detection and analysis.

MPII Human Pose

MPII Human Pose is a publicly available benchmark dataset for human pose estimation in computer vision. It was created by the Max Planck Institute for Informatics in Germany and contains over 25,000 images of people in various poses, along with the corresponding annotations for the position of the joints in the body. The dataset covers a wide range of poses, from simple standing poses to complex actions such as running, jumping, and playing sports.


The OCHuman dataset focuses on images of occluded humans. It provides a challenging testbed for computer vision algorithms as the occlusions can significantly affect the visibility of the body and make it difficult to estimate the pose or track the body accurately. 

The dataset includes 13,000 annotated instances of occluded humans across 5,000 images. All instances have a bounding box, but not all have masks or keypoint annotations. By evaluating algorithms on this dataset, researchers can assess their robustness to occlusions and develop new methods to handle these challenging scenarios.  

Lightbox Image

Get our free ebook

How to use synthetic data in 6 easy steps