The Complete Guide to OpenPose in 2024

The OpenPose library uses neural networks to perform real-time human body pose estimation for single- and multi-person video analysis.

This article provides a guide to the OpenPose library for real-time multi-person keypoint detection. We will review its architecture, features, and comparison with other human pose estimation methods.

In the following, we will cover the following:

  • Pose Estimation in Computer Vision
  • What is OpenPose? How does it work?
  • How to Use OpenPose? (research, commercial)
  • OpenPose Alternatives
  • Getting started

About us: provides the leading end-to-end Computer Vision Platform Viso Suite. Global organizations use it to develop, deploy and scale all computer Genislab Technologiesn applications in one place, with automated infrastructure. Get a personal demo.

Viso suite screenshot
Viso Suite – End-to-End Computer Vision and No-Code for Computer Vision Teams

The video shows the output of a pose estimation application built using Viso Suite:

In the era of AI, more and more computer Genislab Technologiesn and machine learning (ML) applications need 2D human pose estimation as information input. This also involves subsequent tasks in image recognition and AI-based video analytics. Single and multi-person pose estimation is an important computer Genislab Technologiesn task and may be used in different domains, such as action recognition, security, sports, and more.

Pose Estimation is still a pretty new computer Genislab Technologiesn technology. However, in recent years, human pose estimation accuracy achieved great breakthroughs with the emergence of Convolutional Neural Networks (CNNs).

Pose Estimation with OpenPose

A human pose skeleton denotes the orientation of an individual in a particular format. Fundamentally, it is a set of data points that can be connected to describe an individual’s pose. Each data point in the skeleton can also be called a part or coordinate, or point. A relevant connection between two coordinates is known as a limb or pair. However, it is important to note that not all combinations of data points give rise to relevant pairs.

Human Pose Keypoints
Human Pose Keypoints

Knowing a person’s orientation paves the road for many real-life applications, many of them in sports and fitness. A lot of approaches to human pose estimation have been proposed over the years. The first-ever technique typically estimated the pose of a single individual in an image consisting of a single person. OpenPose provides a more efficient and robust approach that allows applying pose estimation to images with crowded scenes.

Keypoint estimation human pose with OpenPose
Keypoint estimation human pose with OpenPose – Source

What is OpenPose?

OpenPose is a real-time multi-person human pose detection library that has for the first time shown the capability to jointly detect the human body, foot, hand, and facial keypoints on single images. OpenPose is capable of detecting a total of 135 keypoints.

The method is the winner of the COCO 2016 Keypoints Challenge and is popular for its decent quality and robustness to multi-person settings.

Keypoints detected by OpenPose on the Coco Dataset
Keypoints detected by OpenPose on the Coco Dataset.
Who created OpenPose?

The OpenPose technique was created by Ginés Hidalgo, Yaser Sheikh, Zhe Cao, Yaadhav Raaj, Tomas Simon, Hanbyul Joo, and Shih-En Wei. It is, however, maintained by Yaadhav Raaj and Ginés Hidalgo.

What are the features of OpenPose?

The OpenPose human pose detection library has many features but given below are some of the most remarkable ones:

  • Real-time 3D single-person keypoint detections
    • 3D triangulation with multiple camera views
    • Flir camera compatibility
  • Real-time 2D multi-person keypoint detections
    • 15, 18, 27-keypoint body/foot keypoint estimation
    • 21 hand keypoint estimation
    • 70 face keypoint estimation
  • Single-person tracking for speeding up the detection and visual smoothing
  • Calibration toolbox for the estimation of extrinsic, intrinsic, and distortion camera parameters
Costs of OpenPose for commercial purposes

OpenPose was licensed under a license that allows free non-commercial use and redistribution under these conditions. If you want to use OpenPose in commercial applications (non-exclusive commercial use), they require a non-refundable annual fee of $25’000 USD.

How to Use OpenPose

Lightweight OpenPose

Pose Estimation algorithms usually require significant computational resources and are based on heavy models with large model sizes. This makes them unsuitable for real-time applications (video analytics) and deployment on resource-constrained hardware (edge devices in edge computing). Hence, there is a need for lightweight real-time human pose estimators that can be deployed to devices to perform on-device edge machine learning.

Lightweight OpenPose is a heavily optimized OpenPose implementation to perform real-time inference on CPU with minimal accuracy loss. It detects a skeleton consisting of keypoints and the connections between them to determine human poses for every single person in the image. The pose may include multiple keypoints, including ankles, ears, knees, eyes, hips, nose, wrists, neck, elbows, and shoulders.

Hardware and Camera

OpenPose supports image video webcam input from images, videos, and camera streams of webcams, Flir/Point Grey cameras, IP cameras (CCTV), and custom input sources (such as depth cameras, stereo lens cameras, etc.)

Hardware-wise, OpenPose supports different versions for Nvidia GPU (CUDA), AMD GPU (OpenCL), and non-GPU (CPU) computing. It can be run on Ubuntu, Windows, Mac, and Nvidia Jetson TX2.

How to use OpenPose?

The fastest and easiest way to use OpenPose is using a platform like Viso Suite. This end-to-end solution provides everything needed to build, deploy and scale OpenPose applications. Using Viso Suite, you can easily apply OpenPose using common cameras (CCTV, IP, Webcams, etc.), implement multi-camera systems, and compute workloads on different AI hardware at the Edge or in the Cloud (Get the Whitepaper here).

  • Find the official installation guide of OpenPose here.
  • Tutorials for using the Lightweight implementation version can be found here.

How Does OpenPose Work?

The OpenPose library initially pulls out features from a picture using the first few layers. The extracted features are then inputted into two parallel diGenislab Technologiesns of convolutional network layers. The first diGenislab Technologiesn predicts a set of 18 confidence maps — with each of them denoting a specific part of the human pose skeleton. The next branch predicts another set of 38 Part Affinity Fields (PAFs) that denotes the level of association between parts.

The later stages are used to clean the predictions made by the branches. With the help of confidence maps, bipartite graphs are made between pairs of parts. Through PAF values, weaker links are pruned in the bipartite graphs. Now, applying all the given steps, human pose skeletons can be estimated and allocated to every person in the picture.

How OpenPose Works
How OpenPose Works – Source

Overview of the Pipeline

The OpenPose Pipeline consists of multiple sequential tasks:

  • a) Acquisition of the entire image as input (image or video frame)
  • b) Two-branch CNNs jointly predict confidence maps for body part detection
  • c) Estimate the Part Affinity Fields (PAF) for parts association
  • d) Set of bipartite matchings to associate body parts candidates
  • e) Assemble them into full-body poses for all people in the image

OpenPose vs. Alpha-Pose vs. Mask R-CNN

OpenPose is one of the most well-renowned bottom-up approaches for real-time multi-person body pose estimation. One of the reasons is because of their well-written GitHub implementation. Just like the other bottom-up approaches, Open Pose initially detects parts belonging to every person in the image, known as key points, trailed by allocating those key points to specific individuals.

OpenPose vs. Alpha-Pose

RMPE or Alpha-Pose is a well-known top-down technique of pose estimation. The creators of this technique suggest that top-down methods are usually based on the precision of the person detector, as pose estimation is conducted on the area where the person is present. This is why errors in localization and replicate bounding box predictions can result in the pose extraction algorithm working sub-optimally.

To solve this issue, the creators introduced a Symmetric Spatial Transformer Network (SSTN) to pull out a high-quality person region from an incorrect bounding box. A Single Person Pose Estimator (SPPE) is applied in this extracted area to estimate the human pose skeleton for that individual. A Spatial De-Transformer Network (SDTN) is applied to remap the human pose back to the initial image coordinate system. Moreover, the authors also introduced a parametric pose Non-Maximum Suppression (NMS) method to handle the problem of irrelevant pose deductions.

Along with this, a Pose Guided Proposals Generator has also been proposed to multiply training samples to help better train the SPPE and SSTN networks. The most important feature of Alpha-Pose is that it can be extended to any blend of a person detection algorithm and an SPPE.

OpenPose vs. Mask R-CNN

Last but not least, Mask RCNN is a well-known architecture for performing semantic and instance segmentation. It anticipates both the bounding box locations of the different objects in the image and a mask that segments the objects semantically (image segmentation). The architecture of Mask RCNN can be simply extended for human pose estimation.

It first extracts feature maps from a picture through a Convolutional Neural Network (CNN). A Region Proposal Network (RPN) uses these feature maps to get bounding box candidates for the presence of entities. The bounding box candidates select a region from the feature map. Since the bounding box candidates can be of different sizes, the RoIAlign layer is used to decrease the size of the extracted features so that they become uniform in size.

Now, the extracted features are passed into the parallel branches of CNNs for the ultimate prediction of the bounding boxes and the segmentation masks. The object detection algorithm can be trained to determine the region of individuals. By merging the person’s location information and their set of keypoints, we can obtain the human pose skeleton for every individual in the image.

This technique is very similar to the top-down method, but the person detection step is conducted along with the part detection step. Put simply, the keypoint detection phase and the person detection phase are independent of each other.

Mask R-CNN - The Mask R-CNN Framework for Instance Segmentation
Mask R-CNN Architecture

The Bottom Line

Real-time multi-person pose estimation is an important element in enabling machines to visually comprehend and analyze humans and their interactions. OpenPose is one of the most popular detection libraries for pose estimation and is capable of real-time multi-person pose analysis.

The lightweight variant makes it possible to apply OpenPose in Edge AI applications and to deploy it for on-device Edge ML Inference.

To develop, deploy, maintain and scale pose estimation applications effectively, a wide range of tools is needed. The Viso Suite platform provides all those capabilities in one end-to-end solution. Get in touch and request a demo for your organization.

What’s next for OpenPose?

Moving ahead, OpenPose represents a significant advancement in artificial intelligence and computer Genislab Technologiesn. This development also paves the way for future advancements, sparking new research and applications that have the potential to transform how we engage with technology.

Read more about related articles.

Explore More Usescases