Introduction to modalities#

Datagen’s datapoints are rendered images of synthetic human beings. Each image is accompanied by annotation files that highlight different aspects of the underlying ground truth. Because the images are computer-generated, we are able to provide pixel-perfect ground truth data.

The annotations are designed to bring different parts of the ground truth to the forefront: the locations of facial landmarks; a normal map that reconstructs the contours of the face; the direction of eye gaze; and so on. Each of these ground truths is called a modality.

This section of the documentation details the structure and format of each modality, so you can process the data properly and integrate it into your training set.

Table of Contents#

This section of the documentation provides two guides through the modalities:

  • Per file: The File structure page goes folder-by-folder, file-by-file through a downloaded dataset and tells you what data each file contains.

  • Per modality: Each of the following pages describes a single modality, and gives a list of files that contain data relevant to that modality:

Visual modalities#

These modalities are primarily made up of image files. They include the original rendered image itself as well as various recolorations of that image for different purposes:

  • Rendered image: The original rendered image and information about the lighting environment

  • Depth map: A recolored image that shows the distance from each pixel in the image to the camera lens

  • Normal map: A recolored image that shows the direction of normal vectors coming out of each surface in the image

  • HDRI map: A low-resolution copy of the background image in the scene

  • Semantic segmentation: A recolored image that identifies semantic objects and parts of objects in the scene

Keypoint modalities#

These modalities primarily consist of JSON files that list the 2D and 3D coordinates of landmarks in the scene:

  • About our coordinate systems: An explanation of the 3D and 2D coordinate systems that we use in our keypoint files

  • Facial keypoints (iBUG): The 68 landmarks that make up the iBug facial landmark standard

  • Facial keypoints (MediaPipe): The 468 landmarks that make up the MediaPipe facial landmark standard

  • Body keypoints: Keypoints that identify body landmarks according to a standard developed by Datagen

  • Ear keypoints: The 55 landmarks for each ear that make up the iBug ear landmark standard

  • Eye keypoints: Keypoints that identify eye landmarks according to a standard developed by Datagen

  • Foot keypoints: The 3 landmarks for each foot that make up the CMU Perceptual Computing Lab foot landmark standard

  • Hand keypoints: The 21 landmarks for each hand that make up the MediaPipe hand landmark standard

  • Head keypoints: Keypoints that identify non-facial head landmarks according to a standard developed by Datagen

  • Bounding box: The coordinates of a bounding box identifying which parts of the image contain a human face

  • Center of geometry: The coordinates of the center of geometry for important objects in the scene

Other modalities#

These modalities contain miscellaneous data about the actors and cameras:

  • Actor metadata: Information about the identity and behavior of the actor(s) in the scene

  • Camera metadata: Internal and external parameters for the camera in the scene