Gaze Estimation

Introduction

Gaze estimation is a computer vision task in which a model predicts the direction in which a person is looking based on an image of the person’s face. 3D applications of gaze estimation involve predicting the direction that a person is looking in a 3D space – for example, conducting attention analysis on people who are driving. 2D applications of gaze estimation involve predicting the target of a person’s gaze on a 2D surface – for example, placing a mouse cursor at the appropriate location on a screen.

../_images/gazesample.png

In this playbook we will describe how to use synthetic data from Datagen to solve gaze estimation tasks: what you need to know and define, and how to get started with synthetic data.

Defining the visual domain

Before you approach this task, you should carefully consider the domain in which your model will operate and the potential pitfalls that you want to avoid.

  1. ConfigurationsWhat are my data needs?

    Examples:

    • Camera

      • Will I be using visual spectrum or near-infrared images?

      • Will I be using temporal data (sequences) or individual frames?

      • What are my camera parameters (resolution, location, orientation, etc.)?

    • Subject

      • What is the expected range of head poses – the location and orientation of the subject’s head relative to the camera?

      • What is the expected range of gaze directions?

      • Which types of facial expressions do I expect the model to handle?

      • Will there be occlusions between the camera and the eye, such as glasses?

    • Surroundings

      • What is the lighting in the scene going to be like?

  2. Edge casesWhere might my model encounter difficulties?

    Examples:

    • Extreme head pose angles

    • Extreme gaze angles

    • Mostly closed eyes

    • Glasses

    • Particularly dark or bright lighting

  3. Potential Bias ScenariosWhat biases might I see in real-world data that synthetic data can help me address?

    Examples:

    • Small or unrepresentative range of ethnicities

    • Few examples of people with their eyes partially closed

    • Low variance in glasses shapes, sizes, and opacities

Defining a gaze vector

Before we begin working, we must define exactly how to translate gaze vectors from the real world into data.

From anatomy to vector

There are two vectors that can define a person’s gaze: the optical axis vector and the visual axis vector.

  • The optical axis vector describes the direction that the eyeball appears to be pointed. It is based on the line between the apex of the cornea and the center of the pupil:

    optical_axis_vector = apex_of_cornea - center_of_pupil

  • The visual axis vector describes the direction that the person is actually looking. It is based on the line between the center of rotation and the fovea:

    visual_axis_vector = center_of_rotation_point - fovea_point

These four points (the apex of the cornea, the center of the pupil, the eyeball’s center of rotation, and the fovea) are not precisely on the same line. As a result, the two vectors are slightly offset from one another – in other words, where the eye appears to be looking is not exactly where the eye is actually looking. See here for more details.

../_images/eyeanatomy.png

An accurate gaze detection model should ideally predict the visual axis vector, which more accurately reflects the direction the subject is looking. However, the two points that define this vector are on the inside of the eyeball and are obviously not visible in real-world images. This makes the vector both harder to predict and harder to provide as part of the annotations to your training set.

It is much easier to predict the optical axis vector, because that only requires the model to estimate the locations of the apex of the cornea and the center of the pupil – both of which are visible from the outside. And the optical axis vector does provide sufficient accuracy for many applications of gaze estimation models.

However, using synthetic data it is possible to provide perfectly accurate optical and visual axis vectors, enabling you to provide the model with data that demonstrates the offset between the two – a necessary step towards accurately predicting the visible axis vector using the optical axis vector as a bootstrap.

Using 3D gaze vectors

The primary benefit to using Datagen’s synthetic data is perfectly accurate annotations. We provide both visual axis and optical axis vectors for the eyes of every subject you generate in our platform.

The vectors are provided in JSON format in the actor_metadata.json file found in each Scene folder in your dataset, and can be accessed through the datagen-tech python package available through PyPi.

Because these annotations provide a one-to-one relationship between an image and the ground truth, they are ideal for training your model.

Here is an example of a generated subject’s gaze vectors:

"eye_gaze": {
   "axis_directions": {
      "right_eye": {
         "axis_directions": {
            "visual_axis_direction": {
               "x": 0.2186808735546518,
               "y": -0.9488713367468006,
               "z": 0.22764415617738634
            },
            "optical_axis_direction": {
               "x": 0.18227809777132098,
               "y": -0.9533525853570715,
               "z": 0.2406107708849639
            }
         }
      },
      "left_eye": {
         "axis_directions": {
            "visual_axis_direction": {
               "x": 0.20972485836741495,
               "y": -0.9505072147881017,
               "z": 0.22924117958720472
            },
            "optical_axis_direction": {
               "x": 0.2648305457623606,
               "y": -0.938850700652277,
               "z": 0.22005486569475924
            }
         }
      }
   }
}
../_images/paintedgaze.png

An example of a gaze vector painted directly onto the rendered image.

Using 2D gaze vectors

After you train your model, you will of course need to test it. While you can conduct testing using synthetic data, you will at some point need to conduct testing using actual examples of the real-world data the model will encounter in the wild. This test data must be 2D real-world data that is manually annotated by humans. Therefore, any testing pipeline that you build for gaze estimation will need to validate quality using 2D annotations.

Manual 2D annotations (of real-world data) are possible to obtain but are usually not accurate (especially due to high occlusions and visual variance in the eye shape). Datagen’s 2D annotations are available for you to use, they are pixel-perfect and can be combined alongside human annotations.

Datagen therefore provides the 2D location data for all of the following:

  • Apex of cornea - “apex_of_cornea_point”

  • Center of eye rotation - “center_of_rotation_point”

  • Center of iris - “center_of_iris_point”

  • Center of pupil - “center_of_pupil_point”

  • Iris circle points - “iris_circle”

  • Pupil circle points - “pupil_circle”

We provide coordinates in both 2D (location in the rendered image) and 3D (location in global coordinates):

"apex_of_cornea_point": {
   "2d": {
      "camera_1": {
         "right_eye": {
            "x": 408,
            "y": 358
         },
         "left_eye": {
            "x": 401,
            "y": 618
         }
      }
   },
   "3d": {
      "right_eye": {
         "x": -0.03276697173714638,
         "y": -0.03935128450393677,
         "z": 0.14207574725151062
      },
      "left_eye": {
         "x": 0.022525761276483536,
         "y": -0.03960025683045387,
         "z": 0.1436464786529541
      }
   }
},

Using 2D and 3D annotation: a summary

Learning phase

Data Type

Annotation type

Required annotation

Training

Real / Synthetic

2D

Two human-annotated points in each eye: 1. Apex of cornea 2D keypoint 2. Center of pupil 2D keypoint

By comparing the apparent distance between these two points in the 2D plane with the assumption that their distance would be approximately 12mm apart in the 3D space, you can calculate the 3D gaze vector.

Training

Synthetic

3D

Datagen provides the 3D gaze vector directly

Test

Real

2D

Two human-annotated points in each eye: 1. Apex of cornea 2D keypoint 2. Center of pupil 2D keypoint

Test

Synthetic

2D / 3D

3D gaze vector OR 2D coordinates of eye keypoints

Input and output formats

Based on the above, your model’s input and output might be as follows:

INPUT

OUTPUT

512X512X1 image

left_x, left_y, left_z, right_x, right_y, right_z

Congratulations! Now you know exactly what data you need, what annotations you need, and how the model will input and output the data. Time to start working with data.

Defining your primary test set

You should define at least one clear test set that represents the real world as best you can, so that you can benchmark your models and track your progress.

Recommendations:

  • Your test set should include a minimum of 5000 real-world datapoints, annotated manually using the annotation methodology defined above.

  • Distribution

    • Should provide uniform coverage of the target visual domain

    • Make sure to include edge cases (such as extreme head angles) and bias scenarios (such as infrequently encountered ethnicities)

In addition to the primary test set, we will also be creating small test sets using synthetic data for specific purposes; more on this later.

Generating synthetic data

Your first dataset

The goal of the first training set is simply to extract an initial signal from the data. Do not worry about the settings on Datagen’s Faces platform; your goal here is simply to validate that the model is correctly set up to process the synthetic data. Check that the input formats are correct, that the losses are in fact pushing the model in the right direction, and that the inference of the model outputs the correct data. It is also important to check that the model can be tested on the test set and that the performance metrics are working as expected.

Training Set Name

Size

Goal

Parameters

standard_initial_training_set

1000

Initial Signal

Default

Your second dataset

The second training set should provide uniform coverage of the target visual domain: sufficient examples of each type of visual data that your model will be expected to process in the real world. You should aim for uniform distribution of the various high-level variables that make up the data. In the case of gaze estimation this means:

  • Uniform coverage of different yaw angles relative to the camera

  • Uniform coverage of different pitch angles relative to the camera

  • Uniform coverage of different roll angles relative to the camera

You should keep the platform’s default uniform distribution of gaze directions.

Training Set Name

Size

Goal

Parameters

uniform_gaze_uniform_head_pose

1000

Initial Signal

Default except for camera and face positions (see below)

As with your first dataset, you should leave the age, gender, ethnicity, and expression settings alone. But for your second dataset, there are two parameters to which you should give careful thought: camera placement and NIR.

Camera placement

By default, our platform points the camera directly at the subject’s face from 1.6 meters away. You should move the camera so that it has the same orientation towards the subject that the real-world camera will have.

Do not focus the camera too narrowly on where you expect the eye to be. Not all faces are the same height, so not all eyes will be in the same place. In addition, if your real-world camera will be able to see each subject’s entire face, this should be reflected in your training set. Therefore, you should make sure your camera is far enough from the subject or has a wide enough field of view that it can capture the entire face.

Our platform includes a simulator that enables you to see where the subject will appear in your images. Use the simulation to set the following:

  • Position and orientation of the camera

  • A range of positions and orientations of the subject’s head, so the subject can appear in a variety of places in the image

NIR parameters

../_images/NIRsample.jpg

The majority of real-world data for the task of eye gaze detection is in near infrared (NIR). Datagen’s platform enables you to generate simulated NIR images and even define the following NIR spotlight parameters:

Definition

Range of Values

Beam Angle

The angle covered by the NIR light beam. Outside of this angle there is complete darkness. A measurement of how quickly light fades as it approaches the edges of the spotlight.

3° – 90° 0-100, where:

100% – The light is at full strength only at its center, and gradually gets weaker as you move towards the edge.

0% – The light is at full strength throughout the beam, with a sharp cutoff between light and dark at the beam’s edge.

Brightness

The amount of light energy that is output by the spotlight (the spotlight’s intensity).

1 – 1000 W

The NIR spotlight shares the camera’s location and orientation. As a result, you should take the camera position into consideration when defining the NIR parameters. For example, if the camera is very close to the face, you should choose a lower brightness level to avoid over-exposure.

Data-centric iterations

After you perform training on a combination of your real-world dataset and your second synthetic dataset, it is time to analyze errors and gaps. Where does the model go wrong?

There are a couple of ways to do this:

  • Perform inference on your validation set, then explore the datapoints with the worst prediction score. What do they have in common? Where does your model struggle? This analysis can be qualitative (visual exploration) or quantitative, using methods such as heat maps and correlation maps.

  • Generate small, highly specific datasets on the Datagen platform to test various edge cases (extreme lighting conditions, for example)

Let’s discuss the second option, in which we create various test datasets containing ~500 datapoints each. We will be specifying certain specific parameters while leaving the others at their default values. When we check for losses on these small test datasets, they should help uncover where the network is weak.

To make the process easier, you can use our clone feature to prepare a new dataset with the same settings as an earlier one, and then make a single change before sending it to be generated. This is particularly useful when you are iterating over a single parameter (for example, keeping all of the parameters constant except for ethnicity).

../_images/howtoclone.png

Here are some examples of gap analysis:

Different head poses with a forward gaze

Let’s check the effects of yaw, pitch, and roll on the model’s performance. We will generate the following test sets:

Test Set Name

Size

Goal

Parameters

forward_gaze_uniform_head_yaw

1000

Effect of Yaw

Forward gaze, uniform distribution of yaw

forward_gaze_uniform_head_pitch

1000

Effect of Pitch

Forward gaze, uniform distribution of pitch

forward_gaze_uniform_head_roll

1000

Effect of Roll

Forward gaze, uniform distribution of roll

forward_gaze_uniform_head_yaw_pitch_roll

1000

Effect of Head Pose (Combined)

Forward gaze, uniform distribution of head rotations

Different gaze directions independent of head pose

Let’s check the effects of gaze direction on the model’s performance. We will generate the following test sets:

Test Set Name

Size

Goal

Parameters

uniform_gaze_forward

500

Effect of gaze forward

Gaze forward, all else uniform

uniform_gaze_left

500

Effect of gaze forward

Gaze left, all else uniform

uniform_gaze_right

500

Effect of gaze forward

Gaze right, all else uniform

Etc.

Ethnic biases

Let’s check the effects of different ethnicities on the model’s performance. We will generate the following test sets:

Test Set Name

Type

Parameters

african_eyes_looking_forward

Ethnicity Bias Analysis

100% African, 100% Looking forward

hispanic_eyes_looking_forward

Ethnicity Bias Analysis

100% Hispanic, 100% Looking forward

mediterranean_eyes_looking_forward

Ethnicity Bias Analysis

100% Mediterranean, 100% Looking forward

southeast_asian_eyes_looking_forward

Ethnicity Bias Analysis

100% Southeast Asian, 100% Looking forward

north_european_eyes_looking_forward

Ethnicity Bias Analysis

100% North European, 100% Looking forward

south_asian_eyes_looking_forward

Ethnicity Bias Analysis

100% South Asians, 100% Looking forward

As you can see, the idea behind these test sets is to zero in, one at a time, on specific variables – gaze direction, ethnicity, etc – and to assess that variable’s impact on your model.

Each of these test sets should be generated only once, so that they can be used as benchmarks to assess how performance improves when you make adjustments or you train a new model. Never train on these test sets!

Generating more training data based on gap analysis

When you have identified your model’s weaknesses, that means you know which gaps in the training data need to be filled.

For example, if your test on the ‘african_eyes_looking_forward’ didn’t go so well, you should consider adding more examples of African ethnicities looking forward to your training set.

For every such gap you find in your training data, we recommend that you generate at least 1000 more datapoints, and add them to your combination synthetic/real-world training set. Then retrain, and repeat your benchmarking and gap analysis!

Reminder: Never use your test sets to enhance your training set!

Summary

We have provided a step-by-step manual of how to incorporate synthetic data in the task of gaze estimation:

  1. Define your target visual domain and annotations

  2. Define the input and output formats for your models

  3. Configure the Datagen platform to produce training sets with the right parameters

  4. Generate your first and second training sets to get your initial signal

  5. Iterate using gap analysis #. Find your errors by creating test sets #. Generate training data to bridge your gaps #. Repeat

Our ever-evolving platform enables you to constantly think up new edge cases, test on them, and then generate new datapoints according to the results.

We are more than happy to offer additional advice and hear about your challenges and success stories in training gaze detection models on synthetic data!

Ping us at support@datagen.tech.

  • The Datagen team