Gaze Estimation#
Introduction#
Gaze estimation is a computer vision task in which a model predicts the direction in which a person is looking based on an image of the person’s face. 3D applications of gaze estimation involve predicting the direction that a person is looking in a 3D space – for example, conducting attention analysis on people who are driving. 2D applications of gaze estimation involve predicting the target of a person’s gaze on a 2D surface – for example, placing a mouse cursor at the appropriate location on a screen.

In this playbook we will describe how to use synthetic data from Datagen to solve gaze estimation tasks: what you need to know and define, and how to get started with synthetic data.
Defining the visual domain#
Before you approach this task, you should carefully consider the domain in which your model will operate and the potential pitfalls that you want to avoid.
Configurations – What are my data needs?
Examples:
Camera
Will I be using visual spectrum or near-infrared images?
Will I be using temporal data (sequences) or individual frames?
What are my camera parameters (resolution, location, orientation, etc.)?
Subject
What is the expected range of head poses – the location and orientation of the subject’s head relative to the camera?
What is the expected range of gaze directions?
Which types of facial expressions do I expect the model to handle?
Will there be occlusions between the camera and the eye, such as glasses?
Surroundings
What is the lighting in the scene going to be like?
Edge cases – Where might my model encounter difficulties?
Examples:
Extreme head pose angles
Extreme gaze angles
Mostly closed eyes
Glasses
Particularly dark or bright lighting
Potential Bias Scenarios – What biases might I see in real-world data that synthetic data can help me address?
Examples:
Small or unrepresentative range of ethnicities
Few examples of people with their eyes partially closed
Low variance in glasses shapes, sizes, and opacities
Defining a gaze vector#
Before we begin working, we must define exactly how to translate gaze vectors from the real world into data.
From anatomy to vector#
There are two vectors that can define a person’s gaze: the optical axis vector and the visual axis vector.
The optical axis vector describes the direction that the eyeball appears to be pointed. It is based on the line between the apex of the cornea and the center of the pupil:
optical_axis_vector = apex_of_cornea - center_of_pupil
The visual axis vector describes the direction that the person is actually looking. It is based on the line between the center of rotation and the fovea:
visual_axis_vector = center_of_rotation_point - fovea_point
These four points (the apex of the cornea, the center of the pupil, the eyeball’s center of rotation, and the fovea) are not precisely on the same line. As a result, the two vectors are slightly offset from one another – in other words, where the eye appears to be looking is not exactly where the eye is actually looking. See here for more details.

An accurate gaze detection model should ideally predict the visual axis vector, which more accurately reflects the direction the subject is looking. However, the two points that define this vector are on the inside of the eyeball and are obviously not visible in real-world images. This makes the vector both harder to predict and harder to provide as part of the annotations to your training set.
It is much easier to predict the optical axis vector, because that only requires the model to estimate the locations of the apex of the cornea and the center of the pupil – both of which are visible from the outside. And the optical axis vector does provide sufficient accuracy for many applications of gaze estimation models.
However, using synthetic data it is possible to provide perfectly accurate optical and visual axis vectors, enabling you to provide the model with data that demonstrates the offset between the two – a necessary step towards accurately predicting the visible axis vector using the optical axis vector as a bootstrap.
Using 3D gaze vectors#
The primary benefit to using Datagen’s synthetic data is perfectly accurate annotations. We provide both visual axis and optical axis vectors for the eyes of every subject you generate in our platform.
The vectors are provided in JSON format in the actor_metadata.json file found in each Datapoint folder in your dataset, and can be accessed through the datagen-tech python package available through PyPi.
Because these annotations provide a one-to-one relationship between an image and the ground truth, they are ideal for training your model.
Here is an example of a generated subject’s gaze vectors:
"eye_gaze": {
"axis_directions": {
"right_eye": {
"axis_directions": {
"visual_axis_direction": {
"x": 0.2186808735546518,
"y": -0.9488713367468006,
"z": 0.22764415617738634
},
"optical_axis_direction": {
"x": 0.18227809777132098,
"y": -0.9533525853570715,
"z": 0.2406107708849639
}
}
},
"left_eye": {
"axis_directions": {
"visual_axis_direction": {
"x": 0.20972485836741495,
"y": -0.9505072147881017,
"z": 0.22924117958720472
},
"optical_axis_direction": {
"x": 0.2648305457623606,
"y": -0.938850700652277,
"z": 0.22005486569475924
}
}
}
}
}

An example of a gaze vector painted directly onto the rendered image.#
Using 2D gaze vectors#
After you train your model, you will of course need to test it. While you can conduct testing using synthetic data, you will at some point need to conduct testing using actual examples of the real-world data the model will encounter in the wild. This test data must be 2D real-world data that is manually annotated by humans. Therefore, any testing pipeline that you build for gaze estimation will need to validate quality using 2D annotations.
Manual 2D annotations (of real-world data) are possible to obtain but are usually not accurate (especially due to high occlusions and visual variance in the eye shape). Datagen’s 2D annotations are available for you to use, they are pixel-perfect and can be combined alongside human annotations.
Datagen therefore provides the 2D location data for all of the following:
Apex of cornea - “apex_of_cornea_point”
Center of eye rotation - “center_of_rotation_point”
Center of iris - “center_of_iris_point”
Center of pupil - “center_of_pupil_point”
Iris circle points - “iris_circle”
Pupil circle points - “pupil_circle”
We provide coordinates in both 2D (location in the rendered image) and 3D (location in global coordinates):
"apex_of_cornea_point": {
"2d": {
"camera_1": {
"right_eye": {
"x": 408,
"y": 358
},
"left_eye": {
"x": 401,
"y": 618
}
}
},
"3d": {
"right_eye": {
"x": -0.03276697173714638,
"y": -0.03935128450393677,
"z": 0.14207574725151062
},
"left_eye": {
"x": 0.022525761276483536,
"y": -0.03960025683045387,
"z": 0.1436464786529541
}
}
},
Using 2D and 3D annotation: a summary#
Learning phase |
Data Type |
Annotation type |
Required annotation |
Training |
Real / Synthetic |
2D |
Two human-annotated points in each eye: 1. Apex of cornea 2D keypoint 2. Center of pupil 2D keypoint By comparing the apparent distance between these two points in the 2D plane with the assumption that their distance would be approximately 12mm apart in the 3D space, you can calculate the 3D gaze vector. |
Training |
Synthetic |
3D |
Datagen provides the 3D gaze vector directly |
Test |
Real |
2D |
Two human-annotated points in each eye: 1. Apex of cornea 2D keypoint 2. Center of pupil 2D keypoint |
Test |
Synthetic |
2D / 3D |
3D gaze vector OR 2D coordinates of eye keypoints |
Input and output formats#
Based on the above, your model’s input and output might be as follows:
INPUT |
OUTPUT |
---|---|
512X512X1 image |
left_x, left_y, left_z, right_x, right_y, right_z |
Congratulations! Now you know exactly what data you need, what annotations you need, and how the model will input and output the data. Time to start working with data.
Defining your primary test set#
You should define at least one clear test set that represents the real world as best you can, so that you can benchmark your models and track your progress.
Recommendations:
Your test set should include a minimum of 5000 real-world datapoints, annotated manually using the annotation methodology defined above.
Distribution
Should provide uniform coverage of the target visual domain
Make sure to include edge cases (such as extreme head angles) and bias scenarios (such as infrequently encountered ethnicities)
In addition to the primary test set, we will also be creating small test sets using synthetic data for specific purposes; more on this later.
Generating synthetic data#
Your first dataset#
The goal of the first training set is simply to extract an initial signal from the data. Do not worry about the settings on Datagen’s Faces platform; your goal here is simply to validate that the model is correctly set up to process the synthetic data. Check that the input formats are correct, that the losses are in fact pushing the model in the right direction, and that the inference of the model outputs the correct data. It is also important to check that the model can be tested on the test set and that the performance metrics are working as expected.
Training Set Name |
Size |
Goal |
Parameters |
---|---|---|---|
standard_initial_training_set |
1000 |
Initial Signal |
Default |
Your second dataset#
The second training set should provide uniform coverage of the target visual domain: sufficient examples of each type of visual data that your model will be expected to process in the real world. You should aim for uniform distribution of the various high-level variables that make up the data. In the case of gaze estimation this means:
Uniform coverage of different yaw angles relative to the camera
Uniform coverage of different pitch angles relative to the camera
Uniform coverage of different roll angles relative to the camera
You should keep the platform’s default uniform distribution of gaze directions.
Training Set Name |
Size |
Goal |
Parameters |
---|---|---|---|
uniform_gaze_uniform_head_pose |
1000 |
Initial Signal |
Default except for camera and actor positions (see below) |
As with your first dataset, you should leave the age, gender, ethnicity, and expression settings alone. But for your second dataset, there are two parameters to which you should give careful thought: camera placement and NIR.
Camera placement#
By default, our platform points the camera directly at the actor’s face from 1.6 meters away. You should move the camera so that it has the same orientation towards the subject that the real-world camera will have.
Do not focus the camera too narrowly on where you expect the eye to be. Not all actors’ heads are the same height, so not all eyes will be in the same place. In addition, if your real-world camera will be able to see the actor’s entire face or even their entire body, this should be reflected in your training set. Therefore, you should make sure your camera is far enough from the subject or has a wide enough field of view that it can capture all of the information that is necessary.
Our platform includes a simulator that enables you to see where the subject will appear in your images. Use the simulation to set the following:
Position and orientation of the camera
A range of positions and orientations of the subject’s head, so the subject can appear in a variety of places in the image
NIR parameters#

The majority of real-world data for the task of eye gaze detection is in near infrared (NIR). Datagen’s platform enables you to generate simulated NIR images and even define the following NIR spotlight parameters:
Definition |
Range of Values |
|
---|---|---|
Beam Angle |
The angle covered by the NIR light beam. Outside of this angle there is complete darkness. A measurement of how quickly light fades as it approaches the edges of the spotlight. |
3° – 90° 0-100, where: 100% – The light is at full strength only at its center, and gradually gets weaker as you move towards the edge. 0% – The light is at full strength throughout the beam, with a sharp cutoff between light and dark at the beam’s edge. |
Brightness |
The amount of light energy that is output by the spotlight (the spotlight’s intensity). |
1 – 1000 W |
The NIR spotlight shares the camera’s location and orientation. As a result, you should take the camera position into consideration when defining the NIR parameters. For example, if the camera is very close to the face, you should choose a lower brightness level to avoid over-exposure.
Data-centric iterations#
After you perform training on a combination of your real-world dataset and your second synthetic dataset, it is time to analyze errors and gaps. Where does the model go wrong?
There are a couple of ways to do this:
Perform inference on your validation set, then explore the datapoints with the worst prediction score. What do they have in common? Where does your model struggle? This analysis can be qualitative (visual exploration) or quantitative, using methods such as heat maps and correlation maps.
Generate small, highly specific datasets on the Datagen platform to test various edge cases (extreme lighting conditions, for example)
Let’s discuss the second option, in which we create various test datasets containing ~500 datapoints each. We will be specifying certain specific parameters while leaving the others at their default values. When we check for losses on these small test datasets, they should help uncover where the network is weak.
To make the process easier, you can use our clone feature to prepare a new dataset with the same settings as an earlier one, and then make a single change before sending it to be generated. This is particularly useful when you are iterating over a single parameter (for example, keeping all of the parameters constant except for ethnicity).

Here are some examples of gap analysis:
Different head poses with a forward gaze#
Let’s check the effects of yaw, pitch, and roll on the model’s performance. We will generate the following test sets:
Test Set Name |
Size |
Goal |
Parameters |
forward_gaze_uniform_head_yaw |
1000 |
Effect of Yaw |
Forward gaze, uniform distribution of yaw |
forward_gaze_uniform_head_pitch |
1000 |
Effect of Pitch |
Forward gaze, uniform distribution of pitch |
forward_gaze_uniform_head_roll |
1000 |
Effect of Roll |
Forward gaze, uniform distribution of roll |
forward_gaze_uniform_head_yaw_pitch_roll |
1000 |
Effect of Head Pose (Combined) |
Forward gaze, uniform distribution of head rotations |
Different gaze directions independent of head pose#
Let’s check the effects of gaze direction on the model’s performance. We will generate the following test sets:
Test Set Name |
Size |
Goal |
Parameters |
uniform_gaze_forward |
500 |
Effect of gaze forward |
Gaze forward, all else uniform |
uniform_gaze_left |
500 |
Effect of gaze forward |
Gaze left, all else uniform |
uniform_gaze_right |
500 |
Effect of gaze forward |
Gaze right, all else uniform |
Etc. |
Ethnic biases#
Let’s check the effects of different ethnicities on the model’s performance. We will generate the following test sets:
Test Set Name |
Type |
Parameters |
african_eyes_looking_forward |
Ethnicity Bias Analysis |
100% African, 100% Looking forward |
hispanic_eyes_looking_forward |
Ethnicity Bias Analysis |
100% Hispanic, 100% Looking forward |
mediterranean_eyes_looking_forward |
Ethnicity Bias Analysis |
100% Mediterranean, 100% Looking forward |
southeast_asian_eyes_looking_forward |
Ethnicity Bias Analysis |
100% Southeast Asian, 100% Looking forward |
north_european_eyes_looking_forward |
Ethnicity Bias Analysis |
100% North European, 100% Looking forward |
south_asian_eyes_looking_forward |
Ethnicity Bias Analysis |
100% South Asians, 100% Looking forward |
As you can see, the idea behind these test sets is to zero in, one at a time, on specific variables – gaze direction, ethnicity, etc – and to assess that variable’s impact on your model.
Each of these test sets should be generated only once, so that they can be used as benchmarks to assess how performance improves when you make adjustments or you train a new model. Never train on these test sets!
Generating more training data based on gap analysis#
When you have identified your model’s weaknesses, that means you know which gaps in the training data need to be filled.
For example, if your test on the ‘african_eyes_looking_forward’ didn’t go so well, you should consider adding more examples of African ethnicities looking forward to your training set.
For every such gap you find in your training data, we recommend that you generate at least 1000 more datapoints, and add them to your combination synthetic/real-world training set. Then retrain, and repeat your benchmarking and gap analysis!
Reminder: Never use your test sets to enhance your training set!
Summary#
We have provided a step-by-step manual of how to incorporate synthetic data in the task of gaze estimation:
Define your target visual domain and annotations
Define the input and output formats for your models
Configure the Datagen platform to produce training sets with the right parameters
Generate your first and second training sets to get your initial signal
Iterate using gap analysis
Find your errors by creating test sets
Generate training data to bridge your gaps
Repeat
Our ever-evolving platform enables you to constantly think up new edge cases, test on them, and then generate new datapoints according to the results.
We are more than happy to offer additional advice and hear about your challenges and success stories in training gaze detection models on synthetic data!
Ping us at support@datagen.tech.
The Datagen team