Data Challenges in Gaze Analysis
When analyzing gaze data, missing values are a common obstacle. In my recent project, I worked with a dataset containing 99 target points, each associated with 1000 gaze points[1], collected during an eye-tracking study. To handle missing gaze points, I applied cluster mean imputation within the gaze points of each target group. This approach ensured that the imputed values respected the local structure of the gaze data.
The Dataset: Organized by Target Points
The dataset consisted of 99 target points, each corresponding to 1000 gaze points. These gaze points represented where participants focused while following a randomized target in the Random Saccade Task (RAN).
These are the Key Characteristics :
- Target Points: 99 positions across a visual field.
- Gaze points: 1000 samples per target, collected over time.
- Missing Data: Random gaps due to participant behavior or tracking errors.
To handle missing values, I clustered the gaze points within each target group and used the cluster’s mean to impute the missing data.
Step-by-Step Methodology
Step 1: Clustering Gaze Points by Target Group
The first step was to process each group of 1000 gaze points independently. Within each target group, gaze points were clustered.
# Step 1 : Handle Missing Values by imputation
# Group by target points (xT, yT) and calculate group means for x and y
group_means = gaze_data.groupby(['xT', 'yT'])[['x','y']].mean()
Step 2: Create a Function to Impute Cluster with Group Means
I create a function to impute any NaN with the groups’ mean. The impute_missing_values
function fills missing values (NaN) in the dataset’s x
and y
columns. These columns represent gaze points, which may occasionally be missing due to data collection issues. Instead of dropping these rows or using a global mean, this function imputes the missing values based on group-specific means.
#define function to impute missing values in x and y using group means
def impute_missing_values(df, group_means) :
for idx, row in df.iterrows():
if pd.isna(row['x']) or pd.isna(row['y']):
target_group = (row['xT'], row['yT'])
if target_group in group_means.index:
#impute with the group mean
if pd.isna(row['x']):
df.at[idx, 'x'] = group_means.loc[target_group, 'x']
if pd.isna(row['y']):
df.at[idx, 'y'] = group_means.loc[target_group, 'y']
return df
Step 3: Check Before and After Imputation
To evaluate the imputation process, I print the NaN
with isna()
for both the original and imputed datasets. This way, I can assured that the imputation process is successfully done.
#make copy of the original data and impute missing values
imputed_data = gaze_data.copy()
imputed_data = impute_missing_values(imputed_data, group_means)
print(gaze_data.isna().sum())
print(imputed_data.isna().sum())
output:
n 0
x 1338
y 1338
val 0
xT 0
yT 0
dP 1338
lab 0
dtype: int64
n 0
x 0
y 0
val 0
xT 0
yT 0
dP 1338
lab 0
dtype: int64
As we can see, imputed data has 0 values for thex
andy
values. Since I don't intend to use the dP value, I haven’t imputed those NaNs yet.
Key Takeaways
- Target-Based Clustering: Clustering within each target group preserved the unique spatial context of gaze points.
- Cluster Aware Imputation: By imputing missing values using cluster means, I ensured that new values reflected natural patterns within each group.
- Enhanced Accuracy: This method maintained the integrity of gaze distributions, avoiding biases introduced by global imputation techniques.
Challenges and Future Directions
While cluster mean imputation worked well for this dataset, future improvements could include :
- Dynamic Clustering: Adjusting the number of clusters dynamically based on the density of gaze points.
- Temporal Modeling: integrating temporal information to better capture gaze trajectories over time
- Validation of Real-World Data: Comparing the results with known ground truth in experimental settings.
References
[1]H. Griffith, D. Lohr, E. Abdulin, and O. Komogortsev, “GazeBase, a large-scale, multi-stimulus, longitudinal eye movement dataset,” Sci Data, vol. 8, no. 1, p. 184, Jul. 2021, doi: 10.1038/s41597-021-00959-y.