Filling the Gaps : Imputing Missing Gaze Data Using Cluster Means

Uncategorized

yahdiinformatika

Read Time:3 Minute, 26 Second

Data Challenges in Gaze Analysis

When analyzing gaze data, missing values are a common obstacle. In my recent project, I worked with a dataset containing 99 target points, each associated with 1000 gaze points[1], collected during an eye-tracking study. To handle missing gaze points, I applied cluster mean imputation within the gaze points of each target group. This approach ensured that the imputed values respected the local structure of the gaze data.

The Dataset: Organized by Target Points

The dataset consisted of 99 target points, each corresponding to 1000 gaze points. These gaze points represented where participants focused while following a randomized target in the Random Saccade Task (RAN).

These are the Key Characteristics :

Target Points: 99 positions across a visual field.
Gaze points: 1000 samples per target, collected over time.
Missing Data: Random gaps due to participant behavior or tracking errors.

To handle missing values, I clustered the gaze points within each target group and used the cluster’s mean to impute the missing data.

Step-by-Step Methodology

Step 1: Clustering Gaze Points by Target Group

The first step was to process each group of 1000 gaze points independently. Within each target group, gaze points were clustered.

				
					# Step 1 : Handle Missing Values by imputation
# Group by target points (xT, yT) and calculate group means for x and y
group_means = gaze_data.groupby(['xT', 'yT'])[['x','y']].mean()

Step 2: Create a Function to Impute Cluster with Group Means

I create a function to impute any NaN with the groups’ mean. The ⁠impute_missing_values⁠ function fills missing values (NaN) in the dataset’s ⁠x⁠ and ⁠y⁠ columns. These columns represent gaze points, which may occasionally be missing due to data collection issues. Instead of dropping these rows or using a global mean, this function imputes the missing values based on group-specific means.

				
					#define function to impute missing values in x and y using group means
def impute_missing_values(df, group_means) :
    for idx, row in df.iterrows():
        if pd.isna(row['x']) or pd.isna(row['y']):
            target_group = (row['xT'], row['yT'])
            if target_group in group_means.index:
                #impute with the group mean
                if pd.isna(row['x']):
                    df.at[idx, 'x'] = group_means.loc[target_group, 'x']
                if pd.isna(row['y']):
                    df.at[idx, 'y'] = group_means.loc[target_group, 'y']
    return df

Step 3: Check Before and After Imputation

To evaluate the imputation process, I print the NaN with isna() for both the original and imputed datasets. This way, I can assured that the imputation process is successfully done.

				
					#make copy of the original data and impute missing values
imputed_data = gaze_data.copy()
imputed_data = impute_missing_values(imputed_data, group_means)
print(gaze_data.isna().sum())
print(imputed_data.isna().sum())

output:

				
					n         0
x      1338
y      1338
val       0
xT        0
yT        0
dP     1338
lab       0
dtype: int64
n         0
x         0
y         0
val       0
xT        0
yT        0
dP     1338
lab       0
dtype: int64

As we can see, imputed data has 0 values for the x and y values. Since I don't intend to use the dP value, I haven’t imputed those NaNs yet.

Key Takeaways

Target-Based Clustering: Clustering within each target group preserved the unique spatial context of gaze points.
Cluster Aware Imputation: By imputing missing values using cluster means, I ensured that new values reflected natural patterns within each group.
Enhanced Accuracy: This method maintained the integrity of gaze distributions, avoiding biases introduced by global imputation techniques.

Challenges and Future Directions

While cluster mean imputation worked well for this dataset, future improvements could include :

Dynamic Clustering: Adjusting the number of clusters dynamically based on the density of gaze points.
Temporal Modeling: integrating temporal information to better capture gaze trajectories over time
Validation of Real-World Data: Comparing the results with known ground truth in experimental settings.

References

[1]

H. Griffith, D. Lohr, E. Abdulin, and O. Komogortsev, “GazeBase, a large-scale, multi-stimulus, longitudinal eye movement dataset,” Sci Data, vol. 8, no. 1, p. 184, Jul. 2021, doi: 10.1038/s41597-021-00959-y.

Jangan lewatkan artikel penting! Langganan newsletter dosensibuk.com sekarang.

yahdiinformatika

"I'm a data analyst who loves diving into data to find insights and bringing creativity to life through 3D design. Proud to be SciVal certified, adding research analytics to my toolkit for making smarter decisions."

Author Posts