1. Participant Screening Ensure that participants with normal or corrected-to-normal vision are recruited. Because participants will be asked to wear a mobile eye-tracker, they can wear contact lenses but not regular eyeglasses. 2. Preparation for the Experiment Eye-tracking devices Use any mobile eye-tracker capable of capturing eye movement in real world environments. NOTE: The mobile eye-trackers used here were two Tobii Pro Glasses 2 (see Table of Materials). In addition to specialized cameras that can track eye movements, the glasses are also equipped with an HD scene camera and a microphone so that the gaze can be visualized in the context of the user's visual field. These glasses capture gaze data 50 times per second. Other researchers have used ASL Mobile Eye5, SMI6, or Pupil-labs7, all of which provide video streams from the scene camera and eye-tracking coordinates at varying sampling rates (30–120 Hz). The procedure below might vary slightly with other eye-tracking devices. Fiducial Markers The two steps below (i.e., temporal and spatial alignments) require the use of fiducial markers. There are several computer vision libraries that provide researchers with these markers and algorithms to detect them on an image or video feed. The protocol described uses the Chilitag library8. Temporal alignment Because the eye-tracking data are recorded on two separate units, ensure that the data are properly synchronized (Figure 1). Two main methods can be used. This manuscript only covers the first method, because server synchronization works differently with each brand of mobile eye-tracker. Briefly show a fiducial marker on a computer screen to mark the beginning and the end of a session. This is similar to a visual "hand clap" (Figure 2). Alternatively, use a server to synchronize the clocks of the two data collection units. This method is slightly more accurate and recommended if a higher temporal accuracy is required. Spatial alignment To find if two participants are looking at the same place at the same time, map their gazes to a common plane. This plane can be a picture of the experimental setting (see the left side of Figure 3). Carefully design this image before the experiment. Size of the fiducial markers: The general size of the fiducial markers depends on the algorithm used to detect them from the eye-tracking video. Surfaces close to the participants can have smaller fiducial markers, while surfaces further away from them need to be larger, so that they look similar from the participants' perspective. Try different sizes beforehand to make sure that they can be detected from the eye-tracking video. Number of fiducial markers: To make the process of mapping gaze points into a common plane successful, make sure to have several fiducial markers visible from the participants' point of view at any given time. Location of the fiducial markers: Frame relevant areas of interest with strips of fiducial markers (e.g., see the laptop screen on Figure 3). Finally, run pilots to test the synchronization procedure and determine the optimal location, size, and number of fiducial markers. Eye-tracking videos can be processed through a computer vision algorithm to see if the fiducial markers are reliably detected. 3. Running the experiment Instructions Instruct participants to put on the eye-tracking glasses as they would a normal pair of glasses. Based on the participants' distinct facial features, nose pieces of different heights may need to be used to preserve data quality. After turning on the eye-tracker, have the participants clip the recording unit to themselves to allow for natural body movement. Calibration Instruct the participants to look at the center of the calibration marker provided by Tobii while the calibration function of the software is enabled. Once the calibration is complete, recording can be started from within the software. Instruct participants to not move the mobile eye-trackers after calibration. If they do, the data are likely to be inaccurate and the calibration procedure will need to be performed again. Data monitoring Monitor the data collection process during the study and ensure that the eye-tracking data are being collected properly. Most mobile eye-trackers can provide a live stream on a separate device (e.g., a tablet) for this purpose. Data export After the recording session is complete, instruct the participant to remove the eye-tracking glasses and the data collection unit. Turn off the unit. Extract data using another software, Tobii Pro Lab, by removing the SD card from the data collection unit importing the session data. Tobii Pro Lab can be used to replay the video, create visualizations, and export the eye-tracking data as comma-separated (.csv) or tab-separated (.tsv) files. 4. Preprocessing the dual eye-tracking data Sanity checking eye-tracking data Check the eye-tracking data visually after data collection. It is not uncommon for some participants to have missing data. For example, some particular eye physiology can pose problems to eye-tracking algorithms, the glasses might shift during the experiment, the data collection software might crash, etc. Use descriptive statistics to check how much data were lost during each session and exclude sessions that have significant amounts of missing or noisy data. Temporal alignment Trim the data from each mobile eye-tracker to only include interactions between the participants. This can be achieved by using the method described above (i.e., presenting two special fiducial markers to participants at the start and the end of the session). These fiducial markers can then be detected from the eye-tracking video to trim the datasets. Spatial alignment NOTE: To detect whether two participants are looking at the same place at the same time, it is necessary to remap the participants' gaze onto a common plane (i.e., an image of the experimental setting). A computational method for achieving this goal is a homography (i.e., a perspective transformation of a plane). From a technical perspective, two images of the same planar surface in space are related by a homography matrix. Based on a common set of points, this matrix can be used to infer the location of additional points between two planes. In Figure 3, for example, if a computer vision algorithm knows where the fiducial markers are on the handout, it can remap the gaze of the participant onto the common plane on the left side. The white lines connect the two sets of points shared by the video feed of each participant and the scene, which are then used for building the homography to remap the green and blue dots on the left side. Use the Python version of OpenCV, e.g., to compute the homography matrix from the fiducial markers and then to remap the eye-tracking data to the scene of the experimental setting (or any other suitable library in your language of choice). OpenCV provides two useful functions: findHomography() to get the homography matrix, and perspectiveTransform() to transform the point from one perspective to the other. To use findHomography(), run with two arguments: the X,Y coordinates of the source points (i.e., the fiducial markers detected from the participants' scene video, shown on the right in Figure 3) and the corresponding destination points (i.e., the same fiducials markers detected on the scene image, shown on the left in Figure 3). Feed the resulting homography matrix into the perspectiveTransform() function, along with a new point that needs to be mapped from the source image to the destination image (e.g., the eye-tracking data shown as a blue/green dot on the right side of Figure 3). The perspectiveTransform function returns the new coordinate of the same point on the scene image (i.e., the blue/green dots shown on the left side of Figure 3). NOTE: For more information, the OpenCV official documentation provides sample code and examples to implement the homography: docs.opencv.org/master/d1/de0/tutorial_py_feature_homography.html. Sanity checking the homography Complete section 4.3 for the entire session, and perform a homography on each frame of the mobile eye-tracking video to check the quality of the homography. While there are no automated ways to estimate the accuracy of the resulting eye-tracking data, videos like the one shown in Figure 4 should be used to manually sanity check each session. If the quality is lower than expected, consider additional parameters to improve the results of the homography: Number of fiducial markers detected: Only perform the homography if enough fiducial markers can be detected from the video stream. This number can be determined by examining the video produced above. Location of the fiducial markers: If different markers are at different depths and orientations, the quality of the homography usually increases when the markers closest to the gaze coordinates are selected, given that there are enough markers to build a robust homography. Orientation of the fiducial markers: Combining fiducial markers that have different orientations (e.g., horizontal and vertical) will produce inaccurate homographies. It is recommended to first detect which plane or areas of interests (AOIs) the participant is looking at (e.g., the computer screen, the cheat sheet, the table, see Figure 3) and then use the fiducial markers on this plane for the homography. Quality of the video stream: Sudden head movements can blur video frames and make the data unusable, because fiducial markers cannot be reliably detected (Figure 4). The methodology of this paper is not appropriate for experiments that involve a lot of sudden head movements. 5. Analyzing the dual eye-tracking data Missing data In order to make sure that the data were properly remapped onto the reference image, produce visualization graphs (e.g., Figure 5, Figure 6) and descriptive statistics to check how much data are missing. Cross-recurrence graphs Use cross-recurrence graphs9 to represent visual synchronization between two participants (Figure 6), where the X-axis represents time for the first participant, and the Y-axis represents time for the second participant. Black squares indicate that participants are looking at the same area, a black diagonal line describes two subjects looking at the same thing at exactly the same time, and black squares off the diagonal line describes when two subjects looking at the same thing with a time lag. Finally, differentiating between missing data (white square) and existing data with no JVA (gray squares) helps identify problematic sessions. This provides researchers with a visual sanity check. Computing JVA After filtering for missing data, compute a metric for JVA by counting the number of times that participants' gazes are in the same radius in the scene (defined below) in a -2/+2 s time window. Divide this number by the number of valid data points that can be used to compute JVA. The result of the division represents the percentage of time that two subjects were jointly looking at the same place. This last step is necessary to avoid inflating the scores of groups with more data after the homography. NOTE: Two parameters need to be set before JVA can be computed, the minimal distance between two gaze points, and the time window between them (Figure 7): 1) Time window: An early foundational study10 used a single eye-tracker to measure JVA between a listener and a speaker. The researchers asked a first set of participants ("speakers") to talk about a television show whose characters were displayed in front of them. A second set of participants ("listeners") then watched the same show while listening to audio recording of the speakers. The eye movements of the speakers and listeners were compared, and it was found that a listener's eye movements closely matched a speaker's eye movement with a delay of 2 s. In subsequent work11 researchers analyzed live dialogues and found that a delay of 3 s best captured moments of JVA. Because each task is unique and might exhibit different time lags, it is also suggested to explore how different time lags affect the results of a given experiment. Overall, it is common to look for JVA in a ± 2/3 s time window depending on the experimental task and then explore how different time lags might change the results. 2) Distance between gazes: there is no empirically defined distance between two gazes for them to count as JVA. This distance is dependent on the research questions defined by the researchers. The research questions should inform the size of the targets of interest. In the example seen in Figure 7, a radius of 100 pixels on the scene image (blue/green circles) was chosen for the analysis because it is sufficient to capture when participants are looking at the robot in the maze, as well as at similar user interface elements on the computer screen, which are the two main areas of interest for this experimental task.