Gaze Control for Face Learning and Recognition by Humans and Machines

 

 

 

 

John M. Henderson1,4, Richard Falk1,4, Silviu Minut2, Fred C. Dyer3,4, and Sridhar Mahadevan2,4

 

1Department of Psychology

2Department of Computer Science

3Department of Zoology

4Cognitive Science Program

 

 

 

 

 

 

 

 

 

 

 

Draft of chapter to appear in: T. Shipley and P. Kellman (Eds.), From Fragments to Objects: Segmentation and Grouping in Vision.

 

8/26/00

 

 

 

 

Please address correspondence to:

John M. Henderson

129 Psychology Research Building

Michigan State University

East Lansing, MI 48824-1117

Email: john@eyelab.msu.edu


 

Abstract

 

In this chapter we describe an ongoing project designed to investigate gaze control in face perception, a problem of central importance in both human and machine vision. The project uses converging evidence from behavioral studies of human observers and computational studies in machine vision. The research is guided by a formal framework for understanding gaze control based on Markov decision processes (MDPs). Behavioral data from human observers provide new insight into gaze control in a complex task, and are used to motivate an artificial gaze control system using the Markov approach. Furthermore, the efficacy of a Markov approach to gaze control for face recognition in machine vision is tested. The general goal of the project is to uncover key principles of gaze control that cut across the specific implementation of the system (biological or machine).

 

Introduction

 

The Problem of Gaze Control

 

            The majority of work in human and machine vision to date has made the simplifying assumption that visual acuity during stimulus input is equally good across the image to be processed. A property of human perception, though, is that high acuity vision is restricted to a small (2º) foveal region surrounding fixation, with acuity dropping off precipitously from the fixation point (Anstis, 1974; Riggs, 1965). The human visual system takes advantage of this high-acuity region by rapidly reorienting the eyes via very fast (saccadic) eye movements (Buswell, 1935; Henderson & Hollingworth, 1998, 1999; Rayner, 1998; Yarbus, 1967). Recent work in computer vision and robotics (Kuniyoshi et al., 1995; Brooks et al., 1998) suggests that outfitting artificial vision systems with a central high-acuity region can similarly provide important computational advantages in computer vision. However, foveated vision systems require that the direction of gaze be controlled so that the foveal region is appropriately directed within the image based on the properties of the stimulus and the goals of the agent (human or machine), a complex real-time learning and control problem. The interdisciplinary project described in this chapter is an attempt to integrate the study of human and machine gaze control during face learning and recognition, with the ultimate goal of shedding light on the underlying principles and properties of gaze control within the important context of face perception.

 

            Gaze Control in Human Vision. The human visual system takes advantage of the high resolving power of the fovea by reorienting the fixation point around the viewed scene an average of three times each second via saccadic eye movements. Saccades are ballistic, very fast sweeps (velocities of up to 900º/s; Carpenter, 1988) of gaze position across the scene during which visual information acquisition is severely limited (Matin, 1974; Volkmann, 1986). Fixations are brief epochs (averaging about 300 ms; Henderson & Hollingworth, 1998) in which the fovea is directed at a point of interest, gaze position remains relatively still, and pattern information is acquired from the scene. Given the importance of foveation in human vision, the control of fixation placement over time (gaze control) is a critical sequential decision-making problem. Furthermore, gaze control appears to have important consequences for other cognitive processes beyond the timely and efficient acquisition of visual information. For example, Ballard and colleagues have suggested that fixation is necessary for enabling computations that require the binding of cognitive and motor processes to external objects (Ballard, 1996; Ballard et al., in press; also Henderson, 1996; Milner & Goodale, 1995). An important issue in the study of vision and visual cognition therefore is the nature of the process that controls sequential decision-making for saccadic eye movements during dynamic visual and cognitive tasks.

 

      Gaze Control in Computer Vision. Most classical computer vision methods process images at constant resolution. In contrast, in human vision the magnitude and complexity of the input is reduced by a decrease in the resolution of the visual field from the fovea to the periphery. This decrease in resolution across the retina leads to loss of information, but the human visual system compensates for the loss by employing an efficient gaze control mechanism, directing the fovea to different points in the image to gather more detailed information as it is needed. Thus, rather than analyzing an enormous amount of detailed visual information at once, a computationally expensive proposition, the brain processes detailed information sequentially, turning vision in part into a sequential decision‑making  process. Only recently has the importance and the potential of foveated vision been discovered in computer vision, and work in this direction is very early in development. Currently very few artificial foveated vision systems exist (Kunioshy 1995, van der Soiegel 1989, Bandera 1996). As these foveated vision systems develop, algorithms will be required to control the timely placement of the artificial fovea over the external scene.

      A central component of a successful general theory of gaze control will be an account of how the perceiving agent– whether it be a human or a machine– can decide where to fixate at each point in time. This problem of sequential decision‑making for gaze control is formidable.  The world offers a dizzying array of stimuli to which the agent could direct the fovea. Compounding the problem, the consequences of particular alternative actions (foveating Feature A rather than Feature B, C, D... N) may not become apparent until a sequence of related fixations has been taken, and yet the perceiving agent may need to estimate the likely payoff of a particular fixation (or sequence of fixations) in advance. Furthermore, for a perceiving agent, decisions about fixation position need to be made in quick succession, leaving little time to sort through the space of all possible fixation placements and associated outcomes. Finally, a well‑designed agent should be able to modify its fixation behavior through learning, raising the problem of how to assign credit to the correct decision and how to store all this information in memory.

 

The Problem of Face Perception

 

      Face Perception in Human Vision. Faces are arguably the most important and salient visual stimulus a human ever encounters. Faces are central in human social interaction, providing critical information about the age, gender, emotional state, intention, and identity of another. There is substantial evidence that the perception of faces by human observers may be “special” in at least two ways. First, the computational processes responsible for face perception appear to be qualitatively different from those involved in the recognition of other kinds of objects or complex scenes. This view is supported by behavioral data showing that faces are preferentially attended by young infants when compared to similarly complex visual stimuli or even scrambled faces (Bruce, 1988). When placed in complex scenes, faces preferentially draw the attention of adult viewers (Yarbus, 1967). Inverting a face disrupts face recognition to a greater degree than does inverting other types of objects (Yin, 1969). Second, the neural systems that underlie face perception appear to be partially independent of those systems that underlie object perception more generally. For example, single cell recording from neurons in the temporal cortex of monkeys has revealed a population of cells that respond selectively to faces, and in some cases to specific faces (Desimone, 1991). Evidence from human neuropsychology has provided examples for a double dissociation between object and face recognition, with some patients showing intact face recognition and impaired object recognition (prosopagnosia), and other patients showing the opposite pattern (visual object agnosia; e.g., Moscovitch et al., 1997; Newcome et al., 1994). Recent functional neuroimaging studies similarly provide evidence that face recognition is supported by a neural module in the fusiform gyrus of the human cortex that is not active during perception of other types of objects (Ishai et al., 1997; Kanwisher et al., 1997; McCarthy et al., 1997), though it has been suggested that this area may be more generally devoted to the analysis of exemplars of a well-learned object class (Diamond & Carey, 1986; Gautier et al., 1997). Together, these converging sources of evidence strongly support the hypothesis that the human visual system contains specialized neural and computational systems that are devoted to face perception. However, the specific nature of those computations remains very much unresolved.

 

            Face Perception in Computer Vision. Face recognition is a well-studied problem in computer vision, but is not completely solved. The problem is typically posed in the following way: Suppose we have a database of images of N people.  In general, all images are the same size, and there are several images for each person. Given a test image (i.e., an image of a person whose identity is not known), one must detect whether or not the person is in the database, and if it is, correctly identify that person. The traditional approach is to treat each image as a (high dimensional) vector by concatenating the rows of pixels that compose it (e.g. see Pentland et al., 1994). The dimensionality of the vector, K, equals the total number of pixels in the image; each dimension is a measure of the gray-scale level (or color) of a particular pixel in the image. Analyzed in this way, each image in the database defines a point in a K-dimensional Euclidean space. If it turns out that images of the same person are closer to each other in this space than they are close to images of other people, a clustering algorithm can be used to form decision boundaries. Thus, to classify a test image it suffices to determine the class of its nearest neighbors or, in other words, to determine the cluster to which that image belongs. Although appealing, this basic scheme is difficult to implement, for reasons of dimension. An image as small as 60 x 80 pixels yields a vector of dimension 4800. If the database contains a few hundred images, it is difficult if not impossible to find the nearest neighbors in reasonable time.

            A considerable number of techniques that try to reduce the dimensionality of the problem are known, each with its own strengths and weaknesses. These include Principal Component Analysis (Pentland et al., 1994), Neural Networks (Mitchell, 1997) and Markov models (Samaria et al., 1994). From a purely computational point of view, any method is acceptable, as long as it produces high classification rates, e.g., see recent work on probabilistic face matching (Mogaddam et al., 1998). Such methods do not have to imitate nature. However, there must be a good reason nature chose, through evolution, certain algorithms over others. From a theoretical point of view, it is of interest to discover the nature of the algorithms used by the human brain, and to come to understand why those algorithms have been selected. Furthermore, the superiority of the algorithms used by the human visual system (whatever they may be) over standard face recognition (and more generally, computer vision) algorithms is obvious. Of the frameworks used to reduce the dimensionality of the face recognition problem, the one that appears most promising in uncovering the algorithm used by nature, given the characteristics of the human foveal vision system, is a sequential decision-making framework. Our work thus asks the following fundamental question: How important is foveal vision in face recognition?

 

An Investigation of Gaze Control in Face Learning and Recognition

 

            We have recently undertaken an initial investigation designed to integrate the study of gaze control for face learning and recognition across humans and artificial agents. This investigation entailed a behavioral study of gaze control with human participants, and the implementation of a face learning and recognition system within a Hidden Markov Model (HMM) framework. In this section we will summarize the work so far and the promise it holds for the future.

 

Behavioral Study: Human Gaze Control during Face Learning and Recognition

 

            In an initial behavioral study (Falk, Hollingworth, Henderson, Mahadevan, & Dyer, 2000), we asked 16 participants to view full-color images of human faces in two phases, a study phase in which the participants were asked to learn the faces of twenty previously unknown undergraduate women, and a recognition phase in which the participants were asked to distinguish these learned faces from distracter faces, drawn from the same pool, that had not been learned. In addition, during the recognition phase, half of the previously learned and half of the new (distracter) faces were presented upright, and half were presented upside-down. The purpose of this latter manipulation was to examine whether differences in gaze control to upright and inverted faces might be related to fixation position or scanning sequence.

            The method of the experiment was as follows: In the Study Phase, 16 participants viewed a series of 20 upright faces for 10 sec each, with the order of faces and conditions randomized within the block. The participants then viewed a series of 10 pictures of naturalistic scenes for 10 sec each, with the order of scenes randomized within the block. This scene viewing phase constituted a period of time over which the learned faces had to be remembered. In the Recognition Phase, the same 16 participants viewed 10 new upright faces, 10 new inverted faces, 10 previously learned upright faces, and 10 previously learned inverted faces. Each face was presented until the participant responded, or for 20 sec maximum. All four types of face stimuli were presented in a single block, with order of stimuli (and hence condition) randomized within that block. Assignment of particular faces to particular learning and recognition conditions (learned or distracter, upright or inverted) was counterbalanced over participants.

The face stimuli used in the study were generated from photographs of undergraduate women with dark hair. The photographs were scanned into a computer and cropped so that each picture included only the face and hair. Hair was also cropped so that its style was relatively uniform across the set of stimuli (See Figure 1 for an example). Faces that included distinctive features such as eyeglasses, jewelry, moles, and so on were not used. Each cropped face was pasted onto a uniform gray background. All images were presented at a resolution of 800 by 600 pixels by 15 bit color (32,768 colors) and appeared photographic in quality. The faces subtended 7.56º horizontally by 10.47º vertically on average at a distance of about 1 meter; thus, they appeared at a natural size for a human face at this viewing distance.

            The viewer's eye position over the course of face learning and recognition was precisely tracked using a Generation 5.5 Stanford Research Institute Dual Purkinje Image Eyetracker (Crane, 1994; Crane & Steele, 1985). Eye position was sampled at 1000 Hz, with spatial resolution better than 10' of arc.

Text Box: Figure 2. Scoring regions on an example face stimulus from the behavioral experiment.Text Box: Figure 1. Example scan pattern of a human observer in the behavioral experiment. Dots represent fixations, and lines represent saccades.            Summary of Results. Study Phase. Figure 1 presents the scan pattern (the sequence of fixations and saccades) of a single participant viewing a single face during the study phase. The straight lines represent saccades, and the dots represent fixations. This scan pattern is typical: Participants tended to focus on the salient features of the faces, including the eyes, the nose, the mouth.

To quantitatively examine the data, each face was divided into ten scoring regions, consisting of the two eyes, the nose, the mouth, the two ears, the chin, the two cheeks, and the forehead, as shown in Figure 2. All fixations for each participant were then assigned to one of these regions. Figure 3 shows the mean percentage of times each facial region was entered at least once during a trial. As can be seen, both eyes were fixated at least once in over 95% of all trials. The nose, mouth, and forehead were also examined with a high degree of regularity, whereas the cheeks, ears, and chin were rarely fixated over the course of 10 seconds of viewing time. Thus, consistent with the qualitative results of prior face viewing studies, participants generally distributed the majority of their fixations on or near important facial features (e.g., Buswell, 1935; Yarbus, 1967).

 

Target Region

 

 

 

 

 

 

 

 

 

 

Source

Region

 

 

Region

L_Eye

R_Eye

Nose

Mouth

L_Ear

R_Ear

Chin

L_Chk

R_Chk

F_Head

L_Eye

+

+

 

 

 

-

-

 

-

 

R_Eye

 

+

-

 

-

 

-

-

 

 

Nose

 

-

+

+

 

-

 

 

 

-

Mouth

 

-

 

 

 

 

+

 

 

-

L_Ear

 

-

-

 

+

 

 

 

 

 

R_Ear

-

 

-

 

 

+

 

 

 

 

Chin

-

-

 

 

 

 

+

 

 

-

L_Chk

 

-

 

 

 

 

 

 

 

 

R_Chk

-

 

 

 

 

 

 

 

 

-

F_Head

-

 

-

-

 

 

 

 

 

+

Table 1. Deviations of fixation transition frequencies observed in the first-order Markov matrix from the frequencies predicted by the zero-order Markov matrix. Positive (+) indicates more observed fixations than predicted. Negative (-) indicates fewer observed fixations than predicted.

 

Fixation sequences (an example of which is shown in the figure above) were analyzed using zero- and first-order Markov transition matrices of fixation sequences from the pre-defined face regions shown in Figure 2. A Markov analysis quantitatively characterizes the distribution of gaze transitions from one fixation position to the next. The goal of the Markov analysis was to determine if the location of fixation position n, and hence perceptual and cognitive processing during fixation n, influences the spatial placement of fixation n+1. The zero-order matrix captures the probability of fixating a given region (zero-order Markov matrix), and the first-order Markov matrix captures the probability of moving to a given region from another given region. If the location of the current fixation does significantly influence where subsequent fixations are made, then the first-order matrix should deviate from that predicted by the base probabilities represented by the zero-order matrix. The method used for computing the zero- and first order matrices was modified from that given by Liu (1998).

Table 1 provides a summary of the Markov matrix analysis. Chi-squared tests of deviations of observed fixation positions from those that would be expected based on the marginal (zero-order) matrices showed systematic deviations from base probabilities, suggesting that there was a degree of regularity in the fixation sequences. A non-zero number in the difference matrix in Table 1 indicates a reliable difference between the predicted first-order matrix and the observed first-order matrix as determined by the Chi-Square analysis. A plus sign in the difference matrix means that there were more observed transitions in that cell than would be predicted from the zero-order matrix. A minus sign indicates fewer transitions than would be predicted. As Table 1 demonstrates, in several instances position of fixation n significantly affected the selection of fixation n+1 position. For example, notice that along the diagonal axis there is almost a uniform influence of current fixation position. This shows that the tendency to refixate the currently fixated facial region was higher than that predicted by baseline fixation rates. Second, there were a large number or transitions from one eye to the other, suggesting a greater likelihood of moving to an eye if the other eye was presently fixated. These data provide additional evidence that the eyes are a particularly important or salient facial features for gaze targeting during face learning.

 

            Recognition Phase. In the Recognition Phase, participants viewed the twenty previously learned faces and 20 new faces drawn from the same pool. Half of the new and old faces were presented upright, and half were presented upside-down. Participants viewed each face freely and indicated, via button press, whether the current face was new or was one of the faces viewed during the learning phase. When the participant pressed the response button, the current face was removed from the computer screen and, following a brief calibration screen, the next face was displayed.

Overall accuracy in the Recognition Phase, collapsed over new and old faces, was about 73% correct. As expected from the face recognition literature, accuracy was influenced by the orientation of the face in the Recognition Phase, with accuracy of about 79% in the upright condition and 66% in the inverted condition. Average viewing time during the Recognition Phase was about 2.3 seconds, with participants taking about 560 ms longer to respond to new than to learned faces, and about 144 ms longer to respond to inverted than to upright faces.

Text Box: Figure 4. Proportion of total time spent on the major facial features in the Study Phase and the four Recognition conditions.            Interestingly, although viewing time in the Recognition Phase was about an order of magnitude shorter than in the Learning Phase, the distribution of fixations over the faces in these two phases was remarkably similar (Falk, Hollingworth, Henderson, Mahadevan, & Dyer, 2000). This similarity can be seen in Figure 4, which shows the proportion of total time spent on the major features defined in Figure 2. As can be seen, the same features received the same proportion of total fixation time in the Learning and Recognition Phases. Also clear in Figure 4 is the fact that inversion in the Recognition Phase had little influence on the distribution of fixation time over the faces. A similar pattern is observed when proportion of discrete fixations rather than proportion of fixation time is used as the dependent measure. These data suggest that, at least insofar as the overt selection of facial features for visual analysis is concerned, the face inversion effect is not due to a transition from more wholistic face processing to more local, feature-based processing.

            We can draw four main conclusions from this behavioral study. First, selection of facial features through overt orienting of the eyes is observed during both face learning and recognition. Although we don't have evidence here that gaze control is strictly necessary for face learning and recognition, it is striking that when allowed to view faces freely, human observers moved their eyes over the face, making clear choices about which features to orient to and which to ignore. Second, as shown by the results of the Markov analysis, the selection of a fixation site during learning is driven, in part, by the specific feature that is currently under fixation. Thus, there is some sequential dependency to gaze control decisions. Third, the facial features selected for fixation during recognition are very similar to those selected for fixation during learning. It is tempting to conclude that feature processing during learning influences this selection during recognition, though we do not yet have direct evidence for this proposition. Fourth, the facial features selected for fixation during recognition of an upright face are very similar to those selected for fixation during recognition of an inverted face. This finding is intriguing, because one might have expected that recognition of upright faces, hypothesized to be supported by wholistic pattern processing, would lead to less feature-specific analysis than would inverted faces, which are hypothesized to be supported by feature-based analysis (Farah, Tanaka, & Drain, 1995; Tanaka & Farah, 1993).

 

Formal Framework: A Probabilistic Model of Sequential Decision Making

            We have seen in the previous section that human gaze control is regular and efficient. It is deeply puzzling how to endow machines with the apparent fluidity and accuracy of human gaze control. What is needed is a formal framework that both accounts for patterns of human gaze and allows the development of algorithms for gaze control of an artificial visual system. The long-term goal of our project is to investigate whether Markov decision processes (MDPs) can be the basis of such a framework.

MDPs are a well-developed formal framework for studying sequential decision making, originally proposed in the operations research literature (Bellman, 1957; Howard, 1960; Puterman, 1994). In recent years, MDPs have become a unifying formalism for studying a range of problems in artificial intelligence and robotics, from planning in uncertain environments (Kaelbling, 1998) to learning from delayed reward (Mahadevan et al., 1992). The MDP framework comprises a range of different models, from the simple case of Markov chains when states are observable and there is no choice of action, to the intermediate cases of hidden-Markov models (HMMs) when states are not observable but there is no action choice and Markov decision processes (MDPs) where states are observable but there is a choice of action, leading finally up to the most complex case of partially-observable MDPs (POMDPs) where states are not observable and the agent has a choice of action in any state. In our project to date, we have focused first on the HMM case, since the underlying states during the process of face recognition are not themselves observable, but have to be inferred from fixation points during recognition. Our goal is to use the simpler HMM case as a jumping-off point to the more complex POMDP problem of allowing the agent choice of what the next fixation point should be. One advantage of starting with the HMM case is that the principal algorithms for learning an HMM model (the well-known Baum Welch, forward-backward, and Viterbi algorithms) all extend nicely to the POMDP case.

 

Computational Study: Gaze Control for Face Recognition using Hidden Markov Models

In a recent study we tested the potential of the HMM formalism in face recognition (Minut, Mahadevan, Henderson, & Dyer, 2000). HMMs are in one important respect not a good model of human gaze control, since humans clearly do not follow a fixed, predetermined sequence of action when scanning faces.  However, HMMs are appealing because they do not require us to specify the nature of the underlying states of the system (which are not known for human gaze control), only the actions available to it (shifts of gaze), and the (probabilistic) effects of the actions on the information available to the system.  Furthermore, it should eventually be possible to build on the HMM framework to introduce more flexible selection of action, hence turning it into a POMDP.

            We used a database consisting of 27 women and 19 men with 6 images per person, as shown in Figure 5.

 

Figure 5. Face stimuli used in the HMM study.

 

Each image was 512x512 pixels. We defined 10 regions of interest for each face, loosely based on the human fixation patterns we observed in the behavioral study, and we built a left to right HMM corresponding to these regions (see Figure 6).

Text Box: Figure 6. Regions of interest for the HMM study.            The goal of this experiment was to build an HMM for each class (person) in the database, using input from a foveated vision system. To this end we produced for each image an observation sequence that consisted of 30 observation vectors. Each observation vector was produced by foveating on the same image multiple times. The system fixated 3 times in each of the 10 regions, and moved to the next region in sequence. This order was of course arbitrarily imposed by us and does not agree with the patterns that human subjects seem to follow. The HMM was built when all of its parameters had been learned (i.e. all the state transition probabilities and all the observation density functions). The parameters were learned (updated) incrementally, as new observation sequences were processed. The HMMs for different classes had the same states, but different parameters. Each HMM was built based on 5 images for each class, with the 6th image used for testing of recognition performance. Thus, recognition was tested using images that had not been seen during learning.

            In this study, our simulated fovea was a software-defined square patch centered at the desired fixation point, of the same resolution as the original unfoveated image. We surround this patch with rings of “superpixels'” which double in size as we move from the center towards the periphery of the image. A superpixel is simply a small square patch whose size is a few (physical) pixels. All pixels that make up a superpixel have the same graylevel value, namely the average of the pixels in the same region in the original image. Within a ring, all superpixels have the same size. Their size doubles only when one moves from one ring to another. Furthermore, each ring contains the same number of layers of superpixels. (Figure 7 shows a 512 x 512 image (1:4 scale), fixation on the left eye.). A second dimensionality reduction is then achieved by applying a standard compression technique (DCT) to the image (Figure 8 shows the reduced image obtained by mapping each superpixel to a single physical pixel).

            HMMs have been used recently to model faces ( Nefian & Hayes, 1999; Samaria & Young, 1994). However, those implementations did not make use of foveated vision. Instead, the observations for each image were produced by first dividing the image into horizontal (overlapping) stripes, and then computing the DCT coefficients in each stripe. The images used were already small (about 100 x 100 pixels), 25 times smaller than our images. Any attempt to run the HMM algorithms using observations coming from 512 x 512 images would be futile. In order to compare our method with Samaria's we had to reduce the size of our images through subsampling (vs foveation) in order to achieve constant resolution.

            In this study ten regions of each face were foveated (three fixations in each region), and we varied the number of states in the HMMs from 3 to 10, comparing foveated vision and subsampling. Interestingly, peak performance for foveated vision was observed at about 6‑7 states, rather than 10, as expected (see Figure 9). Since the fixation points are spread rather uniformly in the ten regions across the image, it is unlikely that the states arose due to clustering, and it is very intriguing why the HMM algorithms, purely mathematical in nature, came up with an optimum of 6‑7 states. This may suggest that the states of the recognition process do not correspond to regions in the image. It is also interesting to note that the optimal number of states for subsampling is different (and higher) than the optimal number of states for the HMMs built through foveation

Text Box: Figure 9.  Recognition performance with foveation versus subsampling.            Although our purpose was not necessarily to produce a better recognizer, but rather to determine if the HMM framework using foveated vision could produce at least reasonable results, it is interesting that we achieved higher accuracy using foveation than with subsampling (see Figure 9). In addition, the sequential decision-making framework represented by the HMM approach has other advantages. For example, it provides a method that allows incremental learning of new classes. That is, new faces can be added to the database without recomputing the representations of all other learned faces. We are encouraged that performance benefits may be derived from incorporating foveal vision and sequential information acquisition into artificial recognition systems.

 

            The MDP Approach to Gaze Control for Face Perception. The MDP approach to decision making provides an exciting framework within which to study human gaze control for face perception. These models provide a formalism that can be used to generate specific hypotheses about the decision processes supporting gaze control. Importantly, these models can be formulated at the appropriate temporal and spatial grain size, with an action conceptualized as a “macro”  that specifies the goal of the saccadic system in a hierarchical control structure. Conceptualizing gaze control at this grain size is consistent with current cortical control models (Pierrot-Deseilligny et al., 1995), and makes the decision process far more tractable (Precup et al., 1997). In addition, recent work on foveated vision and gaze control in AI and robotics suggests that the MDP approach may provide new insights into decision-making for fixation placement and the manner in which gaze control is learned and optimized to support new perceptual tasks (Bandera et al., 1996; Darrell, 1995; Rimey & Brown, 1991).

            Although there has been some work on the applicability of Markov models to human scan patterns, this work has predominantly been descriptive and motivated by a theory of perceptual memory (the scan path theory, Noton & Stark, 1971) that has not been strongly supported by the behavioral data (Groner et al., 1984; Stark & Ellis, 1981). Furthermore, little work has specifically attempted to integrate Markov analyses of gaze control with face perception and recognition (though see Althoff & Cohen, 1999). Finally, prior work applying Markov models to human gaze control has focused on the sequence of fixation placement. Another important aspect of gaze control is the duration of each fixation. Fixation durations in scene viewing have a mean of about 300 ms, but there is considerable variability around this mean (Henderson & Hollingworth, 1998). A substantial proportion of this variability is accounted for by the fact that an important function of gaze control is to gain information about the world, as suggested by the findings that fixation durations are shorter for semantically constrained objects (Friedman, 1979; Henderson et al., 1999; Loftus & Mackworth, 1978) and objects that have been fixated previously (Friedman, 1979).

Human eye movements are examples of actions whose primary purpose is to gain information, not to change the state of the world. Such information-gathering actions naturally arise in partially observable environments, where the agent does not have complete knowledge of the underlying state. POMDPs are an approach to extending the MDP model to modeling incomplete perceptions, as well as to naturally treat actions that change the state as well as collect information (Kaelbling et al., 1998). Here, in addition to states and actions, there is a finite set of observations O (much smaller than the set of states), where an individual observation may be a stochastic function of the underlying state. POMDPs can be viewed as MDPs over a much larger set of belief states, which are a vector of estimates of the probability that the agent is in each of the underlying real states. Given a belief state, the agent can update the belief state on the basis of new observations and actions in a purely local manner. However, finding optimal policies in a known POMDP model is intractable, even for problems with as few as 30 states (Littman, 1996). The ultimate goal of our work is to determine how humans are able to solve POMDPs in restricted situations, such as face recognition, by exploiting particular properties of the image (e.g. faces are symmetric, and the main features of a face remain invariant over age or ethnicity). We assume that careful investigations of human gaze control will provide insights into how artificial gaze control systems can meet these challenges. These insights will not only lead to a better theoretical understanding of gaze control in humans and machines, but also result in practical algorithms for robots.

 

Conclusion

 

The ability to control the direction of gaze in order to properly orient the fovea to important regions of the external world is important for all mammals, including humans. This ability is especially critical in the case of face perception, where the need to quickly determine the identity, kinship, intention, and emotional state of another is central to all social interaction and presumably, to survival. In human vision, evidence suggests that the ability to learn and recognize faces uses dedicated neural and cognitive systems that have evolved to support it. In machine vision, face perception is an important test-bed for computational theories of vision, is necessary for constructing robotic systems that can interact socially with humans, and has important practical applications in the construction of computer security systems that operate via person identification. The promise of foveated vision systems with appropriate gaze control for artificial vision is in its infancy, but holds great promise.

Our current work is designed to extend our study of gaze control during face learning and recognition. We are particularly interested in finding answers to three specific questions: First, how are potential fixation targets selected by human observers, what stimulus factors determine which target is fixated next, and what advantages are conferred by choosing these fixation sites over others? Second, to what degree is the specific ordinal sequence of fixations important in the perceptual learning and recognition of faces? Third, to what degree is the ordinal fixation sequence over faces affected by factors such as the viewing task and the stimulus set? We are addressing these questions by using converging methods from behavioral studies of human gaze control and MDP studies using computational modeling and implementation of artificial gaze control.


References

 

Altoff, R. R., & Cohen, N. J. (1999), Eye-movement-based memory effect:  A reprocessing effect in face perception.  Journal of Experimental Psychology: Learning, Memory, and Cognition 25: 997-1010.

Anstis, S. M. (1974). A chart demonstrating variations in acuity with retinal position. Vision Research, 14, 589-592.

Ballard, D. H. (1996). In K. Akins (Ed.). Perception: Vancouver Studies in Cognitive Science. (111-131). Oxford: Oxford Univ. Press.

Ballard, D. H., Hayhoe, M. M. Pook, P. K., & Rao, R. P. N. (in press). Diectic codes for the embodiment of cognition. Behavioral and Brain Sciences.

Bandera, C., Vico, F., Bravo, J., Harmon, M., & Baird, L. (1996). Residual Q‑learning Applied to Visual Attention, Proceedings of the 13th International Conference on Machine Learning, July 3rd‑6th, Bari, Italy, 20‑27.

Bellman, R. (1957). Dyamic Programming, Princeton University Press.

Brooks, R. A. et al. (1998). Alternative Essences for Artificial Intelligence, Proceedings of the AAAI Conference, Madison, Wisconsin, 1998.

Bruce, V. (1988). Recognizing faces. Hove, England: Erlbaum.

Buswell, G. T. (1935). How people look at pictures. Chicago: University of Chicago Press.

Carpenter RHS. (1988). Movements of the Eyes. London: Pion.

Crane, H. D. (1994). The Purkinje image eyetracker, image stabilization, and related forms of stimulus manipulation. In D. H. Kelley (Ed.), Visual science and engineering: Models and applications (pp. 15-89). New York: Macel Dekker.

Crane, H. D., & Steele, C. M. (1985). Generation-V dual-Purkinje-image eyetracker. Applied Optics, 24, 527-537.Desimone, R. (1991). Face-selective cells in the temporal cortex of monkeys. Journal of Cognitive Neuroscience, 3, 1-8.

Darrell, T. (1995). Reinforcement Learning of Active Recognition Behaviors. Advances in Neural Information Processing Systems, 8, pp. 858-864.

Desimone, R. (1991). Face-selective cells in the temporal cortex of monkeys. Journal of Cognitive Neuroscience, 3, 1-8.

Diamond & Carey, 1986

Falk, R. J., Henderson, J. M., Hollingworth, A., Mahadevan, S., & Dyer, F. C. (2000, August). Eye movements in human face learning and recognition. Presented at the Annual Meeting of the Cognitive Science Society, Philadelphia, PA.

Farah, M.J., Tanaka, J.W., & Drain, H.M. (1995). What causes the face inversion effect? Journal of Experimental Psychology: Human Perception and Performance, 21, 628-634.

Friedman, A. (1979). Framing pictures: The role of knowledge in automatized encoding and memory for gist. Journal of Experimental Psychology: General, 108, 316-355.

Groner, R., Walder, F., & Groner, M. (1984). Looking at faces: Local and global aspects of scanpaths. In A. G. Gale & F. Johnson (Eds.), Theoretical and applied aspects of eye movement research. North Holland: Elsevier Science Publishers.

Henderson, J. M. (1996). Visual attention and the attention-action interface. In K. Aikens (Ed.), Perception: Vancouver Studies in Cognitive Science (Vol V.). Oxford: Oxford University Press.

Henderson, J. M., & Hollingworth, A.. (1998). Eye movements during scene viewing: an overview. In G. W. Underwood (Ed.), Eye guidance while reading and while watching dynamic scenes ( 269-295). Amsterdam: Elsevier.

Henderson, J. M., & Hollingworth, A. (1999). High-level scene perception. Annual Review of Psychology, 50, 243-271.

Henderson, J. M., Weeks, P. A., Jr., & Hollingworth, A. (1999). Eye movements during scene viewing: Effects of semantic consistency. Journal of Experimental Psychology: Human Perception and Performance, 25, 210-228.

Howard (1960), Dynamic Programming and Markov Processes. Cambridge, MA: MIT Press.

?Ishai et al. 1997

Kaelbling, L., Littman, M., & Cassandra, T. (1998). Planning and Acting in Partially Observable Stochastic Domains, Artificial Intelligence.

Kanwisher, N., McDermott, J., & Chu, M. M. (1997). The fusiform face area: A module in human extrastriate cortex specialized for face perception. Journal of Neuroscience, 17, 4302-4311.

Kunioshy et al. (1995). A foveated wide angle lens for active vision, IEEE International Conference on Robotics and Automation, Japan, pp. 2982-2985.

Liu, A. (1998). What the driver's eye tells the car's brain. In G.J. Underwood, Ed., Eye Guidance in Reading and Scene Perception, Oxford: Elsevier, pp. 431-452.

Littman, M. (1996). Algorithms for Sequential Decision Making, Ph.D. dissertation, Brown University, Department of Computer Science, Providence, RI, March.

Loftus, G. R., Mackworth, N. H. (1978). Cognitive determinants of fixation location during picture viewing. Journal of Experimental Psychology: Human Perception and Performance, 4, 565-572.

Mahadevan, S. and Connell, J. (1992). Automatic Programming of Behavior-based Robots using Reinforcement Learning, Artificial Intelligence.

Minut, S., Mahadevan, S., Henderson, J. M., & Dyer, F. C. (2000). Face recognition using foveal vision. In. S-W. Lee, H. H. Bulthoff, & T. Poggio (Eds.), Biologically motivated computer vision. Berlin: Springer-Verlag.

Matin, E. (1974). Saccadic suppression: A review and an analysis. Psychological Bulletin, 81, 899-917.

McCarthy, G., Puce, A., Gore, J., & Allison, T. (1997). Face-specific processing in the fusiform gyrus. Journal of Cognitive Neuroscience, 9, 605-610.

Milner, A., & Goodale, M. (1995). The visual brain in action. Oxford: Oxford University Press.

Minut, S., Mahadevan, S., Henderson, J. M., & Dyer, F. C. (2000). Face recognition using foveal vision. In. S-W. Lee, H. H. Bulthoff, & T. Poggio (Eds.), Biologically motivated computer vision. Berlin: Springer-Verlag.

Mitchell, T. (1997). Machine Learning. McGraw-Hill.

Moghaddam, B. et al. (1998). Beyond Eigenfaces: Probabilistic Matching for Face Recognition, International Conference on Automatic Face and Gesture Recognition.

Moscovitch, M., Winocur, G., & Behrmann, M. (1997). What is special in face recognition. Nineteen experiments on a person with visual object agnosia and dyslexia but normal face recognition. Journal of Cognitive Neuroscience, 9, 555-604.

Nefien, A., and Hayes, M. 1999.  Face recognition using an embedded HMM. IEEE Conference on Audio and Video-based Biometric Person Authentication. Washington, D.C. 1999. 

Newcome, F., Mehta, Z., & de Haan, E. H. F. (1994). Category specificity in visual recognition. In M. J. Farah & G. Ratcliff (Eds.), The neuropsychology of high-level vision (pp. 103-132). Hillsdale, NJ: Erlbaum.

Noton, D., & Stark, L. (1971). Scanpaths in eye movements during pattern perception. Science, 171, 308-311.

Pentland, A. et al. (1994). View-based and Modular Eigenfaces for Face Recognition, IEEE Conference on Computer Vision and Pattern Recognition.

Pierrot-Deseilligny, C., Rivaud, S., Gaymard, B., Muri, R., & Vermersch, A. (1995). Cortical control of saccades. Neurological Reports, 37, 557-567.

Precup, D., Sutton, R., & Singh, S. (1997). Planning with closed‑loop macro actions, Working Notes of the AAAI Fall Symposium on Model‑Directed Autonomous Systems.

Puterman, M. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming, Wiley.

Rabiner, Lawrence R. (1989) A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition.

Rayner, K. (1998). Eye movements in reading, visual search and scene perception: 20 years of research. Psychological Bulletin, 124, 372-422.

Riggs, L. A. (1965). Eye movements. In C. H. Graham (Ed). Vision and Visual Perception (pp. 321-349). New York: Wiley.

Rimey, R. D., & Brown, C. M.  (1991). Controlling eye movements with hidden Markov models.  International Journal of Computer Vision, November, 47-65.

Samaria, F., & Young, S. (1994), HMM based architecture for face identification. Image and Computer Vision 12

Stark, L., & Ellis, S. R. (1981). Scanpaths revisited: Cognitive models direct active looking.

Tanaka, J. W., & Farah, M.J. (1993). Parts and whole relationships in face recognition. Quarterly Journal of Experimental Psychology,  46A, 225-245.

Van der Spiegel, J. et al. (1989). A Foveated Retina-like Sensor using CCD Technology. In Mead, C. and Ismail, M. Analog VLSI Implementations of Neural Systems, Kluwer Academic Publishers, pp. 189-212.

Volkmann, F. C. (1986). Human visual suppression. Vision. Research, 26, 1401-1416.

Yarbus, A. L. (1967). Eye movements and vision. New York: Plenum Press.

Yin, R. K. (1969). Looking at upside-down faces. Journal of Experimental Psychology, 81, 141-145.