When is eye-tracking not eye-tracking?

29 November 2016 / Tim Holmes

The influence of fake information has been in the news quite a lot lately. Did strategically written un-truths, or at least un-verified truths, affect the outcome of the Brexit vote and the American election? Did encountering these reports through trusted sites like Facebook give weight and credibility to them? Why didn’t more people question the truth in the first place? All of which are great questions to ask and ones which anyone involved with scientific research and the dissemination of results are already very familiar with. After all, scientific results are frequently over inflated or worse still falsely reported by a media eager to generate hits for websites and potential click-throughs for their advertisers, and that’s assuming that the science was good and honest in the first the place! Our cognitive biases ensure that we frequently make decisions based on insufficient information and that trusted, or trustworthy looking, sources of information are rarely questioned. So today I want to talk about fake eye-tracking, and I’ll explain what I mean by that in a minute. In particular I want to talk about the risks associated with fake eye-tracking at a time when technologies like Virtual Reality provide new opportunities for making assumptions about visual attention that are not based on eye-movements at all.

So what do I mean by fake eye-tracking? Well it actually comes in more than one form, but the one I want to concentrate on today is something I recently heard described as “predictive eye-tracking”, but in my world is called salience mapping. This technique uses computational models to evaluate, typically, static images such as supermarket shelves or websites in order to identify the regions that contain the most visual information. What do I mean by visual information? Well here I am talking about the low-level attributes of an image such as brightness, contrast, edges, colour intensity and, in the case of non-static visuals, motion. These attributes all have one thing in common, they are mapped out by regions of the visual cortex and feed into other brain areas associated with language, object recognition, motor planning and, most importantly for us today, eye-movements. The outputs of these models often come in the form of a heat-map that is easily mistaken for one you would get from an eye-tracking study but in this case it simply shows where the most visual information is.

One of the most cited models is that of Itti & Koch (2001) and it was subsequently used to suggest that these regions of concentrated visual information automatically attract our attention because the brain has evolved to know that in order to process the gist of a scene we need to scan it quickly with our eyes and by looking at these regions we optimise the information gathering whilst minimising the amount of time spent scanning. It is, of course, this correlation between information and attention that makes these models commercially interesting. Now, as I discussed here, there are lots of these models around and their ability to predict the attention of naïve participants varies considerably, but in general such models do provide a reasonably high correlation with eye-movements. There are, however, some important qualifiers:

  1. The correlation is high for the first fixation and weakens with subsequent fixations meaning that the longer the viewer spends looking at the scene the less accurate the predictions of the model become. This is because voluntary attention, on the part of the viewer, starts to kick in within 2 seconds of viewing a scene and each viewer responds differently to the information gathered from those first few fixations depending on visual acuity, past experience or reason for looking at the scene in the first place. It is worth noting that in the real world this scene scanning happens on the approach to the fixture and doesn’t necessarily form part of the much touted “2 seconds to make a purchase decision”.
  2. The correlation relies heavily on the absence of a task or motivation meaning that if the viewer is searching for something specific in a scene then the model has no way of knowing this and if there’s one thing we know about eye-movements it’s that they are highly susceptible to the effects of task. Put simply, we mostly look at what is most likely to fulfil our needs at that precise moment. One person viewing the same image with two different questions in mind will produce two distinct scan paths.
  3. In the real-world, scenes do not remain static for long. People move, grass sways, products are removed from supermarket shelves or put back in a different place. Moreover, the viewer doesn’t remain static either and so their field of view also changes. The prediction of attention from any of these models is only valid while the scene exactly matches the one given to the model and for the point-of-view represented by that image.
  4. The models cannot account for environment. I’ve done my fair bit of eye-tracking in the real world and, in particular, in a few galleries lately where we have shown that the pattern of eye-movements differs between original artworks viewed in a museum and images of the artwork shown on a screen. This is because things like lighting, image size and viewing distance all change the way we scan a scene. The same is true for these models, and a planogram of a supermarket shelf, will generate a very different salience map from a photograph of the real thing. In most cases the very image being used is simply not representative of the situation you are trying to predict behaviour in.

So, hopefully now you can understand why I am using the term “fake eye-tracking”, because although such models might provide quick and cheap evaluations of visual salience, they are absolutely not a guarantee of how anyone will actually view the scene in the real world.

An additional quick aside about these models. Most commercial users of them seem to believe their application lies with designing products or messaging that will stand-out quickly at the point-of-sale, and certainly salience models can predict that a big red SALE sign attracts attention. But here’s the small print that commercial users frequently miss and yet it massively impacts the way these models work. In attention there is a process called inhibition of return (IOR) which basically means that when scanning a visual scene we do not keep returning to the same part of that scene, and it is this very mechanism that allows attention to move rapidly around a scene in those first few seconds of viewing. ALL predictive models rely heavily on this concept, particularly if they provide any sequence information about where attention is likely to go first, second, third, etc. In a supermarket we know you have very limited time available to get a shopper’s attention and whilst a tool like this might help you to be one of those first items looked at, it provides NO help whatsoever in determining the ability of a product to hold attention because the models assume that once looked at your product will be suppressed so that attention can move on!

Now, I am not suggesting these models are useless. Far from it in fact. I use these models myself because they are great at generating hypotheses for testing with actual eye-tracking studies. What I am suggesting is that they are not a substitute, and the financial risks associated with making commercial decisions based on such naïve methods is far greater than the cost of running a small, appropriate eye-tracking study to confirm, or otherwise, your hypothesis.

I mentioned there is more than one type of fake eye-tracking. In the next post I’ll talk more about this, and in particular about head-tracking methods and why real eye-tracking is crucial to understanding attention in Virtual Reality.


Leave a Reply