The Promise of Detecting Emotions From Facial Expressions, and a Drawback

In everyday life, when people socially interact with each other face-to-face (in person or by video chats), they try to gauge from the facial expressions of the other person how he or she feels, what emotion that person may experience, to better conduct themselves (e.g., sympathise, choose their words more carefully not to offend). The prospect that technology-based algorithms (AI) can smartly and automatically detect emotions from expressions in the photo or video images of human faces is gaining growing appeal lately in many companies. Yet, interpreting emotions merely on the basis of facial expressions, by either humans or technologies, is not a process so much deterministic and straightforward; instead, the interpretation may involve nuances and variability. The variability may occur due to different contextual, personal and cultural factors, but furthermore, it could be the outcome of the way the human central neural system operates, more dynamic and agile than has commonly been accepted. Hence companies could be relying on inferences that are much less accurate and dependable than they wish to believe.

In talking about ‘facial expressions’, we should clarify that it physically means the movements of facial muscles, those which an observer would be trying to trace so as to infer the emotional state of a person. Such movements can be intricate, delicate or compound (i.e., a combination of different muscles changing posture simultaneously). The expressions generated on one’s face may seem more or less consistent or ‘predictable’ to a by-standing observer; that could make relating a prototypical facial expression to an emotion category (e.g., sadness, happiness, anger) more troublesome or elusive.

Multiple kinds of applications of emotion detection from facial expressions are emerging, spanning areas of marketing, advertising, service and user experience. The application is usually utilised during an interaction with a consumer, or customer, via a webcam at a desktop screen or a camera on one’s mobile phone; sometimes, however, it can also take place through cameras installed in physical sites (e.g., a store, a service centre). The motivation for managers for taking this route is often the claim that consumers cannot or would not reliably disclose their true emotions in verbal responses or via familiar survey tools (e.g., rating or semantic scales); consumers particularly may be unable to reliably and voluntarily report their emotions during service or product usage experiences.

  • Marketers and product developers may be interested in tracking how consumers emotionally respond to the visual design and operation of prototype or semi-final models of new products.
  • Another area of application much talked about is advertising, when testing the emotional response of consumers to new ad copies (still images) or video clips (TV or online commercials).
  • Service encounters with customers is another important field of application — tracking changes in the emotional states of customers during a video interaction with a service agent (a live chat with a human agent or a chat with a ‘robotic’ virtual agent). Such an application might help agents to respond more promptly and agreeably to complaints or requests of customers in real-time.
  • In retailing, there are suggestions of using the technology to track facial expressions and detect the emotions of shoppers during their journey in a physical store, walking along shelf displays, browsing and searching for products (utilised possibly in co-ordination with beacon technology to send messages). In another kind of application, fashion retailer Uniqlo installed, in select clothing stores of its chain, AI-powered UMood kiosks that show a variety of products to customers and can measure their reaction to colour and style with the aid of neurotransmitters, and upon inferring how customers feel, make recommendations (Blake Morgan, Forbes, 4 March 2019; the nature of ‘neurotransmitters’ is not explicated with regard to movements of facial muscles).
  • The methodology can surely be applied also to online shopping in store websites (e.g., research companies already use methods of measuring facial expressions for inferring emotions in combination with eye tracking for measuring aspects of consumer-shopper attention).
  • Regarding user experience, there seems to be increased interest in tracing facial expressions during gaming sessions and thus possibly modifying the ‘story’ or flow of the game according to the mood and emotional response of gamers at every stage of their game.

Briefly, the common approach to trace facial expressions is known as ‘facial coding’, whereby a software identifies and marks selected points upon the image of a human face, associated with relevant facial muscles, and connects between the points (i.e., depicting a type of network). This allows to detect changes in configuration of the ‘network’ on the person’s face and thereby infer the emotion likely to occur in association with the observed pattern of facial expression. A number of research firms (e.g. iMotions, Affectiva-Smart Eye, and in collaboration between them) and well-known technology companies (e.g., Microsoft, Google, Amazon) are engaged in developing and applying AI-based methods for detecting emotions from facial expressions.

The approach for inferring emotional states from facial expressions is grounded in a conception that some basic or natural-kind emotions exist, and each basic emotion is associated with occurrence of a prototypic facial expression. The conception is largely drawn from the extensive research work and findings of psychologist Paul Ekman [1]. He concentrated on six emotion categories for which repeatedly-occurring typical ‘expressions’ could be identified: anger, sadness, disgust, fear, happiness, and surprise. Ekman argued that there is sounder basis for making inferences about these emotion categories, and much subsequent research continues to focus on these basic emotions. It is generally held, for example, that a smile signifies happiness, a scowl occurs in a state of anger, or a frown is associated with fear. Less simplistically, a set of muscles may be moving in any emotional episode, creating a more compound pattern or expression, including some primary ‘prototypic’ muscle movements and secondary versatile facial muscle movements.

However, the ability to make generalizations about the facial expressions that are typically linked to specific emotions could be rather limited, unstable and even flimsy. Advocates make an argument based on the logic that: ‘you know the expression of a [happy / angry / frightened etc.] person when you see one’, and that should give enough confidence for people as well as AI-powered algorithms for inferring emotional states from facial expressions. But are movements of facial muscles so universal, typical and forward-telling about underlying emotions? The theory of Ekman, received as part of the classical view, is consistent with a locationist view of brain activity: a stimulus so perceived as ’emotion-laden’ triggers a reaction (i.e., activation) of a brain region or structure that generates a particularly relevant type of emotion, or perceives a certain emotion (e.g., the most familiar linkage cited is of the amygdala with fear, but there are other structures involved in emotions such as the insula or the anterior cingulate cortex-ACC). The locationist view has however been contested.

The constructionist view of emotions, as developed by psychologist Lisa Feldman Barrett, contends that emotional states are formed in real-time — they are constructed rather than automatically triggered when exposed to an external stimulus (e.g., an object, an event, the facial expression of another person). Her theory of constructed emotions suggests that the construction of an emotion is facilitated by simulating its built-up in mind and body, just as it occurs, being informed by past experiences and learned concepts. Barrett disagrees with using the term ‘reaction’ so far as this implies a fixed, pre-determined activation of a brain region and the ensuing or associated responses, such as facial expression, other physiological sensations (‘feelings’), and subjective experience of emotion. The construction of an emotional state and experience is more fluid and the outcome can be more diversified. Henceforth, Barrett also argues against treating facial expressions, and other physical changes, as ‘fingerprints’ that exist a priori for each type or category of emotion. She rejects the assertion that one can easily tell an emotion from an observed facial expression as if it were a fingerprint embedded in the emotional state. People may respond differently, with accompanied physical changes, even when seemingly feeling the same emotion, such as feeling angry when confronted with critical and even insulting remarks: some people fume in anger, but other ones cry, yet others become withdrawn, and some turn quiet and cunning. Their apparent expressions of ‘anger’ vary [2],

According to a constructionist view, a structure like the amygdala is involved in more than just producing or perceiving fear, and a number of structures may engage with regard to a given type of emotion, together or alternately. First, the occurrence of activations of brain regions is not pre-determined or ‘pre-wired’ and may vary from one instance to another (i.e., the experience and observed response is not automatically the same). Second, a structure like the amygdala can be engaged with emotions other than fear. Third, brain regions involved in emotional processes are likely to be involved in other types of processes as well (i.e., it gives the basis for greater integration, for example, between cognitive and emotional processes).

  • Note: The concept of constructed emotions is similar to another psychological concept in a more cognitive context of constructed preferences and choice processes [3]. In that context, the argument made is that choices are not necessarily derived by retrieving an established preference ordering stored in long-term memory; in many cases preferences are learned and formed ad hoc, and a choice is constructed as the consumer progresses through a decision process (i.e., what set of rules one uses and how).

There are variants in each ’emotion category’ that should not be overlooked, and the question is how well a person can distinguish between them when viewing the facial expression of another person, for example being elated, delighted, joyful or content as variants of ‘happiness’; or being irritated, aggravated, infuriated, or vengeful as variants of ‘anger’. The differing tone or intensity of a general type of emotion in one person may command a change in appropriate response of the other observing person. But what happens with emotions that do not seem even to have distinct enough facial expressions (e.g., guilt, love, contempt)? There also are expressions that are ‘neutral’ and simply mean nothing emotionally and one should not give them significance as such. All these comments are furthermore valid when the ‘observer’ is a camera and the detection of the emotional state is performed by an AI algorithm.

Barrett has led a team of researchers that examined a large pool of previous studies (performing meta-analysis of their results) in hope to decipher this very issue of interpreting facial expressions; their paper titled “Emotional Expressions Reconsidered” is dedicated to the “Challenges to Inferring Emotion From Human Facial Movements” [4]. The researchers consider specificity and reliability in making inferences from facial expressions about emotional states, regarding the six emotion categories aforementioned. They distinguish between production and perception of facial expressions, and come to conclusion that scientific findings in both types of studies are “failing to strongly support the common view” (i.e., “humans around the world reliably express and recognise certain emotions in specific configurations of facial movements”).

With respect to perception, Barrett and her colleagues agree that humans do express instances of emotion from the six categories with the hypothesised configurations of facial expression identified (see their Figure 4), that is, those configurations express sometimes the emotions proposed. It raises nevertheless the question of what level of agreement about the emotion perceived from an expression would be satisfactory to accept the judgement: 60%? 80%? It is problematic, however, when rates of agreement on the emotions range mostly between 20% and 40% (highest for happiness, nearly 50%, followed by anger ~40%, see their Figure 11 — people interpret more successfully emotions from faces within their own culture). The authors conclude that the reliability of findings on perception of emotional states is weak and “there is evidence that the strength of support for the common view varies systematically with research methods used”. They criticise the use of technology, which relies on facial expressions as ‘fingerprints’, to make inferences about underlying emotional states, applying mistaken scientific facts.

  • In an article in the Financial Times, Madhumita Murgia (FT.com, 12 May 2021) refers to the critical study of Barrett et al. He emphasises the role of context (e.g., the occasion-when, the person-who) in making more accurate assessments about human instances of emotional expression. Paul Ekman, commenting for this article, sounded reserved and distanced himself from technological models purporting to predict emotions on the basis of their relationships to facial muscle movements. Overall, Murgia is particularly concerned with the technological applications of emotion detection, considering also possible violation of individuals’ privacy, an issue discussed by AI regulators in the EU.

When a person tries to detect the emotion of his or her partner to a conversation, for instance, he or she is likely to consider some additional factors of context about the situation and its background: causes, goals or motivation, social or cultural circumstances, time and location, etc. If one is also better familiar with the other person, the latter’s personality traits may also be taken into account to assess what type of emotion to expect (e.g., is the other person amused and kidding or upset and cynical). It can be difficult in some cases to discern emotions based on a facial expression alone (e.g., with eyes wide open, it may be confusing to tell if one is afraid or surprised). The observing person may attempt to notice more parts of the facial expression to make a better judgement, but considering also contextual factors can improve much better one’s inference about emotion. Nevertheless, we should not count on context too heavily, lest it would neglect the information of facial expression; we can use context and personality information to contemplate (or simulate) what emotions to expect, and especially to resolve ambiguities when the emotion inferred or detected from a facial expression does not seem to fit the situation involved. These issues are just as true and relevant when emotion detection is done by an AI-powered algorithm.

There is no intention to deny here that people can figure out in some occasions pretty well how others feel, the kind of emotion another person seems to experience, from one’s facial expression. But people also often misinterpret observed signs of emotional expression such as facial expressions. The inferences of humans lack in accuracy, they may jump to conclusion based on partial cues, and the facial expression itself could be misleading. People may also face difficulties when their perception is not refined enough to infer variants of emotions. Yet, humans also get a chance to track-back and correct their errors. The verb ‘detecting’ itself could be misguided — ‘perceiving’ or ‘inferring’ may be more suitable terms. Applying additional contextual information as described above can help in improving the ability to infer more closely the emotion experienced by another person. It can also prove vital to ask the other person how he or she feels — it may still provide a valuable point of reference or comparison for validating one’s inference. Companies that utilise AI-powered algorithms to infer emotions from facial expressions may find out that making mistakes can be damaging, especially when the computer system hurries to take action upon the AI’s prediction — it can hurt the attitudes of consumers or deteriorate rather than improve relationships with customers. AI algorithms can be just as fallible as humans are in this regard.

Ron Ventura, Ph.D. (Marketing)

Notes:

[1] See for example this early publication of the research of Ekman and colleagues in the book: Emotion in the Human Face: Guidelines for Research and an Integration of Findings; Paul Ekman, Wallace V. Friesen, & Phoebe Ellsworth, 1972; Pergamon Press.

[2] How Emotions Are Made: The Secret Life of the Brain; Lisa Feldman Barrett, 2017; UK: Macmillan

[3] See for example: Constructive Consumer Choice Processes; James R. Bettman, Mary Frances Luce, & John W. Payne, 1998; Journal of Consumer Research, 25 (December), pp. 187-217.

[4] Emotional Expressions Reconsidered: Challenges to Inferring Emotion From Human Facial Movements; Lisa Feldman Barrett, Ralph Adolphs, Stacy Marsella, Aleix M. Martinez, & Seth D. Pollak, 2019; Psychological Science in the Public Interest, 20 (1), pp. 1-68, available for reading online at Sage Publications