Statistics – A Gateway to Consumer and Customer Behaviour

The marketing research sector seems enthusiastic lately about the use of synthetic data in consumer and marketing studies. More precisely, this post refers to the use of synthetic responses in surveys, generated by models of machine learning (ML) in the framework of artificial intelligence (AI); in the more comprehensive form of application, surveys are based completely on data generated on behalf of synthetic respondents as a substitute for human respondents (i.e., the sample as a whole is ‘human-like’). For a researcher of the late 20th century, the strong advocacy and willingness to rely heavily on data that mimics human responses, smart and informed as such data may be, might seem odd at best, and perilous to the trustworthiness or credibility of those consumer studies in the worse cases.

In the context discussed here, it is necessary to clarify that synthetic data comprise two main parts: the profiles of synthetic respondents included in a sample and the synthetic responses generated on their behalf. It is noted, nonetheless, that in many applications common today synthetic responses may be assigned to human participants in a survey — this is the case of imputing missing values for responses of some participants to particular questions (or their background data); another yet more extensive approach is enhancing a human sample with synthetic respondents and their expected responses (known as sample ‘boosting’). Not to be mistaken, imputing missing values is quite an old practice — the difference now is in the replacement of ‘naive’ and traditional statistical techniques with more sophisticated techniques based on ML models. The approach of ‘boosting’ is designated usually for ‘completing’ a sample with synthetic respondents to represent niche groups of the research population that are hard to reach (i.e., can be expensive in time and money to recruit human participants); this approach produces a mixed sample of human and synthetic respondents.

The critical criterion for the acceptance of these techniques is the extent to which researchers rely on filling-in absent human responses with synthetic, AI-generated responses. For instance, ‘boosting’ is preferred by its advocates to sample ‘weighting’ which might cause distortion of data distributions; but calculated guesses of how participants would answer, based on learning from previous knowledge (i.e., in training data), can cause other forms of distortions, and perhaps even more substantial ones. The looming problem is an illusion of relying on real-world responses when they actually involve too many assumptions or calculated guesses, reasonable as they may sound.

The more daring approach, however, whereby a sample is composed wholly of AI-driven synthetic respondents, attracts the most attention and excitement in the field. Devout advocates describe the advantages of this methodology and its outcomes (explained further below), proclaim the demise of traditional surveys based on human samples, and conclude that in coming years the surveys based on samples of synthetic respondents will become the norm, while those based on human samples will be rendered redundant. That is a bit of a strange argument because the AI-oriented models are founded (mostly) on real-world data of human consumers in general and as participants in surveys in particular.

Firstly, training data is required to provide background on demographic characteristics, personality traits, attitudes, and other attributes of consumers [1] to construct personas with profiles of the synthetic respondents in the sample. Secondly, knowledge on past behaviours of consumers (e.g., shopping & purchase behaviour, response behaviour) enables to devise meaningful responses that fit the personas of synthetic respondents. Models based on algorithms of ML are used for numeric responses (structured questions) whereas verbal responses (free text) can be composed by drawing on Large Language Models (LLMs) under the branch of Generative AI (Gen AI), much talked about in the past three years. It must be acknowledged, nevertheless, that any of the new applications of synthetic responses are based on the reliability, accuracy and relevance of the real-world training data (i.e., about humans) that models learn from to make inferences about missing responses (partially or completely).

The use of synthetic data is hardly new to statistical analyses. For instance, in order to solve difficult estimation problems, multiple datasets (‘samples’) are drawn to simulate distributions of the relevant variables and thereon compute the expectancy and variance of estimates of model parameters, following the methodology of Monte Carlo simulations. The more recent technologies and methods are giving rise to new objectives and possible uses in marketing and consumer research. Directions are already being conceived for possible purposes and uses of synthetic response data.

For example, in survey research, samples of synthetic respondents can be utilised to pre-test questionnaires and examine the coherence of questions and sensibility of results before fielding a survey study. It is argued, with reason, that at an early stage the synthetic response patterns are adequately accurate and realistic for testing purposes and thus one can save the cost and time of conducting pre-test or pilot studies with human respondents. In the area of new product development (NPD), early tests can be performed for prototype versions of the product in focus with synthetic respondents. Additionally, researchers can simulate the reactions of consumers in the market to a new product based on knowledge of past behaviours with similar products. A careful and sensible utilisation of these methods may save high costs in the NPD research process. It is critical to note that these applications are meant for preliminary & testing purposes, to corroborate hypotheses and plan the next moves — not for replacing human consumers-respondents in all stages of research.

Undeniably, the motivation to turn to solutions of synthetic response data arises from the increasing difficulties and intrusions that jeopardise the quality of data in surveys, a problem that has been escalating over the past two decades. Difficulties start from recruiting engaged participants. The threat to quality of data in quantitative survey research may come from bots (‘click farms’), tech-enabled fraudsters, and hyperactive (‘professional’) human respondents (usually focused on getting benefits rather than on answering) [2]. Respondent fatigue (e.g., with questionnaires that take longer than fifteen minutes to complete, due to repetitive types of questions), an increasingly frequent problem, may lead to quitting a survey prematurely, and probably even worse, less reliable responses by respondents who stay on but lose concentration and no longer think-through their answers (‘automatic’ response pattern, ‘straight-lining’ in response to items on Likert-type scales). Whether fake, unauthentic or less reliable responses, they all put in question the quality of data and thereby the reliability of survey results.

Proponents of methods associated with the use of synthetic response data point to some solutions. Imputing processes based on ML models may be applied to infer the responses to questions left behind unanswered by respondents who abandoned due to fatigue or boredom. But the advocates go further to suggest that samples consisting fully of synthetic respondents can bypass the problem of fatigue and allow utilisation of long questionnaires. They also propose that surveys based on synthetic respondents can avoid various types of inattentive and faulty response patterns by human participants. In an intriguing interpretation of imputing, new questionnaires would be artificially introduced to existing human respondents, that is, anticipating or inferring how those respondents might react to or rate additional brands, products or other kinds of stimuli [2]. (Note: In practice that may imply the creation of synthetic ‘replicas’ of the human respondents based on knowledge of their background characteristics and central tendencies in order to impute or predict responses unprompted before.)

Crucially, the quality of synthetic data generated at any level depends on the quality of real-world data employed for learning in the first place. As in traditional surveys, where researchers need to ‘clean’ their data from fraudulent cases, biases and flaws before analyses, it is furthermore necessary to do so before using primary data as input to learning algorithms. If synthetic data is trained on data sets replete with fraud, biases or unreliable, faulty responses, the synthetic data will not just replicate errors but perpetuate and amplify them [2]. When referring to ‘response behaviour’, it means learning the substance of information in consumer responses while signalling out or ‘unlearning’ flawed forms of responses not to be replicated. However, applying synthetic data leaves an uncomfortable notion which is entangled with an ethical issue: ‘studies’ are carried out by creating ‘clones’ of human respondents and inferring or imputing their responses without really giving consumers the opportunity to express themselves [1].

Colin Strong (Ipsos, MRS/Delphi, [1]) argues that researchers need to consider also behavioural aspects in responding to surveys. In-depth interviews, but also open-ended questions in survey questionnaires, give consumers-participants more room to express themselves verbally, make sense of things and construct their answers as in a conversation (i.e., behave more in a ‘constructivist’ way). However, Strong posits that even in regard to structured survey questionnaires (with mostly close-ended questions), it would be mistaken to treat responding to them as ‘computer-like’. The cognitive process may involve how consumers-participants feel about their responses (e.g., confidence) and the ease at which they retrieve information from memory as they choose the appropriate answer for them (‘felt-fluency’); this can affect their self-perception and decision-making behaviour while responding to questions [1]. (Note: Strong refers to the research work of Norbert Schwarz, professor of psychology & marketing, on meta cognition, processing fluency and ‘feelings-as-information’.)

In the context of NPD, there is often concern, and some agitation, among researchers that consumers do not truly understand new innovative products when asked to react to representations of them, particularly in early stages of development (e.g., their properties, uses, and potential benefits & flaws). They are also less likely to hold well-defined preferences in view of advanced technological developments. Hence, NPD researchers need to relate to responses of participants with caution. Yet, synthetic data that relies on past reactions to previous new products may not be relevant enough (e.g., because products are not adequately similar) to transfer to an innovative product in focus. The use of synthetic (response) data as well has limitations in this regard — they can replicate known patterns but struggle to extrapolate in less-explored categories or predict future preferences, in particular when it involves novel features or niche markets [4]

Mark Ritson (a professor of marketing, consultant and celebrated speaker) strongly advocates the use of synthetic data to substitute human survey responses. He recommends its use particularly in B2B research where the sought after respondents are high-ranking managerial decision-makers that are very difficult and expensive to recruit for interviews. However, Ritson suggests that employing synthetic data is due to take hold in marketing research (MR) in general [4]. Notably, researchers are likely to come across hard-to-reach respondents in various contexts of segments of professionals (white collar, technical), academics, and consumers (e.g., mothers to recently born babies, elderly ages 70+, ethnic minorities). Ritson cites the claim of Evidenza (an analytics firm he supports) that synthetic responses achieve 90%-95% match (correlation) with human survey responses on the same survey questions [4]. Somewhat provocatively, he drives a belief that human respondents may not be necessary in the future because their likely responses or reactions might be inferred very accurately (‘as good as real’) by other means and sources of knowledge — that sounds as quite an inconvenient projection. Yet, Ritson notes that the MR field is still in an early stage of developing such capabilities.

Ritson relates to a different approach taken in this area of drawing the necessary knowledge from online sources, including websites of different types (e.g., news, companies & brands), social media networks, and various text documents (e.g., articles, books) that can be retrieved from those sources. That approach relies on learning with LLMs; subsequently Gen AI conversational agents can be utilised to produce verbal responses. The approach raises some critical concerns and leaves some open questions: How relevant and focused the online sources are for devising answers in surveys on any particular topic? How does this approach practically help in predicting responses to structured questions in surveys? There are furthermore ethical difficulties (e.g., privacy, lack of consent) in applying open and public online sources, including the conversations of people in social media, in order to generate responses that mimic consumers’ own responses.

Steven Snell (Rep Data, [2]) recommends four actions to be taken in a multipronged approach for responsible use of synthetic data in surveys: (1) Start with quality respondents (i.e., founded on reliable training data from past studies); (2) Combine synthetic and real-time human responses (i.e., supplement, not replace real-world consumer feedback); (3) Increase transparency in data training (e.g., data sources, validation methodologies); and (4) Monitor for unintended distortions (i.e., validate and calibrate against real-world survey results). The second and fourth actions highlight in particular the importance of integrating between human response data (prior & current) and synthetic response data. Stephan Basson (Factworks, [4]) also stresses that synthetic data should be viewed, at least for now, “as a complement, not replacement for traditional research methods”; he explains: “By combining real and synthetic response data, researchers can leverage the strengths of both approaches — using synthetic data to address challenges of fatigue and fraud, while still grounding insights in real-world behaviors and attitudes“.

Using synthetic response data in surveys, which includes employing synthetic respondents (as ‘replicas’ or similar to actual human respondents), can help researchers in resolving known problematic issues and overcoming flaws in surveys with human participants. We may find, however, differing interpretations of this practice (e.g., how far is the extent that researchers allow for ‘imputing’? what constitutes ‘synthetic respondents’?) and the methodologies used (e.g., sources and forms of training data). Yet, it is the line of thinking that is bothering: What would be the point in conducting marketing surveys if researchers actually impute, infer or predict the responses anticipated from human consumers or customers without truly asking them about their perceptions, attitudes and intentions? This could be critical when synthetic responses are used as the ground for drawing marketing conclusions and making managerial decisions. The correct and essential direction to follow seems, therefore, to incorporate and combine real-world (prior and real-time) human response data with synthetic AI-generated response data, and it may very well be necessary and advisable to continue in this course from now and into the future.

Ron Ventura, Ph.D. (Marketing)

References:

[1] “Using Synthetic Participants for Market Research“, MRS Delphi Group – Report, 2024. MRS is the Market Research Society in UK and Delphi Group is its subdivision specialising in decision making; Colin Strong is Head of Behavioural Science at Ipsos & Chair of MRS Delphi Group.

[2] “Is Research Ready for Synthetic Data?“, Steven Snell, Quirk’s Media, 17 June 2025 (Snell is SVP & Head of Research at Rep Data).

[3] “4 Trends Shaping Market Research in 2025“, Stephan Basson, Greenbook, 19 December 2024 (Basson is Content Marketing Manage at Factworks, see Trend 1 on Synthetic Data).

[4] “Synthetic Data Is As Good As Real — Next Comes Synthetic Strategy“, Mark Ritson, Marketing Week, 13 June 2024 (Ritson received his PhD in Marketing from Lancaster University and is more recently the founder of an MiniMBA course programme at its Management School).