Conversation analysis and conversation design: Where the moonshots are
Conversation has been described as “the primordial site of human sociality”. Every turn during conversation is designed and used by humans for other humans to get different facets of life accomplished. We build, maintain, and end our personal and professional relationships through conversation. We buy and sell. We give and receive help. We are excited, persuaded, irritated, embarrassed, and consoled in response to things others say to us. Conversation is the tool we use to do all these things.
‘Conversation analysis’ is both a field of study and a research method for analyzing naturally occurring conversation founded by Harvey Sacks, Emanuel Schegloff, and Gail Jefferson. For more than 50 years, its cumulative science — across sociology, psychology, linguistics, communication, and anthropology — has examined hundreds of thousands of cases of real conversation across myriad settings. It has shown that and how conversation is systematic (not messy); that there is ‘order at all points’ (even in one single turn at talk), and that much of its core machinery is universal across languages.
The concept of ‘conversation design’ has become ubiquitous in the world of chatbots, virtual assistants, and conversational user interfaces (CUIs) — in human-computer interaction (HCI) research in academia and industry alike. For some, the aim of conversation design is to enable devices “to mimic human, turn-based conversation and to use natural language in written or spoken form”. For us, a more straightforward aim is to leverage human, turn-based conversation in order to allow users to do tasks with their technology. Yet, despite huge investment, massive technical progress, and the proliferation of smart assistants in phones and smart speakers, many studies cite a mismatch between user expectations and performance in conversational systems. This may reflect a long-standing tendency for naïve users to over-apply social rules and categories onto even relatively simple technologies. However, a recent research review argues that “no system is able to lead a half-decent coherent conversation”. Depending on one’s definition of ‘conversation’, then, there is something important missing from our interactions with automated agents, especially when they range across multiple turns and sequences, and in ‘open domain’ rather than more task-specific conversations.
Conversation analysis and conversation design seem like they should have a close and productive relationship. But with some notable exceptions, there has been surprisingly little interaction between the two domains. While Google’s ‘Conversation Design’ manual for its Assistant uses some aspects of linguistics, it draws especially on the language philosophy of Paul Grice and his notion of ‘conversational implicature’ to account for various forms of tacit meaning in everyday talk. This approach lets designers make broad assumptions (e.g., about the ‘cooperative’ intentions of speakers) that enable dialogue systems to infer the user’s goals in relation to specific tasks. However, this approach does not work well when applied to the fundamentally open-ended nature of conversation as it happens ‘in the wild’. Computational approaches tend to underpin their models with Herbert Clark and Susan Brennan’s notion of ‘grounding in communication’ — that communication is the process through which mental states are socially distributed. The driving assumption is that each individual’s objective and exogenous intentions and meanings need to be shared so that common ground can be achieved. However, conversation analysts argue that this — like Gricean pragmatics — provides a “confusing metaphor rather than a useful explanatory mechanism” — particularly when it comes to getting traction on how real conversation works.
Getting the basics right is an ever-relevant issue for conversation designers, now and into the future. In this article, we describe three fundamentals from conversation analysis that create moonshots for the future of conversation design.
1. Designing for recipients
‘Recipient design’ is the conversation analytic term for the communicative practices people use to tailor the design of their talk for — and show their ‘orientation and sensitivity’ to — the person they are talking to. Given the infinitely extendible ways in which people can put together every conversational turn — from how they say ‘hello’ to how they make a request — every turn reveals something about recipient design. People select words and discuss topics all “with an eye to who the recipient is and what the recipient knows about the reference”. So, if a speaker says, ‘Jane came for dinner’, they are treating their recipient as knowing who ‘Jane’ is. If a speaker knows that their recipient does not know Jane, they might refer to ‘Jane from the office’ or use a recognizable category in saying that ‘a colleague’ came for dinner. Not only do people design turns for what they take their recipients to know, they also orient to recipient design by correcting what they say mid-utterance (e.g., “I was talking to Jane and she said — the woman at my office — and she said”).
The notion that conversations are designed for recipients is also embedded in conversation design, through the construction of data-driven user ‘personas’ such as ‘Nerdy Nina’ or ‘Facilities Manager Fred’. Developing personas can help conversation designers to focus their language choices for a likely recipient. However, the personas, once created, are static. They can backfire by becoming monotonous or stereotypical. And, since the CUI also has a persona — such as Apple’s Siri — the very consistency in tone of voice required to convey Siri’s persona and its insensitivity to minor recipient-designed variations in the prosody, timing and intonation of the user’s “hey Siri” means that basic actions like greetings are treated as unvarying and identical by both Nina and Fred. The beauty of recipient design is out of the window from the opening “hey”.
We can see the importance and pervasiveness of recipient design by examining almost any natural conversational data — for example, how teleservice workers interact with real customers and clients. Consider these two differently designed turns soliciting contact information drawn from two different British CX conversations:
Example A: And if I can have your email address please.
Example B: And uh do you have an email address at all.
In both cases, the turn performs a request for information. Both start with an ‘and-preface’, used to embed the question within an ongoing sequence of straightforward to answer ‘no problem’ questions. Both are designed as yes/no interrogative formatted questions, in terms of grammar. And, in terms of their pragmatic functions, it would be perfectly acceptable for an email address to be supplied in response to either. However, the questions differ fundamentally in terms of recipient design. While Example A builds in an assumption that there will be no problem for the recipient to provide the information requested, Example B builds in a contingency that the recipient might not have an email address ‘at all’.
Example A is drawn from a call to a university contact centre for potential undergraduate students during the annual application cycle. It is reasonable to assume that most callers will be between 18–20 years old and will be users of technology and email, since the application system used by universities in the UK is online — but not all are. For instance, parents often call on behalf of their children. Meanwhile, Example B is from a call to a holiday company that conveys, tacitly, through the design and wording of its website that they specialize in holidays for a ‘senior’ market segment. The call-takers use the many types of information that are available both explicitly and tacitly — from the phonetic qualities of speakers’ voices to what may be revealed elsewhere in the call — to shape the delivery of their requests using the principles of recipient design.
2. Designing compound conversations
For a truly human interaction, CUIs need to embed what we know about ordinary yet potentially complex conversational structures. One example is what conversation analysts call ‘compound turns’, in which two or more speakers jointly produce one syntactic unit together across two or more turns. The final product is something that just one speaker could produce, but it is done chorally (data are in US English):
01 Rich: if you bring it into them
02 Carol: it don’t cost yuh nothing
In this case, Rich’s first turn is the first part of an if-then syntactical structure which is completed by Carol in the second turn. Here is another example, in which Geri endorses the way that Shirley has completed the first part of her turn.
01 Geri: An’ it doesn’t matter at this point I’ve waited this long I c’n (wait)
02 Shirley: C’n wait another three weeks.
03 Geri: Yeah.
It seems almost magical that speakers can predict what someone was about to say so accurately that they can finish off each other’s turns without creating friction — that they can know each other’s worlds so well. In interpersonal relationships, ‘finishing another’s sentences’ is conventionally seen as a display of intimacy, whereas in organizational settings, we can see how this ability is the routine outcome of conversational structure rather than the product of some kind of ‘mind reading’.
Consider how ‘recipient design’ and ‘compound turns’ can play out in combination to produce this outcome. The next example is from a sales call in which the salesperson is about to elicit the name, address, and other details of a new customer. Data are in British English.
01 S: Let me take some details. You’re mister…?
02 C: Mister Cain.
The seller provides the first part, and the customer completes the compound turn. In this case, the salesperson does not already know the name of his interlocutor but has discerned enough about him to propose that this is a ‘mister’ — presumably by the pitch of his voice. Given technical developments in incremental speech processing, this kind of sequence seems relatively straightforward to implement and, from a Gricean perspective, it conforms to a certain notion of efficiency.
However, detailed analysis of this conversational construction highlights important social constraints on how it is used in practice. Our research found that compound turns in which the customer supplies their surname only occurred with hearably male customers. That is, sellers did not routinely ask hearably female customers, “You’re missus…?”. Rather, they asked women customers to provide a title by selecting from options — “Is it miss, missus, or ms”? ‘Mister’, as a generic title for men, conveys no further information about sexual orientation or marital status: it is used without risk of ‘error’. However, for a woman, title conveys various possibilities about her biography. So, the example demonstrates how the conversational minutiae actively constitute a social context in practice, and can therefore render it amenable to empirical and, potentially, computational forms of analysis.
3. Designing the routine to achieve the extraordinary
Conversation analysts are sometimes accused of studying the mundane in too much depth — after all, this is just talk. Emanuel Schegloff (1986) wrote that, “if talk on the telephone may initially appear unworthy of sustained scrutiny, the beginnings of its episodes may intensify this sense of unworthiness” (p. 112). It is in these beginnings that reciprocal greetings (“Hello!”, “Hi!”) and ‘how-are-yous’ are exchanged. Schegloff describes these highly systematic components of conversational openings as mundane, ‘ritual’, ‘virtually automatic’ and almost ‘pre-scripted’ — but that their presence achieves the routine. How do you convey you’re angry with the person you are calling, or are in a rush, or that something is an emergency? You dispense with greetings and ‘how are yous’ or do not reciprocate them. When any components are missing, we can infer that this conversation is not routine.
In the call below, from a British dataset collected by Emma Richardson, we see how the routine can be exploited to achieve actions that are the opposite of mundane. It comes from a corpus of emergency 999 calls to the police about domestic violence. The police call-taker is returning a call —we do not know the the circumstances of the preceding call, but it may have been a follow-up to an original ‘active silent call’. The original caller answers the phone.
Conversations often start with reciprocal greetings and ‘how-are-yous’. By saying “hiya y’all right” at the top of the call — AND in overlap, and by doing this with precision timing to appear when return greetings and ‘how are yous’ are due — the caller uses the sequential and structural norms of interaction to begin to convey to the call-taker that she is in danger. The call-taker quickly understands the situation and begins to ask yes/no questions including establishing that the woman is not “free to speak”. These systematic, structural components of conversation enable the caller to sound like she is talking to someone other than the police (to whoever may be overhearing).
Similarly, in the following 911 call, the caller designs two actions for two different recipients simultaneously: the potentially overhearing perpetrator of violence hears that she “would like to order a pizza for delivery” while the police dispatcher hears that she is in danger — and is requesting help. Notice how the caller adopts the same practice of starting her turns in overlap as we saw in the previous call with the precise placement of “hiya y’all right”.
CUIs have a long way to go before they can work with the empirical reality of conversation. The human participants achieve mutual understanding within a few turns. If you tell Google Assistant that you would like to order a pizza, it brings up a list of pizza restaurants and asks which one you’d like, gives their contact information and an option to call. Asking Alexa produces a list of cooking utensils for pizza (e.g., a crisper tray), and not a pizza delivery service. Neither agent covertly offers to call the police.
But what about the overall context of the conversation? One explanation for the comprehensibility of the women’s actions is the fact they are calling the police — not a pizza restaurant. This conceptualization of ‘context’ as a propositional variable that is either present, or lacking, is often cited in NLP and CUI research:
“[A]ny conversation has a specific context within which that conversation takes place. Even when humans have a conversation, context needs to be established first. If not, it leads to misunderstandings. Context within a conversation is the parts of the conversation that immediately precedes the current passage or words, allowing for clarity and meaning. Context assists both the user and the conversational Interface to construct meaning from the conversation.”
However, both of our emergency police calls show that such ‘bucket’ theories of context — that treat it as just another informational variable — are fundamentally flawed. This is because they reverse the order of inference that participants must follow to make sense of the call. It is not because the context is somehow pre-determined as a police emergency service that the parties understand that this is a genuine emergency, even though the caller dials an emergency number and asks for pizza. The context always has to be achieved, turn-by-turn, using interactional resources and methods such as recipient design and compound turn-taking. Indeed, many calls are legitimately categorized as ‘nuisance’ and sifting these from genuine requests is a key interactional challenge for emergency call-takers.
Shooting at the moon
Our aim in this article is to prompt discussion about what is needed in the conversational AI and conversation design domains to produce CUIs that are able to have a “half decent coherent conversation” — and to think about which of the ordinary yet complex things that people do might be a goal for the future. We have argued that Gricean assumptions may facilitate task-specific communication, but not moment-to-moment conversational inference. Conversation design already goes some way to achieving recipient design, but it could be embraced much more radically.
We conclude by returning to the issue of ‘context’ and ‘grounding’. We have seen that compound turns are tools, shaped by the action of recipient design to generate social contexts from the minutiae of talk. To start designing away the apparently intractable complexity of conversational ‘context’, we must start by recognizing how context is actively constituted turn-by-turn, in everyday talk. Conversation analysts ‘ground’ their inductive interpretations in a principle called the “next-turn proof procedure”; that is, how people show, in the way they respond to any prior turn, their own analysis of what they took it to be doing. When problems of speaking, hearing or understanding occur — and when they are relevant for the current interaction — evidence for those troubles also appears in the next turn. For instance, when one person says “Huh?”, the previous speaker often repeats or reformulates their prior turn.
Schegloff describes these kinds of ‘repair procedures’ as a “structurally provided defence of intersubjectivity”. What he means by this is that people harness what Derek Edwards refers to as “the rich surface of language and social interaction” to recover sufficiently shared understandings to continue with a turn-by-turn conversation. This is where Gricean pragmatics and ‘common ground’ fall down. The amount of ‘cooperation’ and ‘grounding’ in a turn cannot be measured directly, operational definitions of these concepts are contested and, in general, the position of conversation analysis is that “[c]ommon ground cannot and should not be treated as an empirical fact.” Computational models that pursue fixing and clarifying what each person ‘objectively’ means often end up chasing something that does not actually occur in real conversation. Rather, the routine maintenance of turn-by-turn talk itself is the central objective of the conversation, over and above any information exchange task or some assumed overarching goal.
Reaching these conversational moonshots may better be achieved by targeting design priorities at what people are doing in conversation rather than the information they are exchanging through it — and augmenting natural language processing with natural action processing.