1 Introduction
Conversational information retrieval, also known as
conversational search, refers to the process of retrieving relevant information in response to a natural language conversation or query. The primary goal of a conversational search system is to satisfy the user’s information need by retrieving relevant information from a given collection. To successfully do so, the system needs to have a clear understanding of the underlying user need. Since users’ queries are often under-specified and vague, a mixed-initiative paradigm of conversational search allows the system to take the initiative in the conversation and ask the user clarifying questions or issue other requests. Clarifying the user information need has benefited both the user and the conversational search system [
4,
32,
77], providing a solid motivation for such mixed-initiative systems.
However, evaluating the described mixed-initiative conversational search systems takes considerable work [
47]. The challenge arises from expensive and time-consuming user studies required for holistic evaluation of conversational systems [
20]. Such studies require real users to interact with the search system for several conversational turns and provide answers to potential clarifying questions prompted by the system. A relatively simple solution is to conduct offline corpus-based evaluation [
4]. However, this limits the system to selecting clarifying questions from a pre-defined set of questions, which only transfers well to the real-world scenario. Moreover, such offline evaluation remains limited to single-turn interaction, as the pre-defined questions are associated with corresponding answers and unaware of previous interactions. User simulation has been proposed to tackle the shortcomings of corpus-based and user-based evaluation methodologies. A simulated user aims to capture the behaviour of a real user, i.e., being capable of having multi-turn interactions on unseen data, while still being scalable and inexpensive like other offline evaluation methods [
7,
56,
78].
In this article, we extend our conversational
User Simulator (USi), proposed in Sekulić et al. [
61], to explore utterance formulation of
large language model (LLM)–based user simulators. Given an initial information need,
USi interacts with the conversational system by accurately answering clarifying questions prompted by the system. The answers align with the underlying information needed and help elucidate the intent. Moreover,
USi generates answers in fluent and coherent natural language, making its responses comparable to real users.
We experiment with two LLM-based approaches to simulate users. First, we base our proposed user simulator on a large-scale transformer-based language model. We fine-tune GPT-2 [
49] to generate answers to posed clarifying questions. This method was presented in our recent paper [
61]. Second, we use in-context learning, that is, prompting, a few-shot technique made possible with the next generation of LLMs, such as GPT-3 [
12], LLaMa [
71], and Chinchilla [
28]. A GPT-3-based method,
ConvSim, was recently proposed by Owoicho et al. [
44]. Both methods generate answers to clarifying questions in line with the initial information needed, simulating the behaviour of a real user. In the first case, we ensure that through a specific training procedure, resulting in a semantically controlled language model. With a GPT-3-based simulator, we utilise in-context learning (i.e., prompting) to guide the model into following specific steps to answer posed questions.
We evaluate the feasibility of our approaches with an exhaustive set of experiments, including automated metrics and human judgements. First, we compare the quality of the answers generated by our methods and several competitive sequence-to-sequence baselines by computing several automated
natural language generation (NLG) metrics. In Sekulić et al. [
61], we found that the GPT-2-based model significantly outperforms the baselines. This work extends the experiments to
ConvSim [
44], a GPT-3-based model, and finds an even more robust performance. However, as automated NLG metrics often yield unrealistic evaluations, we further analyse a crowdsourcing study conducted in Owoicho et al. [
44] to assess how honest and accurate the generated answers are compared with solutions caused by humans. Furthermore, we extend this evaluation setting in a multi-turn conversational scenario. The crowdsourcing judgements show significant differences both in the
naturalness and
usefulness of answers generated by
USi and ones developed by
ConvSim. The
ConvSim model outperforms the other, especially in the multi-turn setting. Performance compared with human responses remains similar. Next, we perform a qualitative analysis of utterance reformulations generated by our LLM-based approaches in response to clarifying questions. We map our findings to recently proposed patterns for conversational recommender systems [
79] and find that user simulators tend to rewrite the original query to explain the underlying information need further. However, we note that types of such reformulations highly depend on the training data and the prompts given to the models. Finally, we discuss the applications and future work of LLM-based user simulators.
In summary, our contributions are the following.
—
We compare two streams of LLM-based user simulators by the automated NLG metrics.
—
We analyse the type of utterances generated by LLM-based methods.
—
We discuss in detail the potential applications of LLM-based user simulators, their cost, and their limitations. Moreover, we outline potential future work in the space of user simulation, aimed at going beyond answering clarifying questions.
The rest of the article is organised as follows. Section
2 reviews related work on the topic. Section
3 describes a user’s role in conversational search system evaluation and the desirable characteristics of a simulated user. In Section
4, we motivate and describe in detail the implementation of the two approaches to user simulation, covering both
USi [
61] and
ConvSim [
44]. In Section
5, we construct several experiments to answer key research questions on the feasibility of the proposed methods. In Section
6, we then analyse patterns identified in the simulator’s responses, compare them to human-generated answers, and extend an existing set of practices for utterance reformulations. We present the results in Section
7. In Section
8, we discuss the advantages versus shortcomings of the approaches and outline our future work aspirations. In Section
9, we present our conclusions.
5 Evaluation Methodology
In this section, we describe our methodology for evaluating the proposed user simulation methods. We compare text generated by our simulators to human-generated text with regard to multiple aspects. First, we use automated NLG metrics to assess the differences between the two simulation methods. Second, we employ crowdsourcing to evaluate the
usefulness and
naturalness of the generated answers. Finally, we perform a qualitative analysis of the simulator’s utterance reformulations and map them into recently identified patterns (see [
79]). All of the comparisons are performed both in single- and multi-turn settings.
5.1 Single-Turn Conversational Data
For training and evaluating our proposed simulated user
USi, we utilise two publicly available datasets, Qulac [
4], and ClariQ [
3]. Both datasets aim to foster research in asking clarifying questions in open-domain conversational search. Qulac was created on top of the TREC Web Track 2009-12 collection. The Web Track collection contains ambiguous and faceted queries, often requiring clarification when addressed in a conversational setting. Given a topic from the dataset, clarifying questions were collected via crowdsourcing. Then, given a topic and a specific facet of the case, workers were employed to gather answers to these clarifying questions. This results in a tuple of (
topic,
facet,
\(clarifying\_question\),
answer). Most of the topics in the dataset are multi-faceted and ambiguous, meaning that the clarifying questions and answers must align with the actual facet. ClariQ is an extension of Qulac created for the ConvAI3 challenge [
3] and contains additional non-ambiguous topics. Relevant statistics of the datasets are presented in Table
1.
We utilise these datasets by feeding the corresponding elements to Equation (
4). Specifically,
facet from Qulac and ClariQ represents the underlying information need, as it describes in detail the intent behind the issued
query. Q represents the current asked question, whereas
answer is our language modelling target.
5.2 Multi-turn Conversational Data
A significant drawback of Qulac and ClariQ is that they are both built for single-turn offline evaluation. A conversational search system will likely engage in a multi-turn dialogue to elucidate user needs. To bridge the gap between single- and multi-turn interactions, we construct multi-turn data that resembles a more realistic interaction between a user and the system. Our user simulator USi is then further fine-tuned on this data.
To acquire the multi-turn data, we construct a crowdsourcing-based human-to-human interaction. At each conversational turn, the crowdsourcing worker is tasked to behave as a search system by asking a clarifying question on the topic of the conversation. Then, another worker is tasked to provide the answer to that question, considering the underlying information needed and the conversation history, imitating the actual user’s behaviour. We construct in 500 conversations up to a depth of three, i.e., we have three sequential question–answer pairs for a topic and its facet.
We construct several edge cases to further study the effects of specific clarifying questions on the search experience. In such cases, the clarifying question prompted by the search system is considered faulty, as it is either a repetition, off-topic, unnecessary, or completely ignores the previous user’s answers. We obtain answers to these questions to provide more realistic data for the training of our model, making our simulated user as human-like as possible. These clarifying questions are intended to simulate a conversational search system of poor quality and provide insight into users’ responses to such questions. We employ workers to provide answers to an additional 500 clarifying questions of poor quality, up to the depth of two. The specific edge cases and their descriptions with examples are presented in Table
2. We publicly release the acquired multi-turn datasets in Sekulić et al. [
61]. In this work, we use the multi-turn data to evaluate both fine-tuning-based and prompting-based approaches to generating answers to clarifying questions in a conversational setting.
5.3 Research Questions
We aim to evaluate whether our proposed simulated user can replace real users in answering clarifying questions of conversational search systems, which would make evaluating such systems significantly less troublesome. Overall, we aim to answer four main research questions, extending the list from Sekulić et al. [
61]:
RQ1:
To what extent are the answers generated by the two simulation methods in line with the underlying information need?
RQ2:
How coherent and natural is the language of the generated answers?
RQ3:
How do LLM-based simulators behave in multi-turn interactions?
RQ4:
What are the advantages and disadvantages of either simulation methodology?
In order to address these questions, we first compute several NLG metrics to compare the generated answers to the oracle human answers from ClariQ. As several NLG metrics received criticism from the NLP community, significantly since they do not correlate well with the text’s coherence, we perform a crowdsourcing study to evaluate the naturalness of generated answers. To evaluate whether the generated answers align with the information needed, we conduct an additional crowdsourcing study and assess the usefulness of answers. Finally, we perform a qualitative analysis of generated answers by identifying certain patterns in utterance formulations.
As it was done in Sekulić et al. [
61], we compare our LLM-based user simulators with two sequence-to-sequence baselines. The first baseline is a multi-layer bidirectional
long short-term memory (LSTM) encoder–decoder network for sequence-to-sequence tasks [
68].
1 The second baseline is a transformer-based encoder–decoder network, based on Vaswani et al. [
73]. We perform a hyperparameter search to select the models’ learning rate, number of layers, and hidden dimension. Both baselines are trained with the same input as our primary model.
5.4 Automated NLG Metrics
We first study the language-generation ability of
USi and the previous baselines. We compute several standard metrics for evaluating the generated language. We use two widely adopted metrics based on n-gram overlap between the generated and the reference text. These are BLEU [
45] and ROUGE [
37]. Next, we compute the EmbeddingAverage and SkipThought metrics to capture the semantics of the generated text, as they are based on the word embeddings of each token in the developed and the target text. The metric is then defined as a cosine similarity between the means of the word embeddings in the two texts [
35]. The models are trained on the ClariQ training set and evaluated on the unseen ClariQ development set. We evaluate ClariQ’s development set since the test set does not contain question–answer pairs. We take a small portion of the training set for our development. The answers generated by
USi and the baselines are compared against oracle answers from ClariQ, generated by humans.
5.5 Response Naturalness and Usefulness
To simulate a real user, the generated responses by our model need to be fluent and coherent. Thus, we study the
naturalness of the generated answers. We define
naturalness as an answer being natural, fluent, and likely caused by a human. Similarly, fluency [
14] and humanness [
57] have been used for evaluating generated text. We also assess the
usefulness of the answers generated by our simulated user. We define
usefulness as an answer that aligns with the underlying information need and guides the conversation towards the topic of the information need. This definition of usefulness can be related to similar metrics in previous work, such as adequacy [
65] and informativeness [
16].
We perform a crowdsourcing study to assess the naturalness and usefulness of generated answers to clarifying questions. We use Amazon Mechanical Turk to acquire workers based in the United States with at least a 95% task approval rate. The study was done in a pair-wise setting, i.e., each worker was presented with several answer pairs. Our model generated one of the answers, and the other was by a human, taken from the ClariQ collection. Their task was to judge which answers were more natural or valuable, depending on the study. The workers have been provided with the context, i.e., the initial query, facet description, and clarifying question.
We annotate 230 answer pairs for naturalness and 230 answer pairs for usefulness, each judged by two crowdsource workers. We would define a win for our model if both annotators voted our generated answer as more natural/useful and a loss for our model if both voted the human-generated answer as more natural/useful. If the two workers voted differently on a single answer pair, we define that as a tie. With this study, we aim to shed light on research questions RQ1 and RQ2, i.e., whether the generated answers are natural and in line with the underlying information need compared with human-generated answers. Additionally, we compare Transformer-seq2seq to USi.
We also compare the two LLM-based simulation approaches in a multi-turn casual setting. The results of both single- and multi-turn comparisons are presented in Section
7.2.
6 Responses to Clarifying Questions
In this section, we analyse human- and simulator-generated answers to posed clarifying questions. Specifically, we conduct expert annotation to identify patterns in the given answers, grounding our findings in prior work. To this end, we analyse the answers in light of patterns identified by Krasakis et al. [
34], focusing on the Qulac dataset [
4]. Krasakis et al. [
34] find that users’ answers vary in polarity in length. For example, the user can answer with a negative short answer, such as
“No”, but also potentially provide a longer answer, e.g.,
“No, I’m looking for X instead”. Naturally, the answer can also be positive polarity depending on the information needed and prompted clarifying questions. Furthermore, we compare the generated answers to patterns identified by Zhang et al. [
79]. Although Zhang et al. [
79] focus on query reformulations in conversational recommender systems, we find the overlap of the findings to be high. Thus, we map their proposed query reformulation types to answers in a mixed-initiative conversational search. Finally, we analyse answers to faulty clarifying questions proposed by Sekulić et al. [
61].
6.1 Responses Patterns
We analyse answers to prompted clarifying questions in light of previously identified utterance reformulation types [
79]. In other words, we map and expand the existing utterance reformulation ontology for conversational recommender systems to answer formulations in conversational search. While specific differences exist between recommender and search systems, our initial analysis suggested that the common conversational setting incites similar user behaviours. In their study, Zhang et al. [
79] analyse how users reformulate their utterances in subsequent turns given a prompt from the conversational recommendation agent about its lack of understanding of user’s needs. Similarly, in conversational search, we have the user’s initial query, the clarifying question prompted by the search system, and the user’s answer. Thus, we analyse these answers through the lens of reformulations from the user’s initial query.
Zhang et al. [
79] identify seven utterance reformulation behaviours: (1)
start/restart – users start to present their need; (2)
repeat – user repeats previous utterance without significant change; (3)
repeat/rephrase – user repeats last turn with different wording; (4)
repeat/simplify – user repeats the word with a more straightforward expression, reducing complexity; (5)
clarify/refine – user clarifies or refines the expression of an information need; (6)
change – user changes the information need (topic shift); (7)
stop – user ends the search session. We encourage an interested reader to refer to Zhang et al. [
79] for a more elaborate explanation of the reformulation types. In our analysis, we focus specifically on answers to clarifying questions. Thus, some user utterances must be observed and not discussed in other sections. Mainly by the design of our research setting, described in Section
4, we do not deal with utterance types (1)
start/restart, (6)
change, or (7)
stop. However, we add two additional categories, mostly to deal with edge cases: (8)
hallucination – when the provided answer is not in line with the underlying information need and (9)
short answer – when the answer is just “no” or “yes”. Examples of the observed utterance types are presented in Table
4.
6.2 Responses to Faulty Clarifying Questions
In order to gain further insight into designing a reliable user simulator for conversational search evaluation, we must adapt it to be resilient to unexpected system responses. For example, if a conversational search system responds with an off-topic clarifying question or an unrelated passage, the simulated user needs to react in a natural, human-like manner. However, to design such a simulator, we first need to learn how real users would react to incorrect responses from the search system. To this end, we acquired a dataset of human responses when prompted with faulty clarifying questions. The published dataset is multi-turn and can thus be used to improve our multi-turn user simulator model.
Examples from the acquired dataset are presented in Table
2. The dataset contains several scenarios in which a conversational search system asks follow-up clarifying questions. We acquired a dataset of 1,000 conversations, with crowd workers assuming the user role and responding to clarifying questions. Initial analysis of the crowd workers’ answers offers several insights. In the case of appropriate clarifying questions (
Natural), users tend to respond naturally by refining their information needs, as expected. However, in the case of faulty clarifying questions (
repeat,
off-topic, or
similar), users either repeat their previous answer (20% of analysed answers), expand their last reply with more details on their information need (23%), or rephrase the previous answer with different wording (37%). Next, we aim to evaluate the resilience of our proposed
USi to such faulty questions by analysing its correspondence to human-generated answers.