0% found this document useful (0 votes)
36 views7 pages

Compiler Construction

The paper compares the performance of two popular chatbot frameworks (Google Dialogflow and Microsoft QnA Maker) to a state-of-the-art Sentence BERT model for answering student FAQs. It finds that SBERT outperforms the frameworks, achieving an F1-score of 0.99 on a dataset of 284 training phrases and 85 test phrases across 10 intents, while the frameworks achieved scores of 0.96 and 0.95 respectively.

Uploaded by

smumin011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views7 pages

Compiler Construction

The paper compares the performance of two popular chatbot frameworks (Google Dialogflow and Microsoft QnA Maker) to a state-of-the-art Sentence BERT model for answering student FAQs. It finds that SBERT outperforms the frameworks, achieving an F1-score of 0.99 on a dataset of 284 training phrases and 85 test phrases across 10 intents, while the frameworks achieved scores of 0.96 and 0.95 respectively.

Uploaded by

smumin011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/366468375

A comparison of chatbot platforms with the state-of-the-art sentence BERT


for answering online student FAQs

Article in Results in Engineering · December 2022


DOI: 10.1016/j.rineng.2022.100856

CITATIONS READS

2 46

2 authors, including:

Saritha Unnikrishnan
Atlantic technological university sligo
17 PUBLICATIONS 56 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Q-Learning to improve the performance of Ethical Hacking Tests in a Simulated CAV Environment View project

Computational Modelling of Droplet Data of Colloidal Dispersion using Machine Learning for Real-Time Process Control View project

All content following this page was uploaded by Saritha Unnikrishnan on 27 January 2023.

The user has requested enhancement of the downloaded file.


Results in Engineering 17 (2023) 100856

Contents lists available at ScienceDirect

Results in Engineering
journal homepage: www.sciencedirect.com/journal/results-in-engineering

A comparison of chatbot platforms with the state-of-the-art sentence BERT


for answering online student FAQs
Kevin Peyton a, *, Saritha Unnikrishnan a, b
a
Faculty of Engineering and Design, Atlantic Technological University, Sligo, Ireland
b
Mathematical Modelling and Intelligent Systems for Health and Environment (MISHE), Atlantic Technological University, Sligo, Ireland

A R T I C L E I N F O A B S T R A C T

Keywords: Online learning enables academic institutions to accommodate increased student numbers at scale. With this
Chatbots scale comes high demands on support staff for help in dealing with general questions relating to qualifications
SBERT and registration. Chatbots that implement Frequently Asked Questions (FAQs) can be a valuable part in this
Natural language understanding
support process. A chatbot can provide constant availability in answering common questions, allowing support
FAQs
Online learning
staff to engage on higher value one-to-one communication with prospective students. A variety of approaches can
be used to create these chatbots including vertical platforms, frameworks, and direct model implementation. A
comparative analysis is required to establish which approach provides the most accuracy for an existing,
available dataset.
This paper compares intent classification results of two popular chatbot frameworks to a state-of-the-art
Sentence BERT (SBERT) model that can be used to build a robust chatbot. A methodology is outlined which
includes the preparation of a university FAQ dataset into a chatbot friendly format for upload and training of
each implementation. Results obtained from the framework-based implementations are generated using their
published Application Programming Interfaces (APIs). This enables intent classification using testing phrases and
finally comparison of F1 scores.
Using ten intents comprising 284 training phrases and 85 testing phrases it was found that a SBERT model
outperformed all others with an F1-score of 0.99. Initial comparison with the literature suggests that the F1-
scores obtained for Google Dialogflow (0.96) and Microsoft QnA Maker (0.95) are very similar to other
benchmarking exercises where NLU (Natural Language Understanding) has been compared.

1. Introduction learning becoming ever more popular, it seems that the ‘24/7’ avail­
ability highlighted by Cunningham-Nelson et al. [3] can be extremely
Chatbots are intelligent conversational agents, which have become advantageous. By its very nature, online learning does not necessarily
an essential element of the customer management process for dealing confine a student to synchronous engagement or through shared geog­
with large numbers of online queries and support tasks. Customer ser­ raphy or time zone. However, it does place considerable burden on the
vice has evolved to the point where traditional voice and email in­ academic institutions as to how they scale their support to prospective
teractions are now being replaced by self-service channels such as students with questions relating to qualifications, fees, and other
chatbots. According to van der Goot and Pilgrim [1] such channels are administrative queries.
acceptable to customers once queries are answered in a fast, easy and Using chatbots to provide a dependable solution for answering
convenient manner. common questions and queries, allows support staff to be better utilised
While chatbots are being widely implemented in sectors such as in interacting with prospective students on a personal, one to one level.
telecoms, finance and online commerce, Yang and Evans [2] note the There are many approaches that can be used to support the delivery of a
absence of implementation in the educational sector. With online chatbot that can deliver FAQs, and these include vertical platforms

* Corresponding author.
E-mail addresses: kevin.peyton@atu.ie (K. Peyton), saritha.unnikrishnan@atu.ie (S. Unnikrishnan).

https://doi.org/10.1016/j.rineng.2022.100856
Received 27 October 2022; Received in revised form 13 December 2022; Accepted 17 December 2022
Available online 21 December 2022
2590-1230/© 2022 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
K. Peyton and S. Unnikrishnan Results in Engineering 17 (2023) 100856

relevant to the education sector such as Mainstay1 and more well-known and a SOTA Sentence BERT6 (SBERT) model. QnA Maker7 from Micro­
platforms such as Dialogflow2 from Google and LUIS3 from Microsoft. soft and Dialogflow from Google, are both cloud-based NLP services that
Direct implementation of a state-of-the-art (SOTA) technique for Natural allows for the creation of conversational client applications, including
Language Processing (NLP) may also be an option. While these ap­ chatbots. These are suitable for use when static, non-changing infor­
proaches may differ in terms of market segment, technical approach, mation like FAQs are being used to return answers to the user. They both
and user environment – they all have a common factor which is how enable the importation of structured content (in the form of question and
effective and accurate the results are regarding Natural Language Un­ answer pairs) as well as semi-structured content (FAQs, manuals, doc­
derstanding (NLU). uments) to a knowledge base. It is through this knowledge base that the
This paper provides a comparative analysis of popular chatbot user can easily manipulate and improve the information by adding
platforms with a SOTA SBERT implementation to establish which different forms of how the question might be asked as well as metadata
approach provides the highest NLU accuracy for an FAQ dataset asso­ tags that are associated with each question and answer pair.
ciated with queries from prospective students applying to study online. Both QnA Maker and Dialogflow provide rich environments for
Under Related work, approaches are outlined in the literature to NLU authoring, building and publishing conversational clients by both de­
accuracy that have already been carried out. These appear to be mainly velopers and non-developers. In this study, these environments were
from generic, open domains such as travel – for example using station used for the upload and training of our data on both platforms. In the
names along with arrival and departure times which potentially enable case of QnA Maker, NLU is provided by direct integration into LUIS.
easier intent identification. In this work, the dataset being evaluated will Along with QnA Maker and Dialogflow, a traditional Feedforward
be specific to a subset of queries within an educational domain for neural network model was utilised. Fig. 1 shows an implementation of
prospective students to study online. A comparative approach using this one of these networks with one input, eight hidden layers and one
type of dataset has not been found in the literature. In the Methodology output.
section, the frameworks and the models used are briefly described. A The implementation in this study is based on the work by Loeber [8].
detailed account is also given of dataset creation and implementation of It used Python 3.7 with PyTorch and nltk libraries and utilised hyper­
a test harness for generating results. In the Results & Discussion section, parameters including batch size of 8, learning rate of 0.001, hidden size
results are discussed and compared to studies from the literature. of 8 and 1000 epochs for training. Finally, an SBERT pre-trained
transformer framework developed by Reimers and Gurevch [9] was
2. Related work implemented to evaluate the dataset in this study. Fig. 2 shows an
example of the SBERT architecture computing a similarity score be­
A review of publications that discuss benchmarking and analyse the tween two sentences to see how similar in meaning the sentences are.
results of NLU accuracy suggest several approaches. The comparison of At the core of the SBERT model is the pretrained BERT network
results and features from platforms like Watson4 from IBM, Dialogflow, developed by Devlin et al. [11] that enables bidirectional training to be
LUIS, and RASA5 is common and suggests that general question applied to a word sequence, using a transformer. A transformer uses an
answering services are well served by these platforms. It was found that encoder-decoder architecture with self-attention, a mechanism that
the datasets being compared are primarily from open domains utilising enables the model to make sense of the input that it receives. This ability
relatively common intents such as those for restaurant booking and to make sense of language sequences has allowed BERT to be used for
travel reservations. different types of tasks including classification, regression, and sentence
This approach in using open domains was illustrated by Braun et al. similarity. However, it performs poorly with semantic similarity search,
[4] when they investigated LUIS and used queries from travel (206 a task that can be used to measure the similarity score of texts and
queries) and online computing forums (290 queries). Wisniewski et al. sentences in terms of a defined metric. Whereas the construction of
[5] compared several systems including LUIS for building chatbots, BERT makes it unsuitable for semantic similarity search, Reimers and
using an open domain dataset comprising 328 queries. By contrast Liu Gurevych [9] demonstrated that SBERT could perform this more effi­
et al. [6] performed a cross domain implementation of NLU using 21 ciently by deriving semantically meaningful sentence embeddings or
domains comprising 25 k queries whilst noting the difficulty the user
might have in choosing between platforms.
A different approach taken by Malamas et al. [7] who used the RASA
platform within a healthcare domain and modified model internal set­
tings. This experimentation used a hand designed dataset of 142 ques­
tions with the author commenting that the collection, review and adding
of new data is quite normal within the domain. Intriguingly, they also
noted that intent similarity can be an issue where a sentence may only
differ by one or two words.

3. Methodology

3.1. Frameworks and models used

This study compared the NLU results of chatbot platforms from


Microsoft and Google to those from a traditional Feedforward model,

Fig. 1. A Feedforward neural network with one input, eight hidden layers and
1 one output.
https://mainstay.com/.
2
https://cloud.google.com/dialogflow/.
3
https://www.luis.ai/.
4
https://www.ibm.com/cloud/ai.
5
https://rasa.com/.
6
https://sbert.net.
7
https://www.qnamaker.ai/.

2
K. Peyton and S. Unnikrishnan Results in Engineering 17 (2023) 100856

Fig. 2. SBERT Architecture showing Sentence A and Sentence B passed through the network yielding embedding U, V with cosine similarity then applied [10].

vectors, that can be compared. Cosine similarity is used for this com­ to potential user questions and also ensures that more tailored training
parison process as it enables a measure of similarity to be ascertained data is available. The total number of question answer pairs has grown
about documents or words in a document. It offers the potential for from the original 41 to a total of 85. The question part of these 85 pairs
improved text retrieval by understanding the content of the supplied are the test questions used in the testing of the models.
text. This model is particularly interesting to the current study as a
human may pose a question in a variety of ways. While the words may be 3.3. Implementation
different, they may be semantically similar in terms of meaning.
In this implementation of SBERT, Python 3.7. was used, with the A highly focused and complete dataset, as described in Section 3.2,
sentence-transformer library framework utilising the “bert-base-nli- was used to evaluate the NLU performance on the Dialogflow and QnA
mean-tokens” pre-trained model. Training of the models and all testing Maker platforms and the Feedforward and SBERT models. Fig. 5 outlines
for the platforms and models outlined here was conducted on a MacBook the details of the simple test harness that was built to utilise common
Intel Core i5 with 8 GB RAM. processes that could be applied across all four approaches.
The key processes included importing of the training data, evaluating
3.2. Dataset test data, and generating a standard output of results. The storage of the
training data was done by using a standard JSON format as discussed in
The data used for this study originates in question and answers section 3.2. Similarly, in extracting data from this training set to run the
format from the online FAQs that are available for prospective online test data through the implementations, a bespoke Python library was
students at a Technological University. Fig. 3 shows an example of how developed to read the dataset and generate output results.
this material was presented to the user on the website in a collapsible With the import of training data – both Dialogflow and QnA Maker
format that allowed easy access of answers to common questions. offered straightforward import options for training data. Python scripts
This material was originally written to be quickly scanned by the were written to transform the JSON dataset into CSV (comma separated)
student for answers to commonly asked queries received by adminis­ and TSV (tab separated) files respectively with the training process on
trative staff via phone and email. The dataset was divided into three both platforms being initiated manually. As both the Feedforward and
categories, entitled About, Applications and Studying Online. A fourth SBERT implementations were bespoke, full control of the automation
category encompassing a special purpose Government funded training and associated scripting was possible, with import of the JSON data and
initiative known as Springboard+ [12] was also included. These four training of the model occurring in one manual cycle. Whereas the former
categories comprised a total of 41 question answer pairs. Table 1 out­ was straightforward in terms of implementation, the latter was more
lines the original publishing details of the question answer pairs showing complex. A pre-trained SBERT model (bert-base-nli-mean-tokens) was
an overall minimum question length of three words and an overall implemented for semantic search. To store the subsequent trained em­
maximum question length of sixteen words. beddings and to enable future semantic search an Elasticsearch8
The existing structure and content of the data was perfect for easy container running version 7.9.0 was provisioned using Docker.9
assimilation by a human scanning for relevant information. It was rec­ Running test data through the implementations followed a similar
ognised that this format was not optimal for chatbot usage. For example, convention as described in the training process. Both Dialogflow and
some answers were quite long, and others contained information not QnA Maker offer well documented APIs which was scripted using a
directly relevant to the question being asked from the perspective of a bespoke Python library to query the trained models with test questions.
potential chatbot response. Feedforward and SBERT models were evaluated in a similar fashion.
For this reason, the initial content was used as the basis for a revised Finally, the results from all implementations outputted data in a stan­
set of question answer pairs, optimised for use in a generic storage dard format. This format was again generated across all implementa­
format. Fig. 4 shows a revised, generic question answer pair format tions using a combination of Python scripts and Libraries to generate
utilising JavaScript Object Notation (JSON) that allows for storage of generic logging data, F1 scores and the graphical generation of a
potential questions that will be asked. Each question or utterance and confusion matrix. The generic logging data included question asked,
associated data is stored in a node using a key/value, individual value or answer expected, answer predicted, confidence, intent expected, and
array of values [13]. intent predicted. A sample of this logging data is shown in Fig. 6.
Table 2 describes the specific structure and data related to an ut­
terance. This structure allows multiple utterances in the form of an FAQ 4. Results and discussion
to be easily integrated and tested with disparate chatbot frameworks and
models. Table 4 shows a comparison of the evaluated model results with
An individual utterance should have a unique tag, but a category will relevant Precision, Recall and F1-scores. Looking at these results in
normally contain multiple, related utterances. A test utterance is never isolation it appears that SBERT gives the most promising results with an
used to train the chatbot – it is used to only to test the effectiveness of a F1 score of 0.99 on this particular dataset. This was followed closely by
chatbot trained on utterances provided as patterns. Finally, depending Dialogflow with an F1 score of 0.96 and QnA Maker with an F1 score of
on how the chatbot operates – a response is generated for the user and 0.95. Feed Forward comes in unsurprisingly at the end with an F1 score
this response will have an associated confidence score.
Table 3 above shows the reformatted and rewritten question answer
pairs used for chatbot purposes. The total number of training phrases or 8
https://www.elastic.co/.
questions is 294. This allows better tailored and more specific responses 9
https://www.docker.com/.

3
K. Peyton and S. Unnikrishnan Results in Engineering 17 (2023) 100856

Fig. 3. Original format of FAQs on website used accordion for collapsible content.

Table 1
Question answer pairs of original website text.
Section Question Answer Question min Question max
Table 2
pairs length length
Specific structure and data related to an utterance.
About 3 5 7
Name Description
Applications 8 5 12
Studying 14 5 15 Tag Unique name for the utterance
Online Category A category may contain multiple utterances and equates to intent
Springboard+ 15 3 16 Tests Utterance(s) provided to test the effectiveness/accuracy of a chatbot.
TOTAL 41 Patterns Utterance used to train the chatbot. A pattern shows that an utterance
can be phrased in several different ways, comparable to how a real use
might ask a question
of 0.56. Responses Each utterance normally contains just one response. A salutation
SBERT, Dialogflow and QnA Maker appear to have similar issues utterance might randomly choose from a number of responses (e.g.,
pattern: ‘Hello’, response: “Hi” or “Hi there”)
contained within the sb_intro and sb_apply intents of this dataset. Each
of these intents contain seventeen and nine intents respectively and have
a slightly lower F1 score when compared with other intents across the
implementations. This may relate to an observation noted by Malamas Table 3
et al. [7] who stated that intent similarity can be an issue. In this Reformatted and rewritten question answer pairs.
particular case, it is the fact that the phrase “springboard” must be used Category/Intent Question Answer pairs Original Section
in every question relating to these specific intents. Basics 9 About/Studying Online
While the results outlined indicate that SBERT with this particular Dates 5 About
dataset perform well, it is worthy to consider the performance of Dia­ Applications 12 Applications
Study 10 Studying Online
logflow and QnA Maker in the literature. Braun et al. [4] considered
exam_ca 6 About/Studying Online
LUIS as part of their evaluation and found it to give their highest overall Fees 6 About
F1 score of 0.916 between compared platforms. Liu et al. [6] considered sb_intro 17 Springboard+
both LUIS and Dialogflow as part of their benchmarking and reported sb_apply 9 Springboard+
overall F1 scores of 0.821 and 0.811 respectively. Broad comparisons of sb_fees 6 Springboard+
sb_details 5 Springboard+
our F1 scores with those in the literature are however less relevant for
TOTAL 85 294 training phrases
several reasons. Firstly, online platforms and underlying models may

Fig. 4. JSON snippet showing node structure of sample question answer pair.

4
K. Peyton and S. Unnikrishnan Results in Engineering 17 (2023) 100856

Fig. 5. Simple test harness using common processes for import, testing and output of results.

Fig. 6. Sample readable, logging data.

“multi-qa-mpnet-base-dot-v1”, “multi-qa-distilbert-cos-v1” and “multi-


Table 4
qa-MiniLM-L6-cos-v1” models. Consideration should also be given to
Comparison of model test results.
using AI eco-systems like that provided by Hugging Face10 which along
Model Precision Recall F1-score with access to the models just mentioned as well as models from other
Feed Forward 0.64 0.60 0.56 users, also offers direct integration with testing tools from Weights and
Dialogflow 0.96 0.96 0.96 Biases.11
QnA Maker 0.96 0.96 0.95
SBERT 0.99 0.99 0.99
5. Conclusions

undergo changes and upgrades in the interim period since experimen­ This study sought to establish where the highest NLU accuracy would
tation was completed. Secondly, datasets and the intents contained be achieved by carrying out a comparative analysis between two pop­
within may be from completely different domains which may not be ular chatbot frameworks with a Feedforward model, and an SBERT
directly comparable. Finally, the dataset in this study while complete for model in answering FAQs. A methodology was outlined whereby an FAQ
the specific use case, only contained 294 training phrases and 86 testing dataset, associated with queries from prospective students applying to
phrases. This appears to be very small in comparison with other studies study online, was prepared and formatted for evaluation. Finally, an
where open domains have been utilised. implementation was described which utilised a simple test harness
Notwithstanding the small dataset, the results of this study are very which optimised and streamlined the creation of results.
encouraging and suggest that further study is required using bigger It is intriguing to note the performance of the SBERT model
datasets and multiple cross-validation techniques. This study would compared to the Dialogflow and QnA Maker platforms in this work.
include investigations into newer, improved, pre-trained models rather
than the “bert-base-nli-mean-tokens” model which has been recently
deprecated. There are currently three models on the SBERT website that
10
appear to have been specifically tuned for semantic search including the https://huggingface.co.
11
https://wandb.ai.

5
K. Peyton and S. Unnikrishnan Results in Engineering 17 (2023) 100856

Existing platforms offer excellent environments for the development and References
delivery of a chatbot solution, while SOTA models can offer the potential
of improved NLU accuracy. The user may wish to consider what the [1] M.J. van der Goot, T. Pilgrim, Exploring age differences in motivations for and
acceptance of chatbot communication in a customer service context, in: A. Følstad,
trade-off might be for their specific use case. T. Araujo, S. Papadopoulos, E.L.-C. Law, O.-C. Granmo, E. Luger, P.B. Brandtzaeg
As the performance of the SBERT model is encouraging, further (Eds.), Chatbot Research and Design, Cham, vol. 2020, Springer International
investigation would be required in a number of distinct areas. The fact Publishing, 2020, pp. 173–186.
[2] S. Yang, C. Evans, Opportunities and challenges in using AI chatbots in higher
that AI ecosystems are providing fast-evolving environments for the education, in: Proceedings of the 2019 3rd International Conference on Education
evaluation and comparison of language models has already been and E-Learning, Association for Computing Machinery, Barcelona, Spain, 2019,
mentioned. Much of the comparative analysis already carried out uses pp. 79–83, https://doi.org/10.1145/3371647.3371659, available:.
[3] S. Cunningham-Nelson, W. Boles, L. Trouton, E. Margerison, A Review of Chatbots
datasets that are generic and limited in size. Since this study was in Education: Practical Steps Forward [Chapter in Book, Report or Conference
completed, access has been gained to a Technological University online Volume], Engineers Australia, 2019.
chat corpus from a Technological University. Comprising many thou­ [4] D. Braun, A.H. Mendez, F. Matthes, M. Langen, Evaluating natural language
understanding services for conversational question answering systems, in:
sands of interactions, this dataset may have dual potential in allowing
Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, 2017,
the refinement of the existing FAQs as well as potentially enabling pp. 174–185.
further testing of these FAQs with real questions. [5] C. Wisniewski, C. Delpuech, D. Leroy, F. Pivan, J. Dureau, Benchmarking Natural
Language Understanding Systems, 2017.
[6] X. Liu, A. Eshghi, P. Swietojanski, V. Rieser, Benchmarking Natural Language
Credit author statement Understanding Services for Building Conversational Agents, 2019 arXiv preprint
arXiv:1903.05566.
Kevin Peyton: Investigation, Writing – original draft, Writing – re­ [7] N. Malamas, K. Papangelou, A.L. Symeonidis, Upon Improving the Performance of
Localized Healthcare Virtual Assistants’, in Healthcare, MDPI, 2022, p. 99.
view & editing, Project administration Saritha Unnikrishnan: Writing – [8] P. Loeber, available: https://www.python-engineer.com/courses/pytorch
review & editing, Supervision, Project administration. beginner/13-feedforward-neural-network/, 2020. (Accessed 17 May 2021).
accessed.
[9] N. Reimers, I. Gurevych, Sentence-bert: Sentence Embeddings Using Siamese Bert-
Declaration of competing interest Networks, 2019 arXiv preprint arXiv:1908.10084.
[10] N. Reimers, SentenceTransformers Documentation, 2016 available: https://sbert.
The authors declare that they have no known competing financial net. (Accessed 18 January 2021).
[11] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of Deep
interests or personal relationships that could have appeared to influence Bidirectional Transformers for Language Understanding, 2018 arXiv preprint
the work reported in this paper. arXiv:1810.04805.
[12] HEA, HEA - Springboard+, 2022 available: https://springboardcourses.ie/.
(Accessed 18 August 2022).
Data availability
[13] Technical Committee 39, ECMA 404 - the JSON Data Interchange Syntax, 2017
available: https://www.ecma-international.org/publications-and-standards/sta
Data will be made available on request. ndards/ecma-404/. (Accessed 18 August 2022).

View publication stats

You might also like