Compiler Construction
Compiler Construction
net/publication/366468375
CITATIONS READS
2 46
2 authors, including:
             Saritha Unnikrishnan
             Atlantic technological university sligo
             17 PUBLICATIONS 56 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Q-Learning to improve the performance of Ethical Hacking Tests in a Simulated CAV Environment View project
Computational Modelling of Droplet Data of Colloidal Dispersion using Machine Learning for Real-Time Process Control View project
All content following this page was uploaded by Saritha Unnikrishnan on 27 January 2023.
                                                                       Results in Engineering
                                              journal homepage: www.sciencedirect.com/journal/results-in-engineering
A R T I C L E I N F O A B S T R A C T
Keywords:                                                     Online learning enables academic institutions to accommodate increased student numbers at scale. With this
Chatbots                                                      scale comes high demands on support staff for help in dealing with general questions relating to qualifications
SBERT                                                         and registration. Chatbots that implement Frequently Asked Questions (FAQs) can be a valuable part in this
Natural language understanding
                                                              support process. A chatbot can provide constant availability in answering common questions, allowing support
FAQs
Online learning
                                                              staff to engage on higher value one-to-one communication with prospective students. A variety of approaches can
                                                              be used to create these chatbots including vertical platforms, frameworks, and direct model implementation. A
                                                              comparative analysis is required to establish which approach provides the most accuracy for an existing,
                                                              available dataset.
                                                                 This paper compares intent classification results of two popular chatbot frameworks to a state-of-the-art
                                                              Sentence BERT (SBERT) model that can be used to build a robust chatbot. A methodology is outlined which
                                                              includes the preparation of a university FAQ dataset into a chatbot friendly format for upload and training of
                                                              each implementation. Results obtained from the framework-based implementations are generated using their
                                                              published Application Programming Interfaces (APIs). This enables intent classification using testing phrases and
                                                              finally comparison of F1 scores.
                                                                 Using ten intents comprising 284 training phrases and 85 testing phrases it was found that a SBERT model
                                                              outperformed all others with an F1-score of 0.99. Initial comparison with the literature suggests that the F1-
                                                              scores obtained for Google Dialogflow (0.96) and Microsoft QnA Maker (0.95) are very similar to other
                                                              benchmarking exercises where NLU (Natural Language Understanding) has been compared.
1. Introduction                                                                                     learning becoming ever more popular, it seems that the ‘24/7’ avail
                                                                                                    ability highlighted by Cunningham-Nelson et al. [3] can be extremely
    Chatbots are intelligent conversational agents, which have become                               advantageous. By its very nature, online learning does not necessarily
an essential element of the customer management process for dealing                                 confine a student to synchronous engagement or through shared geog
with large numbers of online queries and support tasks. Customer ser                               raphy or time zone. However, it does place considerable burden on the
vice has evolved to the point where traditional voice and email in                                 academic institutions as to how they scale their support to prospective
teractions are now being replaced by self-service channels such as                                  students with questions relating to qualifications, fees, and other
chatbots. According to van der Goot and Pilgrim [1] such channels are                               administrative queries.
acceptable to customers once queries are answered in a fast, easy and                                   Using chatbots to provide a dependable solution for answering
convenient manner.                                                                                  common questions and queries, allows support staff to be better utilised
    While chatbots are being widely implemented in sectors such as                                  in interacting with prospective students on a personal, one to one level.
telecoms, finance and online commerce, Yang and Evans [2] note the                                  There are many approaches that can be used to support the delivery of a
absence of implementation in the educational sector. With online                                    chatbot that can deliver FAQs, and these include vertical platforms
    * Corresponding author.
      E-mail addresses: kevin.peyton@atu.ie (K. Peyton), saritha.unnikrishnan@atu.ie (S. Unnikrishnan).
https://doi.org/10.1016/j.rineng.2022.100856
Received 27 October 2022; Received in revised form 13 December 2022; Accepted 17 December 2022
Available online 21 December 2022
2590-1230/© 2022 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
K. Peyton and S. Unnikrishnan                                                                                                Results in Engineering 17 (2023) 100856
relevant to the education sector such as Mainstay1 and more well-known             and a SOTA Sentence BERT6 (SBERT) model. QnA Maker7 from Micro
platforms such as Dialogflow2 from Google and LUIS3 from Microsoft.                soft and Dialogflow from Google, are both cloud-based NLP services that
Direct implementation of a state-of-the-art (SOTA) technique for Natural           allows for the creation of conversational client applications, including
Language Processing (NLP) may also be an option. While these ap                   chatbots. These are suitable for use when static, non-changing infor
proaches may differ in terms of market segment, technical approach,                mation like FAQs are being used to return answers to the user. They both
and user environment – they all have a common factor which is how                  enable the importation of structured content (in the form of question and
effective and accurate the results are regarding Natural Language Un              answer pairs) as well as semi-structured content (FAQs, manuals, doc
derstanding (NLU).                                                                 uments) to a knowledge base. It is through this knowledge base that the
    This paper provides a comparative analysis of popular chatbot                  user can easily manipulate and improve the information by adding
platforms with a SOTA SBERT implementation to establish which                      different forms of how the question might be asked as well as metadata
approach provides the highest NLU accuracy for an FAQ dataset asso                tags that are associated with each question and answer pair.
ciated with queries from prospective students applying to study online.                Both QnA Maker and Dialogflow provide rich environments for
Under Related work, approaches are outlined in the literature to NLU               authoring, building and publishing conversational clients by both de
accuracy that have already been carried out. These appear to be mainly             velopers and non-developers. In this study, these environments were
from generic, open domains such as travel – for example using station              used for the upload and training of our data on both platforms. In the
names along with arrival and departure times which potentially enable              case of QnA Maker, NLU is provided by direct integration into LUIS.
easier intent identification. In this work, the dataset being evaluated will       Along with QnA Maker and Dialogflow, a traditional Feedforward
be specific to a subset of queries within an educational domain for                neural network model was utilised. Fig. 1 shows an implementation of
prospective students to study online. A comparative approach using this            one of these networks with one input, eight hidden layers and one
type of dataset has not been found in the literature. In the Methodology           output.
section, the frameworks and the models used are briefly described. A                   The implementation in this study is based on the work by Loeber [8].
detailed account is also given of dataset creation and implementation of           It used Python 3.7 with PyTorch and nltk libraries and utilised hyper
a test harness for generating results. In the Results & Discussion section,        parameters including batch size of 8, learning rate of 0.001, hidden size
results are discussed and compared to studies from the literature.                 of 8 and 1000 epochs for training. Finally, an SBERT pre-trained
                                                                                   transformer framework developed by Reimers and Gurevch [9] was
2. Related work                                                                    implemented to evaluate the dataset in this study. Fig. 2 shows an
                                                                                   example of the SBERT architecture computing a similarity score be
    A review of publications that discuss benchmarking and analyse the             tween two sentences to see how similar in meaning the sentences are.
results of NLU accuracy suggest several approaches. The comparison of                  At the core of the SBERT model is the pretrained BERT network
results and features from platforms like Watson4 from IBM, Dialogflow,             developed by Devlin et al. [11] that enables bidirectional training to be
LUIS, and RASA5 is common and suggests that general question                       applied to a word sequence, using a transformer. A transformer uses an
answering services are well served by these platforms. It was found that           encoder-decoder architecture with self-attention, a mechanism that
the datasets being compared are primarily from open domains utilising              enables the model to make sense of the input that it receives. This ability
relatively common intents such as those for restaurant booking and                 to make sense of language sequences has allowed BERT to be used for
travel reservations.                                                               different types of tasks including classification, regression, and sentence
    This approach in using open domains was illustrated by Braun et al.            similarity. However, it performs poorly with semantic similarity search,
[4] when they investigated LUIS and used queries from travel (206                  a task that can be used to measure the similarity score of texts and
queries) and online computing forums (290 queries). Wisniewski et al.              sentences in terms of a defined metric. Whereas the construction of
[5] compared several systems including LUIS for building chatbots,                 BERT makes it unsuitable for semantic similarity search, Reimers and
using an open domain dataset comprising 328 queries. By contrast Liu               Gurevych [9] demonstrated that SBERT could perform this more effi
et al. [6] performed a cross domain implementation of NLU using 21                 ciently by deriving semantically meaningful sentence embeddings or
domains comprising 25 k queries whilst noting the difficulty the user
might have in choosing between platforms.
    A different approach taken by Malamas et al. [7] who used the RASA
platform within a healthcare domain and modified model internal set
tings. This experimentation used a hand designed dataset of 142 ques
tions with the author commenting that the collection, review and adding
of new data is quite normal within the domain. Intriguingly, they also
noted that intent similarity can be an issue where a sentence may only
differ by one or two words.
3. Methodology
                                                                                   Fig. 1. A Feedforward neural network with one input, eight hidden layers and
 1                                                                                 one output.
     https://mainstay.com/.
 2
     https://cloud.google.com/dialogflow/.
 3
     https://www.luis.ai/.
 4
     https://www.ibm.com/cloud/ai.
 5
     https://rasa.com/.
                                                                                    6
                                                                                        https://sbert.net.
                                                                                    7
                                                                                        https://www.qnamaker.ai/.
                                                                               2
K. Peyton and S. Unnikrishnan                                                                                                 Results in Engineering 17 (2023) 100856
Fig. 2. SBERT Architecture showing Sentence A and Sentence B passed through the network yielding embedding U, V with cosine similarity then applied [10].
vectors, that can be compared. Cosine similarity is used for this com              to potential user questions and also ensures that more tailored training
parison process as it enables a measure of similarity to be ascertained             data is available. The total number of question answer pairs has grown
about documents or words in a document. It offers the potential for                 from the original 41 to a total of 85. The question part of these 85 pairs
improved text retrieval by understanding the content of the supplied                are the test questions used in the testing of the models.
text. This model is particularly interesting to the current study as a
human may pose a question in a variety of ways. While the words may be              3.3. Implementation
different, they may be semantically similar in terms of meaning.
    In this implementation of SBERT, Python 3.7. was used, with the                     A highly focused and complete dataset, as described in Section 3.2,
sentence-transformer library framework utilising the “bert-base-nli-                was used to evaluate the NLU performance on the Dialogflow and QnA
mean-tokens” pre-trained model. Training of the models and all testing              Maker platforms and the Feedforward and SBERT models. Fig. 5 outlines
for the platforms and models outlined here was conducted on a MacBook               the details of the simple test harness that was built to utilise common
Intel Core i5 with 8 GB RAM.                                                        processes that could be applied across all four approaches.
                                                                                        The key processes included importing of the training data, evaluating
3.2. Dataset                                                                        test data, and generating a standard output of results. The storage of the
                                                                                    training data was done by using a standard JSON format as discussed in
    The data used for this study originates in question and answers                 section 3.2. Similarly, in extracting data from this training set to run the
format from the online FAQs that are available for prospective online               test data through the implementations, a bespoke Python library was
students at a Technological University. Fig. 3 shows an example of how              developed to read the dataset and generate output results.
this material was presented to the user on the website in a collapsible                 With the import of training data – both Dialogflow and QnA Maker
format that allowed easy access of answers to common questions.                     offered straightforward import options for training data. Python scripts
    This material was originally written to be quickly scanned by the               were written to transform the JSON dataset into CSV (comma separated)
student for answers to commonly asked queries received by adminis                  and TSV (tab separated) files respectively with the training process on
trative staff via phone and email. The dataset was divided into three               both platforms being initiated manually. As both the Feedforward and
categories, entitled About, Applications and Studying Online. A fourth              SBERT implementations were bespoke, full control of the automation
category encompassing a special purpose Government funded training                  and associated scripting was possible, with import of the JSON data and
initiative known as Springboard+ [12] was also included. These four                 training of the model occurring in one manual cycle. Whereas the former
categories comprised a total of 41 question answer pairs. Table 1 out              was straightforward in terms of implementation, the latter was more
lines the original publishing details of the question answer pairs showing          complex. A pre-trained SBERT model (bert-base-nli-mean-tokens) was
an overall minimum question length of three words and an overall                    implemented for semantic search. To store the subsequent trained em
maximum question length of sixteen words.                                           beddings and to enable future semantic search an Elasticsearch8
    The existing structure and content of the data was perfect for easy             container running version 7.9.0 was provisioned using Docker.9
assimilation by a human scanning for relevant information. It was rec                  Running test data through the implementations followed a similar
ognised that this format was not optimal for chatbot usage. For example,            convention as described in the training process. Both Dialogflow and
some answers were quite long, and others contained information not                  QnA Maker offer well documented APIs which was scripted using a
directly relevant to the question being asked from the perspective of a             bespoke Python library to query the trained models with test questions.
potential chatbot response.                                                         Feedforward and SBERT models were evaluated in a similar fashion.
    For this reason, the initial content was used as the basis for a revised        Finally, the results from all implementations outputted data in a stan
set of question answer pairs, optimised for use in a generic storage                dard format. This format was again generated across all implementa
format. Fig. 4 shows a revised, generic question answer pair format                 tions using a combination of Python scripts and Libraries to generate
utilising JavaScript Object Notation (JSON) that allows for storage of              generic logging data, F1 scores and the graphical generation of a
potential questions that will be asked. Each question or utterance and              confusion matrix. The generic logging data included question asked,
associated data is stored in a node using a key/value, individual value or          answer expected, answer predicted, confidence, intent expected, and
array of values [13].                                                               intent predicted. A sample of this logging data is shown in Fig. 6.
    Table 2 describes the specific structure and data related to an ut
terance. This structure allows multiple utterances in the form of an FAQ            4. Results and discussion
to be easily integrated and tested with disparate chatbot frameworks and
models.                                                                                 Table 4 shows a comparison of the evaluated model results with
    An individual utterance should have a unique tag, but a category will           relevant Precision, Recall and F1-scores. Looking at these results in
normally contain multiple, related utterances. A test utterance is never            isolation it appears that SBERT gives the most promising results with an
used to train the chatbot – it is used to only to test the effectiveness of a       F1 score of 0.99 on this particular dataset. This was followed closely by
chatbot trained on utterances provided as patterns. Finally, depending              Dialogflow with an F1 score of 0.96 and QnA Maker with an F1 score of
on how the chatbot operates – a response is generated for the user and              0.95. Feed Forward comes in unsurprisingly at the end with an F1 score
this response will have an associated confidence score.
    Table 3 above shows the reformatted and rewritten question answer
pairs used for chatbot purposes. The total number of training phrases or             8
                                                                                         https://www.elastic.co/.
questions is 294. This allows better tailored and more specific responses            9
                                                                                         https://www.docker.com/.
                                                                                3
K. Peyton and S. Unnikrishnan                                                                                                          Results in Engineering 17 (2023) 100856
Fig. 3. Original format of FAQs on website used accordion for collapsible content.
Table 1
Question answer pairs of original website text.
  Section            Question Answer     Question min         Question max
                                                                                      Table 2
                     pairs               length               length
                                                                                      Specific structure and data related to an utterance.
  About              3                   5                    7
                                                                                       Name           Description
  Applications       8                   5                    12
  Studying           14                  5                    15                       Tag            Unique name for the utterance
    Online                                                                             Category       A category may contain multiple utterances and equates to intent
  Springboard+       15                  3                    16                       Tests          Utterance(s) provided to test the effectiveness/accuracy of a chatbot.
  TOTAL              41                                                                Patterns       Utterance used to train the chatbot. A pattern shows that an utterance
                                                                                                      can be phrased in several different ways, comparable to how a real use
                                                                                                      might ask a question
of 0.56.                                                                               Responses      Each utterance normally contains just one response. A salutation
    SBERT, Dialogflow and QnA Maker appear to have similar issues                                     utterance might randomly choose from a number of responses (e.g.,
                                                                                                      pattern: ‘Hello’, response: “Hi” or “Hi there”)
contained within the sb_intro and sb_apply intents of this dataset. Each
of these intents contain seventeen and nine intents respectively and have
a slightly lower F1 score when compared with other intents across the
implementations. This may relate to an observation noted by Malamas                   Table 3
et al. [7] who stated that intent similarity can be an issue. In this                 Reformatted and rewritten question answer pairs.
particular case, it is the fact that the phrase “springboard” must be used             Category/Intent              Question Answer pairs           Original Section
in every question relating to these specific intents.                                  Basics                       9                               About/Studying Online
    While the results outlined indicate that SBERT with this particular                Dates                        5                               About
dataset perform well, it is worthy to consider the performance of Dia                 Applications                 12                              Applications
                                                                                       Study                        10                              Studying Online
logflow and QnA Maker in the literature. Braun et al. [4] considered
                                                                                       exam_ca                      6                               About/Studying Online
LUIS as part of their evaluation and found it to give their highest overall            Fees                         6                               About
F1 score of 0.916 between compared platforms. Liu et al. [6] considered                sb_intro                     17                              Springboard+
both LUIS and Dialogflow as part of their benchmarking and reported                    sb_apply                     9                               Springboard+
overall F1 scores of 0.821 and 0.811 respectively. Broad comparisons of                sb_fees                      6                               Springboard+
                                                                                       sb_details                   5                               Springboard+
our F1 scores with those in the literature are however less relevant for
                                                                                       TOTAL                        85                              294 training phrases
several reasons. Firstly, online platforms and underlying models may
Fig. 4. JSON snippet showing node structure of sample question answer pair.
                                                                                  4
K. Peyton and S. Unnikrishnan                                                                                                      Results in Engineering 17 (2023) 100856
Fig. 5. Simple test harness using common processes for import, testing and output of results.
undergo changes and upgrades in the interim period since experimen                         This study sought to establish where the highest NLU accuracy would
tation was completed. Secondly, datasets and the intents contained                       be achieved by carrying out a comparative analysis between two pop
within may be from completely different domains which may not be                         ular chatbot frameworks with a Feedforward model, and an SBERT
directly comparable. Finally, the dataset in this study while complete for               model in answering FAQs. A methodology was outlined whereby an FAQ
the specific use case, only contained 294 training phrases and 86 testing                dataset, associated with queries from prospective students applying to
phrases. This appears to be very small in comparison with other studies                  study online, was prepared and formatted for evaluation. Finally, an
where open domains have been utilised.                                                   implementation was described which utilised a simple test harness
    Notwithstanding the small dataset, the results of this study are very                which optimised and streamlined the creation of results.
encouraging and suggest that further study is required using bigger                         It is intriguing to note the performance of the SBERT model
datasets and multiple cross-validation techniques. This study would                      compared to the Dialogflow and QnA Maker platforms in this work.
include investigations into newer, improved, pre-trained models rather
than the “bert-base-nli-mean-tokens” model which has been recently
deprecated. There are currently three models on the SBERT website that
                                                                                         10
appear to have been specifically tuned for semantic search including the                      https://huggingface.co.
                                                                                         11
                                                                                              https://wandb.ai.
                                                                                     5
K. Peyton and S. Unnikrishnan                                                                                                   Results in Engineering 17 (2023) 100856
Existing platforms offer excellent environments for the development and         References
delivery of a chatbot solution, while SOTA models can offer the potential
of improved NLU accuracy. The user may wish to consider what the                 [1] M.J. van der Goot, T. Pilgrim, Exploring age differences in motivations for and
                                                                                     acceptance of chatbot communication in a customer service context, in: A. Følstad,
trade-off might be for their specific use case.                                      T. Araujo, S. Papadopoulos, E.L.-C. Law, O.-C. Granmo, E. Luger, P.B. Brandtzaeg
    As the performance of the SBERT model is encouraging, further                    (Eds.), Chatbot Research and Design, Cham, vol. 2020, Springer International
investigation would be required in a number of distinct areas. The fact              Publishing, 2020, pp. 173–186.
                                                                                 [2] S. Yang, C. Evans, Opportunities and challenges in using AI chatbots in higher
that AI ecosystems are providing fast-evolving environments for the                  education, in: Proceedings of the 2019 3rd International Conference on Education
evaluation and comparison of language models has already been                        and E-Learning, Association for Computing Machinery, Barcelona, Spain, 2019,
mentioned. Much of the comparative analysis already carried out uses                 pp. 79–83, https://doi.org/10.1145/3371647.3371659, available:.
                                                                                 [3] S. Cunningham-Nelson, W. Boles, L. Trouton, E. Margerison, A Review of Chatbots
datasets that are generic and limited in size. Since this study was                  in Education: Practical Steps Forward [Chapter in Book, Report or Conference
completed, access has been gained to a Technological University online               Volume], Engineers Australia, 2019.
chat corpus from a Technological University. Comprising many thou               [4] D. Braun, A.H. Mendez, F. Matthes, M. Langen, Evaluating natural language
                                                                                     understanding services for conversational question answering systems, in:
sands of interactions, this dataset may have dual potential in allowing
                                                                                     Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, 2017,
the refinement of the existing FAQs as well as potentially enabling                  pp. 174–185.
further testing of these FAQs with real questions.                               [5] C. Wisniewski, C. Delpuech, D. Leroy, F. Pivan, J. Dureau, Benchmarking Natural
                                                                                     Language Understanding Systems, 2017.
                                                                                 [6] X. Liu, A. Eshghi, P. Swietojanski, V. Rieser, Benchmarking Natural Language
Credit author statement                                                              Understanding Services for Building Conversational Agents, 2019 arXiv preprint
                                                                                     arXiv:1903.05566.
   Kevin Peyton: Investigation, Writing – original draft, Writing – re          [7] N. Malamas, K. Papangelou, A.L. Symeonidis, Upon Improving the Performance of
                                                                                     Localized Healthcare Virtual Assistants’, in Healthcare, MDPI, 2022, p. 99.
view & editing, Project administration Saritha Unnikrishnan: Writing –           [8] P. Loeber, available: https://www.python-engineer.com/courses/pytorch
review & editing, Supervision, Project administration.                               beginner/13-feedforward-neural-network/, 2020. (Accessed 17 May 2021).
                                                                                     accessed.
                                                                                 [9] N. Reimers, I. Gurevych, Sentence-bert: Sentence Embeddings Using Siamese Bert-
Declaration of competing interest                                                    Networks, 2019 arXiv preprint arXiv:1908.10084.
                                                                                [10] N. Reimers, SentenceTransformers Documentation, 2016 available: https://sbert.
    The authors declare that they have no known competing financial                  net. (Accessed 18 January 2021).
                                                                                [11] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of Deep
interests or personal relationships that could have appeared to influence            Bidirectional Transformers for Language Understanding, 2018 arXiv preprint
the work reported in this paper.                                                     arXiv:1810.04805.
                                                                                [12] HEA, HEA - Springboard+, 2022 available: https://springboardcourses.ie/.
                                                                                     (Accessed 18 August 2022).
Data availability
                                                                                [13] Technical Committee 39, ECMA 404 - the JSON Data Interchange Syntax, 2017
                                                                                     available: https://www.ecma-international.org/publications-and-standards/sta
    Data will be made available on request.                                          ndards/ecma-404/. (Accessed 18 August 2022).