research-article

Open access

Question-Attentive Review-Level Explanation for Neural Rating Regression

Authors:

Trung-Hoang Le,

Hady W. LauwAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology, Volume 15, Issue 6

Article No.: 132, Pages 1 - 25

https://doi.org/10.1145/3699516

Published: 13 December 2024 Publication History

PDF eReader

Abstract

Recommendation explanations help to improve their acceptance by end users. Explanations come in many different forms. One that is of interest here is presenting an existing review of the recommended item as the explanation. The challenge is in selecting a suitable review, which is customarily addressed by assessing the relative importance or “attention” of each review to the recommendation objective. Our focus is improving review-level explanation by leveraging additional information in the form of questions and answers (QA). The proposed framework employs QA in an attention mechanism that aligns reviews to various QAs of an item and assesses their contribution jointly to the recommendation objective. The benefits are two-fold. For one, QA aids in selecting more useful reviews. For another, QA itself could accompany a well-aligned review in an expanded form of explanation. Experiments on datasets of 10 product categories showcase the efficacies of our method as compared to comparable baselines in identifying useful reviews and QAs, while maintaining parity in recommendation performance.

1 Introduction

A ubiquitous feature of Web applications and e-commerce marketplaces is a recommender system that aids users in navigating the multitude of options available, whether they are products to purchase, social media posts to view, movies to watch, and so on. The most common framework is that of collaborative filtering [18], predicting ratings or adoptions based on users’ past interactions with various items.

Earlier in the evolution of recommender systems, the concern was predominantly on achieving higher accuracies [14, 38]. Of late, the concern shifts to greater interpretability and explainability, as ultimately the goal is to get users to adopt the recommendations. This gives rise to a plethora of explainable recommendation models [52], which seek to produce not only recommendations but also accompanying explanations. There are diverse forms of explanations, leveraging different types of information associated with either users or items.

For a pertinent instance, we allude to review-level explanation, whereby the explanation to a recommendation takes the form of a review, selected from the existing reviews of the product. An insightful review, when presented with a recommended product, allows the recipient of the recommendation to empathize with the hands-on experience of the reviewer, thus anticipating what her own experience with the product would be. For instance, on Amazon.com, Canon EOS Rebel T7 Bundle¹ has more than 2,800 ratings, more than 300 of which have reviews. One of these reviews is illustrated in Figure 1, relating to the quality of the starter kit. That popular products may have many reviews (some to the tune of tens of thousands) is a dual-edged sword. With a rich corpus for selection comes the problem of selecting which review to present as explanation. One existing paradigm [3, 28] is to weigh the contribution of various reviews to the recommendation.

Fig. 1.

Given the abundance of reviews, there is a proclivity to employ reviews to aid recommendations. Most of the works are intent on improving recommendation accuracy rather than to serve directly as explanations. These include content-based methods based on topic models [43], sentiments [8], and social networks [39]. By using convolutional neural network (CNN), [55] encodes all reviews on an item to represent that item and all reviews written by a user to represent that user to enhance rating prediction. Tay et al. [45] learn to focus on a few reviews of users and items optimizing for rating prediction. In contrast to works that see reviews as content to help recommendation accuracy, we focus on the role of reviews as explanations.

In this work, we propose to go beyond reviews and incorporate other information associated with a product. One that is a focus of this work is a question posted by a user that in turn attracts answers from other users, hereinafter referred to in short form as questions and answers (QA). For instance, the same product Canon EOS Rebel T7 bundle featured in Figure 1 has more than 200 questions. Among them are whether the camera has Wi-Fi ability (answer: yes), whether there is a port for an external microphone (answer: no, but another model T7i does), and whether it is suitable for indoor sports (answer: yes, it has a sport mode). Similarly to reviews, QAs could also receive votes from users.

Interestingly, QAs present a distinct yet complementary information to reviews. Where reviews tend to be subjective and replete with opinions, questions tend to be objective and inquisitive of factual concerns. Where a single review tends to be multi-faceted and comprehensive, each question tends to be concise and narrowly focused on a single aspect. Given this complementarity, we postulate that both QA and review could collectively serve as recommendation explanations. The former notifies the recommendee of relevant factual concern(s), while the latter gains the recommendee insights from a reviewer’s experience.

QA as a feature is also increasingly prevalent across many platforms, with Amazon.com and Tripadvisor.com being a couple of prominent examples. For instance, across the 10 product categories in our datasets (see Section 4), between 13% and 56% of products have QA information. Given the anticipated further increase in QA data over time, it is timely to consider how to leverage QA in addition to reviews for more informative recommendation explanations.

Problem.

Let \(\mathcal{U}\) be a set of users, and \(\mathcal{P}\) be a set of products. A user \(i\in\mathcal{U}\) assigns to a product \(j\in\mathcal{P}\) a rating \(r_{ij}\in\mathbb{R}_{+}\) along with a review \(t_{ij}\). We denote the collection of ratings as \(\mathcal{R}\), all reviews as \(\mathcal{T}\), the subset of reviews concerning a product \(j\) as \(\mathcal{T}_{j}\). Product \(j\) may have a set of questions \(\mathcal{Q}_{j}=\{q_{j1},q_{j2},...,q_{jK}\}\subset\mathcal{Q}\), where \(K\) is the total number of questions of product \(j\). Each question \(q_{jk}\) has a collection of answers \(\mathcal{A}_{jk}=\{a_{jk1},a_{jk2},\dots,a_{jkL}\}\), where \(L\) is the total number of answers of question \(q_{jk}\). Table 1 lists the notations (some to be introduced later). The problem can thus be stated as follows. Receiving as input users \(\mathcal{U}\), products \(\mathcal{P}\), ratings \(\mathcal{R}\), reviews \(\mathcal{T}\), and QA pairs \(\mathcal{Q}\), we seek a model capable of predicting a missing rating by a user \(i\) on product \(j\) for recommendation (rating regression), as well as identifying a QA pair (selected from \(\mathcal{Q}_{j}\)) along with a review (selected from \(\mathcal{T}_{j}\)) to serve collectively as explanations accompanying the recommendation.

Table 1.

Symbol	Description
\(\mathcal{U},\mathcal{P}\)	Set of all users and products
\(\mathcal{T},\mathcal{Q},\mathcal{A}\)	Set of all reviews, QAs
\(t_{ij}\in\mathcal{T}\)	A review of user \(i\) on product \(j\)
\(\mathcal{Q}_{j}\)	A set of all questions on product \(j\)
\(q_{jk}\in\mathcal{Q}_{j}\)	A question \(k\) of product \(j\)
\(\mathcal{A}_{jk}\)	A set of all answers on question \(q_{jk}\)
\(a_{jkl}\in\mathcal{A}_{jk}\)	An answer \(l\) of a question \(k\) on product \(j\)
\(\xi(t_{ij}),\xi(q_{jk}),\xi(a_{jkl})\)	Embedded matrices of \(t_{ij}\), \(q_{jk}\), and \(a_{jkl}\)
\(\zeta_{u}(i),\zeta_{p}(j)\)	Latent features of user \(i\) and product \(j\)
\(O_{t_{ij}},O_{q_{jk}},O_{a_{jkl}}\)	Feature vectors extracted from \(t_{ij}\), \(q_{jk}\), and \(a_{jkl}\)
\(u_{i},p_{j}\)	Rating-based representation of user \(i\) and product \(j\)
\(\alpha_{ij}\)	Attention weights for \(O_{t_{ij}}\)
\(\beta_{ijk}\)	Attention weight of review \(t_{ij}\) on question \(q_{jk}\)
\(\delta_{jkl}\)	Attention weight of answer \(a_{jkl}\) on question \(q_{jk}\)
\(\omega_{jk}\)	QA representation of question \(q_{jk}\) after infusing answers
\(d_{jk}\)	Document representation respecting to \(q_{jk}\) after infusing reviews
\(\gamma_{jk}\)	Attention weight of document \(d_{jk}\)
\(b_{u},b_{i},\mu\)	User bias, item bias, and global bias, respectively

Table 1. Main Notations

Due to the differing yet complementary natures of QA and reviews, we design a neural attention model, called QUESTion-attentive review-level Explanation for neural rating Regression (QuestER) , that operates at two levels. First, the concise QA serves as focal points of attention representing salient aspects to a product recommendation. Second, the multi-faceted nature of reviews means that they could be relevant to multiple aspects, and we model their relative importance to each QA. Together, QA and reviews serve dual roles in a hand-in-hand manner: to contribute content features to aid recommendation and to serve as explanations.

Contribution. We make several contributions. First, we incorporate product questions into an attention mechanism on reviews for recommendation. Second, we develop a neural model called QuestER, which considers questions as a source of alignment to textual review. An important question would help to identify important reviews. Third, we conduct comprehensive experiments on 10 product categories against comparable baselines. Importantly, we find that not only do QAs help in identifying useful reviews, but the expanded explanation that is the combination of QA and review also has value.

This article is an extension of the conference version [22], and it differs with following additional contributions:

—

Instead of treating answer as a part of question text, the model architecture of QuestER has been extended to include another attention layer for answers to contribute to the question representation. This is used in replacement of the question encoding.

—

We include extensive experimental comparisons on 10 product categories (Home, Health, Sport, Toy, Grocery, Baby, Office, Automotive, Patio, and Musical). In contrast, [22] reports only three product categories (Home, Sport, and Musical). These additional results provide comprehensive coverage of how the method applies across a wide range of product domains.

—

We expand the experiments with discussions on additional metrics new to this article, including the effect of the number of answers in each question toward the final rating predictions, as well as the performance of both review-level and question-level explanations. These give a more all-rounded coverage of the performance of the proposed method.

—

In addition to quantitative experiments, we now include user studies that examine the quality of both review-level and question-level explanations, as well as comparison to top-rated reviews. For illustration, we also produce more case studies on more domains. The resulting analyses thoroughly examine the effectiveness of the proposed methods.

2 Related Work

We survey related work that deals with questions or reviews in the context of recommendations.

QA-Based Recommendation. The use of QA for recommendation is still relatively rare in the literature. One is to detect a user’s propensity to purchase a product based on the question that the user has submitted [4]. This is a distinct scenario from ours where the question does not have to be posed by the recipient of recommendations. Rather, we see questions as additional product information that may be relevant as explanation. QA-based recommendation is also orthogonal from QA task. Zhao et al. [54] select relevant sentences in product reviews to answer a question. Chen et al. [5] identifies answers from product reviews for user questions by multi-task attentive networks. Yu and Lam [51] incorporate aspect on reviews for predicting answer of a yes-no question. Our goal is not to answer questions, rather to select QA appropriate for recommendation explanations.

Review-Based Recommendation. Given the abundance of reviews, there is a proclivity to employ reviews to aid recommendations. Most of the works are intent on improving recommendation accuracy rather than to serve directly as explanations. These include content-based methods based on similarity metric [35], topic models [43], sentiments [8], and social networks [39]. By using CNN, Zheng et al. [55] encode all reviews on an item to represent that item and all reviews written by a user to represent that user to enhance rating prediction. Tay et al. [45] learn to focus on a few reviews of users and items optimizing for rating prediction. Liu et al. [27] treat reviews with different polarities for rating prediction. In contrast to works that see reviews as content to help recommendation accuracy, we focus on the role of reviews as explanations.

Review-Based Recommendation Explanation. Our work belongs to a group that uses a whole review as explanation. We identify a few in this group and compare to them as baselines. NARRE [3] uses attention to weigh each individual review toward user and item representation and uses the most useful review(s) as review-level explanation. HRDR [28] uses multi-layer perceptron (MLP) to encode user’s ratings (respectively, item’s ratings) as user features (respectively, item’s features) and use that as query for attention layer to weight the contribution of each review to rating prediction. HFT [32] could select the review with the closest topic distribution to the item’s topic distribution. Our key distinction from these baselines is our unique incorporation of QA both for review selection and explanation. Another work uses three-tier attention [48] on word-level, sentence-level, and review-level for learning user text representation and item text representation toward the final rating prediction. [49] extends three-tier attention with graphs.

Rather than relying on review-level explanations, some works extract segments [29, 34, 42] or aspect-level sentiments [13, 21, 47, 53]. Another formulation is to select personalized review [2, 7, 10, 15], the selected personalized review for every user for a given item may vary, which is orthogonal to selecting useful reviews, the selected useful review for the same item are identical. Cong et al. [7] use GRU as text encoder to encode word-level and review-level representation and learn the contribution of each word/review to the rating prediction. Huang et al. [15] select personalized review based on extracted aspects. Dong et al. [10] employ bi-directional LSTM [37] to learn embedding for user and item’s sentences in textual review, then applies asymmetric attentive modules that the text on item side contribute to the text of user side.

Review Generation. A few try to predict ratings and generate reviews in multi-task learning manner [6, 23–25, 46, 50]. Li et al. [25] use the predicted rating as sentiment, along with user and item factors as context to generate explanation text. Chen at al. [6] extend with the attention on concepts from an oracle.² Truong and Lauw [46] further attend on visual aspects. Li et al. [23] explicitly use aspect keywords to generate explanation. Li et al. [24] use Transformer, a well-known language modeling technique, for personalized review generation. For further informative and factually explanation generation, Xie et al. [50] augment the review generator with external knowledge from another personalized retriever model that estimates the personalized review embedding for each user. We are concerned with the selection of existing reviews, rather than their generation.

Review Quality. We focus on the recommendation scenario. There are other formulations that seek to predict helpful reviews [9, 30, 31, 41, 44]. In those cases, the concern is with objective review quality. In contrast, this work concerns how a review is aligned to recommendation, and thus could serve as an explanatory device.

3 Methodology

Our formulation in having a pair of QA and review to accompany recommendation based on rating regression is novel. We hypothesize that the concise questions could serve as an attention mechanism in weighing the importance of reviews. This achieves an alignment between questions and reviews, potentially allowing expanded explanations that are more comprehensive and coherent.

The overall architecture of our proposed QuestER model is shown in Figure 2. Below we describe its various components.

Fig. 2.

Text Encoder. We use a widely adopted CNN text processor [2, 3, 28, 55], named TextCNN, for encoding to extract semantic features of text. TextCNN consists of a CNN followed by max pooling and a fully connected layer (see Figure 3). Particularly, we have a word embedding function \(\xi:M\rightarrow\mathbb{R}^{D}\) to map each word in the text \(t\) into a \(D\)-dimensional vector, forming an embedded matrix \(\xi(t)\) with fixed length \(W\) (padded zero for text with length <). Following this embedding layer is a convolutional layer with \(m\) neurons, each associated with a filter \(F\in\mathbb{R}^{w\times D}\), each kth neuron produces features by applying convolution operator on the embedded matrix \(\xi(t)\):

\begin{align}z_{k}=ReLU(\xi(t)*F_{k}+b_{z}),\end{align}

(1)

where \(ReLU(x)=\max(x,0)\) is a non-linear activation function and \(*\) is the convolution operation. With sliding window \(w\), the produced features would be \(z_{1},z_{2},...,z^{W-w+1}_{k}\), which are passed to a max pooling to capture the most important features having highest values, which is defined as

\begin{align}o_{k}=\max(z_{1},z_{2},...,z^{W-w+1}_{k}).\end{align}

(2)

Fig. 3.

We get the final output of the convolutional layer by concatenating all output from \(m\) neurons, \(O=[o_{1},o_{2},...,o_{m}]\). A simple approach to get the final representation of the input text \(t\) is to pass \(O\) into a fully connected layer as follows:

\begin{align}X=WO+b.\end{align}

(3)

Besides TextCNN, there are other text processing methods based on deep learning technology that have been proposed and have claimed advantages over traditional methods, such as fastText [16], RNN and paragraph vector [20], and so on. Especially, the recent Large-scale Pre-trained Language Models [11], such as BERT, perform well on a variety of natural language processing tasks. An analysis of both TextCNN and BERT as text encoder is reported in Section 4.7.

Rating Encoder. Ratings are explicit features provided by users to indicate their interest on given items. The user ratings \(r_{i:}\) form a rating pattern for user \(i\), and the item ratings \(r_{:j}\) form a rating pattern for item \(j\). A reasonable choice is to use a MLP network to learn the representation for the rating pattern [28] (see Figure 4). Specifically,

\begin{align}\begin{split} h_{i1} & =\tanh(W_{r_{i:}1}r_{i:}+b_{r_{ i:}1}),\\h_{i2} & =\tanh(W_{r_{i:}2}h_{i1}+b_{r_{i:}2}),\\...\\u_{i} & =\tanh(W_{r_{i:}k}h_{i(k-1)}+b_{r_{i:}k}).\end{split}\end{align}

(4)

Fig. 4.

The output \(u_{i}\) is the final rating-based representation of user \(i\), \(h_{ik}\) is the output hidden representation at layer \(k\) of the MLP. Similarly, we can also get the rating-based representation \(p_{j}\) of product \(j\) from its input ratings \(r_{:j}\) in similar manner. We use \(\tanh\) as activation function to project the learned rating-based representation into the same range of text-based representations that will be discussed in the following paragraphs.

User Attention-Based Review Pooling. Equation (3) presumes that the contribution of each review is the same toward the final representation. The importance of each individual review contributing to user final representation is learnt as follows:

\begin{align}\rho_{ij} & =\tanh(W_{O_{t}}(O_{t_{ij}}\odot u_{i})+b_{\rho}), \end{align}

(5a)

\begin{align}\theta_{ij} & =W_{\rho}\rho_{ij}+b_{\theta},\end{align}

(5b)

\begin{align}\alpha_{ij} & =\frac{e^{\theta_{ij}}}{\sum_{j}{e^{\theta_{ij}}}},\end{align}

(5c)

where \(\odot\) is element-wise multiplication operator, \(u_{i}\) is the rating-based representation of the user \(i\), \(O_{t_{ij}}\) is the feature vector extracted from review text \(t_{ij}\) by TextCNN, \(\alpha_{ij}\) is the normalized attention score of the review \(t_{ij}\), which can be interpreted as the contribution of that review to the feature profile \(O_{i}\) of user \(i\), aggregating as follows:

\begin{align}O_{i}=\sum_{j}{\alpha_{ij}O_{t_{ij}}}.\end{align}

(6)

The final representation of user \(i\) is computed as follows:

\begin{align}X_{i}=W_{O_{i}}O_{i}+b_{X}.\end{align}

(7)

Item Question-Attentive Review-Level Explanations. Of particular importance is our modeling of product questions. A naive approach to model question on item side is to apply similar approach of modeling reviews. However, the connection between reviews and questions would have been overlooked. Here we presume that a product review may contain information that could be relevant to a question. We aggregate another attention layer based on item questions that help us to incorporate reviews based on their contribution toward item questions.

First, we use TextCNN to encode reviews and QAs. Let \(O_{t_{ij}}\) be the review encoding, \(O_{q_{jk}}\) be the question encoding of the question \(k\) on the product \(j\), and \(O_{a_{jkl}}\) be the answer encoding of the answer \(l\) of the question \(k\). With respect to each question representation \(O_{q_{jk}}\), we learn the attention weights \(\delta_{jkl}\) for answer representation \(O_{a_{jkl}}\) by projecting both QA representation onto an attention space followed by a non-linear activation function; the outputs are \(\phi_{jk}\) and \(\psi_{jkl}\), respectively. We use \(\tanh\) activation function to scale \(O_{q_{jk}}\) and \(O_{a_{jkl}}\) to the same range of values, so that neither component dominates the other. We let the question projection \(\phi_{jk}\) interact with the answer projection \(\psi_{jkl}\) in two ways: element-wise multiplication and summation. The learned vector \(V\) plays the role of global attention context. This produces an attention value \(\upsilon_{jkl}\), which is normalized using softmax to obtain \(\delta_{jkl}\):

\begin{align}\phi_{jk} & =\tanh\left(W_{O_{q}}O_{q_{jk}}+b_{\phi}\right), \\\end{align}

(8a)

\begin{align}\psi_{jkl} & =\tanh\left(W_{O_{a}}O_{a_{jkl}}+b_{\psi}\right), \\\end{align}

(8b)

\begin{align}\upsilon_{jkl} & =V^{T}\left(\psi_{jkl}\cdot\phi_{jk}+\psi_{jkl}\right),\end{align}

(8c)

\begin{align}\delta_{jkl} & =\frac{e^{\frac{\upsilon_{jkl}}{\tau}}}{\sum_{l}{e^{\frac{ \upsilon_{jkl}}{\tau}}}},\end{align}

(8d)

where \(\tau\) is a temperature parameter to adjust the probabilities in the softmax. We aggregate the answer representations \(O_{a_{ijk}}\)’s into each question representation \(\omega_{jk}\) using the learned attention \(\delta_{jkl}\):

\begin{align}\omega_{jk}=\sum_{l}{\delta_{jkl}O_{a_{jkl}}}.\end{align}

(9)

Analogously, we learn the attention weight \(\beta_{ijk}\) for review representation \(O_{t_{ij}}\) by projecting both question representation \(\omega_{jk}\) and review representation onto an attention space followed by a non-linear activation function; the outputs are \(\chi_{jk}\) and \(\rho^{\prime}_{ij}\), respectively. To learn the question-specific attention weight of a review, we let the question projection \(\chi_{jk}\) interact with the review projection \(\rho^{\prime}_{ij}\) in two ways: element-wise multiplication and summation. The learned vector \(E\) plays the role of global attention context. This produces an attention value \(\eta_{ijk}\), which is normalized using softmax to obtain \(\beta_{ijk}\):

\begin{align}\chi_{jk} & =\tanh\left(W_{\omega}\omega_{jk}+b_{\chi}\right), \\\end{align}

(10a)

\begin{align}\rho^{\prime}_{ij} & =\tanh\left(W_{O_{t}}(O_{t_{ij}}\odot p_{j})+b_{\rho^{\prime}}\right), \\\end{align}

(10b)

\begin{align}\eta_{ijk} & =E^{T}\left(\chi_{jk}\odot\rho^{\prime}_{ij}+\rho^{\prime}_{ij}\right), \end{align}

(10c)

\begin{align}\beta_{ijk} & =\frac{e^{\frac{\eta_{ijk}}{\tau}}}{\sum_{i}{e^{\frac{\eta_{ijk}} {\tau}}}}.\end{align}

(10d)

Using the question-specific attention weights \(\beta_{ijk}\), we aggregate the review representations \(O_{t_{ij}}\)’s into a question-specific representation \(d_{jk}\) as follows:

\begin{align}d_{jk}=\sum_{i}{\beta_{ijk}O_{t_{ij}}}.\end{align}

(11)

For a document (a product question with all of its reviews), we apply this attention mechanism for every product question, yielding a set of question-specific document representations \(d_{jk}\), \(k\in[1,|Q_{j}|]\). All the \(d_{jk}\)’s need to be aggregated into the final document representation \(O_{j}\) before incorporating to product representation. Thus, we seek to learn the importance weight \(\gamma_{jk}\), signifying how each question-specific representation \(d_{jk}\) would contribute to \(O_{j}\):

\begin{align}\kappa_{jk} & =K^{T}\tanh(W_{d_{jk}}d_{jk}+b_{\kappa}), \end{align}

(12a)

\begin{align}\gamma_{jk} & =\frac{e^{\frac{\kappa_{jk}}{\tau}}}{\sum_{k}{e^{\frac{\kappa_{jk }}{\tau}}}}.\end{align}

(12b)

Question-specific representation \(d_{jk}\) is projected into attention space through a layer of neurons with non-linear activation function \(\tanh\). The scalar \(\kappa_{jk}\) indicates the importance of \(d_{jk}\), obtained by multiplying with global attention context vector \(K\) (randomly initialized and learned during training). The representation \(d_{jk}\)’s due to the various questions are aggregated into the final product representation \(O_{j}\) using soft attention pooling with attention weight \(\gamma_{jk}\)’s:

\begin{align} O_{j} & =\sum_{k}{\gamma_{jk}d_{jk}}, \end{align}

(13a)

\begin{align} Y_{j} & =W_{O_{j}}O_{j}+b_{Y}.\end{align}

(13b)

Prediction Layer. The latent factors of user \(i\) and product \(j\) are mapped to a shared hidden space as follows:

\begin{align}h_{ij}=[u_{i};X_{i};\zeta_{u}(i)]\odot[p_{j};Y_{j};\zeta_{p}(j)],\end{align}

(14)

where \(\zeta_{u}(\cdot)\) and \(\zeta_{p}(\cdot)\) are embedding function to map each user and each product into their embedding space, respectively, \(X_{i}\) is user preferences and \(Y_{j}\) is item features obtained from user reviews and product reviews and questions, \([u_{i};X_{i};\zeta_{u}(i)]\) is the concatenation of user rating-based representation \(u_{i}\), user text attention review pooling \(X_{i}\), and user \(i\) embedding \(\zeta_{u}(i)\). The final rating prediction is computed as follows:

\begin{align}\hat{r}_{ij}=W^{T}h_{ij}+b_{i}+b_{j}+\mu.\end{align}

(15)

Learning. Similar to prior works on rating prediction task [3, 28, 43], which is a regression problem, we adopt the squared loss function:

\begin{align}\mathcal{L}=\sum_{i,j\in\Omega}{(\hat{r}_{ij}-r_{ij})^{2}},\end{align}

(16)

where \(\Omega\) denotes the set of all training instances, \(r_{ij}\) is the ground-truth rating that user \(i\) assigned on product \(j\).

The most important question \(\mathbb{Q}\) is selected by computing \(\mathbb{Q}=argmax_{k}(\gamma_{jk})\) and the most useful review is selected by \(argmax_{i}(\beta_{ij\mathbb{Q}})\). We use the selected question with its answer and the selected review collectively as explanation for a given recommendation.

A limitation of relying only on questions found within a product is that product features may not be captured completely, because some products do not have sufficient questions to cover all its important aspects. As a result, an important review may be overlooked because it does not correspond to any question. To address this limitation, in addition to the questions found in a product, we include one more global “General Question,” which allows those important reviews to still be aligned. This additional question plays the role of “global” aspect, and also helps our model to potentially generalize to product without questions.

4 Experiments

As this work is primarily about recommendation explanations, rather than rating prediction per se, and the two objectives are not necessarily directionally equivalent, our orientation is to improve explanations while maintaining parity in accuracy performance. In particular, our core contribution is in incorporating QA for review-level explanation. The experimental objectives revolve around the utility of QA as part of explanation, the effectiveness of QA to aid the selection of review-level explanation, and the alignment of QA and review that are part of an explanation. Source code is available for reproducibility.³

Datasets. Toward reproducibility, we work with publicly available sources. While QA is a feature on many platforms, not many such datasets have both reviews and QA information. One that does is the Amazon Product Review Dataset⁴ [12]. We experiment on 10 product categories from this source as separate instances. These categories are selected for significant availability of QA information. Consistent performance across multiple categories with different statistics bolsters the analysis. Table 2 summarizes basic statistics of the 10 datasets.

Table 2.

Dataset	#Item	#User	#Review (Rating)	#Question	#Answer	\(\frac{{\text{#Item with Question}}}{{\text{#Item}}}\)	\(\frac{{\text{#Answer}}}{{\text{#Question}}}\)
Home	28,169	66,295	549,895	368,904	1,079,983	0.3193	2.93
Health	18,464	38,416	344,888	105,814	207,330	0.1731	1.96
Sport	18,301	35,447	295,074	123,119	237,845	0.1940	1.93
Toy	11,870	19,322	166,821	35,520	75,276	0.1463	2.12
Grocery	8,690	14,632	150,802	18,134	42,779	0.1301	2.36
Baby	7,039	19,418	160,521	32,507	58,345	0.1301	1.79
Office	2,414	4,892	53,143	68,864	165,623	0.4544	2.41
Automotive	1,810	2,892	20,203	40,477	79,034	0.3470	1.95
Patio	951	1,667	13,133	22,454	53,550	0.3049	2.38
Musical	893	1,416	10,163	22,409	47,357	0.5622	2.11

Table 2. Data Statistics

For greater coverage, we collect item questions and acquire their helpful voting scores from the Amazon.com Web site.⁵ These questions data are complement yet distinct from [33], as they do not include helpful voting scores for every QA. Too short reviews (less than three words), users, and items with fewer than five reviews are filtered out. To aggregate overlapping questions, we cluster questions in each category with k-means, keeping questions from big clusters which cover \(80\%\) of questions. For smaller clusters, we keep the nearest question to each cluster centroid and combine them into a single text, called General Question (all products have this by default). This is used solely for modeling to generalize to items without questions, but would not be used as a recommendation explanation. Moreover, a question will always be associated with at least an answer (when available). For questions without answer, the question content will be used as its own answer. In the subsequent experiments, we investigate QuestER that includes only one answer and QuestER+ that includes the maximum of five answers (an analysis of the maximum number of answers is reported in Section 4.4).

Baselines. We evaluate our proposed QuestER and QuestER+ against the following baselines in terms of useful review and QA selection. Comparisons between methods are tested with one-tailed paired-sample Student’s t-test at 0.05 level.

—

HRDR [28] uses attention mechanism with the rating-based representation as features to weight the contribution of each individual review toward user/item final representation.

—

NARRE [3] learns to predict ratings and the usefulness of each reviews by applying attention mechanism for reviews on users/items embedding.

—

HFT [32] models the latent factors from user or item reviews by employing topic distributions. In this work, we employ item reviews and applied their proposed usefulness review retrieval approach for selecting useful reviews. The number of topics is \(K=50\).

Among the three selected baselines, HRDR and NARRE use similar TextCNN for learning text representation. There are other works that use other text processors [48, 49] (discussed more in Section 2), which we do not consider as direct baselines in this work. Note that our key distinction from the above mentioned baselines is that we further incorporate product questions. As there is no prior work on predicting ratings along with selecting useful question, when the evaluative task is to look into selecting questions (question retrieval and question similarity tasks, see Sections 4.1 and 4.3), we would apply similar approach for each baseline such that item text will be item questions instead of item reviews.

Training Details. Each item’s reviews are split randomly into train, validation, and test with ratio \(0.8:0.1:0.1\). Unknown users are excluded from validation and test sets. Reviews in validation set and test set are excluded from training and will not be used for rating prediction on validation/test data. Answers are appended as additional text of the corresponding question. We employ the pre-trained word embeddings from GloVe [36] to initialize the text embedding matrix with dimensionality of \(100\) in which the embedding matrix is shared for both reviews and questions. We use separate TextCNN for user reviews, item reviews, and item QAs. The maximum number of tokens for each text \(W\) is \(128\), the number of neurons in convolutional layer \(m\) is \(64\), and the window size \(w\) is \(3\). The latent factor number was tested in \(k\in\{8,16,32,64\}\). After tuning, we set \(k=8\) for memory efficiency as using larger \(k\) does not improve the performance significantly. Dropout ratio is \(0.5\) as in [3], \(\tau\) is \(0.01\). We apply 3-layers MLP for rating-based representation modeling as in [28], with the number of neural units in hidden layers to be \(\{128,64,m\}\). Using Adam optimizer [17] with an initial learning rate of \(10^{-3}\) and mini-batch size of \(64\), we see models tend to converge before \(20\) epochs. We set a maximum of \(20\) epochs and report the test result from the best performing model (the lowest MSE) on validation, a uniform practice across methods.

Brief Comment on Running Time. Our focus in this work is recommendation explanation, rather than computational efficiency. The models can be run offline. For a sense of the running times, Table 3 reports the training time and testing time of all models on a machine with AMD EPYC 7742 64-Core Processor and NVIDIA Quadro RTX 8000. Increasing the maximum number of answers to 5 (QuestER+) slows down training time approximately 1.3 \(\sim\) 1.5 times compared to training with only one answer. Inference time of all models are similarly fast.

Table 3.

Model	Home	Health	Sport	Toy	Grocery	Baby	Office	Automotive	Patio	Musical
QuestER	19,707/4.7	11,892/2.9	9,805/2.7	5,453/1.5	5,014/1.3	5,360/1.4	2,011/0.4	646/0.2	448/0.1	334/0.1
QuestER+	27,493/4.8	16,560/2.8	13,966/2.7	7,741/1.5	7,202/1.2	7,690/1.4	2,651/0.4	931/0.2	649/0.2	503/0.1
HRDR	13,603/4.3	10,906/3.8	9,770/2.5	3,199/1.5	3,048/1.2	3,704/1.4	1,158/0.4	424/0.2	267/0.1	180/0.1
NARRE	9,855/5.1	6,254/3.1	5,034/2.8	2,093/1.6	2,755/1.3	2,855/1.5	1,067/0.4	329/0.2	249/0.1	172/0.1
HFT	9,399/4.0	5,806/2.3	5,452/2.2	3,305/1.2	2,508/1.0	2,665/1.2	1,052/0.4	395/0.2	460/0.2	253/0.1

Table 3. Running Time (Train (Seconds)/Test (Seconds))

4.1 Question and Review Alignment

Our proposed recommendation explanation consists of a QA and a review. Ideally, these two components, QA on one hand, and review on the other hand, are well-aligned for a more coherent explanation. We measure this alignment using ROUGE [26] and METEOR [1], two well-known metrics for text matching and text summarization. To cater to words as well as phrases, we report F-Measure of ROUGE-1 measuring the overlapping unigrams, ROUGE-2 measuring the overlapping bigrams, and ROUGE-L measuring the longest common subsequence between the reference summary and evaluated summary. We compute ROUGE and METEOR scores for the top-1 selected question and review and report them in Table 4.

Table 4.

Data	Model	ROUGE-1	ROUGE-2	ROUGE-L	METEOR
Home	QuestER	15.68\({}^{\rm a}\)	0.88\({}^{\rm a}\)	7.73\({}^{\rm a}\)	9.56\({}^{\rm a}\)
	QuestER+	15.65\({}^{\rm a}\)	0.89\({}^{\rm a}\)	7.71\({}^{\rm a}\)	9.55\({}^{\rm a}\)
	HRDR	14.85	0.75	7.08	8.36
	NARRE	14.66	0.72	6.57	7.39
	HFT	13.55	0.66	6.40	7.53
Health	QuestER	19.54	1.59	7.99\({}^{\rm a}\)	9.89\({}^{\rm a}\)
	QuestER+	19.58	1.58	8.01\({}^{\rm a}\)	9.90\({}^{\rm a}\)
	HRDR	19.59	1.59	7.88	9.65
	NARRE	17.97	1.33	6.45	7.31
	HFT	17.13	1.28	6.59	7.93
Sport	QuestER	15.52\({}^{\rm a}\)	0.72\({}^{\rm a}\)	7.33\({}^{\rm a}\)	9.04\({}^{\rm a}\)
	QuestER+	15.56\({}^{\rm a}\)	0.73\({}^{\rm a}\)	7.35\({}^{\rm a}\)	9.07\({}^{\rm a}\)
	HRDR	15.25	0.64	7.14	8.35
	NARRE	14.52	0.56	6.21	7.00
	HFT	13.88	0.56	6.09	7.29
Toy	QuestER	15.80\({}^{\rm a}\)	1.17\({}^{\rm a}\)	7.84\({}^{\rm a}\)	9.41\({}^{\rm a}\)
	QuestER+	15.80\({}^{\rm a}\)	1.17\({}^{\rm a}\)	7.83\({}^{\rm a}\)	9.41\({}^{\rm a}\)
	HRDR	15.20	1.08	7.18	8.12
	NARRE	15.08	1.03	7.05	7.86
	HFT	14.05	0.96	6.53	7.39
Grocery	QuestER	16.82\({}^{\rm a}\)	0.74\({}^{\rm a}\)	7.04\({}^{\rm a}\)	8.15\({}^{\rm a}\)
	QuestER+	16.80\({}^{\rm a}\)	0.74\({}^{\rm a}\)	7.05\({}^{\rm a}\)	8.13\({}^{\rm a}\)
	HRDR	16.18	0.67	6.45	7.35
	NARRE	15.22	0.56	5.51	5.85
	HFT	14.68	0.57	5.71	6.46
Baby	QuestER	18.82\({}^{\rm a}\)	1.23\({}^{\rm a}\)	7.84\({}^{\rm a}\)	10.59\({}^{\rm a}\)
	QuestER+	18.80\({}^{\rm a}\)	1.22\({}^{\rm a}\)	7.81\({}^{\rm a}\)	10.54\({}^{\rm a}\)
	HRDR	18.51	1.15	7.39	9.75
	NARRE	17.64	1.04	6.79	8.50
	HFT	15.93	0.88	6.14	7.61
Office	QuestER	18.00\({}^{\rm a}\)	0.99\({}^{\rm a}\)	7.89\({}^{\rm a}\)	12.44\({}^{\rm a}\)
	QuestER+	17.82\({}^{\rm a}\)	0.99\({}^{\rm a}\)	7.76\({}^{\rm a}\)	12.27\({}^{\rm a}\)
	HRDR	17.53	0.76	7.36	11.36
	NARRE	17.14	0.70	6.76	9.13
	HFT	15.07	0.61	6.32	8.93
Automotive	QuestER	17.94	1.22	7.98\({}^{\rm a}\)	10.36
	QuestER+	17.79	1.19	7.85	10.36
	HRDR	17.72	1.16	7.65	10.28
	NARRE	16.35	0.91	6.16	7.36
	HFT	15.27	0.88	6.41	8.12
Patio	QuestER	18.93	1.74	8.96	13.19
	QuestER+	18.91	1.76	9.07	13.29
	HRDR	18.55	1.73	8.94	13.29
	NARRE	16.90	1.32	7.12	9.42
	HFT	15.53	1.24	7.13	10.48
Musical	QuestER	16.42\({}^{\rm a}\)	0.96\({}^{\rm a}\)	7.44\({}^{\rm a}\)	11.16\({}^{\rm a}\)
	QuestER+	16.11\({}^{\rm a}\)	0.91\({}^{\rm a}\)	7.37\({}^{\rm a}\)	10.71\({}^{\rm a}\)
	HRDR	14.81	0.70	6.63	9.75
	NARRE	13.94	0.48	5.64	6.90
	HFT	12.98	0.55	5.96	8.73

Table 4. Performance in Question and Review Alignment

\({}^{\rm a}\)Denotes statistically significant improvements. Highest values are in bold.

The results show that the proposed QuestER and QuestER+ consistently outperform the baselines significantly across virtually all the datasets. This shows QuestER’s QAs and reviews that are part of a collective explanation are better-aligned with each other, as compared to the respective pairings identified by the baselines. Note that HRDR, NARRE, and HFT had been designed solely to select helpful reviews. To be able to compare with these models, we ran each model twice, once with reviews and another time replacing item reviews with QAs. This approach essentially treats review and question in a disjoint manner, which contributes to why they are underperforming as compared to our proposed QuestER that jointly selects review and question that are well-aligned with each other.

4.2 Review-Level Explanation

Here we assess whether incorporating questions would help in selecting reviews for the explanation. We take reviews that have the greatest positive helpfulness voting scores on every product to be the ground-truth to study the performance of selecting useful reviews. We use Precision at \(5\) (Prec@5), Recall at \(5\) (Rec@5), and F1@5 as evaluation. As reported in Table 5 (left), our proposed QuestER and QuestER+ are the better-performing methods overall. Their outperformance over baseline models is statistically significant in the majority of cases. QuestER still outperforms NARRE (on Automotive, Patio, and Musical categories) and HFT (on Automotive category) significantly.

Table 5.

Data	Model	Prec@5	Rec@5	F1@5	ROUGE-1	ROUGE-2	ROUGE-L	METEOR
Home	QuestER	0.145\({}^{\rm a}\)	0.634\({}^{\rm a}\)	0.231\({}^{\rm a}\)	34.27\({}^{\rm a}\)	18.48\({}^{\rm a}\)	24.56\({}^{\rm a}\)	27.93\({}^{\rm a}\)
	QuestER+	0.145\({}^{\rm a}\)	0.632\({}^{\rm a}\)	0.231\({}^{\rm a}\)	34.34\({}^{\rm a}\)	18.58\({}^{\rm a}\)	24.66\({}^{\rm a}\)	27.97\({}^{\rm a}\)
	HRDR	0.136	0.588	0.216	32.00	16.21	22.28	25.46
	NARRE	0.129	0.557	0.204	27.53	11.84	17.79	21.06
	HFT	0.141	0.613	0.224	28.87	14.36	19.99	23.35
Health	QuestER	0.152\({}^{\rm a}\)	0.645\({}^{\rm a}\)	0.239\({}^{\rm a}\)	34.13\({}^{\rm a}\)	19.16\({}^{\rm a}\)	24.93\({}^{\rm a}\)	27.99
	QuestER+	0.152\({}^{\rm a}\)	0.645\({}^{\rm a}\)	0.239\({}^{\rm a}\)	34.20\({}^{\rm a}\)	19.22\({}^{\rm a}\)	25.00\({}^{\rm a}\)	28.03
	HRDR	0.142	0.601	0.224	33.17	17.65	23.62	28.02
	NARRE	0.137	0.574	0.215	26.46	11.63	17.27	20.62
	HFT	0.149	0.635	0.236	28.69	14.70	20.18	23.86
Sport	QuestER	0.157	0.663\({}^{\rm a}\)	0.247	34.65\({}^{\rm a}\)	19.60\({}^{\rm a}\)	25.37\({}^{\rm a}\)	28.50
	QuestER+	0.157	0.663\({}^{\rm a}\)	0.248	34.64\({}^{\rm a}\)	19.60\({}^{\rm a}\)	25.36\({}^{\rm a}\)	28.50
	HRDR	0.151	0.633	0.237	34.24	19.04	24.87	28.73
	NARRE	0.141	0.591	0.222	27.84	12.63	18.41	22.24
	HFT	0.155	0.656	0.245	29.63	15.48	20.95	24.83
Toy	QuestER	0.158\({}^{\rm a}\)	0.682\({}^{\rm a}\)	0.250\({}^{\rm a}\)	36.72\({}^{\rm a}\)	21.04\({}^{\rm a}\)	26.68\({}^{\rm a}\)	29.99\({}^{\rm a}\)
	QuestER+	0.158\({}^{\rm a}\)	0.681\({}^{\rm a}\)	0.250\({}^{\rm a}\)	36.74\({}^{\rm a}\)	21.08\({}^{\rm a}\)	26.74\({}^{\rm a}\)	30.02\({}^{\rm a}\)
	HRDR	0.143	0.611	0.226	31.67	15.20	21.14	25.72
	NARRE	0.143	0.611	0.226	30.35	14.13	19.98	24.19
	HFT	0.149	0.642	0.236	30.18	15.48	20.81	24.58
Grocery	QuestER	0.165\({}^{\rm a}\)	0.695\({}^{\rm a}\)	0.260\({}^{\rm a}\)	36.31\({}^{\rm a}\)	21.36\({}^{\rm a}\)	27.22\({}^{\rm a}\)	30.23\({}^{\rm a}\)
	QuestER+	0.165\({}^{\rm a}\)	0.697\({}^{\rm a}\)	0.261\({}^{\rm a}\)	36.13\({}^{\rm a}\)	21.13\({}^{\rm a}\)	27.00\({}^{\rm a}\)	30.01\({}^{\rm a}\)
	HRDR	0.155	0.649	0.244	32.49	16.83	22.90	28.08
	NARRE	0.152	0.635	0.239	28.66	13.33	19.26	23.03
	HFT	0.162	0.681	0.255	30.43	16.05	21.70	25.73
Baby	QuestER	0.138\({}^{\rm a}\)	0.578\({}^{\rm a}\)	0.217\({}^{\rm a}\)	35.05\({}^{\rm a}\)	18.00\({}^{\rm a}\)	24.15\({}^{\rm a}\)	27.70\({}^{\rm a}\)
	QuestER+	0.139\({}^{\rm a}\)	0.583\({}^{\rm a}\)	0.218\({}^{\rm a}\)	34.96\({}^{\rm a}\)	17.87\({}^{\rm a}\)	24.03\({}^{\rm a}\)	27.70\({}^{\rm a}\)
	HRDR	0.123	0.509	0.192	31.55	13.85	20.21	25.27
	NARRE	0.119	0.496	0.187	28.23	11.15	17.22	21.16
	HFT	0.128	0.537	0.201	27.77	12.49	18.00	21.39
Office	QuestER	0.144\({}^{\rm a}\)	0.597\({}^{\rm a}\)	0.222\({}^{\rm a}\)	35.19\({}^{\rm a}\)	18.40\({}^{\rm a}\)	24.12\({}^{\rm a}\)	28.62
	QuestER+	0.145\({}^{\rm a}\)	0.601\({}^{\rm a}\)	0.224\({}^{\rm a}\)	35.96\({}^{\rm a}\)	19.32\({}^{\rm a}\)	24.97\({}^{\rm a}\)	29.81
	HRDR	0.135	0.548	0.207	33.22	15.70	21.69	28.90
	NARRE	0.124	0.500	0.189	26.35	9.83	15.28	19.71
	HFT	0.126	0.516	0.193	27.04	12.00	17.04	21.48
Automotive	QuestER	0.176	0.745	0.278	36.75	22.28	27.91	31.11
	QuestER+	0.174	0.740	0.275	36.25	21.78	27.48	30.41
	HRDR	0.173	0.731	0.273	35.79	20.59	26.62	31.94
	NARRE	0.156	0.651	0.245	26.89	12.09	17.69	21.28
	HFT	0.168	0.710	0.265	29.94	15.95	21.44	25.04
Patio	QuestER	0.166	0.694	0.256	38.10	21.94	27.58	32.27
	QuestER+	0.164	0.685	0.253	37.14	20.87	26.61	31.07
	HRDR	0.165	0.679	0.252	37.01	20.13	25.93	32.72
	NARRE	0.152	0.629	0.233	28.25	11.65	17.23	22.38
	HFT	0.168	0.704	0.260	33.01	17.97	23.26	27.74
Musical	QuestER	0.145	0.617	0.230	35.40\({}^{\rm a}\)	20.18\({}^{\rm a}\)	25.88\({}^{\rm a}\)	30.13\({}^{\rm a}\)
	QuestER+	0.149	0.633	0.236	36.21\({}^{\rm a}\)	21.26\({}^{\rm a}\)	26.82\({}^{\rm a}\)	30.69\({}^{\rm a}\)
	HRDR	0.144	0.611	0.228	32.38	16.07	22.05	27.38
	NARRE	0.132	0.563	0.210	25.53	10.16	15.87	19.41
	HFT	0.144	0.613	0.228	27.40	12.56	18.08	22.32

Table 5. Performance in Review-Level Explanation Task

\({}^{\rm a}\)Denotes statistically significant improvements over the baselines. Highest values are in bold.

To further assess the quality of top-ranked reviews against top-rated helpful reviews, we again use ROUGE and METEOR as metrics. The results in Table 5 consistently show that our proposed QuestER and QuestER+ outperform all baseline models significantly in the majority of cases, i.e., the top-ranked reviews from QuestER and QuestER+ are more similar to the top-rated helpful reviews than those of HRDR, NARRE, and HFT. Overall, in addition to the reviews, our QuestER and QuestER+ use additional product QA, achieving better results than the baseline methods that only use reviews as additional data, suggesting that using QA aids in selecting more useful reviews.

4.3 Question-Level Explanation

The novelty of the proposed QuestER and QuestER+ is in producing question-level explanation along with review-level explanation. We conduct a homologous quantitative evaluation as Review-Level Explanation above, but now with question votes as ground-truth and measure Prec@5, Rec@5, and F1@5. In addition, we measure the similarity between top-ranked question by QuestER (or QuestER+) and top-voted useful question using ROUGE and METEOR; only the questions are being evaluated in this evaluation. As shown in Table 6, QuestER and QuestER+ are significant better than other baselines throughout. This result further highlights the improvement of the current version of QuestER (this work) in comparison to that of the previous version [22] in which this version achieves better results quantitatively for question-level explanation.

Table 6.

Data	Model	Prec@5	Rec@5	F1@5	ROUGE-1	ROUGE-2	ROUGE-L	METEOR
Home	QuestER	0.097\({}^{\rm a}\)	0.360\({}^{\rm a}\)	0.146\({}^{\rm a}\)	21.09\({}^{\rm a}\)	10.97\({}^{\rm a}\)	17.82\({}^{\rm a}\)	20.26\({}^{\rm a}\)
	QuestER+	0.097\({}^{\rm a}\)	0.365\({}^{\rm a}\)	0.147\({}^{\rm a}\)	20.95\({}^{\rm a}\)	10.88\({}^{\rm a}\)	17.67\({}^{\rm a}\)	20.26\({}^{\rm a}\)
	HRDR	0.082	0.307	0.124	17.47	7.52	13.22	16.51
	NARRE	0.082	0.307	0.124	17.69	7.77	13.52	16.75
	HFT	0.082	0.309	0.125	17.72	8.14	14.91	16.33
Health	QuestER	0.115\({}^{\rm a}\)	0.447\({}^{\rm a}\)	0.177\({}^{\rm a}\)	23.45\({}^{\rm a}\)	14.36\({}^{\rm a}\)	20.51\({}^{\rm a}\)	22.98\({}^{\rm a}\)
	QuestER+	0.114\({}^{\rm a}\)	0.439\({}^{\rm a}\)	0.175\({}^{\rm a}\)	23.65\({}^{\rm a}\)	14.24\({}^{\rm a}\)	20.72\({}^{\rm a}\)	22.87\({}^{\rm a}\)
	HRDR	0.091	0.347	0.139	16.74	7.25	12.00	16.64
	NARRE	0.089	0.342	0.136	17.62	8.18	13.70	16.73
	HFT	0.092	0.353	0.140	18.36	8.95	15.63	17.33
Sport	QuestER	0.114\({}^{\rm a}\)	0.443\({}^{\rm a}\)	0.175\({}^{\rm a}\)	24.03\({}^{\rm a}\)	14.04\({}^{\rm a}\)	20.89\({}^{\rm a}\)	23.24\({}^{\rm a}\)
	QuestER+	0.116\({}^{\rm a}\)	0.447\({}^{\rm a}\)	0.178\({}^{\rm a}\)	23.40\({}^{\rm a}\)	13.35\({}^{\rm a}\)	20.18\({}^{\rm a}\)	22.81\({}^{\rm a}\)
	HRDR	0.085	0.329	0.131	13.13	3.65	7.79	12.71
	NARRE	0.088	0.335	0.134	18.14	8.08	13.83	17.04
	HFT	0.090	0.343	0.138	20.13	10.03	17.26	18.69
Toy	QuestER	0.130\({}^{\rm a}\)	0.485\({}^{\rm a}\)	0.197\({}^{\rm a}\)	23.80\({}^{\rm a}\)	14.82\({}^{\rm a}\)	20.77\({}^{\rm a}\)	23.74\({}^{\rm a}\)
	QuestER+	0.126\({}^{\rm a}\)	0.468\({}^{\rm a}\)	0.191\({}^{\rm a}\)	23.85\({}^{\rm a}\)	14.02\({}^{\rm a}\)	20.70\({}^{\rm a}\)	23.70\({}^{\rm a}\)
	HRDR	0.106	0.392	0.161	14.50	5.27	9.21	15.61
	NARRE	0.107	0.394	0.162	19.15	10.00	15.10	19.69
	HFT	0.110	0.404	0.166	21.16	11.80	18.49	20.79
Grocery	QuestER	0.125\({}^{\rm a}\)	0.503\({}^{\rm a}\)	0.194\({}^{\rm a}\)	26.92\({}^{\rm a}\)	18.08\({}^{\rm a}\)	24.08\({}^{\rm a}\)	26.11\({}^{\rm a}\)
	QuestER+	0.124\({}^{\rm a}\)	0.504\({}^{\rm a}\)	0.193\({}^{\rm a}\)	23.32	14.01	20.11	22.12\({}^{\rm a}\)
	HRDR	0.105	0.427	0.164	20.16	10.79	15.79	19.17
	NARRE	0.103	0.425	0.161	17.66	8.28	13.18	17.48
	HFT	0.105	0.437	0.166	21.70	12.28	18.93	19.37
Baby	QuestER	0.110\({}^{\rm a}\)	0.399\({}^{\rm a}\)	0.166\({}^{\rm a}\)	23.70\({}^{\rm a}\)	13.21\({}^{\rm a}\)	20.16\({}^{\rm a}\)	22.62
	QuestER+	0.104\({}^{\rm a}\)	0.384\({}^{\rm a}\)	0.157\({}^{\rm a}\)	22.52	11.43	18.72	21.30
	HRDR	0.085	0.317	0.129	15.07	4.22	9.78	15.32
	NARRE	0.086	0.327	0.132	20.63	9.58	16.28	20.18
	HFT	0.085	0.314	0.129	19.34	9.88	16.57	17.45
Office	QuestER	0.101\({}^{\rm a}\)	0.399\({}^{\rm a}\)	0.155\({}^{\rm a}\)	21.85\({}^{\rm a}\)	11.98\({}^{\rm a}\)	18.60\({}^{\rm a}\)	20.67\({}^{\rm a}\)
	QuestER+	0.107\({}^{\rm a}\)	0.415\({}^{\rm a}\)	0.164\({}^{\rm a}\)	21.63\({}^{\rm a}\)	11.56\({}^{\rm a}\)	18.13\({}^{\rm a}\)	20.75\({}^{\rm a}\)
	HRDR	0.075	0.291	0.115	14.08	4.00	8.61	12.84
	NARRE	0.072	0.273	0.109	13.57	3.78	8.53	12.72
	HFT	0.075	0.290	0.115	17.36	7.44	14.54	15.57
Automotive	QuestER	0.106\({}^{\rm a}\)	0.416\({}^{\rm a}\)	0.163\({}^{\rm a}\)	26.50\({}^{\rm a}\)	15.76\({}^{\rm a}\)	23.23\({}^{\rm a}\)	25.39\({}^{\rm a}\)
	QuestER+	0.107\({}^{\rm a}\)	0.417\({}^{\rm a}\)	0.164\({}^{\rm a}\)	28.61\({}^{\rm a}\)	18.39\({}^{\rm a}\)	25.58\({}^{\rm a}\)	27.49\({}^{\rm a}\)
	HRDR	0.063	0.251	0.097	14.57	3.65	10.36	12.25
	NARRE	0.063	0.253	0.098	16.15	5.31	11.01	14.82
	HFT	0.060	0.242	0.093	15.82	5.79	13.18	13.26
Patio	QuestER	0.094\({}^{\rm a}\)	0.384\({}^{\rm a}\)	0.147\({}^{\rm a}\)	23.29\({}^{\rm a}\)	13.05\({}^{\rm a}\)	20.10\({}^{\rm a}\)	21.32\({}^{\rm a}\)
	QuestER+	0.104\({}^{\rm a}\)	0.422\({}^{\rm a}\)	0.162\({}^{\rm a}\)	21.25\({}^{\rm a}\)	10.59\({}^{\rm a}\)	17.84\({}^{\rm a}\)	20.17\({}^{\rm a}\)
	HRDR	0.051	0.198	0.079	14.67	4.07	10.40	12.16
	NARRE	0.055	0.210	0.084	11.41	1.74	6.57	9.43
	HFT	0.054	0.212	0.083	14.55	5.42	12.27	10.83
Musical	QuestER	0.118\({}^{\rm a}\)	0.446\({}^{\rm a}\)	0.179\({}^{\rm a}\)	23.95\({}^{\rm a}\)	13.01\({}^{\rm a}\)	20.57\({}^{\rm a}\)	23.43\({}^{\rm a}\)
	QuestER+	0.111\({}^{\rm a}\)	0.427\({}^{\rm a}\)	0.170\({}^{\rm a}\)	22.58\({}^{\rm a}\)	11.82\({}^{\rm a}\)	19.38\({}^{\rm a}\)	20.74\({}^{\rm a}\)
	HRDR	0.075	0.293	0.116	18.48	7.66	13.84	16.79
	NARRE	0.087	0.339	0.134	12.86	2.19	7.00	12.29
	HFT	0.086	0.352	0.134	17.65	6.88	14.61	15.11

Table 6. Performance in Question-Level Explanation Task

\({}^{\rm a}\)Denotes statistically significant improvements over the baselines. Highest values are in bold.

4.4 Rating Prediction

As previously established, our main focus in this work is on recommendation explanations, with an eye on improving the selection of reviews and incorporating questions in that endeavor. Nevertheless, while recommendation accuracy is not the main focus, we find that QuestER still maintains parity in this regard with the other methods.

We report the average of MSE averaged across users on each category in Table 7. Our proposed QuestER and QuestER+ achieve comparable results when compared to the neural models HRDR and NARRE. HFT that is based on graphical model varies from the neural models. Depending on the reported domain, it is lower in some cases and higher in others. Such variant in performance between simpler and more complex models using neural networks in terms of rating predictions is expected and has also been reported in [40].

Table 7.

Data	HFT	NARRE	HRDR	QuestER	QuestER+
Home	1.2775	1.2654	1.2677	1.2670	1.2666
Health	1.2712	1.2853	1.2878	1.2862	1.2861
Sport	1.0251	1.0054	1.0072	1.0053	1.0047
Toy	0.9136	0.9971	0.9973	0.9974	0.9979
Grocery	1.2007	1.1987	1.1988	1.2011	1.2027
Baby	1.3719	1.3622	1.3639	1.3613	1.3614
Office	0.8948	0.9248	0.9267	0.9245	0.9250
Automotive	0.9570	0.9248	0.9250	0.9258	0.9236
Patio	1.1173	1.1537	1.1594	1.1588	1.1564
Musical	0.8846	0.8136	0.8102	0.8174	0.8155

Table 7. Rating Prediction Performance: MSE

In any case, as we see from the previous experiments as well, QuestER and QuestER+ stand out in having the better review-level and question-level explanations, which are the main focal points of this work.

Effect of Number of Answers in Each Question. We now focus on analyzing the effect of using different maximum number of answers to be used in each question. We report the average of MSE averaged across users on each category when varying the maximum number of answers being used in the set \(\{\)1, 3, 5, 10\(\}\) in Table 8. We observe relatively minor differences in rating prediction performance among the variants. The proposed method achieves best MSE w.r.t five answers (QuestER+) in the majority of cases, which motivates us further evaluating this variant in the previous experiments.

Table 8.

Data	MSE of QuestER
	1 Answer	3 Answers	5 Answers	10 Answers
Home	1.2670	1.2667	1.2666	1.2669
Health	1.2862	1.2865	1.2861	1.2862
Sport	1.0053	1.0051	1.0047	1.0051
Toy	0.9974	0.9976	0.9979	0.9972
Grocery	1.2011	1.2006	1.2027	1.2001
Baby	1.3613	1.3624	1.3614	1.3622
Office	0.9245	0.9239	0.9250	0.9248
Automotive	0.9258	0.9253	0.9236	0.9238
Patio	1.1588	1.1560	1.1564	1.1551
Musical	0.8174	0.8123	0.8155	0.8178

Table 8. Rating Prediction Performance (MSE) of QuestER w.r.t Different Maximum Number of Answers

4.5 Case Studies

To investigate the usefulness of the recommendation explanation consisting of a QA as well as a review, we show a few case studies that benchmark QuestER to the most voted question and the most voted review:

—

Figure 5 shows five sets of explanations for a sanding pad product of Meguiar’s brand. The first set (in gray box, above) comprises a QA and a review based on Top_Rated_Useful votes. The second set (in green box) comprises those selected by our QuestER. While both QuestER and Top_Rated_Useful provide useful information about the product, QuestER’s explanation is notable in two respects. For one, QuestER’s question with its answer is more aligned with its review than those of Top_Rated_Useful, ROUGE-L F-Measure for QuestER and Top_Rated_Useful are \(10.61\) and \(8.37\), respectively. For another, Top_Rated_Useful is based on explicit votes, which are not found on many products and therefore not universally available or applicable. The following three blue boxes comprise the explanations produced by the baseline methods. While NARRE has the same review explanation as QuestER, it produces a different QA explanation.

—

Figure 6 shows explanation for a breast pump product of Medela brand. Both QuestER and Top_Rated_Useful provide further useful information about the product. QuestER’s question with its answer is considered more aligned with its review than those of Top_Rated_Useful, and ROUGE-L F-Measures are \(12.59\) and \(9.02\), respectively. In this case, the baselines HRDR and HFT pick the same question.

—

Figure 7 shows explanation for a guitar rest. Notably, the pairing by Top_Rated_Useful is not so coherent, as the QA discusses its use for guitars, while the review discusses its use for ukuleles. In contrast, the QA and the review by QuestER concentrate on the key issue of how well the item could hold a guitar in rest. QuestER’s QA is more aligned with its review than those of Top_Rated_Useful, and ROUGE-L F-Measures are \(14.71\) and \(6.64\), respectively.

Fig. 5.

Fig. 6.

Fig. 7.

4.6 User Studies

To evaluate the quality of questions and reviews selected by QuestER and Top_Rated_Useful (based on user votes on Amazon.com), we conduct a couple of user studies.

Reviews vs. QAs. In the first study, we seek to investigate whether users find questions and reviews helpful as part of a recommendation explanation. We conduct user studies concerning 30 examples (3 products from each category). We split these examples into 3 surveys, each containing 10 examples of different domains which are generated by QuestER. Each survey is done by \(5\) annotators, for a total of \(15\) annotators who are neither the authors nor having any knowledge of the objective of the study. Each product is presented with both question and review ordering randomly (review and question can be either group A or group B). We ask annotators to assess the pairwise quality with four options:

A is more useful than B.

II.

B is more useful than A.

III.

A and B are almost the same, both useful.

IV.

A and B are almost the same, both useless.

The Fleiss’ kappa [19] for consistency for categorical ratings, \(\kappa=0.2955\), implies fair agreement.

Pairwise evaluation results are shown in Figure 8. As the key proposal is to have both review and question to be part of an expanded explanation, it is gratifying that the most popular option is that both are useful, attaining 39.3%. While the percentage that finds reviews more useful is slightly higher than the percentage that finds questions more useful, this is less important as we are not seeking to replace reviews with questions. Excluding “both useless,” 96% find at least one useful. We repeat the same study with explanations coming from Top_Rated_Useful and the conclusion still holds, i.e., the most popular option is that both reviews and QA are useful.

Fig. 8.

QuestER vs. Top_Rated_Useful. In the second user study, we would like to investigate the quality of the proposed combined explanation form consisting of a QA and a review. With the same set of examples and annotators, we split the examples into \(3\) other surveys, each containing \(10\) products from different categories. We present the explanations blindly by ordering survey’s questions and explanations randomly (group A and group B are now either QuestER or Top_Rated_Useful). We ask similar questions as in the first study. Figure 9 shows the pairwise evaluation results between QuestER and Top_Rated_Useful. The Fleiss’ kappa score is \(0.217\) indicating fair agreement. In summary, when combining question and review as explanation, the overall quality of both QuestER and Top_Rated_Useful are useful (\(96.67\%\)). Among those, question and review selected by QuestER are considered to be slightly more useful (\(26.7\%\)) than those of top rated useful (\(25.3\%\)).

Fig. 9.

As important as the slight outperformance by QuestER over Top_Rated_Useful, or perhaps more so is that QuestER as a method is more widely applicable method. In contrast, Top_Rated_Useful relies on the existence of helpfulness votes, which are relatively rare, and therefore it stands more as a benchmark rather than a practical method for review and QA selection for explanation.

4.7 Discussion

Robust Rating Prediction Layer. Here we further explore the cold-start scenario by removing reviews as well as QAs. Keeping the available ratings, we randomly remove reviews with ratios in range [0,1] with step size 0.1. Results in Figure 10 for all datasets consistently show that the rating prediction performance of the proposed QuestER is quite stable regardless the amount of reviews. This can be explained using Equation (15); missing reviews only discard the contribution of user and item representations constructed using reviews and QAs while the rating-based representation as well as latent factors \(\zeta_{u}\) and \(\zeta_{p}\) are available. We further note here that using Equation (15) can produce rating prediction for users/items that have rating-only (\(u_{i}\) and \(p_{j}\)), content-only (\(X_{i}\) and \(Y_{j}\)), and known latent factors (\(\zeta_{u}(i)\) and \(\zeta_{p}(j)\)). In addition, we investigate the overall rating prediction by varying the number of available questions by setting the keeping questions threshold to be covered by big clusters in the range of [0,1] with step size 0.1 (in the main experiment, this threshold is 0.8).⁶ We observe a similar trend that the rating prediction performance is quite stable (see Figure 11).

Fig. 10.

Fig. 11.

Using BERT as Text Encoder. Here we investigate whether using other text encoder such as BERT can further enhance the overall performance. Table 9 reports the overall performance of different variants of QuestER based on its text encoder (default is TextCNN) including 6 different versions of small BERT model from TF Hub with 128 hidden dimension, from L-2 (2 Transformer blocks) to L-12 (12 Transformer blocks). Trivially, using a larger text encoder model consumes more time for training. Evidently, using BERT as text encoder does enhance the rating prediction performance. However, it does not clearly show that using BERT as text encoder enhances the explanation performance further in terms of text alignment, review-level explanation, and question-level explanation.

Table 9.

Text Encoder	Train (Seconds)	MSE	Text Alignment		Review-Level Explanation			Question-Level Explanation
Text Encoder	Train (Seconds)	MSE	ROUGE-L	METEOR	F1@5	ROUGE-L	METEOR	F1@5	ROUGE-L	METEOR
TextCNN	334	0.8174	7.44	11.16	0.230	25.88	30.13	0.179	20.57	23.43
BERT (L-2)	61,164	0.7861	7.27	10.75	0.237	26.72	30.78	0.175	19.18	22.11
BERT (L-4)	66,291	0.7915	7.29	10.73	0.240	25.77	30.06	0.171	18.94	21.71
BERT (L-6)	73,332	0.7909	7.32	10.91	0.233	25.47	29.68	0.178	17.98	21.14
BERT (L-8)	74,587	0.7929	7.33	10.89	0.240	25.96	30.45	0.176	20.16	22.26
BERT (L-10)	76,953	0.7767	7.48	11.08	0.235	23.96	28.52	0.176	18.14	21.06
BERT (L-12)	79,146	0.7892	7.31	10.69	0.233	25.96	30.14	0.174	15.45	17.63

Table 9. The Overall Performance of QuestER Using BERT as Text Encoder on Musical Data

5 Conclusion

QuestER is a framework for incorporating QA pair into review-based recommendation explanation. We model QA in an attention mechanism to identify more useful reviews. Through joint modeling, we can collectively form an explanation in terms of QA and review. Comprehensive experiments on various product categories show that the QA and the review that are part of a collective explanation are more coherent with each other than those pairings found by the baselines. Review-level and question-level explanations identified by QuestER are also more consistent with top-rated ones based on helpfulness votes than those identified by the baselines. User studies further help to support that incorporating questions as part of a recommendation explanation is useful.

Footnotes

https://www.amazon.com/Canon-T7-18-55mm-3-5-5-6-Accessory/dp/B07P15K8Q7/

https://concept.research.microsoft.com/

https://github.com/PreferredAI/QuestER

https://cseweb.ucsd.edu/jmcauley/datasets/amazon/links.html

The collected data are available at https://github.com/PreferredAI/QuestER

When keeping all clusters (threshold is 1), the General Question is all the centroid questions.

References

[1]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, USA, 65–72.

Abstract

1 Introduction

2 Related Work

3 Methodology

4 Experiments

4.1 Question and Review Alignment

4.2 Review-Level Explanation

4.3 Question-Level Explanation

4.4 Rating Prediction

4.5 Case Studies

4.6 User Studies

4.7 Discussion

5 Conclusion

Footnotes

References

Index Terms

Recommendations

Neural Attentional Rating Regression with Review-level Explanations

Explainable Recommendation via Neural Rating Regression and Fine-Grained Sentiment Perception

Explicit factor models for explainable recommendation based on phrase-level sentiment analysis

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations