skip to main content
research-article
Open access

Question-Attentive Review-Level Explanation for Neural Rating Regression

Published: 13 December 2024 Publication History

Abstract

Recommendation explanations help to improve their acceptance by end users. Explanations come in many different forms. One that is of interest here is presenting an existing review of the recommended item as the explanation. The challenge is in selecting a suitable review, which is customarily addressed by assessing the relative importance or “attention” of each review to the recommendation objective. Our focus is improving review-level explanation by leveraging additional information in the form of questions and answers (QA). The proposed framework employs QA in an attention mechanism that aligns reviews to various QAs of an item and assesses their contribution jointly to the recommendation objective. The benefits are two-fold. For one, QA aids in selecting more useful reviews. For another, QA itself could accompany a well-aligned review in an expanded form of explanation. Experiments on datasets of 10 product categories showcase the efficacies of our method as compared to comparable baselines in identifying useful reviews and QAs, while maintaining parity in recommendation performance.

1 Introduction

A ubiquitous feature of Web applications and e-commerce marketplaces is a recommender system that aids users in navigating the multitude of options available, whether they are products to purchase, social media posts to view, movies to watch, and so on. The most common framework is that of collaborative filtering [18], predicting ratings or adoptions based on users’ past interactions with various items.
Earlier in the evolution of recommender systems, the concern was predominantly on achieving higher accuracies [14, 38]. Of late, the concern shifts to greater interpretability and explainability, as ultimately the goal is to get users to adopt the recommendations. This gives rise to a plethora of explainable recommendation models [52], which seek to produce not only recommendations but also accompanying explanations. There are diverse forms of explanations, leveraging different types of information associated with either users or items.
For a pertinent instance, we allude to review-level explanation, whereby the explanation to a recommendation takes the form of a review, selected from the existing reviews of the product. An insightful review, when presented with a recommended product, allows the recipient of the recommendation to empathize with the hands-on experience of the reviewer, thus anticipating what her own experience with the product would be. For instance, on Amazon.com, Canon EOS Rebel T7 Bundle1 has more than 2,800 ratings, more than 300 of which have reviews. One of these reviews is illustrated in Figure 1, relating to the quality of the starter kit. That popular products may have many reviews (some to the tune of tens of thousands) is a dual-edged sword. With a rich corpus for selection comes the problem of selecting which review to present as explanation. One existing paradigm [3, 28] is to weigh the contribution of various reviews to the recommendation.
Fig. 1.
Fig. 1. An example product on Amazon.com with reviews, questions, and answers (QA).
Given the abundance of reviews, there is a proclivity to employ reviews to aid recommendations. Most of the works are intent on improving recommendation accuracy rather than to serve directly as explanations. These include content-based methods based on topic models [43], sentiments [8], and social networks [39]. By using convolutional neural network (CNN), [55] encodes all reviews on an item to represent that item and all reviews written by a user to represent that user to enhance rating prediction. Tay et al. [45] learn to focus on a few reviews of users and items optimizing for rating prediction. In contrast to works that see reviews as content to help recommendation accuracy, we focus on the role of reviews as explanations.
In this work, we propose to go beyond reviews and incorporate other information associated with a product. One that is a focus of this work is a question posted by a user that in turn attracts answers from other users, hereinafter referred to in short form as questions and answers (QA). For instance, the same product Canon EOS Rebel T7 bundle featured in Figure 1 has more than 200 questions. Among them are whether the camera has Wi-Fi ability (answer: yes), whether there is a port for an external microphone (answer: no, but another model T7i does), and whether it is suitable for indoor sports (answer: yes, it has a sport mode). Similarly to reviews, QAs could also receive votes from users.
Interestingly, QAs present a distinct yet complementary information to reviews. Where reviews tend to be subjective and replete with opinions, questions tend to be objective and inquisitive of factual concerns. Where a single review tends to be multi-faceted and comprehensive, each question tends to be concise and narrowly focused on a single aspect. Given this complementarity, we postulate that both QA and review could collectively serve as recommendation explanations. The former notifies the recommendee of relevant factual concern(s), while the latter gains the recommendee insights from a reviewer’s experience.
QA as a feature is also increasingly prevalent across many platforms, with Amazon.com and Tripadvisor.com being a couple of prominent examples. For instance, across the 10 product categories in our datasets (see Section 4), between 13% and 56% of products have QA information. Given the anticipated further increase in QA data over time, it is timely to consider how to leverage QA in addition to reviews for more informative recommendation explanations.
Problem.
Let \(\mathcal{U}\) be a set of users, and \(\mathcal{P}\) be a set of products. A user \(i\in\mathcal{U}\) assigns to a product \(j\in\mathcal{P}\) a rating \(r_{ij}\in\mathbb{R}_{+}\) along with a review \(t_{ij}\). We denote the collection of ratings as \(\mathcal{R}\), all reviews as \(\mathcal{T}\), the subset of reviews concerning a product \(j\) as \(\mathcal{T}_{j}\). Product \(j\) may have a set of questions \(\mathcal{Q}_{j}=\{q_{j1},q_{j2},...,q_{jK}\}\subset\mathcal{Q}\), where \(K\) is the total number of questions of product \(j\). Each question \(q_{jk}\) has a collection of answers \(\mathcal{A}_{jk}=\{a_{jk1},a_{jk2},\dots,a_{jkL}\}\), where \(L\) is the total number of answers of question \(q_{jk}\). Table 1 lists the notations (some to be introduced later). The problem can thus be stated as follows. Receiving as input users \(\mathcal{U}\), products \(\mathcal{P}\), ratings \(\mathcal{R}\), reviews \(\mathcal{T}\), and QA pairs \(\mathcal{Q}\), we seek a model capable of predicting a missing rating by a user \(i\) on product \(j\) for recommendation (rating regression), as well as identifying a QA pair (selected from \(\mathcal{Q}_{j}\)) along with a review (selected from \(\mathcal{T}_{j}\)) to serve collectively as explanations accompanying the recommendation.
Table 1.
SymbolDescription
\(\mathcal{U},\mathcal{P}\)Set of all users and products
\(\mathcal{T},\mathcal{Q},\mathcal{A}\)Set of all reviews, QAs
\(t_{ij}\in\mathcal{T}\)A review of user \(i\) on product \(j\)
\(\mathcal{Q}_{j}\)A set of all questions on product \(j\)
\(q_{jk}\in\mathcal{Q}_{j}\)A question \(k\) of product \(j\)
\(\mathcal{A}_{jk}\)A set of all answers on question \(q_{jk}\)
\(a_{jkl}\in\mathcal{A}_{jk}\)An answer \(l\) of a question \(k\) on product \(j\)
\(\xi(t_{ij}),\xi(q_{jk}),\xi(a_{jkl})\)Embedded matrices of \(t_{ij}\), \(q_{jk}\), and \(a_{jkl}\)
\(\zeta_{u}(i),\zeta_{p}(j)\)Latent features of user \(i\) and product \(j\)
\(O_{t_{ij}},O_{q_{jk}},O_{a_{jkl}}\)Feature vectors extracted from \(t_{ij}\), \(q_{jk}\), and \(a_{jkl}\)
\(u_{i},p_{j}\)Rating-based representation of user \(i\) and product \(j\)
\(\alpha_{ij}\)Attention weights for \(O_{t_{ij}}\)
\(\beta_{ijk}\)Attention weight of review \(t_{ij}\) on question \(q_{jk}\)
\(\delta_{jkl}\)Attention weight of answer \(a_{jkl}\) on question \(q_{jk}\)
\(\omega_{jk}\)QA representation of question \(q_{jk}\) after infusing answers
\(d_{jk}\)Document representation respecting to \(q_{jk}\) after infusing reviews
\(\gamma_{jk}\)Attention weight of document \(d_{jk}\)
\(b_{u},b_{i},\mu\)User bias, item bias, and global bias, respectively
Table 1. Main Notations
Due to the differing yet complementary natures of QA and reviews, we design a neural attention model, called QUESTion-attentive review-level Explanation for neural rating Regression (QuestER) , that operates at two levels. First, the concise QA serves as focal points of attention representing salient aspects to a product recommendation. Second, the multi-faceted nature of reviews means that they could be relevant to multiple aspects, and we model their relative importance to each QA. Together, QA and reviews serve dual roles in a hand-in-hand manner: to contribute content features to aid recommendation and to serve as explanations.
Contribution. We make several contributions. First, we incorporate product questions into an attention mechanism on reviews for recommendation. Second, we develop a neural model called QuestER, which considers questions as a source of alignment to textual review. An important question would help to identify important reviews. Third, we conduct comprehensive experiments on 10 product categories against comparable baselines. Importantly, we find that not only do QAs help in identifying useful reviews, but the expanded explanation that is the combination of QA and review also has value.
This article is an extension of the conference version [22], and it differs with following additional contributions:
Instead of treating answer as a part of question text, the model architecture of QuestER has been extended to include another attention layer for answers to contribute to the question representation. This is used in replacement of the question encoding.
We include extensive experimental comparisons on 10 product categories (Home, Health, Sport, Toy, Grocery, Baby, Office, Automotive, Patio, and Musical). In contrast, [22] reports only three product categories (Home, Sport, and Musical). These additional results provide comprehensive coverage of how the method applies across a wide range of product domains.
We expand the experiments with discussions on additional metrics new to this article, including the effect of the number of answers in each question toward the final rating predictions, as well as the performance of both review-level and question-level explanations. These give a more all-rounded coverage of the performance of the proposed method.
In addition to quantitative experiments, we now include user studies that examine the quality of both review-level and question-level explanations, as well as comparison to top-rated reviews. For illustration, we also produce more case studies on more domains. The resulting analyses thoroughly examine the effectiveness of the proposed methods.

2 Related Work

We survey related work that deals with questions or reviews in the context of recommendations.
QA-Based Recommendation. The use of QA for recommendation is still relatively rare in the literature. One is to detect a user’s propensity to purchase a product based on the question that the user has submitted [4]. This is a distinct scenario from ours where the question does not have to be posed by the recipient of recommendations. Rather, we see questions as additional product information that may be relevant as explanation. QA-based recommendation is also orthogonal from QA task. Zhao et al. [54] select relevant sentences in product reviews to answer a question. Chen et al. [5] identifies answers from product reviews for user questions by multi-task attentive networks. Yu and Lam [51] incorporate aspect on reviews for predicting answer of a yes-no question. Our goal is not to answer questions, rather to select QA appropriate for recommendation explanations.
Review-Based Recommendation. Given the abundance of reviews, there is a proclivity to employ reviews to aid recommendations. Most of the works are intent on improving recommendation accuracy rather than to serve directly as explanations. These include content-based methods based on similarity metric [35], topic models [43], sentiments [8], and social networks [39]. By using CNN, Zheng et al. [55] encode all reviews on an item to represent that item and all reviews written by a user to represent that user to enhance rating prediction. Tay et al. [45] learn to focus on a few reviews of users and items optimizing for rating prediction. Liu et al. [27] treat reviews with different polarities for rating prediction. In contrast to works that see reviews as content to help recommendation accuracy, we focus on the role of reviews as explanations.
Review-Based Recommendation Explanation. Our work belongs to a group that uses a whole review as explanation. We identify a few in this group and compare to them as baselines. NARRE [3] uses attention to weigh each individual review toward user and item representation and uses the most useful review(s) as review-level explanation. HRDR [28] uses multi-layer perceptron (MLP) to encode user’s ratings (respectively, item’s ratings) as user features (respectively, item’s features) and use that as query for attention layer to weight the contribution of each review to rating prediction. HFT [32] could select the review with the closest topic distribution to the item’s topic distribution. Our key distinction from these baselines is our unique incorporation of QA both for review selection and explanation. Another work uses three-tier attention [48] on word-level, sentence-level, and review-level for learning user text representation and item text representation toward the final rating prediction. [49] extends three-tier attention with graphs.
Rather than relying on review-level explanations, some works extract segments [29, 34, 42] or aspect-level sentiments [13, 21, 47, 53]. Another formulation is to select personalized review [2, 7, 10, 15], the selected personalized review for every user for a given item may vary, which is orthogonal to selecting useful reviews, the selected useful review for the same item are identical. Cong et al. [7] use GRU as text encoder to encode word-level and review-level representation and learn the contribution of each word/review to the rating prediction. Huang et al. [15] select personalized review based on extracted aspects. Dong et al. [10] employ bi-directional LSTM [37] to learn embedding for user and item’s sentences in textual review, then applies asymmetric attentive modules that the text on item side contribute to the text of user side.
Review Generation. A few try to predict ratings and generate reviews in multi-task learning manner [6, 2325, 46, 50]. Li et al. [25] use the predicted rating as sentiment, along with user and item factors as context to generate explanation text. Chen at al. [6] extend with the attention on concepts from an oracle.2 Truong and Lauw [46] further attend on visual aspects. Li et al. [23] explicitly use aspect keywords to generate explanation. Li et al. [24] use Transformer, a well-known language modeling technique, for personalized review generation. For further informative and factually explanation generation, Xie et al. [50] augment the review generator with external knowledge from another personalized retriever model that estimates the personalized review embedding for each user. We are concerned with the selection of existing reviews, rather than their generation.
Review Quality. We focus on the recommendation scenario. There are other formulations that seek to predict helpful reviews [9, 30, 31, 41, 44]. In those cases, the concern is with objective review quality. In contrast, this work concerns how a review is aligned to recommendation, and thus could serve as an explanatory device.

3 Methodology

Our formulation in having a pair of QA and review to accompany recommendation based on rating regression is novel. We hypothesize that the concise questions could serve as an attention mechanism in weighing the importance of reviews. This achieves an alignment between questions and reviews, potentially allowing expanded explanations that are more comprehensive and coherent.
The overall architecture of our proposed QuestER model is shown in Figure 2. Below we describe its various components.
Fig. 2.
Fig. 2. The architecture of QuestER model.
Text Encoder. We use a widely adopted CNN text processor [2, 3, 28, 55], named TextCNN, for encoding to extract semantic features of text. TextCNN consists of a CNN followed by max pooling and a fully connected layer (see Figure 3). Particularly, we have a word embedding function \(\xi:M\rightarrow\mathbb{R}^{D}\) to map each word in the text \(t\) into a \(D\)-dimensional vector, forming an embedded matrix \(\xi(t)\) with fixed length \(W\) (padded zero for text with length <). Following this embedding layer is a convolutional layer with \(m\) neurons, each associated with a filter \(F\in\mathbb{R}^{w\times D}\), each kth neuron produces features by applying convolution operator on the embedded matrix \(\xi(t)\):
\begin{align}z_{k}=ReLU(\xi(t)*F_{k}+b_{z}),\end{align}
(1)
where \(ReLU(x)=\max(x,0)\) is a non-linear activation function and \(*\) is the convolution operation. With sliding window \(w\), the produced features would be \(z_{1},z_{2},...,z^{W-w+1}_{k}\), which are passed to a max pooling to capture the most important features having highest values, which is defined as
\begin{align}o_{k}=\max(z_{1},z_{2},...,z^{W-w+1}_{k}).\end{align}
(2)
Fig. 3.
Fig. 3. The CNN text processor (TextCNN) architecture.
We get the final output of the convolutional layer by concatenating all output from \(m\) neurons, \(O=[o_{1},o_{2},...,o_{m}]\). A simple approach to get the final representation of the input text \(t\) is to pass \(O\) into a fully connected layer as follows:
\begin{align}X=WO+b.\end{align}
(3)
Besides TextCNN, there are other text processing methods based on deep learning technology that have been proposed and have claimed advantages over traditional methods, such as fastText [16], RNN and paragraph vector [20], and so on. Especially, the recent Large-scale Pre-trained Language Models [11], such as BERT, perform well on a variety of natural language processing tasks. An analysis of both TextCNN and BERT as text encoder is reported in Section 4.7.
Rating Encoder. Ratings are explicit features provided by users to indicate their interest on given items. The user ratings \(r_{i:}\) form a rating pattern for user \(i\), and the item ratings \(r_{:j}\) form a rating pattern for item \(j\). A reasonable choice is to use a MLP network to learn the representation for the rating pattern [28] (see Figure 4). Specifically,
\begin{align}\begin{split} h_{i1} & =\tanh(W_{r_{i:}1}r_{i:}+b_{r_{ i:}1}),\\h_{i2} & =\tanh(W_{r_{i:}2}h_{i1}+b_{r_{i:}2}),\\...\\u_{i} & =\tanh(W_{r_{i:}k}h_{i(k-1)}+b_{r_{i:}k}).\end{split}\end{align}
(4)
Fig. 4.
Fig. 4. MLP for user ratings and item ratings encoder.
The output \(u_{i}\) is the final rating-based representation of user \(i\), \(h_{ik}\) is the output hidden representation at layer \(k\) of the MLP. Similarly, we can also get the rating-based representation \(p_{j}\) of product \(j\) from its input ratings \(r_{:j}\) in similar manner. We use \(\tanh\) as activation function to project the learned rating-based representation into the same range of text-based representations that will be discussed in the following paragraphs.
User Attention-Based Review Pooling. Equation (3) presumes that the contribution of each review is the same toward the final representation. The importance of each individual review contributing to user final representation is learnt as follows:
\begin{align}\rho_{ij} & =\tanh(W_{O_{t}}(O_{t_{ij}}\odot u_{i})+b_{\rho}), \end{align}
(5a)
\begin{align}\theta_{ij} & =W_{\rho}\rho_{ij}+b_{\theta},\end{align}
(5b)
\begin{align}\alpha_{ij} & =\frac{e^{\theta_{ij}}}{\sum_{j}{e^{\theta_{ij}}}},\end{align}
(5c)
where \(\odot\) is element-wise multiplication operator, \(u_{i}\) is the rating-based representation of the user \(i\), \(O_{t_{ij}}\) is the feature vector extracted from review text \(t_{ij}\) by TextCNN, \(\alpha_{ij}\) is the normalized attention score of the review \(t_{ij}\), which can be interpreted as the contribution of that review to the feature profile \(O_{i}\) of user \(i\), aggregating as follows:
\begin{align}O_{i}=\sum_{j}{\alpha_{ij}O_{t_{ij}}}.\end{align}
(6)
The final representation of user \(i\) is computed as follows:
\begin{align}X_{i}=W_{O_{i}}O_{i}+b_{X}.\end{align}
(7)
Item Question-Attentive Review-Level Explanations. Of particular importance is our modeling of product questions. A naive approach to model question on item side is to apply similar approach of modeling reviews. However, the connection between reviews and questions would have been overlooked. Here we presume that a product review may contain information that could be relevant to a question. We aggregate another attention layer based on item questions that help us to incorporate reviews based on their contribution toward item questions.
First, we use TextCNN to encode reviews and QAs. Let \(O_{t_{ij}}\) be the review encoding, \(O_{q_{jk}}\) be the question encoding of the question \(k\) on the product \(j\), and \(O_{a_{jkl}}\) be the answer encoding of the answer \(l\) of the question \(k\). With respect to each question representation \(O_{q_{jk}}\), we learn the attention weights \(\delta_{jkl}\) for answer representation \(O_{a_{jkl}}\) by projecting both QA representation onto an attention space followed by a non-linear activation function; the outputs are \(\phi_{jk}\) and \(\psi_{jkl}\), respectively. We use \(\tanh\) activation function to scale \(O_{q_{jk}}\) and \(O_{a_{jkl}}\) to the same range of values, so that neither component dominates the other. We let the question projection \(\phi_{jk}\) interact with the answer projection \(\psi_{jkl}\) in two ways: element-wise multiplication and summation. The learned vector \(V\) plays the role of global attention context. This produces an attention value \(\upsilon_{jkl}\), which is normalized using softmax to obtain \(\delta_{jkl}\):
\begin{align}\phi_{jk} & =\tanh\left(W_{O_{q}}O_{q_{jk}}+b_{\phi}\right), \\\end{align}
(8a)
\begin{align}\psi_{jkl} & =\tanh\left(W_{O_{a}}O_{a_{jkl}}+b_{\psi}\right), \\\end{align}
(8b)
\begin{align}\upsilon_{jkl} & =V^{T}\left(\psi_{jkl}\cdot\phi_{jk}+\psi_{jkl}\right),\end{align}
(8c)
\begin{align}\delta_{jkl} & =\frac{e^{\frac{\upsilon_{jkl}}{\tau}}}{\sum_{l}{e^{\frac{ \upsilon_{jkl}}{\tau}}}},\end{align}
(8d)
where \(\tau\) is a temperature parameter to adjust the probabilities in the softmax. We aggregate the answer representations \(O_{a_{ijk}}\)’s into each question representation \(\omega_{jk}\) using the learned attention \(\delta_{jkl}\):
\begin{align}\omega_{jk}=\sum_{l}{\delta_{jkl}O_{a_{jkl}}}.\end{align}
(9)
Analogously, we learn the attention weight \(\beta_{ijk}\) for review representation \(O_{t_{ij}}\) by projecting both question representation \(\omega_{jk}\) and review representation onto an attention space followed by a non-linear activation function; the outputs are \(\chi_{jk}\) and \(\rho^{\prime}_{ij}\), respectively. To learn the question-specific attention weight of a review, we let the question projection \(\chi_{jk}\) interact with the review projection \(\rho^{\prime}_{ij}\) in two ways: element-wise multiplication and summation. The learned vector \(E\) plays the role of global attention context. This produces an attention value \(\eta_{ijk}\), which is normalized using softmax to obtain \(\beta_{ijk}\):
\begin{align}\chi_{jk} & =\tanh\left(W_{\omega}\omega_{jk}+b_{\chi}\right), \\\end{align}
(10a)
\begin{align}\rho^{\prime}_{ij} & =\tanh\left(W_{O_{t}}(O_{t_{ij}}\odot p_{j})+b_{\rho^{\prime}}\right), \\\end{align}
(10b)
\begin{align}\eta_{ijk} & =E^{T}\left(\chi_{jk}\odot\rho^{\prime}_{ij}+\rho^{\prime}_{ij}\right), \end{align}
(10c)
\begin{align}\beta_{ijk} & =\frac{e^{\frac{\eta_{ijk}}{\tau}}}{\sum_{i}{e^{\frac{\eta_{ijk}} {\tau}}}}.\end{align}
(10d)
Using the question-specific attention weights \(\beta_{ijk}\), we aggregate the review representations \(O_{t_{ij}}\)’s into a question-specific representation \(d_{jk}\) as follows:
\begin{align}d_{jk}=\sum_{i}{\beta_{ijk}O_{t_{ij}}}.\end{align}
(11)
For a document (a product question with all of its reviews), we apply this attention mechanism for every product question, yielding a set of question-specific document representations \(d_{jk}\), \(k\in[1,|Q_{j}|]\). All the \(d_{jk}\)’s need to be aggregated into the final document representation \(O_{j}\) before incorporating to product representation. Thus, we seek to learn the importance weight \(\gamma_{jk}\), signifying how each question-specific representation \(d_{jk}\) would contribute to \(O_{j}\):
\begin{align}\kappa_{jk} & =K^{T}\tanh(W_{d_{jk}}d_{jk}+b_{\kappa}), \end{align}
(12a)
\begin{align}\gamma_{jk} & =\frac{e^{\frac{\kappa_{jk}}{\tau}}}{\sum_{k}{e^{\frac{\kappa_{jk }}{\tau}}}}.\end{align}
(12b)
Question-specific representation \(d_{jk}\) is projected into attention space through a layer of neurons with non-linear activation function \(\tanh\). The scalar \(\kappa_{jk}\) indicates the importance of \(d_{jk}\), obtained by multiplying with global attention context vector \(K\) (randomly initialized and learned during training). The representation \(d_{jk}\)’s due to the various questions are aggregated into the final product representation \(O_{j}\) using soft attention pooling with attention weight \(\gamma_{jk}\)’s:
\begin{align} O_{j} & =\sum_{k}{\gamma_{jk}d_{jk}}, \end{align}
(13a)
\begin{align} Y_{j} & =W_{O_{j}}O_{j}+b_{Y}.\end{align}
(13b)
Prediction Layer. The latent factors of user \(i\) and product \(j\) are mapped to a shared hidden space as follows:
\begin{align}h_{ij}=[u_{i};X_{i};\zeta_{u}(i)]\odot[p_{j};Y_{j};\zeta_{p}(j)],\end{align}
(14)
where \(\zeta_{u}(\cdot)\) and \(\zeta_{p}(\cdot)\) are embedding function to map each user and each product into their embedding space, respectively, \(X_{i}\) is user preferences and \(Y_{j}\) is item features obtained from user reviews and product reviews and questions, \([u_{i};X_{i};\zeta_{u}(i)]\) is the concatenation of user rating-based representation \(u_{i}\), user text attention review pooling \(X_{i}\), and user \(i\) embedding \(\zeta_{u}(i)\). The final rating prediction is computed as follows:
\begin{align}\hat{r}_{ij}=W^{T}h_{ij}+b_{i}+b_{j}+\mu.\end{align}
(15)
Learning. Similar to prior works on rating prediction task [3, 28, 43], which is a regression problem, we adopt the squared loss function:
\begin{align}\mathcal{L}=\sum_{i,j\in\Omega}{(\hat{r}_{ij}-r_{ij})^{2}},\end{align}
(16)
where \(\Omega\) denotes the set of all training instances, \(r_{ij}\) is the ground-truth rating that user \(i\) assigned on product \(j\).
The most important question \(\mathbb{Q}\) is selected by computing \(\mathbb{Q}=argmax_{k}(\gamma_{jk})\) and the most useful review is selected by \(argmax_{i}(\beta_{ij\mathbb{Q}})\). We use the selected question with its answer and the selected review collectively as explanation for a given recommendation.
A limitation of relying only on questions found within a product is that product features may not be captured completely, because some products do not have sufficient questions to cover all its important aspects. As a result, an important review may be overlooked because it does not correspond to any question. To address this limitation, in addition to the questions found in a product, we include one more global “General Question,” which allows those important reviews to still be aligned. This additional question plays the role of “global” aspect, and also helps our model to potentially generalize to product without questions.

4 Experiments

As this work is primarily about recommendation explanations, rather than rating prediction per se, and the two objectives are not necessarily directionally equivalent, our orientation is to improve explanations while maintaining parity in accuracy performance. In particular, our core contribution is in incorporating QA for review-level explanation. The experimental objectives revolve around the utility of QA as part of explanation, the effectiveness of QA to aid the selection of review-level explanation, and the alignment of QA and review that are part of an explanation. Source code is available for reproducibility.3
Datasets. Toward reproducibility, we work with publicly available sources. While QA is a feature on many platforms, not many such datasets have both reviews and QA information. One that does is the Amazon Product Review Dataset4 [12]. We experiment on 10 product categories from this source as separate instances. These categories are selected for significant availability of QA information. Consistent performance across multiple categories with different statistics bolsters the analysis. Table 2 summarizes basic statistics of the 10 datasets.
Table 2.
Dataset#Item#User#Review (Rating)#Question#Answer\(\frac{{\text{#Item with Question}}}{{\text{#Item}}}\)\(\frac{{\text{#Answer}}}{{\text{#Question}}}\)
Home28,16966,295549,895368,9041,079,9830.31932.93
Health18,46438,416344,888105,814207,3300.17311.96
Sport18,30135,447295,074123,119237,8450.19401.93
Toy11,87019,322166,82135,52075,2760.14632.12
Grocery8,69014,632150,80218,13442,7790.13012.36
Baby7,03919,418160,52132,50758,3450.13011.79
Office2,4144,89253,14368,864165,6230.45442.41
Automotive1,8102,89220,20340,47779,0340.34701.95
Patio9511,66713,13322,45453,5500.30492.38
Musical8931,41610,16322,40947,3570.56222.11
Table 2. Data Statistics
For greater coverage, we collect item questions and acquire their helpful voting scores from the Amazon.com Web site.5 These questions data are complement yet distinct from [33], as they do not include helpful voting scores for every QA. Too short reviews (less than three words), users, and items with fewer than five reviews are filtered out. To aggregate overlapping questions, we cluster questions in each category with k-means, keeping questions from big clusters which cover \(80\%\) of questions. For smaller clusters, we keep the nearest question to each cluster centroid and combine them into a single text, called General Question (all products have this by default). This is used solely for modeling to generalize to items without questions, but would not be used as a recommendation explanation. Moreover, a question will always be associated with at least an answer (when available). For questions without answer, the question content will be used as its own answer. In the subsequent experiments, we investigate QuestER that includes only one answer and QuestER+ that includes the maximum of five answers (an analysis of the maximum number of answers is reported in Section 4.4).
Baselines. We evaluate our proposed QuestER and QuestER+ against the following baselines in terms of useful review and QA selection. Comparisons between methods are tested with one-tailed paired-sample Student’s t-test at 0.05 level.
HRDR [28] uses attention mechanism with the rating-based representation as features to weight the contribution of each individual review toward user/item final representation.
NARRE [3] learns to predict ratings and the usefulness of each reviews by applying attention mechanism for reviews on users/items embedding.
HFT [32] models the latent factors from user or item reviews by employing topic distributions. In this work, we employ item reviews and applied their proposed usefulness review retrieval approach for selecting useful reviews. The number of topics is \(K=50\).
Among the three selected baselines, HRDR and NARRE use similar TextCNN for learning text representation. There are other works that use other text processors [48, 49] (discussed more in Section 2), which we do not consider as direct baselines in this work. Note that our key distinction from the above mentioned baselines is that we further incorporate product questions. As there is no prior work on predicting ratings along with selecting useful question, when the evaluative task is to look into selecting questions (question retrieval and question similarity tasks, see Sections 4.1 and 4.3), we would apply similar approach for each baseline such that item text will be item questions instead of item reviews.
Training Details. Each item’s reviews are split randomly into train, validation, and test with ratio \(0.8:0.1:0.1\). Unknown users are excluded from validation and test sets. Reviews in validation set and test set are excluded from training and will not be used for rating prediction on validation/test data. Answers are appended as additional text of the corresponding question. We employ the pre-trained word embeddings from GloVe [36] to initialize the text embedding matrix with dimensionality of \(100\) in which the embedding matrix is shared for both reviews and questions. We use separate TextCNN for user reviews, item reviews, and item QAs. The maximum number of tokens for each text \(W\) is \(128\), the number of neurons in convolutional layer \(m\) is \(64\), and the window size \(w\) is \(3\). The latent factor number was tested in \(k\in\{8,16,32,64\}\). After tuning, we set \(k=8\) for memory efficiency as using larger \(k\) does not improve the performance significantly. Dropout ratio is \(0.5\) as in [3], \(\tau\) is \(0.01\). We apply 3-layers MLP for rating-based representation modeling as in [28], with the number of neural units in hidden layers to be \(\{128,64,m\}\). Using Adam optimizer [17] with an initial learning rate of \(10^{-3}\) and mini-batch size of \(64\), we see models tend to converge before \(20\) epochs. We set a maximum of \(20\) epochs and report the test result from the best performing model (the lowest MSE) on validation, a uniform practice across methods.
Brief Comment on Running Time. Our focus in this work is recommendation explanation, rather than computational efficiency. The models can be run offline. For a sense of the running times, Table 3 reports the training time and testing time of all models on a machine with AMD EPYC 7742 64-Core Processor and NVIDIA Quadro RTX 8000. Increasing the maximum number of answers to 5 (QuestER+) slows down training time approximately 1.3 \(\sim\) 1.5 times compared to training with only one answer. Inference time of all models are similarly fast.
Table 3.
ModelHomeHealthSportToyGroceryBabyOfficeAutomotivePatioMusical
QuestER19,707/4.711,892/2.99,805/2.75,453/1.55,014/1.35,360/1.42,011/0.4646/0.2448/0.1334/0.1
QuestER+27,493/4.816,560/2.813,966/2.77,741/1.57,202/1.27,690/1.42,651/0.4931/0.2649/0.2503/0.1
HRDR13,603/4.310,906/3.89,770/2.53,199/1.53,048/1.23,704/1.41,158/0.4424/0.2267/0.1180/0.1
NARRE9,855/5.16,254/3.15,034/2.82,093/1.62,755/1.32,855/1.51,067/0.4329/0.2249/0.1172/0.1
HFT9,399/4.05,806/2.35,452/2.23,305/1.22,508/1.02,665/1.21,052/0.4395/0.2460/0.2253/0.1
Table 3. Running Time (Train (Seconds)/Test (Seconds))

4.1 Question and Review Alignment

Our proposed recommendation explanation consists of a QA and a review. Ideally, these two components, QA on one hand, and review on the other hand, are well-aligned for a more coherent explanation. We measure this alignment using ROUGE [26] and METEOR [1], two well-known metrics for text matching and text summarization. To cater to words as well as phrases, we report F-Measure of ROUGE-1 measuring the overlapping unigrams, ROUGE-2 measuring the overlapping bigrams, and ROUGE-L measuring the longest common subsequence between the reference summary and evaluated summary. We compute ROUGE and METEOR scores for the top-1 selected question and review and report them in Table 4.
Table 4.
DataModelROUGE-1ROUGE-2ROUGE-LMETEOR
HomeQuestER15.68\({}^{\rm a}\)0.88\({}^{\rm a}\)7.73\({}^{\rm a}\)9.56\({}^{\rm a}\)
 QuestER+15.65\({}^{\rm a}\)0.89\({}^{\rm a}\)7.71\({}^{\rm a}\)9.55\({}^{\rm a}\)
 HRDR14.850.757.088.36
 NARRE14.660.726.577.39
 HFT13.550.666.407.53
HealthQuestER19.541.597.99\({}^{\rm a}\)9.89\({}^{\rm a}\)
 QuestER+19.581.588.01\({}^{\rm a}\)9.90\({}^{\rm a}\)
 HRDR19.591.597.889.65
 NARRE17.971.336.457.31
 HFT17.131.286.597.93
SportQuestER15.52\({}^{\rm a}\)0.72\({}^{\rm a}\)7.33\({}^{\rm a}\)9.04\({}^{\rm a}\)
 QuestER+15.56\({}^{\rm a}\)0.73\({}^{\rm a}\)7.35\({}^{\rm a}\)9.07\({}^{\rm a}\)
 HRDR15.250.647.148.35
 NARRE14.520.566.217.00
 HFT13.880.566.097.29
ToyQuestER15.80\({}^{\rm a}\)1.17\({}^{\rm a}\)7.84\({}^{\rm a}\)9.41\({}^{\rm a}\)
 QuestER+15.80\({}^{\rm a}\)1.17\({}^{\rm a}\)7.83\({}^{\rm a}\)9.41\({}^{\rm a}\)
 HRDR15.201.087.188.12
 NARRE15.081.037.057.86
 HFT14.050.966.537.39
GroceryQuestER16.82\({}^{\rm a}\)0.74\({}^{\rm a}\)7.04\({}^{\rm a}\)8.15\({}^{\rm a}\)
 QuestER+16.80\({}^{\rm a}\)0.74\({}^{\rm a}\)7.05\({}^{\rm a}\)8.13\({}^{\rm a}\)
 HRDR16.180.676.457.35
 NARRE15.220.565.515.85
 HFT14.680.575.716.46
BabyQuestER18.82\({}^{\rm a}\)1.23\({}^{\rm a}\)7.84\({}^{\rm a}\)10.59\({}^{\rm a}\)
 QuestER+18.80\({}^{\rm a}\)1.22\({}^{\rm a}\)7.81\({}^{\rm a}\)10.54\({}^{\rm a}\)
 HRDR18.511.157.399.75
 NARRE17.641.046.798.50
 HFT15.930.886.147.61
OfficeQuestER18.00\({}^{\rm a}\)0.99\({}^{\rm a}\)7.89\({}^{\rm a}\)12.44\({}^{\rm a}\)
 QuestER+17.82\({}^{\rm a}\)0.99\({}^{\rm a}\)7.76\({}^{\rm a}\)12.27\({}^{\rm a}\)
 HRDR17.530.767.3611.36
 NARRE17.140.706.769.13
 HFT15.070.616.328.93
AutomotiveQuestER17.941.227.98\({}^{\rm a}\)10.36
 QuestER+17.791.197.8510.36
 HRDR17.721.167.6510.28
 NARRE16.350.916.167.36
 HFT15.270.886.418.12
PatioQuestER18.931.748.9613.19
 QuestER+18.911.769.0713.29
 HRDR18.551.738.9413.29
 NARRE16.901.327.129.42
 HFT15.531.247.1310.48
MusicalQuestER16.42\({}^{\rm a}\)0.96\({}^{\rm a}\)7.44\({}^{\rm a}\)11.16\({}^{\rm a}\)
 QuestER+16.11\({}^{\rm a}\)0.91\({}^{\rm a}\)7.37\({}^{\rm a}\)10.71\({}^{\rm a}\)
 HRDR14.810.706.639.75
 NARRE13.940.485.646.90
 HFT12.980.555.968.73
Table 4. Performance in Question and Review Alignment
\({}^{\rm a}\)Denotes statistically significant improvements. Highest values are in bold.
The results show that the proposed QuestER and QuestER+ consistently outperform the baselines significantly across virtually all the datasets. This shows QuestER’s QAs and reviews that are part of a collective explanation are better-aligned with each other, as compared to the respective pairings identified by the baselines. Note that HRDR, NARRE, and HFT had been designed solely to select helpful reviews. To be able to compare with these models, we ran each model twice, once with reviews and another time replacing item reviews with QAs. This approach essentially treats review and question in a disjoint manner, which contributes to why they are underperforming as compared to our proposed QuestER that jointly selects review and question that are well-aligned with each other.

4.2 Review-Level Explanation

Here we assess whether incorporating questions would help in selecting reviews for the explanation. We take reviews that have the greatest positive helpfulness voting scores on every product to be the ground-truth to study the performance of selecting useful reviews. We use Precision at \(5\) (Prec@5), Recall at \(5\) (Rec@5), and F1@5 as evaluation. As reported in Table 5 (left), our proposed QuestER and QuestER+ are the better-performing methods overall. Their outperformance over baseline models is statistically significant in the majority of cases. QuestER still outperforms NARRE (on Automotive, Patio, and Musical categories) and HFT (on Automotive category) significantly.
Table 5.
DataModelPrec@5Rec@5F1@5ROUGE-1ROUGE-2ROUGE-LMETEOR
HomeQuestER0.145\({}^{\rm a}\)0.634\({}^{\rm a}\)0.231\({}^{\rm a}\)34.27\({}^{\rm a}\)18.48\({}^{\rm a}\)24.56\({}^{\rm a}\)27.93\({}^{\rm a}\)
 QuestER+0.145\({}^{\rm a}\)0.632\({}^{\rm a}\)0.231\({}^{\rm a}\)34.34\({}^{\rm a}\)18.58\({}^{\rm a}\)24.66\({}^{\rm a}\)27.97\({}^{\rm a}\)
 HRDR0.1360.5880.21632.0016.2122.2825.46
 NARRE0.1290.5570.20427.5311.8417.7921.06
 HFT0.1410.6130.22428.8714.3619.9923.35
HealthQuestER0.152\({}^{\rm a}\)0.645\({}^{\rm a}\)0.239\({}^{\rm a}\)34.13\({}^{\rm a}\)19.16\({}^{\rm a}\)24.93\({}^{\rm a}\)27.99
 QuestER+0.152\({}^{\rm a}\)0.645\({}^{\rm a}\)0.239\({}^{\rm a}\)34.20\({}^{\rm a}\)19.22\({}^{\rm a}\)25.00\({}^{\rm a}\)28.03
 HRDR0.1420.6010.22433.1717.6523.6228.02
 NARRE0.1370.5740.21526.4611.6317.2720.62
 HFT0.1490.6350.23628.6914.7020.1823.86
SportQuestER0.1570.663\({}^{\rm a}\)0.24734.65\({}^{\rm a}\)19.60\({}^{\rm a}\)25.37\({}^{\rm a}\)28.50
 QuestER+0.1570.663\({}^{\rm a}\)0.24834.64\({}^{\rm a}\)19.60\({}^{\rm a}\)25.36\({}^{\rm a}\)28.50
 HRDR0.1510.6330.23734.2419.0424.8728.73
 NARRE0.1410.5910.22227.8412.6318.4122.24
 HFT0.1550.6560.24529.6315.4820.9524.83
ToyQuestER0.158\({}^{\rm a}\)0.682\({}^{\rm a}\)0.250\({}^{\rm a}\)36.72\({}^{\rm a}\)21.04\({}^{\rm a}\)26.68\({}^{\rm a}\)29.99\({}^{\rm a}\)
 QuestER+0.158\({}^{\rm a}\)0.681\({}^{\rm a}\)0.250\({}^{\rm a}\)36.74\({}^{\rm a}\)21.08\({}^{\rm a}\)26.74\({}^{\rm a}\)30.02\({}^{\rm a}\)
 HRDR0.1430.6110.22631.6715.2021.1425.72
 NARRE0.1430.6110.22630.3514.1319.9824.19
 HFT0.1490.6420.23630.1815.4820.8124.58
GroceryQuestER0.165\({}^{\rm a}\)0.695\({}^{\rm a}\)0.260\({}^{\rm a}\)36.31\({}^{\rm a}\)21.36\({}^{\rm a}\)27.22\({}^{\rm a}\)30.23\({}^{\rm a}\)
 QuestER+0.165\({}^{\rm a}\)0.697\({}^{\rm a}\)0.261\({}^{\rm a}\)36.13\({}^{\rm a}\)21.13\({}^{\rm a}\)27.00\({}^{\rm a}\)30.01\({}^{\rm a}\)
 HRDR0.1550.6490.24432.4916.8322.9028.08
 NARRE0.1520.6350.23928.6613.3319.2623.03
 HFT0.1620.6810.25530.4316.0521.7025.73
BabyQuestER0.138\({}^{\rm a}\)0.578\({}^{\rm a}\)0.217\({}^{\rm a}\)35.05\({}^{\rm a}\)18.00\({}^{\rm a}\)24.15\({}^{\rm a}\)27.70\({}^{\rm a}\)
 QuestER+0.139\({}^{\rm a}\)0.583\({}^{\rm a}\)0.218\({}^{\rm a}\)34.96\({}^{\rm a}\)17.87\({}^{\rm a}\)24.03\({}^{\rm a}\)27.70\({}^{\rm a}\)
 HRDR0.1230.5090.19231.5513.8520.2125.27
 NARRE0.1190.4960.18728.2311.1517.2221.16
 HFT0.1280.5370.20127.7712.4918.0021.39
OfficeQuestER0.144\({}^{\rm a}\)0.597\({}^{\rm a}\)0.222\({}^{\rm a}\)35.19\({}^{\rm a}\)18.40\({}^{\rm a}\)24.12\({}^{\rm a}\)28.62
 QuestER+0.145\({}^{\rm a}\)0.601\({}^{\rm a}\)0.224\({}^{\rm a}\)35.96\({}^{\rm a}\)19.32\({}^{\rm a}\)24.97\({}^{\rm a}\)29.81
 HRDR0.1350.5480.20733.2215.7021.6928.90
 NARRE0.1240.5000.18926.359.8315.2819.71
 HFT0.1260.5160.19327.0412.0017.0421.48
AutomotiveQuestER0.1760.7450.27836.7522.2827.9131.11
 QuestER+0.1740.7400.27536.2521.7827.4830.41
 HRDR0.1730.7310.27335.7920.5926.6231.94
 NARRE0.1560.6510.24526.8912.0917.6921.28
 HFT0.1680.7100.26529.9415.9521.4425.04
PatioQuestER0.1660.6940.25638.1021.9427.5832.27
 QuestER+0.1640.6850.25337.1420.8726.6131.07
 HRDR0.1650.6790.25237.0120.1325.9332.72
 NARRE0.1520.6290.23328.2511.6517.2322.38
 HFT0.1680.7040.26033.0117.9723.2627.74
MusicalQuestER0.1450.6170.23035.40\({}^{\rm a}\)20.18\({}^{\rm a}\)25.88\({}^{\rm a}\)30.13\({}^{\rm a}\)
 QuestER+0.1490.6330.23636.21\({}^{\rm a}\)21.26\({}^{\rm a}\)26.82\({}^{\rm a}\)30.69\({}^{\rm a}\)
 HRDR0.1440.6110.22832.3816.0722.0527.38
 NARRE0.1320.5630.21025.5310.1615.8719.41
 HFT0.1440.6130.22827.4012.5618.0822.32
Table 5. Performance in Review-Level Explanation Task
\({}^{\rm a}\)Denotes statistically significant improvements over the baselines. Highest values are in bold.
To further assess the quality of top-ranked reviews against top-rated helpful reviews, we again use ROUGE and METEOR as metrics. The results in Table 5 consistently show that our proposed QuestER and QuestER+ outperform all baseline models significantly in the majority of cases, i.e., the top-ranked reviews from QuestER and QuestER+ are more similar to the top-rated helpful reviews than those of HRDR, NARRE, and HFT. Overall, in addition to the reviews, our QuestER and QuestER+ use additional product QA, achieving better results than the baseline methods that only use reviews as additional data, suggesting that using QA aids in selecting more useful reviews.

4.3 Question-Level Explanation

The novelty of the proposed QuestER and QuestER+ is in producing question-level explanation along with review-level explanation. We conduct a homologous quantitative evaluation as Review-Level Explanation above, but now with question votes as ground-truth and measure Prec@5, Rec@5, and F1@5. In addition, we measure the similarity between top-ranked question by QuestER (or QuestER+) and top-voted useful question using ROUGE and METEOR; only the questions are being evaluated in this evaluation. As shown in Table 6, QuestER and QuestER+ are significant better than other baselines throughout. This result further highlights the improvement of the current version of QuestER (this work) in comparison to that of the previous version [22] in which this version achieves better results quantitatively for question-level explanation.
Table 6.
DataModelPrec@5Rec@5F1@5ROUGE-1ROUGE-2ROUGE-LMETEOR
HomeQuestER0.097\({}^{\rm a}\)0.360\({}^{\rm a}\)0.146\({}^{\rm a}\)21.09\({}^{\rm a}\)10.97\({}^{\rm a}\)17.82\({}^{\rm a}\)20.26\({}^{\rm a}\)
 QuestER+0.097\({}^{\rm a}\)0.365\({}^{\rm a}\)0.147\({}^{\rm a}\)20.95\({}^{\rm a}\)10.88\({}^{\rm a}\)17.67\({}^{\rm a}\)20.26\({}^{\rm a}\)
 HRDR0.0820.3070.12417.477.5213.2216.51
 NARRE0.0820.3070.12417.697.7713.5216.75
 HFT0.0820.3090.12517.728.1414.9116.33
HealthQuestER0.115\({}^{\rm a}\)0.447\({}^{\rm a}\)0.177\({}^{\rm a}\)23.45\({}^{\rm a}\)14.36\({}^{\rm a}\)20.51\({}^{\rm a}\)22.98\({}^{\rm a}\)
 QuestER+0.114\({}^{\rm a}\)0.439\({}^{\rm a}\)0.175\({}^{\rm a}\)23.65\({}^{\rm a}\)14.24\({}^{\rm a}\)20.72\({}^{\rm a}\)22.87\({}^{\rm a}\)
 HRDR0.0910.3470.13916.747.2512.0016.64
 NARRE0.0890.3420.13617.628.1813.7016.73
 HFT0.0920.3530.14018.368.9515.6317.33
SportQuestER0.114\({}^{\rm a}\)0.443\({}^{\rm a}\)0.175\({}^{\rm a}\)24.03\({}^{\rm a}\)14.04\({}^{\rm a}\)20.89\({}^{\rm a}\)23.24\({}^{\rm a}\)
 QuestER+0.116\({}^{\rm a}\)0.447\({}^{\rm a}\)0.178\({}^{\rm a}\)23.40\({}^{\rm a}\)13.35\({}^{\rm a}\)20.18\({}^{\rm a}\)22.81\({}^{\rm a}\)
 HRDR0.0850.3290.13113.133.657.7912.71
 NARRE0.0880.3350.13418.148.0813.8317.04
 HFT0.0900.3430.13820.1310.0317.2618.69
ToyQuestER0.130\({}^{\rm a}\)0.485\({}^{\rm a}\)0.197\({}^{\rm a}\)23.80\({}^{\rm a}\)14.82\({}^{\rm a}\)20.77\({}^{\rm a}\)23.74\({}^{\rm a}\)
 QuestER+0.126\({}^{\rm a}\)0.468\({}^{\rm a}\)0.191\({}^{\rm a}\)23.85\({}^{\rm a}\)14.02\({}^{\rm a}\)20.70\({}^{\rm a}\)23.70\({}^{\rm a}\)
 HRDR0.1060.3920.16114.505.279.2115.61
 NARRE0.1070.3940.16219.1510.0015.1019.69
 HFT0.1100.4040.16621.1611.8018.4920.79
GroceryQuestER0.125\({}^{\rm a}\)0.503\({}^{\rm a}\)0.194\({}^{\rm a}\)26.92\({}^{\rm a}\)18.08\({}^{\rm a}\)24.08\({}^{\rm a}\)26.11\({}^{\rm a}\)
 QuestER+0.124\({}^{\rm a}\)0.504\({}^{\rm a}\)0.193\({}^{\rm a}\)23.3214.0120.1122.12\({}^{\rm a}\)
 HRDR0.1050.4270.16420.1610.7915.7919.17
 NARRE0.1030.4250.16117.668.2813.1817.48
 HFT0.1050.4370.16621.7012.2818.9319.37
BabyQuestER0.110\({}^{\rm a}\)0.399\({}^{\rm a}\)0.166\({}^{\rm a}\)23.70\({}^{\rm a}\)13.21\({}^{\rm a}\)20.16\({}^{\rm a}\)22.62
 QuestER+0.104\({}^{\rm a}\)0.384\({}^{\rm a}\)0.157\({}^{\rm a}\)22.5211.4318.7221.30
 HRDR0.0850.3170.12915.074.229.7815.32
 NARRE0.0860.3270.13220.639.5816.2820.18
 HFT0.0850.3140.12919.349.8816.5717.45
OfficeQuestER0.101\({}^{\rm a}\)0.399\({}^{\rm a}\)0.155\({}^{\rm a}\)21.85\({}^{\rm a}\)11.98\({}^{\rm a}\)18.60\({}^{\rm a}\)20.67\({}^{\rm a}\)
 QuestER+0.107\({}^{\rm a}\)0.415\({}^{\rm a}\)0.164\({}^{\rm a}\)21.63\({}^{\rm a}\)11.56\({}^{\rm a}\)18.13\({}^{\rm a}\)20.75\({}^{\rm a}\)
 HRDR0.0750.2910.11514.084.008.6112.84
 NARRE0.0720.2730.10913.573.788.5312.72
 HFT0.0750.2900.11517.367.4414.5415.57
AutomotiveQuestER0.106\({}^{\rm a}\)0.416\({}^{\rm a}\)0.163\({}^{\rm a}\)26.50\({}^{\rm a}\)15.76\({}^{\rm a}\)23.23\({}^{\rm a}\)25.39\({}^{\rm a}\)
 QuestER+0.107\({}^{\rm a}\)0.417\({}^{\rm a}\)0.164\({}^{\rm a}\)28.61\({}^{\rm a}\)18.39\({}^{\rm a}\)25.58\({}^{\rm a}\)27.49\({}^{\rm a}\)
 HRDR0.0630.2510.09714.573.6510.3612.25
 NARRE0.0630.2530.09816.155.3111.0114.82
 HFT0.0600.2420.09315.825.7913.1813.26
PatioQuestER0.094\({}^{\rm a}\)0.384\({}^{\rm a}\)0.147\({}^{\rm a}\)23.29\({}^{\rm a}\)13.05\({}^{\rm a}\)20.10\({}^{\rm a}\)21.32\({}^{\rm a}\)
 QuestER+0.104\({}^{\rm a}\)0.422\({}^{\rm a}\)0.162\({}^{\rm a}\)21.25\({}^{\rm a}\)10.59\({}^{\rm a}\)17.84\({}^{\rm a}\)20.17\({}^{\rm a}\)
 HRDR0.0510.1980.07914.674.0710.4012.16
 NARRE0.0550.2100.08411.411.746.579.43
 HFT0.0540.2120.08314.555.4212.2710.83
MusicalQuestER0.118\({}^{\rm a}\)0.446\({}^{\rm a}\)0.179\({}^{\rm a}\)23.95\({}^{\rm a}\)13.01\({}^{\rm a}\)20.57\({}^{\rm a}\)23.43\({}^{\rm a}\)
 QuestER+0.111\({}^{\rm a}\)0.427\({}^{\rm a}\)0.170\({}^{\rm a}\)22.58\({}^{\rm a}\)11.82\({}^{\rm a}\)19.38\({}^{\rm a}\)20.74\({}^{\rm a}\)
 HRDR0.0750.2930.11618.487.6613.8416.79
 NARRE0.0870.3390.13412.862.197.0012.29
 HFT0.0860.3520.13417.656.8814.6115.11
Table 6. Performance in Question-Level Explanation Task
\({}^{\rm a}\)Denotes statistically significant improvements over the baselines. Highest values are in bold.

4.4 Rating Prediction

As previously established, our main focus in this work is on recommendation explanations, with an eye on improving the selection of reviews and incorporating questions in that endeavor. Nevertheless, while recommendation accuracy is not the main focus, we find that QuestER still maintains parity in this regard with the other methods.
We report the average of MSE averaged across users on each category in Table 7. Our proposed QuestER and QuestER+ achieve comparable results when compared to the neural models HRDR and NARRE. HFT that is based on graphical model varies from the neural models. Depending on the reported domain, it is lower in some cases and higher in others. Such variant in performance between simpler and more complex models using neural networks in terms of rating predictions is expected and has also been reported in [40].
Table 7.
DataHFTNARREHRDRQuestERQuestER+
Home1.27751.26541.26771.26701.2666
Health1.27121.28531.28781.28621.2861
Sport1.02511.00541.00721.00531.0047
Toy0.91360.99710.99730.99740.9979
Grocery1.20071.19871.19881.20111.2027
Baby1.37191.36221.36391.36131.3614
Office0.89480.92480.92670.92450.9250
Automotive0.95700.92480.92500.92580.9236
Patio1.11731.15371.15941.15881.1564
Musical0.88460.81360.81020.81740.8155
Table 7. Rating Prediction Performance: MSE
In any case, as we see from the previous experiments as well, QuestER and QuestER+ stand out in having the better review-level and question-level explanations, which are the main focal points of this work.
Effect of Number of Answers in Each Question. We now focus on analyzing the effect of using different maximum number of answers to be used in each question. We report the average of MSE averaged across users on each category when varying the maximum number of answers being used in the set \(\{\)1, 3, 5, 10\(\}\) in Table 8. We observe relatively minor differences in rating prediction performance among the variants. The proposed method achieves best MSE w.r.t five answers (QuestER+) in the majority of cases, which motivates us further evaluating this variant in the previous experiments.
Table 8.
DataMSE of QuestER
 1 Answer3 Answers5 Answers10 Answers
Home1.26701.26671.26661.2669
Health1.28621.28651.28611.2862
Sport1.00531.00511.00471.0051
Toy0.99740.99760.99790.9972
Grocery1.20111.20061.20271.2001
Baby1.36131.36241.36141.3622
Office0.92450.92390.92500.9248
Automotive0.92580.92530.92360.9238
Patio1.15881.15601.15641.1551
Musical0.81740.81230.81550.8178
Table 8. Rating Prediction Performance (MSE) of QuestER w.r.t Different Maximum Number of Answers

4.5 Case Studies

To investigate the usefulness of the recommendation explanation consisting of a QA as well as a review, we show a few case studies that benchmark QuestER to the most voted question and the most voted review:
Figure 5 shows five sets of explanations for a sanding pad product of Meguiar’s brand. The first set (in gray box, above) comprises a QA and a review based on Top_Rated_Useful votes. The second set (in green box) comprises those selected by our QuestER. While both QuestER and Top_Rated_Useful provide useful information about the product, QuestER’s explanation is notable in two respects. For one, QuestER’s question with its answer is more aligned with its review than those of Top_Rated_Useful, ROUGE-L F-Measure for QuestER and Top_Rated_Useful are \(10.61\) and \(8.37\), respectively. For another, Top_Rated_Useful is based on explicit votes, which are not found on many products and therefore not universally available or applicable. The following three blue boxes comprise the explanations produced by the baseline methods. While NARRE has the same review explanation as QuestER, it produces a different QA explanation.
Figure 6 shows explanation for a breast pump product of Medela brand. Both QuestER and Top_Rated_Useful provide further useful information about the product. QuestER’s question with its answer is considered more aligned with its review than those of Top_Rated_Useful, and ROUGE-L F-Measures are \(12.59\) and \(9.02\), respectively. In this case, the baselines HRDR and HFT pick the same question.
Figure 7 shows explanation for a guitar rest. Notably, the pairing by Top_Rated_Useful is not so coherent, as the QA discusses its use for guitars, while the review discusses its use for ukuleles. In contrast, the QA and the review by QuestER concentrate on the key issue of how well the item could hold a guitar in rest. QuestER’s QA is more aligned with its review than those of Top_Rated_Useful, and ROUGE-L F-Measures are \(14.71\) and \(6.64\), respectively.
Fig. 5.
Fig. 5. Example explanation: Meguiar’s Sanding Pad (explanation by Top_Rated_Useful is in gray, that by QuestER is in green, and those by other baselines are in blue).
Fig. 6.
Fig. 6. Example explanation: Medela’s Breast Pump (explanation by Top_Rated_Useful is in gray, that by QuestER is in green, and those by other baselines are in blue).
Fig. 7.
Fig. 7. Example explanation: Planet Waves Guitar Rest (explanation by Top_Rated_Useful is in gray, that by QuestER is in green, and those by other baselines are in blue).

4.6 User Studies

To evaluate the quality of questions and reviews selected by QuestER and Top_Rated_Useful (based on user votes on Amazon.com), we conduct a couple of user studies.
Reviews vs. QAs. In the first study, we seek to investigate whether users find questions and reviews helpful as part of a recommendation explanation. We conduct user studies concerning 30 examples (3 products from each category). We split these examples into 3 surveys, each containing 10 examples of different domains which are generated by QuestER. Each survey is done by \(5\) annotators, for a total of \(15\) annotators who are neither the authors nor having any knowledge of the objective of the study. Each product is presented with both question and review ordering randomly (review and question can be either group A or group B). We ask annotators to assess the pairwise quality with four options:
I.
A is more useful than B.
II.
B is more useful than A.
III.
A and B are almost the same, both useful.
IV.
A and B are almost the same, both useless.
The Fleiss’ kappa [19] for consistency for categorical ratings, \(\kappa=0.2955\), implies fair agreement.
Pairwise evaluation results are shown in Figure 8. As the key proposal is to have both review and question to be part of an expanded explanation, it is gratifying that the most popular option is that both are useful, attaining 39.3%. While the percentage that finds reviews more useful is slightly higher than the percentage that finds questions more useful, this is less important as we are not seeking to replace reviews with questions. Excluding “both useless,” 96% find at least one useful. We repeat the same study with explanations coming from Top_Rated_Useful and the conclusion still holds, i.e., the most popular option is that both reviews and QA are useful.
Fig. 8.
Fig. 8. Review vs. QA annotation results.
QuestER vs. Top_Rated_Useful. In the second user study, we would like to investigate the quality of the proposed combined explanation form consisting of a QA and a review. With the same set of examples and annotators, we split the examples into \(3\) other surveys, each containing \(10\) products from different categories. We present the explanations blindly by ordering survey’s questions and explanations randomly (group A and group B are now either QuestER or Top_Rated_Useful). We ask similar questions as in the first study. Figure 9 shows the pairwise evaluation results between QuestER and Top_Rated_Useful. The Fleiss’ kappa score is \(0.217\) indicating fair agreement. In summary, when combining question and review as explanation, the overall quality of both QuestER and Top_Rated_Useful are useful (\(96.67\%\)). Among those, question and review selected by QuestER are considered to be slightly more useful (\(26.7\%\)) than those of top rated useful (\(25.3\%\)).
Fig. 9.
Fig. 9. QuestER vs. Top_Rated_Useful annotation results.
As important as the slight outperformance by QuestER over Top_Rated_Useful, or perhaps more so is that QuestER as a method is more widely applicable method. In contrast, Top_Rated_Useful relies on the existence of helpfulness votes, which are relatively rare, and therefore it stands more as a benchmark rather than a practical method for review and QA selection for explanation.

4.7 Discussion

Robust Rating Prediction Layer. Here we further explore the cold-start scenario by removing reviews as well as QAs. Keeping the available ratings, we randomly remove reviews with ratios in range [0,1] with step size 0.1. Results in Figure 10 for all datasets consistently show that the rating prediction performance of the proposed QuestER is quite stable regardless the amount of reviews. This can be explained using Equation (15); missing reviews only discard the contribution of user and item representations constructed using reviews and QAs while the rating-based representation as well as latent factors \(\zeta_{u}\) and \(\zeta_{p}\) are available. We further note here that using Equation (15) can produce rating prediction for users/items that have rating-only (\(u_{i}\) and \(p_{j}\)), content-only (\(X_{i}\) and \(Y_{j}\)), and known latent factors (\(\zeta_{u}(i)\) and \(\zeta_{p}(j)\)). In addition, we investigate the overall rating prediction by varying the number of available questions by setting the keeping questions threshold to be covered by big clusters in the range of [0,1] with step size 0.1 (in the main experiment, this threshold is 0.8).6 We observe a similar trend that the rating prediction performance is quite stable (see Figure 11).
Fig. 10.
Fig. 10. Rating predictions performance (MSE) when removing reviews.
Fig. 11.
Fig. 11. Rating predictions performance (MSE) when varying the keeping big cluster threshold.
Using BERT as Text Encoder. Here we investigate whether using other text encoder such as BERT can further enhance the overall performance. Table 9 reports the overall performance of different variants of QuestER based on its text encoder (default is TextCNN) including 6 different versions of small BERT model from TF Hub with 128 hidden dimension, from L-2 (2 Transformer blocks) to L-12 (12 Transformer blocks). Trivially, using a larger text encoder model consumes more time for training. Evidently, using BERT as text encoder does enhance the rating prediction performance. However, it does not clearly show that using BERT as text encoder enhances the explanation performance further in terms of text alignment, review-level explanation, and question-level explanation.
Table 9.
Text EncoderTrain (Seconds)MSEText AlignmentReview-Level ExplanationQuestion-Level Explanation
ROUGE-LMETEORF1@5ROUGE-LMETEORF1@5ROUGE-LMETEOR
TextCNN3340.81747.4411.160.23025.8830.130.17920.5723.43
BERT (L-2)61,1640.78617.2710.750.23726.7230.780.17519.1822.11
BERT (L-4)66,2910.79157.2910.730.24025.7730.060.17118.9421.71
BERT (L-6)73,3320.79097.3210.910.23325.4729.680.17817.9821.14
BERT (L-8)74,5870.79297.3310.890.24025.9630.450.17620.1622.26
BERT (L-10)76,9530.77677.4811.080.23523.9628.520.17618.1421.06
BERT (L-12)79,1460.78927.3110.690.23325.9630.140.17415.4517.63
Table 9. The Overall Performance of QuestER Using BERT as Text Encoder on Musical Data

5 Conclusion

QuestER is a framework for incorporating QA pair into review-based recommendation explanation. We model QA in an attention mechanism to identify more useful reviews. Through joint modeling, we can collectively form an explanation in terms of QA and review. Comprehensive experiments on various product categories show that the QA and the review that are part of a collective explanation are more coherent with each other than those pairings found by the baselines. Review-level and question-level explanations identified by QuestER are also more consistent with top-rated ones based on helpfulness votes than those identified by the baselines. User studies further help to support that incorporating questions as part of a recommendation explanation is useful.

Footnotes

5
The collected data are available at https://github.com/PreferredAI/QuestER
6
When keeping all clusters (threshold is 1), the General Question is all the centroid questions.

References

[1]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, USA, 65–72.
[2]
Rose Catherine and William Cohen. 2017. TransNets: Learning to transform for recommendation. In Proceedings of the 11th ACM Conference on Recommender Systems (RecSys ’17). ACM, New York, NY, 288–296. DOI:
[3]
Chong Chen, Min Zhang, Yiqun Liu, and Shaoping Ma. 2018. Neural attentional rating regression with review-level explanations. In Proceedings of the 2018 World Wide Web Conference (WWW ’18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 1583–1592.
[4]
Long Chen, Ziyu Guan, Qibin Xu, Qiong Zhang, Huan Sun, Guangyue Lu, and Deng Cai. 2020. Question-driven purchasing propensity analysis for recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 35–42. DOI:
[5]
Long Chen, Ziyu Guan, Wei Zhao, Wanqing Zhao, Xiaopeng Wang, Zhou Zhao, and Huan Sun. 2019. Answer identification from product reviews for user questions by multi-task attentive networks. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI ’19/IAAI ’19/EAAI ’19), Vol. 33. AAAI Press, 45–52. DOI:
[6]
Zhongxia Chen, Xiting Wang, Xing Xie, Tong Wu, Guoqing Bu, Yining Wang, and Enhong Chen. 2019. Co-attentive multi-task learning for explainable recommendation. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI ’19). International Joint Conferences on Artificial Intelligence Organization, 2137–2143. DOI:
[7]
Dawei Cong, Yanyan Zhao, Bing Qin, Yu Han, Murray Zhang, Alden Liu, and Nat Chen. 2019. Hierarchical attention based neural network for explainable recommendation. In Proceedings of the 2019 on International Conference on Multimedia Retrieval (ICMR ’19). ACM, New York, NY, 373–381. DOI:
[8]
Qiming Diao, Minghui Qiu, Chao-Yuan Wu, Alexander J. Smola, Jing Jiang, and Chong Wang. 2014. Jointly modeling aspects, ratings and sentiments for movie recommendation (JMARS). In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’14). ACM, New York, NY, 193–202. DOI:
[9]
Gerardo Ocampo Diaz and Vincent Ng. 2018. Modeling and prediction of online product review helpfulness: A survey. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Vol. 1: Long Papers, 698–708.
[10]
Xin Dong, Jingchao Ni, Wei Cheng, Zhengzhang Chen, Bo Zong, Dongjin Song, Yanchi Liu, Haifeng Chen, and Gerard de Melo. 2020. Asymmetrical hierarchical networks with attentive interactions for interpretable review-based recommendation. Proceedings of the AAAI Conference on Artificial Intelligence 34, 05 (Apr. 2020), 7667–7674. DOI:
[11]
Xu Han, Zhengyan Zhang, Ning Ding, Yuxian Gu, Xiao Liu, Yuqi Huo, Jiezhong Qiu, Yuan Yao, Ao Zhang, Liang Zhang, et al. 2021. Pre-trained models: Past, present and future. AI Open 2 (2021), 225–250.
[12]
Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web, 507–517.
[13]
Xiangnan He, Tao Chen, Min-Yen Kan, and Xiao Chen. 2015. TriRank: Review-aware explainable recommendation by modeling aspects. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (CIKM ’15). ACM, New York, NY, 1661–1670. DOI:
[14]
Jonathan L. Herlocker, Joseph A. Konstan, Loren G. Terveen, and John T. Riedl. 2004. Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems (TOIS) 22, 1 (2004), 5–53.
[15]
Chunli Huang, Wenjun Jiang, Jie Wu, and Guojun Wang. 2020. Personalized review recommendation based on users’ aspect sentiment. ACM Transactions on Internet Technology 20, 4, Article 42 (Oct. 2020), 26 pages. DOI:
[16]
Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics 2 (2017), 427–431. Retrieved from https://aclanthology.org/E17-2068
[17]
Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In International Conference on Learning Representations. Retrieved from http://arxiv.org/abs/1412.6980
[18]
Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30–37.
[19]
J. Richard Landis and Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics 33, 1 (1977), 159–174. Retrieved from http://www.jstor.org/stable/2529310
[20]
Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International Conference on Machine Learning. PMLR, 1188–1196.
[21]
Trung-Hoang Le and Hady W. Lauw. 2021. Explainable recommendation with comparative constraints on product aspects. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining (WSDM ’21). ACM, New York, NY, 967–975. DOI:
[22]
Trung-Hoang Le and Hady W. Lauw. 2022. Question-attentive review-level recommendation explanation. In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data). IEEE, 756–761.
[23]
Lei Li, Yongfeng Zhang, and Li Chen. 2020. Generate neural template explanations for recommendation. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM ’20). ACM, New York, NY, 755–764. DOI:
[24]
Lei Li, Yongfeng Zhang, and Li Chen. 2021. Personalized transformer for explainable recommendation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Vol. 1: Long Papers. Association for Computational Linguistics, Online, 4947–4957. DOI:
[25]
Piji Li, Zihao Wang, Zhaochun Ren, Lidong Bing, and Wai Lam. 2017. Neural rating regression with abstractive tips generation for recommendation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’17). ACM, New York, NY, 345–354. DOI:
[26]
Chin-Yew Lin and Eduard Hovy. 2003. Automatic evaluation of summaries using N-gram co-occurrence statistics. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL ’03), Vol. 1. Association for Computational Linguistics, Stroudsburg, PA, 71–78. DOI:
[27]
Han Liu, Yangyang Guo, Jianhua Yin, Zan Gao, and Liqiang Nie. 2022. Review polarity-wise recommender. IEEE Transactions on Neural Networks and Learning Systems (2022). Retrieved from https://arxiv.org/abs/2106.04155
[28]
Hongtao Liu, Yian Wang, Qiyao Peng, Fangzhao Wu, Lin Gan, Lin Pan, and Pengfei Jiao. 2020. Hybrid neural recommendation with joint deep representation learning of ratings and reviews. Neurocomputing 374 (2020), 77–85.
[29]
Yichao Lu, Ruihai Dong, and Barry Smyth. 2018. Coevolutionary recommendation model: Mutual learning between ratings and reviews. In Proceedings of the 2018 World Wide Web Conference (WWW ’18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 773–782. DOI:
[30]
Yue Lu, Panayiotis Tsaparas, Alexandros Ntoulas, and Livia Polanyi. 2010. Exploiting social context for review quality prediction. In Proceedings of the 19th International Conference on World Wide Web (WWW ’10). ACM, New York, NY, 691–700. DOI:
[31]
Lionel Martin and Pearl Pu. 2014. Prediction of helpful reviews using emotions extraction. In Proceedings of the AAAI Conference on Artificial Intelligence 28 (2014), 1551–1557.
[32]
Julian McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics: Understanding rating dimensions with review text. In Proceedings of the 7th ACM Conference on Recommender Systems (RecSys ’13). ACM, New York, NY, 165–172. DOI:
[33]
Julian McAuley and Alex Yang. 2016. Addressing complex and subjective product-related queries with customer reviews. In Proceedings of the 25th International Conference on World Wide Web (WWW ’16). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 625–635. DOI:
[34]
Sicheng Pan, Dongsheng Li, Hansu Gu, Tun Lu, Xufang Luo, and Ning Gu. 2022. Accurate and explainable recommendation via review rationalization. In Proceedings of the ACM Web Conference 2022 (WWW ’22). ACM, New York, NY, 3092–3101. DOI:
[35]
Michael J. Pazzani and Daniel Billsus. 2007. Content-based recommendation systems. In The Adaptive Web: Methods and Strategies of Web Personalization. P. Brusilovsky, A. Kobsa, W. Nejdl (Eds.), Vol. 4321. Springer, Berlin, Heidelberg, 325–341. DOI:
[36]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods In Natural Language Processing (EMNLP), 1532–1543.
[37]
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 2227–2237. DOI:
[38]
Pearl Pu, Li Chen, and Rong Hu. 2011. A user-centric evaluation framework for recommender systems. In Proceedings of the 5th ACM Conference on Recommender Systems, 157–164.
[39]
Zhaochun Ren, Shangsong Liang, Piji Li, Shuaiqiang Wang, and Maarten de Rijke. 2017. Social collaborative viewpoint regression with explainable recommendations. In Proceedings of the 10th ACM International Conference on Web Search and Data Mining (WSDM ’17). ACM, New York, NY, 485–494. DOI:
[40]
Noveen Sachdeva and Julian McAuley. 2020. How useful are reviews for recommendation? A critical review and potential improvements. ACM, New York, NY, 1845–1848. DOI:
[41]
Sunil Saumya, Jyoti Prakash Singh, and Yogesh K. Dwivedi. 2020. Predicting the helpfulness score of online reviews using convolutional neural network. Soft Computing 24, 15 (2020), 10989–11005.
[42]
Sungyong Seo, Jing Huang, Hao Yang, and Yan Liu. 2017. Interpretable convolutional neural networks with dual local and global attention for review rating prediction. In Proceedings of the 11th ACM Conference on Recommender Systems (RecSys ’17). ACM, New York, NY, 297–305. DOI:
[43]
Yunzhi Tan, Min Zhang, Yiqun Liu, and Shaoping Ma. 2016. Rating-boosted latent topics: Understanding users and items with ratings and reviews. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16). AAAI Press, 2640–2646.
[44]
Jiliang Tang, Huiji Gao, Xia Hu, and Huan Liu. 2013. Context-aware review helpfulness rating prediction. In Proceedings of the 7th ACM Conference on Recommender Systems (RecSys ’13). ACM, New York, NY, 1–8. DOI:
[45]
Yi Tay, Anh Tuan Luu, and Siu Cheung Hui. 2018. Multi-pointer co-attention networks for recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’18). ACM, New York, NY, 2309–2318. DOI:
[46]
Quoc-Tuan Truong and Hady Lauw. 2019. Multimodal review generation for recommender systems. In Proceedings of the World Wide Web Conference (WWW ’19). ACM, New York, NY, 1864–1874. DOI:
[47]
Nan Wang, Hongning Wang, Yiling Jia, and Yue Yin. 2018. Explainable recommendation via multi-task learning in opinionated text data. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR ’18). ACM, New York, NY, 165–174. DOI:
[48]
Chuhan Wu, Fangzhao Wu, Junxin Liu, and Yongfeng Huang. 2019. Hierarchical user and item representation with three-tier attention for recommendation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long and Short Papers), 1818–1826.
[49]
Chuhan Wu, Fangzhao Wu, Tao Qi, Suyu Ge, Yongfeng Huang, and Xing Xie. 2019. Reviews meet graphs: Enhancing user and item representations for recommendation with hierarchical attentive graph neural network. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 4884–4893. DOI:
[50]
Zhouhang Xie, Sameer Singh, Julian McAuley, and Bodhisattwa Prasad Majumder. 2023. Factual and informative review generation for explainable recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37, 13816–13824.
[51]
Qian Yu and Wai Lam. 2018. Review-aware answer prediction for product-related questions incorporating aspects. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM ’18). ACM, New York, NY, 691–699. DOI:
[52]
Yongfeng Zhang and Xu Chen. 2020. Explainable recommendation: A survey and new perspectives. Foundations and Trends in Information Retrieval 14, 1 (2020), 1–101.
[53]
Yongfeng Zhang, Guokun Lai, Min Zhang, Yi Zhang, Yiqun Liu, and Shaoping Ma. 2014. Explicit factor models for explainable recommendation based on phrase-level sentiment analysis. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR ’14). ACM, New York, NY, 83–92. DOI:
[54]
Jie Zhao, Ziyu Guan, and Huan Sun. 2019. Riker: Mining rich keyword representations for interpretable product question answering. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’19). ACM, New York, NY, 1389–1398. DOI:
[55]
Lei Zheng, Vahid Noroozi, and Philip S. Yu. 2017. Joint deep modeling of users and items using reviews for recommendation. In Proceedings of the 10th ACM International Conference on Web Search and Data Mining (WSDM ’17). ACM, New York, NY, 425–434. DOI:

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology
ACM Transactions on Intelligent Systems and Technology  Volume 15, Issue 6
December 2024
727 pages
EISSN:2157-6912
DOI:10.1145/3613712
  • Editor:
  • Huan Liu
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 December 2024
Online AM: 08 October 2024
Accepted: 22 September 2024
Revised: 23 August 2024
Received: 10 October 2023
Published in TIST Volume 15, Issue 6

Check for updates

Author Tags

  1. neural rating regression
  2. recommendation explanation
  3. review-level explanation
  4. question-level explanation

Qualifiers

  • Research-article

Funding Sources

  • National Research Foundation Singapore

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 206
    Total Downloads
  • Downloads (Last 12 months)206
  • Downloads (Last 6 weeks)71
Reflects downloads up to 12 Feb 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media