research-article

Open access

DIRECT: Dual Interpretable Recommendation with Multi-aspect Word Attribution

Authors:

Ninghao LiuAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology, Volume 15, Issue 5

Article No.: 97, Pages 1 - 21

https://doi.org/10.1145/3663483

Published: 17 October 2024 Publication History

PDF eReader

Abstract

Recommending products to users with intuitive explanations helps improve the system in transparency, persuasiveness, and satisfaction. Existing interpretation techniques include post hoc methods and interpretable modeling. The former category could quantitatively analyze input contribution to model prediction but has limited interpretation faithfulness, while the latter could explain model internal mechanisms but may not directly attribute model predictions to input features. In this study, we propose a novel Dual Interpretable Recommendation model called DIRECT, which integrates ideas of the two interpretation categories to inherit their advantages and avoid limitations. Specifically, DIRECT makes use of item descriptions as explainable evidence for recommendation. First, similar to the post hoc interpretation, DIRECT could attribute the prediction of a user preference score to textual words of the item descriptions. The attribution of each word is related to its sentiment polarity and word importance, where a word is important if it corresponds to an item aspect that the user is interested in. Second, to improve the interpretability of embedding space, we propose to extract high-level concepts from embeddings, where each concept corresponds to an item aspect. To learn discriminative concepts, we employ a concept bottleneck layer and maximize the coding rate reduction on word-aspect embeddings by leveraging a word–word affinity graph extracted from a pre-trained language model. In this way, DIRECT simultaneously achieves faithful attribution and usable interpretation of embedding space. We also show that DIRECT achieves linear inference time complexity regarding the length of item reviews. We conduct experiments including ablation studies on five real-world datasets. Quantitative analysis, visualizations, and case studies verify the interpretability of DIRECT. Our code is available at: https://github.com/JacksonWuxs/DIRECT.

1 Introduction

Recommender systems help users to access items matching their preferences. While deep learning significantly improves recommendation accuracy, increasing the transparency of recommender systems to users also becomes a new trend recently [10]. Interpretable recommender systems [58, 64] attempt to generate both accurate predictions and intuitive explanations. However, the two goals often sit on opposite sides of a seesaw. While deep learning models represent features in uninterpretable high-dimensional spaces to pursue better accuracy, they also sacrifice system transparency.

There are two main categories of techniques for creating interpretable recommender systems. The first category is the post hoc explanation to understand how predictions are made after a model is trained. Typical techniques include gradient-based [43, 45, 57], path-based [41, 60], and perturbation-based methods [31, 50]. However, it has been pointed out that post hoc methods may not always faithfully explain the exact inference mechanism [20, 64]. The second category is to build inherently interpretable models. Typical techniques are attention models [8, 11, 40, 51, 52] and disentangled representation learning [12, 25, 32, 54]. Attention weights shed light on how information propagates over user–item interaction graphs, while disentangled factors unravel the global distribution of representations. The attention weights or disentangled factors could help understand certain aspects of the model inference process, but these intermediate information is not directly associated with the output, such as prediction scores. This is different from the post hoc interpretation like Shapley Values [31] or Counterfactual Analysis [45] that directly decomposes output and quantitatively attributes it back to input features. To satisfy both performance and transparency, recent studies [23, 68] suggest that let neural networks map inputs into a human-understandable latent space, then apply a linear transformation from this space to the target label set, known as Generalized Additive Model (GAM) [16]. However, GAM-based approaches often require the involvement of experts to define the latent space, which can limit the learning capabilities of deep learning models.

Considering the natural readability of textual user–item reviews, we put reviews forward as a latent space to achieve interpretable recommendations, where this latent space is easily understood by humans and carries rich semantic information for predicting user preferences. There are several challenges in building interpretable recommendation models with review information. First, how to design the model that preserves the advantages of both post hoc interpretation and inherently interpretable modeling? Second, how to simultaneously utilize the capacity of advanced language models and protect recommendation interpretability? Third, since many users do not write reviews, how to overcome the sparsity issue?

To address the challenges, we propose a novel review-based interpretable recommendation model for user–item rating prediction. First, we design the rating function as the summation of attribution scores of review words. This allows both quantitative and intuitive attribution to the input reviews as post hoc interpretations. Second, we employ a concept bottleneck layer [21, 23] to map review words into interpretable features before the output, where each feature corresponds to an aspect of items. Different from existing concept bottleneck models that require domain experts to design the features, our model automatically discovers these features from data.

The above designs consider the attributable model predictions and human-understandable model mechanisms, so we name our model Dual Interpretable RECommendaTion (DIRECT).

Third, we extract a word–word affinity graph from Pre-Trained Language Models (PLMs), and then design an end-to-end solution, by leveraging the implicit community structure in the graph, to guarantee that non-trivial aspects are discovered from word representations. This allows us to effectively utilize PLMs without harming model interpretability. Fourth, we jointly model user reviews and shopping history to merge the two information modalities. This allows our model to express user interests with reviews written by other customers. Furthermore, we introduce how to reduce time complexity for online model inference. The contributions of this work are summarized below.

—

We propose a novel review-based interpretable recommender system called DIRECT. It inherits the advantages of both attribution-based interpretation and interpretable modeling mechanism to achieve a transparent decision-making process.

—

We propose a novel objective that encourages the model to learn discriminative aspects.

—

Experiments on real-world datasets validate the effectiveness of DIRECT. Visualization and case studies demonstrate the interpretability of our model.

2 Problem Statement

Notations. In this work, we use boldface lowercase letters (e.g., $\boldsymbol{x}$) to denote vectors, boldface uppercase letters (e.g., $\boldsymbol{A}$) to denote matrices, and calligraphic capital letters (e.g., $\mathcal{D}$) to denote sets. Specifically, we use $\mathcal{U}$ and $\mathcal{I}$ to denote the user set and item set, respectively. The interactions between users and items are stored in a rating matrix $\boldsymbol{R}\in\mathbb{R}^{|\mathcal{U}|\times|\mathcal{I}|}$, where each element $r_{u,i}$ indicates the rating score of user $u\in\mathcal{U}$ to item $i\in\mathcal{I}$. In review-based recommendation systems, we also assume that an $M$-word review is available for some rating actions, denoted by $\mathcal{T}_{u,i}=[w_{u,i}^{1},...,w_{u,i}^{m},...,w_{u,i}^{M}]$. $w_{u,i}^{m}\in\mathcal{V}$ indicates the $m$th word of the review posed by user $u$ to item $i$, and $\mathcal{V}$ is a pre-defined vocabulary set. Besides the rating-level review, we also represent each user $u$ (or item $i$) with a summarized document $\mathcal{D}_{u}=[w_{u}^{1},...,w_{u}^{l},...,w_{u}^{L}]$ (or $\mathcal{D}_{i}=[w_{i}^{1},...,w_{i}^{l},...,w_{i}^{L}]$), where $L$ is the document length. In practice, each document $\mathcal{D}_{u}$ (or $\mathcal{D}_{i}$) is obtained by concatenating all the observed reviews that are commented by the user (or commented on the item). These settings have been widely adopted in existing review-based recommender systems [9, 67].

Problem Definition. The goal of this work is to build an interpretable model $f$ to predict the preference score $\hat{r}_{u,i}=f(u,i,\mathcal{D}_{u},\mathcal{D}_{i})$ of user $u$ on item $i$. In real-world scenarios, the review $\mathcal{T}_{u,i}$ is not available for a target user–item pair $(u,i)$, so we always delete the current review from the user and item document during training.

3 Proposed Method

We now introduce the proposed interpretable recommendation model with user reviews. First, in Section 3.1, we formally define the “interpretability” considered in this work. Then, we describe the architecture design of our multi-aspect recommendation model in Sections 3.2 and 3.3. After that, in Section 3.4, we propose a training loss to guarantee the distinction among different aspects. We then introduce the overall objective to train our model in Section 3.5. Finally, we discuss the time complexity of model inference in Section 3.6.

3.1 Interpretability of Recommender Systems

We consider two levels of interpretability in designing the proposed recommender system. First, similar to the post hoc methods, we want prediction scores to be attributable, where several requirements are as below.

—

Attributable prediction: Let $\boldsymbol{x}$ denote the input, $x_{i}$ denote the $i$th feature, and $f(\boldsymbol{x})$ be the prediction for $\boldsymbol{x}$. The prediction is attributable if we could design an interpretation method $intp()$, where $intp(f,\boldsymbol{x},x_{i})$ returns the attribution score of $x_{i}$ for $f(\boldsymbol{x})$. We think $x_{i}$ is more important than $x_{j}$ if $|intp(f,\boldsymbol{x},x_{i})| {\gt} |intp(f,\boldsymbol{x},x_{j})|$. Commonly used attribution methods include raw gradient interpretation [43] and attention scores [8, 11, 51, 52].

—

Measurable attribution: An attribution is measurable if

\begin{align}f(\boldsymbol{x})\approx\sum\nolimits_{i}intp(f,\boldsymbol{x},x_{i}).\end{align}

(1)

The measurability further requires attributions to compose the prediction value. It thus makes interpretation a quantitative analytic tool. Commonly used measurable attribution methods include integrated gradients [45] and GAMs [3]. Interpretation methods in the previous category, such as attention scores, do not have this property.

—

Comprehensible attribution: An attribution is comprehensible if it is easy for humans to understand the meaning of each $x_{i}$. For example, each pixel in an image is hardly comprehensible, while objects are more comprehensible [26]. In recommender systems, we regard words in user reviews as comprehensible.

The second level of interpretability refers to a more inherently understandable model mechanism. State-of-the-art neural networks are typically designed in a way that maps input (e.g., nodes and texts) through complicated interactions to the target (e.g., labels). Such a prediction scheme, however, does not match the human cognition habits which rely on high-level concepts or aspects. To bridge the gap, we introduce the idea of concept bottleneck [23] for building interpretable recommendation models. Specifically, the model consists of two parts. The first part $f_{1}:\mathcal{X}\rightarrow\mathcal{K}$ maps input to the concept space. The second part $f_{2}:\mathcal{K}\rightarrow\mathcal{Y}$ makes the final predictions based on the concepts. Thus, given an input $x$, its concept activation is denoted as $f_{1}(\boldsymbol{x})\in\mathbb{R}^{K}$, where $K$ is the number of concepts. Then, a prediction is made as $\hat{y}=f_{2}(f_{1}(\boldsymbol{x}))$. In traditional concept bottleneck models [21, 23, 68], the concepts are provided by domain experts, where the models are trained to fit both concept labels and the prediction label $y$. However, in our problem, the concepts are not pre-defined and are discovered from data. It is also worth noting that, although some neural recommendation models use linear functions or factorization machines [38] as their scoring functions, they do not fully match our definition of interpretability. For example, DeepCoNN [67] uses a dot production as its scoring function upon user embeddings and item embeddings. However, the embeddings are generated by a TextCNN model [7], where the latent space dimensions are not interpretable.

3.2 Model Architecture

We introduce the architecture of our model in this section. Figure 1 presents the overall framework of our method. The general idea is to predict user preference for an item by attributing the preference score to each of the words in item reviews. The score of each word is the product of two factors: (1) the sentiment of the word and (2) the degree of user interest in the item's aspect described by the word. Specifically, let $w^{l}_{i}\in\mathcal{D}_{i}$ denote the $l$th word in the reviews of the $i$th item. The preference of user $u$ for item $i$ is predicted as

\begin{align}\hat{r}_{u,i}=\frac{1}{L}\sum_{l=1}^{L}sentiment(w^{l}_{i})\times gate_{u}(w^{ l}_{i})+b_{i}+b_{u}+b_{g},\end{align}

(2)

where $b_{i}$, $b_{u}$, and $b_{g}$ are trainable item bias, user bias, and global bias terms, respectively, and $L=|\mathcal{D}_{i}|$. Here we design a sentiment analysis module $sentiment:\mathbb{R}^{h_{1}}\rightarrow[-1,1]$ indicating the sentiment of word $w^{l}_{i}$, and an aspect-interest gate function $gate_{u}:\mathbb{R}^{h_{1}}\rightarrow[0,1]$ computing the probability that the word $w^{l}_{i}$ describes an item aspect that user $u$ interested in. Meanwhile, a word representation module serves as one of the foundations of our review-based system. The details of each module are introduced below.

Fig. 1.

3.2.1 Word Representation Module.

To obtain high-quality word embeddings with rich semantic information to support other functionalities, we collect contextual word embedding by using PLMs. This module takes reviews as input and returns an ${h_{1}}$-dimensional embedding $\boldsymbol{e}$ for each of the word tokens as output. Formally,

\begin{align}[\boldsymbol{e}^{0},\boldsymbol{e}^{1},\boldsymbol{e}^{2},...,\boldsymbol{e}^{L}]=\text{PLM}([w^{0},w^{1},w^{2},...,w^{L}]),\end{align}

(3)

where $\boldsymbol{e}^{l}$ denotes the contextual embedding of word $w^{l}$. We insert a special token $w^{0}=[CLS]$ to each input word sequence. The contextual embedding of the special token $[CLS]$ is used as the representation of the whole sequence. If the input is a user review document $\mathcal{D}_{u}$ or item review document $\mathcal{D}_{i}$, then $\boldsymbol{e}^{0}$ could be treated as the overall representation of the user or item from the perspective of their review information.

3.2.2 Sentiment Analysis Module.

The sentiment analysis module tries to predict the sentiment polarity of each word in item reviews. Considering that the sentiment analysis over words is a relatively simple task with high-quality pre-trained contextual word embeddings, we implement it as a light-weight Multi-Layer Perception (MLP) model that maps the word embedding to a scalar

\begin{align}sentiment(w^{l}_{i})=\text{tanh}(\text{MLP}_{1}(\boldsymbol{e}_{i}^{l})),\end{align}

(4)

where $\boldsymbol{e}_{i}^{l}$ denotes the embedding of word $w^{l}_{i}$, and $\text{MLP}_{1}:\mathbb{R}^{h_{1}}\rightarrow\mathbb{R}$.

Here, the tanh function returns a value between $-1$ and 1, where a greater value indicates a stronger positive signal. Since the final estimated score is a weighted addition to these sentiment scores, the estimation errors between the ground truth and predicted ratings directly guide the learning of the sentiment analysis module.

3.2.3 Aspect-Guided Interest Gate.

A user usually evaluates a product from several aspects. A review segment is useful to the user if it describes the aspects that the user is interested in. Otherwise, the content of that segment will be ignored by the user. For example, a review of a restaurant is “Their pizzas are full of flavor and have a crispy crust, but it is far away, by the way.,” where the word “pizza” showing the Food aspect of the restaurant which is positive, and the word “far away” reflecting the Location aspect is negative. This review will only be noticed by the users who care about these aspects of a restaurant. For example, a gourmet will focus on the sentiment score of the pizza in this review and ignore the comments on the restaurant location, while a student without a car may not choose this restaurant due to the negative comment on the location.

Following this intuition, we design the Aspect-Guided Interest Gate (AGIG) to quantify the degree to a word $w_{i}^{l}\in\mathcal{D}_{i}$ falling in the aspects of user $u$'s interest. We assume all items within share $K$ aspects that users might concern about, such as Price, Service, Location, and Food in restaurant recommendation. The AGIG module plays the same role as the concept bottleneck layer introduced before, where each concept corresponds to an aspect of items. Formally, we assign each aspect $k$ a $h_{2}$-dimensional vector $\boldsymbol{a}_{k}\in\mathbb{R}^{h_{2}}$, and form the total $K$ aspects in an aspect matrix $\boldsymbol{A}\in\mathbb{R}^{K\times h_{2}}$. Suppose we have generated a user embedding $\boldsymbol{z}_{u}\in\mathbb{R}^{h_{3}}$ for user $u$ (see details in Section 3.3), then we define the AGIG as

\begin{align}gate_{u}(\boldsymbol{e}_{i}^{l})=\sum_{k=1}^{K}P(\boldsymbol{a}_{k}|\boldsymbol{e}_{i}^{l})\times P(\boldsymbol{a}_{k}|\boldsymbol{z}_{u}),\end{align}

(5)

where

\begin{align} 0\leq P(\boldsymbol{a}_{k}|\boldsymbol{z}_{u})\leq 1, \sum_{k=1}^{K}P(\boldsymbol{a }_{k}|\boldsymbol{e}_{i}^{l})=1.\end{align}

(6)

Here, $P(\boldsymbol{a}_{k}|\boldsymbol{e}_{i}^{l})$ is the probability that word $w_{i}^{l}$ mentions the aspect $\boldsymbol{a}_{k}$ with its contextual word embedding $\boldsymbol{e}_{i}^{l}$, and $P(\boldsymbol{a}_{k}|\boldsymbol{z}_{u})$ is the probability that user $u$ is interested in aspect $\boldsymbol{a}_{k}$. The former probability is user-independent, while the latter is user-specific. For a word to contribute to the recommendation, it must express sufficient information on the aspects that the user cares about.

—

We estimate the distribution $P(\boldsymbol{a}_{k}|\boldsymbol{e}_{i}^{l})$ of a word $w_{i}^{l}\in\mathcal{D}_{i}$ on different aspects $\{\boldsymbol{a}_{k}\}^{K}_{k=1}$ as below

\begin{align}P(\boldsymbol{a}_{k}|\boldsymbol{e}_{i}^{l})=\frac{\exp{(\boldsymbol{q}_{i}^{l}\cdot\boldsymbol{a}_{k}^{\top}) }}{\sum^{K}_{k^{\prime}=1}\exp{(\boldsymbol{q}_{i}^{l}\cdot\boldsymbol{a}_{k^{\prime}}^{\top}) }},\end{align}

(7)

where $\boldsymbol{q}_{i}^{l}=\text{MLP}_{2}(\boldsymbol{e}_{i}^{l}),$ and $\text{MLP}_{2}:\mathbb{R}^{h_{1}}\rightarrow\mathbb{R}^{h_{2}}$ bridges between the word embedding space and the aspect embedding space. Here, $\boldsymbol{q}_{i}^{l}\in\mathbb{R}^{h_{2}}$ is called word-aspect embedding.

—

We compute the distribution $P(\boldsymbol{a}_{k}|\boldsymbol{z}_{u})$ of user interests on different aspects $\{\boldsymbol{a}_{k}\}^{K}_{k=1}$ as below

\begin{align}P(\boldsymbol{a}_{k}|\boldsymbol{z}_{u})=\sigma(\text{MLP}_{3}(\boldsymbol{z}_{u})\cdot\boldsymbol{a}_{k}^{ \top}),\end{align}

(8)

where $\text{MLP}_{3}:\mathbb{R}^{h_{3}}\rightarrow\mathbb{R}^{h_{2}}$ aligns the user embedding space with the aspect embedding space, and $\sigma$ is the sigmoid activation function.

In the above, Equation (7) measures the correlation between the word-aspect embeddings $\{\boldsymbol{q}_{i}^{l}\}$ and the aspect embeddings $\{\boldsymbol{a}_{k}\}^{K}_{k=1}$. This is similar to K-means clustering in the gradient-descent form [1, 39] if we could optimize each centroid to minimize its distances to the nearby data points. Since one of the main requests of DIRECT for better recommendation quality is to correctly predict the aspects reflected by each word, the distance between the words and their closest aspect has a chance to be minimized. However, different from traditional K-means algorithms where the input data samples are fixed, we aim to identify topical aspects from word-aspect embeddings that are trainable. Directly optimizing both $\{\boldsymbol{q}_{i}^{l}\}$ and $\{\boldsymbol{a}_{k}\}^{K}_{k=1}$ via gradient descent could cause model collapse. To avoid this problem, in Section 3.4, we discuss further constraints on the aspect distribution toward producing diverse aspects.

3.3 Learning User Representations

Effective user representations $\{\boldsymbol{z}_{u}\}$ is the key to personalized recommendation in our system. Modeling user–item interactions is one of the popular directions to generating user embeddings [24, 61, 53]. However, user interests hiding in their historical interactions are implicit. A straightforward way is generating user embedding from their reviews [27, 63, 67], where user preferences are explicit. But, another issue is raised: many reviews are biased, sparse, and incomprehensive because most users only write reviews when they feel the items are particularly bad or beyond expectations.

To fill the gaps, we utilize user shopping histories to be combined with the review information. The user shopping history is denoted as $\mathcal{I}_{u}$, where $\mathcal{I}_{u}=\{i_{1},...,i_{t},...,i_{T}\},i_{t}\in\mathcal{I}$, is a $T$-size item set storing the items purchased by the user $u$. We then design a fusion network to generate the final user representations.

3.3.1 Representing Users with Sequences.

Both user reviews and shopping history are processed as two sequences of $h_{1}$-dimensional embeddings. We use a self-attention module to aggregate the two sequences into two vectors $\boldsymbol{z}_{u,d}$ and $\boldsymbol{z}_{u,h}$, and then use the fusion network to merge them into $\boldsymbol{z}_{u}$. First, given the user review document $\mathcal{D}_{u}$ and its contextual word embeddings $[\boldsymbol{e}_{u}^{1},...,\boldsymbol{e}_{u}^{l},...,\boldsymbol{e}_{u}^{L}]$ from the PLM, we aggregate them into a single embedding as $\boldsymbol{z}_{u,d}\in\mathbb{R}^{h_{1}}$

\begin{align}\boldsymbol{z}_{u,d}=\text{AGGR}([\boldsymbol{e}_{u}^{1},...,\boldsymbol{e}_{u}^{L}],\boldsymbol{e}_{u}^{0})= \sum_{l=1}^{L}\alpha_{l}\boldsymbol{e}_{u}^{l},\end{align}

(9)

where $\boldsymbol{e}_{u}^{0}$ acts as the query for self-attention

\begin{align}\alpha_{l}=\frac{\exp(\tilde{\boldsymbol{e}}_{l})}{\sum_{l^{\prime}=1}^{L }\exp(\tilde{\boldsymbol{e}}_{l^{\prime}})}, \tilde{\boldsymbol{e}}_{l}=\lambda_{0}\cdot \tanh(\boldsymbol{e}_{u}^{0}\cdot FC(\boldsymbol{e}_{u}^{l})^{\top}).\end{align}

(10)

Then, given the user shopping history $\mathcal{I}_{u}$ and history embedding sequence $[\boldsymbol{e}_{i_{1}}^{0},...,\boldsymbol{e}_{i_{t}}^{0},...,\boldsymbol{e}_ {i_{T}}^{0}]$, we adopt the similar aggregation function to generate a single user history embedding $\boldsymbol{z}_{u,h}\in\mathbb{R}^{h_{1}}$, where $\boldsymbol{z}_{u,h}=\text{AGGR}([\boldsymbol{e}_{i_{1}}^{0},...,\boldsymbol{e }_{i_{T}}^{0}],\bar{\boldsymbol{e}}_{u}^{0})$, and $\bar{\boldsymbol{e}}_{u}^{0}=\frac{1}{T}\sum_{t=1}^{T}e_{i_{t}}^{0}$ is the mean.

3.3.2 Embedding Fusion Network.

As discussed earlier, the user document embedding $z_{u,d}$ keeps explicit but biased user preferences, while the history embedding $z_{u,h}$ contains more general but implicit information. We design a fusion network to filter information coming from each embedding guided by the other resource to generate the final user embedding $\boldsymbol{z}_{u}$

\begin{align}\boldsymbol{z}_{u} & =[\boldsymbol{z}_{u,h}\odot\boldsymbol{s}_{h};\boldsymbol{z}_{u,d}\odot\boldsymbol{s}_{d}], \\\boldsymbol{s}_{h} & =\sigma(\text{MLP}_{4}(\boldsymbol{z}_{u,d})), \boldsymbol{s}_{d}=\sigma(\text{MLP}_{5}(\boldsymbol{z}_{u,h})), \end{align}

(11)

where $\odot$ stands for element-wise multiplication, $\text{MLP}_{4}:\mathbb{R}^{h_{1}}\rightarrow\mathbb{R}^{h_{1}}$ and $\text{MLP}_{5}:\mathbb{R}^{h_{1}}\rightarrow\mathbb{R}^{h_{1}}$. If user reviews are not available, we can simply let $\boldsymbol{z}_{u}=\boldsymbol{z}_{u,h}$. Here, $s_{h}$ and $s_{d}$ are gates to filter out redundant information. Inspired by SE-Net [19], each gate first represents its input embedding into a lower-dimensional hidden space and then maps them back to the original space with a gated activation function (e.g., the sigmoid function). With this design, each gate could identify the essential information from its inputs and use them as guidance to filter out the original information.

3.4 Learning Discriminative Aspect Representations

Learning diverse and comprehensive aspects are crucial for the interpretability of our recommendation model, which is also a non-trivial task. There are two categories of methods to learn aspect-based embeddings. The first category relies on a two-step procedure [12, 29], i.e., it conducts clustering to find the aspects and then learns embeddings. While this approach could produce human-understandable clusters, it is difficult to guarantee the quality of clustering results used for embedding.

The second category jointly conducts aspect discovery and embedding learning in an end-to-end manner [32, 35, 36], which aims to learn word embeddings and interpretable aspects simultaneously. However, the model could suffer from mode collapse [22, 32], since there is no explicit constraint to control the diversity of embedding distribution. Specific to our article, the mode collapse is caused by jointly learning aspect and word-aspect embeddings during model training, which differs from the K-means clustering [1] assuming the input entries (word-aspect embeddings here) are fixed. This could lead to a trivial solution that maps every aspect and word-aspect embedding to the same point, while the objective of K-means is minimized, i.e., the distances between words and their closest aspects are zero. To tackle the challenges, we propose a new end-to-end approach with an explicit objective to learn discriminative representations of words in different aspects. Specifically, we leverage the idea of Maximizing Coding Rate Reduction (MCR${}^{2}$) of representations [62], which encourages the words sharing similar semantic concepts to be represented closer and pushes the representations of semantically different words further. Therefore, this constraint can be regarded as our prior on the word-aspect space to prevent the model from collapsing into trivial solutions.

3.4.1 Maximization of Coding Rate Reduction.

In information theory, the coding rate [33] is defined as the minimum number of binary bits needed to encode a set of data instances with a prescribed precision $\epsilon{\,\gt\,}0$. Intuitively, the coding rate of a dataset is large if its instances are scattered in a broad spatial region. In supervised learning, if a dataset contains multiple classes, where each class is cohesive but instances in different classes are uncorrelated, then we can reduce the coding rate of the whole dataset by coding each subset separately and summing them up. Thus, to learn discriminative word representations distributed over multiple aspects, we want to maximize the coding rate reduction.

3.4.2 Unsupervised MCR ${}^{2}$ .

In recommender systems, we treat each item aspect as a class and aim at learning discriminative word representations between different aspects. However, in our problem, the labels are not available to assign words to aspects. To overcome this, we build a word–word affinity graph with an adjacency matrix $\boldsymbol{G}$, where $\boldsymbol{G}_{i,j}$ denotes the semantic similarity between word $i$ and word $j$, and leverage the group information implicitly contained in $\boldsymbol{G}$ [15]. The words sharing similar semantic meanings will form a group in the graph, and words with different meanings fall into different groups. Each group plays the role of a class. Formally, given an item document $\mathcal{D}_{i}$ with $L$ distinct words, the adjacency matrix $\boldsymbol{G}=[\boldsymbol{g}^{1},...,\boldsymbol{g}^{l},...,\boldsymbol{g}^{ L}]\in\mathbb{R}^{L\times L}$, where $\boldsymbol{g}^{l}\in\mathbb{R}^{L}$. In addition, the word-aspect embeddings $\{\boldsymbol{q}_{i}^{l}\}_{l=1}^{L}$ form the matrix $\boldsymbol{Q}\in\mathbb{R}^{L\times h_{2}}$. The objective of learning discriminate word-aspect representations $\boldsymbol{Q}$ is thus formulated as

\begin{align}\Omega_{d}=R(\boldsymbol{Q},\epsilon)-R^{c}(\boldsymbol{Q},\epsilon|\boldsymbol{G}).\end{align}

(12)

where $R(\boldsymbol{Q},\epsilon)$ is the coding rate of the entire representations, and $R^{c}(\boldsymbol{Q},\epsilon|\boldsymbol{G})$ denotes the summation of the coding rates over groups. Since the word-aspect matrix $\boldsymbol{Q}$ and the aspect embeddings $\boldsymbol{A}$ share the same latent space, optimizing $\Omega_{d}$ indirectly controls the distribution of aspect embeddings. Specifically,

\begin{align}R(\boldsymbol{Q},\epsilon) & =\frac{1}{2\beta}\log\det\left(\boldsymbol{I}+\frac{\beta h_{2}}{L\epsilon^{2} }\boldsymbol{Q}^{\top}\boldsymbol{Q}\right), \\R^{c}(\boldsymbol{Q},\epsilon|\boldsymbol{G}) & =\sum_{l=1}^{L}\frac{\text{tr}(\boldsymbol{G}^{l})}{2L}\log\det\left(\boldsymbol{I}+ \frac{h_{2}}{\text{tr}(\boldsymbol{G}^{l})\epsilon^{2}}\boldsymbol{Q}^{\top}\boldsymbol{G}^{l}\boldsymbol{Q}\right), \end{align}

(13)

where $\boldsymbol{I}\in\mathbb{R}^{h_{2}\times h_{2}}$ is an identity matrix, $\boldsymbol{G}^{l}=\text{diag}(g^{l})\in\mathbb{R}^{L\times L}$ diagonalizes the word similarity vector, $\beta\in\mathbb{R}$ is a hyper-parameter to control the compactness of grouped word representations. Note that $\Omega_{d}$ trivially increases with the norm of $\boldsymbol{Q}$, so we need to normalize its columns into unit vectors. Intuitively, maximizing $\Omega_{d}$ equals to maximize $R(\boldsymbol{Q},\epsilon)$ and minimize $R^{c}(\boldsymbol{Q},\epsilon|\boldsymbol{G})$. The former encourages $\{\boldsymbol{q}_{i}^{l}\}_{l=1}^{L}$ to be mutually independent, and the latter encourages $\{\boldsymbol{q}_{i}^{l}\}$ within the same group to be correlated. Using $\Omega_{d}$ as a regularization term tends to separate word representations between different groups in $\boldsymbol{G}$ and squeeze word representations within the same group.

In this work, we use the cosine similarity between pre-trained word embeddings $\boldsymbol{v}^{l_{1}},\boldsymbol{v}^{l_{2}}$ to measure the semantic similarity of two words $w^{l_{1}}$ and $w^{l_{2}}$. The pre-trained embeddings are obtained from the first layer of $PLM$. We let $\boldsymbol{G}_{l_{1},l_{2}}=\boldsymbol{G}_{l_{2},l_{1}}=1$ if $\cos(\boldsymbol{v}^{l_{1}},\boldsymbol{v}^{l_{2}})=\frac{\langle\boldsymbol{v }^{l_{1}},\boldsymbol{v}^{l_{2}}\rangle}{\|\boldsymbol{v}^{l_{1}}\|_{2}\| \boldsymbol{v}_{2}^{l_{2}}\|_{2}}$ is greater than a threshold $T$. Otherwise, we let $\boldsymbol{G}_{l_{1},l_{2}}=\boldsymbol{G}_{l_{2},l_{1}}=0$ to build a sparse adjacency matrix.

3.4.3 Residual Aspect.

Forcing every word to reflect an item aspect violates the truth that many words are not related to item aspects. Taking the review “My parents love this restaurant so much!” as an example, parents and love are nontrivial words but they are not related to any aspect of restaurants. To tackle this problem, we add an additional aspect called residual aspect, denoted as $\boldsymbol{a}_{0}\in\mathbb{R}^{h_{2}}$, to the aspect embedding matrix $\boldsymbol{A}$, so the matrix finally has the shape of $K^{\prime}\times h_{2}$, where $K^{\prime}=K+1$. Moreover, we let the user interest probability $P(\boldsymbol{a}_{0}|\boldsymbol{z}_{u})=0$ for the residual aspect to minimize the influence of residual-aspect words on the recommendation.

3.5 Objective Function

In this section, we introduce the overall objective function (including several terms) and training method in our model.

3.5.1 Prediction Loss.

The major objective of our model is predicting user preference scores. We introduce prediction loss $\mathcal{L}_{p}$ to measure the differences between prediction scores and user rating scores. We follow the previous studies [9, 42] to measure the accuracy of the proposed model by using MSE

\begin{align}\mathcal{L}_{p}=\frac{1}{|\mathcal{X}|}\sum_{(u,i)\in\mathcal{X}}(r_{u,i}-\hat {r}_{u,i})^{2}.\end{align}

(14)

3.5.2 Contrastive Loss.

Inspired by the idea of contrastive learning in graphs [5, 17], we consider the user–item history reviews $\mathcal{D}_{u}$/$\mathcal{D}_{i}$ and the current review $\mathcal{T}_{u,i}$ as the two views of the same user interests and the same item aspects in different moments. Suppose that user interests and the item aspects will not change in a short time period, then the preference scores estimated according to the history and the current reviews should be similar. Thus, given the recommender $f$, we set up a contrastive loss to help model training

\begin{align}\mathcal{L}_{c}&=\frac{1}{|\mathcal{X}|}\sum_{(u,i)\in\mathcal{X}}(\hat{r}_{u,i }-\hat{r}^{\prime}_{u,i})^{2}, \end{align}

(15)

\begin{align}\hat{r}_{u,i}&=f(u,i,\mathcal{D}_{u},\mathcal{D}_{i}), \\\hat{r}^{\prime}_{u,i}&=f(u,i,\mathcal{T}_{u,i},\mathcal{T}_{u,i}).\end{align}

(16)

Following previous studies [6], we drop the gradients coming from calculating $\hat{r}^{\prime}_{u,i}$ to prevent the collapsing issue (i.e., constantly predicting the same result regardless of inputs).

3.5.3 Training Loss.

The final objective function for training is

\begin{align}\mathcal{L}=\mathcal{L}_{p}+\gamma_{1}\mathcal{L}_{c}+\gamma_{2}\Omega_{d},\end{align}

(17)

where $\gamma_{1}$, $\gamma_{2}$ are hyper-parameters to balance the losses. Here, $\Omega_{d}$ denotes the objective of MCR${}^{2}$ for word representation learning, as introduced in the previous subsection.

3.6 Analysis of Inference Complexity

In this part, we analyze the time complexity of model inference, where we assume one-layer architectures for all the MLPs in our model. Given a review document with the length of $L$, a $P$-layer Transformer-based PLM requires $O(P\cdot h_{1}\cdot L^{2})$ time to generate word embeddings. Then, the MLP${}_{1}$ takes $O(L\cdot h_{1})$ time to process $L$ words; the MLP${}_{2}$ and MLP${}_{3}$ take $O(L\cdot h_{1}\cdot h_{2})$ and $O(h_{3}\cdot h_{2})$ time, respectively; the MLP${}_{4}$ and MLP${}_{5}$ both take $O(h_{1}^{2})$ time; the user review AGGR function takes $O(L\cdot h_{1})$ time. After that, given a $T$-length shopping history, the AGGR function takes $O(T\cdot h_{1})$ time. Finally, mapping $L$ words to $K^{\prime}$ aspects takes $O(L\cdot K^{\prime}\cdot h_{2})$ time, and computing the prediction score based on $K^{\prime}$ aspects takes $O(K^{\prime}L)$ time. In total, since $h_{3}=2h_{1}$, the time complexity is $O(P\cdot h_{1}\cdot L^{2}+L\cdot h_{1}\cdot h_{2}+h_{1}^{2}+L\cdot K^{\prime} \cdot h_{2})$.

The above computation is costly for online systems. Thus, to reduce online computation, we propose to cache some intermediate quantities, including word-aspect mentions (i.e., $P(\boldsymbol{a}_{k}|\boldsymbol{e}_{i}^{l})$), user-aspect affiliations (i.e., $P(\boldsymbol{a}_{k}|\boldsymbol{z}_{u})$), and word sentiments. Under this setting, the time complexity of our model is reduced to $O(L\cdot K^{\prime})$. In practice, $K^{\prime}$ is small (e.g., the optimal $K^{\prime}\approx 5$ as shown in Section 4.4), and $L$ could be reduced by pre-processing reviews to select useful words. Empirically, the inference time of our model is comparable to that of matrix factorization [24].

4 Experiment

We try to answer four research questions through experiments. Q1: How effective is DIRECT compared with other state-of-the-art baselines? Q2: How does each component contribute to the performance of DIRECT? Q3: How will DIRECT react to different numbers of aspects? Q4: How effective is DIRECT in learning interest aspects and generating interpretable recommendations?

4.1 Dataset

We evaluate DIRECT on five benchmarks including “Toys and Games” (Toys), “Video Games” (Games), “Clothing, Shoes and Jewelry” (Clothing), and “CDs and Vinyl” (CDs) subsets from the Amazon Review Dataset [34] and the popular Yelp dataset¹ based on the year of 2019 (Yelp2019). We use the five-core versions of these datasets, where each user–item has at least five reviews. We divide 70%, 10%, and 20% of each user’s reviews to constitute the training, validation, and test sets, respectively. The data statistics are summarized in Table 1, in which the density is defined as $\frac{2\times\#Reviews}{\#Users\times\#Items}$.

Table 1.

Dataset	#Users	#Items	#Reviews	Density
Toys	19,412	11,924	167,597	0.1448%
Games	24,303	10,672	231,780	0.1787%
Clothing	39,387	23,033	278,677	0.0614%
Yelp2019	19,936	14,587	84,370	0.0580%
CDs	75,258	64,443	1,097,592	0.0453%

Table 1. Statistics of Datasets

4.2 Comparison with Baseline Methods

To answer Q1, we compare DIRECT with 13 state-of-the-art recommendation baselines below.

Baseline Methods. To have a rigorous and fair comparison, we include standard matrix factorization methods (BiasMF [66] and NeuMF [18]), language model enhanced methods (DeepCoNN [67], NARRE [4], and DAML [27]), aspect-aware methods (EMF [66], ANR [9], CARP [25], AARM [14], and UARM [44]), and graph-based review systems (SSG [13], RMG [55], and RGCL [42]).

Experimental Settings. For all baseline methods, we use their publicly available source codes for experiments and tune their hyper-parameters based on the validation set. We train our model for 50 epochs with AdamW [30] optimizer and early stop which is triggered by two times the learning rate decay. The learning rate decay strategy with a decay factor of 0.1 is adopted, where the initial learning rate is $1e-3$. We set $\gamma_{1}=5e^{-3}$, $\gamma_{2}=1e^{-6}$, $K=5$, and $h_{2}=64$ by default, and the batch size is fixed as 32. The dropout rates for the contextual word embedding and all MLP modules are 0.3 and 0.5, respectively. For the language model, we use the pre-trained BERT-small [47] with embedding size 512 to initialize the contextual word embedding in Equation (3). Following common practice [42], our text preprocessing strategies include (1) removing the HTML tags, special characters, and stopwords and (2) recovering abbreviation spellings and truncating the maximum length of user–item documents to 512 words. For a fair comparison, we replace the static word embedding table in several baselines (DeepCoNN, NARRE, DAML, ANR, and UARM) with the pre-trained BERT, which has been proved to be effective in our preliminary results.

Results. Table 2 reports the averaged MSE results over five random seeds. In general, DIRECT performs very competitively with the best baselines in all scenarios. Specifically, it significantly performs better than both matrix factorization and language model enhanced methods. Compared with aspect-aware baselines, DIRECT outperforms generally outperform them. Moreover, DIRECT even surpasses two of three graph-based methods and achieves comparable results with the strongest baseline. These results verify the effectiveness of DIRECT in terms of accuracy.

Table 2.

Model	Toys	Clothing	Games	CDs	Yelp2019	A.R.
BiasMF	$1.054_{\pm{0.061}}$	$1.497_{\pm{0.054}}$	$1.339_{\pm{0.019}}$	$1.024_{\pm{0.007}}$	$1.339_{\pm{0.012}}$	13.8
NeuMF	$0.935_{\pm{0.006}}$	$1.324_{\pm{0.004}}$	$1.225_{\pm{0.012}}$	$0.949_{\pm{0.006}}$	$1.174_{\pm{0.004}}$	10.4
DeepCoNN	$0.911_{\pm{0.001}}$	$1.297_{\pm{0.010}}$	$1.216_{\pm{0.013}}$	$0.990_{\pm{0.013}}$	$1.172_{\pm{0.006}}$	10.0
NARRE	$0.952_{\pm{0.028}}$	$1.314_{\pm{0.022}}$	$1.236_{\pm{0.012}}$	$0.999_{\pm{0.013}}$	$1.232_{\pm{0.031}}$	12.2
DAML	$0.897_{\pm{0.007}}$	$1.275_{\pm{0.011}}$	$1.204_{\pm{0.014}}$	$0.965_{\pm{0.005}}$	$1.160_{\pm{0.011}}$	8.8
EMF	$0.906_{\pm{0.005}}$	$1.201_{\pm{0.004}}$	$1.196_{\pm{0.003}}$	OOM${}^{\rm a}$	$1.322_{\pm{0.007}}$	11.0
ANR	$0.824_{\pm{0.009}}$	$1.126_{\pm{0.023}}$	$1.190_{\pm{0.097}}$	$0.918_{\pm{0.002}}$	$1.116_{\pm{0.026}}$	5.4
CARP	$0.845_{\pm{0.009}}$	$1.081_{\pm{0.012}}$	$1.195_{\pm{0.019}}$	$1.021_{\pm{0.027}}$	$1.143_{\pm{0.007}}$	6.6
AARM	$0.848_{\pm{0.001}}$	$1.150_{\pm{0.008}}$	$1.184_{\pm{0.003}}$	$0.951_{\pm{0.005}}$	$1.128_{\pm{0.008}}$	6.8
UARM	$0.810_{\pm{0.001}}$	$1.108_{\pm{0.002}}$	$1.118_{\pm{0.003}}$	$0.886_{\pm{0.002}}$	$1.075_{\pm{0.007}}$	3.8
SSG	$0.828_{\pm{0.002}}$	$1.129_{\pm{0.012}}$	$1.144_{\pm{0.005}}$	$0.869_{\pm{0.006}}$	$1.205_{\pm{0.005}}$	6.4
RMG	$0.808_{\pm{0.002}}$	$1.111_{\pm{0.010}}$	$1.110_{\pm{0.003}}$	$0.859_{\pm{0.004}}$	$1.187_{\pm{0.004}}$	4.4
RGCL	$0.803_{\pm{0.003}}$	$1.103_{\pm{0.009}}$	$1.109_{\pm{0.006}}$	$0.844_{\pm{0.003}}$	$1.179_{\pm{0.004}}$	3.0
DIRECT	$0.804_{\pm{0.002}}$	$1.100_{\pm{0.010}}$	$1.115_{\pm{0.001}}$	$0.885_{\pm{0.009}}$	$1.063_{\pm{0.011}}$	2.4

Table 2. Recommendation Performance Comparison

${}^{\rm a}$The model raises an OOM error during training on a 24 GB memory GPU, A.R., Average Rank.

Discussions. We notice that UARM achieves performance comparable with DIRECT, which initially learns the aspect distributions of users and items with contrastive learning, subsequently integrating these aspect distributions with user–item representations for making recommendations. The strong performance of both UARM and DIRECT validates the idea of modeling the aspect distribution for users and items. However, UARM slightly under-performs compared to DIRECT, which can be attributed to the concurrent learning of users, items, and aspects, where the coding rate reduction is introduced to prevent the collapse of training and guarantee the interpretability and distinctiveness of the learned aspects. Meanwhile, although UARM learns aspect distributions of users and items, it treats them as additional features which are mapped to a latent space and concatenated with user–item embeddings. Moreover, UARM does not provide explicit attribution of prediction. Thus, UARM does not provide an interpretable decision-making process compared to DIRECT.

4.3 Ablation Study

To study Q2, we conduct experiments to examine the contributions of (1) using user reviews, (2) fusion network for final user embedding, and (3) the contrastive loss in Section 3.5.2 to capture the shared interests between history reviews and the target review. Specifically, we introduce three DIRECT variants: “w/o Review,” “w/o Fusion,” and “w/o CL.” “w/o Review” is obtained by excluding user reviews from DIRECT. “w/o Fusion” is obtained by replacing the fusion network in Section 3.3.2 with concatenation operation. “w/o CL” is obtained by excluding the contrastive loss function from DIRECT. Table 3 summarizes their results on five benchmark datasets.

Table 3.

	Toys	Clothing	Games	CDs	Yelp2019	Average
w/o Review	0.8109	1.1132	1.1199	0.8943	1.0718	1.0020
w/o Fusion	0.8091	1.0939	1.1172	0.8867	1.0683	0.9950
w/o CL	0.8077	1.0954	1.1164	0.8847	1.0674	0.9943
DIRECT	0.8044	1.1004	1.1152	0.8854	1.0628	0.9936

Table 3. Ablation Study of DIRECT

From Table 3, we made three observations. First, DIRECT improves w/o Review in all cases. This result verifies our motivation to capture user interests upon their reviews. Second, compared with w/o Fusion, DIRECT performs better on four of five datasets. It is reasonable because our fusion network can adaptively combine users’ posts and shopping behaviors in a learnable fashion. Third, by enforcing the alignment between review documents and target reviews, DIRECT outperforms w/o CL on Toys and Clothing while performing comparably on the others. The observations above validate the effectiveness of the three crucial components of the proposed model.

4.4 Sensitivity Analysis on Aspect Number

Aspect embeddings are the key to achieve word-level explanation in our system. In this section, we analyze the sensitivity of our model on the number of aspects Q3. Specifically, we follow the same experimental setups above and search the optimal number of aspects $K$ in the set $\{1,3,5,7,9,11\}$.

Table 4 reports the results with three random seeds. In general, the best $K$ value is varied from one dataset to another in a small range. For example, the optimal $K$ values for Toys, Clothing, Games, CDs, and Yelp2019 are 3, 3, 7, 7, and 3, respectively. That is, the best $K$ value falls between 3 and 7. This observation echos the findings in ANR [9]. Given that DIRECT performs relatively stable when $K$ is between 3 and 7, we set $K=5$ for all datasets without further specification.

Table 4.

$K$	Toys	Clothing	Games	CDs	Yelp2019
1	0.8083	1.0867	1.1122	0.8852	1.0774
3	0.8053	1.0859	1.1114	0.8843	1.0759
5	0.8060	1.0863	1.1113	0.8797	1.0769
7	0.8062	1.0860	1.1112	0.8766	1.0791
9	0.8059	1.0859	1.1120	0.8792	1.0766
11	0.8066	1.0864	1.1127	0.8863	1.0790

Table 4. Sensitivity Analysis on Hyper-Parameter $K$

Bold number indicates the best performance on a dataset.

4.5 Interpretability Analysis

To study Q4, we first analyze the performance of our model in learning aspects via visualization and verbalization (Section 4.5.1). Then, we quantitatively analyze whether the proposed DIRECT could provide interpretations that reflect the user preferences (Section 4.5.2). Finally, we demonstrate the transparent decision-making process of DIRECT with some cases (Section 4.5.3).

4.5.1 Understanding Learned Aspects.

To check if DIRECT learns discriminative interest aspects, we visualize the words and their aspect associations in Figure 2. Specifically, each word $w$ is represented by a word-aspect embedding vector $\boldsymbol{q}_{w}$, which is obtained by averaging its word-aspect representations over the entire training set. After we get the word-aspect embeddings, we assign word $w$ to the $k$th aspect if $k=\max_{k}\boldsymbol{q}_{w}\cdot\tilde{\boldsymbol{a}}_{k}^{\top}$, where $\tilde{\boldsymbol{a}}_{k}$ is the normalized embedding of the $k$th aspect. Since there are tens of thousands words in the vocabulary, we only visualize the top 50 most frequently mentioned words for each aspect.

Fig. 2.

Figure 2 shows the aspect distributions of our model under three different settings, where different colors denote different aspects. We can observe that, in Figure 2(b), the four aspects (including one residual aspect) are discriminative and linearly separable. However, this character is not solid if we ignore the constraint term $\Omega_{d}$ (Figure 2(a)) or put too much weight on it (Figure 2(c)).

Moreover, we report the top 10 popular words under each learned aspect to examine if they conceptually make sense. In particular, we use the pre-trained PLM to encode words in each user review $\mathcal{T}_{u,i}$, then estimate their aspect associations with Equation (7). Here, we use the checkpoints trained for performance comparison. Tables 5 and 6 show the results on the Clothing and Toys datasets. Each column indicates the potential aspect identified by our model. We summarize each aspect in the first line and omit quantifiers, simple sentiment polarity adjectives (e.g., good, bad), and intensity adverbs (e.g., bit, much). As we can see from the tables, our model could effectively cluster words into different aspects that customers may concern about. For example, our model figures out five crucial factors, i.e., Gift, Texture, Environment, LowerBody, and Material, for the clothing domain. Furthermore, the top popular words in each aspect are also closely related. Taking the “Gift” aspect as an example, words like daughter and son are popular roles of the gift receiver in real-world life.

Table 5.

Gift	Texture	Environment	LowerBody	Material
year	cold	little	shirt	den
watch	soft	day	pair	synthetic
bag	water	old	socks	summer
ear	dark	house	feet	cotton
daughter	second	watch	sole	rubber
sand	strong	socks	run	fan
day	light	pair	pocket	tin
son	thick	wash	side	accent
small	fast	light	bra	cap
gift	gray	warm	back	composite

Table 5. Top Frequent Words for Aspects in Clothing Dataset

Table 6.

Quality	Texture	Puzzle	Doll	BoardGame
new	set	piece	doll	game
quality	plastic	make	different	year
collection	hard	game	thing	card
build	train	work	size	car
come	long	time	large	set
beautiful	learn	set	pretty	figure
challenge	big	use	color	pretty
wood	sturdy	together	amazing	look
additional	old	puzzle	heavy	player
grand	young	put	cool	daughter

Table 6. Top Frequent Words for Aspects in Toys Dataset

In summary, DIRECT can not only identify informative semantic aspects for different domain products but also assign words to their most appropriate aspects automatically.

4.5.2 Quantitative Analysis.

We quantitatively assess whether our system provides explanations that reflect user preference. In particular, given a user–item pair, we treat the target review written by the user to the item as the ground truth of the user’s preference. We measure the similarity between the DIRECT generated explanations and the target reviews. In this experiment, we consider the top-$K$ sentences from the item document with the greatest maximum interest scores, according to Equation (5), as the explanations generated by DIRECT, where we set $K=3$. The semantic similarities between the sentences of explanations and the user reviews are estimated by a fine-tuned semantic similarity estimator based on RoBERTa [37]. This fine-tuned model will return a value between 0 to 1, indicating a stronger semantic similarity between the explanation and the user target review if the value is closer to 1. We further normalized and computed the average scores across the explanation sentences. To ensure the item document covers the user interests, we ignore those user–item pairs with less than ten sentences in the item document. Meanwhile, to guarantee item properties are clearly expressed, we ignore those verbose reviews with over five sentences from a user. For comparison, we also report the similarities between the target review and the last- $K$ sentences. In our experimental design, a recommender demonstrating a higher interpretability could receive a greater average semantic similarity between the Top-K sentences and the target review, while the similarity score between the Last-K sentences and the target review should be lower. We also calculate the growth percentage of the average similarity of Top-K compared to Last-K, denoted as “Diff.” In addition to DIRECT, we implement an inherently interpretable baseline for our analysis of interpretability. This baseline first constructs item feature vectors by counting the frequencies of keywords from item reviews. To perform personalized recommendations, it further estimates which item keywords may interest the given user. The final user rating is predicted with a linear function over selected keyword embeddings. Since this baseline is built on the bag-of-word assumption, a fully transparent and human-understandable decision-making process, it can be considered an oracle in our interpretability analysis experiment. We denote this baseline as BoW.

Table 7 reports the results derived from 5,000 randomly sampled user–item pairs from each dataset. Analysis of these results reveals that the explanations generated by DIRECT exhibit a greater similarity to target reviews compared to those un-selected sentences. This pattern is consistent across all five datasets, demonstrating the efficacy of DIRECT in accurately capturing user interests from item documents, aligning closely with its design objectives. When comparing the interpretability of DIRECT and baseline BoW, we observe that the Diff score of DIRECT is comparable with the ideal interpretable baseline, emphasizing the strong interpretability of DIRECT. However, it is crucial to recognize that BoW achieves this transparent design while sacrificing its recommendation quality. Specifically, we also observe that BoW’s MSE is significantly greater than DIRECT’s. Putting these together, we conclude that DIRECT simultaneously improves the performance and transparency of recommender systems.

Table 7.

		Toys	Clothing	Games	CDs	Yelp2019
Baseline-BoW	MSE $\downarrow$	$0.936_{\pm{0.013}}$	$1.240_{\pm{0.026}}$	$1.315_{\pm{0.034}}$	$1.027_{\pm{0.006}}$	$1.244_{\pm{0.032}}$
	Top-K $\uparrow$	0.546${}_{\pm 0.304}$	0.517${}_{\pm 0.320}$	0.583${}_{\pm 0.331}$	0.580${}_{\pm 0.346}$	0.500${}_{\pm 0.270}$
	Last-K $\downarrow$	0.391${}_{\pm 0.276}$	0.412${}_{\pm 0.324}$	0.357${}_{\pm 0.330}$	0.378${}_{\pm 0.346}$	0.379${}_{\pm 0.269}$
	Diff $\uparrow$	39.5%	25.5%	63.3%	53.4%	31.9%
DIRECT	MSE $\downarrow$	$0.804_{\pm{0.002}}$	$1.100_{\pm{0.010}}$	$1.115_{\pm{0.001}}$	$0.885_{\pm{0.009}}$	$1.063_{\pm{0.011}}$
	Top-K $\uparrow$	0.526${}_{\pm 0.282}$	0.492${}_{\pm 0.289}$	0.593${}_{\pm 0.293}$	0.525${}_{\pm 0.321}$	0.452${}_{\pm 0.241}$
	Last-K $\downarrow$	0.372${}_{\pm 0.293}$	0.399${}_{\pm 0.313}$	0.303${}_{\pm 0.301}$	0.412${}_{\pm 0.339}$	0.382${}_{\pm 0.273}$
	Diff $\uparrow$	41.4%	23.3%	95.7%	27.4%	18.3%

Table 7. Quantitative Analysis to Explanation Quality

4.5.3 Case Study.

We provide case studies to show whether DIRECT improves the transparency of recommendation systems via interpretable features. To this end, we trace the activated user aspects, popular words of activated aspects, and the word sentiments predicted by our model.

Table 8 displays two good recommendation samples (Case 1 and Case 2) and a “bad” one (Case 3). For each case, i.e., user–item predication, we not only report its ground-truth information such as the user–item IDs, rating score $r$, and target review, but also summarize all related explainable features extracted by DIRECT, including the aspect distribution, activated frequent words and their sentiment polarity, the predicted rating score $\hat{r}$, the predicted preference $pref$ and bias $bias$ scores. To better visualize these cases, we omit contexts that are irrelevant to the target review and highlight the top 20 segments ranked by their activation scores (defined in Equation (5)) among the entire item document. We render the highlighted segments with different colors to emphasize their sentiment polarities (i.e., positive and negative), and underline some segments of the target review to indicate the potential concerns of the anchor user.

Table 8.

Case 1: userID=A3KHRW6ZC2EQIL, itemID=B006H30KAE (ASICS Men’s GEL-Nimbus 14 Running Shoe)
Prediction:	$r=5.0$, $\hat{r}=4.89$, $pref=0.41$, $bias=4.48$
Interest Aspect:	$Aspect_{1}=0.5622$, $Aspect_{2}=0.5594$, $Aspect_{3}=0.5676$, $Aspect_{4}=0.5585$, $Aspect_{5}=0.5567$
Item Document:	… Similar to the New Balance 1080 and better than the Brooks Ravena. I am a 192 pound, 51 year old runner. I am a neutral runner and mid foot striker …. Gel Nimbus may be it, especially as a road training and long distance racing shoe. Heavier runners will really like the plush and cushioned…
Target Review:	My wife hated the color of the white/blue Nimbus 13s I had… I’m a neutral shoe guy and I have had multiple heel spur surgeries….
Case 2: userID=AOMEH9W6LHC4S, itemID=B006H30KAE (ASICS Men’s GEL-Nimbus 14 Running Shoe)
Prediction:	$r=5.0$, $\hat{r}=4.64$, $pref=0.32$, $bias=4.32$
Interest Aspect:	$Aspect_{1}=0.4850$, $Aspect_{2}=0.3980$, $Aspect_{3}=0.3982$, $Aspect_{4}=0.4692$, $Aspect_{5}=0.4155$
Item Document:	… Similar to the New Balance 1080 and better than the Brooks Ravena. I am a 192 pound, 51 year old runner. I am a neutral runner and mid foot striker …. Gel Nimbus may be it, especially as a road training and long distance racing shoe. Heavier runners will really like the plush and cushioned…
Target Review:	… but I’m quite confident in the fit of ASICs …. It’s neutral (the wrong shoe if you over-pronate) with good lateral stiffness….
Case 3: userID=A2DXFI46OKWC8G, itemID=630508985X (Blue Oyster Cult - Live 1976)
Prediction:	$r=5.0$, $\hat{r}=4.06$, $pref=-0.05$, $bias=4.10$
Interest Aspect:	$Aspect_{1}=0.3172$, $Aspect_{2}=0.9756$, $Aspect_{3}=0.1492$, $Aspect_{4}=0.4661$, $Aspect_{5}=0.9886$
Item Document:	… Bad picture, bad sound, bad performance. Not entirely true. I found the performance to be very good/typical and the picture pretty watchable. I sure wish the sound was better though!… I do feel a little sorry for people who pay $60-$70 for this disc. I was lucky enough to get it for around $20….
Target Review:	… The sound on this isn’t bad but its not the greatest so… its Blue Öyster Cult back in the day, not New Blue Öyster Cult nowadays playing old songs!…

Table 8. Case Study of Three User–Item Pairs Coming from the Amazon Datasets

Good Case. Case 1 and Case 2 list the recommendations of two users on the same item. Since the two cases report the prediction results on the same item, we can observe that our model highlights several common parses in the item document, such as neutral runner, runner, and mid foot striker. However, there still have some parses being only activated by the second user. For example, our model also activates heavier runners and road training and long distance racing shoes in the second case. These results seem unreasonable at the first glance, as the two users have the same item target. However, when we trace back, we find that the second user bought another shoe earlier and posted some comments—“These might be the perfect shoes for some runners or race-walkers (perhaps those with slender builds).” By jointly considering the two posts, the reason behind our model in activating heavier runners is clear and reasonable, since the second user might be a heavier runner. The difference between the two users in the activated parses shed light on the effectiveness of our model in capturing users’ personal interests and making interpretable recommendations via extracting human-understandable review words.

“Bad” Case. It’s impossible for a recommendation model to make correct predictions all the time, our model could fail either. We report an failure case —Case 3 in Table 8. For this case, our model gives a negative preference score (i.e., $-0.05$) to the item since it believes this user dislikes the sound quality based on the negative sentiments of those highlighted parses such as bad picture, bad sound, and bad performance. In fact, this prediction is not totally wrong as the user also admits that the sound of this product should be improved based on the target review. However, it might ignore some implicit facts such as the user is a big fan of the band, which makes he/she can tolerate the sound quality to some extent. This kind of conjecture is reasonable since the user gives five stars to the product. This running case indicates the limitation of DIRECT in capturing users’ fine-grained interests. In the future, we will explore more advanced aspect learning strategy to fill in the gap.

5 Related Work

Recent studies on review-based neural recommendation mainly focus on two topics: (1) improving the accuracy of predicting user preferences and (2) enhancing the interpretability of recommenders.

5.1 Review-Based Neural Recommender Systems

The earliest successful attempt at the review-based neural recommender is DeepCoNN [67], which uses a Dual-TextCNN [7] architecture to gather user–item embeddings from their reviews. TransNet [2] extents DeepCoNN by forcing the document representations to be similar to the representation of the target user–item review. Inspired by the great success of Transformer [49], MPCN [46], D-Attn, and NARRE [4] apply the self-attention mechanism to user and item reviews respectively. To aggregate the information between user and item reviews, DAML [27], CARL [56], and AHN exploit the attention mechanism across the two resources. CARP [25] develops a confidence matrix to only keep the embedding of high confidence reviews. AENAR [63] first measures the difference between the current review embedding and a global review embedding, then treats the difference as a gate to filter the review embedding. To better capture the interactions between users and items, RMG [55], SSG [13], and RGCL [42] consider the user–item preference prediction as an edge classification problem and aligned the graph learning methods to aggregate users and items embedding. Recently, researchers [28, 59, 65] directly use PLMs to process reviews or other textual user–item resources for recommendations by leveraging their strong in-context learning ability.

5.2 Explainable Review-Based Recommender Systems

D-Attn designs a local attention module and a global attention module to find out essential words of reviews. Similarly, CAML [8] first designs a Multi-Pointer Co-Attention Selector to collect a user embedding, an item embedding, and a concept embedding. Then, it uses these embeddings to make recommendations and generate textual explanations. AHR [11] designs an asymmetric attention method to find out important words from the reviews. The user-side attention mechanism extracts words related to the target item. In contrast, the item-side attention mechanism extracts words that most reflect the current item. ANR [9] and CARP [25] are the only two aspect-based end-to-end learning methods in this path. ANR [9] is the first model that applied aspect detection process within the training process. It first represents the item and the user with several aspect embedding and importance scores. The final scoring function is the summation of the similarity of the aspect embeddings weighted by the importance score. CARP [25] predicts specific numbers of aspects by giving the user and item reviews. Next, it combines pairs of aspects from the user and the item and finally uses a capsule network to obtain positive and negative scores.

6 Conclusion

We propose a novel self-interpretable review-based recommender system named DIRECT in this study. DIRECT predicts user preferences by averaging the sentiment polarities of words weighted by the word importance. DIRECT assigns more weights to words that express the user’s interested aspects. We also leverage the idea of MCR${}^{2}$ to encourage the learned aspects to be more discriminate, diverse, and explainable. Under the online system setup, by caching intermediate information such as word-aspect affiliations, DIRECT could achieve linear time complexity with respect to document length. Experimental results on real-world datasets show that DIRECT outperforms traditional baseline methods and is comparable to state-of-the-art methods. Quantitative analysis, visualization, and case studies verify the interpretability of DIRECT.

The future works include (1) exploring more effective user representation learning methods to further improve model performance, (2) developing more effective graph construction methods to describe word–word relationships for generating better aspect embeddings, and (3) introducing expert knowledge to construct more controllable representations.

Footnote

Yelp Open Dataset: https://www.yelp.com/dataset

References

[1]

Leon Bottou and Yoshua Bengio. 1994. Convergence properties of the k-means algorithms. Advances in Neural Information Processing Systems 7 (1994), 585–592.

Model	Toys	Clothing	Games	CDs	Yelp2019	A.R.
BiasMF	\(1.054_{\pm{0.061}}\)	\(1.497_{\pm{0.054}}\)	\(1.339_{\pm{0.019}}\)	\(1.024_{\pm{0.007}}\)	\(1.339_{\pm{0.012}}\)	13.8
NeuMF	\(0.935_{\pm{0.006}}\)	\(1.324_{\pm{0.004}}\)	\(1.225_{\pm{0.012}}\)	\(0.949_{\pm{0.006}}\)	\(1.174_{\pm{0.004}}\)	10.4
DeepCoNN	\(0.911_{\pm{0.001}}\)	\(1.297_{\pm{0.010}}\)	\(1.216_{\pm{0.013}}\)	\(0.990_{\pm{0.013}}\)	\(1.172_{\pm{0.006}}\)	10.0
NARRE	\(0.952_{\pm{0.028}}\)	\(1.314_{\pm{0.022}}\)	\(1.236_{\pm{0.012}}\)	\(0.999_{\pm{0.013}}\)	\(1.232_{\pm{0.031}}\)	12.2
DAML	\(0.897_{\pm{0.007}}\)	\(1.275_{\pm{0.011}}\)	\(1.204_{\pm{0.014}}\)	\(0.965_{\pm{0.005}}\)	\(1.160_{\pm{0.011}}\)	8.8
EMF	\(0.906_{\pm{0.005}}\)	\(1.201_{\pm{0.004}}\)	\(1.196_{\pm{0.003}}\)	OOM\({}^{\rm a}\)	\(1.322_{\pm{0.007}}\)	11.0
ANR	\(0.824_{\pm{0.009}}\)	\(1.126_{\pm{0.023}}\)	\(1.190_{\pm{0.097}}\)	\(0.918_{\pm{0.002}}\)	\(1.116_{\pm{0.026}}\)	5.4
CARP	\(0.845_{\pm{0.009}}\)	\(1.081_{\pm{0.012}}\)	\(1.195_{\pm{0.019}}\)	\(1.021_{\pm{0.027}}\)	\(1.143_{\pm{0.007}}\)	6.6
AARM	\(0.848_{\pm{0.001}}\)	\(1.150_{\pm{0.008}}\)	\(1.184_{\pm{0.003}}\)	\(0.951_{\pm{0.005}}\)	\(1.128_{\pm{0.008}}\)	6.8
UARM	\(0.810_{\pm{0.001}}\)	\(1.108_{\pm{0.002}}\)	\(1.118_{\pm{0.003}}\)	\(0.886_{\pm{0.002}}\)	\(1.075_{\pm{0.007}}\)	3.8
SSG	\(0.828_{\pm{0.002}}\)	\(1.129_{\pm{0.012}}\)	\(1.144_{\pm{0.005}}\)	\(0.869_{\pm{0.006}}\)	\(1.205_{\pm{0.005}}\)	6.4
RMG	\(0.808_{\pm{0.002}}\)	\(1.111_{\pm{0.010}}\)	\(1.110_{\pm{0.003}}\)	\(0.859_{\pm{0.004}}\)	\(1.187_{\pm{0.004}}\)	4.4
RGCL	\(0.803_{\pm{0.003}}\)	\(1.103_{\pm{0.009}}\)	\(1.109_{\pm{0.006}}\)	\(0.844_{\pm{0.003}}\)	\(1.179_{\pm{0.004}}\)	3.0
DIRECT	\(0.804_{\pm{0.002}}\)	\(1.100_{\pm{0.010}}\)	\(1.115_{\pm{0.001}}\)	\(0.885_{\pm{0.009}}\)	\(1.063_{\pm{0.011}}\)	2.4

Case 1: userID=A3KHRW6ZC2EQIL, itemID=B006H30KAE (ASICS Men’s GEL-Nimbus 14 Running Shoe)
Prediction:	\(r=5.0\), \(\hat{r}=4.89\), \(pref=0.41\), \(bias=4.48\)
Interest Aspect:	\(Aspect_{1}=0.5622\), \(Aspect_{2}=0.5594\), \(Aspect_{3}=0.5676\), \(Aspect_{4}=0.5585\), \(Aspect_{5}=0.5567\)
Item Document:	… Similar to the New Balance 1080 and better than the Brooks Ravena. I am a 192 pound, 51 year old runner. I am a neutral runner and mid foot striker …. Gel Nimbus may be it, especially as a road training and long distance racing shoe. Heavier runners will really like the plush and cushioned…
Target Review:	My wife hated the color of the white/blue Nimbus 13s I had… I’m a neutral shoe guy and I have had multiple heel spur surgeries….
Case 2: userID=AOMEH9W6LHC4S, itemID=B006H30KAE (ASICS Men’s GEL-Nimbus 14 Running Shoe)
Prediction:	\(r=5.0\), \(\hat{r}=4.64\), \(pref=0.32\), \(bias=4.32\)
Interest Aspect:	\(Aspect_{1}=0.4850\), \(Aspect_{2}=0.3980\), \(Aspect_{3}=0.3982\), \(Aspect_{4}=0.4692\), \(Aspect_{5}=0.4155\)
Item Document:	… Similar to the New Balance 1080 and better than the Brooks Ravena. I am a 192 pound, 51 year old runner. I am a neutral runner and mid foot striker …. Gel Nimbus may be it, especially as a road training and long distance racing shoe. Heavier runners will really like the plush and cushioned…
Target Review:	… but I’m quite confident in the fit of ASICs …. It’s neutral (the wrong shoe if you over-pronate) with good lateral stiffness….
Case 3: userID=A2DXFI46OKWC8G, itemID=630508985X (Blue Oyster Cult - Live 1976)
Prediction:	\(r=5.0\), \(\hat{r}=4.06\), \(pref=-0.05\), \(bias=4.10\)
Interest Aspect:	\(Aspect_{1}=0.3172\), \(Aspect_{2}=0.9756\), \(Aspect_{3}=0.1492\), \(Aspect_{4}=0.4661\), \(Aspect_{5}=0.9886\)
Item Document:	… Bad picture, bad sound, bad performance. Not entirely true. I found the performance to be very good/typical and the picture pretty watchable. I sure wish the sound was better though!… I do feel a little sorry for people who pay $60-$70 for this disc. I was lucky enough to get it for around $20….
Target Review:	… The sound on this isn’t bad but its not the greatest so… its Blue Öyster Cult back in the day, not New Blue Öyster Cult nowadays playing old songs!…

		Toys	Clothing	Games	CDs	Yelp2019
Baseline-BoW	MSE \(\downarrow\)	\(0.936_{\pm{0.013}}\)	\(1.240_{\pm{0.026}}\)	\(1.315_{\pm{0.034}}\)	\(1.027_{\pm{0.006}}\)	\(1.244_{\pm{0.032}}\)
	Top-K \(\uparrow\)	0.546\({}_{\pm 0.304}\)	0.517\({}_{\pm 0.320}\)	0.583\({}_{\pm 0.331}\)	0.580\({}_{\pm 0.346}\)	0.500\({}_{\pm 0.270}\)
	Last-K \(\downarrow\)	0.391\({}_{\pm 0.276}\)	0.412\({}_{\pm 0.324}\)	0.357\({}_{\pm 0.330}\)	0.378\({}_{\pm 0.346}\)	0.379\({}_{\pm 0.269}\)
	Diff \(\uparrow\)	39.5%	25.5%	63.3%	53.4%	31.9%
DIRECT	MSE \(\downarrow\)	\(0.804_{\pm{0.002}}\)	\(1.100_{\pm{0.010}}\)	\(1.115_{\pm{0.001}}\)	\(0.885_{\pm{0.009}}\)	\(1.063_{\pm{0.011}}\)
	Top-K \(\uparrow\)	0.526\({}_{\pm 0.282}\)	0.492\({}_{\pm 0.289}\)	0.593\({}_{\pm 0.293}\)	0.525\({}_{\pm 0.321}\)	0.452\({}_{\pm 0.241}\)
	Last-K \(\downarrow\)	0.372\({}_{\pm 0.293}\)	0.399\({}_{\pm 0.313}\)	0.303\({}_{\pm 0.301}\)	0.412\({}_{\pm 0.339}\)	0.382\({}_{\pm 0.273}\)
	Diff \(\uparrow\)	41.4%	23.3%	95.7%	27.4%	18.3%

Abstract

1 Introduction

2 Problem Statement

3 Proposed Method

3.1 Interpretability of Recommender Systems

3.2 Model Architecture

3.2.1 Word Representation Module.

3.2.2 Sentiment Analysis Module.

3.2.3 Aspect-Guided Interest Gate.

3.3 Learning User Representations

3.3.1 Representing Users with Sequences.

3.3.2 Embedding Fusion Network.

3.4 Learning Discriminative Aspect Representations

3.4.1 Maximization of Coding Rate Reduction.

3.4.2 Unsupervised MCR \({}^{2}\) .

3.4.3 Residual Aspect.

3.5 Objective Function

3.5.1 Prediction Loss.

3.5.2 Contrastive Loss.

3.5.3 Training Loss.

3.6 Analysis of Inference Complexity

4 Experiment

4.1 Dataset

4.2 Comparison with Baseline Methods

4.3 Ablation Study

4.4 Sensitivity Analysis on Aspect Number

4.5 Interpretability Analysis

4.5.1 Understanding Learned Aspects.

4.5.2 Quantitative Analysis.

4.5.3 Case Study.

5 Related Work

5.1 Review-Based Neural Recommender Systems

5.2 Explainable Review-Based Recommender Systems

6 Conclusion

Footnote

References

Cited By

Index Terms

Recommendations

Contextual-boosted deep neural collaborative filtering model for interpretable recommendation

A General Rating Recommended Weight-Aware Model for Recommendation System

Improving User Satisfaction Through Approaches that Balance Recommendation Accuracy and Serendipity Tailored to Individual Preferences

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations