In the following, we introduce the proposed hybrid prompt learning strategy and the fine-tuning approaches for pre-trained LM.
3.2.1 Hybrid Prompt Learning for Automation Rules.
An initial method to generate justifications for harmful rules is the exploitation of discrete prompt learning. Typically, when using a pre-trained LM, the embedding layer (denoted as \(E\)) is used to convert the input token sequence, represented as \(X=(x_{1},\ldots,x_{N})\), into a set of embeddings \(\{E({x}_{1}),\ldots,E({x}_{N})\}\). In a downstream task, \(X\) is conditioned by a specific context \(CO\), which influences the model’s behavior in producing the target sentence \(J\). Instead, discrete prompt learning uses a prompting function that arranges \(CO\), \(J\), and a prompt \(P\) into a template \(T\). In our scenario, the prompt \(P\) is made up of additional natural language tokens, while the context \(CO\) consists of the rule information \(R\) and the associated harm \(H\). These are the key elements that guide the model in generating the justification \(J\), which represents the target sentence.
Hard Prompt. To make prompts clear, we define them in a format similar to how a human would explain a problem. In particular, we consider two different
prefix prompts [
25], where the input is placed entirely before the slot [
\(J\)] to be filled. The first is named
span-infilling prompt and is a straightforward sequence of natural language tokens following the encoded rule information:
On the other hand, it has been proven that using the question-answering format can be very useful in transferring linguistic or other knowledge learned from one task to a related task [
18]. For this reason, we also propose a
question-answering prompt as it is possible to convert a non-question-answering task into an equivalent question-answering form [
28]. Again, we deal with a prefix prompt since it is necessary to continue a string prefix, but the encoded rule information is placed in the middle of the prompt. Here, the prompting function receives as input the encoded rule
\(R\) and harm
\(H\), and outputs the sentence:
In both cases, prompts are fixed, and training and test samples share the same prompt. The hypothesis is that the additional natural language tokens force the model to compute similarities between the input text and the words in the human-written templates, suggesting the subset of the vocabulary that provides better information in generating justifications.
Upon closer examination of rule characteristics, we found that, despite their simplicity, hard prompts have weaknesses. Specifically, representing channel names with discrete tokens may result in the loss of semantic information. In fact, channels are fundamental components in the TAP domain, helping determine a rule’s execution context. For instance, if a rule uses “Philips Hue” as a channel for a trigger or action, it is clear that the rule is affecting a physical device. On the other hand, if the rule uses “Facebook,” we can deduce that software-based operations will occur.
Pre-trained models often use
subword tokenization to effectively deal with both common and infrequent words [
30]. However, this method may result in the breaking down of channel names into meaningless segments, thereby exacerbating the challenge of comprehending the context in which the rules are executed. For example, the channel “500px” could be tokenized into [“500,” “p,” “x”], making it even more challenging to understand that it refers to a global online photo-sharing platform. A straightforward method would be to consider channel names as distinct words and incorporate them into the model’s vocabulary. However, as the number of channels provided by TAPs grows rapidly, this method can cause a significant increase in the vocabulary size, resulting in a substantial computational load on the LM during the softmax operation used to generate a word at each step. Furthermore, pre-trained models utilize contextualized embeddings to represent words [
37,
49], which capture contextual information in the text. This method is more efficient than using static word embeddings because it can generate different representations for words with multiple meanings. For instance, the word “mouse” can refer to a small rodent or a computer peripheral, and a pre-trained model can distinguish between these meanings based on the context. However, identifying the functionalities of a channel using this approach can pose a challenge in the TAP domain, as certain words may obscure the context. In fact, since channels might be words not known to the pre-trained LM, it is not certain that their representation is inferred only from words consistent with the exposed functionalities. For instance, consider the rule “
IF I publish a new post with a specific hashtag, THEN turn off the lights in the bedroom” that enables a user to turn off the lights with a goodnight post. This rule can be defined using
Facebook and
Philips Hue as trigger and action channels, respectively. In this case, since the word “post,” recalling a social media post, falls within the context of the Philips Hue channel, it may be considered in the definition of the channel’s embedding, potentially undermining the rule’s execution context. Ideally, the representation of Philips Hue should only include words, such as “lights” and “turn off” as it enables effective management of smart bulbs.
Hybrid Prompt. The goal is to create embeddings for channels that are similar if they offer the same functionalities, such as camera or light services. In this way, during the generation phase, the model can more accurately deduce the context in which a rule is executed by considering channels similar to those involved in the rule. This is achieved by using soft prompts, where channels are represented as continuous tokens rather than discrete tokens. As an example, in
Figure 2 we represent rule trigger, action, harm, justification, and prompt tokens as discrete tokens, obtaining their embeddings through the pre-trained LM embedding layer. Additionally, we consider rule channels as continuous tokens and derive their embeddings through neural word-embedding models. This results in a
hybrid prompt that enables us to grasp both the rule’s context of execution and the rule behavior.
Training soft prompts requires addressing two challenges [
27]: (1) they cannot be randomly initialized and then optimized using stochastic gradient descent as it may lead to sub-optimal results [
1], and (2) their values should be dependent on each other. Our approach addresses the interdependence among channels by considering the functionalities they expose, and avoids the issue of random initialization by generating their embeddings using a strategy inspired by Skip-Gram of
Word2Vec [31]. The latter is a family of shallow, two-layer neural networks that can learn word embeddings of a vocabulary, ensuring that similar words have similar representations. The key concept behind this is that a word can be represented based on its context, i.e., the words surrounding it in documents. Therefore, words with similar contexts should have similar meanings.
The Skip-Gram model starts by assigning randomly generated embeddings to each word in the vocabulary. The goal is to improve these embeddings by predicting the words that appear in the context of a designated target word. To train the model, a dataset of pairs is used, where each pair consists of a target word and a context word. These pairs are created by analyzing existing documents, and the context window size determines how many words are considered before and after the target word. Specifically, the Skip-Gram model is fed with pairs (target word, context word), along with a label indicating whether the context word is contextually relevant (Positive Input Sample (PS)) or irrelevant (Negative Input Sample (NS)) to the target word. During the training process, the model works to improve its accuracy through multiple iterations and backpropagation steps. As a result, the embeddings become increasingly precise. Ultimately, the model’s predictions are discarded, and only the optimized embeddings are used.
In the considered problem, vocabulary
\(V\) contains channel names of a TAP, aiming to model their embeddings based on their functionalities. However, generating a labeled dataset, as with the Skip-gram model, is not practical in this scenario. Channels appearing in the same context do not necessarily share functionalities, leading to unreliable samples. An example is depicted in
Figure 3, where using a context window size of four and the source text “
Yesterday I posted a photo of my new Philips Hue bulb kit on Facebook,” we get the words “photo,” “of,” “my,” “new,” “bulb,” “kit,” “on” and “Facebook” as the context of “Philips Hue.” The pairs (
Philips Hue,
Facebook) and (
Philips Hue,
photo) are labeled as PS, even though Facebook is a social network and Philips Hue is a solution for managing lights. Similarly, when using Facebook as the target channel, the words “Philips Hue,” “bulb,” “kit,” and “on” are found in its context, generating the pair (
Facebook,
bulb) as a PS even though Facebook does not manage bulbs.
To categorize channels by functionality, we introduce the notion of channel context. In particular, a channel is contextual to a target channel if they share functionalities, yielding similar embeddings. Unfortunately, it is difficult to automatically determine the correct and relevant words from text descriptions, such as “post” for Facebook and “bulb” for Philips Hue. Manually defining these words is impractical due to the high number of channels and platform-specific variations. To address these issues, we propose Channel2Vec, a novel automated solution to define trustworthy positive and NS according to a threshold.
Channel2Vec.
Figure 4 shows the architecture of Channel2Vec. It consists of four components: a module to query a virtual store, an embedding module, a similarity function paired with a threshold, and the Skip-Gram module. The concept behind the process is that channels with comparable functionalities should have similar descriptions on the digital marketplaces where they can be downloaded (such as the Google Play Store or App Store). As examples,
Figure 5 shows a portion of the descriptions of the Facebook and Instagram channels available on the Google Play Store, with the similarities in text highlighted. We focus on a specific virtual store and gather information about the channels offered by a TAP, generating a collection of pairs
\((n_{C_{i}},d_{C_{i}})\), where
\(n_{C_{i}}\) represents the name of the TAP channel and
\(d_{C_{i}}\) represents the textual description of the channel’s functionalities obtained from the virtual store.
To train the Skip-Gram model, we create \(PS\) with pairs of similar channels and \(NS\) with pairs of dissimilar channels. To achieve this goal, the descriptions are first processed through an embedding module, resulting in vector representations \(W(d_{C_{1}}),W(d_{C_{2}}),\ldots,W(d_{C_{\lvert V\rvert}})\) containing channels’ functional semantic information. Then, channel similarity (\(n_{C_{i}},n_{C_{j}}\)) is evaluated by means of a similarity function \(SF\) and a threshold \(\phi\in\) [-1,1]. If \(SF(W(d_{C_{i}}),W(d_{C_{j}}))\) is greater or equal to \(\phi\), then the label \({y}_{n_{C_{i}},n_{C_{j}}}\) is set to 1 and the pair (\(n_{C_{i}},n_{C_{j}}\)) is added to \(PS\), otherwise, the label is set to 0 and the pair is added to \(NS\). Higher \(\phi\) values yield more reliable labels, but overly high thresholds may miss similar functionalities due to slight description differences, causing false negatives.
The Skip-Gram model predicts context words (channels) for a target word (channel) while modeling the vector representation of the latter. In particular, it learns an N-dimensional word embedding for \(\lvert V\rvert\) channels, with N constrained by the pre-trained LM. The model’s input is a pair of words: one target channel \(n_{C_{t}}\) and one context channel \(n_{C_{c}}\). This pair feeds into an embedding layer of size \((\lvert V\rvert\times\textit{{{N}}})\), producing a dense embedding \((1\times\textit{{{N}}})\) for both channels. A merge layer computes the dot product of the two embeddings, sending the result to a dense layer, producing prediction \(y^{\prime}_{n_{C_{t}},n_{C_{c}}}\) as 1 or 0, depending on whether or not the channel pair is contextually relevant.
The Skip-Gram model calculates the sum of log probabilities for all pairs (
\(n_{C_{t}}\),
\(n_{C_{c}}\)), resulting in binary cross-entropy loss function:
where
\(\Theta\) represents the model parameters. We want to maximize the likelihood of the PS, such as (Facebook, X (formerly known as Twitter)), while minimizing the likelihood of the NS, like (Facebook, Philips Hue). To this end, we leverage the sigmoid function to maximize
We compare the prediction with the actual label, compute the loss, and backpropagate the errors to adjust the weights in the embedding layer. This process is repeated cyclically on all pairs (\(n_{C_{t}}\), \(n_{C_{c}}\)) for multiple epochs, yielding a set of channel embeddings \(C2V\in\mathbb{R}^{\lvert V\rvert xN}\). In this way, the model can learn the contextually relevant channel pairs and generate similar embeddings for channels with similar functionalities. Moreover, in the inference phase, we only need the output embeddings and can discard the model.