1 Introduction
Language is strongly dependent on both the writers/speakers’ characteristics and its context of use (e.g., time, place, scenario, intent). Although humans naturally take these factors into account, Artificial Intelligence systems could struggle to properly handle these aspects. As a result, the development of Natural Language Processing (NLP) tools that are capable of controlling the characteristics of the generated text has become particularly appealing.
Text Style Transfer (TST) is a well-known NLP task. It focuses on changing the style attributes of a piece of text from the source style to a target one (e.g., from an informal version to its formal one) while preserving the original message conveyed by the text. Changing the text style is relevant to a wide range of real-life applications ranging from online content moderation to intelligent writing assistants [
13]. TST solutions may improve the user experience by enhancing the intelligibility and pertinence of the generated text as well as adapting the language to the current situation and writer/speaker's intent [
9]. Importantly, style transfer must be achieved with minimal changes to the text to preserve the original content as much as possible.
In this work, we address TST in an unsupervised scenario, i.e., we assume that there is a lack of parallel annotated data to train sequence-to-sequence models [
36]. The key challenges in unsupervised TST are (1) the preservation of the original content of the source text and (2) the correct identification and replacement of the stylistic elements present in the textual content. In the absence of parallel training data, disentangling style and content is known to be particularly challenging [
16]. On the other hand, unsupervised TST approaches are, broadly speaking, more resource-efficient as they do not involve labor-intensive training tasks [
47].
We propose a new architecture for TST relying on
Cycle-consistent Generative Adversarial Networks (CycleGANs). CycleGANs exploit the cycle-consistency principle for self-supervised adversarial learning. In the context of TST, they have recently emerged as promising sequence-to-sequence approaches to disentangle text style and content [
12].
Existing CycleGAN-based approaches to TST face the following issues [
12]:
I1)
Self-supervision in the latent space: they encode/decode the input/output text and employ fully-connected neural networks to implement the generator and discriminator models. This makes content and style information tightly connected to the text embedding representation, facing issues while coping with mixed-style content, i.e., input textual sequences that are partly in the original style (e.g., informal) and partly in the target one (e.g., formal).
I2)
Recurrent networks: generators and discriminators consist of
Long Short-Term Memory (LSTM) networks, which are known to be suboptimal for coping with long-term text dependencies. Although the interest of the NLP community has already shifted toward the use of Transformer encoder-decoder architectures [
38], to the best of our knowledge, existing CycleGAN-based TST approaches do not rely on Transformers yet.
I3)
Weak enforcement of the target style in adversarial learning: since in the adversarial learning process, the discriminator distinguishes between real and fake sentences without explicitly taking the style of the generated text into account, the target style is weakly enforced.
Our approach overcomes the limitations of existing approaches by introducing the following innovative features:
—
Self-supervision at the sequence level: to overcome issue I1, it applies self-supervision, based on the cycle-consistency principle [
47], directly to the raw input sequences. During the training process, the adversarial loss ensures that the generated text is indistinguishable from the target text, whereas the cycle-consistency loss ensures that the mapping between the source and target text styles is invertible.
—
CycleGANs using Transformers: to overcome issue I2, it adopts a self-supervised approach based on CycleGANs [
47], which automatically learns the mapping between the original and target styles without the need for paired data. The proposed framework consists of two generators and two discriminators. All of them are based on the Transformer architecture [
38].
—
Classifier-guided text generation: to overcome issue I3, the CycleGAN generators leverage a pre-trained classifier performing text style prediction. The classification loss returned by the classifier is integrated into the generators’ loss functions to guide the text generation process. The style classifier is aimed to guide the generators to produce text with the desired style attributes while maintaining the original content's meaning.
The empirical results achieved on benchmark TST datasets for sentiment and formality transfer show the superior performance of the proposed approach:
—
Against state-of-the-art unsupervised TST models: we compare the performance of our approach with that of recently proposed unsupervised approaches to TST, including Transformer-based and CycleGAN-based ones [
12]. The presented architecture outperforms all the tested competitors, e.g., +6.8 points of SacreBLEU on the GYAFC dataset (see Tables
4 and
5).
—
On mixed-style inputs: we run extensive experiments on TST tests suited to a mixed-style scenario. The results, exemplified in
Figure 1, confirm the superior performance of our approach while coping with mixed-style inputs.
—
Against Large Language Models (LLM): we also compare our approach with a state-of-the-art open-source LLM with a similar number of parameters, i.e., Llama2-7B [
37]. The results show that our approach averagely performs better on benchmark data and is more robust than the tested LLM to mixed-style inputs.
—
In a human evaluation: we carried out a human evaluation to qualitatively assess the quality of a sample of TST outcomes. The results are coherent with the quantitative performance metrics.
As an example, the results summarized in
Figure 1 show that CycleGAN (our approach) generates output sequences that are most syntactically similar to the expected outcome (the higher ref-BLEU the better) on all the tested configurations of mixed-style inputs. The mixing ratio
\(X\)%-
\(Y\)% indicates the percentage ratio of the original (
\(X\)%) and target (
\(Y\)%) style in the input. The performance of the LLM (Llama2) is closer to that of CycleGAN when there is no mix (e.g.,
\(X\approx 0\%\) or
\(Y\approx 0\%\)), whereas it is significantly worse in a mixed scenario (e.g.,
\(X=Y=50\%\)).
In summary, the novelty of our TST approach lies in (1) The application of cycle-consistency directly to the input sequence, making the approach more robust for content preservation, particularly when coping with mixed-style inputs (see the results in
Section 5.9); (2) The adoption of Transformers in a CycleGAN TST approach (see the empirical comparisons in
Section 5.5); and (3) The use of a style classifier to foster the generator to produce text in the target style (see
Section 4.1.1 for further details).
2 Related Works
According to a recently proposed categorization [
9], existing TST methods can be classified as (1)
Parallel supervised, if they are trained on known pairs of text with different styles; (2)
Non-parallel supervised, if the style labels are available but the matching between text pairs is missing; (3)
Purely unsupervised, if the style labels are not available.
Parallel supervised or semi-supervised approaches (e.g., Wang et al. [
39], Xu et al. [
43], Shang et al. [
33]) require large-scale style-to-style parallel data, i.e., examples of parallel sentences conveying the same message with different style attributes. However, their generation is extremely labor-intensive. Conversely, non-parallel supervised approaches are trained on large text corpora annotated with style labels. Relaxing the constraint of having style-to-style text pairs makes the problem challenging yet more tractable in real scenarios. This paper falls into the latter category.
Non-parallel supervised methods need to address the following issues:
—
Content preservation: it involves maintaining the original textual content while transforming the text style. Preserving the underlying meaning, semantic information, and structural characteristics of the input text is essential to ensure the coherence and fidelity of the generated output. However, achieving effective content preservation while simultaneously changing the text style is a non-trivial task.
—
Style-content disentanglement: it refers to the process of correctly separating the style attributes from the content in the text. This disentanglement is challenging because style and content are inherently intertwined and strongly related to each other. Modifying the style of a text without altering its content requires the model to accurately identify and manipulate the style-specific attributes while keeping the underlying content intact [
16].
Style-content disentanglement can be achieved through different strategies:
—
Explicit disentanglement [
18,
42,
44]: it entails directly replacing the text with the original style attributes with new pieces of text that have the desired target style attribute. This approach explicitly separates the style and content. However, it can be applied only when style and content can easily be separated, and the style transfer can be realized by changing only some selected words.
—
Implicit disentanglement [
5,
8,
26]: it learns two distinct latent representations, one for the content and the other for the style. By manipulating these separate representations, the model can ideally modify the style while preserving the content. Different techniques such as back-translation, attribute control generation, and adversarial training are usually adopted to realize this approach.
—
Without disentanglement [
3,
7,
23]: the style-content separation is concealed, and the model does not explicitly distinguish between them during the style transfer process. This approach aims at seamlessly transforming the style attributes while implicitly capturing and preserving the underlying content.
In our method, we adopt a strategy without disentanglement. Recent approaches to TST without disentanglement have explored the combination of linguistic graph structures and Transformer-based architectures [
35]. An extensive review of existing techniques can be found in [
9].
Adversarial learning has already been successfully employed to model style-content disentanglement and achieved fairly good content preservation. Recent works [
1,
12,
21,
46] have already adopted
Generative Adversarial Networks (GANs) and cycle-consistency for non-parallel supervised TST. The key differences with the present work are summarized below.
—
Zhao et al. [
46] propose an encoder-decoder framework where text style and content are encoded into two distinct latent vectors (i.e., implicit disentanglement). The encoding and decoding functions are coupled with a style discrepancy loss, which models the style shift from the original domain to the target one, and with a cycle-consistency loss, which ensures content preservation. Unlike Zhao et al. [
46], our approach adopts CycleGANs [
47] and is without disentanglement.
—
Chen et al. [
1] present a GAN framework that leverages optimal transport and uses the feature mover's distance [
41] as training loss. Unlike the present work, they adopt the cycle-consistency principle only for addressing the task of unsupervised deciphering in the latent feature space, relying on LSTM networks for text generation and convolutional networks as sentence feature extractors.
—
Huang et al. [
12] adopt CycleGANs by imposing the cycle-consistent constraint in the continuous latent space. They rely on the LSTM architecture to encode/decode the input/output sequence and employ a two-layer fully connected neural network to implement the generator and discriminator models. In contrast, our proposed approach performs adversarial training on the raw text sequences and computes the cycle-consistency loss at the text level, allowing for more fine-grained control of the text attribute style.
—
Lorandi et al. [
21] focus on sentiment transfer using CycleGANs and LSTMs. In contrast, our approach explores multiple style attributes, utilizes Transformer architectures, and integrates a style classifier to enhance style transfer quality and fidelity.
Recently, some research has explored the use of LLMs to address TST. For instance, Reif et al. [
30] propose an augmented zero-shot learning strategy showing promising results on various TST tasks without requiring fine-tuning or exemplars in the target style. An empirical comparison between our method and an LLM can be found in
Section 5.6.
3 Preliminaries
In this section, we introduce the preliminary concepts and formally state the problem under consideration. For the sake of readability, the notation used throughout the section is summarized in
Table 1.
Text Style Transfer. TST aims to learn a mapping function
\(\mathcal{F}\) that transforms an input text
\(x_{A}\) with source style
\(A\) into its transferred version
\(x_{B}\) with target style
\(B\). Similarly, function
\(\mathcal{G}\) applies the reverse transformation, i.e., from
\(x_{B}\) to
\(x_{A}\). Unlike style-conditioned text generation [
14], in TST the transformation preserves the original content while transferring the style from
\(A\) to
\(B\).
Hereafter, we will consider the level of formality (i.e., formal or informal) or the sentiment score (i.e., positive or negative) as style attributes. The main TST complexity lies in the tight connection between content and style. For example, the level of formality of a piece of text is often determined not only by a particular linguistic register but also by other characteristics such as syntax and orthography. For the sake of simplicity, we also assume to be in a binary style transfer scenario.
1CycleGANs. Our goal is to address TST by leveraging CycleGANs. They are a class of GANs that can learn the mapping function between two domains without the need for parallel data [
47]. Although they have been introduced in the field of Computer Vision, CycleGANs are general-purpose architectures that can be used to accomplish a variety of tasks, including TST. The use of CycleGANs enables the adoption of a self-supervised paradigm, relaxing the constraint on the availability of parallel textual data.
CycleGAN architectures typically comprise four models, including two generators and two discriminators. The generators learn the mapping functions while the discriminators ensure the quality of the generated outputs. In the following, we outline the general formulation of CycleGAN training objectives. For a detailed description of both the architecture and the training process specific to the task under consideration, please refer to
Section 4.
Let
\(X\) and
\(Y\) be two domains with training examples
\(x_{i}\in X\) and
\(y_{j}\in Y\). The corresponding data distributions can be denoted as
\(x\sim p_{data}(x)\) and
\(y\sim p_{data}(y)\). The generators
\(F\) and
\(G\) aim to learn the following mappings:
\(\mathcal{F}:X\rightarrow Y\) and
\(\mathcal{G}:Y\rightarrow X\). The discriminator
\(D_{Y}\) aims to distinguish between real samples
\(y\) and generated samples
\(F(x)\). Similarly,
\(D_{X}\) discriminates between
\(x\) and
\(G(y)\).
2CycleGAN training involves two objectives: adversarial losses and cycle-consistency loss. Adversarial losses try to match the distribution of generated samples with the data distribution in the target domain. Specifically, for the mapping function
\(\mathcal{F}:X\rightarrow Y\), it can be expressed as follows:
It is an adversarial loss since \(F\) aims to minimize it against an adversary \(D_{Y}\) that tries to maximize it. Similarly, it is possible to define an adversarial loss for the mapping function \(\mathcal{G}:Y\rightarrow X\).
The cycle-consistency loss constrains the mapping functions to ensure that their sequential application to the input sample
\(x\) allows for its reconstruction. Additionally, it addresses the mode collapse problem [
47]. By combining the reconstruction constraint of both mapping functions, it can be defined as follows:
This general formulation of CycleGAN training objectives can easily be adapted to the TST task. The main difference lies in defining the two domains, \(X\) and \(Y\), as the input and target styles \(A\) and \(B\).
4 Method
Figure 2 shows a sketch of the proposed method. The objective is to learn the mapping between the two styles using two generators,
\(G_{A\rightarrow B}\) and
\(G_{B\rightarrow A}\), and two discriminators,
\(D_{A}\) and
\(D_{B}\). These components work together to learn the mapping between the source and target styles. A detailed description of the generator and discriminator characteristics is given below.
In addition to the generators and discriminators, we also use an external, pre-trained style classifier, hereafter denoted by \(SC\). This model aims to classify the style of a given text sample. During the training process of the CycleGAN model, the generators receive feedback from the style classification model on the style of the generated content. This feedback is exploited by the generators to effectively produce text pieces with the desired style attribute.
4.1 Generator
The purpose of the generators \(G_{A\rightarrow B}\) and \(G_{B\rightarrow A}\) is to learn the transformation between the source and target pieces of texts. A modification of a specific text attribute, such as style or sentiment, must preserve the original content. The generator \(G_{A\rightarrow B}\) takes a sequence of tokens, \(x_{A}=(x_{A,1},x_{A,2},\ldots,x_{A,N})\), as input, where \(x_{A,n}\) is the \(n\)-th token in the sequence. The output of the generator is a sequence of tokens, \(y_{B}=(y_{B,1},y_{B,2},\ldots,y_{B,M})\), where \(y_{B,m}\) is the \(m\)-th token in the output sequence. It is worth noting that the lengths of the input and output sequences may be different (i.e., \(N\neq M\)). The generator \(G_{B\rightarrow A}\) handles a similar process, transforming the piece of text written in style B into a text written in style A. The key point is that the input and output sequences are not paired, and the model cannot be trained based on prior knowledge of the expected output.
The proposed method involves two cycles, \(A\rightarrow B\rightarrow A\) and \(B\rightarrow A\rightarrow B\), which operate as follows. For the cycle \(A\rightarrow B\rightarrow A\), the generator \(G_{A\rightarrow B}\) is trained to predict the output sequence \(y_{B}\) given the input sequence \(x_{A}\). The output of this generator is then fed to the discriminator \(D_{B}\), which aims at distinguishing between samples drawn from the original distribution and those generated by the generator \(G_{A\rightarrow B}\), which transfers the input text's style to the target style. The output of the generator is also fed back to the generator \(G_{B\rightarrow A}\), which transforms the style of the generated text back to the original style. The output of the second generator, \(y_{A}=G_{B\rightarrow A}(y_{B})\), corresponds to the reconstructed text that should be as close as possible to the original input text.
The generators
\(G_{A\rightarrow B}\) and
\(G_{B\rightarrow A}\) are trained to minimize the following loss functions:
Here,
\(\mathcal{L}_{G_{D_{B}}}\) (illustrated in point 1 in
Figure 2) and
\(\mathcal{L}_{G_{D_{A}}}\) are the adversarial losses (see Equation
(2)) and represent the feedback from the corresponding discriminator (i.e., the extent to which the generator is able to generate text that is indistinguishable from the target), whereas
\(\mathcal{L}_{cyc_{A\rightarrow B\rightarrow A}}\) (illustrated in point 3 in
Figure 2) and
\(\mathcal{L}_{cyc_{B\rightarrow A\rightarrow B}}\) are the cycle-consistency losses (see Equation
(3)) computed at the end of the corresponding cycle by comparing the output of the second generator to the input sequence. More formal definitions of the discriminator and cycle losses are available in
Sections 4.2 and
4.3, respectively.
\(\mathcal{L}_{style_{B}}\) and
\(\mathcal{L}_{style_{A}}\) are the style classifier losses that are computed using the pre-trained style classifier. These components of the loss function (represented in point 2 in
Figure 2), aim at ensuring that the generator is able to generate text that is consistent with the target style.
\(\mathcal{L}_{style_{B}}\) and
\(\mathcal{L}_{style_{A}}\) corresponds to the binary cross-entropy loss between the predicted style and the target style (known according to the transformation being learned). The classifier-guided loss is computed using the pre-trained style classifier, but only the generator is updated using this loss (e.g., the pre-trained style classifier is not trained during the adversarial training process). The classifier-guided loss can be formalized as follows:
where
\(N\) is the number of samples in the batch,
\(x_{i}\) is the input sequence,
\(y_{i}\) is the target label (i.e., 1 for style B and 0 for style A), and
\(SC(x_{i})\) is the output of the style classifier (see
Section 4.1.1 for further details) for the input sequence
\(x_{i}\) classified using the
[CLS] token. The loss is calculated by taking the average over all samples in the batch.
A different hyperparameter is associated with each of the three loss components in Equations
(4) and (
5): specifically,
\(\lambda_{gen}\),
\(\lambda_{cyc}\), and
\(\lambda_{style}\) respectively control the relative importance of
\(\mathcal{L}_{G_{D_{B}}}\) (
\(\mathcal{L}_{G_{D_{A}}}\)),
\(\mathcal{L}_{cyc_{A\rightarrow B\rightarrow A}}\) (
\(\mathcal{L}_{cyc_{B\rightarrow A\rightarrow B}}\)) and
\(\mathcal{L}_{style_{B}}\) (
\(\mathcal{L}_{style_{A}}\)).
For the cycles \(A\rightarrow B\rightarrow A\) and \(B\rightarrow A\rightarrow B\) the process is quite similar: the generators and discriminators operate to learn the transformation between the source and target styles.
4.1.1 Style Classifier.
The aim of the style classifier loss is to ensure the alignment with the target style, complementing content preservation and style transfer produced by the cycle consistency loss. It provides tailored guidance for accurate style transformation. Importantly, the discriminator, described in
Section 4.2, distinguishes between real and fake sequences without taking the style of the generated output into account. Although it identifies out-of-distribution samples, the adversarial learning process weakly enforces the target style of the output text. To overcome this issue, the style classifier aims to provide explicit feedback to the generator on the style quality of the generated texts, thus mitigating the limitations of adversarial learning in TST.
4.2 Discriminator
The discriminators
\(D_{A}\) and
\(D_{B}\) are responsible for distinguishing between real and generated text. In line with the original GAN framework [
6], the discriminators are trained to maximize the probability of correctly classifying real and generated text. Specifically,
\(D_{A}\) is trained to distinguish between the source texts and the output of the generator
\(G_{B\rightarrow A}\), where the source texts are samples drawn from the source distribution. Similarly,
\(D_{B}\) is trained to distinguish between the target texts, which are samples drawn from the target distribution, and the output of the generator
\(G_{A\rightarrow B}\). The discriminators are trained to minimize the following loss functions:
where
\(\mathcal{L}_{D_{A}}^{real}\) and
\(\mathcal{L}_{D_{B}}^{real}\) denote the losses computed using data sampled from the source domain and the target domain, respectively, whereas
\(\mathcal{L}_{D_{A}}^{fake}\) and
\(\mathcal{L}_{D_{B}}^{fake}\) denote the losses computed using the output sequences of the generators
\(G_{B\rightarrow A}\) and
\(G_{A\rightarrow B}\), respectively. The weight of the discriminator losses in the overall objective function is controlled by a hyperparameter
\(\lambda_{dis}\).
Each term of the discriminator loss (i.e.,
\(\mathcal{L}_{D_{A}}^{real}\),
\(\mathcal{L}_{D_{A}}^{fake}\),
\(\mathcal{L}_{D_{B}}^{real}\), and
\(\mathcal{L}_{D_{B}}^{fake}\)) is defined as a Binary Cross-Entropy loss which, for a given discriminator
\(D\), is given by:
where
\(N\) is the number of samples in the batch,
\(x_{i}\) is the input sequence,
\(y_{i}\) is the target label (i.e., 1 for real text and 0 for generated text), and
\(D(x_{i})\) is the output of the discriminator for the input sequence
\(x_{i}\) classified using the
[CLS] token. The loss is calculated by taking the average over all samples in the batch.
The adversarial losses computed using the output of the discriminators are back-propagated to the generators, allowing them to learn to generate text that is consistent with the data sampled from the target domain. By utilizing Transformer-based models as discriminators, we can efficiently learn the text style consistency within the target domain, thus improving the overall effectiveness of the training process.
4.3 Cycle Consistency
The goal of the proposed method is to learn the mapping between the source and target domains. In
Figure 2 we illustrate the process for the case in which the source domain is
\(A\) while
\(B\) is the target domain. However, the process is analogous the other way around. Given an input sequence
\(x_{A}\) in the source domain, the generator
\(G_{A\rightarrow B}\) is trained to generate a sequence
\(y_{B}\) in the target domain. However, due to the lack of parallel annotated data in the target domain during training, the generator
\(G_{A\rightarrow B}\) is unable to directly learn the mapping between
\(x_{A}\) and
\(y_{B}\). To address this issue, the cyclic architecture first generates a sequence
\(y_{B}\) in the target domain and then transforms it back to the source domain using the generator
\(G_{B\rightarrow A}\). The output of such a generator,
\(y_{A}\), is then compared to the input sequence
\(x_{A}\) using a cycle-consistency loss. This loss is computed using the cross-entropy loss between the output of the second generator and the input sequence and is used to train the generator
\(G_{B\rightarrow A}\).
Specifically, each generator is a sequence-to-sequence model that is trained to minimize the cross-entropy loss between the generated sequence and the target sequence defined as follows:
where
\(N\) is the number of samples in the batch,
\(T_{\text{total}}\) is the total number of tokens across all samples in the batch,
\(T\) is the length of the sequence,
\(y_{nt}\) is the target token at position
\(t\) in the sequence
\(n\), and
\(p_{t|t-1}\) is the probability of the token at position
\(t\) given the previous tokens in the sequence. The loss is calculated by taking the average over all samples in the batch.
Given the self-supervised nature of our method, the target sequence is not available during the initial transformation from the source domain to the target domain (i.e., \(A\rightarrow B\)). The subsequent transformation from the target domain back to the source domain (i.e., \(B\rightarrow A\)) aims to reconstruct the original input sequence. At this stage, the expected output is the input sequence \(x_{A}\), which can be used to compute the cycle-consistency loss. Therefore, the cycle-consistency loss is computed using both the output of the generator \(G_{B\rightarrow A}(y_{B})=y_{A}\) and the input sequence \(x_{A}\) (for the cycle \(A\rightarrow B\rightarrow A\)). A similar process occurs for the cycle \(B\rightarrow A\rightarrow B\).
4.4 Objective Function
The full objective function is a combination of various loss functions, with each component contributing to a specific aspect of the TST task. The loss functions include the generator loss, the cycle-consistency loss, the style loss, and the discriminator loss, each weighted by a hyperparameter
\(\lambda\). The final formulation can be expressed as follows:
It includes the adversarial losses for both generators (\(G_{A\rightarrow B}\) and \(G_{B\rightarrow A}\)), and discriminators (\(D_{A}\) and \(D_{B}\)), the cycle-consistency losses for both style transfer directions, and the style losses for both domains. Additionally, weighting factors (\(\lambda_{gen}\), \(\lambda_{dis}\), \(\lambda_{cyc}\) and \(\lambda_{style}\)) are used to balance the importance of each component in the overall objective.
It is worth noting that while it is possible to have separate weighting factors for each direction, we employ identical weighting hyperparameters for both style transfer directions to maintain simplicity and minimize the complexity of configuration options. This choice allows us to avoid the need for justifying or making any prior assumption concerning distinct values for each direction. By ensuring uniformity in the weighting factors, we establish a balanced optimization process that treats both directions equally. The proposed implementation can easily be extended to accommodate different weighting factors for each style transfer direction if required by the specific use case.
Finally, we formulate the overall optimization problem as follows:
which expresses the min-max game played between each pair of generator-discriminator models [
47].
4.5 Extension to Multiple Styles
The current approach is designed to handle only a specific pair of source and target styles. A straightforward method to handle more than two styles is to train separate pairwise TST architectures. However, this leads to scalability issues. An alternative, more efficient solution is to prompt the generator with specific instructions [
2] indicating the desired style transformation. For example, by using purposefully crafted prompt tokens like
[A -> B] for converting text from style A to style B, or
[A -> C] for converting text from style A to style C, each generator can be trained to handle multiple style conversions. This approach maintains the self-supervised nature of the architecture, enabling generators to convert from any style to any other style. However, implementing this method requires careful consideration of model training and architecture adjustments, which are beyond the scope of the current work (see
Section 6 for a discussion of future research lines).
5 Experimental Evaluation
We evaluate the performance of the proposed method and compare it against recent TST approaches on benchmark data. We also perform various ablation studies to evaluate the following aspects: cycle-consistency in the latent space, impact of the cycle-consistency loss coefficient, and effect of the pre-trained style classifier.
To foster the reproducibility of our results, the models and code used for the implementation of the proposed framework are publicly available, for research purposes only, at
https://github.com/gallipoligiuseppe/TST-CycleGAN under the license CC BY-NC-SA.
5.1 Datasets
We consider three benchmark datasets related to two different TST tasks, i.e., sentiment transfer and formality transfer.
Sentiment Transfer. The Yelp dataset [
34] collects restaurant reviews. Based on their rating, reviews are labeled as positive or negative (a rating of 4 or 5 corresponds to a positive label, whereas a rating below 3 is negative). The dataset includes a test set with four human references per sentence. Train and validation sets are suited to non-parallel supervised TST as they are annotated with style attributes, but the matching between text pairs is missing. For the sake of reproducibility, we use the same train/validation/test splits as in Li et al. [
18] (see
Table 2).
Formality Transfer.
Grammarly's Yahoo Answers Formality Corpus (GYAFC) [
28] is a parallel dataset consisting of informal-to-formal sentence pairs. It comprises sentences from two different domains, i.e., Family & Relationships (family, in short) and Music & Entertainment (music, in short). Although the dataset includes parallel sentences, to simulate the scenario of self-supervised style transfer, we select only the source sentences from the train set. The validation and test sets, on the other hand, include annotated sentences for both domains and are used to evaluate the performance of the proposed method (see
Table 3).
5.2 Metrics
We evaluate our model using a suite of established evaluation metrics [
7,
23,
31]. Specifically, to quantify content preservation, we compute the SacreBLEU score [
25] between the system outputs and the four human references.
3To evaluate the effectiveness of our approach for TST, we fine-tune a BERT-base binary classifier [
4] to compute the style accuracy metric. To distinguish it from the style classifier used during model training, hereafter, we will denote it by the
oracle classifier. Its purpose is to accurately classify the style of the input text and provide a reliable evaluation metric for the quality of the generated text's style transfer. On the analyzed datasets, the oracle classifier respectively achieves the accuracies of 98.5% (Yelp), 94.0% (GYAFC-family), and 94.6% (GYAFC-music). To compute the style accuracy, according to prior works, we also train a TextCNN [
15] as oracle classifier (beyond the BERT-base classifier). Its classification performance is, in general, satisfactory on all the tested datasets (96.5% on Yelp, 93.2% on GYAFC-family, 93.8% on GYAFC-music) and comparable to that of the BERT-base model. Both BERT-base and TextCNN classifiers were trained on the same TST datasets under analysis: they were trained and tested on the corresponding training and test splits, respectively.
Finally, we also provide a comprehensive performance score by computing the geometric mean (GM) and harmonic mean (HM) of the SacreBLEU and style accuracy scores.
5.3 Configuration Settings
We implemented the proposed architecture using the Hugging Face Transformers library [
40]. As reference models for the generators and discriminators, we consider BART-base [
17] (140M parameters) and DistilBERT [
32] (66M parameters), respectively. For the Yelp dataset, we use the case-insensitive variant of DistilBERT, whereas for GYAFC we use the case-sensitive version as the input data contains case-sensitive text. We also run experiments with larger generator models, prioritizing the use of more powerful models for the most challenging (and resource-demanding) text generation task. Specifically, for the generators, we also consider BART-large (400M parameters) and T5 [
27] (with 60M, 220M, and 770M parameters for the small, base, and large versions, respectively). As style classifier, we use a fine-tuned BERT-base (110M parameters) model. Note that although we use the same model as for the
oracle classifier, this is not necessarily the case. The proposed TST architecture can be trained end-to-end, enabling the simultaneous learning of the style transfer functions in both directions.
We use the validation SacreBLEU score as the reference metric to identify the best-performing training configurations and the optimal model checkpoints. Then, we evaluate them on the test set. The SacreBLEU score is calculated separately for each TST direction, and the average of these two values is used as an overall score. Note that for the Yelp dataset human references are not available for the validation set. To overcome this issue, we optimize the GM of the SacreBLEU score calculated between the system outputs and the source sentences, as well as the style accuracy.
Training Details. Similar to [
12], we train the model in a self-supervised setting for both tasks, even though the GYAFC dataset is a parallel corpus. Thus, for our purposes, the alignments between sentence pairs are neglected.
Based on our preliminary experiments, we observed that the impact of the hyperparameters \(\lambda_{gen}\), \(\lambda_{dis}\), and \(\lambda_{style}\) is negligible. Therefore, for the sake of simplicity, hereafter, we will set \(\lambda_{gen}=\lambda_{dis}=\lambda_{style}=1\). We tune the following two hyperparameters: the learning rate and the loss scaling factor \(\lambda_{cyc}\). The learning rate, which controls the magnitude of the weight updates during training, is updated using a linear scheduler, which linearly decreases the learning rate from a maximum value, as reported in the Appendix, to zero during the training process. Meanwhile, the \(\lambda_{cyc}\) hyperparameter controls the weight of the cycle-consistency loss in the overall objective function.
Given the computational demands of training such models and to reduce the number of configurations to be tested, we explore the hyperparameter space by considering values in the range \(\left[10^{-5},10^{-3}\right]\) for the learning rate and \(\left\{0.1,1,10\right\}\) for the loss scaling factor \(\lambda_{cyc}\). The optimal hyperparameter values used throughout the experiments are reported in the Appendix. It is worth noticing that the selection of appropriate hyperparameters may affect the performance of the model. Moreover, these hyperparameters were found to be optimal for the specific datasets and models used in our experiments, and they may not necessarily generalize to other datasets or models. The optimal values were determined through a combination of manual tuning and grid search by evaluating the model's performance of various hyperparameter combinations on the validation sets.
To balance computational efficiency and model performance, we set the maximum input sequence length to 64 since the average number of tokens is 8.88
\(\pm\) 3.64 and 10.68
\(\pm\) 4.12 for the Yelp and GYAFC datasets, respectively, and the batch size to 128 for BART-base, 32 for BART-large, 128 for T5-small, 64 for T5-base, and 8 for T5-large. We employ the AdamW optimizer [
22] with
\(\beta_{1}\) 0.9,
\(\beta_{2}\) 0.999, and weight decay
\(10^{-2}\).
To ensure consistent experimental conditions and hardware utilization, we utilize a single NVIDIA® V100 GPU with 32 GB of VRAM for both training and inference of all models.
5.4 Baselines
Existing Unsupervised TST Methods. We test the following methods: RetrieveOnly, DeleteOnly, DeleteAndRetrieve, and TemplateBased [
18], BackTranslation [
26], StyleEmbedding and MultiDecoder [
5], CrossAlignment [
34], UnpairedTranslation [
42], UnsupervisedMT [
45], DualRL [
23], NASTLatentLearn [
11], DeepLatent [
7], ReinfRewards [
31], MixAndMatch [
24], MultiClass [
3], FineGrainedST [
19], LevenshteinEdit [
29], GTAE [
35], CycleAutoEncoder [
12], and TextGANPG [
21].
4 Notice that we disregard existing supervised approaches to formality transfer because we deem their comparison with unsupervised methods unfair.
Cycle-Consistency in the Latent Space. A key property of our approach is that it performs style transfer directly at the sequence level. Conversely, previous CycleGAN-based TST approaches apply transformations on the latent space. To evaluate our method's effectiveness against this approach, we explore two variants that conduct style transfer in the latent space. In these variants, we leverage the embedding space for style transfer. We decompose the generator network into encoder
\(E\) and decoder
\(D\) components, introducing two baseline models:
(1)
Sentence-level: this approach focuses on aligning representations generated by the encoders \(E_{A}\) and \(E_{B}\) with each other. Considering the case \(A\rightarrow B\rightarrow A\), this is achieved by minimizing the L1 loss between the embeddings of the input sequence encoded by \(E_{A}\) and its corresponding version, predicted in the target style, and then encoded by \(E_{B}\). To obtain sentence representations from token representations, we use average pooling. The rationale behind this approach is to ensure that the semantic meaning of the input sequence is preserved while transferring its style.
(2)
Token-level: in this baseline model, the focus is on preserving the content of the text at the token level by maximizing the similarity between the original input and the reconstructed output. To achieve this, it minimizes the L1 loss between the token embeddings of the input sequence and its reconstructions. The token-level embeddings for the input sequence are obtained immediately after tokenization before being fed into the model. The reconstructed tokens are taken from the output of the decoder after completing the cycle (i.e., \(A\rightarrow B\rightarrow A\)) in the CycleGAN architecture.
To assess the performance of the two described latent-based variants of our approach, we present experimental results in
Section 5.7.
5.5 Evaluation and Comparison
Here, we summarize the main results of the empirical evaluations and performance comparisons separately for each style transfer domain. We also conduct a qualitative analysis of the generated outputs, whose results are provided in the Appendix.
Formality Transfer. Tables
4 and
5 report the performance of our method variants (denoted by the prefix name
CycleGAN) and the baselines on the family and music domains of the GYAFC dataset, respectively. The music domain has been shown to be more challenging and, in general, less explored by previous TST studies than the family one. In both domains, our approach based on
T5 large performs best in terms of SacreBLEU scores compared to all the tested prior works. More specifically, CycleGAN outperforms the other approaches in terms of ref-BLEU score (+6.8 vs. the best-performing competitor), showing a higher capability of content preservation and a better fluency of the generated text. Conversely, models achieving the highest accuracy scores significantly perturb the original content as the corresponding ref-BLEU scores are fairly low, resulting in a less faithful reproduction of the original meaning. Instead, the proposed approach achieves the best balance between content preservation and style transfer. Among the tested CycleGAN variants, those relying on larger generator models produce, as expected, consistently better ref-BLEU results than the other ones.
More detailed results on the most common formality transfer case, i.e., from informal to formal style, are given in the Appendix. The results confirm the superior performance of CycleGAN T5 large compared to all the other methods (e.g., ref-BLEU +34.1 against ReinfRewards on GYAFC-music).
Sentiment Transfer.
Table 6 reports the results of our method variants (denoted by the prefix name
CycleGAN) and the baselines on the Yelp dataset. Our method shows performance superior to all the other methods in terms of the average SacreBLEU metric using the BART large model (CycleGAN 56.5 vs. 54.9 of the best-performing competitor). The better ability to preserve the original content is partly mitigated by the lower style accuracy, which is, however, less critical for the sentiment transfer task (e.g., CycleGAN
\(\approx\)75% accuracy in sentiment transfer vs.
\(\approx\)50% in formality transfer). In fact, sentiment transfer commonly requires minimal modifications of the text to change its polarity.
In general, we claim that our model is able to achieve better content preservation thanks to the cycle-consistent structure of our architecture which is instrumental in preventing inappropriate or unnecessary modifications to the input text.
5.6 Formality and Sentiment transfer with LLM
We perform an empirical comparison between our approach and a state-of-the-art open-source LLM, i.e., Llama2 model [
37]. Specifically, we consider the 7B version to ensure a fair comparison in terms of model size with the proposed architecture.
5 We employ it in both zero-shot and few-shot settings: in the latter case, we experiment with varying number of examples
\(k\in\{1,3,5,10\}\) provided as input to the model. Few-shot examples consist of both the input sentence and the corresponding expected output in the target style. Examples are randomly selected from the parallel training sets for formality transfer datasets, whereas in the case of sentiment transfer since no parallel data is available, we manually annotate the expected outputs for the selected examples.
Based on preliminary experiments, we set the model's temperature hyperparameter at \(0.6\). We provide the LLM with the following prompt:
Transform the following sentence from [SRC] style to [TGT] style.
Apply only minimal changes and preserve the meaning of the sentence.
Here you can find some examples of sentences in [SRC] style and corresponding
sentences in [TGT] style:
Input ([SRC] style): [SRC_EXi]
Output ([TGT] style): [TGT_EXi]
...
Input ([SRC] style): [SRC_INPUT]
Output ([TGT] style):
where we replace [SRC] and [TGT] with the actual source and target styles, [SRC_EXi] and [TGT_EXi] with the source and target sentences for each of the \(k\) examples (for \(k \gt 0\)), and [SRC_INPUT] with the current test sample to be transferred.
Table 7 reports the results achieved for both formality and sentiment transfer tasks, while more detailed results for the informal-to-formal transfer can be found in the Appendix. In both tasks, the ref-BLEU and accuracy performance generally increase while providing more input examples until reaching a steady state. This is probably due to the fact that, when providing numerous examples, some noise may be introduced, potentially misleading the model. Surprisingly, the ref-BLEU results on the Yelp dataset for
\(k=1,3\) are worse than in the zero-shot setting. One possible explanation is that, since in the sentiment transfer task style can often be transferred by modifying only a few words, in the zero-shot setting the model may tend to apply fewer modifications, resulting in a higher ref-BLEU score but lower accuracy. In the 1- or 3-shot settings, the accuracy increases at the expense of a lower ref-BLEU score. This is likely because the model requires more examples to adhere to the requirement of applying only minimal changes to the input sentences.
By comparing LLM results with those of our best models, we can state that our method consistently outperforms Llama2 in terms of content preservation on both tasks (i.e., +8.2 on GYAFC-family, +5.5 on GYAFC-music, +3.1 on Yelp). In contrast, Llama2 achieves the highest accuracy scores, even when compared to the other baselines in the formality transfer task. The results highlight that the TST performance of Llama2 is fair without ad hoc fine-tuning. It is also worth noting that model fine-tuning would require parallel data and thus is out of the scope of the present work. In conclusion, our proposed approach confirms its superior performance in content preservation, even when compared to a larger and more powerful LLM.
5.7 Cycle-Consistency in the Latent Space: Sentence-Level vs. Token-Level
In this section, we present the results of an ablation study conducted to compare the performance of the latent-based versions of our approach (described in
Section 5.4). The purpose is to quantitatively compare these model variants with the proposed solution, which enforces the cycle-consistency constraint directly to the raw input sequence. To better isolate the effect of the cycle-consistency loss, we exclude the pre-trained style classifier and its corresponding loss term. Additionally, to ensure a fair comparison, we use the same generator model that achieved the best results in the corresponding task (i.e., T5 large for formality transfer and BART large for sentiment transfer).
The results of both tasks are shown in
Table 8. As can be seen, the non-latent version (i.e., applying cycle-consistency on the raw input sequence) significantly outperforms both latent-based versions in terms of GM and HM on both tasks. Upon closer inspection, in the formality transfer task, our approach achieves the best ref-BLEU scores, exhibiting an improvement of more than +35.0 points on both domains. Considering style accuracy, the latent token-level version performs the best on the GYAFC-music dataset. However, it must be noted that the corresponding ref-BLEU score is extremely low. In the sentiment transfer task, the sentence-level and token-level versions achieve the best results in ref-BLEU and style accuracy, respectively. Nonetheless, they show remarkably low results in the other metric of interest (i.e., sentence-level accuracy
\(=\) 1.6, token-level ref-BLEU
\(=\) 0.7).
After manually inspecting the generated outputs, we observed that in the sentence-level version, in most cases, the input is simply copied to the output. The loss function used to train the model aims at minimizing the discrepancy between the embedding of the input and transferred sentence, therefore preserving the meaning. However, in the formality transfer task, where the output often contains multiple copies of the input, there is a low overlap with the target sentence. In contrast, in the sentiment transfer task, the input is copied to output without repetitions. Given that sentiment transfer typically involves modifying only a few words, the sentence-level version achieves a high ref-BLEU score. Considering accuracy scores, since the input text is not modified in the sentence-level version, its performance is low, especially in the sentiment transfer task (i.e., accuracy = 1.6). For the token-level version, the outputs are degenerate, i.e., the model almost generates the same sentence. Consequently, the ref-BLEU scores are particularly low (e.g., 0.7 on the Yelp dataset), while the accuracy is often very high (e.g., 99.7 on the Yelp dataset) if the (degenerate) sentences are classified as belonging to the target style.
Overall, these results demonstrate the significant advantage achieved by directly enforcing cycle-consistency constraints to the raw sequence, highlighting one of the main contributions of our work.
5.8 Human Evaluation
Similar to [
12,
23], we conducted a human evaluation to get qualitative feedback on the TST results. We recruited 12 volunteers, each of them meets the following criteria: she/he holds an MSc or PhD degree, is proficient in English, and has a sufficient background in the TST task. We randomly picked 50 test samples per dataset and style transfer direction (300 samples overall). For each source sample and target style, annotators were asked to evaluate the quality of outputs generated by different systems. The outputs were presented in random order and without disclosing the model each output was generated from. Specifically, for each task and dataset, annotators evaluated the outputs produced by the following models: CycleGAN (ours), CycleGAN latent (i.e., the latent sentence-based version of CycleGAN), Llama2, and the two corresponding best baselines.
The output sentences were evaluated using a 5-point Likert scale based on three criteria: (1) Style accuracy, measuring the extent to which the generated sentence aligns with the target style; (2) Content preservation, assessing how effectively the content of the input sentence is preserved; and (3) Fluency, considering the overall fluency and linguistic correctness of the output text. Similar to [
23] and [
18], we also calculate the average across the three criteria and denote a generated output as “successful” if it receives a rating of 4 or 5 on all three criteria.
Table 9 reports the results achieved for both tasks, including a t-test for statistical significance. In the formality transfer task, our approach excels in content preservation and fluency, achieving the best performance. Moreover, it yields the highest average score and success rate on the GYAFC-music dataset. Conversely, Llama2 demonstrates the highest style transfer score for both domains and excels in terms of average score and success rate on the GYAFC-family dataset. Notably, our approach and Llama2 outperform other systems across all metrics by a substantial margin (e.g., +2.9 and +2.0 on content preservation and style accuracy, respectively), especially the latent sentence-based version of our approach, which exhibits the lowest performance. In the sentiment transfer task, our model outperforms all baselines, including Llama2, on all metrics, thus confirming the superior quality of the generated outputs. Broadly speaking, the human evaluations are mostly aligned with the quantitative results and confirm the superior performance of our approach, particularly on content preservation. Notably, we achieved exceptionally high scores in the formality transfer task (i.e., 4.9 and 4.8 on the family and music domains, respectively), highlighting its superior capability in preserving the input content, which is known to be the most challenging constraint in TST. In compliance with [
13], we also report the Krippendorff's alpha inter-rater agreement coefficient, which equals
\(\alpha=0.76\). This high score indicates a significant level of agreement among raters, reinforcing the consistency of the conducted human evaluation.
5.9 Results on Mixed-Style Inputs
We evaluated the models’ ability to preserve content while transferring style on datasets with inputs composed of mixed-style text segments. The mixed-style text versions are generated by proportionally appending pieces of text of different styles.
Table 10 shows the results on the GYAFC-family dataset with a mix of formal/informal text segments with varying mixing ratios. Hereafter, we will focus on formality transfer because it is more likely to have mixed-style text than in sentiment transfer cases. Additional results are available in the Appendix.
Overall, the proposed model achieves the best balance of style accuracy and content preservation across different mixing ratios. On GYAFC-family it obtains the highest geometric and HM in 5 out of 6 configurations. Notably, the performance is relatively strong even in more mixed settings such as 50%–50%, demonstrating its ability to effectively disentangle and transfer style at the segment level rather than just averaging effects across the full input.
The baseline models, namely CrossAlignment and MultiDecoder, exhibited consistently lower performance, leading to a notable decrease in overall effectiveness in mixed settings. The latent space variant of CycleGAN also lagged behind our approach, highlighting the benefit of applying the cycle-consistency constraint directly to the raw input sequences. Llama consistently came second to our proposed approach. However, its performance degraded more significantly than CycleGAN as the mixture became more balanced. This suggests CycleGAN may have an advantage in more ambiguous scenarios where the overall style is unclear.
These results demonstrate that the proposed methodology is highly effective at style transfer even when the input text contains mixtures of different styles, outperforming prior work, especially on more balanced mixtures. This underscores its ability to perform style transfer at a fine-grained segment level.
5.10 Ablation Studies
In this section, we delve into the results of two complementary ablation studies. The first experiment explores the impact of the \(\lambda_{cyc}\) scaling factor in the cycle-consistency loss, whereas the second one analyzes the effect of the pre-trained style classifier, considering both the additional loss term and the model used.
Cycle-Consistency Loss. In this ablation study, we investigate the impact on performance when varying the cycle-consistency loss coefficient \(\lambda_{cyc}\). To better analyze and isolate the effect of this loss component, we conduct this analysis by excluding the pre-trained style classifier and its corresponding loss term. Consequently, we set \(\lambda_{gen}=\lambda_{dis}=1\), \(\lambda_{style}=0\) and experiment with different values of \(\lambda_{cyc}\in\{0,0.1,1,10,50,100\}\). Additional results for the BART-base model on the GYAFC-family dataset can be found in the Appendix.
In general, the values of the ref-BLEU and style accuracy metrics increase while increasing the value of \(\lambda_{cyc}\). However, the ref-BLEU increase appears to be quite limited for large \(\lambda_{cyc}\) values, while accuracy still exhibits some room for improvement. Notably, disabling the cycle-consistency loss (i.e., setting \(\lambda_{cyc}=0\)) results in a significant performance drop in terms of ref-BLEU (i.e., \(-\)4.2 compared to \(\lambda_{cyc}=0.1\)). The performance drop is even more pronounced in terms of style accuracy (\(-\)6.1). These results confirm the importance of the cycle-consistency loss.
Pre-Trained Style Classifier. In this ablation study, we aim to analyze the impact of the pre-trained style classifier. By enabling/disabling the classifier component, we introduce or eliminate the classifier loss contribution (see Equations
(4) and (
5)). To ensure a complete overview of the classifier contribution, we averaged the evaluation metrics reported in
Figure 3 across all the models trained on each dataset. The results reported in
Figure 3 show that the introduction of the pre-trained classifier in the training process has a positive impact on all four evaluation metrics. In terms of BLEU scores, it achieves negligible improvements. However, the style classifier yields a +1.0 BLEU score improvement in the music domain of the GYAFC dataset. The limited effect on BLEU can be motivated by the fact that the pre-trained style classifier's objective is to improve the style transfer accuracy, and thus, it does not necessarily affect the BLEU scores. On the contrary, we observe remarkable improvements in terms of style accuracy on the two domains of the GYAFC dataset. Specifically, we achieve an absolute gain of +4.0 and +7.9 points in accuracy scores, which corresponds to the relative improvements of +8.8% and +18.4% when compared to the classifier-free counterparts. Finally, by analyzing the impact on the GM and HM of BLEU and style accuracy, we can observe an overall improvement of up to +4.1 and +3.7 points, respectively. The GM and HM provide a more comprehensive evaluation of the overall performance of the approach, taking into account the trade-off between the two separate metrics. These results, therefore, confirm the effectiveness of the pre-trained classifier in enhancing the quality of the generated text.
The evaluation results show a surprising lack of performance improvement on the Yelp dataset. One possible explanation for this phenomenon is that the sentiment transfer task already achieves high style accuracy scores, even without the pre-trained classifier. This may suggest that the model's pre-existing capability to perform style transfer is already sufficient to achieve high accuracy scores, making the pre-trained classifier's contribution negligible in this case. Also, the larger size of the Yelp dataset may already provide the model with a sufficient amount of training data to effectively capture style transfer patterns. Moreover, as described in
Section 5.3, since the Yelp dataset does not include human references for the validation set, style accuracy is already taken into account when performing the hyperparameter tuning and selecting the best checkpoints. This may be another possible explanation for the limited impact of the pre-trained classifier on this dataset.
As the quality of the pre-trained style classifier may affect the overall performance of our proposed architecture, we extend this ablation study by also testing other style classifiers. In addition to the BERT-base model used in our architecture, we test the following models: BERT-large, RoBERTa-base [
20], RoBERTa-large, and DistilBERT-base. More detailed results for the BART-base model on the GYAFC-family dataset can be found in the Appendix.
Overall, we observe that the specific style classifier chosen does not have a significant impact on performance. Specifically, the differences among the various classifiers in ref-BLEU are negligible (i.e., \(\pm 0.1\)), and similarly for style accuracy, where fluctuations range from \(\pm 0.4\) to \(\pm 2.5\). The largest differences in performance are observed with the DistilBERT-base model, showing a drop of \(-\)1.0 and \(-\)4.6 in ref-BLEU and accuracy scores, respectively. This result is expected, given that the DistilBERT-base model is the lightest among those tested. Nevertheless, all tested models perform generally well, indicating that the quality of the pre-trained style classifier has a limited impact on the final performance.
6 Conclusions and Future Work
In this paper, we presented a new approach to self-supervised TST using CycleGANs. Thanks to the joint use of cycle-consistency and a pre-trained style classification loss, our method is able to effectively transfer the style of a source text to a target text without the need for labeled data. The experimental results, achieved on three benchmark datasets and two different TST tasks, show that our method performs better than state-of-the-art approaches in terms of the quality of the generated text and the ability to preserve the content of the source text, particularly when mixed-style inputs are processed.
Limitations. The application of the proposed approach to real case studies should consider the following potential limitations. (1) We made the assumption that the source and target domains have approximately the same distributions. As a result, the model could be unable to learn the correct mapping between the two style attributes if this assumption does not hold true. (2) The presented approach may be misused to maliciously manipulate the text style or sentiment. For example, by transferring the style of credible sources to untruthful content, the proposed method might be employed to automatically generate fake news or propaganda. (3) The currently proposed architecture handles one specific pair of source and target styles. This leads to scalability issues, as it would require training a separate architecture for each new pair of styles.
Despite the aforementioned limitations, the usability of the proposed method is quite promising in various real-world scenarios. With a responsible deployment and careful consideration of the main ethical concerns, our approach can relevantly contribute to the advancement of the TST research field and enable innovative NLP applications in fields such as marketing, content generation, and digital storytelling.
Future Work. We plan to extend our work across multiple directions. (1) We aim to expand the capabilities of the proposed architecture by investigating its performance in a multilingual setting. Transferring the style attributes across languages is potentially challenging as it entails not only capturing stylistic nuances but also handling language-specific characteristics. By considering this aspect, we can evaluate the model's ability to generalize and adapt to diverse linguistic contexts. (2) The flexibility of our method allows us to explore its applicability to new domains and tasks. For instance, we would like to explore the following two related tasks further:
Aspect-level style transfer [
13] and
Controllable text generation [
10]. In both cases, the goal is to selectively transfer specific attributes or aspects of the writing style while preserving the rest. (3) We also envisage extending our approach to handle more than a single pair of styles simultaneously (see
Section 4.5).