research-article

Open access

Neural Methods for Data-to-text Generation

Authors:

Mandar Sharma,

Ajay Kumar Gogineni,

Naren RamakrishnanAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology, Volume 15, Issue 5

Article No.: 89, Pages 1 - 46

https://doi.org/10.1145/3660639

Published: 17 October 2024 Publication History

PDF eReader

Abstract

The neural boom that has sparked natural language processing (NLP) research throughout the last decade has similarly led to significant innovations in data-to-text (D2T) generation. This survey offers a consolidated view into the neural D2T paradigm with a structured examination of the approaches, benchmark datasets, and evaluation protocols. This survey draws boundaries separating D2T from the rest of the natural language generation (NLG) landscape, encompassing an up-to-date synthesis of the literature, and highlighting the stages of technological adoption from within and outside the greater NLG umbrella. With this holistic view, we highlight promising avenues for D2T research that focus not only on the design of linguistically capable systems but also on systems that exhibit fairness and accountability.

1 Introduction

Textual representations of information: A picture is worth a thousand words—isn’t it? And hence graphical representation is by its nature universally superior to text—isn’t it? Why then isn’t the anecdote itself represented graphically?—Petre [246], in his advocacy for textual representation of information, challenges the notion that graphical representations of information are inherently more memorable, comprehensible, and accessible than their textual counterparts. Gershon and Page [104] note that the transformation of information from a textual to visual domain, in certain instances, requires further addition of information rendering textual representations more economical. Similarly, knowing where to look may not be obvious in visual representations of information—as validated through reading comprehension experiments [31, 115, 116] where participants were significantly slower in interpreting visual representations of the nested conditional structures within a program compared to their textual representations. This being said, these studies do not intend to dissuade the use of visual representations but rather establish the importance of textual representation of information. Often, the interplay of these paradigms brings out the best of both [289]. Thus, having established the importance of textual representations of information, we next explore how these notions tie into data-to-text (D2T) generation.

1.1 Defining D2T Generation and Scope of the Survey

Textual representations of information, for easier assimilation, are often presented as annotations outlining different behaviours of the underlying data stitched together. These stitched annotations, as showcased in Figure 1, are referred to as narratives. The automated generation of such narratives, although serving several niches (see below), are most prevalent in the public eye through the practice of robo-journalism [66, 176]. Bloomberg News generates a third of its content with Cyborg, their in-house automation system that can dissect tedious financial reports and churn out news articles within seconds [243]. Also prevalent are the use of such systems in business intelligence settings with prominent commercial frameworks,¹ such as Arria NLG,² Narrative Science,³ and Automated Insights.⁴

Fig. 1.

The practice of automating the translation of data to user-consumable narratives through such systems is known as D2T generation, as depicted in Figure 1. Although encompassed by the general umbrella of natural language generation (NLG), the nuance that differentiates D2T from the rest of the NLG landscape is that the input to the system has to qualify as a data instance. Reiter and Dale (1997) [270] describe the instance as a non-linguistic representation of information, and although narration of images and videos [70] has garnered interest in the NLG community, the definition of D2T employed by this survey follows that established by the seminal works prior [100, 270]: an entity that is not exclusively linguistic—tabular databases, graphs and knowledge bases, time series, and charts. Using this clause, we limit the scope of our analysis and exclude examination of all other NLG systems that either both ingest and expel linguistic entities for downstream tasks, such as machine translation (MT) [145, 350] and summarization [188, 228], or ingest non-conventional data, such as images [131] and videos [328].

Outside of dataset specific tasks, practical applications of D2T include, but are not limited to:

—

Weather forecasts [20, 271]

—

Sport summaries [15, 279, 316]

—

Healthcare [241, 249]

—

Virtual dietitians [9]

—

Stock market comments [10, 220]

—

Video game dialogues [149] and driving feedback [35]

With our scope defined, below we outline the rationale for this survey, followed by a structured examination of approaches, benchmark datasets, and evaluation protocols that constitute the D2T landscape with the intent to outline promising avenues for further research.

1.2 Survey Rationale

Following the seminal work by Reiter and Dale [270], the most comprehensive survey on D2T to date has been that by Gatt and Krahmer [100]. Although several articles have taken a close examination of NLG sub-fields such as dialogue systems [282], poetry generation [234], persuasive text generation [76], social robotics [94], or exclusively focus on issues central to NLG, such as faithfulness [181] and hallucination [142], a detailed breakdown of the last half-decade of innovations has been missing since the last exhaustive body of work. The need for a close and consolidated examination of developments in neural D2T is more pertinent now than ever. Further, D2T distinguishes itself from other NLG tasks as it blends the generation of narratives with numerical reasoning between data points. Outside of the D2T niche, there are research communities focused on solving these individual problems—NLG [100, 235, 320] and numerical reasoning [295, 317, 334, 349]. Thus, neural D2T is uniquely positioned such that it has either to incorporate innovations from these seemingly disparate niches or to jointly innovate on both fronts. We believe this provides added justification for D2T requiring its own comprehensive literature review.

As such, neural D2T borrows heavily from advances in other facets of NLG, such as neural MT (NMT) [11, 350] and spoken dialogue systems (SDS) [79, 342, 343]. As such, the pertinence of such a survey also spans highlighting the stages of technological adoptions in the D2T paradigm and drawing distinctions between its NMT and SDS neighbors. Further, the adoption of such technologies brings about the adoption of shared pitfalls—inconsistencies in evaluation metrics [268] and meaningful inter-model comparisons [265]. Thus, in addition to an exhaustive examination of neural D2T frameworks, a consolidated resource on approaches to its evaluation is also necessary. Also crucial, is the discussion of benchmark datasets across shared tasks. The above considerations motivate our survey on the neural D2T paradigm intended to serve the following goals:

—

Structured examination of innovations in neural D2T in the last half-decade spanning relevant frameworks, datasets, and evaluation measures.

—

Outlining the technological adoptions in D2T from within and outside of the greater NLG umbrella with the distinctions and shared pitfalls that lie therein.

—

Highlighting promising avenues for further D2T research and exploration that promote fairness and accountability along with linguistic prowess.

2 Datasets for D2T Generation

The first set of technological adoptions from natural language processing (NLP) takes the form of dataset design: parallel corpora that align the data to their respective narratives are crucial for end-to-end learning, analogous to any neural-based approach to text processing. The initial push toward building such datasets began with database–text pairs of weather forecasts [20, 271] and sport summaries [15]. These datasets, and the convention that currently follows, use semi-structured data that deviates from the raw numeric signals initially used for D2T systems [266]. The statistics for prominent datasets among the ones discussed below are detailed in Table 2.

2.1 Meaning Representations

Mooney [217] defines a meaning representation (MR) language as a formal unambiguous language that allows for automated inference and processing wherein natural language is mapped to its respective MR through semantic parsing [101]. Robocup [42], among pioneering MR-to-text datasets, offers data from 1,539 pairs of temporally ordered simulated soccer games in the form of MRs (pass, kick, turnover) accompanied with their respective human commendation. To mitigate the cost of building large-scale MR datasets, Liang et al. [183] use grounded language acquisition to construct WeatherGov—a weather forecasting dataset with 29,528 MR–text pairs, each consisting of 36 different weather states. Abstract MR (AMR) [13], similarly, is a linguistically grounded semantic formalism representing the meaning of a sentence as a directed graph, as depicted in Figure 2(a). The LDC repository⁵ hosts various AMR-based corpora. Following this, using simulated dialogues between their statistical dialogue manager [357] and an agenda-based user simulator [283], Mairesse et al. [204] offer BAGEL—an MR–text collection of 202 Cambridge-based restaurant descriptions each accompanied with two inform and reject dialogue types. Wen et al. [343], through crowdsourcing, offer an enriched dataset consisting of 5,192 instances of 6 additional dialogue act types, such as confirm and inform only (8 total) for hotels and restaurants in San Francisco. Novikova et al. [232] show that crowdsourcing with the aid of pictorial stimuli yield better phrased references compared to textual MRs. Following this, they released the E2E dataset⁶ as a part of the E2E challenge [230]. With 50,602 instances of MR–text pairs of restaurant descriptions, its lexical richness and syntactic complexity provides new challenges for D2T systems. Table 1 showcases comparative snapshots of the aforementioned datasets.

Fig. 2.

Table 1.

MR	Text
RoboCup [42] badPass(arg1=pink11,…), ballstopped() ballstopped(), kick(arg1=pink11) turnover(arg1=pink11,…)	pink11 makes a bad pass and was picked off by purple3
WeatherGov [183] rainChance(time=26-30,…), temperature(time=17-30,…) windDir(time=17-30,…), windSpeed(time=17-30,…) precipPotential(time=17-30,…), rainChance(time=17-30,…)	Occasional rain after 3am. Low around 43. South wind between 11 and 14 mph. Chance of precipitation is 80%. New rainfall amounts between a quarter and half of an inch possible.
BAGEL [204] inform(name(the Fountain) near(the Arts Picture House) area(centre), pricerange(cheap))	There is an inexpensive restaurant called the Fountain in the centre of town near the Arts Picture House
SF Hotels and Restaurants [343] inform(name=“red door cafe”, goodformeal=“breakfast”, area=“cathedral hill”, kidsallowed=“no”)	red door cafe is a good restaurant for breakfast in the area of cathedral hill and does not allow children.
E2E [232] name[Loch Fyne], eatType[restaurant], food[French], priceRange[less than £20], familyFriendly[yes]	Loch Fyne is a family-friendly restaurant providing wine and cheese at a low cost.

Table 1. Comparative Showcase of Sample MRs (and Their Corresponding Narratives) from the RoboCup, WeatherGov, BAGEL, SF Hotels and Restaurants, and E2E Datasets

Table 2.

Benchmark	Format	Size	Tokens
E2E	MR	50,602	65,710
LDC2017T10	AMR	39,260	-
WebNLG (en, ru)	RDF	27,731	8,886
Data-record-to-text	RDF	82,191	33,200
WikiBio	Record	728,321	400,000
RotoWire	Record	4,853	11,300
TabFact	Record	16,573	-
ToTTo	Record	120,000	136,777
LogicNLG	Record	37,000	52,700
WikiTableT	Record	1.5M	169M

Table 2. Highlights from Prominent D2T Datasets: Format, Number of Samples (Size), the Number of Linguistic Tokens Across the Dataset (Tokens), and Availability of Non-Anglo-Centric Variants

2.2 Graph Representations

Graph-to-text translation is not only central to D2T as its application carries over to numerous NLG fields such as question answering [74, 130], summarization [86], and dialogue generation [190, 216]. Further, the D2T frameworks for graph-to-text borrow heavily from the theoretic formulations offered from the literature in the field of graph neural networks, as will be discussed in Section 5.1.5. The domain-specific benchmark datasets, as discussed above (see Section 2.1) inherently train models to generate stereotypical domain-specific text. By crowdsourcing annotations for DBPedia [211] graphs spanning 15 domains, Gardent et al. [99] introduce the WebNLG dataset.⁷ The data instances are encoded as resource description format triples of the form (subject, property, and object) as depicted in Figure 2(b)—(Apollo 12, operator, NASA). With 27,731 multi-domain graph–text pairs, WebNLG offers more semantic and linguistic diversity than previous datasets twice its size [342]. The abstract generation dataset [168], built with knowledge graphs extracted from articles in the proceedings of AI conferences [6] using SciIE [199], offers 40,000 graph–text pairs of the article abstracts. To further promote generation challenges and cross-domain generalization, Nan et al. [224] merge the E2E and WebNLG dataset with large heterogeneous collections of diverse predicates from Wikipedia tables annotated with tree ontologies to generate the data-record-to-text corpus. With 82,191 samples, this resulting open-domain corpus is almost quadruple the size of WebNLG.

2.3 Tabular Representations

Information represented in large tables can be difficult to comprehend at a glance, thus, table-to-text (T2T) aims to generate narratives highlighting crucial elements of a tabular data instance through summarization and logical inference over the table—as showcased in Figure 3. Similar to graph-to-text, the underpinnings of tabular representation learning is also shared with other fields outside of NLG, such as the generation of synthetic network traffic [276, 353].

Fig. 3.

WikiBio [174], as an initial foray toward a large-scale T2T dataset, offers 700k table–text pairs of Wikipedia info-boxes with the first paragraph of its associated article as the narrative. With a vocabulary of 400k tokens and 700k instances, WikiBio offers a substantially larger benchmark compared to the pioneering WeatherGov and Robocup datasets that have less than 30k data–text pairs. For neural systems, as the length of output sequence increases, the generated summary diverges from the reference. As such, the RotoWire dataset [347] (Figure 3), consisting of verbose descriptions of NBA game statistics, brings forth new challenges in long-form narrative generation as the average reference length of RotoWire is 337 words compared to 28.7 of WikiBio. Similarly, with the observation that only 60% of the content in RotoWire narratives can be traced back to the data records, Wang [335] introduce RotoWire-FG, a refined version of the original dataset aimed at tackling divergence (see Section 3.2), where narrative instances not grounded by their respective tables are removed from the dataset.

TabFact [50] contains annotated sentences that are either supported or refuted by the tables extracted from Wikipedia. Similar to RotoWire-FG, Chen et al. [47] offer a filtered version of TabFact by retaining only those narratives that can be logically inferred from the table.

For controlled generation, Parikh et al. [239] propose ToTTo which generates a single sentence description of a table on the basis of a set of highlighted cells where annotators ensure that the target summary only contains the specified subset of information. With over 120k training samples, ToTTo establishes an open-domain challenge for D2T in controlled settings. Similarly, to evaluate narrative generation in open-domain settings with sentences that can be logically inferred from mathematical operations over the input table, Chen et al. [47] modify the reference narratives of the TabFact dataset to construct LogicNLG with 7392 tables. Following this, with tables and their corresponding descriptions extracted from scientific articles, Moosavi et al. [218] introduce SciGen, where the narratives include arithmetic reasoning over the tabular numeric entries. Building upon the long form generation premise of RotoWire, Chen et al. [45] construct WikiTableT, a multi-domain table–text dataset with 1.5 million instances pairing Wikipedia descriptions to their corresponding info-boxes along with additional hyperlinks, named-entities, and article metadata. The majority of these datasets are available in a unified framework through TabGenie.⁸

2.4 Data Collection and Enrichment

The majority of the prominent datasets discussed in Sections 2.1–2.3 are either collected by merging aligned data–narrative pairs that occur naturally in the “wild” [174, 347] or collected through dedicated crowd-sourcing approaches [99, 230]. However, there are notable works that employ a hybrid approach to data collection. CACAPO [324], an MR-style multi-domain dataset, follows a collection process inspired by Oraby et al. [236] wherein the naturally occurring narratives are first scraped from the internet and are later manually annotated to generate attribute–value pairs. Similarly, chart-to-text [153] follows a similar mechanism of data collection wherein candidate narratives for each chart are first automatically generated via a heuristic-based approach and then are rated by crowd-sourced workers. In similar lines, the ToTTo dataset [239] discussed in Section 2.3 uses crowd-sourced annotators as data “cleaners”—iteratively improving upon the automatically scraped narratives, rather than annotating them from scratch—thus greatly reducing the cost of data acquisition. In addition to innovations in data collection, efforts from the D2T community have also focused on the enrichment of existing datasets. As such, Ferreira et al. [90] augment the WebNLG dataset with intermediate representation for discourse ordering and referring expression generation. By manually delexicalizing (see Section 4.1) the narratives, Ferreira et al. were able to automatically extract a collection of referring expressions by tokenizing the original and delexicalized texts and finding the non-overlapping tokens between them. Similarly, the authors also extracted the order of the arguments in the text by referring to the order of the general tags in the delexicalized texts. This work has also been extended to enrich the E2E dataset [92].

3 D2T Generation Fundamentals and Notations

3.1 What to Say and How to Say It

The data instance, typically, contains more information than what we would intend for the resulting narrative to convey—verbose narratives that detail every attribute of the data instance contradicts the premise of consolidation. Thus, to figure out what to say, a subset of the original information content is filtered out based on the target audience through the process of content selection (CS). Starting from data-driven approaches such as clustering [75] and the use of hidden Markov models [16], the attention of the research community has recently shifted to learning alignments between the data instance and its narrative [183]. Bisazza and Marcello [30] note that pre-reordering the source words to better resemble the target narrative yields significant improvements in NMT. Prior to neural explorations, learning this alignment has been explored with log-linear models [8] and tree representations [170, 171]. With what to say determined, the next step lies in figuring out how to say it, that is, the construction of words, phrases, and paragraphs—this realization of the narrative structure is known as surface realization. While traditionally, the processes of CS and surface realization [148, 270] act as discrete parts of the generation pipeline, the neural sequence-to-sequence (seq2seq) paradigm jointly learns these aspects. For a peripheral view of the articles discussed in this section, Table 4 highlights prominent papers categorized based on their D2T tasks and the benchmark datasets used. Similarly, Figure 5 outlines the organization of the remainder of this survey.

3.2 Hallucinations and Omissions

Apart from the importance of coherence and linguistic diversity in surface realization, data fidelity is a crucial aspect of D2T systems—the narrative should neither hallucinate contents absent from the data instance nor omit contents present in the data instance. Often, the divergence present in benchmark training datasets, wherein the narrative may contain data absent from the source or not cover the entirety of the data instance, is the culprit behind hallucination tendencies in the model [280]. Often times, the need for both linguistic diversity and data fidelity turns into a balancing act between conflicting optimization objectives leading to novel challenges [128]. While almost all of the D2T approaches discussed below engage in balancing coherence and diversity with data-fidelity (besides Section 5.1.4), overarchingly, the approach to balancing these conflicting objectives can be thought to take place in two forms:

—

Architectural interventions: The Sections 5.1.1–5.1.3, 5.1.5, 5.1.6, and 5.1.10 suggest modifications or augmentations to the seq2seq architecture such that it fosters data-fidelity tendencies.

—

Loss-function interventions: An alternative avenue to achieving a balance between conflicting optimization objectives is to directly model the objective functions to perform multi-task learning: as such Sections 5.1.7 and 5.1.8 suggest modifications or augmentations to the seq2seq loss functions.

3.3 Establishing Notation and Revisiting Seq2Seq

For the consistency and readability of this survey, the notation outlining the basic encoder–decoder seq2seq paradigm [11, 54, 314, 326] in D2T (Figure 4), as defined below and respectively compiled in Table 3, will remain valid throughout unless stated otherwise. However, the namespace for additional variable definitions in the individual sections will be limited to their mentions. Let \(S=\{x_{j},y_{j}\}_{j=1}^{N}\) be a dataset of \(N\) data instances \(x\) accompanied with its natural language narrative \(y\). Based on the construction of \(S\), \(x\) can be a set of \(K\) data records \(x=\{r_{j}\}_{j=1}^{K}\) with each entry \(r\) comprised of its respective entity \(r.e\) and value \(r.m\) attributes or \(x\) can be an instance of a directed graph \(x=(V,E)\) with vertices \(v\in V\) and edges \((u,v)\in E\). In the RotoWire instance (Figure 3), for \(r_{j}\) = Heat, \(r_{j}.e\) = WIN attribute would have value \(r_{j}.m\) = 11. Given pairs \((x,y)\), the seq2seq model \(f_{\theta}\) is trained end-to-end to maximize the conditional probability of generation \(P(y|x)=\prod_{t=1}^{T}P(y_{t}|y_{ < t},x)\). The parameterization of \(f_{\theta}\) is usually carried out through RNNs, such as LSTMs [29, 133] and GRUs [54], or transfomer⁹ architectures [326]. For attention-based RNN architectures with hidden states \(h_{t}\) and \(s_{t}\) for the encoder and decoder respectively, the context vector \(c_{t}=\sum_{i}\alpha_{t,i}h_{i}\) weighs the encoder hidden states with attention weights \(\alpha_{t,i}\). While Bahdanau et al. [11] use a multi-layer perceptron (MLP) to model \(\alpha_{t,i}\), several alterations to modeling the attention weights have been proposed [201, 331].

Fig. 4.

Fig. 5.

Table 3.

Notation	Description
\(S\)	Dataset
\((x,y)\in S\)	Data instance \(x\) and its natural language representation \(y\)
\(r=(r.e,r.m)\)	Data record \(r\) with its entity \(r.e\) and value \(r.m\) attributes
\(G=(V,E)\)	Graph instance \(G\) with vertices \(V\) and edges \(E\)
\((u,v)\in E\)	Nodes \(u\) and \(v\) of an edge \(E\)
\(f_{\theta}\in\{f_{1},...,f_{n}\}\)	Model \(f_{\theta}\) that may belong to an ensemble \(\{f_{1},...,f_{n}\}\)
\(P(y\|x)\)	Conditional probability of sequence \(y\) given \(x\)
\(h_{t},s_{t}\)	Encoder \(h_{t}\) and decoder \(s_{t}\) hidden state representations
\(c_{t},\alpha_{t,i}\)	Context vector \(c_{t}\) weighing \(h_{t}\) with attention weights \(\alpha_{t,i}\)
\(p_{gen},p_{copy}\)	Token generation \(p_{gen}\) or copying \(p_{copy}\) probabilities
\(z_{t}\in\{0,1\}\)	Binary variable that selects either \(p_{gen}\) or \(p_{copy}\)
\(W_{i\in\mathbb{N}}\), \(b_{i\in\mathbb{N}}\)	Arbitrary weights and biases parameterizing \(f_{\theta}\)

Table 3. Notation Descriptions

Table 4.

Dataset	Publication Highlights	Framework and Human Evaluation
MR-to-Text
Robocup and WeatherGov	[210] Coarse-to-fine aligner and penalty based on learned priors	LSTM \(\rightarrow\) LSTM + regularization	N
Recipe and SF H&R	[160] Neural agenda-checklist modeling	GRU \(\rightarrow\) GRU + agenda encoders	Y
BAGEL	[79] Reranking beam outputs w/ RNN-based reranker	LSTM \(\rightarrow\) LSTM + reranker	N
Restaurant Ratings	[227] Nondelexicalized inputs w/ data augmentation	LSTM \(\rightarrow\) LSTM	Y
WikiData	[52] Complementary text-to-data translation	GRU \(\rightarrow\) GRU	Y
E2E	[81] Comparative evaluation of 62 systems	Seq2Seq + Data-driven + Templated	Y
	[256] MLP encoder attuned to the dataset	MLP \(\rightarrow\) GRU	Y
	[360] Two-level hierarchical encoder	CAEncoder \(\rightarrow\) GRU	N
	[150] Ensemble w/ heuristic reranking	Ensemble w/ LSTM + CNN	N
	[310] Hierarchical decoding with POS tags	GRU \(\rightarrow\) GRU	N
	[95] Unsupervised DTG with DAEs	LSTM \(\rightarrow\) LSTM + DAEs	Y
	[103] Comparative evaluations w/ ensembling & penalties	Ensemble w/ LSTM + T	N
	[62] Syntactic controls with SC-LSTM	SC-LSTM	Y
	[297] Computational pragmatics based DTG	GRU \(\rightarrow\) GRU	N
	[59] Extensive anonymization	LSTM \(\rightarrow\) LSTM	Y
	[78] Semantic correctness in neural DTG	LSTM \(\rightarrow\) LSTM	Y
	[159] Self-training w/ noise injection sampling	GRU \(\rightarrow\) GRU	Y
	[277] Char-level GRU w/ input reconstruction	GRU \(\rightarrow\) GRU	N
	[97] CRFs w/ Gumbel categorical sampling	CRF	N
Graph-to-Text
AGENDA	[168] Graph-centric Transformer and AGENDA dataset	T \(\rightarrow\) LSTM + LSTM encoding	Y
LDC2015E25	[88] Phrase vs Neural MR-text w/ preprocessing analysis	LSTM \(\rightarrow\) LSTM + Phrase-based	N
LDC2015E86	[169] Unlabeled pre-training and linearization agnosticism	LSTM \(\rightarrow\) LSTM	N
LDC2015E86	[273] Dual encoding for hybrid traversal	GNN \(\rightarrow\) LSTM	Y
LDC2017T10	[12] Graph reconstruction w/ node and edge projection	T \(\rightarrow\) T + reconstruction loss	Y
LDC2017T10	[203] Fine-tuning GPT-2 on AMR-text joint distribution	GPT-2	Y
WebNLG	[207] Graph encoding with GCNs	GCN \(\rightarrow\) LSTM	N
	[68] LSTM based triple encoder	LSTM \(\rightarrow\) LSTM	Y
	[91] Discrete neural pipelines and comparisons to end-to-end	GRU \(\rightarrow\) GRU + T	Y
	[219] Sentence planning with ordered trees	LSTM \(\rightarrow\) LSTM	Y
	[275] Complementary graph contextualization	GAT \(\rightarrow\) T	Y
	[306] Detachable multi-view reconstruction	T \(\rightarrow\) T	N
	[364] Dual encoder for structure and planning	GCN \(\rightarrow\) LSTM	Y
	[274] Task-adaptive pretraining for PLMs	BART + T5	Y
	[2] Knowledge enhanced language models and KeLM dataset	T5	Y
	[158] Graph-text joint representations and pretraining strategies	BART + T5	Y
Record-to-Text (Table-to-text)
WikiBio	[174] Tabular positional embeddings and WikiBio dataset	LSTM \(\rightarrow\) LSTM + Kneser-Ney	N
	[14] Encoding tabular attributes and WikiTableText dataset	GRU \(\rightarrow\) GRU	N
	[193] Field information through modified LSTM gating	LSTM \(\rightarrow\) LSTM	N
	[292] Link-based and content-based attention	LSTM \(\rightarrow\) LSTM	N
	[244] Multi-instance learning w/ alignment-based rewards	LSTM \(\rightarrow\) LSTM	Y
	[202] Key fact identification and data augment for few shot	LSTM + T	N
	[191] Hierarchical encoding w/ supervised auxiliary learning	LSTM \(\rightarrow\) LSTM	Y
	[192] Forced attention for omission control	LSTM \(\rightarrow\) LSTM	Y
	[46] External contextual information w/ knowledge graphs	GRU \(\rightarrow\) GRU	Y
	[318] Confidence priors for hallucination control	BERT + Pointer Networks	Y
	[51] Soft copy switching policy for few-shot learning	GPT-2	Y
	[356] Variational auto-encoders for template induction	VAE modified to VTM	Y
	[337] Autoregressive modeling with iterative text-editing	Pointer networks + Text editing	Y
	[365] Reinforcement learning with adversarial networks	GAN	N
	[105] Linearly combined multi-reward policy	Pointer networks	N
	[354] Source-target disagreement auxiliary loss	T	N
	[311] BERT-based IR system for contextual examples	T5 + BERT	Y
Rotowire	[347] Classification-based metrics and RotoWire dataset	LSTM \(\rightarrow\) LSTM + Templated	N
	[229] Numeric operations and operation-result encoding	GRU \(\rightarrow\) GRU + operation encoders	Y
	[253] Dynamic hierarchical entity-modeling and MLB dataset	LSTM \(\rightarrow\) LSTM	Y
	[252] Content selection and planning w/ gating and IE	LSTM \(\rightarrow\) LSTM	Y
	[107] Contextualized numeric representations	LSTM \(\rightarrow\) LSTM	Y
	[139] Dynamic salient record tracking w/ stylized generation	GRU \(\rightarrow\) GRU	N
	[261] Two-tier hierarchical input encoding	T \(\rightarrow\) LSTM	N
	[180] Auxiliary supervision w/ reasoning over entity graphs	LSTM + GAT	Y
	[255] Paragraph-centric macro planning	LSTM \(\rightarrow\) LSTM	Y
	[254] Interweaved plan and generation w/ variational models	LSTM \(\rightarrow\) LSTM	Y
TabFact	[47] Coarse-to-fine two-stage generation	LSTM + T + GPT-2 + BERT	Y
WikiPerson	[339] Disagreement loss w/ optimal transport matching loss	T	Y
Humans, Books and Songs	[109] Attribute prediction-based reconstruction loss	GPT-2	Y
ToTTo LogicNLG NumericNLG	[189] Contextual examples through k nearest neighbors	GPT-3	Y
	[312] Targeted table cell representation	GPT-2	Y
	[49] Semantic confounders w/ Pearl’s do-calculus	DCVED + GPT	Y
	[187] Table-to-logic pretraining for logic text generation	T5 + BART	Y
	[223] Faithful generation with unlikelihood and replacement detection	T5	Y
	[44] Table serialization and structural encoding	T \(\rightarrow\) GPT-2	Y
	[7] T5 infused with tabular embeddings	T5	N
Cross-domain
E2E WebNLG DART WikiBio RotoWire WITA	[140] Char-based vs word-based seq2seq	GRU \(\rightarrow\) GRU	Y
	[348] Template induction w/ neural HSMM decoder	HSMM	N
	[98] Training w/ partially aligned dataset	T \(\rightarrow\) T + supportiveness	Y
	[157] Iteratively editing templated text	GPT-2 + LaserTagger	N
	[127] RoBERTa-based semantic fidelity classifier	GPT-2 + RoBERTa	Y
	[48] Knowledge-grounded pre-training and KGTEXT dataset	T \(\rightarrow\) T + GAT	N
	[186] Hybrid attention-copy for stylistic imitation	LSTM + T	Y
	[351] Disambiguation and stitching with PLMs	GPT3 + T5	N
	[77] Unified learning of D2T and T2D	T5 + VAE	N
	[144] Search and learn in a few-shot setting	T5 + Search and Learn	Y
Time series-to-text
WebNLG and DART	[294] Open-domain transfer learning for time-series narration	BART + T5 + Time series analysis	Y
Chart-to-text
Chart2Text	[233] Preprocessing w/ variable substitution	T \(\rightarrow\) T	Y
Chart-to-text	[153] Neural baselines for Chart-to-text dataset	LSTM + T + BART + T5	Y

Table 4. Task and Dataset Based Summarization of Noted D2T Frameworks over the Last Half-Decade

For handling of out-of-vocabulary tokens, Gu et al. [119] attempt to model the rote memorization process of human learning where a language model conditioned on binary variable \(z_{t}\in\{0,1\}\) can either generate \(p_{gen}\) the next token or copy it from the source \(p_{copy}\) based on their respective probabilities. While Gu et al. [119] and Yang et al. [355] parameterize the joint distribution over \(y_{t}\) and \(z_{t}\) directly (Equation (1)), Gulçehre et al. [121] decompose the joint probability (Equation (2)), using an MLP to model \(p(z_{t}|y_{ < t},x)\)

\begin{align} P(y_{t},z_{t}|y_{ < t},x)\propto\begin{cases}p_{gen}(y_{t},y_{ < t}, x) z_{t}=0 \\p_{copy}(y_{t},y_{ < t},x) z_{t}=1,y_{t}\in x \\0\qquad\qquad\qquad z_{t}=1,y_{t}\notin x\end{cases}\end{align}

(1)

\begin{align} \\\begin{cases}p_{gen}(y_{t}|z_{t},y_{ < t},x) p(z_{t}|y_{ < t},x) z_ {t}=0 \\p_{copy}(y_{t}|z_{t},y_{ < t},x) p(z_{t}|y_{ < t},x) z_{t}=1\end{cases}\!\!\!\!\!\!.\end{align}

(2)

Similar to the greater NLG paradigm, different strategies for modeling the conditional probability of generation \(P(y|x)\), the attention mechanisms \(\{\alpha_{t,i},c_{t}\}\), and the copy mechanisms \(\{p_{gen},p_{copy}\}\), as discussed below, often form the basis for D2T innovations. In addition to this, variations in training strategies such as teacher-forcing [346], reinforcement learning (RL) [315], and auto-encoder-based reconstruction [41] open up further avenues for D2T innovation.

4 Innovations in Data Preprocessing

Contrary to the other facets of NLG, such as chatbots, for which large-scale data can be harvested [1, 198], D2T datasets are often smaller in scale and task-specific. Ferreira et al. [88] note that phrase-based translation models [166] can outperform neural models in such data sparsity. As such, delexicalization, noise reduction, linearization, and data augmentation are preprocessing techniques often employed to tackle said sparsity of training data.

4.1 Delexicalization and Noise Reduction

Delexicalization, often referred to as anonymization, is a common practice in D2T [79, 204] wherein the slot–value pairs for the entities and their attributes in training utterances are replaced with a placeholder token such that weights between similar utterances can be shared [227]—as illustrated in Figure 6(a). These placeholder tokens are later replaced with tokens copied from the input data instance [174]. In comparison to copy-based methods for handling rare entities, delexicalization has shown to yield better results in constrained datasets [300].

Fig. 6.

From the notion that delexicalization of the data instance may cause the loss of vital information that can aid seq2seq models in sentence planning, where some data instance slots may even be deemed nondelexicalizable [343], Nayak et al. [227] explore different nondelexicalized input representations (mention representations) along with grouping representations as a form of sentence planning (plan representations). The authors note improvements over delexicalized seq2seq baselines when input mentions are concatenated with each slot–value pair representing a unique embedding. The efficacy of such concatenation is also corroborated by Freitag and Roy [95]. Further, the addition of positional tokens representing intended sentence position to the input sequence offers further improvements. Addressing this, in addition to delexicalizing categorical slots, Juraska et al. [150] employ hand-crafted tokens for values that require different treatment in their verbalization: for the slot food, the value Italian is replaced by slot\(\_\)vow\(\_\)cuisine\(\_\)food indicating that the respective utterance should start with a vowel and the value represents a cuisine—an Italian restaurant. Perez-Beltrachini and Lapata [244] delexicalize numerical expressions, such as dates, using tokens created with the attribute name and position of the delexicalized token. Colin and Gardent [59] note performance improvements with an extensive anonymization scheme wherein all lemmatized content words (expect adverbs) are delexicalized as compared to restricting delexicalization to named entities.

The presence of narratives that fail to convey vital attributes of data instances leads to semantic noise in the dataset [84, 136]. Dušek et al. [78] employ slot matching [263] to clean the E2E corpus for semantic correctness and explore the impact of semantic noise on neural model performance. Similarly, Obeid and Hoque [233] substitute data mentions in the narrative, identified through named entity recognition (NER), with a predefined set of tokens which are later replaced through a look-up operation. Liu et al. [194] focus on generating faithful narratives and truncate the reference narratives by retaining only the first few sentences, since the latter are prone to have been inferred from the table.

4.2 Linearization

Frameworks for MR (and graph) narration that shy from dedicated graph encoders rely on effective linearization techniques—the representation of graphs as linear sequences, as illustrated in Figure 6(b). While Ferreira et al. [88] note improvements in neural models with the adoption of a two-step classifier [177] that maps AMRs to the target text, Konstas et al. [169] showcase agnosticism to linearization orders by grouping and anonymizing graph entities for delexicalization with the Stanford NER [93]. The reduction in graph complexity and subsequent mitigation of the challenge brought forth by data sparsity lends any depth-first traversal of the graph as an effective linearization approach. Moryossef et al. [219] append text plans modeled as ordered trees [308] to the WebNLG training set and use an off-the-shelf NMT system [121] for plan-to-text generation. However, the authors note that the restriction of requiring single entity mentions in a sentence establishes their approach as dataset dependent.

For pretrained language models, such as GPT-2 [257] and T5 [258], Zhang et al. [361] and Gong et al. [109] represent tables as a linear sequence of attribute–value pairs and use a special token as the separator between the table data and the reference text. It should be noted that T5, while performing the best on automated metrics, fails to generate good summaries when numerical calculations are involved. Chen et al. [47] traverse the table horizontally, each row at a time, where each element is represented by its corresponding field and cell value separated by the keyword is. For scientific tables, Suadaa et al. [312] view a table \(T_{D}\) as a set of cells with their corresponding row and column headers \(h=[rh:ch]\) with \(th\) for overlapping tokens, numerical value \(val\), and metric-type \(m\). The cells are marked with target flag \(tgt\) which is set to 1 for targeted cells and 0 otherwise respective to the content plan. The linearization of the resulting tables is done with templates that consist of concatenation \(T_{D}=[h:th:val:m:tgt]\), filtration based on \(tgt\), pre-computed mathematical operations, and their respective combinations.

4.3 Data Augmentation

Often, appending contextual examples from outer sources to the training set, or permuting the training samples themselves to append variation, helps mitigate data sparsity. This is known as data augmentation. Nayak et al. [227] propose the creation of pseudo-samples by permuting the slot orderings of the MRs while keeping the utterances intact. Juraska et al. [150], however, take an utterance-oriented approach where pseudo-samples are built by breaking training MRs into single-sentence utterances. For the shared surface realization task [213], Elder and Hokamp [83] augment the training set with sentences from the WikiText corpus [212] parsed using UDPipe [309]. Following this premise, Kedzie and Mckeown [159] curate a collection of utterances from novel MRs using a vanilla seq2seq model with noise injection sampling [53]. The validity of the MRs associated with these utterances are computed through a CNN-based parser [162] and the valid entries are augmented to the training set. However, it is worth noting that performance gains from augmenting the training set with out-of-domain instances tend to saturate after a certain point [95]. Also, practitioners of data augmentation should note that caution is advised when augmenting with synthetic data, as the inclusion of such data may reinforce the mistakes of the model [126].

Chen et al. [46] append knowledge graphs representing external context to the table–text pairs and quantify its efficacy through their metric KBGain—the ratio of tokens unique to the external context to the total number of tokens in the narrative. Similarly, Ma et al. [202] augment the limited training data for table–text pairs by assigning part-of-speech (POS) tags for each word in the reference and further increase the robustness of their model with adversarial examples created by randomly adding and removing words from the input. In contrast, Chen et al. [47] create adversarial examples by randomly swapping entities in the narrative with ones that appear in the table. Following this, Liu et al. [194] use an augmented plan consisting of table records and entities recognized from the reference narrative which eliminates the inclusion of information not present in the table. For few-shot learning, Liu et al. [189] observed that the performance of a GPT-3 model [37] improved upon providing in-context examples computed based on their k-nearest neighbor (\(k=2\)) embeddings.

5 Innovations in the Seq2Seq Framework

Seq2Seq models (see Section 3.3) serve as the basis for neural NLG [54, 314, 326]. As such, to compare the efficacy of neural architectures for long-form D2T, Wiseman et al. [347] compare the performance of various seq2seq models to their templated counterparts on the RotoWire dataset. Based on their observations, the conditional copy model [121] performs the best on both word-overlap and extractive metrics (see Section 6.2.1) compared to the standard attention-based seq2seq model [11] and its joint copy variant [119]. Similarly, in an evaluation of 62 seq2seq, data-driven, and templated systems for the E2E shared task, Dušek et al. [81] note that seq2seq systems dominate in terms of both automated word-based metrics and naturalness in human judgment. Wiseman et al. [347], however, note that the traditional templated generation models outperform seq2seq models on extractive metrics although they score poorly on word-overlap metrics. Thus, the adaptation of seq2seq models to D2T for richer narratives with less omissions and hallucinations still remains an active focus of the research community.

It is worth noting that all seq2seq models discussed below operate at the word level. Models operating at the character level [3, 112, 277] have shown reasonable efficacy with the added computational savings from forgoing the preprocessing steps of delexicalization and tokenization. However, the attention garnered by them from the research community is slim. From their comparative analysis, Jagfeld et al. [140] note that as character-based models perform better on the E2E dataset while word-based models perform better on the more linguistically challenging WebNLG dataset, it is hard to draw conclusions on the framework most suited for generic D2T. In the sections that follow, we detail notable innovations over the last half-decade in seq2seq modeling, branched on the basis of their training strategies—supervised and unsupervised learning.

5.1 Supervised Learning

5.1.1 Entity Encoders.

Centering theory [118], as well as many other noted linguistic frameworks [40, 57, 125, 156, 172, 251], highlights the critical importance of entity mentions to the coherence of the generated narrative. The ordering of these entities (\(r.e\) in Section 3.3) is crucial for such narratives to be considered as entity coherent [154]. Unlike typical language models which are conditioned solely based on previously generated tokens \(c_{t}\), Lebret et al. [174] provide additional context \(\{z_{c_{t}},g_{f},g_{w}\}\) to the generation where \(z_{c_{t}}\) represents table entity \(c_{t}\) as a triplet of its corresponding field name, start, and end positions, and \(\{g_{f},g_{w}\}\) are one-hot encoded vectors where each element indicates the presence of table entities from the fixed field and word vocabularies—illustrated in Figure 7. Similarly, Bao et al. [14] encode the table cell \(c\) and attributes \(a\) as the concatenation \([e_{i}^{c}:e_{i}^{a}]\) where the decoder uses this vector to compute the attention weights. Liu et al. [193] modify the LSTM unit with a field gate to update the cell memory indicating the amount of entity field information to be retained in the cell memory. Following [174], Ma et al. [202] use a Bi-LSTM to encode the concatenation of word, attribute and position embeddings. However, to indicate whether an entity is a key fact, a MLP classifier is used on said representation for binary classification. Inspired from Liu and Lapata [195], Gong et al. [108] construct a historical timeline by sorting each table record with respect to its date field. Three encoders encode a table entity separately in row \(r_{i,j}^{r}\), column \(r_{i,j}^{c}\), and time \(r_{i,j}^{t}\) dimensions. The concatenation of these representations are fed to a MLP to obtain a general representation \(r_{i,j}^{gen}\) over which specialized attention weights are computed to obtain the final record representation as \(\hat{r}_{i,j}=\alpha_{r}r_{i,j}^{r}+\alpha_{c}r_{i,j}^{c}+\alpha_{t}r_{i,j}^{t}\). Exploiting the attributes of the E2E dataset—the set number of unique MR attributes and the limited diversity in lexical instantiations of their values, Puzikov and Gurevych [256] employ a simple approach wherein the recurrent encoder is replaced with one dense layer that takes in MR representations through embedding lookup. Similarly, to keep track of entity mentions in the SF dataset for long-form text generation, Kiddon et al. [160] introduce a checklist vector \(a_{t}\) that aids two additional encoders to track used (mentioned in the resulting narrative) and new (not mentioned as of time step \(t\)) items on the defined agenda. The output hidden state is modeled as a linear interpolation between the three encoder states—\(c_{t}^{gru}\) of the base GRU and \(\{c_{t}^{new},c_{t}^{used}\}\) from the agenda models, weighted by a probabilistic classifier. Extending this concept of entity encoding to transformer-based architectures, Chen et al. [44] adapt the multi-headed attention layer architecture [326] to encode serialized table attributes that is then fed to a GPT-based decoder. Similarly, as a unified text-to-text alternative approach to [44], Andrejczuk et al. [7] include the row and column embeddings \(\hat{r}_{i,j}\) of the input table on top of the token embeddings for table-structure learning in a T5 model.

Fig. 7.

Certain D2T tasks, such as sport commentaries [15, 279, 316], require reasoning over numeric entities present in the input data instance. Although numeracy in language modeling is a prominent niche of its own [295, 317, 334], notable D2T-specific approaches include that of Nie et al. [229]— precomputing the results of numeric operations \(op_{i}\in\{minus,argmax\}\) on the RotoWire dataset, the authors propose the combination of dedicated operation and operation-result encoders, the latter utilizing a quantization layer for mapping lexical choices to data values, in addition to a record encoder. In similar fashion to [355], the concatenated embeddings \(\{r.idx,r.e,r.m\}\) fed to a bi-directional GRU generate record representations while the concatenated embeddings of \(op_{i}\) attributes fed to a non-linear layer yields operation representations. To address the difficulty in establishing lexical choices on sparse numeric values [271, 303], the authors add quantization to the operation-results encoder that maps results of scalar operations \(e\) (minus) to \(l\in L\) possible bins through a weighed representation (\(h_{i}=\sum_{l}\mu_{i,l}\ e\)) using softmax scores of each individual result \(\mu_{i,l}\). Following this body of work, to contextualize numeric representations and thus understand their logical relationships, Gong et al. [107] feed raw numeric embeddings for all numericals corresponding to the same table attributes to a transformer-based encoder to obtain their contextualized representations. Through a ranking scheme based on a fully connected layer, these contextualized representations are further trained to favor larger numbers.

5.1.2 Hierarchical Encoders.

The intuition behind the use of hierarchical encoders, in the context of D2T, is to model input representations at different granularities, either through dedicated modules [191, 261, 360] or attention schemes [193, 253, 355]. As such, Zhang et al. [360] leverage their CAEncoder [359] to incorporate precomputed future representations \(h_{i+1}\) into current representation \(h_{i}\) through a two-level hierarchy. Similarly, Rebuffel et al. [261] propose a two-tier encoder to preserve the data structure hierarchy—the first tier encodes each entity \(e_{i}\) based on its associated record embeddings \(r_{i,j}\) while the second tier encodes the data structure based on its entity representation \(h_{i}\) obtained through the individual embeddings \(r_{i,j}\). On the other hand, Liu et al. [191], as illustrated in Figure 8(b), propose a word-level \(h_{r.e}^{r.m}\) and an attribute-level \(H^{r.e}\) two-encoder setup to capture the attribute–value hierarchical structure in tables. The attribute-level encoder takes in the last hidden state \(h_{last}^{r.e}\) for attribute \(r.e\) from the word level LSTM as its input. Using these hierarchical representations, fine-grained attention \(\beta_{r.e}^{r.m}\propto g(h_{r.e}^{r.m},s_{t})\) and coarse-grained attention \(\gamma^{r.e}\propto g(H^{r.e},s_{t})\) are used for decoding where \(g\) represents a softmax function. Similarly, based on hierarchical attention [355], Liu et al. [193] employ an attention scheme that attends to both word level and field level tokens. Following this, Puduppully et al. [253] propose language modeling conditioned on both the data instance and a dynamically updated entity representation. At each time-step \(t\), a gate \(\gamma_{t}\) is used to decide whether an update is necessary for the entity memory representation \(u_{k}\) and a parameter \(\delta_{t,k}\) decides the impact of said update (Equation (3))

\begin{align}\gamma_{t}=\sigma(W_{1}s_{t}+b_{1})\ \ \& \ \delta_{t,k}=\gamma_{t}\odot\sigma(W_ {2}s_{t}+W_{3}u_{t-1,k}+b_{3}).\end{align}

(3)

Fig. 8.

5.1.3 Plan Encoders and Auto-Encoders.

Traditionally, the what to say aspect of D2T (see Section 3.1) used to be its own module in a set of pipelines [100, 270], thus offering flexibility in planning the narrative structure. However, the end-to-end learning paradigm often models CS and surface realization as a shared task [174, 210, 347]. Although convenient, without explicitly modeling the planning of the narrative (Figure 8(a)), language models struggle to keep coherence in long-form generation tasks. As such, Puduppully et al. [252] model the generation probability \(P(y|r)\) as the joint probability of narrative \(y\) and content plan \(z\) given a record \(r\) such that \(P(y|r)=\sum_{z}P(z|r)P(y|r,z)\). Similar to their prior work [253], a CS gate operates over the record representation \(r_{i}\) giving an information controlled representation \(r_{i}^{cs}\). The elements of \(z\) are extracted using an information extraction system [347] and correspond to entities in \(y\) while pointer networks [331] are use to align elements in \(z\) to \(r\) during training. Iso et al. [139], on the other hand, avoid precomputing content plans \(z\) by dynamically choosing data records during decoding—an additional memory state \(h^{ent}\) remembers mentioned entities and updates the language model state \(h^{lm}\) accordingly. The authors propose two representations for an entity—static embedding \(e\) based on row \(r_{i}\) and aggregated embedding \(\bar{e}\) based on all rows where the entity appears. In the context of the RotoWire dataset, the aggregate embedding \(\bar{e}\) is supposed to represent how entity \(e\) played in the game. For \(h_{t}=\{h_{t}^{lm},h_{t}^{ent}\}\), \(P(z_{t}=1|h_{t-1})\) (Equation (4)) models the transition probability, and based on whether \(e\) belongs to the set of entities \(\epsilon_{t}\) that have already appeared at time step \(t\), \(P(e_{t}=e|h_{t-1})\) (Equation (5)) computes the next probable entity \(e\) to mention. The authors note that such discrete tracking dramatically suppresses the generation of redundant relations in the narrative

\begin{align}P(z_{t}=1|h_{t-1}) & =\sigma(W_{1}(h_{t-1}^{lm}\oplus h_{t-1}^{ent})) \\ \end{align}

(4)

\begin{align} P(e_{t}=e|h_{t-1}) & \propto\begin{cases}e^{(h_{s}^{ent}W_{1}h_{t-1}^{lm})}\ e\in \epsilon_{t-1} \\e^{(\bar{e}W_{2}h_{t-1}^{lm})}\ otherwise\end{cases}\!\!\!\!\!\!.\end{align}

(5)

With the premise that paragraphs are the smallest sub-categorization where coherence and topic are defined [358], Puduppully and Lapata [255] propose a paragraph-based macro planning framework specific to the design of MLB [253] and RotoWire [347] datasets where the input to the seq2seq framework are predicted macro-plans (sequence of paragraphs). Building upon this, in contrast to precomputing global macro plans, Puduppully et al. [254] interweave the macro planning process with narrative generation where latent plans are sequentially inferred through a structured variational model as the narrative is generated conditioned on the plans so far and the previously generated paragraphs. Similarly, to establish order in the generation process, Sha et al. [292] incorporate link-based attention [114] in addition to content-based attention [11] into their framework. Similar to transitions in Markov chains [155], a link matrix \(\mathbb{L}\in\mathbb{R}^{n_{f}\times n_{f}}\) for \(n_{f}\) tabular attributes defines the likelihood of transitioning from the mention of attribute \(i\) to \(j\) as \(\mathbb{L}(f_{j},f_{i})\). Wang et al. [337] propose combining autoregressive modeling [287] to generate skeletal plans and using an iterative text-editing based non-autoregressive decoder [120] to generate narratives constrained on said skeletal plans. The authors note that this approach reduces hallucination tendencies of the model. Similarly, motivated by the strong correlation observed between entity-centric metrics for record coverage and hallucinations, Liu et al. [194] adopt a two-stage generation process where a plan generator first transforms the input table records into serialized plans \(R\rightarrow R+P\) based on the separator token \(SEP\) and then translates the plans into narratives with the help of appended auxiliary entity information extracted through NER.

Handcrafted templates traditionally served as pre-defined structures where entities computed through CS would be plugged-in. However, even in the neural D2T paradigm, inducing underlying templates helps capture the narrator voicing and stylistic representations present in the training set. As such, Ye et al. [356] extend the use of the variational auto-encoders (VAEs) [163] for template induction with their variational template machine that disentangles the latent representation of the template \(z\) and the content \(c\). In essence, the model can be trained to follow specific templates by sampling from \(z\). Inspired from stylistic encoders [137], the authors further promote template learning by anonymizing entities in the input table thus effectively masking the CS process. Similarly, to mitigate the strong model biases in the standard conditional VAEs [319], Chen et al. [49] estimate semantic confounders \(z_{c}\)—linguistically similar entities to the target tokens that confound the logic of the narrative. Compared to the standard formulation \(p(y|x)\), the authors employ Pearl’s do-calculus [242] to learn the objective \(p(y|\mathrm{do}(x))\) that asserts that confounder \(z_{c}\) is no longer determined by instance \(x\), thus ensuring logical consistency in the narrative. To ensure that the estimated confounders are meaningful, they are grounded through proxy variables \(c\) such that confounding generation \(p(c|z_{m})\) can be minimized. Recently, modeling D2T and T2D as complementary tasks, Doung et al. [77] leverage the VAE formulation with the underlying architecture of a pre-trained T5 model to offer a unified multi-domain framework for the dual task. To combat the lack of parallel-corpora for the back-translation (T2D) training, the authors introduce latent variables to model the marginal probabilities of back-translation through an iterative learning process.

Likewise, for approaches beyond the use of auto-encoders, Chen et al. [47] take inspiration from practices in semantic parsing [71] and propose a coarse-to-fine two-stage generation scheme. In the first stage, a template \(Y_{T}\) containing placeholder tokens \(ENT\) is generated, representing the global logical structure of the narrative. The entities are then copied over from the input data instance to replace tokens \(ENT\) in the second step to generate the final narrative \(\hat{Y}\). Suadaa et al. [312], similarly, follow template-guided generation [151] (see Section 4.2) where the precomputed results of numeric operations are copied over to the template and replace the placeholder tokens. For pre-trained language models (PLMs), the authors incorporate copying into the fine-tuning stage for this action.

5.1.4 Stylistic Encoders.

In addition to the traits of coherence, fluency, and fidelity, stylistic variation is crucial to NLG [307]. It is interesting to note that the n-gram entropy of generated texts in seq2seq-based NLG systems are significantly lower than that in its training data—leading to the conclusion that these systems adhere to only a handful of dominant patterns observed in the training set [237]. Thus, introducing control measures to text generation has recently garnered significant attention from the NLG community [137, 344]. As such, the semantically conditioned LSTM proposed by Wen et al. [343] extends the LSTM cell to incorporate a one-hot encoded MR vector \(d\) that takes the form of a sentence planner. Following this, Deriu and Cieliebak [62] append additional syntactic control measures to the MR vector \(d\) (such as the first token to appear in the utterances and expressions for different entity–value pairs) by simply appending one-hot vectors representation of these control mechanisms to \(d\). Similarly, Lin et al. [186] tackle the lack of a template-based parallel dataset with style imitation— as illustrated in Figure 9, for each instance \((x,y)\), an exemplar narrative \(y_{e}\) is retrieved from the training set based on field-overlap distance \(D(x,x_{e})\) and an additional encoder is used to encode \(y_{e}\). The model is trained with competing objectives for content determination \(P(y|x,y_{e})\) and style embodiment \(P(y_{e}|x_{e},y_{e})\) with an additional content coverage constraint for better generation fidelity.

Fig. 9.

5.1.5 Graph Encoders.

The use of explicit graph encoders in D2T stems from the intuition that neural graph encoders such as graph convolutional networks (GCNs) [164] have strong relational inductive biases that produce better representations of input graphs [18] as an effective alternative to linearization. This entails generating representations for the nodes \(v\in V\) and edges \((u,v)\in E\) in the input graph.

GCNs and graph-RNNs: Marcheggiani and Titov [208] compute node Representations \(h_{v}^{{}^{\prime}}\) (6) through explicit modeling of edge labels \(lab(u,v)\) and directions \(dir(u,v)\in\{in,out,loop\}\) for each neighboring node \(u\in N(v)\) in their GCN parameterization where learned scalar gates \(g_{u,v}\) weigh the importance of each edge. With residual (\(h_{v}^{r}=h_{v}^{{}^{\prime}}+h_{v}\)) [129] and dense (\(h_{v}^{d}=[h_{v}^{{}^{\prime}};h_{v}]\)) [138] skip connections, Marcheggiani and Perez-Beltrachini [207] adopt the above-mentioned encoder with an LSTM decoder [201] for graph-to-text generation. Differing from previous iterations of Graph LSTMs [184], Distiawan et al. [68] compute the hidden states of graph entities with consideration of the edges pointing to the entity from the previous entities, allowing their GTR-LSTM framework to handle non-predefined relationships. The ordering of the vertices fed into the LSTM is based on a combination of topological sort and breath-first traversal. Inspired by hybrid traversal techniques [226, 298], Ribeiro et al. [273] propose a dual graph encoder—first operating on a top-down traversal of the input graph where the predicate \(p\) between two nodes is used to transform labeled edges \((u_{i},p,u_{j})\) to two unlabeled edges \((u_{i},p)\) and \((p,u_{j})\) while the second operates on a bottom-up traversal where directions of edges are reversed \((u_{i},u_{j})\rightarrow(u_{j},u_{i})\)

\begin{align}h_{v}^{{}^{\prime}}=\mathrm{ReLU}\left(\sum_{u\in N(v)}g_{u,v}(W_{dir(u,v)}h_{u}+b_ {lab(u,v)})\right).\end{align}

(6)

Damonte and Cohen [61] note that GCNs can assist LSTMs in capturing re-entrant structures and long term dependencies. As such, to bridge the gap between the GCN [19, 285] and the linearized LSTM encoders in graph-to-text translation, Zhao et al. [364] propose DualEnc which uses both to capture their complementary effects. The first GCN models the graph, retaining its structural integrity, while the second GCN serializes and re-orders the graph nodes resembling a planning stage and feeds it to the LSTM decoder.

GATs and graph transformers: To address the shortcomings of RNN-based sequential computing, Koncel-Kedziorski et al. [168] extend the transformer architecture [326] to graph-structured inputs with the GraphWriter. The distinction of GraphWriter from graph attention networks (GATs) [327] is made through the contextualization of each node representation \(v_{i}\) (7) with respect to its neighbors \(u_{j}\in N(v_{i})\) through attention mechanism \(a_{n}\) for the \(\mathcal{N}\) attention heads. In contrast, Ribeiro et al. [275] focus on capturing complementary graph contexts through distinct global \(h_{v}^{global}\) and local \(h_{v}^{local}\) message passing using GATs. Their approach to graph modeling also differs in its token-level approach for node representations with positional embeddings injected to preserve sequential order of the tokens

\begin{align}\hat{v_{i}}=v_{i}+\Arrowvert_{n=1}^{\mathcal{N}}\sum_{N(v_{i})}\alpha_{ij}^{n} W_{v}^{n}u_{j}\quad\alpha_{ij}^{n}=a^{n}(v_{i},u_{j}).\end{align}

(7)

Song et al. [306] enrich the training signal to the relation-aware transformer model [367] through additional multi-view auto-encoding losses [264]. This detachable multi-view framework deconstructs the input graph into triple sets for the first view, reconstructed with a deep biaffine model [73], and linearizes the graph through a depth-first traversal for the second view. In contrast, Ke et al. [158] obtain the entity and relation embeddings through contextual semantic representations with their structure-aware semantic aggregation module added to each transformer layer—the module consists of a mean pooling layer for entity and relation representations, a structure-aware self-attention layer [296], and finally a residual layer that fuses the semantic and structural representations of entities.

5.1.6 Reconstruction and Hierarchical Decoders.

Input reconstruction: Conceptualized from auto-encoders [34, 304, 330], reconstruction-based models quantify the faithfulness of an encoded representation by correlating the decoded representation to the original input. As such, Wiseman et al. [347] adopt decoder reconstruction [321] to the D2T paradigm by segmenting the decoder hidden states \(h_{t}\) into \(\frac{T}{B}\) continuous blocks \(b_{i}\) of size at most \(B\). The prediction of record \(r\) from such a block \(b_{i}\), \(p(r.e,r.m|b_{i})\), is modeled as softmax(\(f(b_{i})\)) where \(f\) is a convolutional layer followed by an MLP. To replicate the actions of an auto-encoder, Chisholm et al. [52] train a seq2seq-based reverse re-encoding text-to-data model along with a forward seq2seq D2T model. Similarly, Roberti et al. [277] propose a character-level GRU implementation where the recurrent module is passed as a parameter to either the encoder or the decoder depending on the forward \(\hat{y}=f(x)\) or reverse \(\hat{x}=g(y)\) direction. Following the mechanics of back-translation (text-to-data) [290, 321], Bai et al. [12] extend the standard transformer decoder [326] to reconstruct the input graph by jointly predicting the node and edge labels while predicting the next token. The standard training objective of minimizing the negative log-likelihood of the conditional word probabilities \(l_{std}\) is appended by a node prediction loss \(l_{node}\) that minimizes the word-to-node attention distance and an edge prediction loss \(l_{edge}\) that minimizes the negative log-likelihood over the projected edges. For table-structure reconstruction, Gong et al. [109] define the reconstruction loss based on attribute prediction and content matching similar to the optimal transport distance [339]. It should be noted that these auxiliary tasks improve the model performance in few-shot settings.

Hierarchical decoding: Similar to hierarchical encoding (see Section 5.1.2), hierarchical decoding intends to designate granular roles to each decoder in the hierarchy. Serban et al. [291] show that injecting variations at the conditional output distribution does not capture high-level variations. As such, to model both high and low level variations, Shao et al. [293] propose their planning-based hierarchical variational model (PHVM) based on the conditional VAE [305]. PHVM follows a hierarchical multi-step encoder–decoder setup where a plan decoder first generates a subset \(g\) of the input \(\{d_{i},...,d{n}\}\in x\). Then, in the hierarchical generation process, a sentence decoder and a word decoder generate the narrative conditioned on plan \(g\). To dissipate the decoder responsibilities in the seq2seq paradigm, Su et al. [310] propose a four-layer hierarchical decoder where each layer is responsible for learning different parts of the output speech. The training instances are appended with POS tags such that each layer in the decoder hierarchy is responsible for decoding words associated with a specific set of POS patterns.

Hierarchical attention-based decoding: To alleviate omissions in narrative generation, Liu et al. [192] propose forced attention—with word-level coverage \(\theta_{t}^{i}\) and attribute-level coverage \(\gamma_{t}^{e}\), a new context vector \(\hat{c_{t}}=\pi c_{t}+(1-\pi)v_{t}\) is defined with a learnable vector \(\pi\) and a compensation vector \(v_{t}=f(\theta_{t}^{i},\gamma_{t}^{e})\) for low-coverage attributes \(e\). To enforce this at a global scale, similar to Xu et al. [352], a loss function \(\mathbb{L}_{FA}\) based on \(\gamma_{t}^{e}\) is appended to the seq2seq loss function.

5.1.7 Regularization Techniques.

Similar to regularization in the greater deep learning landscape [110], regularization practices in D2T append additional constraints to the loss function to enhance generation fidelity. As such, Mei et al. [210] introduce a coarse-to-fine aligner to the seq-to-seq framework that uses a pre-selector and refiner to modulate the standard aligner [11]. The pre-selector assigns each record a probability \(p_{i}\) of being selected based on which the refiner re-weighs the standard aligner’s likelihood \(w_{ti}\) to \(\alpha_{ti}\). The weighted average \(z_{t}=\sum_{i}\alpha_{ti}m{i}\) is used as a soft approximation to maintain the architecture differentiability. Further, the authors regularize the model with a summation of the learned priors \(\sum_{i=1}^{N}p_{i}\) as an approximation of the number of selected records. Similarly, Perez-Beltrachini and Lapata [244] precompute binary alignment labels for each token in the output sequence indicating its alignment with some attribute in the input record. The prediction of this binary variable is used as an auxiliary training objective for the D2T model. For tabular datasets, Liu et al. [191] propose a two-level hierarchical encoder that breaks the learning of semantic tabular representation into three auxiliary tasks incorporated into the loss function of the model. The auxiliary sequence labeling task \(L_{SL}\), learnt in unison with seq2seq learning, predicts the attribute name for each table cell. Similarly, the auto-encoder supervision \(L_{AE}\) penalizes the distance between the table \(z_{t}\) and the narrative \(z_{b}\) representations, while the multi-label supervision task \(L_{ML}\) predicts all the attributes in the given table. The individual losses, along with the language modeling loss, defines the loss function of the framework. To mitigate information hallucination and avoid the high variance exhibited by the use of policy gradients in the reinforcement-learning paradigm, Wang et al. [339] compute two losses in addition to the language modeling loss—the first checks the disagreement between the source table and the corresponding narrative through the L2 loss between their embeddings, similar to Yang et al. [354], while the second uses optimal transport [43]-based maximum flow between the narrative and input distributions \(\mu\) and \(v\). Tian et al. [318] propose the use of confidence priors to mitigate hallucination tendencies in T2T generation through learned confidence scores. At each decoding step \(y_{t}\), instead of concatenating all the previous attention weights, only the antecedent attention weight \(a_{t-1}\) is fed back to the RNN, such that an attention score \(A_{t}\) can be used to compute how much \(a_{t}\) affects the context vector \(c_{t}\)—as all the source information in \(c_{t}\) comes from \(a_{t}\). The confidence score \(C_{t}(y_{t})\) is then used to sample target sub-sequences faithful to the source using a variational Bayes scheme [167]. Similarly, inspired by Liu et al. [191], Li et al. [180] propose two auxiliary supervision tasks incorporated into the training loss— number ranking and importance ranking, both crucial to sport summaries, modeled with pointer networks on the outputs of the row and column encoders, respectively.

5.1.8 RL.

In the D2T premise, language-conditional RL [200] often aids in model optimization through its role as auxiliary loss functions. While traditionally, the BLEU (see Section 6.1) and TF-IDF [260] scores of generated texts were used as the basis for RL [192], Perez-Beltrachini and Lapata [244] use alignment scores of the generated text with the target text. Similarly, Gong et al. [107] use four entity-centric metrics that center around entity importance and mention. Rebuffel et al. [262] propose a model agnostic RL framework, PARENTing which uses a combination of language model loss and RL loss computed based on PARENT F-score [65] to alleviate hallucinations and omissions in T2T generation. To avoid model overfitting on weaker training samples and to ensure the rewards reflect improvement made over pretraining, self-critical training protocol [272] is applied using the REINFORCE algorithm [345]. The improvement in PARENT score over a randomly sampled candidate \(y_{c}\) and a baseline sequence generated using greedy decoding \(y_{b}\) is used as the reward policy. In contrast, Zhao et al. [365] use generative adversarial networks [111] where the generator is modeled as a policy with the current state being the generated tokens and the action defined as the next token to select. The reward for the policy is a combination of two values—the discriminator probability of the sentence being real and the correspondence between generated narrative and the input table based on the BLEU score. As RL frameworks based on singular metrics makes it difficult to simultaneously tackle the multiple facets of generation, Ghosh et al. [105] linearly combine metrics for recall, repetition, and reconstruction, along with the BLEU score, to form a composite reward function. The policy is adapted from Wang et al. [338] and trained using Maximum Entropy Inverse RL [368].

5.1.9 Fine-tuning Pretrained Language Models.

PLMs [64, 257] have been successful in numerous text generation tasks [288, 363]. The extensive pretraining grants these models certain worldly knowledge [247] such that, at times, the models refuse to generate nonfactual narratives even when fed deliberately corrupted inputs [274]. As such, Mager et al. [203] propose an alternate approach to fine-tuning GPT-2 for AMR-to-text generation where the fine-tuning is done on the joint distribution of the AMR \(x_{j}\) and the text \(y_{i}\) as \(\prod_{i}^{N}p(y_{i}|y_{ < i},x_{1:M})\cdot\prod_{j}^{M}p(x_{j}|x_{ < j})\). On the other hand, inspired by task-adaptive pretraining strategies for text classification [122], Ribeiro et al. [274] introduce supervised and unsupervised task-adaptive pretraining stages as intermediaries between the original pretraining and the fine-tuning for graph-to-text translation. Interestingly, the authors note good performance of the task-adapted PLMs even when trained on shuffled graph representations. Chen et al. [51] note the few-shot learning capabilities of GPT-2 for T2T generation when appended with a soft switching policy for copying tokens [287]. Similarly, as a light-weight alternative to fine-tuning the entire model, Li and Liang [182] take inspiration from prompting [37], and propose prefix-tuning which freezes the model parameters to only optimize the prefix, a task-specific vector prepended to the input. The authors note significant improvements in low-data settings when the prefix is initialized with embeddings from task specific words such as T2T. For avenues that allow leveraging the worldly knowledge of PLMS even without fine-tuning, Xiang et al. [351] leverage a combination of prompting GPT-3 for disambiguation and T5 for sentence fusion leading to a domain-agnostic framework for D2T generation.

Inspired from practices in unlikelihood learning [323, 341], Nan et al. [223] model T5 as both a generator and a faithfulness discriminator with two additional learning objectives for unlikelihood and replacement detection. To train the model with said objectives, \(n\) contradictory sentences \(Y^{(i,j)}_{False}\) are generated for each entailed sentence \(Y^{(i)}_{True}\) wherein the discrimination probability is computed at every step of token generation. Similarly, to address omissions in D2T (Section 3.2), Jolly et al. [144] adapt the search-and-learn formulation [178] to a few-shot setting through a two-step fine-tuning process wherein a T5 model fine-tuned on the D2T task is further fine-tuned again with omitted attributes \(r.e/r.m\) reinserted to the narratives as pseudo-groundtruths.

5.1.10 Supplemental Frameworks.

Supplementary modules: Fu et al. [98] propose the adaptation of the seq2seq framework for their partially algined dataset WITA using a supportiveness adaptor and a rebalanced beam search. The pre-trained adaptor calculates supportiveness scores for each word in the generated text with respect to the input. This score is incorporated into the loss function of the seq2seq module and used to rebalance the probability distributions in the beam search. Framing the generation of narratives as a sentence fusion [17] task, Kasner and Dušek [157] use the pre-trained LasterTagger text editor [205] to iteratively improve a templated narrative. Su et al. [311] adopt a BM25 [278] based information retrieval system to their prototype-to-generate framework, which, aided with their BERT-based prototype selector, retrieves contextual samples for the input data instance from Wikipedia, allowing for successful few shot learning in T5. For reasoning over tabulated sport summaries, Li et al. [180] propose a variation on GATs named as GatedGAT that operates over an entity graph modeled after the source table to aid the generation model in entity-based reasoning.

Re-ranking and pruning: Dušek and Jurcicek [79] append the seq2seq paradigm with an RNN-based re-ranker to penalize narratives with missing and/or irrelevant attributes from the beam search output. Based on the Hamming distance between two one-hot vectors representing the presence of slot-value pairs, the classifier employs a logistic layer for a binary classification decision. Their framework, TGen, was the baseline for the E2E challenge [81]. Following this, Juraska et al. [150] first compute slot-alignment scores with a heuristic-based slot aligner with is used to augment the probability score from the seq2seq model. The aligner consists of a gazetteer that searches for overlapping content between the MR and its respective utterance, WordNet [87] to account for semantic relationships, and hand-crafted rules to cover the outliers. Noting that even copy-based seq2seq models tend to omit values from the input data instance, Gehrmann et al. [103] incorporate coverage \(cp\) (Equation (8)) and length \(lp\) (Equation (9)) penalties of Wu et al. [350]. With tunable parameters \(\alpha\) and \(\beta\), \(cp\) increases when too many generated words attend to the same input \(a_{i}^{t}\) and \(lp\) increases with the length of the generated text. In contrast to Tu et al. [322], however, the penalties are only used during inference to re-rank the beams. Similar to Paulus et al. [240], authors prune beams that start with the same bi-gram to promote syntactic variations in the generated text. Similar to natural language inference (NLI)-based approaches, Harkous et al. [127] append a RoBERTa [196] based semantic fidelity classifier that reranks the beam output from a fine-tuned GPT-2 model

\begin{align}cp(x,y) & =\beta\cdot\sum_{i=1}^{|x|}\mathrm{log}\left(\mathrm{min}\left(\sum_{t=1}^{ |y|}a_{i}^{t},1\right)\right)\end{align}

(8)

\begin{align} lp(y) & =\frac{(5+|y|)^{\alpha}}{(5+1)^{\alpha}}.\end{align}

(9)

5.1.11 Ensemble Learning.

Juraska et al. [150] propose SLUG, an ensemble of three neural encoders—two LSTMs and one CNN [175], individually trained, for MR–text generation. The authors note that selecting tokens with the maximum log-probability by averaging over different encoders at each time steps results in incoherent narratives, thus the output candidate is selected based on the rankings among the top 10 candidates from each model. SLUG’s ensemble along with its data preprocessing and ranking schemes, as detailed in the sections above, was crowned winner of the 2017 E2E challenge [81]. Similarly, to prompt the models \(f_{1},...,f_{n}\) in the ensemble to learn distinct sentence templates, Gehrmann et al. [103] adopt diverse ensembling [124] where an unobserved random variable \(w\sim\mathrm{Cat}(1/n)\) assigns a weight to each model for each input. Constraining \(w\) to \(\{0,1\}\) trains each model \(f_{i}\) on a subset of the training set thus leading to each model learning distinct templates. The final narrative is generated with a single model \(f\) with the best perplexity on the validation set.

5.2 Unsupervised Learning

5.2.1 D2T Specific Pretraining.

Following the successful applications of knowledge-grounded language models [4, 197], Konstas et al. [169] propose a domain-specific pretraining strategy inspired by Sennrich et al. [290] to combat the challenges in data sparsity, wherein self-training is used to bootstrap an AMR parser from the large unlabeled Gigaword corpus [225] which is in turn used to pretrain an AMR generator. Both the generator and parser adopt the stacked-LSTM architecture [350] with a global attention decoder [201]. Similarly, following success brought forth by the suite of PLMs [37, 64, 257, 259], Chen et al. [48] propose a knowledge-grounded pretraining framework trained on 1.8 million graph-text pairs of their knowledge-grounded dataset KGTEXT built with Wikipedia hyperlinks matched to WikiData [332]. The framework consists of a graph attention network [327]-based encoder and transformer [326]-based encoders and decoders. Ke et al. [158] propose three graph-specific pretraining strategies based on the KGTEXT dataset—reconstructing masked narratives based on the input graph, conversely, reconstructing masked graph entities based on the narrative, and matching the graph and narrative embeddings with optimal transport. Similarly, Agarwal et al. [2] verbalize the entirety of the Wikidata Corpus [332] with two-step fine-tuning for T5 [259] to construct their KeLM corpus. The authors utilize this corpus to train knowledge-enhanced language models for downstream NLG tasks with significant improvements shown in the performance of REALM [123] and LAMA [247] for both retrieval as well as question answering tasks. Similarly, specifically geared for logical inference from tables, Liu et al. [187] propose PLoG wherein a PLM is first pre-trained on table-to-logic conversion intended to aid logical T2T generation for downstream datasets to the likes of LogicNLG.

5.2.2 Auto-Encoders.

For a D2T framework solely based on unlabeled text, Freitag and Roy [95] adapt the training procedure of a denoising auto-encoder [329] to the seq2seq framework with the notion of reconstructing each training example from a partially destroyed input. For each training instance \(x_{i}\), a percentage \(p\) (sampled from a Gaussian distribution) of words are removed at random to get a partially destroyed version \(\hat{x}_{i}\). However, the authors note carry over of this unsupervised approach across further D2T tasks such as the WebNLG challenge [99] could be limited by the fact that the slot names in WebNLG contribute to the MR.

5.3 Innovations Outside of the Seq2Seq Framework

5.3.1 Template Induction.

For enhanced interpretability and control in D2T, Wiseman et al. [348] propose a neural parameterization of the hidden semi-Markov model (HSMM) [221] that jointly learns latent templates with generation. With discrete latent states \(z_{t}\), length variable \(l_{t}\), and a deterministic binary variable \(f_{t}\) that indicates whether a segment ends at time \(t\), the HSMM is modeled as a joint-likelihood (Equation (10)). Further, associated phrases from \(x\) can be mapped to latent states \(z_{t}\) such that common templates can be extracted from the training dataset as a sequence of latent states \(z^{i}=\{z_{1}^{i},...,z_{S}^{i}\}\). Thus, the model can then be conditioned on \(z^{i}\) to generate text set to the template. Following this, Fu et al. [97] propose template induction by combining the expressive capacity of probabilistic models [222] with graphical models in an end-to-end fashion using a conditional random field model with Gumbel-softmax used to relax the categorical sampling process [141]. The authors note performance gains on HSMM-based baselines while also noting that neural seq2seq fare better than both

\begin{align} P(y,z,l,f|x)=\prod_{t=0}^{T-1}P(z_{t+1},l_{t+1}|z_{t},l_{t},x)^{ f_{t}}\times\prod_{t=1}^{T}P(y_{t-l_{t}+1:t}|z_{t},l_{t},x)^{f_{t}}.\end{align}

(10)

5.3.2 Discrete Neural Pipelines.

While the traditional D2T pipeline observes discrete modeling of the content planning and linguistic realization stages [270], neural methods consolidate these discrete steps into end-to-end learning. With their neural referring expressions generator NeuralREG [89] appended to discrete neural pipelines, and the GRU [54] and transformer [326] as base models for both, Ferreira et al. [91] compare neural implementations of these discrete pipelines to end-to-end learning. In their findings, authors note that the neural pipeline methods generalize better to unseen domains than end-to-end methods, thus alleviating hallucination tendencies. This is also corroborated by findings from Elder et al. [82].

5.3.3 Computational Pragmatics.

Pragmatic approaches to linguistics naturally correct under-informativeness problems [117, 134] and are often employed in grounded language learning [206, 215]. Shen et al. [297] adopt the reconstructor-based [96] and distractor-based [58] models of pragmatics to MR-to-text generation. These models extend the base speaker models \(S_{0}\) using reconstructor \(R\) and distractor \(D\) based listener models \(L(y|x)\in\{L^{R},L^{D}\}\) to derive pragmatic speakers \(S_{1}(y|x)\in\{S_{1}^{R},S_{1}^{D}\}\) (11, 12) where \(\lambda\) and \(\alpha\) are rationality parameters controlling how much the model optimizes for discriminative outputs

\begin{align}S_{1}^{R}(y|x) & =L^{R}(x|y)^{\lambda}\cdot S_{0}(y|x)^{1-\lambda}\end{align}

(11)

\begin{align}S_{1}^{D}(y|x) & \propto L^{D}(x|y)^{\alpha}\cdot S_{0}(y|x).\end{align}

(12)

6 Evaluation of D2T Systems

In this section, we take a deeper look into the specifics of evaluation for D2T systems. Traditionally, the evaluation of D2T systems is compartmentalized into either intrinsic or extrinsic measures [26]. The former either uses automated metrics to compare the generated narrative to a reference text or employs for human judgment [146]—both evaluating the properties of the system output. The latter focuses on the ability of the D2T system to fulfill its intended purpose of imparting information—to what degree does the system achieve its overarching task for which it was developed. From the analysis of 79 papers spanning 2005–2014, Gkatzia and Mahamood [106] note the overwhelming prevalence of intrinsic evaluation with 75.7% of articles reporting it compared to 15.1% that report an extrinsic measure. This is unsurprising, as intrinsic evaluation can be automated and is often convenient, not requiring additional crowd-sourced human labor and collection of feedback from deployed systems. As such, Reiter [267] notes the importance of extrinsic (pragmatic) evaluation and its absence in the field. Thus, the absence of literature in extrinsic evaluation measures leads us to focus on the innovations in improving the quality of intrinsic evaluation metrics (Section 6.2). For a broader view on the evaluation of text generation systems in the greater NLG landscape, we refer the readers to a recent survey of evaluation practices for text generation systems by Celikyilmaz et al. [39].

6.1 BLEU: The False Prophet for D2T

With the abundance of paired datasets where each data instance is accompanied by a human generated reference text, often referred to as the gold standard, the NLG community has sought after quick, cheap, and effective metrics for evaluation of D2T systems. The adoption of automated metrics such as BLEU, NIST, and ROUGE, by the MT community, by the virtue of their correlation with human judgment [69, 185, 238], similarly carried over to the D2T community. Among them, Belz and Gatt [22] note that NIST best correlates with human judgments on D2T texts, when compared to 9 human domain-experts and 21 non-experts. However, they note that these n-grams-based metrics perform poorer in D2T as compared to MT due to the domain-specific nature of D2T systems wherein the generated texts are judged better by humans than human-written texts.

From a review of 284 correlations reported in 34 papers, Reiter [268] notes that the correlations between BLEU and human evaluations are inconsistent—even in similar tasks. While automated metrics can aid in the diagnostic evaluation of MT systems, the author showcases the weakness of BLEU in the evaluation of D2T systems. This notion has been resonated several times [269, 286]. On top of this, undisclosed parameterization of these metrics and the variability in the tokenization and normalization schemes applied to the references can alter this score by up to 1.8 BLEU points for the same framework [250]. Similarly, it has also been shown that ROUGE tends to favor systems that produce longer summaries [313]. Further complicating the evaluation of D2T is the fact modern frameworks are neural—comparing score distributions, even with the aid of statistical significance tests, are not as meaningful due to the non-deterministic nature of neural approaches and accompanying randomized training procedures [265].

6.2 Innovations in Intrinsic Evaluation

Noting the shortcomings of prevalent word-overlap metrics (Section 6.1), alternative automated metrics for intrinsic evaluation have been proposed (Sections 6.2.1 and 6.2.2). To account for divergence in reference texts, Dhingra et al. [65] propose PARENT—a metric that computes precision and recall of the generated narrative \(\hat{y}\) with both the gold narrative \(y\) and its entailment to the semi-structured tabular input \(x\).

6.2.1 Extractive Metrics.

With dialogue generation models adopting classification-backed automated metrics [152, 179], Wiseman et al. [347] propose a relation extraction system to the likes of [60, 72] wherein the record type \(r.t\) is predicted using its corresponding entity \(r.e\) and value \(r.m\) as \(p(r.t|e,m;\theta)\). With such a relation extraction system, the authors propose three metrics for automated evaluation:

—

CS is represented by the precision and recall of unique relations extracted from \(\hat{y}_{1:t}\) that are also extracted from \(y_{1:t}\).

—

Relation generation (RG) is represented by the precision and number of unique relations extracted from \(\hat{y}_{1:t}\) that can be traced to \(x\).

—

Content ordering (CO), similarly, by the normalized Damerau-Levenshtein distance [36] between the sequence of records extracted from \(y_{1:t}\) and \(\hat{y}_{1:t}\).

The authors note that, given the two facets of D2T, CS pertains to what to say and CO to how to say it while RG pertains to both (factual correctness).

6.2.2 Contextualized Metrics.

Böhm et al. [32] note that while modern frameworks for text generation compete with higher scores on automated word-overlap metrics, the quality of the generation leaves a lot to be desired. As such, the adaptation of continuous representations-based metrics shifts the focus from surface-form matching to semantic matching. Zhang et al. [362] introduce BERTScore which computes similarity scores for tokens in the system and reference text based on their BERT [64] embeddings while Mathur et al. [209] devise supervised and unsupervised metrics for NMT based on the same BERT embeddings—both having substantially higher correlation to human judgment compared to standard word-overlap metrics (see Section 6.1). Following this, Clark et al. [56] extend the word mover’s distance [173] to multi-sentence evaluation using ELMo representations [245]. Zhao et al. [366] propose the MoverScore which uses contextualized embeddings from BERT where the aggregated representations are computed based on power means [281]. Dušek and Kasner [80] employ RoBERTa [196] for NLI, where, for a given hypothesis and premise, the model computes scores for entailment between the two. While lower scores for forward entailment can point to omissions, backward entailment scores correspondingly point to hallucinations. Similarly, Chen et al. [47] propose parsing-based and adversarial metrics to the evaluate model correctness in logical reasoning.

6.2.3 Human Judgment.

While human judgment is often considered to be the ultimate D2T evaluation measure, they are subject to a high degree of inconsistency (even with the same utterance), which may be attributed to the judge’s individual preferences [63, 333]—an issue that could be circumvented through a larger sample size, however such an endeavor is accompanied with an equal increase in cost for data acquisition. As such, there have been several recommendations on the proper usage of ratings and Likert scales [143, 146, 165]. From the analysis of 135 papers from specialized NLG conferences, Amidei et al. [5] note that several studies employ the Likert scale on an item-by-item basis in contrast to its design as an aggregate scale, and the analysis performed on these scales with parametric statistics do not disclose the assumptions about the distribution of the population probability. Howcroft et al. [135], on the other hand, note that the definitions of fluency and accuracy for which these scales are employed lack consistency among the papers investigated. To mitigate these inconsistencies through good experimental design, Novikova et al. [231] propose RankME, a relative-ranking based magnitude estimation method that combines the use of continuous scales [23, 113], magnitude estimation [301], and relative assessment [38]. Further, to address the quadratic growth of data required for cross-system comparisons, the authors adopt TrueSkill [132], a Bayesian data-efficient ranking algorithm used in MT evaluation [33], to RankME. The HumEval workshop [21, 24, 25] has been an invaluable resource in investigating the shortcomings and building maps to better practices in human evaluations.

6.3 Emphasis on Reproducibility

The last half-decade has seen the ML community place significant emphasis on the reproducibility of academic results [248, 302]. However, the focus of these reproducibility efforts are placed on automated metrics (Sections 6.1, 6.2.1, and 6.2.2) with the reproducibility of human evaluation results receiving far less attention. As human evaluation is often considered the ultimate measure of D2T, Belz et al. [27] initiate ReproGen, a shared task focused on reproducing the results of human evaluations—intended to shed better light on the reproducibility of human evaluations and the possible interventions in design and execution of human evaluations to make them more reproducible. The authors note the inconsistency (expected) of the evaluators across different studies and hence point toward the use of metadata standardization through data-sheets such as HEDS [299]. Providing a complementary view, van der Lee et al. [325] review practices in human evaluation from 304 publications in the International Conference in NLG and the Annual Meeting of the Association of Computation Linguistics from 2018 to 2019 and outline the severe discrepancy in the spectrum of evaluator demographics and sample sizes, design practices, evaluation criteria, and put forth some common ground through a set of best practices for conducting human evaluations.

7 Conclusion and Future Directions

As delineated in Sections 3 and 4, innovations in D2T take inspiration from several facets of NLG and ML. From alterations to the seq2seq, pretraining-fine-tuning, auto-encoding, ensemble-learning, and reinforcement-learning paradigms, to domain-specific data preprocessing and data encoding strategies, the prospects for innovations in D2T appear as grand as that for the NLG landscape itself. Alongside, progresses made in non-anglocentric datasets, datacards that reinforce accountability, and metrics that offer heuristic evaluation, aid in elevating D2T standards. As NLG research evolves, so will D2T, and vice versa. In the following sections, we impart our thoughts for future directions for each facet of D2T—the desiderata for D2T dataset design (Section 7.1), a forward look at the possibilities for approaches and architectures for D2T (Section 7.2) and finally, closing thoughts on the future of D2T evaluation (Section 7.3).

7.1 Desiderata for D2T Datasets

In Section 2, we outlined the development of parallel corpora with data-narrative pairs alongside dominant benchmark datasets in each task category. In addition to these benchmarks, it is as crucial to acknowledge niche datasets—Obeid and Hoque [233] compile a collection of 8,305 charts with their respective narratives, followed by chart-to-text [153], that encompasses 44,096 multi-domain charts. The shared gains and pitfalls in dataset design across the D2T task categories, as discussed in Section 2, offer insights that can aid the construction of future datasets with the potential to challenge the current paradigm:

—

Domain agnosticism: Although domain-specific datasets allow the models to learn and leverage domain-specific conventions for performance gains in niche tasks, the resulting models are less malleable to unseen domains. To be adaptable and deployable for unseen niche tasks that may vary based on user requirements, it is crucial that the datasets used to train D2T models are not restricted to a single domain to avoid over-fitting on domain-specific keywords.

—

Dataset consistency: Often, the greatest challenges for D2T systems, namely hallucination and omission (see Section 3.2), can be traced back to the datasets. Datasets facing divergence (as outlined in Section 2.3), wherein the narratives are not consistent with the data instances or vice versa, often lead to models that hallucinate or omit important aspects of the data [280].

—

Human-crafted references: Often, to replicate human linguistics, datasets in NLP/NLG contain human annotations (narratives), considered as gold references. Reiter [267] notes that D2T datasets, unintentionally, may contain machine-generated annotations, such as those for WeatherGov, and urges the community to focus on human-centered narratives.

—

Linguistic diversity: It is vital to acknowledge that the majority of the D2T benchmark datasets are anglo-centric. Joshi et al. [147] note that models built on non-anglo-centric datasets, which are fewer and far between, have the potential to impact many more people than models built on highly resourced languages. The WebNLG 2020 challenge,¹⁰ for instance, encourages submissions for both English and Russian parsing.

7.2 Approaches to D2T Generation: Looking Forward

In the above Sections 4 and 5, we have extensively outlined the recent innovations in D2T both inside and outside of seq2seq modeling. However, looking forward, with the emergence of highly capable large language models (LLMs) such as ChatGPT [235], below we discuss the reconciliation of these emergent technologies with the current D2T paradigm:

—

Few-shot learning: D2T generation is a task that requires extrapolation beyond general linguistic understanding and commonsense reasoning, thus general LLM prompting strategies [340] may not be suited for this endeavor. Adding to that the data sparsity prevalent in D2T, extensions of prefix-tuning [182] and in-context sample search [189] may be especially favorable for building strong subset of samples for few-shot learning in LLMs.

—

Deviation from task-specific architectures: The paradigm for D2T generation, as it stands now, prefers custom architectures, and rightfully so—they allow focused modeling of entities (Section 5.1.1) and dedicated attention mechanisms (Sections 5.1.2 and 5.1.6) to combat data infidelity that occurs as a consequence of RNN’s lack of long-form coherence. Transformer-based LLMs, however, may inadvertently model these dependencies and attention mechanisms as a function of their self-attention modules, thus allowing the convergence toward a universal architecture.

—

Effective linearization: In line with the point above, the preference for plan-based approaches to D2T (Section 5.1.3) stems from issues in coherence. While there has been extensive work in linearizing graphs and tables (Section 4.2) showcasing that linearization in LLMs can be as effective, if not more, compared to dedicated encoders designed to capture inductive biases (Section 5.1.1, 5.1.2, and 5.1.5), recent work suggests that LLMs can handle unexpected tasks simply through their linear transcription into linguistic sentences (the LIFT framework) [67]. This line of work has extensive potential for modeling plans through their transcriptions into sentences.

—

Numeracy for D2T generation: The NLP niche of building LLMs capable of quantitative reasoning (often referred to as Math-AI or Math-NLP) has garnered significant interest from the research community [295, 317, 334]. Although there are works in D2T that incorporate this aspect [107], the two research niches are often disparate. D2T, a field that aims to combat hallucination of data points, has a lot to gain from the advances in Math-NLP that enable models to better reason about said data points.

—

Interfacing with external APIs: In line with the above point, interfacing LLMs with computational APIs (Wolfram Aplha [349] and ToolFormer [284]) has showcased significant enhancement potential for these already capable models. This paves a path in D2T where we deviate from viewing the generation of narratives as a sequential input-to-output mapping but rather a more involved loop comprising of numerical and logical reasoners, computational engines, pattern matchers, and validators that combine to form the greater NLG pipeline.

7.3 The Future of D2T Evaluation

While NLP systems generally have benchmark datasets that closely resemble the target tasks that these systems are intended to be deployed for (summarization, sentiment analysis, and language translation), D2T systems are highly specialized to the incoming data stream which differs from user to user. Thus, a one-size-fits-all approach to benchmarking, especially with automated metrics on benchmark datasets, can not showcase the utility of these systems in the real world, leading to an urgency for practical tools for extrinsic evaluation. Additionally, besides the need for fluidity and fidelity, systems placed in the real world require accountability [214]. The current leaderboard system poses the risk of blind metric optimization with disregard to model size and fairness [85]. For a holistic approach to evaluation, Gehrmann et al. [102] propose a living benchmark, GEM, similar to likes to Dynabench [161], providing challenge sets (a set of curated test sets intended to be challenging) and benchmark datasets accompanied with their D2T-specific data cards [28].

Further, as unified D2T frameworks become more decentralized with growing user-bases, the designers of these systems can utilize the user-interaction logs as measures for extrinsic evaluation, similar to the likes of ChatGPT [235]. While the D2T community places greater emphasis on measures to evaluate the quality of the generated narrative, the utility of these narratives can be evaluated with task effectiveness with and without the presence of narratives [55, 336].

Footnotes

This survey exclusively focuses on academic innovations for D2T generation as the technologies underlying commercial frameworks are often proprietary.

https://www.arria.com/

https://narrativescience.com/

⁴

https://automatedinsights.com/

⁵

https://amr.isi.edu/

⁶

http://www.macs.hw.ac.uk/InteractionLab/E2E/

⁷

https://webnlg-challenge.loria.fr/

⁸

https://pypi.org/project/tabgenie/

⁹

Though the base transformer architecture is oblivious to input structures, we assume positionally encoded transformers to fall into the seq2seq paradigm.

¹⁰

https://webnlg-challenge.loria.fr/challenge_2020/

References

[1]

Rob Abbott, Brian Ecker, Pranav Anand, and Marilyn Walker. 2016. Internet Argument Corpus 2.0: An Sql Schema for Dialogic Social Media and the Corpora to go with it. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 4445–4452.

Abstract

1 Introduction

1.1 Defining D2T Generation and Scope of the Survey

1.2 Survey Rationale

2 Datasets for D2T Generation

2.1 Meaning Representations

2.2 Graph Representations

2.3 Tabular Representations

2.4 Data Collection and Enrichment

3 D2T Generation Fundamentals and Notations

3.1 What to Say and How to Say It

3.2 Hallucinations and Omissions

3.3 Establishing Notation and Revisiting Seq2Seq

4 Innovations in Data Preprocessing

4.1 Delexicalization and Noise Reduction

4.2 Linearization

4.3 Data Augmentation

5 Innovations in the Seq2Seq Framework

5.1 Supervised Learning

5.1.1 Entity Encoders.

5.1.2 Hierarchical Encoders.

5.1.3 Plan Encoders and Auto-Encoders.

5.1.4 Stylistic Encoders.

5.1.5 Graph Encoders.

5.1.6 Reconstruction and Hierarchical Decoders.

5.1.7 Regularization Techniques.

5.1.8 RL.

5.1.9 Fine-tuning Pretrained Language Models.

5.1.10 Supplemental Frameworks.

5.1.11 Ensemble Learning.

5.2 Unsupervised Learning

5.2.1 D2T Specific Pretraining.

5.2.2 Auto-Encoders.

5.3 Innovations Outside of the Seq2Seq Framework

5.3.1 Template Induction.

5.3.2 Discrete Neural Pipelines.

5.3.3 Computational Pragmatics.

6 Evaluation of D2T Systems

6.1 BLEU: The False Prophet for D2T

6.2 Innovations in Intrinsic Evaluation

6.2.1 Extractive Metrics.

6.2.2 Contextualized Metrics.

6.2.3 Human Judgment.

6.3 Emphasis on Reproducibility

7 Conclusion and Future Directions

7.1 Desiderata for D2T Datasets

7.2 Approaches to D2T Generation: Looking Forward

7.3 The Future of D2T Evaluation

Footnotes

References

Cited By

Index Terms

Recommendations

Research on Text Generation Techniques Combining Machine Learning and Deep Learning

A Survey of Natural Language Generation

Remodeling Numerical Representation for Text Generation on Small Corpus: A Syntactical Analysis

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media