Improving semantic coverage of data-to-text generation model using dynamic memory networks

Abstract This paper proposes a sequence-to-sequence model for data-to-text generation, called DM-NLG, to generate a natural language text from structured nonlinguistic input. Specifically, by adding a dynamic memory module to the attention-based sequence-to-sequence model, it can store the information that leads to generate previous output words and use it to generate the next word. In this way, the decoder part of the model is aware of all previous decisions, and as a result, the generation of duplicate words or incomplete semantic concepts is prevented. To improve the generated sentences quality by the DM-NLG decoder, a postprocessing step is performed using the pretrained language models. To prove the effectiveness of the DM-NLG model, we performed experiments on five different datasets and observed that our proposed model is able to reduce the slot error rate rate by 50% and improve the BLEU by 10%, compared to the state-of-the-art models.

to achieve significant results in generating descriptive text from nonlinguistic data (Wen et al., 2015a(Wen et al., , 2015b(Wen et al., , 2015c(Wen et al., , 2016Lebret, Grangier, and Auli, 2016;Mei, Bansal, and Walter, 2016;Nayak et al., 2017;Riou et al., 2017;Lee, Krahmer, and Wubben, 2018;Liu et al., 2018;Deriu and Cieliebak, 2018;Sha et al., 2018;Tran and Nguyen, 2019;Qader, Portet, and Labbe, 2019;Shen et al., 2020). However, these systems also have limitations. One of their most important problems is the inability to express all the necessary input information in the output text (Tran and Nguyen, 2019). The input MR may include meaning labels with a binary value (true, false, yes, no) or a value that cannot be expressed directly in words (none, don't-care) and can only be identified by their concept. For example, consider a meaning label that indicates whether a hotel allows dogs to enter or not, whose value can be yes, no, none, and don't-care. none means that information about allowing dogs in hotels is not available. This concept is expressed in sentences by the phrases such as dogs are allowed (similar when its value is yes), dogs are not allowed (similar when its value is no), or I do not know if it allows dogs. In these cases, the model should learn the relationship between these values and the words in the text and also how to express its meaning in the output. Another problem, which is related to the nature of RNN networks, is the inability to retain information about the distant past. Although LSTM is able to retain this information to some extent, the information that leads to the generation of the first words of the text is still forgotten as the output text lengthens. This causes problems like duplicate generated words, missing, and redundant meaning labels to be expressed in the output text. So far, many models have been tried to solve these deficiencies, among which models with encoder-attention-decoder structure have achieved relative but not complete success (Wen et al., 2015a(Wen et al., , 2015b(Wen et al., , 2015c(Wen et al., , 2016Dusek and Jurcicek, 2016;Nguyen, 2017, Qader et al., 2019;Shen et al., 2020;Shen et al., 2020).
In this paper, we have enabled the basic attention-based sequence-to-sequence model to store information leading to the generation of previous words and generate new words not only based on the current state but also on previous information. For this purpose, we use a dynamic memory network (DMN) (Kumar et al., 2016). The DMN, which is introduced in 2016 for the Question and Answer (Q&A) task, has two modules for reading and writing; at each step, the memory content is updated and rewritten based on its previous value and query, and the output tokens are generated based on the content read from memory. Using this type of memory along with the sequence-tosequence model, each output word is generated not only based on the current state but also the information and history of decisions that led to the generation of previous words. In addition, since generating words in the proposed model, unlike previous models, are done by two different levels of attention between encoder and decoder, as well as decoder and memory, the meanings of the words and their relation to the input MR are well learned and as a result, all meaning labels with binary or special values will be expressed in the output. Improving the sentence structure is another goal that was achieved by applying a postprocessing step on the proposed model outputs by using pretrained language models.
We performed experiments to evaluate the performance of the proposed model and compare it with the baseline models. These experiments are performed on datasets and evaluation metrics used by RNN-based, autoencoder-based, and transformer-based baseline models to evaluate the effect of memory usage compared to the case where there is no memory. The results of the experiments show clear improvements in the quality of the sentences, both grammatically and semantically compared to the previous models. The main contributions of our work are summarized below: table and the generated text. After that, Sha et al. (2018) extended this approach to learning the order of meaning labels in the corresponding text by adding a linked matrix, so their model was able to generate a more fluent output.
In the case of cross-domain dialog generation, Tseng et al. (2018), used a semantically conditioned variational autoencoder architecture. This model at first encoded the input MRs and their corresponding sentences to a latent variable. Then conditioning on this latent variable, the output sentences for a given MR were generated. Tran and Nguyen (2018a) proposed a variational neural language generator that was trained adversarially by first being trained on a source domain data and then being fine-tuned on a target domain under the guidance of text similarity and domain critics. Furthermore, to deal with the low-resource domain, they integrated a variational inference into an encoder-decoder generator (Tran and Nguyen, 2018b).
The E2E dataset, a large dataset in the restaurant domain, is produced in 2017 by Novikova et al. (2017) for use in the E2E challenge (Dusek, Novikova, and Rieser, 2018). Most of the submissions in that challenge were End-to-End sequence-to-sequence models (Juraska et al., 2018;Gehrmann, Dai, and Elder, 2018;Zhang et al., 2018;Gong, 2018;Deriu and Cieliebak, 2018) who were able to get a better result than the baseline approach by Dusek and Jurcicek (2016). Working on the same dataset, Qader et al. (2019) proposed a combination of an NLG and an NLU sequence-to-sequence models in the form of an autoencoder structure that can learn from annotated and nonannotated data. Their model was able to achieve a result close to the winner of the E2E challenge (Juraska et al., 2018). Also, Shen et al. (2020) proposed to automatically extract the segmental structures of texts and learn to align them with their data correspondences. More precisely, at first the input records are encoded. Then at each time step, the decoder generates tokens based on the attention weights of the input records and previously expressed records in the output sentence. For reducing hallucination, they used a constraint that all records must be used only once.
Recently, pretrained language models like GPT-2 (Radford et al., 2019) are used for data-totext in the few-shot or zero-shot settings. As for the E2E dataset (Dusek et al., 2018), Kasner and Dušek (2020) propose a few-shot approach for D2T based on iterative text editing by using LASERTAGGER (Malmi et al., 2019), a sequence tagging model based on the Transformer (Vaswani et al., 2017) architecture with the BERT (Devlin et al., 2019) pretrained language model as the encoder and the pretrained language model GPT-2. This model first transforms data items to text using trivial templates, and then iteratively improves the resulting text by a neural model trained for the sentence fusion task. The output of the model then is filtered by a simple heuristic and re-ranked with GPT-2 language model. Peng et al. (2020) introduce a model-based GPT-2 called semantically conditioned generative pretraining. This model is first pretrained on a large amount of publicly available dialog datasets and then fine-tuned on the target D2T dataset with few training instances. Harkous et al. (2020) introduce the DATATUNER model, an end-toend, domain-independent data-to-text system that used GPT-2 pretrained language model and a weakly supervised semantic fidelity classifier to detect and avoid generation errors such as hallucination and omission. Chen et al. (2020) proposed a few-shot learning approach that used GPT-2 with a copy mechanism. Kale and Rastogi (2020) used templates to improve the semantic correctness of the generated responses. In a zero-shot setting, at first, their model generates semantically correct but possibly incoherent responses based on the slots, with the constraint of templates, then by using T5 pretrained Text-to-Text model (Raffel et al., 2020) as reorganizer, the generated utterances are transformed into coherent ones. Chang et al. (2021) use the data augmentation methods to improve few-shot text generation results using GPT-2 language model. Lee (2021) presents a simple one-stage approach to generating sentences from MRs using GPT-2 language model. Juraska and Walker (2021) proposed SEA-GUIDE, a semantic attention-guided decoding method for reducing semantic errors in the output text and applied it on fine-tuned T5 and BART (Lewis et al., 2020). This decoding method extracts interpretable information from cross-attention between encoder and decoder to infer which meaning labels are mentioned in the generated text. The use of memory as an entity modifying (Puduppully, Dong, and Lapata, 2019) or entity tracker (Iso et al., 2019) in the D2T task has already been suggested. In the model proposed by Puduppully et al. (2019), after each output token is generated by the decoder, the MR vectors of the input entities obtained by the encoder are updated using the memory module, and these updated vectors are used to select the next entity to describe in the output text. The proposed model by Iso et al. (2019) uses a memory module along with the decoder, that in each time step selects the appropriate entity to be expressed in the output text, updates the encoder's hidden state of the selected entity, and generates output tokens for describing it. However, our proposal differs from theirs in that, ours uses memory to help the decoder store the information it used to generate previous tokens. In this way, each output word is generated not only based on the current decoder state but also the information and history of decisions that led to the generation of previous words.

DM-NLG model
The proposed text generator in this paper is based on the sequence-to-sequence encoder-decoder architecture. The encoder part of the model, which is made by RNN, gets MR={x 0 , x 1 , . . . , x L } as input and converts it to hidden vectors. The hidden vectors of the encoder are used by the attention mechanism to produce the context vector and the final hidden state is used for initializing the hidden state of the decoder. Since the text generator must be able to express all the necessary input information in the generated output text, we propose to use dynamic memory alongside the sequence-to-sequence model. In this new model, which we named DM-NLG, the output tokens {w 0 , w 1 , . . . , w N } are generated by the decoder based on the content of the memory cell. After generating each token, the content of the memory cell is updated. Finally, as a postprocessing stage, the generated sentences are given to a fine-tuned pretrained language model to improve their quality. The general outline of this model is shown in Figure 1.

Encoder
The encoder, which is a stack of GRU units, takes the input MR. At each step of the recurrence, a meaning label x l of the input and hidden state s l−1 of the previous state is given to the GRU unit and a new hidden state s l is generated. This process continues until all the meaning labels of the input MR have been processed and their information stored inside the final hidden state. The formulas for this process are as follows: where W r , W z , W s , U r , U z , and U s are weight parameters that are learned during the model training process and denote for the element-wise product.

Decoder
The final hidden state of the encoder is used as the initial value for the hidden state of the decoder. The decoder, consisting of GRU units, at each step gets the generated token w t−1 and hidden state h t−1 of the previous step and generates the new hidden state h t . The generated hidden state is used to compute a probability discrete distribution over the vocabulary and then to predict the next word: where h 0 is the initial values of the decoder's hidden state and W r , W z , W s , U r , U z , and U s are weight parameters that are learned during the model training process.

Attention mechanism
The output of an NLG system is a descriptive text embracing the input MR; therefore, there should be some alignment between each output text token and input labels. This alignment is modeled through the attention mechanism. Hence, at each step t, the degree of relevancy of the hidden state h t of the decoder to all encoder hidden states as well as previously generated token w t−1 is measured, and based on that a score is given to each hidden state of meaning labels. These scores indicate which input labels are more responsible for generating token w t and are calculated as follows: where V, W a , U a , and Z a are learning weight parameters. The context vector c t , which is the weighted sum of the encoder hidden states, denotes the contribution of the input meaning labels in the production of the w t .

Dynamic memory
As mentioned before in Equations (2) and (3), both phases of generating words by the decoder and calculating attention weights by the attention mechanism are performed based on the last generated word embedding vector (w t−1 ) and the last hidden vector of the decoder (h t ). Therefore, the only information from the past steps that affect the attention weights and word generation at current step t is the content of the h t vector, which includes a compact representation of the sequence of words produced from step 0 to t − 1. On the other hand, as the generated word sequence lengthens, the weights of the initial generated words become less than the recently generated words in this representation. As a result, the information that leads to the generation of the words at the beginning of the sentence is lost, which results in the text generator to produce repetitive words or not to express part of the input meaning labels in the output text. Therefore, we propose to use dynamic memory to store the history of previous information. In this way, to generate the output word w t , the content of memory that is initialized with encoder outputs, is first updated and re-written based on hidden state h t . The information in the updated memory is then read by the attention mechanism and used to generate the output word. In this way, all the past information that leads to generating words is always available to the decoder.
In this work, two types of dynamic memories with different sizes and initialization values are proposed. In the first case, the memory M t at time step t consists of a slot m t 0 that is initialized by the last encoder's hidden state s L , so contains the compressed information of the input MR. In the second case, the number of memory slots M t at time step t is equal to the input meaning labels, M t = m t 0 , m t 1 , . . . , m t L . These slots are initialized by their hidden states generated by the encoder, {s 0 , s 1 , . . . , s L }.

Memory writing module
As mentioned before, at each step t, the hidden vector h t−1 and the previously generated word w t−1 are given to the decoder to generate the hidden vector h t . After that, this vector with the context vector generated by the attention mechanism is given to the writing module. The writing module, which is made of GRU units, updates, and rewrites the memory as follows: where W f , W m , V f , U f , and U m are learning weight parameters and ⊕ is the concatenation operation. If memory M t has multiple slots, the writing module operates recursively on m t 0 , m t 1 , . . . , m t L ; Otherwise, it will act as a single GRU unit on slot m t 0 .

Memory reading module
At time step t and after the memory M t is updated and rewritten in M t+1 , its information is read by the reading module and used to generate the next word w t+1 . The process of reading memory is done by the attention mechanism as follows: where V r , W r , and U r are learning weight parameters. The output distribution of each token then is defined by using a softmax function as follows: where W o is the learning weight parameter.

Training
The model is trained to minimize the negative log-likelihood cost function based on back propagation. Accordingly, J(θ) formulated as follows: whereŷ l is the actual word distribution, p l is the predicted word distribution. For training, the ground truth token for the previous time step is used as the input of the next step; but for inference, a simple beam search is used to generate output text.

Postprocessing
At this stage, a pretrained language model that autoregressively generates text, such as GPT-2 or Transformer-XL (Dai et al., 2019), is used to improve the structural quality of the sentences produced by the decoder. For this purpose, first, the equivalent sentences for the MRs of the training data are generated by the DM-NLG model. These sentences are then used along with the ground truths as an input and output pair to fine-tune the pretrained language models, in the form of <BOS> generated text <SEP> ground truths text <EOS>. Finally, each generated sentence at the inference time is given to the fine-tuned model as a language model, in the form of <BOS> generated text <SEP> to generate the equivalent sentence as a continuation of the input. The impact of using each of these pretrained models in improving the quality of output sentences is discussed in the next section.

Experiments
To evaluate the performance of the proposed model, we conducted experiments. The used datasets, experiment settings, evaluation metrics, results of the experiments, and their analysis are described below.

Datasets
We used five publicly available datasets to examine the quality of the proposed models: finding a hotel, finding a restaurant, buying a TV, buying a laptop published by Wen et al. (2015c), the E2E challenge dataset provided by Dusek et al. (2018) and Personage dataset published by Oraby et al. (2018) which provides multiple reference outputs with various personality type (agreeable, disagreeable, conscientious, unconscientious, and extrovert) for each E2E MR. These datasets contain a set of scenarios, each containing an MR and an equivalent sentence. Details of these datasets are given in Table 1. In preprocessing, the value of meaning labels that appear exactly in the text (such as the name, address, phone, postal code, and etc.) is replaced with a specific token both in the text and in the input MR. In this way, while reducing the size of the vocabulary, the information that influences the process of generating text becomes more compact. To perform postprocessing, these values were put back into place.

Experimental setups
The proposed models are implemented using the TensorFlow library and trained on Google Co-laboratory with one Tesla P100-PCIE-16 GB GPU. The hidden layer size and embedding dimensions are set to 80, the batch size is set to 100, and the generators were trained with a 70% of keep dropout rate. The models were initialized with pretrained word vectors GloVe (Pennington, Socher, and Manning, 2014) and optimized using Adam optimizer with a learning rate of 1e-3. This process is terminated by early stopping based on the validation loss. Moreover, for every five epochs, L2-regularization with λ = 1e − 5 is added to lose values. In the inference phase, beam search with width 10 is used and for each MR, 20 candidate sentences are over-generated, then the five top sentences are selected based on their negative log-likelihood values. a For fine-tuning GPT-2 and Transformer XL as a postprocessing module, we used the gpt2 and transfo-xl-wt103 pre-trained models in the Hugging Face library (Wolf et al., 2020) and trained them for five epochs with a learning rate of 5e-5 and batch size 16. The output sentences are then regenerated with beam width 10 and temperature 0.9 using fine-tuned models.
To compare the performance of the proposed DM-NLG model with the pre-trained encoderdecoder models, we fine-tuned the T5 and BART models on all five datasets. For this purpose, the t5-base and bart-base models available in the Hugging Face Library were used and trained for five epochs with a learning rate of 5e-5 and batch size 16. The output sentences are then generated using these fine-tuned models with a beam width of 10.

Evaluation metrics
Performance of the proposed model on the Restaurant, Hotel, TV, and Laptop dataset was evaluated using BLEU-4 (Papineni et al., 2002), BERTScore (Zhang et al., 2020), and SER metrics. SER is defined as where N is the number of slots in the MR, p, and q are the number of missing and redundant slots in the generated sentence (Wen et al., 2015c). For the E2E and Personage datasets, BLEU-4, NIST (Martin and Przybocki, 2000), METEOR (Lavie, Sagae, and Jayaraman, 2004), ROUGE-L (Lin, 2004), SER, and BERTScore metrics are used. b For evaluating the diversity of generated text, Shannon Entropy is used as follows: where S is the set of unique words in all generated sentences by the model, freq is the number of times that a word occurs in all generated sentences, and total is the number of words in all reference sentences (Oraby et al., 2018). a Code and data will be published in https://github.com/seifossadat/DM-NLG b https://github.com/tuetschek/e2e-metrics We also ran human evaluation to evaluate the faithfulness (How many of the semantic units in the given sentence can be found/recognized in the given MR), coverage (How many of the given MRs slots values can be found/recognized in the given sentence), and fluency (whether the given sentence is clear, natural, grammatically correct, and understandable) of the generated sentences by our proposed model. For fluency, we asked the judges to evaluate the given sentence and then give it a score: 1 (with many errors and hardly understandable), 2 (with a few errors, but mostly understandable), and 3 (clear, natural, grammatically correct, completely understandable). To do this, 50 test MRs are selected randomly from each dataset. Also, as judges, 20 native English speakers (Fleiss's κ = 0.51, Krippen-dorff's α = 0.59) are used. To avoid all bias, judges are shown one of the randomly selected MRs at a time, together with its gold sentence and the corresponding generated sentence by our model. Judges were not aware of which sentences were generated by our model.

Baselines
We compared our proposed models against strong baselines including: • Gating-based models. HLSTM (Wen et al., 2015c) which uses a heuristic gate to ensure that all information was accurately expressed in the generated text; SCLSTM (Wen et al., 2015b) which uses a sigmoid control gate to keep track of meaning labels in generating text; SRGRU-Context, ESRGRU-MUL, ESRGRU-INNER, and RALSTM Nguyen, 2017, 2019) which use different attention mechanisms for representing the input MR and then refining the input token based on this representation using a sigmoid gate. • Attention-based models. ENCDEC (Wen et al., 2015a) which applies the attention mechanism to an LSTM encoder-decoder; CONTEXT (Oraby et al., 2018) which uses a simple sequence-to-sequence with attention model; Fine-Control (Harrison et al., 2019) which uses sequence-to-sequence with attention model and incorporates side constraint information into the generation process by adding additional inputs to the LSTM decoder; Seg&Align (Shen et al., 2020) which uses LSTM encoder-attention-decoder to first segments the text into small fragments and then learns the alignment between data and text segments. • Autoencoders. SCVAE (Tseng et al., 2018) which uses Conditional Variational autoencoder architecture to generate dialog sentences from semantic representations; VRALSTM and VNLG Nguyen, 2018a, 2018b) which use Variational encoder-decoder based models designed to deal with low in-domain data; (Qader et al., 2019) work which uses two sequence-to-sequence models for understanding and generating learned jointly to compensate for the lack of annotation data. • Transformers. Chen et al. (2020) which use the GPT-2 as generator and enabled it to switch between copying input and generating tokens; Chang et al. (2021) which use data augmentation methods with GPT-2 language model; Lee (2021) which uses GPT-2 for generating sentences given the input MRs; SEA-GUIDE-T5-small, SEA-GUIDE-T5-based and SEA-GUIDE-BART-based (Juraska and Walker, 2021) which use a semantic attention-guided decoding method along beam search to reduce semantic errors in the output text.

Results and analysis
Tables 2 c , 3, 4 c , and 5 c contain the results of our proposed DM-NLG model, with and without postprocessing, along with the aforementioned baseline models on the Restaurant, Hotel, Measured by CIDEr and BLEU-4 (B4) scores. * and * * indicate statistically significant best results at p ≤ 0.05 and p ≤ 0.01, respectively, for a paired t-test evaluation. The best and second-best models are highlighted in bold and underline face, respectively.
TV, Laptop, E2E, and Personage datasets. As can be seen, by adding memory to the attentional sequence-to-sequence model, SER is significantly decreased and BERTScore increased compared with attentional encoder-decoder autoencoder-based and transformer-based models indicating the improvement of the semantic and structural quality of the generated sentences by our proposed model. However, despite the fact that our proposed model has been able to get a high BERTScore and surpass all the basic models in terms of semantics and structure, its BLEU values are either less or slightly better than the baselines, except for the Personage dataset which baseline models were small and simple compared to our proposed model. This difference is due to the fact that n-gram-based metrics compare the hypothesis with a reference sentence based on its phrases, words, syntax, and length similarity. As a result, if the hypothesis sentence is completely correct in terms of semantics and structure but not similar in terms of used words and phrases, it gets a lower BLEU score. Based on Tables 2-5, using the GPT model as a booster of the quality of the generated sentences leads to a slight improvement in the value of the BERTScore evaluation metric. However, the Transformer-XL model reduces the quality of the output sentences, as it focuses more on learning the long dependencies between words and sentence segments, and given the number of model parameters (approximately twice as many as the GPT-2) and length of training sentences (30-40 tokens), the model tends to shorten sentences by expressing existing meaning labels only. It should be noted that since the generated sentences by DM-NLG are used as training data to fine-tune the GPT and the Transformer-XL models if there is any semantic error due to missing or redundant meaning labels expressed in these sentences, this error will also appear in the output sentences of the GPT and Transformer-XL and does not change the SER. Examples of the sentences generated by the decoder part of the DM-NLG and then regenerated by GPT and Transformer-XL are shown in Table 6. Moreover, the performance of the DM-NLG model without adding memory is close to other attention-based encoder-decoder models; but with the addition of memory to the model, the BERTScore metric improves and the SER decreases significantly. As mentioned earlier, the DM-NLG model with one-slot memory is initialized with the final hidden state of the encoder, which contains a compressed representation of the whole input MR. As a result, the decoder is aware of the history of previous alignments when generating the hidden state for the next token, resulting in a further reduction in the SER and improvement in BERTScore. When the information of each input meaning label is stored in different memory slots, the history of previous alignments is kept separately. Hence, the weights are more evenly distributed between the labels when reading memory, resulting in further reduction of SER and BERTScore improvement compared to the case where there is only one slot. Tables 7-9 show the results obtained by comparing the performance of the transformer-based encoder-decoder models with the proposed DM-NLG model without the postprocessing step. For this purpose, we fine-tuned the T5-base and BART-base models on the training dataset and examined the quality of sentences generated by these models by automated evaluation metrics. As can be seen, for the proposed model, the values of SER are less and the values of BERTScore are higher than the fine-tuned models indicating that the proposed model is well able to control the flow of input MRs information at the output text. This is why, although transformer-based models are trained on large volumes of text and then fine-tuned on training data, the values obtained for n-gram-based evaluation metrics are also close to each other. Therefore, it can be concluded that https://doi.org/10.1017/S1351324923000207 Published online by Cambridge University Press The best and second-best models are highlighted in bold and underline face, respectively. * indicates statistically significant best results at p ≤ 0.05, for a paired t-test evaluation.
the proposed model has a good ability to generate text from structured data, especially despite having a smaller number of trainable parameters. For all datasets, we evaluated the amount of variation in the training set, and also in the output generated text by the proposed model and fine-tuned BART and T5 using entropy (Equation 9), where the larger entropy value indicates a larger amount of linguistic variation in the generated outputs. The results are shown in Table 10. It is noticed that the highest entropy value is related to the training sets, and none of the models are able to match their variability. However, the small difference between the entropy value of the generated and training sentences for the different datasets shows the ability of the proposed model to produce texts with appropriate variations.
To evaluate the performance of the proposed model on unseen data, we designed an experiment in which the DM-NLG model is trained on one data set and tested on another. Due to the shared meaning labels between the Hotel and Restaurant datasets, TV, and Laptop datasets, as well as E2E and Personage datasets, we performed this test individually on these sets of data, so that in each set one was used as training data and the other as test data. The performance comparisons are shown in Table 11. As observed, for all sets of data, the BLEU and BERTScore decreased and SER is increased, respectively. These differences are larger when the Hotel and Laptop domains are used    BART-base 67.00 * 0.4550 8.72 * 0.6915 * 0.10 0.920 * indicates statistically significant best results at p ≤ 0.05, for a paired t-test evaluation. The best and second-best models are highlighted in bold and underline face, respectively. MRs, and therefore, the BLEU and BERTScore decrease slightly. These results prove that our proposed model also can adapt to a new, unseen dataset.
As subjective evaluations, we compared the output generated by our best-performing model, DM-NLG multislot along with postprocessing, with its gold reference. Table 12 shows the scores of each human evaluation metric for each dataset. The sentences generated by our model received References are the golden truth sentences in each dataset. * and * * indicate statistically significant best results at p ≤ 0.05, and p ≤ 0.01, for a paired t-test and ANOVA evaluation, respectively.

Reference
The Punter is a restaurant providing Indian food in the moderate price range. It is located in the city center. It is near Express by Holiday Inn. Its customer rating is 1 out of 5. familyFriendly[no] Juraska et al. (2018) The Punter is a moderately priced Indian restaurant in the city center near Express by Holiday Inn. It is not kid friendly and has a customer rating of 1 out of 5.

One-slot memory
The punter is a restaurant providing Indian food in the moderate price range. It is located in the city center near Express by Holiday Inn. Its customer rating is 1 out of 5 and is not family friendly.

Multislot memory
The Punter is a moderately priced restaurant providing Indian food. It is located in the city center near Express by Holiday Inn. Its customer rating is 1 out of 5 and is not family friendly.
The missed and redundant meaning labels for each generated sentence are shown in red and orange, respectively.
high scores from the judges for all human evaluation metrics. It should be noted that, in the E2E dataset, there is a slot called food, which specifies what kind of food is served in the restaurant (Chinese, English, French, Indian, Italian, Japanese, and Fast Food). But for some MRs, instead of expressing types of foods, names of foods, like "wines and cheeses," "cheeses, fruits and wines," "snacks, wine and coffee," "fish and chips," and "pasta," are mentioned in the reference sentences.
Considering that the nationality of the food cannot be determined with certainty based on its name, the judges have considered such cases as a missing slot. For this reason, this data set has not received a full score for the coverage and faithfulness factors. Moreover, the scores of fluency of the proposed model are lower compared to the reference sentences (about 0.1 units), which is an acceptable difference considering that the reference sentences are written by humans. Examples of input MRs and sentences generated by our proposed model are given in Tables 13-15.

Case study and visualization
As an intuitive way to show the performance of the DM-NLG model, Tables 13-15 show samples of generated texts for certain MRs from the Laptop, E2E, and Personage datasets by our model without doing postprocessing. The input MR selected from the Laptop domain relates to the comparison of two different models of laptops, so it includes duplicate labels. The model should be able to express all these labels in the output sentence in a way that the concept of comparison is clearly recognizable in the generated sentence. As shown in Table 13, considering the concept  DM-NLG multi-slot: oh gosh i don't know. nameVariable is a retaurant with an average rating, also it is damn kid friendly in city center, expensive near nearVariable and an English place, i mean.
Personality: Extravert Reference:nameVariable has an average rating, it is expensive, also it's an English place and family friendly, buddy, it is a restaurant and it is near nearVariable in city center, you know!

DM-NLG multi-slot:
nameVariable is a restaurant with an average rating, also nameVariable is expensive in city center, it is an English place and it is near nearVariable, also it is kid friendly. buddy, you know! The missed, redundant, and duplicated meaning labels for each generated sentence are shown in red, orange, and blue, receptively.
of comparison, it can be said that text generated by our proposed model is close to the reference text by producing words such as "also" and "while." The input MR selected from the E2E dataset has a label with a binary value, so the model should be able to express this concept in the output sentence. The output generated by our model successfully expressed all labels. Finally, the input MR selected from the Personage dataset also has a label with a binary value, but the main task is generating sentences that describe the input MRs with different types of personality. As can be seen in Table 15, our proposed model is able to produce the equivalent text of each personality label correctly. The heat map in Figure 2 shows the average value of the forget gate of the memory writing module (Equation 4) and attention weights (Equation 3) for a sentence from the Laptop dataset generated by the DM-NLG one-slot model. The horizontal and vertical axes show the generated sentence and the input MR, respectively. The darker color reveals a higher value while the lighter part has a lower value. Figure 2 shows that, at the beginning of the word generation process based on the input MR, the value of the forgetting gate is smaller, meaning that the contents of the memory are changing. As it gets closer to the end of the sentence while expressing all the meaning labels in it, the value of the forget gate becomes larger, meaning that the contents of the memory no longer change. The attention weights assigned to the input labels after generating each word in the output also reflect the desirable effect of memory usage.
The average value of the forgetting gate of the memory writing module, attention weights of the memory reading module (Equation 5), and attention weights for a sentence from the Laptop dataset generated by the DM-NLG multislot model are shown in Figure 3. Here too, the horizontal and vertical axes show the generated sentence and the input MR, respectively. In this model, for each memory cell, the value of the forget gate of writing modules changes according to its corresponding meaning labels expressed in the text. As can be seen in Figure 3a, the contents of memory slots corresponding to the name, type, and price range labels that appear at the beginning and middle of the text change more than the drive and platform labels that are expressed at the end of the sentence. Also, the uniform value of the forgetting gate from the middle of the sentence indicates that the change in the content of memory slots is balanced based on their previous step value and the new information. Furthermore, as observed in Figure 3b, at the beginning of the output generation, the weights that the memory reading module assigns to each slot have a nonuniform distribution, which indicates that some memory slots have more attention weights. But in the end, when the memory changes very slightly, these weights are evenly distributed, indicating that the contents of each memory slot corresponding to each meaning label are read in a balanced way. Weights calculated to the attention mechanism also show the same favorable effect on memory usage.

Conclusion
In this paper, we presented our DM-NLG model to improve the performance of basic models based on the sequence-to-sequence structure in order to fully express the input information at the output. This model uses a dynamic memory that is initialized with the encoded input MR. The content of memory is rewritten and read based on the decoder's hidden state of each generated word; in this way, the memory at any time step contains the history of information used to produce previous words, and this will prevent the generation of sentences with duplicate words or incomplete information. In this work, we considered two types of memories. In the first case, the memory has only one slot, which is initialized by the last encoder mode vector. In the second case, the number of memory slots is equal to the number of input meaning labels, and each slot is initialized by the hidden state of the encoder for its equivalent label. To improve the generated sentences quality by the decoder of DM-NLG, we used the pretrained language models which are first fine-tuned using training sentences and the generated training sentences by the DM-NLG model. Then each generated sentence for test data is regenerated by these fine-tuned models. To evaluate the performance of the proposed model versus the baseline models, we performed experiments on known datasets in the field of D2T. The subjective and objective metrics used for evaluation confirmed the superior performance of our proposed model. Also to compare the performance of the model in the case where we do not use memory versus the case where we use memory with one or more slots, we performed an experiment and its result showed that the quality of generated text will be better when multislot memory is used.