Hostname: page-component-77f85d65b8-8v9h9 Total loading time: 0 Render date: 2026-03-29T10:03:46.285Z Has data issue: false hasContentIssue false

How to quickly select good in-context examples in large language models for data-to-text tasks?

Published online by Cambridge University Press:  14 October 2025

Yulong Li
Affiliation:
Hefei University of Technology, Hefei, China
Jiaoyun Yang*
Affiliation:
Hefei University of Technology, Hefei, China
Lili Jiang
Affiliation:
Deparment of Computing Science, Umeå University, Umeå, Sweden
Shuo Liu
Affiliation:
Hefei University of Technology, Hefei, China
Ning An
Affiliation:
Hefei University of Technology, Hefei, China
*
Corresponding author: Jiaoyun Yang; Email: jiaoyun@hfut.edu.cn
Rights & Permissions [Opens in a new window]

Abstract

In the realm of data-to-text generation tasks, the use of large language models (LLMs) has become common practice, yielding fluent and coherent outputs. Existing literature highlights that the quality of in-context examples significantly influences the empirical performance of these models, making the efficient selection of high-quality examples crucial. We hypothesize that the quality of these examples is primarily determined by two properties: their similarity to the input data and their diversity from one another. Based on this insight, we introduce a novel approach, Double Clustering-based In-Context Example Selection, specifically designed for data-to-text generation tasks. Our method involves two distinct clustering stages. The first stage aims to maximize the similarity between the in-context examples and the input data. The second stage ensures diversity among the selected in-context examples. Additionally, we have developed a batched generation method to enhance the token usage efficiency of LLMs. Experimental results demonstrate that, compared to traditional methods of selecting in-context learning samples, our approach significantly improves both time efficiency and token utilization while maintaining accuracy.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Figure 1. Flowchart of different in-context learning methods for data-to-text. Different colored icons represent the semantic representations of data/text. The similarity-based method chooses demonstrations that are more similar to the input data and achieves better output results, but it has higher time complexity ($M$ is the size of the training set, $n$ is the size of the test set, $D$ is the size of the representation vector). In contrast, the diversity-based method has lower time complexity but produces poorer output results.

Figure 1

Figure 2. The schematic representation of DCCS. The training set is initially stratified into $K$ primary clusters based on the embeddings of the data. Each primary cluster subsequently undergoes sub-clustering into $m$ categories based on the embeddings of the reference texts. The $m$ centroid samples are selected as candidate in-context examples for each primary cluster. During the inference phase, the test data $x$ is encoded and assessed for similarity against the $K$ cluster centers, and the candidate in-context examples from the proximal category is selected.

Figure 2

Algorithm 1: Preprocess of DCCS

Figure 3

Algorithm 2: Inference of DCCS

Figure 4

Algorithm 3: Inference of DCCS-Batch

Figure 5

Table 1. Instruction and data format in prompt text

Figure 6

Figure 3. Left: Single Generation, where text is generated for one structural data at a time. Right: Batched Generation, with text being generated for 5 simultaneously.

Figure 7

Figure 4. Prompt for single generation.

Figure 8

Figure 5. Prompt for batched generation.

Figure 9

Table 2. Comparison of Retrieval Time Saved and Token Saved (%) between DCCS and KATE across datasets

Figure 10

Table 3. Comparative results of data-to-text generation using GPT-3.5 on the E2E, DART, and WebNLG (100 test samples) in a 5-shot setting. The best performance per metric is shown in bold and the second-best result is underlined

Figure 11

Table 4. Comparative results of data-to-text generation using GPT-3.5 on the ToTTo (100 test samples) in a 5-shot setting. The best performance per metric is shown in bold and the second-best result is underlined

Figure 12

Table 5. Comparative results of data-to-text generation using GLM-3 on the E2E, DART, and WebNLG (100 test samples) in a 5-shot setting. The best performance per metric is shown in bold and the second-best result is underlined

Figure 13

Table 6. Comparative results of data-to-text generation using Llama-3.1 on the E2E, DART, and WebNLG in a 5-shot setting. The best performance per metric is shown in bold and the second-best result is underlined

Figure 14

Table 7. BLEU and PARENT on the ToTTo using Llama-3.1 in a 5-shot setting. The best performance per metric is shown in bold and the second-best result is underlined

Figure 15

Table 8. BLEU for batched generation on the E2E, WebNLG, DART and ToTTo using GPT-3.5 in a 5-shot setting. The best performance per dataset is shown in bold

Figure 16

Table 9. BLEU for batched generation on the E2E, WebNLG, DART and ToTTo using GLM-3 in a 5-shot setting. The best performance per dataset is shown in bold

Figure 17

Table 10. BLEU scores for batched generation on the E2E, DART, WebNLG, and ToTTo using Llama-3.1 in a 5-shot setting. The best performance per dataset is shown in bold

Figure 18

Table 11. Comparison of 5-shot prompt generation time (ms)

Figure 19

Table 12. Comparative results of data-to-text generation using Llama-3.1 on the E2E, DART, and WebNLG datasets in a 5-shot setting. The best performance per metric is shown in bold and the second-best result is underlined

Figure 20

Table 13. BLEU and PARENT on the ToTTo using Llama-3.1 in a 5-shot setting. The best performance per metric is shown in bold

Figure 21

Table 14. Statistical comparison of the DCCS and Random methods on E2E, DART, and WebNLG datasets in 5-shot setting using GPT-3.5. Metrics include BLEU, ROUGE-L, and BERTScore. The t-test results (t-value and p-value) indicate that DCCS significantly outperforms Random across most metrics (p < 0.05)

Figure 22

Figure 6. Average Silhouette Coefficient across varying cluster counts for the (a) E2E (top left), (b) DART (top right), (c) WebNLG (bottom left), (d) and ToTTo (bottom right) datasets. The number of clusters with the highest coefficient is chosen for the first clustering.

Figure 23

Table 15. A test sample with three groups of in-context examples selected by Data-based Centroid, Nearest Cluster, and DCCS from the WebNLG dataset in 5-shot setting

Figure 24

Figure 7. Semantic representation of in-context examples selected by different methods. (a) Blue points represent the DCCS method, (b) Green points represent the Data-based Centroid method, (c) Orange points represent the Nearest Cluster method, and Red points indicate reference text.

Figure 25

Figure 8. The change in Generation Failure Rate and the Average Number of Tokens per Instance (TpI) with increasing batch size.

Figure 26

Table 16. Average human evaluation scores (in percentage) for DCCS-generated outputs compared to KATE and Random across Fluency, Informativeness, and Relevance dimensions

Figure 27

Figure 9. Human evaluation of factual consistency for GPT-3.5 outputs on the E2E, DART, and WebNLG datasets (Hallucination $\downarrow$, Missing Fact $\downarrow$, Accurate $\uparrow$).

Figure 28

Table 17. Comparison of BLEU scores between fine-tuned state-of-the-art (SOTA) models and our DCCS-based in-context learning (ICL) method using 5-shot prompts across different base models and datasets

Figure 29

Table A1. Comparative results of data-to-text generation using GPT-3.5 on the E2E, DART, and WebNLG (100 test samples) in a 10-shot setting. The best performance per metric is shown in bold and the second-best result is underlined

Figure 30

Table A2. Comparative results of data-to-text generation using GPT-3.5 on the ToTTo (100 test samples) in a 10-shot setting. The best performance per metric is shown in bold and the second-best result is underlined

Figure 31

Table A3. Comparative results of data-to-text generation using GLM-3 setting on the E2E, DART, and WebNLG (100 test samples) in a 10-shot setting. The best performance per metric is shown in bold and the second-best result is underlined

Figure 32

Table A4. Comparative results of data-to-text generation using Llama-3.1 on the E2E, DART, and WebNLG in a 10-shot setting. The best performance per metric is shown in bold and the second-best result is underlined

Figure 33

Table A5. BLEU and PARENT on ToTTo using Llama-3.1 in a 10-shot setting. The best performance per metric is shown in bold and the second-best result is underlined

Figure 34

Table B1. BLEU for batched generation on the E2E, WebNLG, DART and ToTTo using GPT-3.5 in a 10-shot setting. The best performance per dataset is shown in bold

Figure 35

Table B2. BLEU for batched generation on the E2E, WebNLG, DART and ToTTo using GLM-3 in a 10-shot setting

Figure 36

Table B3. BLEU for batched generation on the E2E, DART, WebNLG and ToTTo using Llama-3.1 in a 10-shot setting. The best performance per dataset is shown in bold

Figure 37

Table C1. Tukey–HSD pairwise comparisons on BLEU ($\Delta$ = G2–G1). Bold $p\lt 0.05$

Figure 38

Figure D1. BLEU scores on the E2E, DART, and WebNLG as a function of the number of in-context examples ($m$).