Hostname: page-component-77f85d65b8-5ngxj Total loading time: 0 Render date: 2026-03-29T22:52:13.744Z Has data issue: false hasContentIssue false

Small language models enable rapid and accurate extraction of structured data from unstructured text: An example with plants and their specialized metabolites

Published online by Cambridge University Press:  25 July 2025

Lucas Busta*
Affiliation:
Department of Chemistry and Biochemistry, University of Minnesota, Duluth, USA
Alan R. Oyler
Affiliation:
Department of Chemistry and Biochemistry, University of Minnesota, Duluth, USA
*
Corresponding author: Lucas Busta; Email: bust0037@d.umn.edu

Abstract

Transformer-based large language models are receiving considerable attention because of their ability to analyse scientific literature. Small language models (SLMs), however, also have potential in this area as they have smaller compute footprints and allow users to keep data in-house. Here, we quantitatively evaluate the ability of SLMs to: (i) score references according to project-specific relevance and (ii) extract and structuring data from unstructured sources (scientific abstracts). By comparing SLMs’ outputs against those of a human on hundreds of abstracts, we found that (i) SLMs can effectively filter literature and extract structured information relatively accurately (error rates as low as 10%), but not with perfect yield (as low as 50% in some cases), (ii) that there are tradeoffs between accuracy, model size and computing requirements and (iii) that clearly written abstracts are needed to support accurate data extraction. We recommend advanced prompt engineering techniques, full-text resources and model distillation as future directions.

Information

Type
Original Research Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - SA
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike licence (https://creativecommons.org/licenses/by-nc-sa/4.0), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is used to distribute the re-used or adapted article and the original article is properly cited. The written permission of Cambridge University Press must be obtained prior to any commercial use.
Copyright
© The Author(s), 2025. Published by Cambridge University Press in association with John Innes Centre
Figure 0

Figure 1. Comparison of SciFinder® versus PubMed® as a data source and schematic of the small language model workflow for retrieving compound–species associations from literature. (a). Structures, common names and CAS Registry® numbers for the six triterpenoid compounds used as test cases in our small language model development and evaluation work. (b) Bar plot comparing the number of references (x-axis) found by SciFinder®and PubMed® (y-axis) for the six different triterpenoids (vertically arranged panels) studied in this work. Each bar represents the number of references found by the indicated search tool for a particular triterpenoid. The absolute number of references found is shown in text to the right of each bar. Bars are colour coded according to search tool (SciFinder® in purple and PubMed® in blue). SciFinder® searches were conducted using CAS Registry® numbers, while PubMed® (which does not generally use these registry numbers) searches were conducted using compound common names. (c) Schematic for the workflow we developed to extract compound occurrence data from information in the literature. Files or information are shown in green bubbles, while steps or actions are shown as arrows. The workflow consists of searching the literature with SciFinder® based on CAS Registry® numbers then creating a repository of references and associated full-text PDF files in an EndNoteTM database; then filtering references for those of highest task-specific relevance (SLM Task A) and finally extracting compound occurrence data in either a targeted (SLM Task B1) or untargeted (SLM Task B2) fashion. Abbreviations: SLM: small language model.

Figure 1

Figure 2. Performance of small language models on a reference relevance ranking task. (a, b) Violin plot showing the score (BART small model language score, y-axis) assigned to references by the bart-large-mnli small language model. Scores range from zero (low relevance) to one (high relevance) and indicate the relevance of a given reference to a user-defined natural language criterion. In panel a, the score is derived from two, chemical compound-specific criteria (full details in methods section), while in panel b, the score is derived from a single, generic criterion (‘chemical compounds are found in plants’). In both panels, scores are broken out according to whether the reference was labelled by a human as ‘reporting an occurrence’, ‘maybe reporting an occurrence’, ‘not reporting an occurrence’ of a specific chemical compound in a specific species (x-axis). The number of references belonging to each group are shown above each violin. In panel a, the dotted line represents a threshold of 0.85, and in panel b, the dotted line represents a threshold of 0.9 (details of thresholds discussed in main text). (c, d) Column plot showing the proportion of references (y-axis) from each human labelled category (‘reporting an occurrence’, ‘maybe reporting an occurrence’ or ‘not reporting an occurrence’; x-axis) that would be retained if a threshold small language model score was used for filtering references. The proportion of each column in the positive y space indicates the fraction of references that would pass the filter and be retained, while the proportion of each column in the negative y space indicates the fraction of references that would be rejected by the filter and eliminated. Exact proportions are shown in numbers above and below each column. In panel c the threshold is 0.85, based on two-prompt scoring, while in panel d the threshold is 0.9, based on single, general prompt scoring (details in main text and methods section). For example, if a score of 0.85 were used as a threshold with which to filter references that had been scored using the two-prompt small language model scoring system, then 86% of references reporting occurrences would be retained while 14% of such references would be rejected, 35% of references maybe reporting occurrences would be retained while 65% of such references would be rejected and 20% of references not reporting occurrences would be retained while 80% of such references would be rejected. In all panels a–d, colours correspond to the three human label categories (‘reporting an occurrence’, ‘maybe reporting an occurrence’, ‘not reporting an occurrence’). BART stands for the bart-large-mnli small language model.

Figure 2

Figure 3. Performance of language models on a targeted compound occurrence data extraction task. (a) Bar plot showing various metrics (y-axis in each row of panels) for different language models (x-axis). The first row shows model size in billions of parameters, the second row shows model resolution in bits, the third row shows the speed with which a model processes references (using the prompt shown in the methods section) in units of 1000 references per hour. (b) Bar plot showing the raw performance metrics of each model (false negative, false positive, true negative and true positive rates). False negatives arise when a model erroneously marks a real compound occurrence as not being true. False positives arise when a model erroneously marks a simple textual occurrence of a compound name and species name as an occurrence data point. True negatives arise when a model correctly marks a simple textual occurrence of a compound name and species name as such and not as an occurrence data point. True positives arise when a model correctly marks a compound occurrence as such. According to human evaluation of the 500 putative occurrences used to test the models, 71% of the putative occurrences were real (i.e. ‘positives’) and 29% of the putative occurrences were just textual co-occurrence (i.e. ‘negatives’). Thus, a perfect model would have, in this experiment, a 71% true negative rate and a 29% true positive rate. Bars are coloured according to true/false positive/negative. (c) Bar plot showing the processed performance metrics of each model. In the first row, the precision of each model is shown (the ratio of true positives to the sum of true positives and false positives). In the second row, the recall of each model is shown (the ratio of true positives to the sum of true positives and false negatives). In the third row, the F1 score is shown, which is the harmonic mean of the precision and recall. In (a–c), models are organized into columns of panels by type (large: > 20 B parameters, medium: 1–20 B parameters, small: 0–1 B parameters and optimized: 4-bit resolution models).

Figure 3

Figure 4. Performance of language models on a targeted compound occurrence data extraction task. (a) Heat map showing the per cent of outputs that contain valid python dictionaries (encoded with colour and written inside each box) from each language model (y-axis) in response to each prompt (x-axis). The marginal (i.e. top and right) plots show the mean per cent valid responses across all models for each prompt or across all prompts for each model. (b) Heat map showing the rate (in 1000 references per hour) of processing by each language model (y-axis) in response to each prompt (x-axis). The marginal (i.e. top and right) plots show the mean per cent valid responses across all models for each prompt or across all prompts for each model. (c) Guide describing how to interpret panels (d–k). Evaluation of occurrence data reported by language models (d/e/f/g: phi-4 and, in darkest bars, phi-4 in agreement with qwen-2.5-7B-instruct; h/i/j/k: phi-4-mini-instruct and, in darkest bars, phi-4-mini-instruct in agreement with qwen-2.5-7B-instruct). (d) and (h) show the number of correct occurrences (true positives, positive y-axis) and incorrect occurrences (false positives, negative y-axis) reported, as indicated in panel c. (e) and (i) show the number of correct occurrences (false negatives, negative y-axis) reported, as indicated in panel c. (f) and (j) show the number of correct occurrences (true positives, positive y-axis) and incorrect occurrences (false positives, negative y-axis) reported after filtering for occurrences whose compounds are in PubChem and were agreed upon by the two models. (g) and (k) show the number of correct occurrences missed by the models after PubChem and agreement filtering (false negatives, negative y-axis). In (d–k), bar orientation emphasizes desired model behaviour: bars pointing upwards indicate correct model responses (desired behaviour), while bars pointing down indicate incorrect model responses or correct answers not reported by the model (undesired behaviour).

Supplementary material: File

Busta and Oyler supplementary material

Busta and Oyler supplementary material
Download Busta and Oyler supplementary material(File)
File 2.6 MB

Author comment: Small language models enable rapid and accurate extraction of structured data from unstructured text: An example with plants and their specialized metabolites — R0/PR1

Comments

Dear editors, we are pleased to submit an original manuscript entitled “Small language model enhances literature processing workflow: An example with plants and their secondary metabolites” for consideration for publication in Quantitative Plant Biology. Language models are rapidly transforming many aspects of science and technology. Large language models are quite popular and capable of a wide variety of general tasks at a large computational expense. In contrast, small language models are also highly capable but require far fewer computing resources and have a narrower set of capabilities. Here we test the ability of one of the most popular small language models to assist in a scientific literature processing workflow. We made several major findings, including the ability of the small language model to enhance literature processing based on recall and precision statistics, as well as effective means by which the small language model can be integrated into workflows that include other major and popular commercial software. This manuscript is not under consideration elsewhere. Please do not hesitate to contact me if you require any further information.

Sincerely yours,

Lucas Busta

Review: Small language models enable rapid and accurate extraction of structured data from unstructured text: An example with plants and their specialized metabolites — R0/PR2

Conflict of interest statement

Reviewer declares none.

Comments

The manuscript by Oyler and Busta describes a workflow based on the use of small language models to identify species-compound connections from the scientific literature. Rapidly evolving language models provide an excellent opportunity to facilitate screening of large numbers of scientific articles and can thereby enhance our phytochemical knowledge in a (semi)automatic fashion. As such, this is a very timely and relevant article. Overall, the manuscript was very clear and well presented. The workflow and research data was presented in a highly transparent and reproducible fashion, which is laudable. While I enjoyed reading the manuscript very much, several improvements should be made before it can be published.

My main scientific question/concern is how the small language model handles subtle language differences. The authors picked a particularly difficult test case for language models in my opinion: The alpha/beta of their compounds might be fully spelt out or written as greek characters; sometimes there is a space instead of a hyphen (e.g., alpha amyrin); alternative spellings are found (amyrine); compound names are very similar (amyrin vs amyrone). Even worse, many of these compounds tend to co-occur, i.e., many plants possess both alpha- and beta-amyrin, and plants that contain alpha-/beta-amyrone will probably also contain the biosynthetic precursors alpha-/beta-amyrin. Have the authors made any observations in this regard how robust their language model was? Can it clearly distinguish between different spellings? How does it handle greek characters? Can the authors be confident that they determined a correlation of a single compound with a species and not of a blur of compounds with similar names?

The compound-species analysis and comparison with LOTUS showed clear differences. The authors discuss some potential reasons (line 498). I think it would be nice if the authors could investigate this a bit in more depth to get a better understanding of this discrepancy, at least for a small and simple test case (e.g., dammarenediol; what about the 4 associations present in LOTUS and absent in their data?). In my opinion, this would provide valuable information how such analyses might be improved in the future.

In many cases, it wasn’t fully clear to me when the authors only used the abstract or the full text of publications in their workflow. Maybe this could be clarified better.

Figure 2A: The authors used different y axis ranges for both subplots (0.0-1.0 vs. 0.5-1.0). Wouldn’t it be better for comparability if the ranges would be the same? Or is the comparability not given anyway between one or two classifier phrases?

Figure 2A: I am not a statistician, but I was surprised by the different appearance of the violin plots in the upper and lower plots. Particularly the red violins look much thinner at the bottom (e.g., for alpha-amyrone). Shouldn’t the overall density be similar? Do the authors have an explanation for this?

Figure 2B: Are the classification scores from one or from two classifier phrases? It wasn’t fully clear to me how this panel is connected to panel 2A.

Figure 3: Why did the authors not include alpha-/beta-amyrin in this comparison? Is it because they did not do a full manual classification? Maybe it could be interesting to include a comparison solely based on the SLM results even without the manual classification? This could give insights how realistic it would be to expand this workflow to more automated literature processing.

Minor points:

- title: The authors might want to think about replacing “secondary metabolites” by “specialized metabolites”, which seems to become more widespread (and is more suitable in my opinion)

- Abstract and introduction: The authors argue that the use of SLMs is a good alternative to LLMs because they use less resources. Maybe this could be slightly expanded to provide a better justification for their approach, considering that LLMs are also easily accessible for researchers and have also been used by the authors previously for a similar purpose (Plant Journal 2024).

- I found the use of “Workflows A/B/C/D” a bit confusing. To me, this sounded more like parallel workflows for the same task and not like consecutive steps of a single workflow. The authors might want to think about this nomenclature.

- Table 1: “Other name” what does “9CI ACI” and similar abbreviations mean? Sometimes, there is also a closing bracket missing after this.

- Table 1: “Formula” numbers should be subscript

- Figure 2: The use of red/green colors is potentially problematic for colorblind people. Maybe a different way to encode this information or clearer labels of the violins could be used.

- Figure 3: Maybe the visual presentation could be improved. First, it was difficult to see which of the circles represents LOTUS and which “this work”; the labels might be more clear or different colors could be used. Second, the dark grey color in the intersection of alpha-amyrone makes the “8” very difficult to read. Third, the colored circles in panels B/C are extremely small and difficult to assess. Fourth, it wasn’t obvious what the difference between panels B and C is (LOTUS vs this work); maybe this information could be included in the figure for more clarity.

- Line 224: “Previous work ….” sounds like references should be provided

Typos, language, etc.:

- Triterpenoid names: The authors might want to use a proper greek character alpha/beta for their compound names. Also, (-)-friedelin should be written with a proper minus and no hyphen

- References Busta et al 2024 Biorxiv vs. Plant J; isn’t this the same? Maybe the preprint could be removed.

- Inconsistent capitalization: EXCEL vs. Excel; Scifinder Substance vs Scifinder SUBSTANCE

- Figure 1 contains many typos, please check carefully: “plant.” dot should be after quote; “WORDKFLOW”; “citatons”; “citaations”; “(from text).” check quote; “into onr row”; “specis”; “compuond”; “ffriedelin” in legend

- Line 224: “Pervious”

- Lines 252-254: “the references in that final output file were evaluated using manually classification (Section 2.2.1), then the SLM scores were evaluated against the manual classifications (Sections 2.2.2).” reads confusing, maybe rephrase.

- Line 264: “friedeilin”

- Line 362: “of-versus” what do you mean?

- Line 395: “Dammerenediol”

- Line 501: “is not be complete”

- Line 539: “suggest” should be “suggests”?

- Line 544: “compound species” add hyphen

- Line 563: “POTENTIALL-OF-INTEREST”

- Line 584: “(e.g.,see (e.g., see”

- Line 649/650: “Microsoft Corporation …”?

- Line 673: “plan”

Review: Small language models enable rapid and accurate extraction of structured data from unstructured text: An example with plants and their specialized metabolites — R0/PR3

Conflict of interest statement

Reviewer declares none.

Comments

In the present manuscript, the authors developed workflows to obtain, organize, and filter references to scientific articles relevant to a specific project, leveraging small language models. These workflows are then applied to a plant triterpenoids case study. Overall, the paper is well written and addresses an important and relevant problem in the field. I support the publication of the manuscript, provided the authors address/discuss the following issues.

Major comments:

- Although the article’s title is centred around small language models (SLMs), they are used only in workflow C to filter/refine references gathered essentially manually. I very much appreciate the effort of using SLMs to speed up the overall process, but I wonder whether there are also other ways to streamline this process through automation. For example, could an API be used for retrieving reference from SciFinder (workflow A)? In case such an API exists, a custom script could be written to take a set of CAS numbers as input and return the corresponding references as output. Similarly, do the authors think some of the EndNote features used for reference organization (workflow B) could be automatised with custom scripting?

- One EndNote feature I am curious about is “its ability to retrieve full PDF files for scientific articles in a semi-automated fashion”. Can EndNote retrive full-text PDF for any reference included in it? Or is it limited to certain publishers? Is this process done automatically or manually? Additionally, does EndNote enable searching for text strings (e.g., specific species or compound name) within the full text PDFs once they are downloaded?

- One major concern I have is that the full workflow proposed by the authors cannot be reproduced without subscriptions to commercial softwares (EndNote and SciFinder). However, I understand that these commercial products offer important features not necessarily available in free alternatives; the authors show a good example of this when comparing SciFinder vs PubMed. I think these aspects could be discussed in more detail in the paper, perhaps in a section titled “Strength and weaknesses” or “Current limitations and future improvements”. It might also be helpful for readers to know the approximate cost of this workflow (i.e., subscription fees for academic users)?

- While I appreciate the high recall (indicating that the model does not filter out relevant references), the low precision suggests that the SLM-based filter does not effectively filter out non-relevant references, even with a high score threshold (0.89). Do the authors know why this is the case? Do they have any ideas on how performance could be improved in the future? I assume that the situation will improve as SLMs in general continue to advance. Would large language models (LLMs) provide significantly better performance? Perhaps this could be discussed in more detail in the “Current Limitations” section suggested above.

- Connected to the previous comment, it is not clear to me whether the precision and recall metrics are calculated only on the references manually-classified as OF-INTEREST, or also include those classified as POTENTIALLY-OF-INTEREST.

Minor comments:

- It is unclear to me what the authors consider the difference between large language models (LLMs) and small language models (SLMs). They refer to Bart-Large-MNLI as a small language model, but from my quick search, it appears to fall under the LLM category, even though it has fewer parameters compared to larger models (around 400 million versus billions).

- In the Results section, I find the statement “set of small language model-enabled workflows” (line 87) somewhat misleading, as only workflow C appears to involve an SLM.

- Lines 131–133: “In SciFinder®, we found over 1,340 and more than 1,850 hits for these two compounds, respectively, compared to fewer than 500 and 1,000 hits in PubMed®.” Did the authors search PubMed by compound name? Is it possible to search PubMed using other compound identifiers (e.g., InChIKey, PubChem ID)?

- What are “tagged text files” (line 148)? I assume this is specific terminology from SciFinder, but not all readers may be familiar with it. Could the authors briefly explain what it refers to or provide an example file in the Supporting Information?

- Line 149: “If the number of filtered references was greater than 400, the word ‘plant’ was entered into the ‘search within results.’” Why was this done? Isn’t the goal of the workflow to streamline the processing of large amounts of literature? Was this step intended to limit the number of references for manual review to evaluate the workflow’s performance?

- I’m not sure I fully understand what the authors mean by “POTENTIALLY-OF-INTEREST references (abstracts implying but not explicitly stating the compound’s derivation from a specific plant source)” (lines 266–269). Is this classification based on the authors' subjective judgment during manual curation? Perhaps an example could be provided in the Methods section for clarity?

Recommendation: Small language models enable rapid and accurate extraction of structured data from unstructured text: An example with plants and their specialized metabolites — R0/PR4

Comments

No accompanying comment.

Decision: Small language models enable rapid and accurate extraction of structured data from unstructured text: An example with plants and their specialized metabolites — R0/PR5

Comments

No accompanying comment.

Author comment: Small language models enable rapid and accurate extraction of structured data from unstructured text: An example with plants and their specialized metabolites — R1/PR6

Comments

No accompanying comment.

Review: Small language models enable rapid and accurate extraction of structured data from unstructured text: An example with plants and their specialized metabolites — R1/PR7

Conflict of interest statement

Reviewer declares none.

Comments

The authors have thoroughly addressed all of my comments. I recommend the manuscript for publication.

Review: Small language models enable rapid and accurate extraction of structured data from unstructured text: An example with plants and their specialized metabolites — R1/PR8

Conflict of interest statement

Reviewer declares none.

Comments

I thank the authors for their thorough revisions of their manuscript. I appreciate their extensive efforts to improve the paper by rewriting the manuscript and carrying out many additional experiments. However, this reviewer found it quite difficult to assess the revisions because of the drastic changes and complete rewriting of the manuscript. I would recommend to the authors to keep their revisions less drastical in the future.

Nonetheless, in my opinion, the manuscript has now indeed strongly improved and benefitted very much from the revisions. Still, some minor improvements are in my opinion necessary before this manuscript can be accepted:

- I appreciate the additional explanations regarding alternative spellings (amyrin/amirine, friedelin/friedeline, amyrone/amyrenone). Nonetheless, the authors still haven’t commented on the handling of the greek character alpha/beta with otherwise identical names by the models. I don’t expect a very sophisticated comparison (which is difficult to achieve in my opinion), but considering that both alpha and beta amyrin (and alpha and beta amyrone) are prominently presented in Figure 1, this feels somewhat inconsistent to me and should be mentioned/discussed at least.

- Figure 1A: Please double-check all of your chemical structures. Several of these appear to be drawn as enantiomers of the naturally occuring compounds. Normal triterpenes have a beta configuration at C-3 (hydroxy group pointing up). Also, the way how some stereochemistry is drawn within the ring systems (wedged/hashed C-C bonds instead of C-H bonds) is unusual and might be changed.

- Lines 211-215: The authors performed a manual evaluation of “maybe” references based on full text articles, which is a very good idea. I was wondering if this manual reannotation revealed any trends of the scores and their distribution - were the average scores of the correct “maybe” references higher than the average scores of the incorrect “maybe” references?

- Line 275: the authors state that “run times generally varied inversely with size”. I think the term “run time” is misleading here. This sounds like large models have low run times (i.e., time it takes to finish the job). I think what the authors mean is speed (articles / hour), not run time (time requirement per job), as shown in Fig. 3A.

- Figure 4C: Wouldn’t it be more logical if the arrows pointed to the coloured areas, not to the border between them?

- Figure 4E/G/I/K: Is there any rationale why the bars point downwards? In my opinion, it would be more consistent with panels D/F/H/J if the correct occurrences would point in the same direction in both cases.

Recommendation: Small language models enable rapid and accurate extraction of structured data from unstructured text: An example with plants and their specialized metabolites — R1/PR9

Comments

While both reviewers greatly appreciated the authors efforts to address their comments and find the manuscript much improved, one of the reviewers still has some remaining issues to be addressed.

Decision: Small language models enable rapid and accurate extraction of structured data from unstructured text: An example with plants and their specialized metabolites — R1/PR10

Comments

No accompanying comment.

Author comment: Small language models enable rapid and accurate extraction of structured data from unstructured text: An example with plants and their specialized metabolites — R2/PR11

Comments

No accompanying comment.

Recommendation: Small language models enable rapid and accurate extraction of structured data from unstructured text: An example with plants and their specialized metabolites — R2/PR12

Comments

No accompanying comment.

Decision: Small language models enable rapid and accurate extraction of structured data from unstructured text: An example with plants and their specialized metabolites — R2/PR13

Comments

No accompanying comment.