Hostname: page-component-89b8bd64d-j4x9h Total loading time: 0 Render date: 2026-05-08T06:59:08.490Z Has data issue: false hasContentIssue false

The architecture of language: Understanding the mechanics behind LLMs

Published online by Cambridge University Press:  06 January 2025

Andrea Filippo Ferraris
Affiliation:
Department of Law, University of Bologna, Bologna, Italy DIKE and the Law Faculty, Vrije Universiteit, Brussel, Belgium
Davide Audrito
Affiliation:
Computer Science Department, University of Torino, and Legal Studies Department, University of Bologna, Bologna, Italy
Luigi Di Caro
Affiliation:
Computer Science Department, University of Torino, Torino, Italy
Cristina Poncibò*
Affiliation:
Department of Law, University of Turin, Torino, Italy
*
Corresponding author: Cristina Poncibò; Email: cristina.poncibo@unito.it
Rights & Permissions [Opens in a new window]

Abstract

Large language models (LLMs) have significantly advanced artificial intelligence (AI) and natural language processing (NLP) by excelling in tasks like text generation, machine translation, question answering and sentiment analysis, often rivaling human performance. This paper reviews LLMs’ foundations, advancements and applications, beginning with the transformative transformer architecture, which improved on earlier models like recurrent neural networks and convolutional neural networks through self-attention mechanisms that capture long-range dependencies and contextual relationships. Key innovations such as masked language modeling and causal language modeling underpin leading models like Bidirectional encoder representations from transformers (BERT) and the Generative Pre-trained Transformer (GPT) series. The paper highlights scaling laws, model size increases and advanced training techniques that have driven LLMs’ growth. It also explores methodologies to enhance their precision and adaptability, including parameter-efficient fine-tuning and prompt engineering. Challenges like high computational demands, biases and hallucinations are addressed, with solutions such as retrieval-augmented generation to improve factual accuracy. By discussing LLMs’ strengths, limitations and transformative potential, this paper provides researchers, practitioners and students with a comprehensive understanding. It underscores the importance of ongoing research to improve efficiency, manage ethical concerns and shape the future of AI and language technologies.

Information

Type
Position Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press.
Figure 0

Figure 1. Illustrates the softmax activation function as used in large language models (LLMs). Each raw output is exponentiated and then normalized by dividing by the sum of exponentiated outputs, ensuring the resulting probabilities range from 0 to 1.

Figure 1

Figure 2. The image illustrates how positional encoding works in transformers. Word embeddings (blue boxes) are created from inputs like “The” and “quick,” while positional information (pink boxes) tracks word order. These are combined and then passed to the transformer model (green box), enabling it to understand word order in sequence processing.

Figure 2

Figure 3. This image represents the core components of the transformer architecture, focusing on the multi-head attention mechanism. The left side shows the stacked layers of multi-head attention and feedforward layers, which are applied both to the input and output sequences. Positional encoding is added to account for word order in the sequence. On the right, a zoom-in reveals how scaled dot-product attention works by combining query, key, and value matrices, normalized through a softmax function, to calculate attention scores. This enables transformers to efficiently capture relationships between words regardless of their position. (Vaswani et al., 2017).

Figure 3

Figure 4. Diagram of an encoder–decoder transformer model demonstrating sequence-to-sequence translation.

Figure 4

Figure 5. Attention heatmap from a decoder-only model, illustrating how each token in the sequence attends only to itself and previous tokens. The triangular structure results from masked self-attention, ensuring the model generates text autoregressively by relying solely on past context.

Figure 5

Figure 6. This image demonstrates gender bias in LLMs. The model assumes a female nurse and a male doctor in its translations, reflecting common stereotypes embedded in training data.

Figure 6

Figure 7. The diagram showcases the workflow of retrieval-augmented generation (RAG). It starts with a prompt and query, which are used to search external knowledge sources for relevant information. The retrieved information enhances the context, which is then passed along with the original query to the large language model (LLM). This improved context helps the LLM generate a more accurate and relevant text response.