5.1 Introduction
Identifying appropriate methods to pinpoint potential paradata in existing datasets and data documentation is essential for data reuse. These methods work as a complement to existing formal documentation of practices and processes that, as discussed earlier in this book, are never fully complete. Data reuse refers to secondary data analysis and use in which researchers or other stakeholders use the data collected by others to address new research questions or for other novel purposes.
Data reusers often aggregate multiple existing datasets to address broader questions. They can also approach previously collected data from a new perspective in an attempt to solve problems other than those previously addressed. While many data reusers are researchers, data is also reused for education, societal decision-making and development of new products and services. Additionally, data reuse is crucial for reproducing earlier research and the validation of its results.
Data reuse, in its various forms – including secondary data analysis, meta-analysis, and validation – plays an important role in advancing scientific knowledge, particularly in data-driven research. While explicit ‘reuse of data’ is less common outside of this paradigm, it can be broadly understood as a reuse of earlier collected resources. This includes the use and analysis of public documents, archival records and material from cultural collections. Data reuse enables researchers to build upon the foundations laid by previous studies, optimising resources and avoiding duplication of efforts (Faniel et al., Reference Faniel, Frank and Yakel2019; Gregory et al., Reference Gregory, Groth, Scharnhorst and Wyatt2020; Liu et al., Reference Liu, Wu, Power and Burton2023). Further, data reuse enhances methodological transparency by allowing researchers to examine and understand past research practices and processes, thus ensuring the validity and reliability of research findings across studies (Edwards et al., Reference Edwards, Goodwin, O’Connor and Phoenix2017; Huvila and Sinnamon, Reference Huvila and Sinnamon2022), and enhancing the reproducibility of findings by facilitating the replication and verification of results (Deeks et al., Reference Deeks, Higgins, Altman and Group2023).
Previous research suggests that one of the important factors affecting data reuse behaviour is the availability of contextual information about the data, including data description, data attributes and documentation of research methods (Faniel et al. Reference Faniel, Frank and Yakel2019; Gregory and Koesten, Reference Gregory and Koesten2022; Murillo, Reference Murillo2022). This applies to all data reuse, independent of field (e.g., Faniel et al., Reference Faniel, Frank and Yakel2019, Pickering, Reference Pickering1995, Rheinberger, Reference Rheinberger2023; Zimmerman, Reference Zimmerman2008). Paradata in particular is a key facet of contextual information because it documents the practices and processes relating to the creation, management and use of the data.
Paradata, despite its critical role in data reuse, is frequently not explicitly documented or structured as such. As discussed in Chapters 2 and 3, this type of information is often interwoven with various forms of primary and secondary research documentation and embedded within the research data itself. Moreover, the perspectives of data creators and reusers may differ regarding what specific information is critical for understanding practices and processes (Huvila et al. Reference Huvila, Andersson and Sköld2025). Consequently, the most important paradata from the reusers’ perspective does not necessarily find its way into the formal description of a particular procedure. Therefore, even when creators, managers and previous users do their best to provide comprehensive documentation of how they worked with a particular dataset, data reusers often need to seek additional information. To mitigate the risk of misinterpreting data, data reusers also need to be adept at identifying paradata, and to be able to grasp and mobilise the resources required to access and utilise it (cf. Chapter 3). This applies not only to researchers but to everyone working with data.
A recent analysis conducted in the CAPTURE project by Juneström and Huvila recognised several retrospective methods for identifying and using paradata in support of data reuse. These methods are concerned with identifying chains of activities described in the data, analysing data with qualitative and quantitative approaches to discern practices and processes used to produce and process the data, and assessing the trustworthiness of digital records to ensure their authenticity.
The methods introduced in this chapter aim to support researchers interested in secondary data analysis guidance in identifying and extracting paradata from datasets and secondary documentation. These methods are examples of approaches that can be applied to identifying and extracting paradata where it does not exist as formal ‘core paradata’ but can be derived from other information, discussed later in this volume as potential paradata (see Chapter 6).
5.2 Methods Descriptions
The methods described in this chapter were chosen based on a scoping review of paradata-related practices in research activities from various disciplines. A preliminary framework of paradata generation developed at the beginning of the CAPTURE project formed a baseline for identifying methods (Huvila, Reference Huvila and Sinnamon2022). It was complemented by reviewing a large number of articles sourced from the project team members throughout the first four years of the project. Additional texts were identified in the reference lists of the material uncovered during the reviewing process, with the focus being to include relevant complementary and contrasting descriptions of the methods and how they have been used in practice. Major categories of methods (qualitative and quantitative backtracking, data forensics and diplomatics) for post hoc identification of paradata were developed through an iterative reviewing process. This was used to develop an understanding of how documentation and paradata can be identified in different settings and how different approaches might be applicable for identifying different types of information relevant to understanding data creation-, management- and use-related practices and processes.
The methods selected for this chapter include approaches that are relatively broad and thus potentially applicable across disciplines. Some techniques specific to particular disciplines and study contexts are briefly described to exemplify an approach with potential wider relevance but are otherwise omitted in the present chapter to keep its focus on general principles and widely applicable approaches. Disciplinary specificity does not, however, always mean that a method had no wider relevance. Some of the approaches stemming from specific disciplinary contexts, such as natural language processing for the quantitative processing of textual material in the health domain, have clear potential for guiding paradata practices far beyond their original context.
In the following, three categories of methods are introduced and discussed: 1) qualitative and 2) quantitative methods of backtracking, as well as 3) data forensics and diplomatics.
5.2.1 Qualitative Backtracking
Qualitative backtracking refers to a category of qualitative methods of analysing data for discerning practices and processes used to produce and process the data. Broadly, qualitative backtracking qualifies as an umbrella term to describe the use of any conceivable form of qualitative data analysis to identify and create paradata. There are, however, certain methods that are specifically focused on practices and processes rather than creating new knowledge on, for example, objects and their attributes.
Close Reading and Thematic Analysis
The CAPTURE project has conducted a series of qualitative studies to understand where and what types of paradata can already be found in diverse data-related artefacts and datasets (see also Chapter 3). A major difficulty of generating paradata ‘by extraction’ (Börjesson et al., Reference Börjesson, Sköld, Friberg, Löwenborg, Pálsson and Huvila2022) is that datasets and accompanying documentation are often geared towards primary analysis and knowledge-making rather than secondary analysis or aggregation. This means that a lot of paradata is scattered around research documentation and formal documentation in metadata, readme and field documentation files is sparse or sometimes non-existent.
While an ideal approach for extracting as much paradata as possible would be to conduct a comprehensive walkthrough of all data and documentation, it is not always possible (Börjesson et al., Reference Börjesson, Sköld, Friberg, Löwenborg, Pálsson and Huvila2022). In such cases, it is reasonable to focus on artefacts with the greatest likelihood of containing relevant practice or process information. Such pieces of documentation could extend from datasets (Börjesson et al. Reference Börjesson, Sköld, Friberg, Löwenborg, Pálsson and Huvila2022) to research reports (Huvila et al., Reference Huvila, Sköld and Börjesson2021b), citations (Huvila et al., Reference Huvila2022), instruction manuals and handbook literature (Huvila and Sköld, Reference Huvila and Sköld2023).
A qualitative analysis based on iterative close reading (DuBois, Reference DuBois, Lentricchia and DuBois2003) of an archaeological fieldwork dataset conducted by Börjesson et al. (Reference Börjesson, Sköld, Friberg, Löwenborg, Pálsson and Huvila2022) showed that a structured datafile, especially if it is not heavily cleaned of all anomalies and preliminary observations and interpretations can provide a lot of information on how it was created and processed. The approach is based on careful analysis of the data from paradata perspective, that is, keeping in mind that all can eventually be informative of practices and processes relating to data, marking such information in the dataset, iteratively developing a structured understanding of them, and finally visualising them in diagrams or narratives. The study showed that conducting the analysis requires understanding of both knowledge organisation (how databases and metadata schemas work, and how people generally use them) and subject expertise (in this particular study, of archaeological fieldwork). Both are needed to understand where and how paradata can eventually be found and extracted and to comprehend what information qualifies as paradata and what eventual limitations they are likely to be. After the analysis, the authors found that an additional step, reaching out to original data creators to verify interpretations and filling in gaps is highly desirable, if possible. At the same time, the work also clearly showed that a dataset itself can contain a lot of information to an extent which allows the reader to gain a reasonably good understanding of its earlier life.
In other studies within the CAPTURE project, the same general approach of close reading and iterative coding combined with variants of thematic analysis inspired by the constant comparative method were applied to other research artefacts. These included research reports and instruction manuals that prescribe data generation practices and processes. The analysis started with repeated iterative reading of material, the generation of categories from the material, coding the material according to these categories, writing summaries, and developing narrative descriptions of the identified themes.
The categorisation was informed by (research) questions underpinning the analysis. For example, in a study of what paradata could be extracted from archaeological field reports (Huvila et al., Reference Huvila, Sköld and Börjesson2021b), the categories related to different types of information (including narrative descriptions of practices and processes, photographs, information sources) proved potentially relevant as paradata. In the study that focused on the analysis of a dataset (Börjesson et al. Reference Börjesson, Sköld, Friberg, Löwenborg, Pálsson and Huvila2022), the categories typified different types of paradata (including knowledge organisation and presentation paradata).
The general approach is applicable also to close reading of diagrams, drawings and photographs (Huvila et al., Reference Huvila and Sköld2023). An overall observation of this work is that while data, secondary research documentation and diverse artefacts that are used in data creation, management and use – including data management infrastructures (Börjesson, Reference Börjesson2021) – contain a lot of traces of practices and processes (cf. Chapter 3) that makes it possible to extract a lot of paradata. It requires a lot of work and the varying quality and level of detail between different artefacts affects considerably the effort of backtracking paradata. One dataset and research report might contain a lot of extractable paradata while others can be relatively spartan and too ‘cleaned’ to reveal much about what happened even if analysed in detail. Another limitation of the approach is that the generated understanding of practices and processes stemming from analysis of heterogeneous data are as diverse as the data itself. Different accounts can also be difficult to compare and they do not necessarily provide systematic enough descriptions for stepwise reproduction of practices or processes.
Key References and Further Reading
Börjesson, L., Sköld, O., Friberg, Z., Löwenborg, D., Pálsson, G., and Huvila, I. (Reference Börjesson, Sköld, Friberg, Löwenborg, Pálsson and Huvila2022). Re-purposing excavation database content as paradata: An explorative analysis of paradata identification challenges and opportunities. KULA: Knowledge Creation, Dissemination, and Preservation Studies, 6(3), 1–18. The article describes a study of an archaeological fieldwork dataset and discusses the opportunities and limitations of generating paradata ‘by extraction’ from research data.
Rainey J., Macfarlane S., Puussaar A., Vlachokyriakos V., Burrows R., Smeddinck J. D., Briggs P. and Montague K. (2022) Exploring the role of paradata in digitally supported qualitative co-research. In CHI Conference on Human Factors in Computing Systems. New York: ACM, 1–16. https://doi.org/10.1145/3491102.3502103. This article illustrates how coding processes of qualitative data can be studied using thematic analysis.
Narrative Inquiry and Object Biography
Narrative inquiry is a type of qualitative analysis method that uses stories to describe and understand human action (Polkinghorne, Reference Polkinghorne1995). Narrative inquiry is different from other forms of narrative analysis in that it focuses on identifying or constructing narratives for analytical purposes instead of analysing existing narratives, for example, diverse types of stories found in the literature or narrated orally (Sharp et al., Reference Sharp, Bye, Cusick and Liamputtong2018). Its focus on human action makes it apposite for qualitative backtracking of paradata. Polkinghorne (Reference Polkinghorne1995) notes that ‘narrative is the type of discourse composition that draws together diverse events, happenings, and actions of human lives into thematically unified goal-directed processes’ (p.5). For narrative inquiry, actions, events and happenings form the building blocks from which narratives are generated and that make the individual steps of activities become meaningful.
Phoenix et al. (Reference Phoenix, Boddy, Edwards, Elliott, Edwards, Goodwin, O’Connor and Phoenix2017) have used narrative analysis to investigate marginal comments written on paper questionnaires (i.e. marginalia) to understand the practices of interviewers and their struggle with the multiplicity of possible interpretations of the data they generate, their obligations to senior researchers, and their own emotions regarding the interview process, the participants, and their role in the research project they were involved in.
Carpentieri et al.’s (Reference Carpentieri, Carter and Jeppesen2023) narrative analysis of the open-ended questions from the first British Birth Cohort Study aimed at reusing existing data to study social mobility in post-war Britain. At the same time, their study also shows how narrative analysis and the construction of ‘pen portraits’ of individual study participants also produced new knowledge on the data collection processes in a cohort study (Carpentieri et al., Reference Carpentieri, Carter and Jeppesen2023). Gaps, anomalies and trends in data creation are sometimes difficult to discern unless individual pieces of information are put together in an attempt to form a coherent whole. These two examples illustrate how identifying paradata linked to survey studies can be repurposed to address research questions not initially proposed by the original dataset, providing insights into the interaction between data creators and study participants.
Object biography is a method that has affinities with narrative inquiry in how it can improve understanding of dynamic relations between people and artefacts. The idea of writing life stories of objects in the manner of biographies of human beings was introduced by Kopytoff in Reference Kopytoff and Appadurai1986 (Kopytoff, Reference Kopytoff and Appadurai1986). The approach has become popular especially in material culture studies and archaeology in the analysis of a large variety of different types of smaller and larger artefacts (Joy, Reference Joy2009). Friberg and Huvila’s (Reference Friberg and Huvila2019) object biographical study of an archaeological collection showcases how the approach can be applied to assemblages, and larger and more heterogeneous artefacts than individual material objects. While Joy (Reference Joy2009) speaks for keeping biographical analysis focused on individual objects, the key question is rather to define what is an object, the unit of analysis, than to limit inquiry on individual physical things. The use of the metaphorical notion of biography has also faced critique. An alternative metaphor of itinerary has been suggested as a possible more neutral substitute to a biography. Biographies have been criticised for a risk of leading to think of non-human matters as if they were human-beings. Biography also comes with a strong connotation that a trajectory is historical and not only has a beginning but also an ending, which seldom is fully applicable to material objects or in the context of paradata, for practices or processes (Bauer, Reference Bauer2019; Fontijn, Reference Fontijn, Hahn and Weiss2013).
Object biography has obvious affinities with other biographical approaches to research, including the chaîne opératoire discussed later in this chapter. Another related technique is life history research that has tended to focus on both spatially and temporally larger scale interactions relating to technology and material objects (Joy Reference Joy2009). Object biography and its underpinning concept of biography, by contrast, is premised by the idea of idiosyncracy and uniqueness of every individual lifestory (Dannehl, Reference Dannehl2017).
Narratives on the other hand, open up more explicitly for their a priori multiplicity (Schofield et al., Reference Salinas, Penafiel, McCormack and Morstatter2020). In contrast to narrative inquiry that focuses on narrativising human action, the common denominator of biographical approaches is the relationship between people and objects (Gosden and Marshall, Reference Gosden and Marshall1999). Their common feature is a parallel focus on change that brings practices and processes into the frame. A major limitation with narrative inquiry and object biography is that there are not always enough ingredients available to construct complete narratives.
As with close reading and thematic analysis, narrative inquiry and biographical research are time-consuming. At the same time, however, their advantage lies in how they help to weave people and artefacts together and through narratives verbalise their intermingling across time. Object biographies can be compared to identify norms and standard procedures (cf. Joy Reference Joy2009), as well as to describe the variety of practices and processes in a given context. A parallel benefit emphasised both in narrative inquiries of survey data and object biographies is how the very act of trying to construct a narrative reveals absences, invisibilities and breaks in what is known about practices and processes. A limitation of narrative inquiry and biographical approaches is that even if the narratives would be well grounded in the available evidence, they are subjective. Also, while narratives are useful for conveying an understanding of a particular practice or process for a human-being, they are difficult for computers limiting their usability as paradata in computational analysis and replication of practices and processes.
Key References and Further Reading
Bauer, A. A. (Reference Bauer2019). Itinerant objects. Annual Review of Anthropology, 48(1), 335–352. A review of recent theoretical discussion relating to object biographies and itineraries.
Dannehl, K. (Reference Dannehl2017). Object biographies: From production to consumption. In History and Material Culture, 2nd ed, Routledge. The book chapter compares the object biography method with the life cycle model providing useful insights to inform the choice of specific methods for life historical inquiry.
Edwards R. (2017) Working with Paradata, Marginalia and Fieldnotes: The Centrality of By-products of Social Research. Edward Elgar Publishing. The edited volume contains multiple chapters that illustrate not only how to analyse paradata, marginalia and fieldnotes in social science research through case studies but also provides insights into how the underpinning research processes can be backtracked in datasets and research documentation.
Phoenix A., Boddy J., Edwards R. and Elliott H. (Reference Phoenix, Boddy, Edwards, Elliott, Edwards, Goodwin, O’Connor and Phoenix2017) ‘Another long and involved story’: Narrative themes in the marginalia of the Poverty in the UK survey. In Edwards R., Goodwin J., O’Connor H., and Phoenix A. (eds.), Working with Paradata, Marginalia and Fieldnotes. Edward Elgar Publishing. The book chapter exemplifies how narrative inquiry can be used to analyse marginal notes in research documentation.
Chaîne Opératoire
As well as methods focused on proximally close analysis – literally close reading – of data, there are multiple approaches applicable to qualitative backtracking of paradata in research materials that focus on larger scales of inquiry. Chaîne opératoire (operational chain or sequence) is ‘a method of documenting technical activities in the field’ (Coupaye, Reference Coupaye, Bruun, Wahlberg, Douglas-Jones, Hasse, Hoeyer, Kristensen and Winthereik2022, p. 45, emphasis in original) developed and extensively used in archaeology and anthropology (Audouze and Karlin, Reference Arshad, Jantan and Abiodun2017). Its focus on explicating social practices and technical processes, especially chains of producing, using and discarding of artefacts has obvious affinities with the ambitions of generating paradata.
Coupaye (Reference Coupaye, Bruun, Wahlberg, Douglas-Jones, Hasse, Hoeyer, Kristensen and Winthereik2022) illustrates the use of chaîne opératoire as a descriptive and interpretive tool to analyse and make visible the dynamics, elements and levels of detail in technical activities. He exemplifies the use of chaîne opératoire by contrasting the operational sequences of his morning activities and yam cultivation in Papua New Guinea showcasing the versatility of the approach to represent both contemporary and past practices of different, both large (agriculture) and small (morning routines) scales. Chaînes opératoire are typically visualised using flow diagrams to depict the sequential and structural dimensions of the portrayed activities (Figure 5.1 for an example). The level of detail and steps included in individual sequences vary and as Coupaye (Reference Coupaye, Bruun, Wahlberg, Douglas-Jones, Hasse, Hoeyer, Kristensen and Winthereik2022) notes, a specific chaîne opératoire is only a ‘skeleton key’ (p. 54) that cannot possibly incorporate everything about a specific process.

Figure 5.1 A simple chaîne opératoire representing a data collection, research and data archiving process with major operations and actors represented.
Figure 5.1Long description
The diagram starts with the formulation of a research question by a researcher, leading to the planning a study by them. This splits into directing data collection by the researcher and collecting data by a technician. Both these lead to analysing data by a data analyst, which is then divided into reporting results by the researcher and archiving data by a data archivist.
Instead of being complete representations, they are rather ‘recordings of particular itineraries’ as observed by particular individuals (p. 54). Rösch’s (Reference Rösch2021) analysis of an archaeological excavation process and Opgenhaffen’s (Reference Opgenhaffen2022) extensive work on analysing, modelling and documenting artefact production (Opgenhaffen, Reference Opgenhaffen2022) together with Coupaye’s (Reference Coupaye, Bruun, Wahlberg, Douglas-Jones, Hasse, Hoeyer, Kristensen and Winthereik2022) illustrative example of using chaîne opératoire in a contemporary everyday life context provide useful examples and templates for applying the concept for extracting and structuring information on past activities also in domains outside of archaeology.
The retroactive process modelling of constructing chaînes opératoire has obvious similarities with the prospective design of workflows (Chapter 4) but also fundamental differences. The gaze backwards and (re)construction of a past process on the basis of its diverse material and immaterial traces calls for particular caution in determining what steps to include in and exclude from the operational chain, and what remains invisible between them. Chaînes opératoire are not visible in the wild. They must be recognised as analytical constructs. Similarly for those working with data creation, managing and using it, hardly consider their undertakings being composed of a series of steps but rather to form a flow of practice.
From the perspective of qualitative backtracking, the method is primarily one of articulating and structuring observations of the key steps in a process, rather than modelling an operational chain as a whole. An operational chain should not be mixed with a recipe or a step-wise procedural code that allows rerunning a specific process. However, in spite of the evident incompleteness of the paradata that can find its way into a chaîne opératoire, it can still be highly useful in making elements of practices and steps of processes visible and by facilitating their critical reflection on them (Coupaye, Reference Coupaye, Bruun, Wahlberg, Douglas-Jones, Hasse, Hoeyer, Kristensen and Winthereik2022). What needs to be kept in mind is that the shape of every individual operational chain is dependent on what is observed and what questions guide the identification of its steps and of the sequence as a whole.
Key References and Further Reading
Brysbaert A. (Reference Brysbaert2012) People and their things: Integrating archaeological theory into prehistoric Aegean museum displays. In Narrating Objects, Collecting Stories. Routledge. This book chapter describes the use of the concept of the chaîne opératoire together with the notion of cross-craft interaction to provide insights in how people interact with material objects in the museum context.
Coupaye L. (Reference Coupaye, Bruun, Wahlberg, Douglas-Jones, Hasse, Hoeyer, Kristensen and Winthereik2022) Making ‘Technology’ visible: Technical activities and the Chaîne Opératoire. In Bruun M. H., Wahlberg A., Douglas-Jones R., Hasse C., Hoeyer K., Kristensen D. B., and Winthereik B. R. (eds.), Palgrave Handbook of the Anthropology of Technology. Basingstoke: Palgrave Macmillan, 37–60. A book chapter that provides an approachable introduction to the chaîne opératoire method.
Rösch F. (Reference Rösch2021) From drawing into digital: On the transformation of knowledge production in postexcavation processing. Open Archaeology 7(1), 1506–1528. https://doi.org/10.1515/opar-2020-0211. This article demonstrates how Chaînes Opératoire can be utilised to trace the steps of data transformation and interpretation in the context of archaeological work.
Conversation Analysis
In contrast to most of the methods discussed in this chapter so far, there are also more structured and formal approaches applicable to identifying and extracting paradata. Conversation Analysis (CA) is an approach for studying social interaction that focuses on the details of action. It originates from the work of Sacks, Schegloff and Jefferson in the 1960s on studying casual conversations in everyday-life situations (Goodwin and Heritage, Reference Goodwin and Heritage1990). As a naturalistic approach to study social interactions, CA focuses on analysing naturally occurring activities as they unfold in human interactions by recording and analysing actual situated activities (Mondada, Reference Mondada2012). In CA, recordings of naturally occurring activities, such as telephone calls, family dinner talks and doctor–patient communication are intensively analysed to shed light on the social rules underpinning communication.
Based on this premise, within CA several comprehensive data transcription schemes have been developed to capture the details of social interaction including nuances of speech, turn-taking and non-verbal actions. The following short conversation exemplifies some features of the popular Jefferson Transcription System (Jefferson, Reference Jefferson and Lerner2004):
A. Which one of the spectrometers did he use to take the measurements?
B. Did use what?
A. SPECTROMETERS?
B. I don’t know, really. We probably have to check_
A. I’ll go and see if there’s something in the notebook_
B. O::k:, sounds good. We’ll probably have to talk about it at the meeting tomorrow morning,
A. Alright (.) I can write it down on the agenda
To point attention to some of the features of the transcription system, CAPITAL LETTERS signify loudly spoken passages, _ unchanging pitch, comma (,) a slightly rising pitch, colons (:) prolonged sounds, and a full stop in parentheses (.) a noticeable pause.
Conversation analysts emphasise not only the content of what is said but also the manner in which it is said, including the visible verbal and non-verbal behaviours of the participants, such as the temporal and sequential relationships and aspects of speech delivery like changes in pitch, loudness and tempo (Hepburn and Bolden Reference Hepburn and Bolden2012). As such, CA research requires a deep engagement with recorded data, highlighting the importance of the researcher’s participation in the manual transcription process and the close integration of transcription and analysis (Bolden, Reference Bolden2015). Nonetheless, the interactional details captured by CA transcription practices rely on the overhearer’s perspective to piece together a plausible version of the participants’ actual experiences (ten Have Reference Have2002).
As a qualitative backtracking approach, CA is a highly specific method for understanding practices and processes in naturalistic settings. When analysing the recorded data and transcripts, CA does not assume that such aspects of context as social categories (race, gender, power, class, etc.) have inherent relevance (Joyce et al., Reference Joyce, Douglass, Benwell, Rhys, Parry, Simmons and Kerrison2023). The starting point is to record the naturally occurring activities based on the specific aspects of practices or processes of interest and what in the analysed material indicates specific types of social interactions. For example, conversation analysts have examined turn-taking as a fundamental structure in everyday conversations, together with adjacency pairs as a basic element of sequence organisation. Adjacency pairs are sets of actions where if one speaker performs an initial action of a certain type, the recipient is expected to respond with a corresponding action (Drew Reference Drew2004). The analysis of the organisation of sequences in conversations can facilitate our understanding of the social rules enacted in specific contexts of everyday interactions. As conversation analysts turn to the study of talks in specific institutional context, also known as institutional talk, one of the key objectives is to inquire into ‘what kinds of institutional practices, actions, stances, ideologies and identities are being enacted in the talk, and to what ends?’ (Heritage, Reference Heritage2004, p. 109).
When approached as a form of qualitative trace analysis, CA can be especially useful in identifying conversational practices related to data creation, management and use. The approach could be especially fruitful in the analysis of diverse types of recordings of practices and processes, including video and audio transcripts. Earlier CA-based studies have also examined, for example, responses in survey studies from conversational perspective. CA could be similarly utilised, for example, to analyse interviewers’ conversational practices, or standardisation and deviations from standardised ‘talk’ with a database schema when data is input in a database system. Arminen and colleagues have investigated the practices of enacting and utilising practical know-how and institutionalised expertise in social interactions (Arminen, Reference Arminen2017; Arminen and Simonen Reference Arminen and Simonen2021) showing an example that could be transposed to studying how they play out in data-related practices and processes.
Overall, CA provides a potentially powerful lens to understanding the specifics of human interactions in data creation and reuse processes and practices by meticulously analysing moment-to-moment interactions in conversations. Its apparent drawback is that it requires detailed recordings of actual conversations which are not always available. Another downside is that it is like the most qualitative methods, very time-consuming. CA comes also with specific theoretical and practical commitments that are different from many other approaches developed for analysing discourse and conversations (Ten Have, Reference Have2006). Such include its focus on individual conversations can limit its applicability to identify and account for the impact of broader societal discourses and sociocultural underpinnings of the conversations. It is also empathetically data-driven in a minute detail and its interest lies in explicating interaction and how it is organised rather than what drives the described practices or processes. However, as such it can – as the brief example above demonstrates – help to track minute details of practices and processes, and how they are talked about.
Key References and Further Reading
Arminen I. and Simonen M. (Reference Arminen and Simonen2021) Expertise as a domain in interaction. Discourse Studies 23(5), 577–596. https://doi.org/10.1177/14614456211016797. This journal article shows how CA can be used to analyse how know-how and expertise are enacted in social interactions.
Hepburn A. and Bolden G. B. (Reference Hepburn and Bolden2012) The conversation analytic approach to transcription. In The Handbook of Conversation Analysis. 57–76. https://doi.org/10.1002/9781118325001.ch4. This authoritative book chapter introduces the conventions of CA approach to transcribing conversations and discusses methodological issues of epistemological and practical concerns.
McIlvenny P. and Davidsen J. (Reference McIlvenny, Davidsen, Haddington, Eilittä, Kamunen, Kohonen-Aho, Oittinen, Rautiainen and Vatanen2023) Beyond video: Using practice-based VolCap analysis to understand analytical practices volumetrically. In Haddington P., Eilittä T., Kamunen A., Kohonen-Aho L., Oittinen T., Rautiainen I., and Vatanen A., Ethnomethodological Conversation Analysis in Motion. London: Routledge, 221–244. https://doi.org/10.4324/9781003424888-15. This book chapter demonstrates the use of the CA method for enhancing transparency of analytical processes by examining how digital tools make sense for participants in a collaborative research setting.
5.2.2 Quantitative Backtracking
Quantitative backtracking refers to methods that can be used for quantitative analysis of data and diverse forms of secondary documentation and evidence for extracting paradata. Similarly to qualitative backtracking, we use the concept to describe a broad variety of approaches that apply quantitative analysis to identify, summarise and excerpt paradata, including both statistical and machine learning techniques.
Quantitative Trace Analysis
A variety of quantitative methods can be used to analyse datasets for identifying patterns in how they have come into being. The work of Börjesson et al. (Reference Börjesson, Sköld, Friberg, Löwenborg, Pálsson and Huvila2022) on close reading an archaeological fieldwork dataset provides multiple cues to how this can be done in practice with structured data. For example, using information of the point of time when specific data points have been entered in a database, it is possible to create a sequence of actions of how a dataset came into being. Depending on the routines of database creators, the sequence is likely to relate in one way or another to the procedures of creating, managing and using the data.
In cases where the data is entered directly at the moment of creation (cf. Huvila, Reference Huvila2012), the sequence extractable from databases corresponds well with the work process. However, in other cases when, for example data entry is done in batches after a certain amount of time has passed or a particular time of the day or week, the sequence is a less accurate and detailed representation of the data practice. In addition to identifying temporal sequences of actions, it is possible to identify patterns and follow changes in how vocabulary, descriptors or documentation of measurements evolve from the inception of a dataset to the point when it is finalised. If a dataset incorporates fields for preliminary and final interpretations like the one analysed by Börjesson et al. (Reference Börjesson, Sköld, Friberg, Löwenborg, Pálsson and Huvila2022), it is possible to trace the progress in the work of interpreting data points either as an individual or collaborative undertaking.
The surge of digital data collection both in social research and sciences has brought new opportunities to collect and backtrack trace data. Many devices used by data creators log a lot of data that can be purposefully collected as paradata as discussed in Chapter 4. However, as exemplified in Chapter 3, in cases where they have not been collected purposefully and remain as residues rather than as a part of formal documentation, they still provide opportunities for post hoc analyses and paradata generation.
Many digital cameras stamp photographs not only with information on the device and its technical characteristics and calibration but also the time when a photograph was taken and also the geographical coordinates of the place where the photo was taken. This information can be used for reconstructing the spatial and temporal sequences of data generation. Comparable information can also be extracted from other measurement devices, including 3D laser scanners and various types of laboratory equipment, and collected separately using a GPS device.
The automatic generation of potential paradata applies also to many software packages used in data collection and analysis. In social science survey research such trace data is explicitly termed paradata and occasionally collected and preserved on purpose for future use. Computer-assisted survey software programs collect a lot of secondary documentation, a part of which qualifies as formal, intentionally generated and collected paradata and is termed as such in the context of survey research (Durrant and Kreuter, Reference Durrant and Kreuter2013). Other parts may require more processing to explicate the understanding of data creation and management procedures. The latter applies especially to data embedded in the survey data itself.
In interview research, interviewers’ movements in the field collected using a GPS device have been used to analyse to what degree sampling protocols have been followed and what has happened during the data collection process (Choumert-Nkolo et al., Reference Choumert-Nkolo, Cust and Taylor2019). The analysis of response time from web-based surveys has been similarly used as an indication of whether respondents have difficulties understanding individual questions or to indicate the amount of effort they invest (Kunz et al., Reference Kunz, Daikeler and Ackermann-Piek2024). Much of the focus of the auxiliary data captured from survey and interview studies has been on enhancing the validity of collected data by addressing the issues of non-response biases in data sampling. However, diverse forms of (potential) paradata can also have many other uses in informing the understanding of data collection procedures. They can also facilitate the study of participant and researcher behaviour for greater understanding and more diverse transparency of data creation, management and use practices and processes.
Depending on the desired granularity and type of insights in practices and processes, different quantitative analysis methods can be applicable for making sense of quantitative trace data. Trend analyses, regression and straightforward correlation analyses can offer valuable insights into trace data, such as changes in vocabulary or patterns of documenting data points. Examining the terms used to describe observations or the frequency of measurements taken at different times during a field study can help identify changes in data collection practices and the types of decisions made throughout the process.
Visualisation of spatial or temporal movements and changes can also be helpful in the interpretation of trace data. To this end Ekström (Reference Ekström2022) developed a tailored application to visualise the spatial and temporal aspects of citizen scientists’ data collection practices on a map. In another study, Pentland et al. (Reference Pentland, Recker, Wolf and Wyner2020) developed a method for extracting contextual information from the trace data of the audit trails of electronic medical records and used the open-source ThreadNet tool to visualise the process data. Visualisation can make especially quantitative paradata easier to understand, reveal patterns and help to obtain an overview of larger sets of traces.
Meta-analysis provides another potentially useful method for trace analysis, specifically as a framework for comparative analysis of practices or processes. It has been extensively used in the reuse of clinical trials data and involves defining the criteria for including studies, searching for and selecting studies, collecting data about a study (e.g., details of methods, participants, setting, context, interventions, outcomes and results), extracting data from reports, and then statistically combining findings from multiple distinct studies (Deeks et al., Reference Deeks, Higgins, Altman and Group2023). Techniques used in meta-analysis can be helpful in aggregating paradata from parallel practices and processes for comparison and broader understanding of wider constellations of how data is created, managed and used.
There are also other methods for revisiting and assessing earlier data that can be applied for quantitative backtracking. Evidence review, including integrity checks, data extraction, transformation and sense-making, has been used to identify the common outcome measures in data harmonisation (the process of integrating data from different sources for comparative purposes) tasks (Deeks et al., Reference Deeks, Higgins, Altman and Group2023; Liu et al., Reference Liu, Wu, Power and Burton2023). For example, similar to how Goldsmith and colleagues reviewed patient-reported and expert-identified scales of measuring pain using evidence mapping, identifying research gaps and multiple challenges in synthesising data (Goldsmith et al., Reference Goldsmith, Taylor, Greer, Murdoch, MacDonald, McKenzie, Rosebush and Wilt2018), the method can be utilised in comparing and synthesising parallel sets of quantitative traces of practices and processes.
Despite the increasing interest in collecting and analysing additional documentation on survey procedures, acquiring and integrating such data presents challenges (Sakshaug and Struminskaya, Reference Sakshaug and Struminskaya2023), many of which are also relevant to other contexts and traces of data creation, management and use. Diverse behavioural cues, including temporal sequences and movements, are not always straightforward to link to a particular practice or process. Their meaning and implications to generated data can be difficult to interpret especially in secondary data analysis when the data was collected by other researchers.
In spite of the downsides of increased normalisation of paradata discussed earlier in this chapter, quantitative trace analysis would undoubtedly benefit from increased standardisation of trace data. Doing so might be feasible in such contexts as structured survey research but less so in other contexts of data generation, which lack standardised procedures and shared data structures. This applies to many branches of qualitative research but also elsewhere in domains where data practices are highly contextual and difficult to standardise due to local circumstances.
Many of the apparent problems and limitations can be mitigated by adjusting the procedures of how trace data is sourced following appropriate sampling strategies. Crucial steps to this direction is to try to ensure that the trace data is covering the relevant participants and aspects of the practice of interest. For example, if there are traces of the decisions made by only one member of a research team, the understanding of the work of the team as a whole remains limited. Some potential problems can also be managed by selecting robust analysis methods that work for the specific types of trace data with the potential to shed light on the specific data creation, management and use procedures in hand. To this end there are a plethora of statistical methods that help to mitigate problems, for example, with skewed distribution of samples and missing data points. The key point is that identifying and selecting trace data, as well as determining workable approaches, remains complex. This process requires a combination of methodological and domain expertise, which is essential for the successful quantitative analysis of trace data.
Key References and Further Reading
Deeks J. J., Higgins J. P., Altman D. G. and Group CSM (Reference Deeks, Higgins, Altman and Group2023) Chapter 10: Analysing data and undertaking meta-analyses. Cochrane Handbook for Systematic Reviews of Interventions. https://training.cochrane.org/handbook. An authoritative handbook that introduces the principles and methods of conducting meta-analysis in healthcare.
Kocar S. and Biddle N. (2023) The power of online panel paradata to predict unit nonresponse and voluntary attrition in a longitudinal design. Quality & Quantity 57(2), 1055–1078. https://doi.org/10.1007/s11135-022-01385-x. This journal article demonstrates how to analyse trace data for identifying the predictors of panel participation in survey research.
Venturini T., Bounegru L., Gray J. and Rogers R. (Reference Venturini, Bounegru, Gray and Rogers2018) A reality check(list) for digital methods. New Media & Society 20(11), 4195–4217. https://doi.org/10.1177/1461444818769236. This journal article reviews conundrums relating to the use of online trace data for the analysis of collective action and provides a checklist of major issues to take into consideration.
Natural Language Processing
As a method of quantitative backtracking, natural language processing (NLP) techniques can be used to identify the process and practice information in human language data. NLP is a field of research that focuses on computational analysis and manipulation of human language.
Different types of NLP techniques exist. Symbolic NLP is based on processing textual or speech data using a set of rules. For example, an example of a rules-based approach to identify paradata in a research report is to generate a list of conditions where a particular phrase is interpreted as a description of a process. If the phrase ‘was measured’ appears in the section ‘Methods’ in a research report, it is considered as paradata on research data creation whereas, if the same sentence appears in the historical background it is supposed to be relating to a historical practice.
Statistical NLP is based on finding patterns in large masses of text. In the previous example, a statistical NLP approach could be used to find patterns in how methods sections in research reports are written and by searching for similar patterns in other texts, to figure out whether they contain methods descriptions, or at least passages that remind of methods descriptions.
Since the early 2000s, NLP has increasingly been based on the use of neural networks. In contrast to rules-based systems and statistical NLP that need to be trained by the researcher, neural networks can be trained automatically to learn features of human language provided large enough quantities of text are available as input. More recently, large language models (LLMs) have been applied to the task of process and procedure extraction for business process models beyond the existing rule-based approaches (e.g. Bellan et al., Reference Bellan, Dragoni and Ghidini2024; Neuberger et al., Reference Neuberger, Ackermann, Jablonski, Sellami, Vidal, van Dongen, Gaaloul and Panetto2024). Nonetheless, to enhance the transparency and fairness of the developed systems, it is necessary to address language biases in the internal knowledge of LLMs (Salinas et al., Reference Li, Higgins and Deeks2023), as these biases can impact downstream applications, such as the task of process and procedure extraction for paradata generation.
Several commonly used NLP techniques are relevant to the extraction of process and procedure information. For example, Named Entity Recognition (NER) can be employed to detect all instances of the named entities (such as persons, organisations, locations, dates and times, and events) in the text as part of the information extraction task (Bird et al., Reference Bird, Klein and Loper2009). Identifying, linking and tracing, for example, persons or organisations, in language data can provide insights into practices and processes they have been engaged in. Tracking dates, times, and events can help (re)construct temporal sequences and spatial locations.
Relation Extraction (RE) can be employed to identify and classify the relationships between entities within a text. In the sentence ‘The committee approved the proposal,’ Relation Extraction identifies the entities ‘committee’ and ‘proposal’ with the relationship ‘approved’. In addition, Entity Resolution (ER) is able to identify semantically equivalent entities that refer to the same information object across different data sources. For instance, ER can identify that ‘IBM’ and ‘International Business Machines Corporation’ refer to the same company across different documents. To address the problem of rule-based method optimised for a specific domain, one approach involves extracting text and location of process elements (NER), resolving them into collections of unique entities (ER), and extracting entity arguments and relation types (RE) for extracting business process information (Neuberger et al., Reference Neuberger, Ackermann, Jablonski, Sellami, Vidal, van Dongen, Gaaloul and Panetto2024). Extracting process elements and their interrelations can facilitate the automated generation of research process model.
For the application of NLP to paradata identification, a paradata extraction approach that involves close iterative reading can be assisted by NLP techniques (Börjesson et al. Reference Börjesson, Sköld, Friberg, Löwenborg, Pálsson and Huvila2022). CAPTURE project tested this in a short pilot project with promising results (Huvila et al. Reference Huvila2022). To promote the semantic integration of datasets, different textual patterns for temporal expressions in archaeological datasets have been identified as part of a standardisation process, noting the importance of keeping the original context (and provenance) of the dating information (Binding and Tudhope Reference Binding and Tudhope2023). As an application of NLP techniques to assess document similarity, Sakahira et al. (Reference Sakahira, Yamaguchi and Terano2023) analysed excavation report texts concerning buried cultural artefacts, demonstrating that the similarities of texts based on sentence embedding (transforming sentences into numerical vectors) of excavation reports can reflect the similarities among archaeological sites. However, the application of NLP techniques to large amounts of data, involving various steps of data standardisation and processing, requires a high level of technical skill.
For developing applications that perform simple NLP tasks, such as counting the number of words, creating a list of words, tracking the word position, and counting word frequencies in a text, it is possible to use an out-of-the-box toolkit. NLP Toolkit is an example of a popular programming library (www.nltk.org) that can be used directly to perform simple NLP work and to develop one’s own complex NLP applications using the Python programming language (Bird et al., Reference Bird, Klein and Loper2009). Many other toolkits exist for multiple programming languages and platforms for developing NLP applications.
Clinical Trial Risk Tool (https://app.clinicaltrialrisk.org) exemplifies how NLP toolkits can be used to develop user-friendly tools to extract process information from textual data. It takes as an input a clinical trial protocol in PDF format, extracts information on the key facets of the reported trial, compares it to quality norms, and produces a report with an assessment of the risk that the reported trial is uninformative.
The major drawback of NLP approaches to quantitative backtracking is that an NLP algorithm never understands its input as a human-being. The generated outputs are potential paradata rather than a definite list of all relevant information. Both false positives and false negatives results pose a risk, making the results only useful as an initial step towards more in-depth analysis.
Another obstacle to scaling up the application is the scarcity of extensive training datasets for extracting process information (Bellan et al. Reference Bellan, Dragoni and Ghidini2024; Neuberger et al., Reference Neuberger, Ackermann, Jablonski, Sellami, Vidal, van Dongen, Gaaloul and Panetto2024). In spite of the shortcomings, NLP techniques can provide strong support for identifying potential paradata in text corpora that would otherwise be impractical to analyse by hand. Moreover, when combined with related techniques for analysing, for example, static and moving images (object analysis), NLP based backtracking can be extended from text and speech to trace data in other media formats and their combinations.
Key References and Further Reading
Bach R. L., Kern C., Bonnay D. and Kalaora L. (Reference Bach, Kern, Bonnay and Kalaora2022) Understanding political news media consumption with digital trace data and natural language processing. Journal of the Royal Statistical Society Series A: Statistics in Society 185(Supplement_2), S246–S269. https://doi.org/10.1111/rssa.12846. An article that exemplifies how the NLP and statistical techniques can be used to elicit information on new media consumption practices from web browsing data.
Bird S., Klein E. and Loper E. (Reference Bird, Klein and Loper2009) Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media. An approachable book length hands-on introduction to NLP techniques, with detailed documentation available online at www.nltk.org.
Neuberger J., Ackermann L. and Jablonski S. (Reference Neuberger, Ackermann, Jablonski, Sellami, Vidal, van Dongen, Gaaloul and Panetto2024) Beyond rule-based named entity recognition and relation extraction for process model generation from natural language text. In Sellami M., Vidal M.-E., van Dongen B., Gaaloul W., and Panetto H. (eds.), Cooperative Information Systems. Cham: Springer Nature Switzerland, 179–197. https://doi.org/10.1007/978-3-031-46846-9_10. This conference paper proposes an approach to process information extraction combining the tasks of named entity recognition, entity resolution and relation extraction.
Data Forensics and Diplomatics
In addition to discrete qualitative and quantitative data analysis methods that can be applied for backtracking practices and processes in primary and secondary material, there are also broader methodological frameworks developed for inquiring into data and its contexts, including paradata. In this section we briefly discuss the two parallel approaches of data forensics and diplomatics. They represent two distinct but prospectively complementary approaches to analysing documents and their characteristics (Duranti, Reference Duranti2009a). The focus of both forensic and diplomatic analysis is on assessing the authenticity, reliability and completeness of the records, and their ‘ability to proof facts at issue’ (Duranti, Reference Duranti2009a, p. 64), albeit from two different methodological outsets.
Data forensics refers to the analysis of digital data and how it is created and used (Pandey et al., Reference Pandey, Husain and Khan2020. It is sometimes categorised as a branch of digital forensics (sometimes computer forensics), that is, the forensic study of digital information and records. Its roots are in the study of digital information to support investigations of crimes committed with the help of computers (Pollitt, Reference Pollitt, Chow and Shenoi2010). Much of the work in this area focuses on analysing digital data in legal contexts (Arshad et al., Reference Arshad, Jantan and Abiodun2018), for example, collecting electronic evidence to support criminal investigations and law enforcement. It is also guided by principles derived from forensic science, including the crucial importance of not relying on a single source of evidence and corroborating and consolidating findings from multiple sources (Ries, Reference Ries2018). Sub-branches of data forensics focus on forensic data analysis in specific contexts. For example, educational data forensics investigates what can be termed as potential paradata on test takers’ response data to detect indications of test fraud (De Klerk et al., Reference De Klerk, Van Noord, Van Ommering, Veldkamp and Sluijter2019).
Forensic techniques can be used in diverse digital contexts. Forensic analysis of media content shared via social media or web platforms has been used for verifying sources and integrity of media on social networks by analysing platform origins for shared content, and assessing the credibility of digital objects consisting of both text and audiovisual media (Pasquini et al., Reference Pasquini2021). Content sharing on social networks leaves digital traces that enable the identification of processing platforms, reconstruction of sharing history, and extraction of upload system details (Pasquini et al., Reference Pasquini2021), that is, information that effectively functions as paradata. Hodges (Reference Hodges2021) demonstrates in a study of biomedical device maintenance work how forensic analysis can ‘constitute a valuable approach to recovering knowledge about behaviors that have already taken place, or that have taken place in contexts where efforts at observation could encounter problems related to access, intellectual property, privacy, or safety’ (Hodges, Reference Hodges2021, p. 1404).
Many tasks in digital forensics are based on the use of technical methods for recovering and scientific and computational, often quantitative, approaches to analysing data. The forensic analysis procedure consists of identifying and recovering digital evidence, prioritising the most promising data for closer inspection, analysis and finally evaluation and interpretation of the findings (Duranti, Reference Duranti2009a). Computational analysis can help especially in forensic analyses of large data resources in the context of what has been termed ‘big data forensics’ (Zawoad et al., Reference Zawoad and Hasan2015).
Hodges’ (Reference Hodges2021) work and forensic analyses in media studies (e.g., Kirschenbaum, Reference Kirschenbaum2008; Reference Kirschenbaum2014; Ries, Reference Ries2018) and digital preservation exemplify how forensics also can benefit from the use of qualitative methods, including what can be described as the close reading of data files. Hodges applies an analysis method that draws on digital forensics and trace ethnography, a method developed for identifying and tracing actors and events that often remain invisible in digital data (Geiger, Reference Geiger2016; Geiger and Ribes, Reference Geiger, Ribes and Sprague2011). The approach follows the ethnographic logic of developing rich descriptions of activities, not necessarily by participating in them in the same physical location but rather through being present in the networks where activities take place, gathering and analysing documentary evidence.
The earlier use of qualitative forensic analysis exemplifies the use of the approach. Geiger and Ribes’s (Reference Geiger, Ribes and Sprague2011) study illustrates how the method can be used to inquire into the practices of vandals on Wikipedia by tracing their activities on the Mediawiki software platform running the encyclopaedia and external software tools. Hodges (Reference Hodges2021) analyses traces of labour in a corpus of repair manual files in PDF format. While only a handful of analysed files contain formal metadata, the manuals contain handwritten page numbers indicating their users’ need to refer and go back to specific pages in the document, evidence that that they have been from non-digital originals, wear of original documents before their digitisation, diverse marginalia (including underlining and circling of content), and added pages. All such traces evince of how the documents have been managed and used during their lifetime.
A typical problem for data forensics is that data is stored in a datafile of a format that is unknown or there are no readily available tools for opening them. Ries’ (Reference Ries2018) study exemplifies how all, even unknown, types of binary data files (i.e. files coded not in plain text) can be read for close analysis and compared using generic hex-editors (a type of file editors capable of showing the contents of binary files). In the legal domain, data forensics is also complicated by diverse anti-forensic measures used by criminals to hinder forensic analyses.
Diplomatics is a methodology that was developed in the seventeenth century for verifying the authenticity of current and archival records (Duranti, Reference Duranti, Ambrosio, Barret and Vogeler2014). Classic diplomatics is based heavily on the analysis of the physical characteristics of records, that is the form and format of documents written on, for example, parchment or paper. It aims to shed light into the contexts and reasons of record creation, persons and other agents involved in the process and the relation of records to other documents. Diplomatics of digital records or digital diplomatics refers to applying the approach in the digital realm utilising and benefitting from methods developed within digital forensic practice (Duranti, Reference Duranti2009a).
Duranti (Reference Duranti2009a) has proposed that an amalgam of digital diplomatics and digital forensics could be termed digital record forensics. Contrary to the diplomatic analysis of human-readable aspects of physical documents, digital record forensics and computational digital forensics involve analysing trace data in machine-readable formats and documenting both the output data and the derivation method (Niu Reference Niu2013).
A digital diplomatic analysis starts with description of the digital environment where the analysed data exist, their digital and logical structure and form. Applying the methodology for extracting paradata does not necessarily require that the analysed data or documents fulfil all the criteria of formal (digital) records (including identifiable context of creation, originator, action, links to other records, fixed form, stable content). However, many of these details are clearly informative of practices and processes relating to the record and, as such, are useful as paradata. Moreover, the focus of diplomatics on establishing the trustworthiness and authenticity of the evidence can provide direction to the work of identifying and extracting paradata. It allows the researcher to consider whether and to what extent the extracted paradata is authentic and trustworthy enough for the planned purposes.
The major difference between data forensics and diplomatics is in their underpinnings. Diplomatics builds on a long tradition of historical and linguistic research whereas many forensic techniques build on methods borrowed from sciences, medicine and engineering (Duranti, Reference Duranti2009a). Conducting digital forensic analysis requires some degree of technical skills whereas diplomatics requires in-depth understanding of the analysed materials, their context and mechanisms of creation and diplomatic criticism, a related method to historical sources criticism.
A comprehensive forensic or diplomatic analysis can be time consuming compared to many other approaches to paradata extraction. The focus of both data forensics and diplomatics is to assess the trustworthiness of digital records and ensure their authenticity rather than producing complete accounts of any particular practices or processes. Both methodologies, alone and combined, do provide, however, a practical framework to guide paradata extraction. Diplomatics offers a model and guidance to identifying how documents and records relate to their originating practices and processes whereas forensics offers a systematic framework for the technical work of identifying and recovering, prioritising and analysing, and evaluating and interpreting evidence.
Key References and Further Reading
Duranti, L. (Reference Duranti2009a). From digital diplomatics to digital records forensics. Archivaria, 68, 39–66. An approachable introduction to diplomatics, digital forensics and digital records forensics.
Hamouda H. A. (Reference Hamouda2023) Authenticating citizen journalism videos by incorporating the view of archival diplomatics into the verification processes of open-source investigations (OSINT). In 2023 IEEE International Conference on Big Data (BigData). Sorrento, Italy: IEEE, 2036–2046. https://doi.org/10.1109/BigData59044.2023.10386935. This conference paper demonstrates how archival diplomatics can be applied to the analysis of citizen journalism videos and their authenticity through explicating their processual underpinnings.
Pasquini C., Amerini I. and Boato G. (Reference Pasquini2021) Media forensics on social media platforms: A survey. EURASIP Journal on Information Security 2021(1), 4. https://doi.org/10.1186/s13635-021-00117-2. This journal article provides an extensive review of digital forensic methods for analysing media content shared via social networks.
Rogers R. (Reference Rogers2023) Tracker analysis: Detection techniques for data journalism research. In Doing Digital Methods, 2nd ed. SAGE, 239–258. This book chapter introduces digital forensics techniques for the media and social research projects, applicable as guidance for paradata extraction.
5.3 Discussion
Many types of methods can be useful for extracting paradata retrospectively from secondary information relating to practices and processes, although the data itself does not necessarily qualify as paradata. The approaches differ in the level of detail of the analysis, in their aims regarding what types of information and insights are produced and how practices and (or) processes are represented. They also have diverging theoretical underpinnings. Some, including formal metadata, are based on objectivist representation practices and processes whereas others, like close reading, are firmly based on interpretivist theorising.
The key practical difference between qualitative and quantitative approaches lies in their respective focus on close interpretative in-depth analysis of typically relatively small quantities of information and focus on developing explanations or predictions based on the analysis of relatively large amounts of data. Both general approaches require time and effort but the craft-like nature of qualitative analysis means that it does not scale as well as quantitative methods.
This means in practice that qualitative methods work better when the aim is to develop an in-depth understanding of particular practices or processes using a finite amount of material. Quantitative analysis is better suited for identifying broader patterns of activity based on larger quantities of data. This is not, however, the only difference between many of the methods discussed above and others applicable for extracting paradata.
The epistemological and ontological underpinnings of the approach used have implications as to what kind of information the method generates, and correspondingly, how the identified activity stands out, for example, as a practice, process, sequence of steps or flow of action. For example, using chaîne opératoire to understand practices or processes frames them as operational sequences with the ontological consequence that the described activity essentially becomes a sequence of discrete steps. Narrative inquiry leads to a very different outcome where a practice or process is both framed as and turned into a story.
Qualitative backtracking methods can be useful for discerning practices and processes used to produce and process data. One of the key steps in secondary data analysis involves data interpretation based on the contextual information about the data. There are guidelines available for writing and analysing fieldnotes in ethnographic studies (Copland, Reference Copland, Phakiti, De Costa, Plonsky and Starfield2018; Emerson et al., Reference Emerson, Fretz and Shaw2011) that are useful for extracting information on both the practices and processes of generating the notes and those described in them.
In contrast, despite the long interest in paradata, there is still a lack of established traditions and consistent approaches in the social sciences for analysing comparative information relating to survey data (Goodwin et al., Reference Goodwin, O’Connor, Phoenix, Edwards, Edwards, Goodwin, O’Connor and Phoenix2017). A part of the differences may be traced back to the epistemological debate around the relationship between the researcher and the data in survey research, and whether or to what degree the research process and data are separable from each other (Joyce et al., Reference Joyce, Douglass, Benwell, Rhys, Parry, Simmons and Kerrison2023). Specifically, since fieldnotes and findings are considered inseparable from observational process in ethnographic studies, ethnographic documentation and approaches to tracing practices and processes are based on the tenet that that documentation incorporates rich evidence of multiple, situational realities of fieldwork (cf. Emerson et al., Reference Emerson, Fretz and Shaw2011).
On the contrary, various branches of research, including survey studies, often treat research findings and evidence of the research process as distinct entities – a perspective frequently criticised by constructivist researchers and theorists. In such quantitative studies identifying and analysing evidence linked to rather than embedded in findings can comparably enrich the understanding of the research process (e.g. Fahmy and Bell, Reference Fahmy, Bell, Edwards, Goodwin, O’Connor and Phoenix2017; Phoenix et al., Reference Phoenix, Boddy, Edwards, Elliott, Edwards, Goodwin, O’Connor and Phoenix2017). Such differences underline the importance of reflecting on one’s own epistemological position and the significance of choosing and using different paradata creation methods in alignment with each other.
Pairing methods is also possible. Using a combination of methods can help to generate more comprehensive information on practices and processes. For instance, trace ethnography (as was discussed briefly in conjunction with data forensics and diplomatics) combines participant-observation with the analysis of extensive data found in computer logs to reconstruct user patterns and practices within online communities (Geiger and Ribes, Reference Geiger, Ribes and Sprague2011). A combination of computational analysis of digital traces in online ethnographic research and ethnographic observation can help to provide a more nuanced understanding of the investigated community (Barkhatova, Reference Barkhatova2023). Pairing methods can provide a more comprehensive and nuanced understanding of practices and processes.
Further, some of the prospective and in-situ methods of paradata generation discussed in Chapter 4 can be applied also to retrospective data on practices and processes. The presence of core paradata (cf. Chapter 6) is helpful not only for directly conveying an adequate understanding of practices and processes for data reuse but also as a starting point for closer examination of secondary sources. With a rudimentary core paradata in place, it becomes easier to start knitting diverse forms of secondary descriptions and traces together to form a richer account of how a dataset was created, and how it is managed and used. Moreover, it can also help to assess eventual constraints for secondary use of data as informative of practices and processes (Johns et al., Reference Johns, Meurers, Wirth, Haber, Müller, Halilovic, Balzer and Prasser2023).
While formal metadata, data modelling and ontologies typically are used prospectively to prescribe data generation, they can also be used retroactively. The work of Thomer and colleagues on geobiology fieldwork (Thomer et al., Reference Thomer, Wickett, Baker, Fouke and Palmer2018), discussed in Chapter 4 demonstrates some of these possibilities.
It is also possible to combine different prospective and retrospective methods. For example, the CIDOC CRM (formal ontology for the documentation of cultural heritage), PROV-DM (provenance standard for specific domain), and named graphs can be employed in combination to represent of objects and their related practices and processes (Shoilee et al., 2023).
All methods discussed in this chapter aim at what Lund (Reference Lund2024) calls a diachronic analysis of materials with a potential to function or be appropriated (cf. Chapter 7) as paradata. They aim at explicating and understanding, as for Lund, different phases of a particular process in a given situation, or if framed in terms of practices, the enactment of the unfolding of a practice. By using different methods, it is possible to extract not only different kinds of information on diverse practices and processes but in effect, extract different practices and processes out of the available primary and secondary traces.
In this sense, the choice of methods for identifying and extracting paradata goes beyond the simple question of choosing a method that is applicable to analysing a small or larger corpus of traces consisting of specific types of data. It is also, in a very fundamental sense, a question of choosing a method that is applicable for extracting, or more correctly constructing and enacting a specific kind of practice or process. A chaîne opératoire enacts an operational chain whereas narrative inquiry constructs a story similar to how following GPS coordinates enacts a journey in space rather than a rich description of a complex practice in its entirety.
Finally, the present brief review of a small sample of methods applicable for identifying and extracting paradata shows also how paradata not only adds to our understanding of data creation, management and use practices and processes to enable reuse (Goodwin et al., Reference Golub and Liu2017) but can also generate new perspectives to the datasets and how they can be used (Carpentieri et al., Reference Carpentieri, Carter and Jeppesen2023). When collecting data for meta-analysis in biomedical research, paradata associated with clinical trials data can be useful for ensuring the integrity of datasets (Li et al., Reference Li, Higgins and Deeks2023) and also reduce the publication bias effects of not including unpublished, difficult-to-find studies (Borenstein et al., Reference Borenstein, Hedges, Higgins and Rothstein2009). Forensic analysis of digital footprints or traces from the activities on social networks can similarly help to establish the trustworthiness and authenticity of digital records in, for example, specific legal contexts (Duranti, Reference Duranti2009a; Pasquini et al., Reference Pasquini2021). The multiple uses and usabilities of the different methods underline their diversity. It also demonstrates the malleability of paradata discussed throughout this volume and how it can be bent to diverse uses.
5.4 Conclusions
The effective identification of paradata during data creation processes is important for enabling and guiding data reuse. Retrospective methods of extracting paradata, including qualitative and quantitative backtracking, and data forensics and diplomatics, provide clues for discerning past activities but also for ensuring its integrity, authenticity and trustworthiness. Since contextual information about data (including data description, attributes and research methods) significantly influences data reuse across disciplines, data reusers can mitigate the risk of data misinterpretation by familiarising themselves with methods for identifying paradata related to data creation practices and processes.
The selection of methods introduced in this chapter provide researchers with guidance on effectively identifying and extracting paradata for secondary data analysis. Such analysis not only enriches our understanding of the research process but also generates new perspectives on datasets independent of research discipline and domain of practice. Qualitative backtracking methods enable the analysis of data to discern practices and processes, offering valuable insights, for example, into fieldwork dynamics, such as interviewers and participants’ interaction in survey studies and data generation field sciences. Quantitative backtracking methods, including meta-analysis and natural language processing techniques, offer means to identify and extract practice and process information across large sets of data and secondary documentation. Data forensics and diplomatics are examples of methodologies that extend beyond individual methods. They both provide guidance in how to think and act regarding evidence on practices and processes and extraction of potential paradata. They also exemplify the benefits of systematicity in the work of identifying and extracting paradata-like information. Data forensics provide a tentative template on how to proceed with paradata analysis and extraction and digital diplomatics, a lens to direct attention into specific aspects of documentation as records, pertaining to practices and processes.