Longitudinal pregnancy and birth cohort studies are powerful resources for exploring the developmental origins of health and disease (DOHaD). They provide the opportunity to investigate how parental and environmental factors occurring during early life (preconception, pregnancy, infancy, and childhood) influence fetal and child growth, developmental trajectories, and long-term susceptibility to disease. Reference Wadhwa, Buss, Entringer and Swanson1–Reference Godfrey, Lillycrop, Burdge, Gluckman and Hanson3 However, the ability to address these questions depends on the access to data sets with large sample sizes, varied and heterogeneous exposure information, and long-term repeated follow-up. To achieve some of these requirements, the scientific community has increasingly begun to take advantage of the opportunity to combine data from existing cohort studies. A prerequisite for co-analysis of individual participant data (IPD) across studies is that the data formats and meanings are comparable, requiring, where possible, to harmonize study-specific data, i.e., to transform collected data to a common format. Reference Fortier, Raina and Van den Heuvel4–Reference Granda and Blasczyk6 Indeed, the number of such harmonization initiatives has increased exponentially during the past two decades, also driven by the call from the scientific communities and funders to make data findable, accessible, interoperable, and reusable (FAIR). Reference Wilkinson, Dumontier and Aalbersberg7
The types of retrospective harmonization initiatives focusing on DOHaD research vary extensively. Risk factors (e.g., genetic background, mother’s stress, air pollution), outcomes (e.g., birth weight, cognitive development, cancer) and life stages of interest differ from one initiative to the other. While most of these focus on specific research questions, initiatives like Lifecycle, Reference Jaddoe, Felix and Andersen8 ENRIECO, Reference Gehring, Casas and Brunekreef9 RECAP-Preterm, Reference Zeitlin, Sentenac and Morgan10 ReACH, Reference Bergeron, Massicotte and Atkinson11 and Global Pregnancy CoLab COLLECT database Reference Myatt, Roberts and Redman12 aim to address a broad range of objectives. Each of these initiatives also differs in magnitude, with the number of participating cohort studies varying from two or three Reference Tollånes, Strandberg-Larsen and Forthun13 to over 20. Reference Gehring, Casas and Brunekreef9,Reference Voerman, Santos and Inskip14,Reference Bousquet, Anto and Sunyer15 Finally, various governance, data warehouse, and data-sharing infrastructures can be adopted, while different methodological and operational approaches are used to handle data access, cleaning, harmonization, documentation, and co-analysis.
Although each initiative is unique, all are confronted by similar issues. Having to collate, understand, process, host, and co-analyze IPD from individual cohort studies is challenging. While it is the case for DOHAD research, it is also true for any initiative harmonizing and co-analyzing existing data across individual studies. First, it is often difficult to access structured documentation or obtain comprehensive information from local study teams related to cohort designs, participant follow-ups, and specific data items or samples collected/available. This can lead to important challenges in selecting appropriate data sources and ensuring the optimal use of data. Reference Butters, Wilson and Burton16 Second, organizational, ethical, and legal requirements typically restrict access to individual participant data or allow access, but only under specific conditions, often differing from one study to another. Therefore, time required to understand these rules and achieve data access procedures can be significant. Third, because of the heterogeneity of the information collected and the number and timing of study-specific data collection events, comparison and/or integration of data across studies present major methodological challenges. Fourth, data harmonization and co-analysis require access to secure and potentially sophisticated data sharing frameworks, methodological expertise, and specialized tools (e.g., standards, software), fundamentals that are not always accessible. Finally, for large-scale harmonization initiatives, maintaining the personnel, infrastructure and documentation required to support optimal long-term use of the harmonized data can be difficult.
While achieving optimal harmonization is and will remain challenging, scientific success and timely management of the initiatives can be facilitated by an ensemble of factors. In the following paper, we aim to provide an overview of the logistics and key elements to be considered from the inception to the end of collaborative epidemiologic projects requiring harmonizing existing data.
The paper content was generated using a consensus approach bringing together information from different sources. First, the experience of the authors in leading, collaborating in, or supporting over 50 harmonization initiatives from research networks in a broad range of research areas helped to build the core elements of the paper. Second, scans of the literature were used to identify additional challenges faced and solutions implemented by other harmonization initiatives. Third, the authors conducted a survey on a subset of 20 initiatives to answer specific questions raised and gather concrete examples of harmonization in practice. The survey included information about the challenges faced, the variables harmonized, the personnel and time required to achieve different tasks, and the data infrastructure implemented. Over 60 initiatives that, at various levels, informed the development of the paper are listed in Supplementary material S1. Consensus on paper content was achieved through a series of topic-specific meetings coordinated by Maelstrom Research, 17 ReACH (Research Advancement through Cohort Cataloguing and Harmonization), Reference Bergeron, Massicotte and Atkinson11 EUCANconnect 18 and DataSHIELD 19 initiatives from 2018 to 2021. These meetings included cohort investigators and experts in various domains (e.g., epidemiologists, software architects, computer scientists, data analysts, statisticians, ethicists, lawyers, physicians, project coordinators, etc.).
Life course of harmonization initiatives
Harmonization initiatives can pursue divergent goals. Some have broad scientific objectives and engage numerous collaborators from various disciplines. Others are set up to answer very specific research questions and harmonize a limited number of variables across a limited number of studies. While each initiative is unique, investigators must generally develop and finance their research plan, implement a working environment adapted to their needs, and generate and preserved harmonized data to ultimately achieve statistical analysis. Figure 1 provides an overview of the conceptual workflow undertaken by harmonization initiatives. The workflow presented is complementary to the iterative harmonization steps proposed by the Maelstrom guidelines for retrospective harmonization. Reference Fortier, Raina and Van den Heuvel4 To simplify reading, the workflow is described as linear. However, it needs to be adapted to the reality of each initiative, and a back-and-forth process is generally required to continually improve procedures and outputs generated based on the experience gained and results observed. An example of a specific harmonization initiative, the Prenatal Alcohol Exposure (PAE) project, is provided in Supplementary material S2.
Conceive the project idea and proposal
The research questions addressed and research plan proposed by harmonization initiatives need to be Feasible, Interesting, Novel, Ethical, and Relevant (FINER). Reference Cummings, Browner, Hulley, Hulley, Cummings, Browner, Grady and Newman20 As in any research project, this involves defining elements including the objectives to be pursued and research questions addressed; the specific exposures and outcomes required to answer the research questions; and the suitable population size and characteristics (e.g., mothers, children, age range, area of residence, primipara). In addition, the logistical, operational, technical, methodological, ethical, and legal elements specific to harmonization and co-analysis of IPD across independent cohort studies generally need to be outlined. Relevant elements to be considered depend on the objectives and scale of the initiative but can comprise defining the proposed governance model; the criteria used to select participating studies; the data infrastructure to be implemented; the operational and methodological approach to harmonization; and the statistical methods foreseen to validate and analyze harmonized data. Ideally, the protocol should also include a first evaluation of the harmonization potential across participating studies to outline the true potential of the project to answer the research questions addressed. Specialized catalogs documenting the design and content of mother-and-child studies are available to the research community to facilitate such evaluation. Reference Jaddoe, Felix and Andersen8,Reference Zeitlin, Sentenac and Morgan10,Reference Bergeron, Massicotte and Atkinson11,Reference Larsen, Kamper-Jørgensen and Adamson21
Apply for funding
Harmonization initiatives can require significant investment in time and expertise, both from the participating cohort studies and from the team coordinating the project, and the budget should reflect this reality. Costs relate to many factors including the scope of the initiative, the complexity of the governance and data infrastructure, the number of cohort studies involved, the quality of study-specific data and metadata (information about data), and the number and type of harmonized variables to be generated across studies. Based on the survey results, relatively small-scale initiatives (e.g., aiming to harmonize 10–15 variables across four studies) can require 2–6 person-months (number of months, for the equivalent of one person working full time) to generate a validated harmonized data set, while large-scale initiatives (e.g., aiming to harmonize 150 variables across more than 10 studies) from 15 to over 80 person-months. Generally, most of the staff’s time resources are dedicated to data inventory, cleaning, management, and processing.
For many initiatives, the time and costs required to obtain data should also be considered. Depending on the context, data access procedures (from submitting a demand for access to being ready to initiate harmonization) can take from a week to more than a year per study. In addition, study-specific data access fees might be applicable and may easily exceed 2,000€. Implementing a complex data infrastructure (e.g., distributed across several studies) could also be required by large initiatives. Setting up such an infrastructure demands time from technical experts (e.g., in data management and security) and can take several months. An overview of the timeline and costs of the PAE project is provided in Supplementary material S2. Finally, as in individual cohort studies, large-scale harmonization initiatives might also need long-term funding to ensure a sustainable platform and the maintenance and management of access to the harmonized data sets generated.
Initiate activities and organize the operational framework
The success of large long-term harmonization initiatives often depends on building a collaborative, interdisciplinary team of experts and implementing flexible but efficient operational and governance models. Large initiatives bring together data users (investigators requesting harmonized data to achieve their research goals), data producers (stakeholders from participating cohort studies), and experts from specialized domains (e.g., longitudinal data analysts, ethicists, computer scientists, epidemiologists, clinicians). Team members generally come from different research groups, and each member brings its own professional background and level of expertise. To optimize projects operations and research outcomes, efforts might thus be required to build a unified approach and common understanding of various concepts.
Building consensus is not always necessary (e.g., in a small initiative with a narrow research question). However, to ensure efficient launch of activities, the team needs to rapidly delineate the practical requirements and operational details related to the research agenda, the data infrastructure to be implemented, and the data harmonization and analysis framework. Table 1 provides various examples of questions that could be addressed by the team.
Assemble information on studies
It is generally important to gather precise information about the characteristics of the studies actually enrolled so as to ensure the quality of the harmonized data set. Data comparability is affected by heterogeneity of the study-specific populations and data content. Access to comprehensive information on study designs, population characteristics, data collected, duration and timing of data collection events, and the standard operating procedures used can be required to confirm the eligibility of studies and estimate harmonization potential. Study-specific inclusion criteria are different for each research question addressed, but could include study-specific design (e.g., cohort studies), number of participants (e.g., at least 500 mothers recruited at baseline), sampling/recruitment frame (e.g., representative sample of pregnant women in a geographic area), years of recruitment (e.g., mothers recruited after 2010), number and frequency of data collection events (e.g., at least two data collection events during pregnancy), data/samples collected (e.g., smoking status, cord blood), specific time of collection (e.g., fasting glucose collected before 12 weeks of pregnancy), and potential to access IPD (e.g., IPD can be transferred to a central repository; or IPD can be analyzed remotely but cannot physically be shared/transferred or copied). Harmonization initiatives generally select studies before initiating the project. However, large-scale ones can address a broad range of research questions, each requiring inclusion of different subsets of studies presenting specific characteristics.
Define variables to be generated and evaluate harmonization potential
A DataSchema, or list of core variables (e.g., outcomes, risk factors) to be generated using study-specific data items, generally needs to be outlined. Selecting and defining these variables is probably the most scientifically challenging step of the harmonization process. It can require participation of researchers with specific domain expertise (e.g., nutrition, mental health), investigators or data managers from member studies, and personnel with technical expertise in data harmonization. The information collected across cohort studies is generally not standardized, the wording of questions and measures used to evaluate the same constructs (e.g., level of physical activity, alcohol consumption) typically differ, and there is variation in the format, structure, and naming conventions of variables. In addition, the research questions addressed often require the analysis of longitudinal data (e.g., several data collection events during pregnancy or throughout the life of the child), but the collection events are likely to differ between and within studies in keyways that affect compatibility. As an example, Table 2 outlines information about mothers’ binge drinking during pregnancy collected by five Canadian cohorts. Various DataSchema variables could be created using the data collected by these studies. These include, but are not limited to, a unique ‘’binge drinking status during pregnancy’’ variable defined as binge drinking at least once during pregnancy (yes/no) or a ‘‘current binge drinking status’’ (yes/no) variable paired with the ‘'time when binge drinking status was collected’’ (number of weeks of pregnancy). While there is rarely a unique or perfect solution, it is important to implement a rigorous and transparent decision-making process to select and define the DataSchema variables. The process should be guided by the scientific needs of the project, including specific requirements related to the statistical analysis planned.
3D: 3D Study – Design, Develop, Discover; AOF: All Our Families; APrON: Alberta Pregnancy Outcomes and Nutrition; FAMILY: Family Atherosclerosis Monitoring in Early Life; OBS: Ontario Birth Study.
Various elements can be used to define the DataSchema variables. These include the: nature of the variable (e.g., smoking status, highest completed level of education); value type (e.g., integer, text, decimal); format, including the specific units (e.g., kg) or list and description of the response options (e.g., 0 = Never; 1 = Almost never; 2 = Sometimes; 3 = Often; 4 = Very often); targeted individual or entity (e.g., the information is about the mother, father, neighborhood); targeted time period (e.g., first trimester, last 30 days, at birth); interdependence with other information needed to interpret the variable (e.g., birthweight and duration of pregnancy); acceptable sources of information (e.g., information obtained from questionnaire, registry, medical files); acceptable informants or who can provide the information (e.g., participant or proxy); acceptable time of collection (e.g., smoking status during first trimester of pregnancy can be collected at birth); acceptable question wording (e.g., binge drinking defined as 5 drinks or more); and acceptable procedures or devices used to generate the measure (e.g., weight needs to be measured, not self-reported by the participant).
Following the selection and definition of each DataSchema variable, it is possible to evaluate the potential (or not) for each study to generate it. According to the Maelstrom Research guidelines for retrospective data harmonization, Reference Fortier, Raina and Van den Heuvel4 the harmonization potential is considered complete (fully achievable) if study-specific variables can directly generate the DataSchema variables (identical) or could be transformed to do so (compatible). The harmonization potential is however deemed impossible if study-specific variables cannot generate the DataSchema variables that have been defined (incompatible) or if the information is simply not collected (unavailable). It is also possible to define the harmonization potential as partial when it is possible to generate the variable but with an unavoidable loss of information. Evaluating the harmonization potential will often lead to adjustments in the initial DataSchema variable definition proposed (e.g., response options for binge drinking categories are adjusted to allow harmonization of more studies). Once finalized, the process will provide a clear overview of the harmonization potential across studies (which variables can be generated by which studies) and the study-specific data required to generate the DataSchema variables. Such processing and documentation can be generated using simple tools (e.g., Excel) or specialized resources.
Develop the data processing infrastructure
In parallel to documenting cohort studies and exploring harmonization potential, it is essential to determine the operating model and build the infrastructure required to host, manage, and analyze the data. For small-scale initiatives, the operating model may be simple and the data infrastructure rudimentary. For example, it could be limited to study-specific data sets uploaded on a server accessible by a single user who generates the DataSchema variables to answer a specific research question and never shares or reuses the harmonized data. However, a more sophisticated approach is often required.
Define the data harmonization and analysis operating models
While various ethical, legal, methodological, and operating factors must be considered, data access and location are fundamental to inform the operating models to be implemented to harmonize and analyze data (Fig. 2). If transfer of study-specific IPD to external third parties is acceptable, data may be transferred to a central server and the harmonization process centralized. Reference Fortier, Dragieva, Saliba, Craig and Robson22,Reference Wey, Doiron and Wissa23 But this is not always possible and may be unsuitable. Alternatively, study-specific data may remain on study-specific servers, and harmonized data generated by study-specific teams. Reference de Moira, Haakma and Strandberg-Larsen24 Each approach presents advantages and challenges (Table 3) and directly impacts operational decisions (e.g., number of servers required, distribution of personnel) and the data infrastructure required (e.g., type of access required, level of security, and computing capacities).
In turn, the possible operating models for statistical analysis are informed by the level of access to the harmonized IPD. Study-specific (analysis performed by studies followed by a meta-analysis of study-level estimates), pooled (data hosted on a central server and analyzed as a collective whole), or federated (centralized analysis, but the individual-level participant data remain on local servers) IPD analysis can be achieved (Fig. 2). Again, each approach presents advantages and challenges. Reference Blettner, Sauerbrei, Schlehofer, Scheuchenpflug and Friedenreich25–Reference Carter, Francis and Carter27 Study-specific IPD analyses followed by a meta-analysis of aggregate data (i.e., two-step IPD meta-analysis Reference Riley, Lambert and Abo-Zaid28 ) is often the approach selected. Reference Taylor, Elhakeem and Nader29 The approach may reduce efforts to obtain and analyze data, as only aggregate data are required for combined analysis and as meta-analytical methods for aggregate data are well established. However, standardizing analyses among studies may require substantial effort, and statistical power and flexibility to explore interactive or heterogeneous effects (for example, across studies or subgroups) can be limited. In contrast, a pooled analysis approach (i.e., one-step IPD meta-analysis Reference Riley, Lambert and Abo-Zaid28 ) typically offers statistical power and flexibility, with the potential for greater insights into interactive or heterogeneous effects and interpretation of results (such as of pooled estimates). Reference Voerman, Santos and Inskip14,Reference Benet, Albang and Pinart30 However, it may necessitate high-performance processing environments to allow analysis of large amounts of data, and it often comes with substantial efforts to obtain access to IPD. The trade-offs between these first two approaches and strategies for choosing an approach have been discussed in detail elsewhere. Reference Riley, Lambert and Abo-Zaid28,Reference Stewart, Altman, Askie, Duley, Simmonds and Stewart31 Finally, federated data analysis can represent a valid option. Reference Gaye, Marcon and Isaeva26,Reference Doiron, Burton and Marcon32 The approach may support one- and two-stage meta-analyses, but it requires implementation of a distributed and interoperable data infrastructure supporting unified co-analysis of the harmonized data across studies. Additional information on these approaches is provided in Supplementary material S3.
Implement the data infrastructure
The data infrastructure provides the physical environment required to access, manage, process, document, and analyze data securely, but efficiently. As mentioned above, the infrastructure may be extremely simple, but large-scale initiatives often require implementation of complex computational environments. The nature of the infrastructure to be implemented is informed by factors including the type and volume of data needed (e.g., questionnaire data, genotypes, images), the statistical analyses foreseen, the location of study-specific and harmonized data, the type of access to IPD required by the various users, the hardware and software resources available to the initiative (and costs if needed to be acquired), the technical skills of the participating teams, the security requirements, and the need (or not) for long-term maintenance and potential scaling up of the infrastructure.
Data harmonization generally requires relatively limited computational power compared to statistical analysis. If statistical analysis is achieved on pooled data, the (internal or external) users analyzing data will require sufficient storage, memory, and processing power to deal with harmonized data from all studies. On the other hand, if analysis is performed by individual studies, computational requirements will be governed by the characteristics of each study data set. Obviously, all aspects of data security should be carefully considered. Proper access control to the data should be in place, and availability and integrity of the data should be ensured by backups, regular system maintenance, and proper monitoring. Where required, static data sets and backups should be encrypted, and there should be documented and auditable procedures for granting access to and/or transfer of data.
Ask for access to, or usage of, relevant study-specific data
Obtaining access or permission to use study-specific data is a prerequisite for initiating data processing. This is true even if data remain on local servers and are processed by the study-specific teams. The goals of ethical, bureaucratic, and technical procedures for data access governance are to protect the cohort study participants (ensuring study-consent stipulations are maintained), data producers (in some case intellectual property rights), and the study itself (to mitigate against reputational risk Reference Murtagh, Turner, Minion, Fay and Burton33 ). Data access committees are also responsible for maintaining adherence to supra-study regulations (e.g., the European Union’s General Data Protection Regulation). Differences in regulatory environments and study-specific policies often mean that access procedures vary from study to study. Procedures may include submission of the project protocol, exchanges with members of scientific or data access committees, and completion of data transfer or privacy agreements. As harmonization initiatives need to access data from more than one study and few integrated (multistudy) access governance systems exist, significant delays are often encountered. Reference Shabani, Thorogood, Murtagh, Laurie, Dove and Ganguli-Mitra34 This is particularly true if independent applications for access need to be submitted to each study for each research question addressed. The data access process should thus be initiated as soon as possible and careful attention to the study-specific requirements and procedures is highly recommended. Providing a list of the exact variables required is often requested by the data access committees; to address the principle of data minimization, data access committees may check the variable list against the proposed research question for coherence. Preparation of this list can be informed by the result of the harmonization potential outlined above and needs to include all study-specific variables required to generate, understand, and validate the DataSchema variables and achieve the statistical analysis foreseen (e.g., all required confounders).
Explore study-specific data
Once access is granted, data are generally prepared (preprocessing under a defined format might be required) and explored to ensure quality and deepen proper understanding of each study-specific data set. For example, the completeness, content, and format of the study-specific data can be verified. Issues observed at this stage often lead to adjustment of the harmonization potential estimated. Poor data quality may also lead to the exclusion of a data set.
If data is processed under the DataSchema format by the study teams, this step may be facilitated and limited to extracting the required study-specific data. However, if harmonization is achieved by a central team, the study-specific data must be rendered accessible to the team and generally explored with close communication with study teams. Ensuring consistent quality and validation procedures across study-specific data sets can be challenging and must be adapted to each project. Different standard operating procedures, methodological approaches, and tools have been proposed by the research community Reference Cai and Zhu35–Reference Schmidt, Colvin, Hohlfeld and Leon37 to support quality assessment of study-specific data. An example of minimal procedures that can be used is outlined in Supplementary material S4.
Process study-specific data under the harmonized format
To enable analysis, it is necessary to convert the heterogeneous study-specific data items into the DataSchema variables format. Where appropriate (when harmonization is deemed possible), data processing is accomplished through algorithmic recoding or statistical modeling of study-specific data. The approach selected for each variable will depend on the scientific objectives of the project, the nature and format of the DataSchema variable, the study-specific data items available, the potential to access study-specific IPD, and whether the data processing is achieved centrally or by study-specific teams. Supplementary material S5 provides an overview of possible approaches (algorithms and statistical models) and considerations in their use. Figure 3 illustrates possible algorithmic processing applied to generate a variable on binge drinking status using the variables outlined in Table 2.
Establishing an efficient processing and quality assurance workflow and ensuring accuracy and consistency in decision making is challenging, especially for large-scale initiatives or if harmonization is achieved by different teams. Processing should be guided by the DataSchema variable definitions, and decision making (e.g., treatment of missing values) and quality assurance should be consistent across all data sets.
Estimate quality of the harmonized data
Once the harmonization process is finalized, it is often essential to explore the data sets generated to understand variable quality. This can include generating basic quality control checks (e.g., validating processing algorithms) and descriptive statistics (e.g., participant distributions, proportion of missing values) to evaluate the consistency across cohort studies (Supplementary material S6). When relevant, assessments of heterogeneity can be performed (e.g., testing for a statistical effect of study-specific question formats on the harmonized data generated). However, in practice it can be difficult to distinguish heterogeneity due to harmonization assumptions as opposed to population differences. A more comprehensive examination of relevant heterogeneity (and how to account for it) thus often needs to be performed at the stage of analysis.
Preserve the harmonized data sets and related documentation
Once validated and deemed of acceptable quality, the harmonized data set and its related documentation can be made available, and this, ideally in adherence to the FAIR7 principles. Complying with the FAIR principles involves making data and metadata accessible to the scientific community, enabling long term access, ensuring their interoperability, and providing sufficient information to enable optimal use and reuse of the harmonized data. Documentation provided could include the harmonization protocol, selected information about cohort study designs and standard operating procedures, the DataSchema variable definitions, the harmonization potential across studies, the processing scripts or statistical models applied to generate harmonized data, and summary statistics on participant distributions or missing values. For large initiatives, creating a centralized metadata portal can provide user-friendly access to such information. However, maintenance of such portal, as well as long term preservation of the harmonized data sets in one or multiple secured data warehouses, can be challenging (e.g., to retain competent staff, maintain and when required scale up the infrastructure, etc.).
Analyze data to answer specific research questions
Using harmonized data often involves working with an infrastructure where data are not available across all studies (missing values when harmonization is considered impossible), co-analyzing data available at different time points across studies, managing the heterogeneity of effects across studies, and using data that are not as precise as the study-specific data collected. Effectively, harmonizing heterogenous data often results in data reduction (e.g., transforming continuous variables into dichotomous) and subsequently to a potential lack of precision and reduction of power leading to underestimation of effects. Reference Cohen38
Substantial time can be necessary to explore the harmonized data and conduct preliminary analysis. Reference Avraam, Wilson and Burton39,Reference Raab, Nowok and Dibben40 It might be required to explore the impact of the harmonization potential of each DataSchema variable on the reduction in sample size and/or the diversity of variables included in the analysis. Harmonized data sets might have complex or limiting patterns of missing data that need to be examined; for example, it may be difficult to obtain the complete harmonized data across the same studies for the same DataSchema variables, leading to a trade-off between including more studies or more covariates in an analysis. Further exploration of the heterogeneity existing across studies and the potential effect of the harmonization process on the variables generated could also be suitable. Reference Friedenreich41,Reference Curran and Hussong42 Various approaches are then available to analyze data and the analytical models determined by the research questions addressed, data infrastructure, and variable content.
With the growing emphasis on a FAIR Reference Wilkinson, Dumontier and Aalbersberg7 approach to science, retrospective data harmonization is increasingly used to support research. However, to be FAIR, in addition to be accessible, data and associated documentation needs to be high quality. While the advantages of retrospective harmonization are significant, the limitations of harmonized data must also be recognized. Harmonizing existing data may not always generate as useful and high-quality data as hoped or expected. First, quality of the study-specific data is not always as good as anticipated. Second, defining DataSchema variables is a balancing act between generating very homogenous harmonized data and limiting the number of contributing cohort studies, or allowing more heterogeneity to retain more studies. Thus, there is often an unavoidable loss of precision in harmonized data generated. While a broad range of categories may be used by a given study to define, for example, the highest level of education, generating the variable across all studies could involve limiting the categories to ‘’having completed secondary school or higher education (yes/no)’’. Third, as it is rarely possible to generate all DataSchema variables across all study-specific data sets, the harmonized data set will often only support sub-analysis across selected variables and/or studies. Specialized statistical models working around missing values could help to overcome the problem but are not always applicable or suitable. Fourth, major complexities are introduced by the differing numbers and timing of data collection events across studies (before, during, and after pregnancy, as well as through the life of the participants). As the DataSchema variables defined can be time-dependent (i.e., need to be measured at a specific time point), this limits the harmonization potential. Fifth, factors related to the study-specific designs (e.g., population characteristics, sampling frames) can introduce biases.
Given these challenges, what motivates harmonization efforts? Might using published results to perform meta-analyses be a more sensible approach to synthesize information? It is easier and faster than selecting, exploring, harmonizing, integrating, documenting, and co-analyzing individual participant data across multiple cohort studies. However, scientifically founded harmonization initiatives present important advantages. First, it allows study-specific information to be processed to create more similar data. Second, for a given construct (e.g., familial income or level of physical activity), different variables can be generated and modified, providing flexibility during analysis. Third, following harmonization, different approaches are offered to support statistical analysis. Independent analysis-by-study followed by a meta-analysis of study-level estimates can be performed or harmonized data can be analyzed as a collective whole. Fourth, access to harmonized individual participant data provides flexibility for the selection of specific variables and participants or covariates to be included in the statistical analysis. Fifth, it increases the ability to examine heterogeneity and handle missing values. Sixth, it helps to limit the significant and intractable publication bias that is generally fundamental to observational data in the published domain. Finally, it facilitates exploring statistical interactions between risk factors and achieving subgroup analysis.
Addressing DOHaD research questions fundamentally involves exploring the interaction of multiple individual and environmental factors to often explain relatively subtle effects. Large initiatives such as ENRIECO and more recently LifeCycle are good examples of successful harmonization efforts leading to innovative research outputs. If well organized and scientifically founded, small and large retrospective harmonization initiatives can generate valuable harmonized data sets to support research, increase the scientific impact of individual cohort studies, and minimize duplication of research efforts.
For supplementary material for this article, please visit https://doi.org/10.1017/S2040174422000460
We would like to thank the Maelstrom Research and EUCANconnect teams for their contribution to the development of concepts and approaches described in the current article. Additional thanks to the Prenatal Alcohol Exposure research team (including but not limited to: A Bocking, R Wissa, R Schmidt, K McDonald, Nika Zahedi).
This work received funding from the European Commission (EUCANconnect, a federated FAIR platform enabling large-scale analysis of high-value cohort data connecting Europe and Canada in personalized health, Grant Agreement No 824989) with its Canadian project partners being funded by the Canadian Institutes of Health Research (CIHR) and the Fonds de la Recherche du Québec (FRQ). The RECAP Preterm project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 733280. R Wilson is a UKRI Innovation Fellow with HDR UK [MR/S003959/1].
Conflicts of interest