Measuring quality and outcomes of research collaborations: An integrative review

Introduction: Although the science of team science is no longer a new field, the measurement of team science and its standardization remain in relatively early stages of development. To describe the current state of team science assessment, we conducted an integrative review of measures of research collaboration quality and outcomes. Methods: Collaboration measures were identified using both a literature review based on specific keywords and an environmental scan. Raters abstracted details about the measures using a standard tool. Measures related to collaborations with clinical care, education, and program delivery were excluded from this review. Results: We identified 44 measures of research collaboration quality, which included 35 measures with reliability and some form of statistical validity reported. Most scales focused on group dynamics. We identified 89 measures of research collaboration outcomes; 16 had reliability and 15 had a validity statistic. Outcome measures often only included simple counts of products; publications rarely defined how counts were delimited, obtained, or assessed for reliability. Most measures were tested in only one venue. Conclusions: Although models of collaboration have been developed, in general, strong, reliable, and valid measurements of such collaborations have not been conducted or accepted into practice. This limitation makes it difficult to compare the characteristics and impacts of research teams across studies or to identify the most important areas for intervention. To advance the science of team science, we provide recommendations regarding the development and psychometric testing of measures of collaboration quality and outcomes that can be replicated and broadly applied across studies.


Introduction
Translating basic science discoveries into demonstrated improvements in public health requires a research team from diverse backgrounds [1][2][3]. The US National Institutes of Health National Center for Advancing Translational Sciences recognized this need by establishing a strategic goal to advance translational team science by fostering innovative partnerships and diverse collaborations [4]. In the health sciences, there is significant interest in translational research and moving more quickly from single-study efficacy trials to effective, generalizable interventions in health care practice. Foundational to this body of literature is the assumption that crossdisciplinary research teams speed the process of translational research [5].
Analyses of trends in scientific publications suggest that major advances in biological, physical, and social science are produced by research teams; that the work of these teams is cited more often than the work of individual researchers; and that, in the long term, the work has greater scientific impact [6][7][8][9]. In addition, cross-disciplinary diversity is assumed to lead to greater innovation [10]. These observations have become the cornerstone of the translational science movement in the health sciences.
Implementing team science can be challenging. Multiple authors have noted that working in collaboration can be more expensive and labor intensive than working alone [11,12]. Noted trade-offs include added time and effort to communicate with diverse collaborators, conflicts arising from different goals and assumptions, and increased start-up time with its resulting delay in productivity [13][14][15][16][17]. These opportunity costs may be acceptable if the outcomes of research collaborations can accelerate knowledge or answer the complex health questions faced by today's society.
To test the assumption that research collaboration leads to greater productivity, we need to accurately measure the characteristics of research teams and their outcomes and be able to compare results across teams [6,12,15,[18][19][20][21][22][23][24][25][26][27]. Although different measures have so far shown that collaborations are beneficial, operational definitions of variables that may influence conclusions (construct validity) are varied, complicating interpretation of results. Despite some exceptions [12,19,23,28], there is a lack of attention to the development and psychometric testing of reliable and valid measures of collaboration. As an initial step, it would be useful to have an overview of the current state of the science in the measurement of research collaborations. In this article, we report the results of an integrative review of the literature, looking for reliable and valid measures that describe the quality and outcomes of research collaborations.

Materials and Methods
We conducted two reviews. The first focused on measures of collaboration quality, defined as measures of interactions or processes of the team during the collaboration. The second review focused on outcomes of the collaboration (e.g., publications, citations). We used an integrative review approach. An integrative review is a specific type of review that applies a comprehensive methodology involving a combination of different approaches to summarize past research related to a particular topic, including both experimental and non-experimental studies, and reach conclusions [29,30].
Our research team brainstormed keyword combinations and, based on expert opinion, agreed on final sets of keywords that were comprehensive enough to cover the topics fully but not so broad as to include non-relevant literature. For the review of collaboration quality, these keywords were "measure/measurement" combined with the following terms: community engagement, community engaged research, collaboration, community academic partnership, team science, regulatory collaboration, industry collaboration, public-private partnership (focus on research). For the review associated with collaboration outcomes, the word "outcomes" was added to the above search terms. Our intention was to include all types of research collaborations, including partnerships between academic and other community, governmental, and industry partners. The following keywords were considered, tested in preliminary searches, and eliminated by group consensus as being too broad for our purpose: consortium collaboration, public health and medicine collaboration, patient advocacy group collaboration, and coalition. Measures of collaboration related to clinical care, education, and program delivery collaborations were excluded from this review.
Quality and outcome measures were identified using both a literature review and an environmental scan. We conducted searches using the standard databases PubMed, the Comprehensive Index to Nursing and Allied Health Literature, and PsychInfo, as well as searched EMBASE, Google Scholar, Scopus, and websites recommended by members of the research team. After duplicates and articles that were not focused on a specific scale or measure of research collaboration were eliminated, team members reviewed a final list of 25 publications for the measures of collaboration quality, including 4 articles describing social network analyses, and 42 publications for measures of collaboration outcome. All publications were published prior to 2017. Figs. 1 and 2 provide flow diagrams of how articles were selected to be included in both reviews.
At least two members of the research team reviewed each article using a standard data abstraction form that included the name of the measure/outcome; construct being measured; sample; and details about the measure, including operational definition, number of items, response options, reliability, validity, and other evidence for supporting its use. Reviewers were also asked to make a judgment as to whether the article included a measure of the collaboration quality (or outcomes or products) of the scientific/ research collaborations; both reviews had a rater agreement of 99%. Differences in reviews were resolved through consensus after discussions with a third reviewer.

Quality Measures
We identified 44 measures of research collaboration quality from the 15 publications included in the final summary analyses (see Fig. 1). The specifics of each measure are detailed in Table 1. Three articles were not included in Table 1 because they all used social network analysis [31][32][33]. Four articles covered 80% of the measures identified [12,19,23,34]. The number of items per measure ranged from 1 to 48, with 77% having less than 10 items per measure. A few articles reported on measures that covered several domains. As shown in Table 1, we have included each domain measure separately if it was reported as an independent scale with its own individual psychometric properties.
Reliability was reported for 35 measures, not reported for four measures, and not applicable for five measures (single-item, selfreported frequency counts, or qualitative responses). Reliability measures were most frequently Cronbach's alphas for internal consistency reliability, but also included intraclass correlation coefficients, inter-rater correlations, and, when Rasch analysis was used, person separation reliability. Test-retest reliability was never reported. Cronbach's alpha statistics were >0.70 for 86% of the measures using that metric. Some form of validity was reported on 40 measures and typically included exploratory (n = 8) and/ or confirmatory factor analysis (n = 26). Convergent or discriminant validity was evident for 38 measures but was based on study results, as interpreted by our reviewers, rather than identified by the authors as a labeled multitrait-multimethod matrix analysis of construct validity. Twelve measures had convergent or discriminant validity only, without any further exploration of validity. Face validity and content validity were reported for five measures, along with other analyses of validity.

Outcome Measures
We identified 89 outcome measures from the 24 publications included in the final summary analyses (see Fig. 2). Characteristics of each measure are detailed in Table 2. Three publications included over 44 (49%) of the measures identified [17,23,35]. However, only two of those [17,23] included measures tested in actual studies; the remaining article [35] included only recommendations for specific measures.
Measures were broadly classified into one of the six different categories, reflected in Table 2: (1) counts or numerical representations of products (e.g., number of publications; 38 measures); (2) quality indicators of counted products (e.g., journal impact factor; 7 measures); (3) self-reported perceptions of outcomes (e.g., perceived productivity; 32 measures); (4) peer-reviewed perceptions of outcomes (e.g., progress on the development of interventions; 5 measures); (5) qualitative descriptions of outcomes (e.g., descriptive data collected by interview; 6 measures); and (6) health indicators/outcomes (e.g., life expectancy; 1 overall measure with 60 different indicators). The number of items per measure ranged from a single count to a 99-item scale, with over 50% of the measures composed of a single count, number, or rating of a single item.
Twenty-three of the 89 measures were recommendations on measures and had no reported reliability or validity as would be expected [35]. For the remaining 66 measures, only 16 reported assessments of reliability. Nine of 24 measures in the self-reported perceptions category included Cronbach's alpha >0.70, showing internal consistency reliability. Six measures (3 of 24 in the counts of products category and 3 of 4 in the peer-reviewed category) had inter-rater agreement described; all were over 80%. One measure in the peer-reviewed category reported inter-rater reliability of r = 0.24-0.69. Of these 16 measures with reported reliability, nine had some form of validity described: confirmatory factor analysis (6 measures) and convergent validity (3 measures). Of the remaining 50 measures without reliability data, five had some type of convergent validity described and one was supported by principal component analysis. Once again, convergent validity was not formally labeled as such but was evident in terms of correlations between the measure under study and other relevant variables.

Quality Measures
Overall, there are a relatively large number of scales, some of them robust, that have been used to measure the quality or process of research collaborations (e.g., trust, frequency of collaboration). However, many scales have not been extensively used and have been subjected to relatively little repeated psychometric study and analysis. Most have been developed in support of a particular research project rather than with the intent of becoming a standard indicator or scale for the field. Although calculated across multiple organizations, estimates of reliability and/or validity were often study specific as well. Reports of effect sizes (sensitivity or responsiveness) were rare and limited to correlations, and construct validity has not been explored beyond exploratory or confirmatory factor analyses. Given this dearth of replicated psychometric data, it is not surprising that widely accepted, standard scales have not emerged to date. Wide-scale testing of measures of collaboration is essential to establish reliability, validity, and sensitivity or responsiveness across settings and samples.
Scales developed to date have been primarily focused on group dynamics (including the quality of interpersonal interactions, trust, and communication). Although these are important factors, few measurements have been made of how well a team functions (such as leadership styles) and the degree to which the team's work is viewed as synergistic, integrative, or otherwise more valuable than would occur in a more siloed setting. Oetzel et al.'s [23] beginning psychometric work provides an example of some of these types of measures. This is in contrast to the numerous available (or under development) scales to measure attitudes toward collaborations and quality of collaborations that exist at specific institutions.
Despite these limitations, two sets of measures deserve note. First, those reported by Hall et al. [12] and Mâsse et al. [19] as measures of collaborations in National Cancer Institute-funded Transdisciplinary Tobacco Use Research Centers have been used more extensively than many of the other scales in this review as indicators of collaboration quality among academic partners (although relatively little additional psychometric data have been reported beyond initial publications). Second, the measures reported by Oetzel et al. [23] are unique in that they are scales to assess research quality involving collaborations between academics and communities, agencies, and/or community-based organizations. They are also unique in representing responses from over 200 research partnerships across the USA. This review did not distinguish between partnerships (e.g., involving just two partnering organizations) and coalitions (involving multiple organizations).  Interpersonal collaborative process at their center 8 5-point Likert scale with response options ranging from either "very poor" to "excellent" or "strongly agree" to "strongly disagree" with central "neither agree nor disagree" Cronbach's alpha = 0.92 Convergent validity: The better the interpersonal collaboration, the more collaboration satisfaction, the more confidence in completion of deliverables, and the more perceived institutional resources for collaboration Hall et al. [12] b

Written Products Protocol
The integrative (transdisciplinary) aspects of written research protocols, disciplines represented, levels of analysis, type of crossdisciplinary integration 21 center developmental project proposals from four NCI TREC centers 37 item protocol used to evaluate proposals Items describing proposal with various response formats; one itemrate whether "unidisciplinary," "multidisciplinary," "interdisciplinary," or "transdisciplinary" proposal, two items regarding transdisciplinary integration and scope of proposal using 10-point Likert scale ranging from "none" to "'substantial" Inter-rater reliabilities based on Pearson's correlations from 0.24 to 0.69; highest reliability for rating experimental types (0.69), number of analytic levels (0.59), disciplines (0.59) and scope (0.52). Lower reliability in attempts to name the crossdisciplinary integration in the proposal Convergent validity: Higher number of disciplines in proposal, the broader its integrative score, larger its number of analytic levels.
The higher the type of disciplinarity, the broader its overall scope        Rate the collaboration within your center: Productivity of collaborative meetings, overall productivity of center (rated on 5-point scale "very poor" to "excellent"); in general, collaboration has improved your research productivity (5point scale from "strongly disagree" to "strongly agree"). Unclear whether three items were summed, summed and averaged, or if some other calculation was used to determine final scale value Unclear; appears to be three items 5-point Likert scale ranging from "very poor" to "excellent". Also asked to respond to a statement about collaboration and research productivity, rating on a 5-point scale from "strongly disagree" to "strongly agree" with central "neither" response option   Items describing proposal with various response formats; one itemrate whether "unidisciplinary", "multidisciplinary", "interdisciplinary" or "transdisciplinary" proposal, two items re: transdisciplinary integration and scope of proposal using 10-point Likert scale ranging from "none" to "'substantial" Inter-rater reliability of r = 0. 24

Outcome Measures
Similar to measures of collaboration quality, little agreement exists as to how to best measure outcomes of research collaborations. By far, the most common type of measurement is a simple count of products over a set period of time (e.g., publications, grants, and/or patents). Interestingly, the procedures used for counting or calculating these products are rarely reported and therefore are not replicable. In addition, published reports infrequently include any type of verification of counts, leaving the reliability of such counts or calculations in question.
The second most common type of measure is the use of self-reported scales to quantify the researchers' perceptions of collaboration outcomes. These include measures of perceived productivity or progress, changes in relationships with partners, increased capacity, and sustainability. Few of these measures, with the exception of the psychometric works of Hall et al. [12] and Oetzel et al. [23], have documented reliability and validity. In general, despite a relatively large number of scales, most of these were not developed for the purpose of becoming standard indicators or measures and most have had little psychometric study or replication.
Efforts to measure the quality of counted products, such as consideration of citation percentiles, journal impact factors, or field performance indicators, offer important alternatives in the quantity versus quality debate and actually may be useful for evaluating the long-term scientific impact of collaborative outcomes. Likewise, peer-reviewed ratings of outcomes based on reviews of proposals or progress reports could provide more neutral and standardized measures of collaboration impact. Both of these categories of measures are used infrequently but could have significant influence if applied more widely in the evaluation of collaborative work. However, further work on a reliable rating's scale for use in peer review is needed before it is able to provide comparable results across studies.

Recommendations
Remarkably, the results of this review, which defines research collaborations to include different types of collaborative partnerships, are very similar to reviews of measures of community coalitions [60] and community-based participatory research [61] conducted 15 and 7 years ago, respectively. Both of those studies concluded that there are few reliable and valid measures. In the intervening years, some progress has been made as noted [see Refs. 12,19,23 as examples]. Based on this observation and our findings in this study, we offer six recommendations to advance the field of team science: (1) We must pay careful attention and devote resources to the development and psychometric testing of measures of research collaboration quality and outcomes that can be replicated and broadly applied. Measures listed in this review with solid initial reliability and validity indicators provide reasonable starting points for continued development; however, measures of other constructs will also be necessary. (2) To establish validity for use in different populations and settings, designed measures should be tested across various research partner and stakeholder relationships (e.g., academia, industry, government, patient, community, and advocacy groups). (3) When evaluating outcomes, it is critical that we focus on both the quality and quantity of products and the use of rating scales for peer review. (4) The sensitivity and responsiveness of measures to interventions should be evaluated as an additional psychometric property. (5) Publications reporting on assessments of collaborations should include a clear description of the measures used; the reliability, validity, and sensitivity or responsiveness of the measures; and a statement on their generalizability. (6) Reports incorporating the use of narrowly applicable measures should include a justification for not using a more broadly applicable measure.

Conclusions
Although a few studies have conducted exemplary psychometric analyses of some measures of both collaboration quality and outcomes, most existing measures are not well-defined; do not have well-documented reliability, validity, or sensitivity or responsiveness (quality measures); and have not been replicated. Construct validity, in particular, requires further exploration. Most of the reported measures were developed for a single project and were not tested across projects or types of teams. Published articles do not use consistent measures and often do not provide operational definitions of the measures that were used. As a result of all of these factors, it is difficult to compare the characteristics and impact of research collaborations across studies.
Team science and the study of research collaborations are becoming better and more rigorous fields of inquiry; however, to truly understand the reasons that some teams succeed and others fail, and to develop effective interventions to facilitate team effectiveness, accurate and precise measurements of the characteristics and the outcomes of the collaborations are needed to further translational science and the concomitant improvements in public health.