To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
The last two decades represent an unprecedented period in the history of data analysis. As the cost of technology has steadily decreased, access to sophisticated data tools has increased, expanding the audience for data-informed research and decision making. At the same time, new areas of research and research methodologies are now possible with the rapid growth of online data produced as a byproduct of digital commerce, file sharing and social media. Together, this confluence of inexpensive computing, plentiful data and accessible tools has created a new interdisciplinary area of research that harnesses the traditional disciplinary expertise of statisticians and computer scientists to explore a wide range of data-related questions. As more researchers and companies embrace data-driven approaches, the phrase ‘data science’ has become an increasingly popular term to describe this growing area of research.
Defining data science
Attempts to apply strict disciplinary boundaries to ‘data science’ remain elusive. While many practitioners have attempted to craft conceptual definitions of this research space, others have challenged the notion that data science is a new field (Conway, 2013; Boykis, 2019). Donoho (2017) makes a compelling argument that statisticians have engaged in data science since the 1960s and others, such as Press (2013), note that many disciplines have explored data science concepts for decades.
Despite these disagreements over the disciplinary origins of data science, its methodologies and scope and the professional role of data scientists, the persistence of data science as a topic over the last few decades indicates that it is not a passing trend. Indeed, an increasing focus on data science by learned societies (National Academies of Sciences, Engineering, and Medicine, 2018), the growing number of data science courses and centers (http://msdse.org/ environments) and the proliferation of data scientist job postings indicate that the data science movement is a powerful force in academia.
In its current form, the term ‘data science’ is easier to define by its application than by theory. It is widely understood to include a diverse range of computational, data-driven approaches to research and business analytics. In academia, this interdisciplinary space comprises a broad range of methodologies, including machine learning, social media analysis, spatial analytics, text analysis and web analytics to name a few.
As interest in data science and computing increases, educators are presented with an opportunity to introduce students to quantitative studies through the lens of data science. Data science courses and programs are popping up all over the place, however the jury is still out on how we can effectively and efficiently teach data science to students with little to no background in computing and statistical thinking. Additionally, how can we equip students with the skills and tools for reasoning with data? Finally, and most importantly, how can we ensure students leave a data science course wanting to learn more? This chapter describes an introductory data science course that provides an answer to these questions.
Many university curricula require that students take at least one quantitative course. Most students fulfill this requirement with an introductory statistics course. While many of these courses incorporate real data sets, these datasets tend to be small and clean, unlike real datasets caught in the wild. Additionally, these courses tend to focus primarily on statistical inference for small to medium sized data. Few have provided guidance for what to do when these conditions don't hold (which is true for most real data). Additionally, the great focus on inference has meant that little time is spent on other important data analysis steps like importing data, cleaning data and performing thorough exploratory data analysis.
The increased availability of data and the recent emergence of the field of data science are two of the main influences behind the most recent modifications to the Guidelines for Assessment and Instruction in Statistics Education (GAISE) (Carver et al., 2016). These modifications include increased emphasis on teaching statistical thinking in introductory statistics classes. Specifically, the guidelines emphasize that introductory statistics should be taught as an investigative cycle of asking questions and obtaining answers, particularly those involving the relationships between multiple variables. The recommendations also stress using technology to explore concepts and analyze data. Data analysis comprises so much more than just inference and modeling. As Grolemund and Wickham (2018) suggest in their book R for Data Science, data analysis comprises a full lifecycle from importing data sources to communicating results (https://r4ds.had.co.nz/introduction. html).
Each of the chapters in this book explores the opportunities and challenges of providing data science instruction and services in academic institutions. While the context and setting vary, the chapters share a common perspective of applying library expertise with data tools, methods and statistics to the growing field of data science. As the contributors note, the conceptual breadth of data science offers ample opportunities for establishing new partnerships, training students and staff and expanding the range of library services. This chapter builds on these case studies and offers a set of key elements for developing a successful library data science service.
All libraries face three main challenges in developing a new service. First, no matter the size or context, libraries must define the scope of their service to reflect local needs and the library's capacity to engage with those needs. Second, libraries should endeavor to identify potential partners in developing and implementing their service. Finally, libraries must plan for the sustainability of their service beyond the initial launch. Libraries that address these three concerns can create the proper conditions for a successful service that addresses the inherent dynamism of data science while attending to local opportunities and needs.
Scope
Defining the parameters and core focus of a library data science service should be the first step in planning a new program. Given the wide range of tools and methodologies associated with data science, the importance of defining the scope of a service, communicating its mission, and assessing its impact cannot be overstated. A library could create a data science service that provides expertise on any number of topics, including machine learning, text analysis, data visualization or reproducible research. Choosing the primary areas in which the library commits both services and staffing provides clarity in the library and on campus about the intended focus of the service. This focus can offer several benefits. First, defining the range of the service encourages the library to deepen expertise within its chosen scope while reducing the temptation to expand service areas before the program is established. Second, a well-defined service lends itself to an effective marketing plan that can increase awareness of the service on campus while establishing clear expectations about the nature and level of support offered.
Libraries have a long-standing tradition in handling data. Back in the 1960s, with the rise of national data centers as custodians of digital social surveys, social science data libraries emerged within university libraries. These pioneering data services aimed to support social scientists in accessing the newly available secondary data sources. In recent years, libraries have also undertaken mass digitization projects, which have resulted in large collections of data that need careful and specialized handling for both dissemination and preservation. Moreover, libraries have adopted a leading role in the research data management space by focusing on primary data and supporting researchers across disciplines to manage them from the moment of creation, providing means to capture, store, document and share their data.
Currently, academic and research libraries have important collections of digital data as well as the expertise to curate them. But should libraries exploit their data in other ways? Can data science methods bring new forms of working with and making available library data? Could this expertise be used to apply analytics to support the organization's decision making?
Against this background of libraries and data, this chapter presents an exploratory data science library unit within a cultural organization: the DataLab of the Library of the Fundación Juan March (FJM). The main objective of the chapter is to present a case study of the inception and development of the structure and services provided by the DataLab and provide a useful example for others interested in exploring the use of data science in libraries.
The chapter starts by setting the context of the FJM and its core activities, which include research and libraries. It goes on to describe the evolution of the DataLab from a social science data library into a data science unit with a focus on data curation and analytics projects. The interdisciplinary nature of the unit will be described together with its concise objectives, the core members of the team and the technical infrastructure setup needed to run a wide range of dissemination, preservation and analytics projects.
In order to show concrete aspects of the DataLab, the chapter will report on a variety of projects and activities undertaken in the last few years.
The University of Washington has been active in data-intensive science efforts for over a decade, both through the establishment of the eScience Institute (http://escience.washington.edu) as well as the University's participation in the Moore-Sloan Data Science Environment grant (http://msdse.org). These efforts have impacted the university environment in the areas of research, education and support services such as libraries. This chapter will explore the University of Washington Libraries in particular, considering the changes to services, strategies and infrastructure required to support students, staff and faculty pursuing data-intensive work.
There are three major takeaways:
1 Networking and collaboration are essential to provide support for data science.
2 Support for data science can be provided with a variety of staffing, expertise and funding models – as long as there is administrative support as well as a culture of networking and collaboration.
3 The evolution in library services and interdepartmental relationships will continue, making agility and responsiveness essential to the continued relevance and success of libraries.
Background
The University of Washington (UW) is well known for its computer science and engineering programs (www.usnews.com/best-graduate-schools/topscience-schools/computer-science-rankings) and is regularly ranked as one of the US's premier research-intensive universities (www.cs.washington.edu/about_us). The growth of massive data collection and the study of those collections led to the founding of the UW's eScience Institute in 2008. Founders included former UW provost Mark Emmert, Ed Lazowska from Computer Science and Engineering, Tom Daniel from Biology and Werner Steutzle from Statistics. Although the key personnel have changed over time, the commitment from a variety of departments remains strong.
The goal of the eScience Institute was to provide a new home for researchers from varied UW departments to work together on data-intensive research. As stated in the Institute's mission:
The eScience Institute empowers researchers and students in all fields to answer fundamental questions through the use of large, complex and noisy data. As the hub of data-intensive discovery on campus, we lead a community of innovators in the techniques, technologies and best practices of data science and the fields that depend on them (https://escience.washington.edu/about-us).
The Institute brings together expertise and people from around UW to collaborate, educate and inform each other, while advancing the field of data science – all without allegiance to one particular discipline or department.
In recent years, as the growth of Data Science programs has proliferated, academic libraries have converged on reproducible research practices as a framework for extending and shaping data services for researchers and scholars at all levels. In addition to many service offerings, such as consulting on data management and data sharing plans for grant applications and data curation, library data service units are increasingly supporting, contributing and collaborating on services such as: open source programming languages; software and data documentation; robust project management and versioning; computational infrastructure and analytic environments; data/software repositories and archives; and critical instruction to promote information literacy in the classroom.
This chapter shares examples of how University of California, Berkeley library staff are collaborating with the Division of Data Sciences, Research Information Technology and other campus partners to support data science initiatives around the theme of reproducible research. The chapter provides some ideas for how reproducibility, as a professional orientation and practice, can pave the way for future services and collaboration between libraries and data science practitioners.
Research is inherently messy. Making research reproducible can be an added burden to a researcher's already complex workflow. The more libraries can help to minimize the labor of reproducible research practices, the greater impact they can have in the academy. At a high level, libraries can mean different things to different people, but at their core, libraries are community hubs for access and creation of knowledge. As researchers naturally gravitate to new forms of knowledge creation, libraries are tasked with a unique responsibility to facilitate these changes.
Traditionally, knowledge is captured in print and its digital counterparts. On the link between scholarship and scholarly articles, Claerbout and Karrenbach (1992) put it succinctly: ‘An article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.’ This well-traveled quote raises an important point about research papers: that tables, figures and prose alone are simply not sufficient for auditable, verifiable and falsifiable science. While the reproducible science call to action must be met by those doing science, it does take a village to support those actions. Sayre and Riegelman (2018) and Stodden et al. (2013) both wrote about libraries being a catalyst for culture change ‘toward reproducible research’.
Having worked in academic libraries for more than 25 years, I have observed many changes in library services and support. I was educated at the Royal School of Library and Information Science in Copenhagen, Denmark, in 1994 as a librarian, but, during my education, I never heard any of my otherwise great teachers talk about ‘the Internet’. While in school, I focused on topics like Information Policies, EU Copyright Policies, Scholarly Communications and Information Searching and Retrieval. For the majority of my career, I have supported and trained students and faculty with their scholarly and education needs. My work focuses on user support, information literacy, outreach, training and teaching. This has taken many forms ranging from an introduction to HTML in the late 1990s to information retrieval, reference management, systematic reviews and open science issues in recent years.
The transformation from dealing with paper-based materials 25 years ago to dealing with electronic, digital and online access has resulted in a shift in the skills needed in libraries. Fortunately, libraries around the world have always embraced that change. The focus on skills needed for moving from paper to digital access are similar to the needs for data services in libraries today. The re-skilling and up-skilling needed to deal with data-related questions requires both data librarians and ‘front desk’ colleagues with just enough expertise to engage with what is actually being asked (Rice and Southall, 2016, 16). Front desk librarians need to be able to place a data reference question in the right context and help point the patron in the right direction. Not everyone working in todays research workflow.s academic libraries will become a data expert, but having the awareness and motivation to learn new data skills and place them in the right context is crucial. A learning mindset is essential.
In the late 1990s, the library management at the Technical University of Denmark (DTU) Library made a bold decision. DTU stopped subscribing to printed journal collections and began building a digital collection, almost from one day to the next (Butler, 1999). At the same time, the library management simultaneously launched an in-house skills project (Project JULIA) requiring every employee to acquire the digital skills needed to be ready for the future.
Data science, in its stubborn refusal to be defined or constrained around a coherent conceptual node, can be seen as a collection of skills, approaches and methods that have become relevant across a variety of research domains. While the departmental home for data science within the academic context continues to be debated (Donoho, 2017), the need for data science-related skills within the broader scientific research community has grown across domains and those researchers are actively searching for help (Osborne et al., 2014).
Data science is also a quickly growing academic degree and certificate area. Undergraduate degree programs and academic units hosting these programs are being strongly urged to develop faculty specializing in data science research and education (National Academies of Sciences, Engineering, and Medicine, 2018).
As much as the field of statistics may lament not becoming data science's de facto home, the reality is that this new domain has grown beyond the boundaries of any single department. This ownership debate will likely continue unabated as many academic disciplines recognize missed strategic opportunities and attempt to assert political control on their campuses. Looking beyond organizational chart intrigue, these new students and scholars will exist no matter how contentious their position is. Waiting for the debates to settle before acting to provide them with library and information services puts them at strong risk of being unserved or underserved for technical, data and other information services.
This entirely new subject domain and service population represent an exciting engagement opportunity for librarianship and information services. Not only are there patrons working directly in this new subject area, like undergraduates majoring in data science, but there are also indirect members sitting as affiliated faculty and other researchers seeking out data sciencealigned training. They have unique research needs around data discovery, technical services, research data management and scientific reproducibility. Libraries resistant to engaging with these new scholars and students face a similar missed strategic opportunity.
Given the complicated and bespoke nature of where these data science research, educational and service units live within academic campuses, hosting campus-level data science training opportunities and consultations within a university's library or a research unit within a library can be one of the most efficient methods of distributing that service.
The University of California, Los Angeles (UCLA) Library Data Science Center (DSC) is a research and education unit supporting faculty, researchers and students through consultation, instruction, co-curricular programming and data infrastructure. It provides a wide range of researcher support and development in data and computationally intensive scholarship, geospatial analysis and emerging technologies. Since 2018, the DSC has developed services that provide education and support for the increasingly complex research landscape.
This chapter outlines the process used to create the new services. It gives context to the Center's origins as the Social Science Data Archive (SSDA) that provided social science data services at UCLA from the 1970s. The chapter examines how integrating the SSDA into the Library in 2014 led to a shift of focus toward a service that supports data creation, interpretation and publication regardless of discipline or methodology. It articulates the drivers for change on the UCLA campus that led to the redesign of service offerings and describes how the DSC's involvement with the Carpentries movement expanded its ability to teach data and coding skills. The chapter also reflects on the challenges faced in establishing a service profile that is non-traditional for a library while focusing on building an inclusive community that democratizes data science tools and their research applications.
UCLA: context
UCLA is a public research institution located in Los Angeles, California. UCLA has a diverse community of scholars that encompasses nearly 30,000 undergraduates pursuing 125 majors, 13,000 graduate students in 59 research programs and over 7,000 faculty members. To support its research activities, the University deployed a department-based research support infrastructure. Research data support has been heavily siloed across campus, depending on when and where departments can access resources to support these endeavors. Several distinct groups have emerged that provide different support layers for other disciplines. For example, researchers in STEM (science, technology, engineering and mathematics) fields have ready access to course-integrated resources in campus units such as the UCLA Collaboratory in the Institute for Quantitative and Computational Biology and the Office of Advanced Research Computing. These institutes have large staffs and support thousands of researchers annually.
In contrast, departments in social sciences, humanities and arts lack access to similar institutes or infrastructure. However, data-intensive research is a part of nearly every discipline's research workflow.
Combinatorial samplers are algorithmic schemes devised for the approximate- and exact-size generation of large random combinatorial structures, such as context-free words, various tree-like data structures, maps, tilings, RNA molecules. They can be adapted to combinatorial specifications with additional parameters, allowing for a more flexible control over the output profile of parametrised combinatorial patterns. One can control, for instance, the number of leaves, profile of node degrees in trees or the number of certain sub-patterns in generated strings. However, such a flexible control requires an additional and nontrivial tuning procedure. Using techniques of convex optimisation, we present an efficient tuning algorithm for multi-parametric combinatorial specifications. Our algorithm works in polynomial time in the system description length, the number of tuning parameters, the number of combinatorial classes in the specification, and the logarithm of the total target size. We demonstrate the effectiveness of our method on a series of practical examples, including rational, algebraic, and so-called Pólya specifications. We show how our method can be adapted to a broad range of less typical combinatorial constructions, including symmetric polynomials, labelled sets and cycles with cardinality lower bounds, simple increasing trees or substitutions. Finally, we discuss some practical aspects of our prototype tuner implementation and provide its benchmark results.