The Legality and Ethics of Web Scraping in Archaeology

Web scraping, the practice of automating the collection of data from websites, is a key part of how the internet functions, and it is an increasingly important part of the research tool kit for scientists, cultural resources professionals, and journalists. There are few resources intended to train archaeologists in how to develop web scrapers. Perhaps more importantly, there are also few resources that outline the normative, ethical, and legal frameworks within which scraping of archaeological data is situated. This article is intended to introduce archaeologists to web scraping as a research method, as well as to outline the norms concerning scraping that have evolved since the 1990s, and the current state of US legal frameworks that touch on the practice. These norms and legal frameworks continue to evolve, representing an opportunity for archaeologists to become more involved in how scraping is practiced and how it should be regulated in the future.

The open science and open data philosophies that began to mature in the early 2000s influenced the development of web scraping for scientific research.This period saw efforts to organize alternatives to publishing systems funded by institutional subscriptions and to make research products and data publicly available.The products of these efforts include the development of public copyright licenses-such as the Creative Commons License, the Public Library of Open Science, and other openaccess research publications-formalized definitions of "open data" and "open access," and public digital repositories for research products including datasets (Harnad 2005;Laakso et al. 2011).Those efforts, in part, resulted in changes to federal policies to encourage data sharing (Sheehan 2015).In the field of archaeology, repositories such as the Digital Archaeological Record (tDAR) developed to fill a need for a public repository for publicly funded archaeological research (McManamon et al. 2017).As these repositories and similar products of archaeological research become more deeply entrenched in the practice of archaeology, and as an appetite for large-scale comparative and synthetic work based on archaeological data increases (Ortman and Altschul 2023;Perreault 2019), the potential for web scrapers and similar kinds of computational methods as important research tools significantly expands (McManamon et al. 2017;Ortman and Altschul 2023).

WHAT SCRAPING IS AND HOW TO DO IT
Archaeologists and other researchers interested in studying how the human past is viewed, studied, and commercialized increasingly rely on web scraping and similar computational tools as the method of gathering data (Daems 2020;Graham et al. 2020;Hashemi and Waddell 2022;Kintigh 2015;Marwick 2014;Richardson 2019;Wilson et al. 2022).Many of these approaches focus on studying how issues relating to archaeological practice are discussed on social media platforms (Marwick 2014;Richardson 2019), while others focus on investigating the online trade of illicit antiquities (Hashemi and Waddell 2022).However, as more and more archaeological data are hosted on websites, web scraping holds greater and greater potential as a powerful data collection tool.Still, despite the power of this method, archaeologists rarely incorporate web scraping into their research.
Scraping involves the automated collection of data by a computer program from websites where that data was intended to be read or collected by humans.People read PDFs, copy and paste elements of spreadsheets, or browse images: all forms of data presentation that are tailored for human consumption.These kinds of formats are unlike the data structures used to exchange information between computers, which are tightly structured, as unambiguous as possible, and read as nonsense to most people (Wiley 2021).Website data transmission happens through an Application Programming Interface (API; Jünger 2021).APIs represent sets of guidelines that structure how data are requested by one party, and how that request is fulfilled by the second party.Access to a website's API is the ideal means of collecting data from that website because it provides more computationally direct access to data.For example, the online Digital Index of North American Archaeology (DINAA) has guides for requesting data through API calls, making web scraping unnecessary (Wells et al. 2014).
However, not all websites have public APIs that researchers can use.Scraping is a fallback option in such cases.
Although diverse kinds of computer languages can be used to create websites, all website pages are transmitted in HyperText Markup Language (HTML) from a server to a client's browser (Mowery and Simcoe 2002).HTML is used to build the structure and content of a web page.The HTML representation of this website structure is called a "parse tree," which is made up of hierarchically organized "elements," each of which is bracketed by a tag.For example, "<p> </p>" would contain a paragraph of text.The information contained within a tag is the "content" of that element.Elements often have attributes associated with them to further specify exactly what element they refer to, such as "id='paragraph-2'" or "name = img3."The simplest methods of web scraping involve navigating this parse tree, identifying strings or tags associated with the data we want, and extracting those associated data-be they images, strings, tables, a single number, or other structures.Using regular expressions to identify tags and elements of interest and then copying material from that element is one popular strategy.
Archaeologists interested in web scraping and who already have programming experience are most likely to have skills in R or Python (Marwick et al. 2017).Both programming languages have useful packages that support web scraping.Organizations such as The Programming Historian and Data Carpentry offer workshops and online tutorials that are designed to train researchers with no prior programming experience.In Python, the Beautiful Soup package is commonly used for scraping (Richardson 2023).The package was developed in 2004, and since then, there have been many tutorials published online, and there is an active community of users posting coding issues and solutions.R is the most widely used programming language in the field of archaeology, although it is less commonly used for web scraping.Recently, however, there has been an expansion of packages designed for web scraping-such as rvest-which was released in 2014 (Wickham 2023).Both rvest and Beautiful Soup contain similar functions that can automate the navigation of an HTML parse tree.
The parse tree, in many instances, may not have the data we want to scrape.Clicking on a website link to access a data table, for example, may not result in navigation to another HTML page, with its parse tree containing the table itself.Instead, websites often incorporate calls to objects stored elsewhere to be displayed within the web browser without changing the parse tree.In such situations, we need other kinds of packages with which either rvest or Beautiful Soup can interact.One solution is to navigate the website itself as a user would-that is, clicking links, typing search queries, and copying data through commands submitted to a web browser.Selenium in Python (Muthukadan 2018) and Rselenium in R (Harrison and Kim 2022) are good packages for automating the navigation of websites from an R or Python session, and they allow us to scrape data that are not represented within an HTML parse tree.Such an approach is likely required for most websites with modern interfaces.

SCRAPING NORMS
Because scraping is an automated process, and because of the generally open nature of the internet, an incredible amount of data can be extracted from websites very quickly.Web scrapers form the foundation of how search engines, such as Google, develop a map of the internet.However, there are many kinds of ways in which web scraping can be misused.Some web scrapers are designed to copy entire websites and rehost pages under new domains for ad revenue.Some web scrapers also can perform the same function as a distributed denial of service (DDoS) attack by automating repeated and very rapid requests on a website, overwhelming its servers.Other scraping techniques may be designed to extract sensitive information about individuals, which then may be sold to a third party (Krotov et al. 2020).Any researcher who can make a web scraper can also make a web scraper that does similar kinds of intrusive damage.Because scrapers can be used for good or ill, they have become the target of regulation not only through legal systems and policymaking but through the development of norms within communities of web designers and scrapers.
One long-standing norm in website design is the Robots Exclusion Protocol, which was developed in the mid-1990s as a means for website owners to communicate to web scraping programs which pages could be and which should not be scraped, as well as information about which web scraping programs are or are not welcome to scrape those pages (Elmer 2008).A website's Robots Exclusion Protocol is provided as a stand-alone page on a website under the address "/robots.txt."The information in a robots.txtmay or may not be wholly consistent with a site's terms of service.For example, a site could have no information about web scraping in its terms of service, but the robots.txtcould have instructions that are intended to prohibit most forms of web scraping.Adherence to the exclusion protocol is voluntary, and many scrapers do ignore them, although these exclusion standards and the violation or adherence to them have been cited in legal cases involving web scraping (John F. Tamburo, et al., Plaintiffs, v. Steven Dworkin, et al., Defendants. No. 04 C 3317 [N.D. Ill. Nov. 17, 2010]).
Robots.txt files are highly variable, and they have become generally more complex over time given that more and more programs and services have developed to gather data from websites.For example, the robots.txtfile associated with eBay in 1998, the early days of that website, included only four lines prohibiting scraping originating from one source (Figures 1 and 2).
As of summer 2023, eBay has a far stricter robot.txt.This page prohibits scraping except by search engines-or scrapers that help generate advertising revenue (Figure 3).Additionally, it supplies a description of the site's philosophy on web scrapers that is written in prose within the robots.txtdocument.Nonetheless, eBay, in particular, is one of the sites that has seen the most research on cultural heritage and antiquities markets through web scraping, and there are publicly available tools specifically designed to scrape the online auction house.This has helped us to gain compelling insights about what kinds of antiquities are most popular on the site and, as a result, to better understand the role the antiquities market plays in looting and the preservation of the archaeological record (Altaweel 2019;Altaweel and Hadjitofi 2020).
Other norms have developed over the course of the history of the internet that help determine the design and use of web scrapers.For example, attempting to obfuscate one's identity or misrepresenting the reasons why the data are being scraped-especially if doing so is necessary to obtain access to the data-is ethically dubious.Instead, researchers should consider transparent practices.This is also an opportunity to explain to the domain owner that the scraping process was designed to avoid harm.Journalists, for example, have developed different norms and practices around how to scrape data from websites (Wiley 2021).One common practice to ensure transparency is to include identifying information within the scraper code-mainly a user agent string that outlines the identity of who is enacting the scrapingsome information about the reason behind the scraping, contact information, and steps taken to ensure that the scraper does no damage to the functioning of a website.Other steps include throttling the frequency of requests sent to a website (Densmore 2017;Wiley 2021).These minimally invasive and transparent practices are formalized in the R package "polite," which has built-in functions for scanning the robots.txtfile of a website to identify if scraping is allowed, for requesting permission to scrape, and for throttling requests to one every few seconds (Perepolkin 2023).

LEGAL FRAMEWORKS
Researchers, before performing a data-scraping project, should familiarize themselves with federal, state, and local regulations that relate to web scraping, copyright law, and digital trespassing.Currently, there are no laws that are tailored specifically to web scraping.Instead, a patchwork of relevant laws, regulations, and decisions are cited in cases involving web scraping.Until reforms occur, the legal landscape will remain murky (Christensen 2020;Landers et al. 2016;Sellars 2018;Sobel 2021).One caveat for the below discussion: I am not a lawyer.For a fuller discussion of the laws and regulations surrounding web scraping, many of the references in this section are a good start.
Site terms and conditions are one method of proposing the "gatekeeper rights" of the owners of a website (Kadri 2020).Those terms and conditions may explicitly note that automated data collection is not allowed, even if their website makes that data publicly available (Wickham et al. 2023).Terms of use violations, such as scraping publicly available information, could be enough for a company to send a cease-and-desist letter (Kadri 2020).In such cases, there is little evidence for successful criminal cases brought against groups who scraped publicly available information, even though it was against the terms of use of that website.However, more precautions should be taken if access to the data to be scraped requires setting up an account and actively agreeing to a website's terms of service that explicitly bans web scraping (Landers et al. 2016).Any information that is behind a log-in screen or that requires an account and agreement to a terms of service to access is more likely to have stronger protections under federal laws (Landers et al. 2016;Macapinlac 2019).In the United States, the main law that gives shape to government policy regarding web scraping is the Computer Fraud and Abuse Act (CFAA; Christensen 2020; Sellars 2018).The CFAA was passed in 1986, well before public use of the internet was commonplace.It outlines the legal framework surrounding how computers are accessed and used.Mainly, it was intended to protect commercial property rights, and sensitive data hosted on the computers of government agencies.For example, if a disgruntled employee released a computer virus on a company's network, destroying financially valuable data, that would be a crime under the CFAA.In contrast, if that released virus did no economic damage, then whether or not that action would constitute a crime is less clear under the CFAA (Roach and Michiels 2006).The CFAA also vaguely delineates web-scraping practices that could be illegal.For example, if private data not intended for the public were somehow scraped from a website, that could constitute "unauthorized access" and fall under the CFAA (Krotov et al. 2020;Sobel 2021).The notoriously vague wording of the CFAA also means that acts such as lying about one's age on the internet could constitute a federal crime, and even scraping of publicly accessible data could be construed as a federal crime (Macapinlac 2019).
The uncertainty around how scraping relates to legal frameworks leads to a lack of predictability about what kinds of actions will be charged as federal crimes.This uncertainty has led to calls to reform the CFAA and other laws.One of the major historical events that illustrates this was the federal legal action brought against Aaron Swartz in 2011.Swartz was a computer scientist who developed the RSS feed, Markdown, and the Creative Commons license.He was also a leading advocate for the open availability of scientific data and research on the internet.Swartz was indicted under the Computer Fraud and Abuse Act under allegations that he used the free access of MIT's institutional JSTOR subscription FIGURE 2. RoverBot.comwebsite as it stood in December 1996.This was the only web-scraping program that eBay.comdisallowed from scraping data on its website in 1998.

Paige
on the public MIT network to download scientific papers en masse using a scraping program.Although neither MIT nor JSTOR pushed for his prosecution, and although Swartz had not then shared those files with the public, federal prosecutors brought charges that included wire fraud and computer fraud.Swartz took his own life in 2013.There was significant bipartisan backlash against the Justice Department's handling of the case, and the case also further galvanized calls for open access to scientific data both broadly and within the field of archaeology (Kansa et al. 2013).Although reforms to the CFAA were drafted after the Swartz case, none were passed into law.Nonetheless, this case likely had an impact on future applications of the CFAA.Interpretation of the law has continued to evolve, and subsequently, there have been a few other instances of federal charges being brought against researchers and individuals who built web scrapers to collect publicly available data for research or scientific purposes.Instead, CFAA cases involving web scraping tend to revolve around business disputes (Macapinlac 2019).
The reach of the CFAA has become slightly shorter as legal cases continue to be decided (Christensen 2020).For example, one of the recent and higher-profile web-scraping cases brought before the Ninth Circuit Court of Appeals was hiQ Labs Incorporated v. LinkedIn Corporation (938 F.3d 985 [9th Cir. 2019]).The Ninth Circuit evaluated whether LinkedIn could cite the CFAA in a case against the company hiQ, which had been scraping information LinkedIn users had placed on their public profiles.In the end, the Ninth Circuit held that scraping publicly available information The Digital Millennium Copyright Act of 1998 is one example of a federal copyright law intended to afford companies with copyrighted digital work protections against others republishing or repurposing their works, especially if that reuse is for profit (Lawrence and Ehle 2019).Reuse and reproduction of online materials for the sake of research is more likely to fall under "fair use" exclusions to copyright law (Myers 2022).However, if scraping involves gathering massive amounts of data and rehosting that data in some way with minimal modification, or if scraping is performed in such a way that it has a negative economic impact on a website or company, this could increase the likelihood of successful legal action (Lawrence and Ehle 2019;Liu and Davis 2015).
Common law, or tort law, also can provide a basis for civil cases that could be brought against people who implement web scrapers.A tort refers to some action that causes a claimant some loss or harm.Trespass to chattels is one example of such a tort in civil law (Sobel 2021).Historically, trespass to chattels is a portion of tort law that serves as a basis to bring civil action against individuals who interfere with another's possessions-or "chattel"through taking those possessions, inhibiting access to them, or  (Quilter 2002).Trespass to chattels is often used to bring civil cases against sources of spam on the internet.One of the first such cases-CompuServe Inc. v. Cyber Promotions Inc.-involved CompuServe arguing that the bulk digital contact that originated from Cyber Promotions was sufficiently damaging to constitute a trespass to chattel (Graham 1997;Quilter 2002).The courts ruled in favor of CompuServe and opened the door for trespass to chattel cases to be brought against others, even if there were only indirect costs or damages that resulted from the trespass (Quilter 2002).Trespass to chattel is cited in several web-scraping cases as well (O'Reilley 2007), and courts generally appeared willing to rule in favor of companies that bring trespass to chattel charges, even without strong evidence of economic damages caused by scraping (Quilter 2002).This provides another viable avenue for website owners to restrict access to publicly available data that otherwise are more difficult to restrict through either copyright law, such as the DMCA, or digital trespass and fraud laws, such as the CFAA (O'Reilly 2007; Quilter 2002).
In summary, in the case of most research projects that involve scraping of publicly available information, there is a low risk of criminal liability but some risk of civil liability.Given the legal ambiguity surrounding the practice, one strategy is to avoid scraping altogether.However, being uninvolved in scraping as a field also leaves archaeologists and heritage professionals in a place where they cannot influence how the practice is employed and regulated in the future.That same ambiguity has also not stopped web scraping from becoming widespread in business, journalism, and other scientific fields (Baranetsky 2018;Kirkpatrick 2015;Wiley 2021).This is largely because the method can provide economic, scientific, and public benefits that arguably outweigh risks that stem from the ambiguity in the legal framework.

ARCHAEOLOGICAL ETHICS
Over the course of the twentieth century in the United States, there was a transition from the archaeological record being unprotected and unmanaged by public entities to the modern condition where archaeological sites are protected and managed to abide by not only legal obligations but professional norms and ethics focused on site preservation (King and Lyneis 1978; Society for American Archaeology 2016).Since the passage of the Antiquities Act in 1906, the federal government has taken an explicit role in the management of archaeological resources on public lands (Colwell-Chanthaphonh 2005;King and Lyneis 1978).The Archaeological Resources Protection Act of 1979 further outlined the role the government must play in protecting sites or subjecting them to minimal damage during analysis (Northey 1982).Among those new protections were provisions to prevent site location data from becoming accessed by the broader public.This kind of sensitive locational information can be easily collected en masse if an archaeologist has access to websites that store it.As digital methods advance, it is important to continually revisit how our practices relate to our ethical and legal obligations (Dennis 2020;Richardson 2018).These kinds of legal, ethical, and normative obligations should be kept in mind during any attempt to scrape large amounts of data about the archaeological record.
Precautions must be taken when gathering either locational information using a web scraper or any other information that could make it much easier to locate-and damage-archaeological sites.One strategy is to engage in some form of obfuscation of the true site locations (Anderson and Horak 1995;Robinson et al. 2019;Smith 2020).A popular strategy is to summarize the locations of sites analytically based on which county they fall within.Another viable strategy is to resample a new site location from within a certain radius of the reported site location (Smith 2020).These steps are best performed at the same time as the scraping to ensure that the obfuscated location, rather than the true site location, is stored at any point.This is to prevent any researcher from having thousands of precise site locations on a personal or work computer.Even if those raw data are never meant to be shared widely, it is not good practice to retain precise locational data for no good reason, given that it could be leaked, unintentionally shared, or hacked.Strategies such as saving only obfuscated site coordinates throughout scraping help to mitigate that risk.
Furthermore, archaeologists interested in public perceptions of archaeology-or other questions that involve gathering data from or about living people-must also ensure that the rights and welfare of those people are protected.Web scraping, as outlined above, can allow one individual to collect massive amounts of data about individuals from public websites, much of which may result in individuals being indirectly identified even if names and other sensitive forms of personally identifiable information are not collected.In 2013, the Department of Health and Human Services updated its recommendations and guidance specifically for internet-based research, including data gathered through web scraping (Secretary's Advisory Committee on Human Research Protections 2013).This updated guidance outlined a framework relating potentially sensitive information hosted on websites to the "basic ethical principles" of human subject research outlined in the Belmont Report: Respect for Persons, Beneficence, and Justice (National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research 1979).The boundary between public information (data that individuals should not expect to be kept private) and private information (such as medical, educational, and financial records) is hazy on the internet.Many users, for example, may not be fully aware that the information they provide on a public website is likely to be observed and recorded for scientific research, even if that data does ostensibly qualify as "public."The beneficence principle in the Belmont Report serves to temper a broad treatment of all publicly available information on the internet as ethically "public" given that many users may be operating under an assumption that their data will not be widely spread.In some online communities, there may be a stronger shared expectation of privacy, and the Advisory Committee's recommendation in such an instance is to be aware of and respect those expectations (Secretary's Advisory Committee on Human Research Protections 2013).The 2013 guidance also discusses concerns about the kinds of research studies that qualify as interventions, the kinds of observations that qualify as observations of public behavior, and the argued characteristics of sites that should be considered analogous to public places.A full discussion of these ideas is beyond the scope of this article, but they should be carefully considered when scraping intersects with human subjects research.In the United States, any scraping work that involves studying human subjects must be discussed with Institutional Review Boards (IRBs), regardless of the investigators' perception of risk to those human subjects.As is the case with sensitive site locational information, sensitive information-including personally identifiable informationshould not be stored unless necessary.When it must be, such data should be stored securely, encrypted in a secure server, or both.

DISCUSSION AND CONCLUSIONS
Web scraping is a useful method of sampling from an everdeepening pool of data hosted on the internet.However, there are many factors to consider when deciding whether and how to build and implement a web scraper.In some cases, we may be more likely to argue that scraping publicly available information from a website is in the public interest and of scientific value, even if the site is explicit in requesting that no web scraping be performed at all (Luscombe et al. 2022).We might, for example, be interested in systematically assessing how discussions about the archaeological record and prehistory have changed over the past few decades by performing a textual analysis of posts on White-supremacist message boards.Although this may be prohibited in the site terms of use or in the robots.txtassociated with the message board, such a study may have scientific value and could be in the public interest.It would help us better understand the long-identified relationship between archaeological findings and White nationalism (Hakenbeck 2019).So should researchers always adhere to company requests?Researchers should consider both their justifications for proceeding-whether the data are public, whether the users of the website may expect privacy, and whether scraping will hurt those websites-and input from IRBs.
In contrast, many websites that contain archaeological data do not have information in their terms of use or a robots.txt to give guidance about expectations for the use of web scrapers.For example, the Texas Historical Commission site atlas does not, nor does it have any guidance in its terms of use that directly relates to the use of web scrapers.The lack of guidance from websites should not be considered a license to scrape in whatever way one wishes.In cases like this, researchers should still-at a minimumidentify themselves, focus on targeted collection, and find ways to obfuscate more sensitive information, especially site locational data.Heritage professionals who are building digital repositories and digital interfaces that provide large amounts of digital dataincluding State Historic Preservation Offices that provide database access to professional archaeologists-should keep in mind the use of web scrapers in the design of those websites.Appropriate robots.txtand discussion in terms of service surrounding the use of web scrapers should help provide an additional layer of protection for more sensitive data.
In summary, there are three broad ways to look at the practice of web scraping.The first is to, out of an abundance of caution, avoid the practice.This circumvents any potential legal or ethical issues outlined above.Another is a no-holds-barred approach, where entire datasets are extracted through whatever means necessary and hosted with no modification, including sensitive data.Researchers should take into account all the issues raised above, norms surrounding scraper design, archaeological ethics, the legal system, the stated desires of website owners, and the expectations of the users of those sites.Researchers should use their best judgment while being aware of the current and changing legal and ethical climates within which this kind of work is situated (Landers et al. 2016;Luscombe et al. 2022).By engaging in this practice, archaeologists and other cultural heritage professionals can then become a part of the community that makes decisions about how scraping should be performed and how it should be regulated in the future.

FIGURE 3 .
FIGURE 3. The first fraction of 475 lines of the robots.txtfile associated with eBay.com as of July 2023.Most kinds of web scraping are disallowed.