Semantic Segmentation of Archaeological Features on Public Lands: Case Study of Historical Cotton Terraces within the Piedmont National Wildlife Refuge, Georgia, USA

Claudine Gravel-Miguel; Grant Snitker; Jayde N. Hirniak; Katherine Peck; Alex Fetterhoff

doi:10.1017/aap.2025.1

Semantic Segmentation of Archaeological Features on Public Lands: Case Study of Historical Cotton Terraces within the Piedmont National Wildlife Refuge, Georgia, USA

Published online by Cambridge University Press: 08 July 2025

Claudine Gravel-Miguel

and

Claudine Gravel-Miguel*: Affiliation:
Cultural Resource Sciences Program, New Mexico Consortium, Los Alamos, NM, USA Department of Anthropology, University of New Mexico, Albuquerque, NM, USA
Grant Snitker: Affiliation:
Cultural Resource Sciences Program, New Mexico Consortium, Los Alamos, NM, USA Department of Anthropology, University of New Mexico, Albuquerque, NM, USA
Jayde N. Hirniak: Affiliation:
Institute of Human Origins, School of Human Evolution and Social Change, Arizona State University, Tempe, AZ, USA
Katherine Peck: Affiliation:
Cultural Resource Sciences Program, New Mexico Consortium, Los Alamos, NM, USA Department of Anthropology, University of New Mexico, Albuquerque, NM, USA
Alex Fetterhoff: Affiliation:
Cultural Resource Sciences Program, New Mexico Consortium, Los Alamos, NM, USA Department of Anthropology, University of New Mexico, Albuquerque, NM, USA
*: Corresponding author: Claudine Gravel-Miguel; Email: cgravelmiguel@newmexicoconsortium.org

Article contents

Abstract
Background
Case Study: Piedmont National Wildlife Refuge, Georgia, USA
Materials and Methods
Results
Discussion
Funding Statement
Data Availability Statement
Competing Interests
References

Rights & Permissions

Abstract

The logistics, costs, and capacity needed to complete extensive archaeological pedestrian surveys to inventory cultural resources present challenges to public land managers. To address these issues, we developed a workflow combining lidar-derived imagery and deep learning (DL) models tailored for cultural resource management (CRM) programs on public lands. It combines Python scripts that fine-tune models to recognize archaeological features in lidar-derived imagery with denoising QGIS steps that improve the predictions’ performance and applicability. We present this workflow through an applied case study focused on detecting historic agricultural terraces in the Piedmont National Wildlife Refuge, Georgia, USA. For this project, we fine-tuned pretrained U-Net models to teach them to recognize agricultural terraces in imagery, identified the parameter settings that led to the highest recall for detecting terraces, and used those settings to train models on incremental dataset sizes, which allowed us to identify the minimum training size necessary to obtain satisfying models. Results present effective models that can detect most terraces even when trained on small datasets. This study provides a robust methodology that requires basic proficiencies in Python coding but expands DL applications in federal CRM by advancing methods in lidar and machine learning for archaeological inventorying, monitoring, and preservation.

Resumen

La logística, el coste y la capacidad necesarios para completar prospecciones arqueológicas terrestres para inventariar bienes culturales presentan retos para los administradores de fincas públicas. Para abordar estos problemas, hemos desarrollado un flujo de trabajo que combina imágenes derivadas de lidar y modelos de aprendizaje profundo diseñado para programas de gestión de recursos culturales (PGRC) en fincas públicas. Este flujo de trabajo combina “scripts” de Python que afinan los modelos para reconocer características arqueológicas en imágenes obtenidas por lidar con filtros de eliminación de ruido en QGIS que mejoran el rendimiento y la aplicabilidad de las predicciones. Presentamos este flujo de trabajo a través de un estudio de caso centrado en la detección de terrazas agrícolas históricas en el Refugio Nacional de Vida Silvestre Piedmont, Georgia, EE. UU. Para llevarlo a cabo, pasamos algunas semanas geolocalizando terrazas visibles en imágenes derivadas de lidar. Para ajustar los modelos U-Net previamente entrenados y enseñarles a reconocer terrazas agrícolas en las imágenes, probamos distintas combinaciones de parámetros. Identificamos aquellas configuraciones que lograron la mayor precisión en la detección de terrazas y utilizamos esas configuraciones para entrenar modelos en tamaños incrementales de conjuntos de datos. Este proceso nos permitió determinar el tamaño mínimo de entrenamiento necesario para obtener un rendimiento satisfactorio en los modelos. Los resultados presentan modelos muy efectivos que son lo suficientemente sensibles como para detectar la mayoría de las terrazas, incluso cuando están entrenados en conjuntos de datos pequeños. Por lo tanto, este estudio proporciona una metodología robusta que requiere conocimientos básicos de codificación Python, pero amplía las aplicaciones de aprendizaje profundo en PGRC federal mediante el avance de métodos en lidar y aprendizaje automático para el inventariado, monitoreo y preservación de bienes arqueológicos.

Keywords

convolutional neural network (CNN)cultural resource management (CRM)deep learning lidar terraces U-Net red neuronal convolucional (RNC)programas de gestión de recursos culturales (PGRC)aprendizaje profundo lidar terrazas U-Net

Information

Type: Article
Information: Advances in Archaeological Practice , Volume 13 , Issue 2 , May 2025 , pp. 291 - 314

DOI: https://doi.org/10.1017/aap.2025.1 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2025. Published by Cambridge University Press on behalf of Society for American Archaeology.

Cultural resources are the tangible and intangible elements connected to the presence, practices, and identity of past and present societies. In the United States, they are managed and protected under multiple local, state, tribal, and federal laws/agreements. Archaeological resources are a component of cultural resources that constitute the physical evidences of past humans and are inventoried and managed on public lands. However, the logistics, costs, and capacity to complete extensive pedestrian surveys to document and monitor archaeological resources present challenges to land managers. While the increasing availability of high-resolution airborne laser scanning (ALS) datasets, referred to as aerial lidar, is revolutionizing data collection and analysis across multiple fields, including cultural resource management (CRM), many cultural resource managers lack the dedicated time or computing resources to process these data, let alone leverage them for analytical or management applications. To address these issues, we have developed a lidar and deep learning (DL) workflow specifically designed to create planning and management geographic information system (GIS) layers for agency CRM professionals to help them refine the focus of their pedestrian surveys rather than to replace them. The workflow contains a streamlined set of Python scripts that train DL models to detect specific archaeological features, as well as denoising post-processing steps that can be done in QGIS to reduce the burden of manual investigation of the results. Here, we present this workflow through a case study focused on detecting historic agricultural terraces in the Piedmont National Wildlife Refuge, Georgia, USA. While this workflow is focused on detecting a single archaeological resource type, the flexibility of our approach means that it can be modified for other site and feature types. This article has two aims: (1) to showcase the high performance of DL models to detect terraces in lidar-derived imagery; and (2) to identify the minimum number of terraces required to train DL models to produce highly accurate detections of terraces in new areas.

Following the lead of Traviglia and others (Reference Traviglia, Cowley and Lambers2016), we first define the terminology used in this study to promote clarity and consistency for multidisciplinary readers. Here, we follow the computer vision terminology and use the term “object” to signify archaeological “sites” and “features.” A “class” refers to the type of object a model learns to detect (e.g., terrace, tar kiln, charcoal pit, etc.). Finally, a “model” refers to an algorithm trained to recognize patterns. In the first part of this article, the term “model” encompasses different types of DL algorithms, whereas in Materials and Methods and subsequent sections it refers primarily to the U-Net structure we used for our case study.

Background

Aerial Laser Scanning

Archaeologists have used ALS for decades to detect archaeological objects in remote locales or areas where ground visibility is obscured by dense vegetation (e.g., Clode et al. Reference Clode, Kootsookos and Rottensteiner2004; Freeland et al. Reference Freeland, Heung, Burley, Clark and Knudby2016; Hesse Reference Hesse2010; Suh and Ouimet Reference Suh and Ouimet2023). Multiple studies have demonstrated aerial lidar’s potential in identifying archaeological object types, such as barrows and cairns (Cerrillo-Cuenca Reference Cerrillo-Cuenca2017; Olivier and Verschoof-van der Vaart Reference Olivier and Verschoof-van der Vaart2021), mining pits (Gallwey et al. Reference Gallwey, Eyre, Tonkins and Coggan2019), charcoal hearths (Carter et al. Reference Carter, Blackadar and Conner2021; Davis and Lundin Reference Davis and Lundin2021; Olivier and Verschoof-van der Vaart Reference Olivier and Verschoof-van der Vaart2021; Suh et al. Reference Suh, Anderson, Ouimet, Johnson and Witharana2021), shell rings (Davis et al. Reference Davis, Lipo and Sanger2019), agricultural field borders (Ladefoged et al. Reference Ladefoged, McCoy, Asner, Kirch, Puleston, Chadwick and Vitousek2011), irrigation ditches and terraces (McCoy et al. Reference McCoy, Asner and Graves2011), and other archaeological structures (Bundzel et al. Reference Bundzel, Jaščur, Kováč, Lieskovský, Sinčák and Tkáčik2020; Guyot et al. Reference Guyot, Lennon, Lorho and Hubert-Moy2021; Somrak et al. Reference Somrak, Džeroski and Kokalj2020). While analytical methods vary by research design and objectives, the growing number of recent studies showcase the impact of lidar in archaeological applications. As more lidar datasets become freely available online, archaeologists are able to study wide areas of multiple regions at a relatively low cost (see Carter et al. Reference Carter, Blackadar and Conner2021; Cerrillo-Cuenca Reference Cerrillo-Cuenca2017; Gallwey et al. Reference Gallwey, Eyre, Tonkins and Coggan2019; Herrault et al. Reference Herrault, Poterek, Keller, Schwartz and Ertlen2021; Küçükdemirci et al. Reference Küçükdemirci, Landeschi, Ohlsson and Dell’Unto2023; Meng Reference Meng2023; Olivier and Verschoof-van der Vaart Reference Olivier and Verschoof-van der Vaart2021; Quintus et al. Reference Quintus, Davis and Cochrane2023, who all used open-access lidar datasets). However, while lidar data help enhance the visibility of obscured archaeological objects, its processing and analysis can require long hours and intensive computing resources.

Deep Learning

Until relatively recently, most archaeological applications of lidar data were confined to generating multiple types of bare-earth visualizations and their derivatives, before manually identifying archaeological objects (e.g., Bennett et al. Reference Bennett, Welham, a. Hill and Ford2012; Cerrillo-Cuenca Reference Cerrillo-Cuenca2017; Chase et al. Reference Chase, Chase, Weishampel, Drake, Shrestha, Clint Slatton, Awe and Carter2011; Evans et al. Reference Evans, Fletcher, Pottier, Chevance, Soutif, Suy Tan and Sokrithy2013; Zakšek et al. Reference Zakšek, Oštir and Kokalj2011). However, recent developments in computer vision have broadened the analytical possibilities through DL models. DL is a subset of machine learning; it is called “deep” learning because it relies on computer models made of multiple connected layers. Convolutional neural network (CNN) models are a subset of DL models where connected layers perform convolutions (hence “Convolutional”) on the input to change its size and dimensionality, and where the layer connections resemble the structure of the human brain (hence “neural network”). CNNs are well suited for lidar data because they specialize in detecting patterns in imagery. Therefore, CNNs provide an efficient new way to automate—as much as possible—the detection of archaeological sites and features in lidar-derived imagery.

CNNs are usually separated into four categories based on the desired output: image classification, object detection, semantic segmentation, and instance segmentation. While many have successfully used CNN for object detection and image classification, here we use it for semantic segmentation because this method works particularly well on large and heterogeneously shaped archaeological objects such as terraces. Segmentation identifies the location of an object by classifying all pixels of an image that overlap with the object to that object’s class and classifying all other pixels to background or a different object’s class. In semantic segmentation, all instances of the same class share the same label, whereas instance segmentation goes a step further by separating each instance of the same class. Semantic segmentation training datasets require binary masks showing the presence/absence of the object classes to detect (Figure 1).

Figure 1.

(Left) Example of terraces visible on an SLRM (20 m radius) tile and (right) its associated binary mask showing the location of terraces (white) versus the background (black).

The different CNN categories use different structures (e.g., YOLO [Redmon et al. Reference Redmon, Divvala, Girshick and Farhadi2016], U-Net [Ronneberger et al. Reference Ronneberger, Fischer, Brox, Navab, Hornegger, Wells and Frangi2015], Faster R-CNN [Ren et al. Reference Ren, Kaiming, Girshick and Sun2017], Mask R-CNN [He et al. Reference He, Gkioxari, Dollár and Girshick2017], and the more recent SAM [Kirillov et al. Reference Kirillov, Mintun, Ravi, Mao, Rolland, Gustafson and Xiao2023]). Furthermore, each structure can use different backbones (e.g., ResNet18 [He et al. Reference He, Zhang, Ren and Sun2015], VGG16 [Simonyan and Zisserman Reference Simonyan and Zisserman2014]), which define the type and number of convolution layers within each block of the general model structure. While they can be simply used as untrained structures, backbones can also come already pretrained on millions of images (e.g., ImageNet [Russakovsky et al. Reference Russakovsky, Deng, Hao, Krause, Satheesh, Sean and Huang2015], MS COCO [Lin et al. Reference Lin, Maire, Belongie, Bourdev, Girshick, Hays, Perona, Deva Ramanan and Dollár2014], or the Segment Anything 1-Billion mask dataset [Kirillov et al. Reference Kirillov, Mintun, Ravi, Mao, Rolland, Gustafson and Xiao2023]), which sets the weights of the nodes in the network. Using pretrained weights is called transfer learning. It assumes that the pretraining helped the model recognize general patterns, such as vertical or horizontal edges, and that we can fine-tune it by retraining only its last layers to recognize a new object, thus saving computational time and providing good results even with small training datasets. Research has shown that pretrained models perform better than untrained ones, even when they were trained on very different images (Nogueira et al. Reference Nogueira, Penatti and dos Santos2017). To date, archaeologists have successfully used pretrained YOLO and RetinaNet for object detection (Davis and Lundin Reference Davis and Lundin2021; Olivier and Verschoof-van der Vaart Reference Olivier and Verschoof-van der Vaart2021; Verschoof‐van der Vaart et al. Reference Verschoof‐van der Vaart, Bonhage, Schneider, Ouimet and Raab2023), U-Net for semantic segmentation (Bundzel et al. Reference Bundzel, Jaščur, Kováč, Lieskovský, Sinčák and Tkáčik2020; Gallwey et al. Reference Gallwey, Eyre, Tonkins and Coggan2019; Küçükdemirci et al. Reference Küçükdemirci, Landeschi, Ohlsson and Dell’Unto2023; Suh et al. Reference Suh, Anderson, Ouimet, Johnson and Witharana2021), and Mask R-CNN for instance segmentation (Bundzel et al. Reference Bundzel, Jaščur, Kováč, Lieskovský, Sinčák and Tkáčik2020; Carter et al. Reference Carter, Blackadar and Conner2021; Davis et al. Reference Davis, Caspari, Lipo and Sanger2021), among others. The first research projects using DL models focused on a limited set of relatively homogeneously shaped objects, such as charcoal kilns, pits, mounds, and barrows. However, methodological and computational improvements have allowed teams to expand the range of object types they could detect. Particularly relevant for our case study, two teams recently used image classification and instance segmentation models to detect agricultural terraces in lidar data (Herrault et al. Reference Herrault, Poterek, Keller, Schwartz and Ertlen2021; Quintus et al. Reference Quintus, Davis and Cochrane2023, respectively) with impressive results. Here, we continue to advance DL approaches in archaeological feature detection through our study of agricultural terraces in the Piedmont Region of Georgia, USA.

Case Study: Piedmont National Wildlife Refuge, Georgia, USA

The Piedmont National Wildlife Refuge (PNWR) is located in central Georgia, approximately 90 km southeast of Atlanta in the southeastern United States (Figure 2). The refuge is situated in the Piedmont geographic region of Georgia, which lies between the Blue Ridge Mountains and the low elevation coastal plain. This region is characterized by hilly terrain and red clay soil that was cleared for agriculture in the eighteenth and early nineteenth centuries and was intensively cultivated throughout the late nineteenth and early twentieth centuries. However, land degradation and the early twentieth-century boll weevil infestation led to the regional collapse of the cotton industry by the 1930s (Higgs Reference Higgs1976; Wettstaed Reference Wettstaed and Bernard2013). In an effort to mitigate economic hardship and restore the land, the US government created a program to convert cotton farms into public lands and relocate farmers to areas with greater economic opportunities (US Department of Agriculture 1965). By 1939, the US government had purchased almost 40,500 ha of land, out of which they assigned just over 12,000 to the US Biological Survey, with the goal of turning that land into a refuge (Wettstaed Reference Wettstaed and Bernard2013). Today, this land is managed by the US Fish and Wildlife Service (USFWS) as the Piedmont National Wildlife Refuge. In an area that was once called “one of the most severely eroded agricultural areas in the United States” (Garcia and Brook Reference Garcia and Brook1996:194), secondary-growth mixed pine and hardwood forests now cover most of the refuge, providing habitat for wildlife and recreational opportunities for visitors. However, the dense canopy and understory of the refuge’s forests make it difficult for USFWS archaeologists—who are responsible for managing all its archaeological and historical resources—to efficiently identify, inventory, and monitor new and known heritage sites. Our study therefore leverages aerial lidar datasets of the PNWR and CNN models to locate historical agricultural terraces throughout the refuge. Agricultural terraces were selected for this study because of their ubiquity on the landscape and their clear visibility in digital elevation models’ (DEMs’) derivative visualizations. Additionally, terraces are often located in association with other archaeological resources of interest for the PNWR CRM program, such as homesteads and historic cemeteries. This approach is the first step in our larger project, which aims to create feature detection tools and mapped archaeological datasets that will assist public lands cultural resource managers in inventorying and monitoring archaeological resources.

Figure 2.

Study area of the PNWR and the coverage of the lidar-derived DEM (in grayscale) within its broader spatial context (the white arrow points to the white dot that shows its location). Basemap: ESRI satellite imagery.

Materials and Methods

Overview

For this study, we created a comprehensive workflow that takes lidar point-cloud data and ends with vector maps of terrace locations, using multiple open-access programs along the way (Figure 3). The R and Python scripts we used are available on our github repository (https://github.com/NMC-CRS/semantic-segmentation-of-historial-terraces-on-public-lands) accessible through Zenodo: https://doi.org/10.5281/zenodo.14107060; however, due to legislation protecting archaeological sites and collection on federal lands (see Johnson et al. Reference Johnson, Ives, Ouimet and Sportman2021) and the inherent sensitivity of their locations, we will only share raw data on request and with permission from the USFWS.

Figure 3.

Workflow from raw data to mapped terraces.

In this section, we first present an overview of our chosen CNN models because that choice informed the training, validation, and testing dataset format requirements, which are discussed afterward. We then go back to the model structure and provide a more detailed description of the different steps we used to train, validate, and test the models, as well as how we automated post-training data cleaning.

CNN Models

To fit the size and sinuosity of terraces, we opted for U-Net (Ronneberger et al. Reference Ronneberger, Fischer, Brox, Navab, Hornegger, Wells and Frangi2015), a semantic segmentation CNN model structure, because its training dataset requires binary masks, which are more flexible than the bounding boxes required by popular object detection models such as Mask R-CNN. Generally speaking, U-Net models use encoder blocks and decoder blocks, which are linked by skipped levels (see Figure 4). The encoder blocks use convolutions to reduce the size of the image and its mask while increasing its dimensionality, whereas the decoder blocks use convolutions to restore the image and its mask to its original size and dimensions. Skipped levels between encoder and decoder blocks help the model retain the detailed patterns it learned during the encoder phase.

Figure 4.

Figure of the different U-Net structures (done by CGM based on visualization in Ronneberger et al. Reference Ronneberger, Fischer, Brox, Navab, Hornegger, Wells and Frangi2015). Top: VGG16 and VGG19 (the gray boxes show the convolutions that are not present in VGG16), bottom: ResNet18.

We used Python 3.10 (van Rossum Reference van Rossum1995) and the PyTorch library (Ansel et al. Reference Ansel, Yang, He, Gimelshein, Jain, Voznesensky and Bao2024) to design three different U-Net structures that each used a different encoder backbone (VGG16 and VGG19 [Simonyan and Zisserman Reference Simonyan and Zisserman2014] and ResNet18 [He et al. Reference He, Zhang, Ren and Sun2015]) from the torchvision package (Ansel et al. Reference Ansel, Yang, He, Gimelshein, Jain, Voznesensky and Bao2024; Figure 4). This choice of backbones was informed by the successful work of others (Meng Reference Meng2023; Somrak et al. Reference Somrak, Džeroski and Kokalj2020; Trier et al. Reference Trier, David and Anders2019; Verschoof-van der Vaart and Lambers Reference Verschoof-van der Vaart and Lambers2019; Verschoof-van der Vaart et al. Reference Verschoof-van der Vaart, Lambers, Kowalczyk and Bourgeois2020), the complexity of popular backbones, and our available computing resources. We used the latest available versions of the pretrained backbone weights (last downloaded between June 1 and June 15, 2024) to speed up training. To use these weights, we formatted our training dataset into square, three-band tiles to fit the pretrained models’ training dataset.

Creating the Training, Validation, and Testing Datasets

Raw Data. Aerial lidar data collection was performed by Kucera International on January 4–5, 2021, under contract for the USFWS and the US Department of Agriculture (USDA) Forest Service. They used a PA-31 Piper Navajo Chieftain aircraft with a Leica ALS80 System mounted to its underside for the lidar acquisition flights. Data was collected over the entirety of the PNWR in 63 flight lines, with an average point density of 50–60 points/m² and point spacing of 0.23 m. Average elevation variation was calculated at 0.01 m from eight ground control points placed throughout the refuge. Horizontal accuracy was reported at <1 m in the metadata generated by Kucera International. The final datasets include calibrated, georeferenced, and unclassified lidar point clouds (LAZ files) separated per flight line.

We used R (version 4.2.0 [R Core Team 2021]) and the lidR package (version 4.0.2 [Roussel et al. Reference Roussel, Auty, Coops, Tompalski, Goodbody, Sánchez Meador, Bourdon, de Boissieu and Achim2020]) to classify ground points within the point clouds and create a digital elevation model (DEM). We first aggregated the compressed lidar aerial survey (LAZ) files to merge overlaps from different flight lines, and then tiled that dataset to reduce the computational memory required for classification. We then used a progressive morphological filter (PMF) algorithm to classify ground points within each LAZ tile (Keqi Zhang et al. Reference Keqi, Shu-Ching Chen, Shyu, Yan and Zhang2003). The PMF algorithm deploys an iterative moving-window approach to distinguish ground points from vegetation, structures, and other objects. We used a triangular irregular network (TIN) algorithm and the resulting ground-classified point clouds to generate a rasterized, bare-earth DEM for the PNWR. While the ground-classified point density (20.84 points/m²) allowed us to create a DEM at a higher resolution, we decided to use a resolution of 1 m to keep our computational time relatively fast. Terraces are highly visible in 1 m resolution DEMs, and using a higher resolution would have considerably increased the number of tiles to feed the model (see the model workflow later), thus increasing its computation time. Additionally, 1 m resolution DEMs are a standard lidar data product created by the US Geological Survey’s (USGS) 3D Elevation Program, and thus we designed our workflow to be compatible with other high-resolution lidar datasets elsewhere in the United States.

DEM Visualizations. To enhance the visibility of terraces on the landscape, we used the DEM to create different visualization maps. We followed previously published archaeological studies, which demonstrated that slope, Terrain Ruggedness Index (TRI), Positive Openness (PosOp), and Simple Local Relief Model (SLRM) visualizations obtained the best balance between computation time and model accuracy (Meng Reference Meng2023; Quintus et al. Reference Quintus, Davis and Cochrane2023; Somrak et al. Reference Somrak, Džeroski and Kokalj2020; Suh et al. Reference Suh, Anderson, Ouimet, Johnson and Witharana2021). Therefore, we used the slope and TRI tools in QGIS (QGIS.org 2023), along with the Relief Visualization Toolbox (Kokalj and Somrak Reference Kokalj and Somrak2019; Zakšek et al. Reference Zakšek, Oštir and Kokalj2011), to create those visualization maps from the lidar-derived DEM (Figure 5). To enhance the visibility of different-sized terraces, we created two SLRM maps (10 m and 20 m radii). All visualization maps are one-band geoTIFF images; however, the pretrained model requires three-band images. Therefore, to take advantage of the different information each visualization provides, we combined three different visualizations into three-band images for training. In Python, images are represented as two-dimensional arrays of numbers (height × width). To combine different visualizations, we stacked these arrays along a new dimension, creating a three-dimensional array (height × width × band).

Figure 5.

Set of terraces visible in different visualization one-band maps (SLRM with 10 m radius, PosOp, and Slope) as well as the RGB they produce when combined as a three-band tile (where SLRM is used for the red channel, PosOp is used for the green channel, and Slope is used for the blue channel).

We then wrote our own Python script to cut all visualization maps into overlapping small tiles (256 × 256 pixels). To produce the tile overlap, we used a 128-pixel vertical and horizontal stride. Creating this overlap ensured that any terraces cut by the edge of a tile would be more complete in a corresponding overlapped tile. The script automatically removed any tile where >50% of the pixels had no data (tiles at the edge) to avoid feeding tiles with large sections of missing data to the model. The script also normalized the values of each tile to 0–1 using its minimum and maximum values.

Annotations. Training a CNN model requires detailed and accurate information on the objects it learns to recognize. Therefore, a model will perform better if it is provided multiple different versions of the object to learn. Creating a good training dataset of archaeological objects can be difficult because archaeological data are sparse and because taphonomic processes may lead similar objects to look different based on their age. Fortunately, our study area encompasses a relatively large area (265 km²) containing thousands of legacy cotton terraces that we were able to use for training. To help us and others determine the minimum necessary training dataset size, we decided to annotate as many of the terraces as we could find in the whole area covered by the lidar point cloud, which extends beyond the administrative boundary of the PNWR. As discussed later, this extensive annotation allowed us to test training models with different numbers of terraces to identify an ideal workflow for other case studies with fewer resources.

To ensure that our methods could be replicated, we defined an annotation workflow implemented in QGIS. We divided the study area into a grid of 2.5 × 2.5 km zones (6.25 km²) further subdivided into 0.5 × 0.5 km cells (0.25 km²) to organize feature annotations and keep track of progress. Those grids have no impact on the resulting annotations. We then switched back and forth between our SLRM, slope, and hillshade maps, as well as USGS topographic maps, to visually identify terraces within the entirety of the PNWR (toggling between different visualizations allowed finding terraces that may be more visible in one visualization than in another). Because the exact extents of the terraces’ flat areas were difficult to identify from the imagery, each terrace was annotated as a polyline following its more visible contour along the hillslope. We did not annotate lines created by modern furrows from silviculture parcels because our focus was on legacy cotton terraces. Due to the heavy erosion that has impacted this landscape since the start of the twentieth century, multiple terraces were disrupted by gullying; in these situations, we attempted to connect the terrace on each side of the gully to create continuous terraces that follow their original contours rather than short, discontinuous segments. All annotations were assigned a confidence value (low or high) based on how evident the terraces were in the visualizations, and low-confidence annotations were revisited and reassessed after all zones were completed. To enhance our quality assurance/quality control (QA/QC) protocols, terrace annotations were also double-checked by a senior member of our research team who added missing lines, merged different parts of single eroded terraces, and removed lines that were not terraces. This annotation process took approximately one month of intermittent work. The resulting dataset includes 9,223 terraces. The annotated terrace lengths range from 8.16 to 1,391.41 m, with a median of 78.86 m and a mean of 101.21 m.

To create raster binary masks from the annotated terraces, we converted all polylines to polygons using QGIS’s buffer tool. We generated buffers of 5 m, 10 m, and 20 m to provide the model with information about the context of each terrace (see Verschoof‐van der Vaart et al. [Reference Verschoof‐van der Vaart, Bonhage, Schneider, Ouimet and Raab2023] for discussion on buffers around objects). We then rasterized the polygons dataset to create a binary map of the whole region, where 0 represents the background and 1 represents the presence of at least one terrace buffer. The mask maps have the same extent and resolution as the visualization maps to ensure that their cut tiles overlap perfectly with the visualization tiles that share their location. We finally used the Python script mentioned above to cut the mask maps into the same tiles as the visualization maps. In summary, the visualization and annotation portions of the workflow created eight sets of normalized tiles (five visualizations and three annotations).

Implementing the U-Net in PyTorch

We wrote a Python script that automatically runs through all the steps required to organize and augment the training tiles, train the model, and evaluate it on validation and testing data. For this workflow, we used PyTorch (Ansel et al. Reference Ansel, Yang, He, Gimelshein, Jain, Voznesensky and Bao2024), Albumentations (Buslaev et al. Reference Buslaev, Iglovikov, Khvedchenya, Parinov, Druzhinin and Kalinin2020), numpy (Harris et al. Reference Harris, Jarrod Millman, van der Walt, Gommers, Virtanen, Cournapeau and Wieser2020), sklearn (Pedregosa et al. Reference Pedregosa, Varoquax, Gramfort, Michel, Thirion, Grisel and Blondel2011), torch_snippets (Sizhky; torch_snippets.github.com/sizhky/torch_snippets,v0.533), and the base random and os Python libraries. To ensure reproducibility and allow result comparison across different trainings, we set the random seeds at the beginning of our script.

The first part of this workflow is to define the training, validation, and testing datasets. The script first identifies the set of mask tiles that have at least a certain number of pixels with the value 1 (i.e., pixels that belong to a terrace buffer), thus ignoring “negative” tiles (i.e., tiles that do not have terraces). Our script allows us to change this pixel threshold to evaluate the impact of such a parameter on model training. This procedure creates a list of tile names to use for training, validation, and testing. The script then separates that list into training, validation, and testing subsets and removes overlapping tiles if they are in different datasets to ensure that no features are found in more than one dataset. The separation can be random (80% training, 10% validation, and 10% testing) or geographically set from a bounding box provided by the user. When the separation is geographical, all tiles within the bounding box are used for training. The remaining tiles are divided into two equal areas divided by the mean longitude (validation on the left and testing on the right). The model is trained only on the training dataset; the validation dataset is used during training only to calculate its performance on new data, and the testing dataset is used after training to evaluate the model’s performance on previously unseen data. Comparing the in-training metrics on training and validation datasets allows us to identify overfitting (i.e., when a model learns the training images instead of learning the pattern in them) and stop training before it occurs.

The second part of the workflow is to load and preprocess the training dataset tiles. CNN models written in Python use a “dataloader,” which is a block of code that loads training images in batches, preprocesses them, and runs them through the model. For our study, preprocessing involved combining three one-band visualizations to form three-band tiles and changing their appearance through augmentations from the Albumentations Python library (Buslaev et al. Reference Buslaev, Iglovikov, Khvedchenya, Parinov, Druzhinin and Kalinin2020). Augmentations are slight changes that are applied with a given probability to each image when they are loaded in their batch (Table 1). When training a model over multiple epochs (each epoch is one complete pass over the whole dataset), different augmentations are applied to the same tiles at each epoch, which increases the number of different images the model trains on.

Table 1.

Augmentations Used in This Research.

Note: The model runs each image through each augmentation probability from the top down; therefore, one image can be blurred first, then flipped both horizontally and vertically, and finally rotated.

The third part of this workflow is the training, validation, and testing of the model. We relied on the torchvision library (Ansel et al. Reference Ansel, Yang, He, Gimelshein, Jain, Voznesensky and Bao2024) to create the structure of our U-Nets and load the pretrained weights of the backbones we selected. However, we made a few changes to the code to fit our project’s needs. First, we froze the weights of all encoder nodes by setting their learning rate to zero to fine-tune the pretrained models; therefore, only the decoder nodes learned the new pattern shown in the training dataset. Second, we added a 30% dropout layer at the base of the decoder blocks to reduce overfitting. Dropout layers deactivate a certain percentage of nodes for that layer (in this case, 30%), which forces all active nodes to learn the pattern better (Srivastava et al. Reference Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov2014).

Because there are no agreed-upon parameter values that perform best for models trained on archaeological objects, we tested different combinations on randomly separated datasets (Table 2). We proceeded iteratively, first comparing the results obtained by the three backbones. We selected the best backbone based on the metrics detailed below and used it to train different models using a different loss function. We then selected the best backbone and loss combination and used it to train models using different learning rates, and so on until we had one set of parameters that performed best. While this approach may have missed model parameters that could have provided better results, it allowed us to identify a satisfactory model quickly.

Table 2.

Parameters Tested in the Order They Were Tested.

Note: For each row, the values are presented in the order they were tested. The value in bold shows the choice that obtained the best validation recall.

We initially ran all those exploratory models for 30 training epochs, at which point we logged performance metrics (accuracy, recall, precision, F1, and the Matthew Correlation Coefficient [MCC]) calculated on the pixels of training, validation, and testing tiles. Accuracy and MCC take into consideration the true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN; see equation 4), whereas recall, precision, and F1 overlook FN (see equations 1–3 below). While accuracy, recall, precision, and F1 have extensively been used (e.g., Abdollahi et al. Reference Abdollahi, Pradhan, Shukla, Chakraborty and Alamri2020; Anttiroiko et al. Reference Anttiroiko, Jan Groesz, Ikäheimo, Kelloniemi, Nurmi, Rostad and Seitsonen2023; Freeland et al. Reference Freeland, Heung, Burley, Clark and Knudby2016; Li et al. Reference Li, Zhe, Wei, Gong, Zhang and Chen2023; Olivier and Verschoof-van der Vaart Reference Olivier and Verschoof-van der Vaart2021; Quintus et al. Reference Quintus, Davis and Cochrane2023; Suh and Ouimet Reference Suh and Ouimet2023; Verschoof-van der Vaart et al. Reference Verschoof-van der Vaart, Lambers, Kowalczyk and Bourgeois2020; Verschoof-van der Vaart and Lambers Reference Verschoof-van der Vaart and Lambers2019), MCC (equation 4 below) is better suited to handle imbalanced datasets (Chicco and Jurman Reference Chicco and Jurman2020), which are common in archaeology. For our purpose, we paid close attention to MCC and recall but decided to select the model with highest recall on validation dataset because we find it more important to optimize the number of archaeological sites and features detected than to minimize the number of false detections that can be manually excluded afterward.

(1)

\begin{equation}recall = \frac{{TP}}{{(TP + FN)}}\end{equation}

(2)

\begin{equation}precision = \frac{{TP}}{{(TP + FP)}}\end{equation}

(3)

\begin{equation}F1 = \frac{{2*recall*precision}}{{(recall + precision)}}\end{equation}

(4)

\begin{equation}MCC = \frac{{TP*TN - FP*FN}}{{\sqrt {((TP + FP)*(TP + FN)*(TN + FP)*(TN + FN))} }}{\text{ }}\end{equation}

To fulfill the first aim of this study, we used the parameter settings of the best model to train a new model over 200 epochs. For this model, we separated the training/validation/testing datasets geographically; the training dataset (∼80% of all terraces) is the north part of the refuge (see Figure 6), whereas the remaining south portion is divided in half along the x axis to create the validation and the testing datasets. We logged the training and validation metrics over each epoch to look for overfitting and to identify the epoch with the highest validation recall and MCC, which we deemed the optimal epoch. We ran a new training using the same parameter combinations but ended it at that optimal epoch, which is equivalent to stopping the training when improvement stops, a practice commonly used to improve model performance (Banasiak et al. Reference Banasiak, Berezowski, Zapłata, Mielcarek, Duraj and Stereńczak2022; Davis et al. Reference Davis, Caspari, Lipo and Sanger2021; Herrault et al. Reference Herrault, Poterek, Keller, Schwartz and Ertlen2021; Quintus et al. Reference Quintus, Davis and Cochrane2023). For the second aim of this research, we used the same settings to train models using different percentages of training versus validation/testing datasets (Table 3). To create those datasets, we simply changed the size of the training dataset’s geographical bounds to roughly encompass a given percentage of all annotated terraces. The rest of the map was used for validation and testing, again separated equally along the x axis; therefore, a model training on 10% of tiles was validated on a different 45% of tiles and tested on the remaining 45%. We defined the training areas starting from the south portion of the map, because it is narrower and thus easier to cut into areas small enough to get as close as possible to the percentage needed.

Figure 6.

Extent of the different training datasets against the extent of the lidar data (black) and the annotated terraces (white lines).

Table 3.

Number of Tiles Used for Each Training on Smaller Datasets.

We ran those trainings for 75 epochs and logged their metrics at every epoch to see if a model trained on a reduced training dataset for a long time performs as well as a model trained for fewer epochs on a larger dataset.

Calculating Metrics on Objects. Default model output metrics evaluate performance on a pixel-by-pixel basis, which is useful to follow a model’s performance during training; however, that level of precision is not necessary for the detection of archaeological objects. Therefore, to calculate more representative measures, we developed a Python script that calculates the metrics using object overlap instead of pixel values and allows us to denoise the predictions beforehand to remove unrealistic predictions (i.e., prediction areas that are smaller than any of the annotated terraces). The script uses the tiles that are created by the model at the end of training. Those tiles are the prediction of the presence/absence of a terrace in each of the testing tiles fed into the model during the testing step. The script merges those tiles into one raster and polygonizes it so that all touching pixels with a value of one form a polygon. The script then deletes any predicted polygon with an area smaller than a given threshold to perform an automatic first-pass denoising.

When provided with a shapefile of the actual annotated terraces covering the same area as the testing dataset, the script then uses the overlap of the actual vs. predicted polygons to compute the metrics object by object rather than pixel by pixel (Figure 7). Since these metrics are calculated after unrealistic predictions are removed, and archaeologists would look for entire terraces rather than meters of terrace cover, we believe that those metrics are a better representation of the model’s performance.

Figure 7.

Example of object-by-object metric workflow. Windows A–C are from a different area to windows D–E: (A) predicted terraces shown as black pixels; (B) polygons with their area (m²) —the polygon’s color shows if they will be automatically deleted (dark gray) or if they will be kept (white) based on the 1,600 m² denoising threshold; (C) denoised predicted polygons; (D) example of annotated terraces; (E) superposition of the annotated terraces with the predicted terraces. The fill colors show how they are used to calculate the object-per-object metrics; TP and FN are annotations that do and do not overlap with predictions, respectively, whereas FP are predictions that do not overlap with annotations.

We used this script to compute the object-by-object metrics on the predictions of the best model at its optimal epoch, as well as on the predictions of all models run on smaller training datasets. To define the denoising threshold, we used the rounded area of the smallest buffered annotated terrace (1,600 m²), and we tested two bigger thresholds (2,000 m² and 2,400 m²) to see how they improved precision and the F1 score without reducing recall too much. We visualized the detection map created by our best denoised model to manually investigate the FPs and FNs, correct any mistakes in our annotations, and better understand the strengths and weaknesses of our model. Finally, to make our results comparable with a larger set of studies, we rasterized the corrected final predictions and compared that map with the raster map of annotated terraces to compute MCC.

Contextualizing the Terraces

Some results discussed below led us to investigate the spatial context of the terraces used for model training. For this investigation, we created heatmaps of the location of all annotated terraces to get an overview of their location, as well as a resampled slope map at a coarser resolution to get a more general view of the slope in each area. Both maps were created in QGIS (QGIS.org 2023).

To create the terrace heatmaps, we reduced the buffer annotations to their centroids, which we used for a Kernel Density interpolation with a radius of 1,000 m and one with a radius of 250 m. To create the coarser slope map, we used the GRASS r.resamp.stats tool (GRASS Development Team 2024) in QGIS, using the median value and exporting the result to a resolution of 100 m². We then used the v.what.rast tool (GRASS Development Team 2024) to extract the coarse slope, and the two heatmap values at the location of each terrace centroid. We used the tidyverse package (Wickham et al. Reference Wickham, Averick, Bryan, Chang, McGowan, François and Grolemund2019) in R to visualize those values, grouped geographically.

Results

We trained our models on an Intel Core i9-14900K 3.20 GHz desktop computer with a NVIDIA GeForce RTX 4090 graphics card with 24 GB of dedicated memory. The 30-epoch iterative trainings took 22–30 minutes each, the best-run 200-epoch training took 2 h 40 min, and the 75-epoch smaller dataset trainings took 20 min to 1 h, depending on the size of the training dataset.

Best Model

The best model used the VGG19 backbone with the latest pretrained weights, the AdamW optimizer, IoU loss, learning rate stable at 0.001, and a combination of slope, PosOp, and TRI 256 × 256-pixel tiles uploaded in batches of 16 tiles, where masks tiles were created from a 20 m buffer around the annotations, and where tiles with positive masks smaller than 1,000 pixels were ignored.

Although the rate of improvement in validation loss decreased after epoch 20, it continued to improve slightly, indicating that our model is not overfitting (Figure 8). Using the combination of MCC and recall on the validation dataset, we can see that the model performs best at epoch 56; therefore, our best model (henceforth called model N for reasons we explain below) was trained using the parameter settings identified above for 56 epochs.

Figure 8.

Smoothed metrics per epoch for the 200-epoch training of the best parameters (smoothing span 0.2). The gray represents the 95% confidence interval around the smoothed value.

At the end of the 56 epochs, model N achieved pixel recall = 0.67, precision = 0.64, F1 score = 0.66, and MCC = 0.55 on the validation dataset. The object-by-object recall metrics obtained on the denoised predictions on the test area indicate that model N was able to correctly predict 88%–90% of terraces found on the landscape, even if our precision metrics were slightly lower (Table 4). Increasing the denoising threshold from 1,600 m² to 2,000 m² removes 31 FP but only 5 TP, whereas further increasing the threshold to 2,400 m² removes 34 additional FP and 8 TP, which is less of an improvement. Therefore, we selected the predictions larger than 2,000 m² for the visual evaluation, through which we were able to further improve precision.

Table 4.

Object-by-Object Performance Metrics after Automatic Denoising of the Predicted Vector Map.

Note: The denoising involved deleting predicted polygons smaller than the given threshold. The lowest and highest values in each metric are shown in italic and bold, respectively.

Visual Evaluation. Visual examination of model N’s denoised predictions on the test area showed that it identified 11 new terraces that we had missed during annotation. For the most part, those terraces were relatively small and indistinct (see left prediction in Figure 9a). Interestingly, many of the remaining 115 FP were either straight lines that may have been old land boundaries or pedestrian paths, both of which can easily be mistaken for terraces because they sometimes follow slopes (see path in Figure 9b). Therefore, these results show that, in general, model N performed very well when identifying the pattern of terraces on sloped landscape.

Figure 9.

Examples of true positives (a), false positives (b), and false negatives (c) identified by the model: (a) small and indistinct terrace (left) and more regular terraces that are likely modern (right); (b) pedestrian path that was incorrectly flagged; (c) annotated terrace not predicted by the model that is likely a pedestrian path. The basemaps represent slope overlayed on a hillshade map for increased topography visibility. All white scale lines represent 50 m.

Conversely, FN (the annotated terraces that were not predicted by the model) are mostly small and isolated lines, three of which seem to have been misclassified by our team during annotation (see Figure 9c). We removed those three annotations and recalculated the object-by-object metrics of this model after manually reclassifying FN and FP, which resulted in an even better model performance (recall = 0.901, precision = 0.850, and F1 = 0.875). The rasterized version of those cleaned predictions led to a satisfactory MCC score of 0.562.

Minimum Training Dataset Size

To evaluate the impact of training size on a model’s performance, we plotted the metrics over epochs for all trainings done with smaller training datasets for 75 epochs. The validation loss, recall, precision, and MCC metrics of those trainings improve as the size of the dataset increases; however, that improvement is not linear, and its rate varies by dataset size (see Figure 10). The model trained on only 5% of the dataset has the worst value for most metrics, except validation recall, which is better than model N’s (an interesting result we discuss below). The curves of the 5% and 10% models’ validation recall differ also from other models’ curves; they decrease substantially during the first ∼20 epochs, whereas for all other models, the first ∼20 epochs show a steady improvement that slows down afterward. This is an interesting difference, especially given that the curves of their other metrics (loss, precision, and MCC) follow the general pattern of the other models. This suggests that results can be erratic when the training dataset is very small (in this case, only 541 terraces over 276 training tiles for the 5% training). Similar to the pattern seen with model N, all metrics improve considerably during the first ∼20 epochs, after which they still improve, but at a slower pace.

Figure 10.

Metrics per epoch for each model trained on different dataset sizes. The gray represents the 95% confidence interval around the smoothed value (span 0.2). Refer to Figure 6 for the extent of the training dataset for each model.

The training metrics of all models converge to the same approximate values after 75 epochs, while most validation metrics stay separate. In terms of the training metrics, this suggests that the models have learned as much as they can from their training dataset after 75 epochs. The validation metrics may suggest that models trained on smaller datasets are not as transferable to new data as models trained on larger datasets; however, the difference between the metrics decreases as the size of the training dataset increases. This suggests that increasing the size of a small training dataset leads to more substantial improvements than increasing the size of an already large training. Moreover, as discussed below, carefully choosing the quality of the training dataset has as much of an impact, if not more, on its models’ performance.

One intriguing result is that model N appears to perform poorly, in terms of recall, when compared with models trained on smaller datasets. More importantly, recall improves as the training size increases for all smaller models, but the model using a similar percentage of training tiles as model N still outperforms it. This result came as a surprising check on our data quality; for model N, the training dataset was the 80% north part of the map, whereas for the smaller models, we used the south part because it was narrower and thus easier to separate to reach the iterative percentages we wanted. This difference created very different results: a model trained on northern terraces performs worse when predicting southern terraces than vice versa. We investigated possible causes for this pattern by examining heatmaps of terrace locations and the coarser slope map (see Figure 11 top). We found that the model trained on the 80% southern terraces (called model S henceforth) was trained on terraces located on steeper slopes than the landscape it was validated on, whereas model N was trained on terraces positioned on slightly gentler slopes than it was validated on. Given that terraces located on steeper slopes are easier to distinguish than those located on gentler slopes, this pattern was contrary to what we expected and did not explain why model S performed best. Instead, we found that the aggregation of terraces was a better explanation: model S was trained on more isolated terraces than model N, which was trained on regions that had larger aggregations of terraces (Figure 11 bottom). As the masks of multiple terraces located near one another would have blended to create big masks, model N would have become much better at recognizing groups of terraces than at recognizing individual ones, thus leading to a higher level of FN when validated on southern areas that had more isolated terraces. On the other hand, model S learned to recognize the visual signature of individual terraces, which could then be applied more easily to their aggregations in the northern validation areas.

Figure 11.

Values extracted at the location of terrace centroids. Top: median slope within a 100 m radius. Bottom: number of terrace centroids within a 250 m radius. The notches show the 95% confidence interval around the median.

While the in-training pixel metrics suggest that a larger training dataset is best, the object-by-object metrics calculated on denoised prediction maps covering the 10% testing area (Table 5) show that the finished products of the different trainings are very similar. When denoising at threshold 2,000, recall ranges from 0.922 to 0.964, and precision from 0.856 to 0.899, which are very good values. The tight range of recall values suggests that we can obtain very good results even with models trained for 20 min on small training datasets. A visual comparison of the 5% to the 80% models’ predictions (Figure 12) supports this conclusion.

Figure 12.

Comparing some of the test predictions of two models trained on southern terraces.

Table 5.

Object-by-Object Metrics Calculated on Denoised Predictions (Using Threshold 2,000) Filtered to Focus Only on the Northeastern Region That Was Used for Testing in All Models.

Figure 12 shows that while the southern-trained model trained with 80% of the terraces produces predictions that follow more closely the contours of the annotations, in general the 5% trained model predicts most of the same areas as having terraces. Therefore, small numbers of annotations for training should still lead to useful results for CRM companies.

Final Map

To obtain a general map of all terraces, we applied the best model to the whole area covered by the PNWR and removed all predictions smaller than 2,000 m² (Figure 13). We did not calculate the metrics on this dataset because they would be misleading, given that the model was trained on most of those images already. While we were not able to ground-truth our model’s predictions, we shared this map with the PNWR archaeologists and resource managers to help them inventory and manage archaeological terraces moving forward. They will evaluate the predictions if and when they need to manage those archaeological resources.

Figure 13.

Terraces predicted by our best model cleaned using 2,000 threshold and clipped to the extent of the PNWR.

Discussion

This research adds to recent applications of CNN to detect archaeological terraces (Herrault et al. Reference Herrault, Poterek, Keller, Schwartz and Ertlen2021; Quintus et al. Reference Quintus, Davis and Cochrane2023) as it provides a robust model that is sensitive enough to detect the vast majority of terraces. In terms of the general research on using semantic segmentation on archaeological objects, our methodology reached performance metrics that are comparable to the work of others who similarly used U-Net. Notably, our MCC metrics did not reach the high score obtained in the study by Suh and Ouimet (Reference Suh and Ouimet2023) on stone walls in the US Northeast; however, we would argue that such straight walls are easier to recognize in imagery than the sinuous edge of terraces found in our dataset. Our final results are more comparable to the works of Bundzel and colleagues (Reference Bundzel, Jaščur, Kováč, Lieskovský, Sinčák and Tkáčik2020) who obtained an MCC score of 0.618 on Mayan structures, Gallwey and colleagues (Reference Gallwey, Eyre, Tonkins and Coggan2019), who got recall values from 0.80 to 0.83 on their segmentation of mining pits in England, and the research by Banasiak and colleagues (Reference Banasiak, Berezowski, Zapłata, Mielcarek, Duraj and Stereńczak2022), which obtained F1-score 0.79 on the validation dataset of their model trained to recognize eight different archaeological objects in Poland. Those studies influenced our choice of ML model structure, as they showed that U-Net could successfully be used to find irregular objects in lidar-derived imagery. Moreover, while Herrault and colleagues (Reference Herrault, Poterek, Keller, Schwartz and Ertlen2021) showed that terrace locations can be obtained using simpler models, such as Random Forest, the model we present here is only the first of a set of models we will train to recognize different archaeological objects (e.g., homesteads, cemeteries, trails, even looter pits), for which we thought we could benefit from the flexibility and code simplicity of the U-Net structure. We also show that our additional denoising steps are effective at improving the results, and thus provide a more accurate and useful map of potential terraces to our USFWS partners who will use this dataset to inform future archaeological surveying and monitoring at the PNWR.

This study expands the application of DL in cultural resource management for land-managing agencies by advancing methods for archaeological inventorying, monitoring, and preservation at a landscape scale. The high recall values obtained by our model can amplify an agency CRM program’s ability to effectively conduct pedestrian survey and testing in areas with high potential for encountering historical infrastructure and other sites of interest. For example, the archaeological cotton terraces of the PNWR provide a landscape-scale inventory of intensive agricultural land use (Figure 13) that may help pinpoint the locations of associated homesteads, historic roads/grades, and other features. The scale and distribution of historic terraces, coupled with a lack of ground visibility in this region, means that this would not be possible with pedestrian survey alone. Given the statuary obligations of federal agencies to manage cultural resources on their properties, this approach to DL has the potential to greatly expand their capabilities and efficiencies. Moreover, this method is relatively quick and computationally inexpensive, given that it uses transfer learning of pretrained models. The most time-intensive part of this project was annotating all the terraces we could see on visualization maps; however, we showed here that our model can successfully learn the pattern with only 5% of the annotations we created, which will help us better approach our other models’ needs in the future. Moreover, we now possess a set of trained models we can apply to new partnering refuges to find more terraces. Therefore, while this specific project still required a considerable amount of time, it will decrease the time required to detect terraces in new areas.

The use of DL models to detect archaeological objects unfortunately has drawbacks to balance its advantages. For one thing, it still requires a considerable amount of time and a high level of coding proficiency to create and train models. While commercial GIS software now includes DL tool kits with a graphical user interface that can replace the code development part of this process, we found that that method did not allow us as much flexibility in the parameters, as well as enough information on model training, as we wanted, which is why we chose to develop our own Python model instead. However, the coding skills involved with creating this workflow are not available to all archaeologists or cultural resource managers. Fortunately, we are building alliances with CRM organizations to help them circumvent this very time-intensive step. Another potential drawback is that, even with a good model structure, DL requires good data to produce good outcomes (the classic “garbage in, garbage out”). Therefore, CNN models applied on lidar-derived imagery rely on good point-cloud data to provide good visualization maps. Fortunately for our project, our point density remained consistently high throughout the refuge, which allowed the model to detect terraces in all areas. However, we are aware that these conditions do not exist everywhere, and additional modifications may need to be made in areas with higher topographic roughness (see Quintus et al.’s [Reference Quintus, Davis and Cochrane2023] discussion on this topic).

For the PNWR, this study provides welcome data on the location of cotton terraces that were abandoned almost a century ago and were not catalogued in their CRM database. This research reveals a forgotten, intensively cultivated, and ultimately unsustainable agricultural landscape which continues to influence contemporary erosion, forest vegetation, and soil health. Our research shows just how common terraces are, despite their infrequent consideration in archaeological research in the US Southeast. Moreover, as mentioned above, this updated terrace database will assist in further research to locate historic homes and farmsteads for which we have some historical records and inventories (Hartman and Wooten Reference Hartman and Wooten1935), but no precise locations.

Acknowledgments

We specifically thank Rick Kanaski and Jon Wallace of the USFWS for their support and subject-matter expertise during this project.

Funding Statement

This material is based on work supported by the US DOI FWS award #F22AC02819 and the USDA Forest Service award #22-PA-11080600-228. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the US DOI FWS and USDA Forest Service. These institutions are equal opportunity providers.

Data Availability Statement

Lidar data are available by request by contacting the USDA Forest Service Southern Research Station via their website (https://www.fs.usda.gov/research/srs/contactus). Archaeological data may be available to interested researchers by contacting the USFWS’s Piedmont National Wildlife Refuge via their website (https://www.fws.gov/refuge/piedmont/contact-us). All of the code and scripted analyses used in this study are available on our github repository (https://github.com/NMC-CRS/semantic-segmentation-of-historial-terraces-on-public-lands) or via the following DOI: https://doi.org/10.5281/zenodo.14107060. To help others learn how to use our code without sharing protected archaeological data, we created a dummy dataset of annotated roads in the Juliette community, which neighbors the PNWR. That dataset can be found along the code in the github repository.

Competing Interests

The authors declare none.

References

References Cited

Abdollahi, Abolfazl, Pradhan, Biswajeet, Shukla, Nagesh, Chakraborty, Subrata, and Alamri, Abdullah. 2020. Deep Learning Approaches Applied to Remote Sensing Datasets for Road Extraction: A State-of-the-Art Review. Remote Sensing 12(9):1444. https://doi.org/10.3390/rs12091444.CrossRef Google Scholar

Ansel, Jason, Yang, Edward, He, Horace, Gimelshein, Natalia, Jain, Animesh, Voznesensky, Michael, Bao, Bin, et al. 2024. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. In 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vol. 2, pp. 929–947. Association for Computing Machinery, New York. https://doi.org/10.1145/3620665.3640366.Google Scholar

Anttiroiko, Niko, Jan Groesz, Floris, Ikäheimo, Janne, Kelloniemi, Aleksi, Nurmi, Risto, Rostad, Stian, and Seitsonen, Oula. 2023. Detecting the Archaeological Traces of Tar Production Kilns in the Northern Boreal Forests Based on Airborne Laser Scanning and Deep Learning. Remote Sensing 15(7):1799. https://doi.org/10.3390/rs15071799.CrossRef Google Scholar

Banasiak, Paweł, Berezowski, Piotr, Zapłata, Rafał, Mielcarek, Miłosz, Duraj, Konrad, and Stereńczak, Krzysztof. 2022. Semantic Segmentation (U-Net) of Archaeological Features in Airborne Laser Scanning—Example of the Białowieża Forest. Remote Sensing 14(4):995. https://doi.org/10.3390/rs14040995.CrossRef Google Scholar

Bennett, Rebecca, Welham, Kate, a. Hill, Ross, and Ford, Andrew. 2012. A Comparison of Visualization Techniques for Models Created from Airborne Laser Scanned Data: A Comparison of Visualization Techniques for ALS Data. Archaeological Prospection 19(1):41–48. https://doi.org/10.1002/arp.1414.CrossRef Google Scholar

Bundzel, Marek, Jaščur, Miroslav, Kováč, Milan, Lieskovský, Tibor, Sinčák, Peter, and Tkáčik, Tomáš. 2020. Semantic Segmentation of Airborne LiDAR Data in Maya Archaeology. Remote Sensing 12(22):3685. https://doi.org/10.3390/rs12223685.CrossRef Google Scholar

Buslaev, Alexander, Iglovikov, Vladimir I., Khvedchenya, Eugene, Parinov, Alex, Druzhinin, Mikhail, and Kalinin, Alexandr A.. 2020. Albumentations: Fast and Flexible Image Augmentations. Information 11(2):125.CrossRef Google Scholar

Carter, Benjamin P., Blackadar, Jeff H., and Conner, Weston L. A.. 2021. When Computers Dream of Charcoal: Using Deep Learning, Open Tools, and Open Data to Identify Relict Charcoal Hearths in and around State Game Lands in Pennsylvania. Advances in Archaeological Practice 9(4):257–271. https://doi.org/10.1017/aap.2021.17.CrossRef Google Scholar

Cerrillo-Cuenca, Enrique. 2017. An Approach to the Automatic Surveying of Prehistoric Barrows through LiDAR. Quaternary International 435:135–145. https://doi.org/10.1016/j.quaint.2015.12.099.CrossRef Google Scholar

Chase, Arlen F., Chase, Diane Z., Weishampel, John F., Drake, Jason B., Shrestha, Ramesh L., Clint Slatton, K., Awe, Jaime J., and Carter, William E.. 2011. Airborne LiDAR, Archaeology, and the Ancient Maya Landscape at Caracol, Belize. Journal of Archaeological Science 38(2):387–398. https://doi.org/10.1016/j.jas.2010.09.018.CrossRef Google Scholar

Chicco, Davide, and Jurman, Giuseppe. 2020. The Advantages of the Matthews Correlation Coefficient (MCC) over F1 Score and Accuracy in Binary Classification Evaluation. BMC Genomics 21(1):6. https://doi.org/10.1186/s12864-019-6413-7.CrossRef Google Scholar PubMed

Clode, Simon, Kootsookos, Peter, and Rottensteiner, Franz. 2004. The Automatic Extraction of Roads from LiDAR Data. International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences 35(Part B3):231–236.Google Scholar

Davis, Dylan S., Caspari, Gino, Lipo, Carl P., and Sanger, Matthew C.. 2021. Deep Learning Reveals Extent of Archaic Native American Shell-Ring Building Practices. Journal of Archaeological Science 132:105433. https://doi.org/10.1016/j.jas.2021.105433.CrossRef Google Scholar

Davis, Dylan S., Lipo, Carl P., and Sanger, Matthew C.. 2019. A Comparison of Automated Object Extraction Methods for Mound and Shell-Ring Identification in Coastal South Carolina. Journal of Archaeological Science: Reports 23:166–177. https://doi.org/10.1016/j.jasrep.2018.10.035.Google Scholar

Davis, Dylan S., and Lundin, Julius. 2021. Locating Charcoal Production Sites in Sweden Using LiDAR, Hydrological Algorithms, and Deep Learning. Remote Sensing 13(18):3680. https://doi.org/10.3390/rs13183680.CrossRef Google Scholar

Evans, Damian H., Fletcher, Roland J., Pottier, Christophe, Chevance, Jean-Baptiste, Soutif, Dominique, Suy Tan, Boun, Sokrithy, Im, et al. 2013. Uncovering Archaeological Landscapes at Angkor Using Lidar. PNAS 110(31):12595–12600. https://doi.org/10.1073/pnas.1306539110.CrossRef Google Scholar PubMed

Freeland, Travis, Heung, Brandon, Burley, David V., Clark, Geoffrey, and Knudby, Anders. 2016. Automated Feature Extraction for Prospection and Analysis of Monumental Earthworks from Aerial LiDAR in the Kingdom of Tonga. Journal of Archaeological Science 69:64–74. https://doi.org/10.1016/j.jas.2016.04.011.CrossRef Google Scholar

Gallwey, Jane, Eyre, Matthew, Tonkins, Metthew, and Coggan, John. 2019. Bringing Lunar LiDAR back down to Earth: Mapping Our Industrial Heritage through Deep Transfer Learning. Remote Sensing 11(17):1994. https://doi.org/10.3390/rs11171994.CrossRef Google Scholar

Garcia, Joseph E., and Brook, George A.. 1996. Reconstruction of the Ocmulgee River, Georgia: 1807–1971. Southeastern Geographer 36(2):192–206.CrossRef Google Scholar

GRASS Development Team 2024. Geographic Resources Analysis Support System (GRASS) Software, Version 8.4. Open Source Geospatial Foundation. https://grass.osgeo.org, accessed June 25, 2025.Google Scholar

Guyot, Alexandre, Lennon, Marc, Lorho, Thierry, and Hubert-Moy, Laurence. 2021. Combined Detection and Segmentation of Archeological Structures from LiDAR Data Using a Deep Learning Approach. Journal of Computer Applications in Archaeology 4(1):1. https://doi.org/10.5334/jcaa.64.CrossRef Google Scholar

Harris, Charles R., Jarrod Millman, K., van der Walt, Stéfan J., Gommers, Ralf, Virtanen, Pauli, Cournapeau, David, Wieser, Eric, et al. 2020. Array Programming with NumPy. Nature 585(7825):357–362. https://doi.org/10.1038/s41586-020-2649-2.CrossRef Google Scholar PubMed

Hartman, William Albert, and Wooten, Hugh Hill. 1935. Georgia Land Use Problems. In Georgia Agricultural Experimentation Station, p. 195. Bulletin 191. Georgia Experiment Station of the University System of Georgia, Griffin.Google Scholar

He, Kaiming, Gkioxari, Georgia, Dollár, Piotr, and Girshick, Ross. 2017. Mask R-CNN. In 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017, pp. 2980–2988. Institute of Electrical and Electronics Engineers, New York. https://doi.org/10.1109/ICCV.2017.322.CrossRef Google Scholar

He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385. https://doi.org/10.48550/ARXIV.1512.03385.CrossRef Google Scholar

Herrault, Pierre-Alexis, Poterek, Quentin, Keller, Benjamin, Schwartz, Dominique, and Ertlen, Damien. 2021. Automated Detection of Former Field Systems from Airborne Laser Scanning Data: A New Approach for Historical Ecology. International Journal of Applied Earth Observation and Geoinformation 104:102563. https://doi.org/10.1016/j.jag.2021.102563.CrossRef Google Scholar

Hesse, Ralf. 2010. LiDAR-Derived Local Relief Models: A New Tool for Archaeological Prospection. Archaeological Prospection 17(2):67–72. https://doi.org/10.1002/arp.374.CrossRef Google Scholar

Higgs, Robert. 1976. The Boll Weevil, the Cotton Economy, and Black Migration 1910–1930. Agricultural History 50(2):335–350.Google Scholar

Johnson, Katharine M., Ives, Timothy H., Ouimet, William B., and Sportman, Sarah P.. 2021. High‐Resolution Airborne Light Detection and Ranging Data, Ethics and Archaeology: Considerations from the Northeastern United States. Archaeological Prospection 28(3):293–303. https://doi.org/10.1002/arp.1836.CrossRef Google Scholar

Kirillov, Alexander, Mintun, Eric, Ravi, Nikhila, Mao, Hanzi, Rolland, Chloe, Gustafson, Laura, Xiao, Tete et al. 2023. Segment Anything. arXiv:2304.02643. https://doi.org/10.48550/ARXIV.2304.02643.Google Scholar

Kokalj, Žiga, and Somrak, Maja. 2019. Why Not a Single Image? Combining Visualizations to Facilitate Fieldwork and On-Screen Mapping. Remote Sensing 11(7):747. https://doi.org/10.3390/rs11070747.CrossRef Google Scholar

Küçükdemirci, Melda, Landeschi, Giacomo, Ohlsson, Mattias, and Dell’Unto, Nicolo. 2023. Investigating Ancient Agricultural Field Systems in Sweden from Airborne LIDAR Data by Using Convolutional Neural Network. Archaeological Prospection 30(2):209–219. https://doi.org/10.1002/arp.1886.CrossRef Google Scholar

Ladefoged, Thegn N., McCoy, Mark D., Asner, Gregory P., Kirch, Patrick V., Puleston, Cedric O., Chadwick, Oliver A., and Vitousek, Peter M.. 2011. Agricultural Potential and Actualized Development in Hawai’i: An Airborne LiDAR Survey of the Leeward Kohala Field System (Hawai’i Island). Journal of Archaeological Science 38(12):3605–3619. https://doi.org/10.1016/j.jas.2011.08.031.CrossRef Google Scholar

Li, Wei, Zhe, Li, Wei, Yongfeng, Gong, Caili, Zhang, Man, and Chen, Liang. 2023. Square Ancient Sites Detection in Typical Regions of the Mongolian Plateau Using Improved Faster R-CNN from Google Earth High-Resolution Images. International Journal of Remote Sensing 44(17):5207–5227. https://doi.org/10.1080/01431161.2023.2244641.CrossRef Google Scholar

Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Bourdev, Lubomir, Girshick, Ross, Hays, James, Perona, Pietro, Deva Ramanan, C. Lawrence Zitnick, and Dollár, Piotr. 2014. Microsoft COCO: Common Objects in Context. arXiv:1405.0312. https://doi.org/10.48550/ARXIV.1405.0312.Google Scholar

McCoy, Mark D., Asner, Gregory P., and Graves, Michael W.. 2011. Airborne Lidar Survey of Irrigated Agricultural Landscapes: An Application of the Slope Contrast Method. Journal of Archaeological Science 38(9):2141–2154. https://doi.org/10.1016/j.jas.2011.02.033.CrossRef Google Scholar

Meng, Yuqi. 2023. Identification of Neolithic Circular Enclosures through Aerial Imagery: A Study about Pattern Recognition and Deep Learning Techniques. Master’s thesis, Department of Geoscience and Remote Sensing, Delft University of Technology, Delft, Netherlands.Google Scholar

Nogueira, Keiller, Penatti, Otávio A. B., and dos Santos, Jefersson A.. 2017. Towards Better Exploiting Convolutional Neural Networks for Remote Sensing Scene Classification. Pattern Recognition 61:539–556. https://doi.org/10.1016/j.patcog.2016.07.001.CrossRef Google Scholar

Olivier, Martin, and Verschoof-van der Vaart, Wouter. 2021. Implementing State-of-the-Art Deep Learning Approaches for Archaeological Object Detection in Remotely Sensed Data: The Results of Cross-Domain Collaboration. Journal of Computer Applications in Archaeology 4(1):274–289. https://doi.org/10.5334/jcaa.78.CrossRef Google Scholar

Pedregosa, Fabian, Varoquax, Gaël, Gramfort, Aalexandre, Michel, Vincent, Thirion, Bertrand, Grisel, Olivier, Blondel, Mathieu, et al. 2011. Scikit-Learn: Machine Learning in Python. Journal of Machine Learning Research 12:2825–2830.Google Scholar

QGIS.org. 2023. QGIS Geographic Information System. QGIS Association. http://www.qgis.org, accessed May 29, 2025.Google Scholar

Quintus, Seth, Davis, Dylan S., and Cochrane, Ethan E.. 2023. Evaluating Mask R‐CNN Models to Extract Terracing across Oceanic High Islands: A Case Study from Sāmoa. Archaeological Prospection 30(4):477–492. https://doi.org/10.1002/arp.1909.Google Scholar

R Core Team. 2021. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.Google Scholar

Redmon, Joseph, Divvala, Santosh, Girshick, Ross, and Farhadi, Ali. 2016. You Only Look Once: Unified, Real-Time Object Detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 779–788. Institute of Electrical and Electronics Engineers, New York. https://doi.org/10.1109/CVPR.2016.91.CrossRef Google Scholar

Ren, Shaoqing, Kaiming, He, Girshick, Ross, and Sun, Jian. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031.CrossRef Google Scholar PubMed

Ronneberger, Olaf, Fischer, Philipp, and Brox, Thomas. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, edited by Navab, Nassir, Hornegger, Joachim, Wells, William M., and Frangi, Alejandro F., pp. 234–241. Lecture Notes in Computer Science Vol. 9351. Springer International, Cham, Switzerland.Google Scholar

Roussel, Jean-Romain, Auty, David, Coops, Nicholas C., Tompalski, Piotr, Goodbody, Tristan R. H., Sánchez Meador, Andrew, Bourdon, Jean-François, de Boissieu, Florian, and Achim, Alexis. 2020. lidR: An R Package for Analysis of Airborne Laser Scanning (ALS) Data. Remote Sensing of Environment 251:112061. https://doi.org/10.1016/j.rse.2020.112061.CrossRef Google Scholar

Russakovsky, Olga, Deng, Jia, Hao, Su, Krause, Jonathan, Satheesh, Sanjeev, Sean, Ma, Huang, Zhiheng, et al. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115(3):211–252. https://doi.org/10.1007/s11263-015-0816-y.CrossRef Google Scholar

Simonyan, Karen, and Zisserman, Andrew. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556. https://doi.org/10.48550/ARXIV.1409.1556.CrossRef Google Scholar

Somrak, Maja, Džeroski, Sašo, and Kokalj, Žiga. 2020. Learning to Classify Structures in ALS-Derived Visualizations of Ancient Maya Settlements with CNN. Remote Sensing 12(14):2215. https://doi.org/10.3390/rs12142215.CrossRef Google Scholar

Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15(56):1929–1958.Google Scholar

Suh, Ji Won, Anderson, Eli, Ouimet, William, Johnson, Katharine M., and Witharana, Chandi. 2021. Mapping Relict Charcoal Hearths in New England Using Deep Convolutional Neural Networks and LiDAR Data. Remote Sensing 13(22):4630. https://doi.org/10.3390/rs13224630.CrossRef Google Scholar

Suh, Ji Won, and Ouimet, William. 2023. Mapping Stone Walls in Northeastern USA Using Deep Learning and LiDAR Data. GIScience & Remote Sensing 60(1):2196117. https://doi.org/10.1080/15481603.2023.2196117.CrossRef Google Scholar

Traviglia, Arianna, Cowley, Dave, and Lambers, Karsten. 2016. Finding Common Ground: Human and Computer Vision in Archaeological Prospection. AARGnews 53:11–24.Google Scholar

Trier, Øivind Due, David, C. Cowley, and Anders, Ueland Waldeland. 2019. Using Deep Neural Networks on Airborne Laser Scanning Data: Results from a Case Study of Semi‐Automatic Mapping of Archaeological Topography on Arran, Scotland. Archaeological Prospection 26(2):165–175. https://doi.org/10.1002/arp.1731.CrossRef Google Scholar

US Department of Agriculture. 1965. The Land Utilization Program 1934 to 1964: Origin, Development, and Present Status. Agricultural Economic Report 85. US Department of Agriculture Economic Research Service, Washington, DC.Google Scholar

van Rossum, Guido. 1995. Python Tutorial. CS-R9526. Centrum voor Wiskunde en Informatica (CWI), Amsterdam. https://ir.cwi.nl/pub/5007/05007D.pdf, accessed June 25, 2025.Google Scholar

Verschoof‐van der Vaart, Wouter, Bonhage, Alexander, Schneider, Anna, Ouimet, William, and Raab, Thomas. 2023. Automated Large‐Scale Mapping and Analysis of Relict Charcoal Hearths in Connecticut (USA) Using a Deep Learning YOLOv4 Framework. Archaeological Prospection 30(3):251–256. https://doi.org/10.1002/arp.1889.CrossRef Google Scholar

Verschoof-van der Vaart, Wouter Baernd, and Lambers, Karsten. 2019. Learning to Look at LiDAR: The Use of R-CNN in the Automated Detection of Archaeological Objects in LiDAR Data from the Netherlands. Journal of Computer Applications in Archaeology 2(1):31–40. https://doi.org/10.5334/jcaa.32.CrossRef Google Scholar

Verschoof-van der Vaart, Wouter B., Lambers, Karsten, Kowalczyk, Wojtek, and Bourgeois, Quentin P. J.. 2020. Combining Deep Learning and Location-Based Ranking for Large-Scale Archaeological Prospection of LiDAR Data from the Netherlands. ISPRS International Journal of Geo-Information 9(5):293. https://doi.org/10.3390/ijgi9050293.CrossRef Google Scholar

Wettstaed, James R. 2013. The Resettlement Administration and the Historical Archaeology of the Georgia Piedmont. In Shovel Ready: Archaeology and Roosevelt’s New Deal for America, edited by Bernard, K. Means, pp. 223–234. University of Alabama Press, Tuscaloosa.CrossRef Google Scholar

Wickham, Hadley, Averick, Mara, Bryan, Jennifer, Chang, Winston, McGowan, Lucy, François, Romain, Grolemund, Garrett et al. 2019. Welcome to the Tidyverse. Journal of Open Source Software 4(43):1686. https://doi.org/10.21105/joss.01686.CrossRef Google Scholar

Zakšek, Klemen, Oštir, Kristof, and Kokalj, Žiga. 2011. Sky-View Factor as a Relief Visualization Technique. Remote Sensing 3(2):398–415. https://doi.org/10.3390/rs3020398.CrossRef Google Scholar

Keqi, Zhang, Shu-Ching Chen, D. Whitman, Shyu, Mei-Ling, Yan, Jianhua, and Zhang, Chengcui. 2003. A Progressive Morphological Filter for Removing Nonground Measurements from Airborne LIDAR Data. IEEE Transactions on Geoscience and Remote Sensing 41(4):872–882. https://doi.org/10.1109/TGRS.2003.810682.CrossRef Google Scholar

Figure 1. (Left) Example of terraces visible on an SLRM (20 m radius) tile and (right) its associated binary mask showing the location of terraces (white) versus the background (black).

Figure 2. Study area of the PNWR and the coverage of the lidar-derived DEM (in grayscale) within its broader spatial context (the white arrow points to the white dot that shows its location). Basemap: ESRI satellite imagery.

Figure 3. Workflow from raw data to mapped terraces.

Figure 4. Figure of the different U-Net structures (done by CGM based on visualization in Ronneberger et al. 2015). Top: VGG16 and VGG19 (the gray boxes show the convolutions that are not present in VGG16), bottom: ResNet18.

Figure 5. Set of terraces visible in different visualization one-band maps (SLRM with 10 m radius, PosOp, and Slope) as well as the RGB they produce when combined as a three-band tile (where SLRM is used for the red channel, PosOp is used for the green channel, and Slope is used for the blue channel).

Table 1. Augmentations Used in This Research.

Table 2. Parameters Tested in the Order They Were Tested.

Figure 6. Extent of the different training datasets against the extent of the lidar data (black) and the annotated terraces (white lines).

Table 3. Number of Tiles Used for Each Training on Smaller Datasets.

Figure 7. Example of object-by-object metric workflow. Windows A–C are from a different area to windows D–E: (A) predicted terraces shown as black pixels; (B) polygons with their area (m2) —the polygon’s color shows if they will be automatically deleted (dark gray) or if they will be kept (white) based on the 1,600 m2 denoising threshold; (C) denoised predicted polygons; (D) example of annotated terraces; (E) superposition of the annotated terraces with the predicted terraces. The fill colors show how they are used to calculate the object-per-object metrics; TP and FN are annotations that do and do not overlap with predictions, respectively, whereas FP are predictions that do not overlap with annotations.

Figure 8. Smoothed metrics per epoch for the 200-epoch training of the best parameters (smoothing span 0.2). The gray represents the 95% confidence interval around the smoothed value.

Table 4. Object-by-Object Performance Metrics after Automatic Denoising of the Predicted Vector Map.

Figure 9. Examples of true positives (a), false positives (b), and false negatives (c) identified by the model: (a) small and indistinct terrace (left) and more regular terraces that are likely modern (right); (b) pedestrian path that was incorrectly flagged; (c) annotated terrace not predicted by the model that is likely a pedestrian path. The basemaps represent slope overlayed on a hillshade map for increased topography visibility. All white scale lines represent 50 m.

Figure 10. Metrics per epoch for each model trained on different dataset sizes. The gray represents the 95% confidence interval around the smoothed value (span 0.2). Refer to Figure 6 for the extent of the training dataset for each model.

Figure 11. Values extracted at the location of terrace centroids. Top: median slope within a 100 m radius. Bottom: number of terrace centroids within a 250 m radius. The notches show the 95% confidence interval around the median.

Figure 12. Comparing some of the test predictions of two models trained on southern terraces.

Table 5. Object-by-Object Metrics Calculated on Denoised Predictions (Using Threshold 2,000) Filtered to Focus Only on the Northeastern Region That Was Used for Testing in All Models.

Figure 13. Terraces predicted by our best model cleaned using 2,000 threshold and clipped to the extent of the PNWR.

Article contents

Semantic Segmentation of Archaeological Features on Public Lands: Case Study of Historical Cotton Terraces within the Piedmont National Wildlife Refuge, Georgia, USA

Abstract

Resumen

Keywords

Information

Background

Aerial Laser Scanning

Deep Learning

Case Study: Piedmont National Wildlife Refuge, Georgia, USA

Materials and Methods

Overview

CNN Models

Creating the Training, Validation, and Testing Datasets

Implementing the U-Net in PyTorch

Contextualizing the Terraces

Results

Best Model

Minimum Training Dataset Size

Final Map

Discussion

Acknowledgments

Funding Statement

Data Availability Statement

Competing Interests

References

References Cited

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests