Hostname: page-component-6766d58669-tq7bh Total loading time: 0 Render date: 2026-05-16T22:33:22.576Z Has data issue: false hasContentIssue false

Machine Learning–Based Identification of Lithic Microdebitage

Published online by Cambridge University Press:  16 February 2023

Markus Eberl*
Affiliation:
Department of Anthropology, Vanderbilt University, Nashville, TN, USA
Charreau S. Bell
Affiliation:
Data Science Institute, Vanderbilt University, Nashville, TN, USA
Jesse Spencer-Smith
Affiliation:
Data Science Institute, Vanderbilt University, Nashville, TN, USA
Mark Raj
Affiliation:
Data Science Institute, Vanderbilt University, Nashville, TN, USA
Amanda Sarubbi
Affiliation:
Data Science Institute, Vanderbilt University, Nashville, TN, USA
Phyllis S. Johnson
Affiliation:
Department of Anthropology, University of Kentucky, Lexington, KY, USA
Amy E. Rieth
Affiliation:
Department of Anthropology, Vanderbilt University, Nashville, TN, USA
Umang Chaudhry
Affiliation:
Data Science Institute, Vanderbilt University, Nashville, TN, USA
Rebecca Estrada Aguila
Affiliation:
Department of Anthropology, Vanderbilt University, Nashville, TN, USA
Michael McBride
Affiliation:
Independent Scholar, Plano, TX, USA
*
(markus.eberl@vanderbilt.edu, corresponding author)
Rights & Permissions [Opens in a new window]

Abstract

Archaeologists tend to produce slow data that is contextually rich but often difficult to generalize. An example is the analysis of lithic microdebitage, or knapping debris, that is smaller than 6.3 mm (0.25 in.). So far, scholars have relied on manual approaches that are prone to intra- and interobserver errors. In the following, we present a machine learning–based alternative together with experimental archaeology and dynamic image analysis. We use a dynamic image particle analyzer to measure each particle in experimentally produced lithic microdebitage (N = 5,299) as well as an archaeological soil sample (N = 73,313). We have developed four machine learning models based on Naïve Bayes, glmnet (generalized linear regression), random forest, and XGBoost (“Extreme Gradient Boost[ing]”) algorithms. Hyperparameter tuning optimized each model. A random forest model performed best with a sensitivity of 83.5%. It misclassified only 28 or 0.9% of lithic microdebitage. XGBoost models reached a sensitivity of 67.3%, whereas Naïve Bayes and glmnet models stayed below 50%. Except for glmnet models, transparency proved to be the most critical variable to distinguish microdebitage. Our approach objectifies and standardizes microdebitage analysis. Machine learning allows studying much larger sample sizes. Algorithms differ, though, and a random forest model offers the best performance so far.

Arqueólogos tienden a producir “slow data,” quiere decir datos complejos de contextos locales pero muchas veces difícil de generalizar. Un buen ejemplo es el análisis de microdesechos líticos o escombros de la talla lítica menor de 6.3 mm (0.25 in.). Hasta ahora, investigadores han usado enfoques manuales que son propensos a errores intra- e ínterobservador. A continuación, presentamos una alternativa basada en machine learning, la arqueología experimental y el análisis dinámico de imágenes. Usamos un analizador de partículas de imagen dinámica para medir cada partícula en una muestra de microdesechos líticos producidos experimentalmente (N = 5,299), así como en una muestra de suelo arqueológico (N = 73,313). Desarrollamos cuatro modelos de machine learning basados en algoritmos Naïve Bayes, glmnet (regresión lineal generalizada), random forest y XGBoost (“Extreme Gradient Boost[ing]”). El ajuste de hiperparámetros optimizó cada modelo. Un modelo de random forest resultó mejor. Tiene una sensibilidad del 83,5% y clasificó mal solo el 28 o el 0,9% de los microdebitos líticos. Los modelos XGBoost alcanzan una sensibilidad del 67,3%, mientras que los modelos Naïve Bayes y glmnet se mantienen por debajo del 50%. A excepción de los modelos glmnet, la transparencia demostró ser la variable más crítica para distinguir los microdesechos del suelo. Nuestro enfoque objetiviza y estandariza el análisis de microdesechos. Machine learning permite estudiar tamaños de muestra mucho más grandes. Sin embargo, algoritmos difieren y un modelo random forest ofrece el mejor rendimiento haste ahora.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Author(s), 2023. Published by Cambridge University Press on behalf of Society for American Archaeology
Figure 0

FIGURE 1. Dynamic image particle analyzer photos of samples of chert microdebitage (top row) and soil particles (bottom row).

Figure 1

FIGURE 2. Flow chart showing data split, pre-processing, and cross-validation.

Figure 2

FIGURE 3. Violin plots for 39 variables.

Figure 3

TABLE 1. Selected Performance Metrics of the Best Naïve Bayes, glmnet, Random Forest, and XGBoost Model.

Figure 4

FIGURE 4. Confusion matrix for the best-performing models of four machine learning approaches (“exp” refers to experimentally produced microdebitage, and “site” refers to the archaeological soil sample): (a) Naïve Bayes, (b) Glmnet, (c) Random Forest, (d) XGBoost.

Figure 5

FIGURE 5. Distribution of mean cross-validation performance by 20 model candidates for each of the four machine learning algorithms (metrics are explained in Table 1; pr_auc refers to the area under the precision and recall curve, roc_auc to the area under the roc curve): (a) Naïve Bayes, (b) Glmnet, (c) Random forest, (d) XGBoost.

Figure 6

FIGURE 6. Comparing the performance of the four machine learning approaches. Red dots mark outliers (metrics are explained in Table 1; pr_auc refers to the area under the precision and recall curve, and roc_auc refers to the area under the roc curve).

Supplementary material: File

Eberl et al. supplementary material

Eberl et al. supplementary material

Download Eberl et al. supplementary material(File)
File 1.5 MB