Hostname: page-component-6766d58669-bkrcr Total loading time: 0 Render date: 2026-05-19T07:08:32.747Z Has data issue: false hasContentIssue false

Machine learning approaches for the prediction of serious fluid leakage from hydrocarbon wells

Published online by Cambridge University Press:  19 May 2023

Mehdi Rezvandehy*
Affiliation:
Department of Chemical and Petroleum Engineering, University of Calgary, Calgary, AB, Canada
Bernhard Mayer
Affiliation:
Department of Geoscience, University of Calgary, Calgary, AB, Canada
*
Corresponding author: Mehdi Rezvandehy; Email: mehdi.rezvandehy@ucalgary.ca

Abstract

The exploitation of hydrocarbon reservoirs may potentially lead to contamination of soils, shallow water resources, and greenhouse gas emissions. Fluids such as methane or CO2 may in some cases migrate toward the groundwater zone and atmosphere through and along imperfectly sealed hydrocarbon wells. Field tests in hydrocarbon-producing regions are routinely conducted for detecting serious leakage to prevent environmental pollution. The challenge is that testing is costly, time-consuming, and sometimes labor-intensive. In this study, machine learning approaches were applied to predict serious leakage with uncertainty quantification for wells that have not been field tested in Alberta, Canada. An improved imputation technique was developed by Cholesky factorization of the covariance matrix between features, where missing data are imputed via conditioning of available values. The uncertainty in imputed values was quantified and incorporated into the final prediction to improve decision-making. Next, a wide range of predictive algorithms and various performance metrics were considered to achieve the most reliable classifier. However, a highly skewed distribution of field tests toward the negative class (nonserious leakage) forces predictive models to unrealistically underestimate the minority class (serious leakage). To address this issue, a combination of oversampling, undersampling, and ensemble learning was applied. By investigating all the models on never-before-seen data, an optimum classifier with minimal false negative prediction was determined. The developed methodology can be applied to identify the wells with the highest likelihood for serious fluid leakage within producing fields. This information is of key importance for optimizing field test operations to achieve economic and environmental benefits.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Figure 1. Location map of Alberta Energy Regulator (AER) classification for test results (serious and nonserious) of surface casing vent flow (SCVF) and gas migration (GM) for energy wells in Alberta, Canada. The majority of the wells are classified as nonserious (83.8%).

Figure 1

Table 1. Twenty-two physical properties for each well in Figure 1 retrieved from geoSCOUT (2022).

Figure 2

Figure 2. Schematic illustration of the workflow used in this article.

Figure 3

Figure 3. AER classification shown in Figure 1 is separated into training, validation, and test sets. Location map (top) and pie chart (bottom) for training (left), validation (middle), and test sets (right).

Figure 4

Figure 4. (a) Cholesky decomposition of correlation matrix for $ n $ features (well properties). (b) LU unconditional simulation. (c) LU conditional simulation.

Figure 5

Figure 5. (a) Schematic illustration of four features with four rows of data with missing and nonmissing values. (b) Change the order of features for raw data to have nonmissing values first followed by missing data. (c) Correlation matrix for each row in (b).

Figure 6

Figure 6. (a) Synthetic example of four features with 10,000 data. NaNs are missing data. (b) Correlation matrix between features (below diagonal elements) and percentage of missing data for each bivariate feature (above diagonal elements). Maximum percentage of missing data is 51% for bivariate distribution of features 1 and 2.

Figure 7

Figure 7. Scatter plot matrix for synthetic example of four features with 10,000 data before imputation (a) and after imputation (b). Histograms of each feature are shown on diagonal elements. n is the number of nonmissing values for each univariate and bivariate distribution and $ {\rho}_{x,y} $ is the correlation coefficient for each bivariate distribution. $ \mu $ is the mean and $ \sigma $ is the standard deviation.

Figure 8

Figure 8. Schematic illustration of resampling within 5-fold cross-validation that leads to five models (models 1–5). Resampling is applied only on the training folds. A model is trained for the resampled training folds of each split. The trained model is used to predict the test fold which preserves the percentage of samples for each class in the original data set.

Figure 9

Figure 9. Correlation matrix (below diagonal elements) for 22 well properties of an AER classification (SCVF/GM test results), before imputation (a) and after imputation (b). The percentage of missing data for bivariate distribution is shown above diagonal elements. Cross plots between surface-casing depth (m) and production-casing depth (m) before and after imputation are shown at the bottom. n is number of nonmissing values for each univariate and bivariate distribution and $ {\rho}_{x,y} $ is the correlation coefficient for each bivariate distribution. $ \mu $ is the mean and $ \sigma $ is the standard deviation.

Figure 10

Figure 10. Performance of predictive algorithms for training set (a), validation set (b), and test set (c) based on the metrics accuracy, sensitivity, precision, and specificity achieved from confusion matrix. Specificity is the highest and sensitivity is lowest for almost all classifiers.

Figure 11

Figure 11. ROC curves with calculated AUC for training set (a), validation set (b), and test set (c). TP, true positive; TN true negative; FP, false positive; FN, false negative. Random Forest has the highest AUC without overfitting.

Figure 12

Figure 12. Hundred realizations of Random Forest with 100 realizations of imputation, one realization at a time to quantify the importance of each feature with uncertainty for predicting target (AER classification). The feature Month Well Spudded (age of the wells in months) has the highest importance; Deviated Hole and Horizontal Hole have the lowest importance.

Figure 13

Figure 13. Resampling before (a) and within (b) 5-fold cross-validation by Random Forest algorithm for training set for the ratios of $ \frac{\mathrm{Class}\hskip0.22em 1}{\mathrm{Class}\hskip0.22em 0}. $ Resampling before cross-validation (a) is incorrect due to overoptimism and overfitting. The metrics are accuracy, specificity, sensitivity, and AUC (area under the curve).

Figure 14

Figure 14. (a) Performance of the classifiers for validation set (a) and test set (b) for the sampling ratio of $ \frac{\mathrm{Class}\hskip0.22em 1}{\mathrm{Class}\hskip0.22em 0}=1.27 $ within 5-fold cross-validation. All classifiers for validation and test set have reasonably higher performance than Dummy Classifier. There is small overfitting in the validation set because of fine-tuning of hyperparameters. Soft voting, integration of Logistic Regression and Random Forest, is the most reliable classifier with high sensitivity and AUC (area under the curve), and reasonable accuracy and specificity.

Figure 15

Figure 15. (a) Location map of 1,000 random wells without field test in Alberta, Canada. (b) Bar chart of missing values of the wells in (a) for 22 well properties retrieved from the geoSCOUT database.

Figure 16

Figure 16. (a) Mean of 100 predicted probabilities of serious fluid leakage for 1,000 wells. (b) Location map for 488 wells with the probability higher than 0.6 in (a). (c) Location map for 136 wells with the probability higher than 0.8 in (a).

Submit a response

Comments

No Comments have been published for this article.