Hostname: page-component-77f85d65b8-pkds5 Total loading time: 0 Render date: 2026-04-21T20:16:17.423Z Has data issue: false hasContentIssue false

A framework for scalable ambient air pollution concentration estimation

Published online by Cambridge University Press:  03 March 2025

Liam J. Berrisford*
Affiliation:
BioComplex Laboratory, Department of Computer Science, University of Exeter, Exeter, UK Department of Mathematics, University of Exeter, Exeter, UK UKRI Centre for Doctoral Training in Environmental Intelligence, University of Exeter, Exeter, UK
Lucy S. Neal
Affiliation:
Met Office, Exeter, UK
Helen J. Buttery
Affiliation:
Met Office, Exeter, UK
Benjamin R. Evans
Affiliation:
Met Office, Exeter, UK
Ronaldo Menezes
Affiliation:
BioComplex Laboratory, Department of Computer Science, University of Exeter, Exeter, UK Department of Computer Science, Federal University of Ceará, Fortaleza, Brazil
*
Corresponding author: Liam J. Berrisford; Email: liam.j.berrisford@bath.edu

Abstract

Ambient air pollution remains a global challenge, with adverse impacts on health and the environment. Addressing air pollution requires reliable data on pollutant concentrations, which form the foundation for interventions aimed at improving air quality. However, in many regions, including the United Kingdom, air pollution monitoring networks are characterized by spatial sparsity, heterogeneous placement, and frequent temporal data gaps, often due to issues such as power outages. We introduce a scalable data-driven supervised machine learning model framework designed to address temporal and spatial data gaps by filling missing measurements within the United Kingdom. The machine learning framework used is LightGBM, a gradient boosting algorithm based on decision trees, for efficient and scalable modeling. This approach provides a comprehensive dataset for England throughout 2018 at a 1 km2 hourly resolution. Leveraging machine learning techniques and real-world data from the sparsely distributed monitoring stations, we generate 355,827 synthetic monitoring stations across the study area. Validation was conducted to assess the model’s performance in forecasting, estimating missing locations, and capturing peak concentrations. The resulting dataset is of particular interest to a diverse range of stakeholders engaged in downstream assessments supported by outdoor air pollution concentration data for nitrogen dioxide (NO2), Ozone (O3), particulate matter with a diameter of 10 μm or less (PM10), particulate matter with a diameter of 2.5 μm or less PM2.5, and sulphur dioxide (SO2), at a higher resolution than was previously possible.

Information

Type
Application Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Figure 1. Leominster AURN monitoring station NO2 measurements. (a) Shows how the peak air pollution reading for NO2 at the Leominster station dramatically exceeds the 24-hour limit, even more so for the annual limit, showing how there can be periods of quite extreme pollution in the context of the annual limits. (b) shows how there can be extended periods where the air pollution levels are below and exceed the designated limits and the relation of the monitoring station peak to all available data for the station in 2014.

Figure 1

Table 1. AURN monitoring station counts by environmental classification per air pollutant

Figure 2

Figure 2. Example feature vector dataset from each dataset family. From left to right, the example datasets are the majority land use classification for each grid (geographic family, discussed in Supplementary Section S1.8), Sentinel 5P NO2 measurements (remote sensing family, discussed in Supplementary Section S1.6), 100 m U component of wind (meteorological family, discussed in Supplementary Section S1.5), NAEI SNAP sector 7 (road transport) NOx emissions (emissions family, discussed in Supplementary Section S1.7), road infrastructure distance from the nearest motorway and total length of residential road per grid (transport infrastructure structural properties family, discussed in Supplementary Section S1.3), and the car and taxis score (transport use family, discussed in Supplementary Section S1.4).

Figure 3

Figure 3. Spearman correlation coefficients overall mean for all pollutants. The mean Spearman correlation coefficients for NOx and O3 across all the environmental classifications of the AURN network for the 10 most extreme, both positive and negative, for the feature vectors are shown. The sources and sinks of the air pollutants are different, aligning with the scientific literature (Section 3.2), with NOx being highly positively correlated with emission features, whereas O3 exhibits such a relationship mainly with meteorological features, such as wind gusts. Regarding negative correlations, the two air pollutants exhibit counter relationships, with NOx having a negative correlation with the meteorological. The analysis highlights how the relationships between a particular phenomenon and a given air pollutant can be widely different in strength.

Figure 4

Figure 4. Spearman correlation coefficients for NOx monitoring station environmental subclassification locations, Rural Background and Urban Traffic. While Figure 3 highlights the difference between phenomena and air pollutants, there exists a further difference between environmental subclassifications. For the Urban Traffic monitoring stations, it can be seen that the primary positive correlations are related to road transport as would be expected (the strong relationship with solvent use is likely an artefact of the scaling performed and discussed in Section 3.2 and Supplementary Section S1.7, alongside a limited sample size of 41 stations). In contrast, the Rural Background monitoring stations show a strong relationship with emissions from the residential sector, highlighting that the sources and sinks for an air pollutant depend on the air pollutant itself and the location of interest.

Figure 5

Figure 5. Spearman correlation heatmap between all feature vectors. The grey lines throughout the heatmap show the data points missing from the dataset, phenomena with no monitoring stations across all pollutants, including four geographic features and nine emissions features.

Figure 6

Figure 6. Dendrogram depicting hierarchical clustering of feature vectors. The lower the linkage distance between feature vectors, the more correlated the features are, indicating that they provide similar information. Supplementary Table S7 details the number of clusters for different linkage distances.

Figure 7

Table 2. R2 scores depicting forecasting performance (2014–2016 train score, 2017 validation score, 2018 test score)

Figure 8

Figure 7. Chesterfield Loundsley Green NO2 concentrations augmented dataset, with missing AURN measurements filled with model predictions. This figure shows that the station’s measurements (green) started in early 2015 with three clear periods of long-term missing data. The model predictions (yellow) can create a complete augmented time series using the model.

Figure 9

Table 3. R2 scores depicting forecasting performance for 5-fold leave-one-out-validation

Figure 10

Table 4. R2 scores for missing monitoring stations performance summary statistics for 5-fold leave-one-out- validation

Figure 11

Figure 8. Full spatial map of England for all pollutants for 8AM on 19/01/2018, chosen arbitrarily as a typical working day away from national holidays in England. Plotted on a log scale to help highlight the differences within regions in the map.

Figure 12

Figure 9. Prediction of peak values for NO2 monitoring stations. In (a), it is evident that the model failed to capture the peak concentration for the Leominster monitoring station. However, there is a noticeable uptick in the concentration prediction at the correct time, raising concerns about a consistent underestimation by the model. Conversely, (b) illustrates the peak prediction for the Stanford-le-Hope monitoring station. The model not only captures the peak but also yields a magnitude considerably higher than that for Leominster, offering an initial indication that the model may not be systematically underpredicting concentrations.

Figure 13

Table 5. Average peak concentrations prediction difference

Figure 14

Table 6. Mean, max, and minimum values for bias, correlation and MSE for each air pollutant across all air pollution monitoring stations

Figure 15

Table 7. Repeat experiments results of Tables 2 and 4 for models trained on individual dataset families (Section 3.2) for NO2

Figure 16

Figure 10. Count of times that a grid exceeded the outlined thresholds for NO2 in 2018. (a) shows the 10 μg/m3 threshold where one grid exceeds the threshold for every hour of the year, with 99.6% of grids exceeding the threshold at least once. (b) depicts the counts for the 25 μg/m3 where the max count was 8,656 exceedances across the year, with 63% of grids exceeding the threshold at least once. (c) uses a threshold of 40 μg/m3 where the max count for exceedances was 8,086 across the year, with 26.2% of grids exceeding the threshold at least once. (d) denotes a 200 μg/m3 threshold, where only a single grid exceeded the threshold twice across the year. Latitude 51.5, longitude −0.15 was the location that exceeded the threshold, a central London location with the postcode W1G 6JA.

Figure 17

Figure 11. 24-hour mean (UK-AIR, DEFRA, 2023d) exceedance counts example. The threshold used is a mean of 25 μg/m3. As the hourly level is the most common high-resolution temporal level mentioned in air quality legislation, pursuing data at this level allows for a more coarse temporal level to be calculated from the input data, resulting in the dataset providing complete legislation coverage no matter the resolution of interest.

Supplementary material: File

Berrisford et al. supplementary material

Berrisford et al. supplementary material
Download Berrisford et al. supplementary material(File)
File 5.2 MB

Author comment: A framework for scalable ambient air pollution concentration estimation — R0/PR1

Comments

Dear Editor-in-Chief,

I am writing to submit our manuscript entitled “A Framework for Scalable Ambient Air Pollution Estimation” for consideration for publication in the Environmental Data Science Journal. This research presents a novel, data-driven, supervised machine learning framework for estimating ambient air pollution concentrations, addressing significant temporal and spatial data coverage gaps.

Our work leverages advanced machine learning techniques to generate synthetic monitoring stations, providing a comprehensive dataset for England in 2018 at a 1km² hourly resolution. This innovative approach enhances the understanding of air pollution patterns and offers a valuable tool for stakeholders involved in air quality management and policy-making.

We believe that our manuscript aligns well with the scope and interests of the Environmental Data Science Journal, contributing to the advancement of environmental monitoring techniques through data science. The findings of our study have the potential to significantly impact the way ambient air pollution is estimated, offering a scalable and cost-effective solution.

All authors have approved the manuscript and agree with its submission to your journal. We confirm that this work is original and has not been published elsewhere, nor is it currently under consideration for publication by another journal.

Thank you for considering our manuscript for publication. We look forward to the opportunity to contribute to the Environmental Data Science Journal.

Sincerely,

Liam Berrisford on behalf of the authors

Review: A framework for scalable ambient air pollution concentration estimation — R0/PR2

Conflict of interest statement

Nonw

Comments

I should preface my remarks by saying I’m someone who works in statistics/machine learning who has done a small amount of air pollution modelling, rather than being an air pollution expert.

Overall, this paper makes a useful contribution to the literature. It has collated together a number of useful datasets, and shown the ability of a particular form of model to predict air pollution. I think it should be published. However, it needs substantial revision, particularly in its presentation. Some specific comments

- The abstract and introduction say the paper uses a machine learning model without saying what (or at least what family of model it is from). It’s a bit like saying ‘in this paper we use science to ….’. We don’t get to find out that ML means random forest here until deep into the paper (possibly not until section 5 on page 10). The early sections are at risk of presenting the ML models as a form of magic where we keep the trick hidden from the reader.

- The abstract (and discussion - but it bothers me less there) make a ludicrous claim that the data are worth $70 billion. I understand where the figure has come from, but it’s not a realistic assessment of the value - it shouldn’t be allowed in the abstract. The phrasing in the discussion is slightly less silly.

- Various mentions are made of ‘heterogeneous’ placement of the sensors. I’m not sure what this means. Would it be better to say they are placed non-uniformly, i.e., they are not spread out but appear in clusters in some places (eg London)? This could be a standard usage of ‘heterogeneous’ that I’m not aware of, but googling it suggests otherwise.

- Line 10 page 5 - ‘the England’

- There isn’t a single equation of bit of notation in the entire paper. Adding some would help clarify what was done. Essentially you are learning a function f that maps from X to y. X consists of what is called ‘feature vectors’ and y are the various AP data. You could be more explicit about what X is (see below), and the form for f.

- You use the abbreviation KDE without defining it. In the legend for fig S3 it uses the phrase ‘kernel distribution estimation plots’ - it should be kernel ‘density’ estimate etc.

- The discussion in paragraph starting line 13 on p6 is confusing. Perhaps start the paragraph by saying you considered removing outliers but decided against for these reasons…. That you don’t remove any points is only revealed at the end of the paragraph.

- I was also confused by how the features (spatial and temporal) are used. This is where some notation would help. If X is a feature map, i.e., a vector of values related to different spatial locations, do all values get passed into f to predict y at some specific location, or just the value of X at that location? Same with the temporal aspects.

- P9 discusses removing some features, but I got lost in the detail. It would be good to state clearly whether you did or didn’t remove features, and then explain why.

- What distance metric and what linkage method (single linkage, group average, complete linkage etc) are used for the dendrogram?

- Re the ML model, lightGBM. We usually think of decision trees as being used for classification problems. Here they are used for regression. It would be good to explain how. Were other ML models considered?

- What is ‘tabular prediction’ and why is this a tabular prediction problem?

- Page 12 mentions ‘homogeneous instances of data points’ - what does homogenous mean here? I’ve not seen the phrased used in this way before.

- Throughout, there are lots of decisions made (e.g. MSE vs MAE, outlier removal, variable removal etc) and a rational is given. Did you try different things and see if it improved predictions or not? That could help justify some of the choices.

- I would have liked to understand the resulting model structure. What are the main effects for f? I.e., as we vary an input how does the prediction change - does this conform to what we expect for AP? What inputs is the prediction most sensitive to, or equivalently, what input are most valuable when predicting AP? Are the model predictions robust to small errors in the inputs? Is the model robust to small changes in its structure?

- I couldn’t follow what was done on p14 in the section discussing randomized grid search of 40 parameter sets. What are these parameters? They presumably are distinct from the LightGBM parameters.

- Re model performance, it would be good to report the MSEs directly alongside the R^2 values. Are the errors small enough to make the predictions useful - we can’t know this from R^2 alone. Some of the predictions have low R^2 values (R^2=0.3 for PM2.5 etc) Is there any value in these predictions? It’s hard to know without discussion of the size of the errors.

- Figure 7 - the model filled predictions look very different to the real AURN measurements - they are noticeably less variable. The text doesn’t comment on this. Figure 9b is used as evidence that the model isn’t failing to report peaks, but the particular time-series chosen has no peak - it is a relatively flat series and so I’m not sure it supports the case that values aren’t under predicted.

- Why is the data used from 2014-2018? Is there not more recently data available?

Review: A framework for scalable ambient air pollution concentration estimation — R0/PR3

Conflict of interest statement

Reviewer declares none.

Comments

The authors used air pollution data together with other datasets to predict missing values across several stations and provide a spatiotemporal model prediction in England. The manuscript is interesting and to some extend novel however it is very big and difficult to follow throughout.

My general major concerns are:

The manuscript is very big, I 'd recommend to either focus on the missing values or the spatiotemporal analysis.

The structure of the manuscript does not help the reader either. I’d suggest to re-structure having in mind the following: introduction, methods, results, discussion, conclusions as a golden rule.

Section 1 is more like a background rather than an introduction. Typically, an introduction is the combination of background, what has happened in the literature and the gaps that exist and what the manuscript tries to address. In this case is a combination of sections 1 & 2 however they are relatively big and need to be shorter.

The authors mention that the first stage is to predict the missing values in each station however they do not provide a table of the data completeness in each station. Furthermore, when they validate their model performance they do not provide a station specific comparison or a station- categorised comparison (i.e. urban, sub-urban, rural etc) but a general R2. They could also add additional statistics such as the Root Mean Square Error Mean Bias etc.

For the feature selection, despite correctly pointing out the multi-collinearity aspects the authors do not seem to use a feature/variable selection process such as Feature Importance Ranking or Principal component analysis or related methods that have been used in the literature in high dimensional datasets.

The authors only tried only ML algorithm and if they tried more there are not mentioned in the manuscript. Different algorithms might have better predictions at specific locations while if you use multiple and want to provide an overall estimate an ensemble approach is typically followed.

For the predictors the authors combine, among others, remote sensing data and meteorological data. However they do not include Aerosol Optical depth (AOD) or visibility variables which are good predictors for PM2.5 & PM10 and have been used in previous ML models with good predictions. This could also be the reason for the relatively low R2 in both PM2.5 & PM10 that the model gives

Another variable that can improve the model predictions in chemical transport model outputs (concentrations of pollutants of choice at 1x1km or 3x3km). Simulations can be found in NASA websites or relevant papers i.e.:

Danesh Yazdi, M., Kuang, Z., Dimakopoulou, K., Barratt, B., Suel, E., Amini, H., Lyapustin, A., Katsouyanni, K. and Schwartz, J., 2020. Predicting fine particulate matter (PM2. 5) in the greater london area: An ensemble approach using machine learning methods. Remote Sensing, 12(6), p.914.

Specific comments:

The manuscript needs to be written in either British English or American English. Currently there is a mixture of the two. currently there are words such as specialized and optimised.

Abstract:

line 52-53: Spell out the names of the pollutants (first time mentioning)

line 51-53: This sentence is repeated as a copy paste in impact statement and abstract. I suggest rewording in one of the two

Introduction

line 4-5: Check grammar

line 8 -11: This is a big sentence suggest splitting in two.

line 13: spell out NO2 first time mentioning in text

line 30: Check typo in the dates

line 54-55 add a reference to the numbers of the stations if possible

line 58: spell out O3

section 2.2

line 52-53: check grammar

Section 3.1

line 34 correct “particles” with “Particulate Matter”

Page 6 line 11: spell out KDE

Line 17: spell out iQR

page 7 line 30: The SO2 is no longer a pollutant that comes from the exhaust in the UK. The de-sulphurisation of the fleet happened in the 80’s & 90’s.

line 31: PM2.5 and PM10 also come from vehicles. brake, tyre and road-dust emissions see:

Matthaios, V.N., Lawrence, J., Martins, M.A., Ferguson, S.T., Wolfson, J.M., Harrison, R.M. and Koutrakis, P., 2022. Quantifying factors affecting contributions of roadway exhaust and non-exhaust emissions to ambient PM10–2.5 and PM2. 5–0.2 particles. Science of The Total Environment, 835, p.155368.

line 41: add “speed and direction” next to the word wind

line 44: temperature does not enhance the removal of air pollutants by plants. Plants have a cooling effect that drops temperature. Not sure what the authors are trying to say here

line 44 page 8: do the authors mean disentangle?

section 4.2

line 37-38. Not sure what the authors are trying to say here. however, it would be better if they could elaborate how they expect that engagement

Did the authors do any sensitivity analysis in the modelling or in the feature selection?

line 8 page 14: What is SHAP??

line 38 page 16: Typo in PM10?

Section 5.5

Can the authors explain what are the test scores?

line 57-58 can the authors rephrase the sentence here and in the supplementary? the term worldwide is confusing. I’d suggest just to name the databases that are not local to England.

Recommendation: A framework for scalable ambient air pollution concentration estimation — R0/PR4

Comments

No accompanying comment.

Decision: A framework for scalable ambient air pollution concentration estimation — R0/PR5

Comments

No accompanying comment.

Author comment: A framework for scalable ambient air pollution concentration estimation — R1/PR6

Comments

No accompanying comment.

Review: A framework for scalable ambient air pollution concentration estimation — R1/PR7

Conflict of interest statement

Reviewer declares none.

Comments

I’m happy with the changes made. My comments from my original review still hold: The paper makes a useful contribution in terms of collating air pollution data, and showing the skill of decision trees in predicting pollution levels. The paper raises many additional questions, but given the paper is already long, these are probably best left to future works.

Recommendation: A framework for scalable ambient air pollution concentration estimation — R1/PR8

Comments

Some of the links are broken. The authors are asked to verify that all links are correct eg the link to OpenAir is broken https://davidcarslaw.github.io/openair/

Decision: A framework for scalable ambient air pollution concentration estimation — R1/PR9

Comments

No accompanying comment.