Using Gaussian processes for spatial prediction of PM2.5 concentration based on calibrated data from distributed low-cost sensor networks

Lillian Muyama; Richard Sserunjogi; Deo Okure; Engineer Bainomugisha

doi:10.1017/eds.2025.10026

Using Gaussian processes for spatial prediction of PM2.5 concentration based on calibrated data from distributed low-cost sensor networks

Published online by Cambridge University Press: 12 December 2025

Lillian Muyama

Richard Sserunjogi ,

Deo Okure and

Engineer Bainomugisha

Show author details

Lillian Muyama*: Affiliation:
AirQo, Department of Computer Science, Makerere University, Uganda
Richard Sserunjogi: Affiliation:
AirQo, Department of Computer Science, Makerere University, Uganda
Deo Okure: Affiliation:
AirQo, Department of Computer Science, Makerere University, Uganda
Engineer Bainomugisha: Affiliation:
AirQo, Department of Computer Science, Makerere University, Uganda
*: Corresponding author: Lillian Muyama; Email: muyamalillian@gmail.com

Article contents

Abstract
Impact Statement
Introduction
Related work
Methodology
Results
Discussion
Conclusion and future work
Open peer review
Author contribution
Competing interests
Data availability statement
Ethics statement
Funding statement
Footnotes
References

Abstract

Air pollution is a major environmental and public health risk globally leading to millions of premature deaths annually and negative economic effects. One of the key challenges in managing air quality is the availability of actionable spatial air quality data. The sparse networks or absence of air quality monitoring stations in many places means that there are limited data and information on air pollution in places without coverage. The spatial prediction of air quality can contribute to increasing data access for locations without air quality monitoring, ultimately improving awareness of the risk of air pollution exposure for vulnerable people. In this study, we investigated the air quality prediction task in two cities in Uganda (i.e., Jinja and Kampala), with unique geographic and economic contexts. Primarily, we used Gaussian processes to predict the PM$ {}_{2.5} $ levels in the two cities, selected because of their relative importance in the country and their varying characteristics. We achieved promising results with an average root-mean-square error (RMSE) of 18.32 μg/m3 and 16.88 μg/m3 in Kampala and Jinja, respectively. These results provide valuable insights into the air quality profiles of two urban sub-Saharan cities with different demographics, which can in turn aid in decision-making for targeted actions at different levels.

Keywords

air quality Gaussian processes particulate matter prediction

Information

Type: Application Paper
Information: Environmental Data Science , Volume 4 , 2025 , e52

DOI: https://doi.org/10.1017/eds.2025.10026 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Open Practices: Open data Open materials
Copyright: © The Author(s), 2025. Published by Cambridge University Press

Impact Statement

In many low-income countries, air quality monitors are few and far between. Even with the development of an air quality monitoring network, many locations remain unmonitored, resulting in unknown air quality due to the absence of air quality sensors. The methodology in this study provides an approach to understanding exposure to particulate matter in a given geographical area with a sparse distribution of air quality sensors. This study is especially important because very limited studies have so far been done in this geographic area, and it can be replicated in other areas with similar conditions. Additionally, it provides insightful information that has a broader impact on health, environmental sustainability, and government policy in the region.

1. Introduction

Air pollution is one of the world’s biggest environmental health problems (Ritchie and Roser, Reference Ritchie and Roser2017), and ambient (outdoor) air pollution is estimated to account for around 4.2 million annual deaths worldwide (World Health Organization, 2022). It has also been shown that low- and middle-income countries are disproportionately affected by the effects of air pollution (OECD iLibrary, 2017), and yet, in many of these countries, air quality monitoring networks are sparse or nonexistent. For example, in sub-Saharan Africa, only a handful of countries, for example, South Africa (The World Air Quality Index Project, 2008), Uganda (AIRQO Africa, 2018), Kenya (Nairobi City County, 2023), and Rwanda (Rwanda Air Quality, 2021), have established continuous air quality networks, and mostly driven by research-led initiatives in recent years. Many African urban areas exhibit significant variations in pollution profiles (Dobson et al., Reference Dobson, Siddiqi, Ferdous, Huque, Lesosky, Balmes and Semple2021; Green et al., Reference Green, Okure, Adong, Sserunjogi and Bainomugisha2022), which necessitates high-resolution monitoring, yet the resource constraints prohibit establishing continuous monitoring networks. Spatial predictions have the potential to close the data gaps in areas with limited monitoring networks.

In this article, we present a novel case study of spatial prediction of two cities (Kampala and Jinja) in Uganda, a country in Eastern Africa. We selected Uganda primarily due to the presence of a recently established continuous air quality network of PM $ {}_{2.5} $ and PM $ {}_{10} $ sensors, developed and managed by AirQo (Sserunjogi et al., Reference Sserunjogi, Ssematimba, Okure, Ogenrwot, Adong, Muyama, Nsimbe, Bbaale and Bainomugisha2022; AIRQO Africa 2018). This network employs calibrated low-cost sensors (Adong et al., Reference Adong, Bainomugisha, Okure and Sserunjogi2022) to contribute to a more precise understanding of the air quality in the country. This information could, in turn, be used to inform decision-making and policy. In addition, Uganda is a low-income sub-Saharan African country with a highly variable pollution profile (Kirenga et al., Reference Kirenga, Meng, Van Gemert, Aanyu-Tukamuhebwa, Chavannes, Katamba, Obai, Van der Molen, Schwander and Mohsenin2015; Okure et al., Reference Okure, Ssematimba, Sserunjogi, Gracia, Soppelsa and Bainomugisha2022), which may be due to differences in activity rates, population levels, and traffic levels, among others. Since urban centers are more polluted than rural areas, we chose the two most populated and highly urbanised cities in the country for this study. Kampala is Uganda’s capital city, while Jinja is a historically large industrial hub in the country (Uganda Investment Authority, 2022). In our study, we specifically look at particulate matter (PM) with a diameter ≤2.5 μm, that is, PM $ {}_{2.5} $ , because it has been shown to have adverse health effects (Curtis et al., Reference Curtis, Rea, Smith-Willis, Fenyves and Pan2006; Laumbach, Reference Laumbach2010; Zheng et al., Reference Zheng, Pozzer, Cao and Lelieveld2015). We utilize the existing network to predict PM $ {}_{2.5} $ concentrations in locations without sensors and demonstrate the potential to use spatial predictions to close the data gaps in other low- and middle-income countries.

Previous work has largely focused on forecasting the air quality of a location by relying on historical data from the same location (Vong et al., Reference Vong, Ip, Wong and Yang2012; Petelin et al., Reference Petelin, Grancharova and Kocijan2013; Jaiswal et al., Reference Jaiswal, Samuel and Kadabgaon2018; Lei et al., Reference Lei, Monjardino, Mendes, Gonçalves and Ferreira2019), or by employing weather parameters to predict the air quality of an area (Yu et al., Reference Yu, Yang, Yang, Han and Move2016; Athira et al., Reference Athira, Geetha, Vinayakumar and Soman2018; Liu et al., Reference Liu, Yang, Huang, Wang and Yoo2018; Wang and Song, Reference Wang and Song2018; Khurram and Lim, Reference Khurram and Lim2024; Lim et al., Reference Lim, Owusu, Thongrod, Khurram, Pongsiri, Ingviya and Buya2024). These studies employ a range of approaches, encompassing deterministic approaches, such as chemical transport models (CTMs), which are known for their computational intensity and substantial data requirements. Machine learning (ML) techniques, such as decision trees, support vector machines, and neural networks, have also been leveraged. For instance, long short-term memory (LSTM) is used in (Mao et al., Reference Mao, Wang, Jiao, Zhao and Liu2021) to predict the future air quality of an area based on previous air quality readings, while (Khurram and Lim, Reference Khurram and Lim2024) used lag-dependent Gaussian process (GP) models to predict and forecast PM $ {}_{2.5} $ , PM $ {}_{10} $ , O $ {}_3 $ , NO $ {}_2 $ , and CO. Additional approaches that are used in previous works are more extensively discussed in Section 2.

In contrast to the above works, this article aims to investigate whether historical data from multiple air quality sensors can be used to accurately predict the air quality in another location. Consequently, this method is more applicable in areas lacking an air quality sensor or historical air quality data. This is a somewhat different task from the previous studies. We therefore propose to use GPs because of their non-linearity and their ability to provide uncertainty estimates. Additionally, the nature of the data, which is time-dependent and periodic, makes GPs the most appropriate solution for this problem.

The main contributions of this article include:

• We propose an air quality prediction approach for a location based on using only historical air quality observations from other locations in two urban settings in sub-Saharan Africa. We consider the similarity of air quality levels that are closer distance-wise and/or time-wise.
• We demonstrate the use of GP Regression (GPR) for PM $ {}_{2.5} $ prediction, and we show that there is a relationship between the quality of the prediction and the rate of change in PM $ {}_{2.5} $ concentration, regardless of distance and time.
• We evaluate our approach using leave-one-out cross-validation, whereby we predict PM $ {}_{2.5} $ levels of a location using data from the rest of the locations. Our approach performed well in both use cases, providing useful predictions that may be used to guide interventions and policy.

The rest of the article is structured as follows: Section 2 presents the previous studies that are related to this work; Section 3 introduces the methods that were used to conduct the study; and Section 4 shows the results of these methods. In Section 5, the implication of these results and the challenges faced are discussed, while Section 6 presents the conclusions made and the direction of future works.

2. Related work

Previous approaches to the problem of air quality prediction can be divided into two broad categories, that is, deterministic and statistical approaches. One such deterministic approach is the operational street pollution model (OSPM) (Hertel et al., Reference Hertel, Berkowicz, Larssen, van Dop and Steyn1991), which has been used to model air pollution in various cities. Hung et al. (Reference Hung, Ketzel, Jensen and Oanh2010) used OSPM to estimate the air pollution levels for five streets in Hanoi, Vietnam. In addition to existing pollutants including various nitrogen oxides, carbon monoxide, sulfur dioxide, and benzene, the OSPM model also requires, as part of its input, the street and building configuration, hourly traffic emissions data, hourly meteorological data, hourly urban background concentration data, and so forth All these data are subsequently used to calculate the street pollution level. Similarly, Ketzel et al. (Reference Ketzel, Jensen, Brandt, Ellermann, Olesen, Berkowicz and Hertel2012) used OSPM to predict air quality measurements over a multiyear period in Copenhagen, Denmark. They used measurements of various nitrogen oxides as well as PM $ {}_{2.5} $ and PM $ {}_{10} $ , in conjunction with traffic emission data and the street and building configurations. However, this model is specific to traffic-based emissions. It also requires various nontrivial input data that may be complicated to obtain, especially at an hourly time resolution, and is therefore difficult to use.

CTMs (Seigneur and Dennis, Reference Seigneur, Dennis, Hidy, Brook, Demerjian, Molina, Pennell and Scheffe2011) are another type of deterministic approach for air quality prediction. Uno et al. (Reference Uno, Carmichael, Streets, Tang, Yienger, Satake, Wang, Woo, Guttikunda, Uematsu, Matsumoto, Tanimoto, Yoshioka and Iida2003) provided a detailed description of a CTM, that is, Chemical Weather Forecast System, which was used in the prediction of several pollutants such as carbon monoxide, sulfate radon, and mineral dust. Finardi et al. (Reference Finardi, De Maria, D’Allura, Cascone, Calori and Lollobrigida2008) developed a forecasting system for the prevention and management of air pollution episodes for PM $ {}_{10} $ , NO $ {}_2 $ , and O $ {}_3 $ partially based on a CTM simulation. Likewise, Stroud et al. (Reference Stroud, Moran, Makar, Gong, Gong, Zhang, Slowik, Abbatt, Lu, Brook, Mihele, Li, Sills, Strawbridge, McGuire and Evans2012) used a CTM to make predictions of primary organic aerosol, black carbon, and carbon monoxide. In a study by Roozitalab et al. (Reference Roozitalab, Carmichael and Guttikunda2021), a CTM was used in the prediction of an extreme pollution event in the Indo-Gangetic Plain. However, these models are very computationally expensive. They also require updated current emissions data of the area for which the prediction is to be made, which data are not always readily available, and which in turn can affect the accuracy of the results.

Statistical approaches have also been utilized in the prediction of air quality levels. Lei et al. (Reference Lei, Monjardino, Mendes, Gonçalves and Ferreira2019) predicted air quality levels for the next day using multiple linear regression, and classification and regression trees. Historical air quality data over a 5-year period, as well as meteorological data, were utilized in this study. Similarly, using annual concentration levels of various pollutants, Jaiswal et al. (Reference Jaiswal, Samuel and Kadabgaon2018) applied the autoregressive integrated moving average model to predict future annual concentrations. This was done for CO, NO $ {}_2 $ , SO $ {}_2 $ , and PM.

Traditional ML methods (Mitchell, Reference Mitchell1997) have been applied to the air quality prediction task as well. Vong et al. (Reference Vong, Ip, Wong and Yang2012) used support vector machine (SVM) to forecast daily ambient air pollution using meteorological (temperature, humidity, rainfall, wind direction, wind speed, and precipitation) and pollutant (NO $ {}_2 $ , SO $ {}_2 $ , suspended PM, and O $ {}_3 $ ) features. Data from the previous day and the current day were used to forecast the pollution level of the different pollutants for the next day. Similarly, Song et al. (\ Reference Song, Pang, Longley, Olivares and Sarrafzadeh2014) used support vector regression for PM $ {}_{2.5} $ prediction. Likewise, Yu et al. (Reference Yu, Yang, Yang, Han and Move2016) employed random forest (RF) to predict the air quality concentration given meteorological data (temperature, humidity, barometric pressure, wind speed visibility), the length of the road, the traffic congestion status of the road and point of interest distribution which shows the land use of the area. By applying RF to these hourly data, they were able to predict the Air Quality Index (AQI) level of specific locations. In a study by Zhang et al. (Reference Zhang, Wang, Gao, Ma, Zhao, Zhang, Wang and Huang2019), light gradient-boosting machine (LightGBM) was used to predict PM $ {}_{2.5} $ levels in Beijing over the next 24 hours. They used meteorological data such as temperature and humidity, temporal features such as the day of the week and the hour of day, air quality data from other pollutants, namely, CO, SO $ {}_2 $ , NO $ {}_2 $ , O $ {}_3 $ , and PM $ {}_{10} $ , as well as the weather forecast for the next 24 hours. Additional studies have also analyzed and compared the performance of various ML methods for this problem (Doreswamy et al., Reference Doreswamy, Harishkumar, Yogesh and Gad2020; Liang et al., Reference Liang, Maimury, Chen and Juarez2020).

More recently, with the exponential increase in data available, there has been a resurgence in the usage of deep learning methods (LeCun et al., Reference LeCun, Bengio and Hinton2015) to accomplish a multitude of tasks. Wang and Wang (Reference Wang and Song2018) built an ensemble model comprised of three components to predict air quality. The predictor component was based on LSTM, and the features considered were meteorological as well as spatial–temporal features. Similarly, Mao et al. (Reference Mao, Wang, Jiao, Zhao and Liu2021) used deep learning to predict the air quality for the next 24 hours using historical hourly air quality measurements. Likewise, using meteorological data such as temperature and humidity as well as spatial features, Athira et al. (Reference Athira, Geetha, Vinayakumar and Soman2018) forecast PM $ {}_{10} $ values utilizing three deep learning architectures, namely, recurrent neural network (RNN), LSTM, and gated recurrent unit. Other papers that use LSTM include (Kim et al., Reference Kim, Park, Song, Lee, Yun, Kim, Jeon, Lee and Han2019; Qin et al., Reference Qin, Yu, Zou, Yong, Zhao and Zhang2019; Kalajdjieski et al., Reference Kalajdjieski, Mirceva and Kalajdziski2020; Lin et al., Reference Lin, Chen, Yang, Xu and Fang2020). However, these studies are all mainly based on historical air quality and weather data, and/or features of a specific location, and yet in many cases, such data are not readily available due to a multitude of reasons such as the absence of an air quality monitoring station.

Iyer et al. (Reference Iyer, Balashankar, Aeberhard, Bhattacharyya, Rusconi, Jose, Soans, Sudarshan, Pande and Subramanian2022) used message-passing RNNs (MPRNNs) to model air pollution maps for Delhi, India, using a network of 60 air quality monitors, including both low cost and reference grade air quality monitors. PM $ {}_{2.5} $ data collected at an hourly resolution over a two-year period were used to train the models and make predictions up to an hour in advance. However, in our study, we aim to use readings from only low-cost air quality monitors. Also, Delhi may have a different air quality profile from the two Ugandan cities because of different climatic conditions and different sources of air pollution, among other reasons.

In addition, a few studies have employed GPs (Rasmussen, Reference Rasmussen, Bousquet, von Luxburg and Rätsch2003), such as in Liu et al. (Reference Liu, Yang, Huang, Wang and Yoo2018) where a GPR model was used to model the indoor air quality of a subway. Seven indoor pollutants and two meteorological variables, that is, temperature and humidity, were used in this study. Similarly, in Petelin et al. (Reference Petelin, Grancharova and Kocijan2013), various first and high-order GP models were used to predict the ozone concentration in the air of Bourgas in Bulgaria, utilizing hourly measurements of ozone, sulfur dioxide, nitrogen dioxide, phenol, and benzene plus several meteorological parameters. In our study, we focus on PM $ {}_{2.5} $ prediction while using only air quality measurements from a network of low-cost monitoring devices.

In this study, we propose to use GPR to predict the outdoor PM $ {}_{2.5} $ concentration of a geographical location without any historical or current air quality information using only the PM $ {}_{2.5} $ data collected from air quality monitoring devices in other areas, as well as the spatial (latitude and longitude) and temporal (time) features of those readings.

3. Methodology

3.1. Study areas

This study considered two cities in Uganda, namely Kampala and Jinja, which are located in the Central and Eastern regions of the country. Figure 1 depicts a map of Uganda showing the location of the two cities. We focused on two cities because it is demonstrated in Cross (Reference Cross2021) that air quality levels are generally worse in urban areas than in rural areas. It is also shown in Mcdonald (Reference Mcdonald2012) that areas with high commercial activity experience higher pollution.

Figure 1.

A map of Uganda showing the locations of Kampala and Jinja cities.

Kampala is the capital city of Uganda and is situated in the central region of the country, just North of Lake Victoria, at an average altitude of about 1190 m (The Editors of Encyclopaedia Britannica, 2024). Over the past years, Kampala’s air quality has been found to be up to 11 times the World Health Organization (WHO) health guidelines (Adong et al., Reference Adong, Bainomugisha, Okure and Sserunjogi2022; Okure et al., Reference Okure, Ssematimba, Sserunjogi, Gracia, Soppelsa and Bainomugisha2022). This is to be expected given the high population density as the Kampala metropolitan area is estimated to be home to around 10% of the entire country’s population (The World Bank, 2018).

On the other hand, Jinja, which has the second largest economy, lies about 81 km to the East of Kampala along the Northern shore of Lake Victoria. It lies at an estimated altitude of about 1140 meters (The Editors of Encyclopaedia Britannica, 2023) and also hosts the Nile River. Since Jinja only acquired its city status in 2020, the total population is uncertain, but estimates put it at about 300,000 (Kazungu, Reference Kazungu2020). It is also a big industrial hub hosting over 100 manufacturing industries (Uganda Investment Authority, 2022). Maps showing some of the locations of the air quality monitors used in this study for Kampala and Jinja are shown in Figures 2 and 3, respectively. More contextual details on Kampala and Jinja can be found in Vermeiren et al. (Reference Vermeiren, Van Rompaey, Loopmans, Serwajja and Mukwaya2012) and McQuaid et al. (Reference McQuaid, Vanderbeck, Valentine, Liu, Chen, Zhang and Diprose2018) respectively.

Figure 2.

A map of Kampala showing some of the sensor locations.

Figure 3.

A map of Jinja showing the sensor locations used in this study.

3.2. Dataset description

We consider a dataset from a distributed network of low-cost air quality sensors (Sserunjogi et al., Reference Sserunjogi, Ssematimba, Okure, Ogenrwot, Adong, Muyama, Nsimbe, Bbaale and Bainomugisha2022), which currently has over 100 devices installed in Uganda. With the aim of providing air quality data for locations with diverse features, the sites in which these devices are installed are, therefore, selected based on various factors, such as, population density, land use, traffic levels, and nearness to known pollution sources such as dusty roads. The devices primarily measure PM $ {}_{2.5} $ and PM $ {}_{10} $ . However, as shown in Figures 2 and 3, there exist several locations that do not have an air quality monitor. To understand the air quality in these areas, we use the data from the existing network devices to predict what the air quality is in these areas.

In addition, for our experiments, we use calibrated data because low-cost sensors can be affected by weather conditions, notably, temperature and relative humidity, such that there may be some inconsistency between their measurements and the ground truth (Adong et al., Reference Adong, Bainomugisha, Okure and Sserunjogi2022; Davda, Reference Davda2023). Low-cost sensors are portable and affordable air quality monitors usually costing between $100 and $2000 USD that provide high temporal and spatial resolution data, though they often require calibration to ensure measurement accuracy and reliability (Adong et al., Reference Adong, Bainomugisha, Okure and Sserunjogi2022; Bainomugisha et al., Reference Bainomugisha, Ssematimba, Okedi, Nsubuga, Banda, Settala and Lubisia2023). Therefore, device calibration ensures that our results are more reliable and accurate. The calibration function is derived using data from reference-grade air quality monitors, which in our case are Met One Beta Attenuation Monitors Model 1022 (Met One Instruments, 2023), as well as data from low-cost monitors that are collocated with these reference monitors (Adong et al., Reference Adong, Bainomugisha, Okure and Sserunjogi2022).

We considered the data for a three-month period, that is, from 1 September 2021 to 30 November 2021. We chose this time period because, although both cities had established air quality networks at that point, several areas within the cities still lacked air quality monitors, and thus were not well covered by the networks. The data were collected from 34 and 10 air quality monitors for Kampala and Jinja, respectively, and a total of 48,112 and 13,653 records were used for the two locations. The data were recorded at an hourly interval. The features considered were latitude, longitude, and time. Meteorological features were considered, that is, temperature, relative humidity, wind speed, and wind direction, but these did not have any significant effect on the model performance hence why they are not included in the final results. During the study period, the average meteorological conditions in Kampala and Jinja were generally warm and humid. Kampala exhibited a mean temperature of 22.09 °C, mean relative humidity of 82.16%, and a modest mean wind speed of 0.81 m/s. In contrast, Jinja experienced slightly higher temperatures, averaging 22.57 °C, slightly lower mean relative humidity at 77.34%, and stronger wind speeds averaging 3.42 m/s. Comparable observations are reported in a 3-year study (Alaran et al., Reference Alaran, Natasha, Lambed, Sserunjogi and Okello2024), which explored seasonal and spatial PM $ {}_{2.5} $ and the meteorological influence on it across Kampala and Jinja. A summary of the meteorological conditions during this period is provided in Appendix A.

During data preprocessing, all rows containing missing PM $ {}_{2.5} $ values were dropped. Additionally, the datetime column was converted to a timestamp and divided by 3600, which is the number of seconds in an hour.

3.3. Methods

For this task, we propose to use GPR because it is an efficient nonparametric method for modeling nonlinear problems, which, in addition to providing the predictions, also provides uncertainty estimates over those predictions and hence is useful for quantifying their reliability. GPR models also have a variety of kernels which are able to model almost any task and are thus very flexible (Rasmussen, Reference Rasmussen, Bousquet, von Luxburg and Rätsch2003; Seeger, Reference Seeger2004).

In (Rasmussen, Reference Rasmussen, Bousquet, von Luxburg and Rätsch2003), a GP is defined as a collection of random variables, any finite number of which have (consistent) joint Gaussian distributions. A GP is defined by its mean and covariance function as shown in Equation 3.1, where $ \mu $ and $ \Sigma $ denote the mean and covariance function or kernel, respectively.

(3.1)

$$ f\sim \mathcal{N}\left(\mu, \Sigma \right)\hskip0.1em $$

In GPR, it is assumed that the observations are drawn from a noisy function as shown in Equation 3.2, where the additive noise, $ \unicode{x025B} $ , is defined by a mean of 0 and a variance of $ {\sigma}_n^2 $ .

(3.2)

$$ y=f(X)+\unicode{x025B}, \unicode{x025B} \sim \mathcal{N}\left(0,{\sigma}_n^2\right) $$

Consider a dataset, $ D=\left(X,y\right) $ , whereby $ X={\left\{{x}_1,{x}_2,\dots, {x}_n\right\}}^m $ and $ y=\left\{{y}_1,{y}_2,\dots, {y}_n\right\} $ where $ m $ and $ n $ represent the number of features and the number of samples in the dataset, respectively. Assuming a mean of 0 for our prior distribution, to predict $ {y}_{\ast}\in R $ for a new input dataset $ {X}_{\ast}\in {R}^m $ using GPR, we need to learn the nonlinear mapping between $ X $ and $ y $ as shown in Equation 3.3. The covariance matrix $ k $ is an $ n $ × $ n $ matrix, which defines the correlation between the input data points.

(3.3)

$$ y\sim \mathcal{N}\left(0,k\left(x,{x}^{\prime}\right)\right)\hskip0.1em $$

For this study, we used the radial basis function (RBF) kernel, which is defined as shown in Equation 3.4, where $ {\sigma}_f^2 $ is the signal variance and $ \mathrm{\ell} $ is the lengthscale parameter. The RBF kernel is also sometimes known as the Gaussian kernel or the squared exponential kernel.

(3.4)

$$ k\left(x,{x}^{\prime}\right)={\sigma}_f^2\exp \left(-\frac{\parallel x-{x}^{\prime }{\parallel}^2}{2{\mathrm{\ell}}^2}\right) $$

Model training involves minimizing the negative log likelihood shown in Equation 3.5, where $ K $ is the covariance matrix of size N × N, $ I $ is the identity matrix, and $ {\sigma}_n $ is the noise variance.

(3.5)

$$ \mathrm{\mathcal{L}}=\frac{-N}{2}\log 2\pi -\frac{1}{2}\log \mid K+{\sigma}_n^2I\mid -\frac{1}{2}{y}^T{\left(K+{\sigma}_n^2I\right)}^{-1}y $$

Thus, after training, for our new input data matrix, $ {X}_{\ast } $ , the predictive Gaussian distribution $ N\Big({\mu}_{\ast },{\Sigma}_{\ast } $ ) is defined as shown in Equations 3.6 and 3.7, respectively.

(3.6)

$$ {\mu}_{\ast }=K\left({X}_{\ast },X\right){\left(K\left(X,X\right)+{\sigma}_n^2I\right)}^{-1}y $$

(3.7)

$$ {\Sigma}_{\ast }=K\left({X}_{\ast },{X}_{\ast}\right)-K\left({X}_{\ast },X\right){\left(K\left(X,X\right)+{\sigma}_n^2I\right)}^{-1}K\left(X,{X}_{\ast}\right) $$

In this work, we used a full GPR model from the Gpflow Python package (Matthews et al., Reference Matthews, van der Wilk, Nickson, Fujii, Boukouvalas, León-Villagrá, Ghahramani and Hensman2017) to predict the PM $ {}_{2.5} $ levels in the locations of interest. The model hyperparameters were optimized using a random search strategy. For Kampala, the lengthscales were set to 0.08°, 0.08° and 1 hour for longitude, latitude, and time, respectively, and were fixed during the training process. The signal variance and likelihood variance were initialized to 625 and 400, respectively. For Jinja, the likelihood variance was fixed to 400, and the lengthscales were initialized to 0.008°, 0.008° and 2 hours for longitude, latitude, and time, respectively. Due to the model’s computational complexity, we sampled only a subset of rows from the training dataset for some device locations.

3.4. Experiment setup and evaluation

For each city, the model was trained using data from all devices in that city except one, which we call the hold-out device. The data from this hold-out device served as the test set to evaluate the model’s performance. Thereafter, various performance metrics were computed by comparing the model’s predicted PM $ {}_{2.5} $ concentration with the actual measurements from the test data. This process was repeated across multiple experiments, each time selecting a different device in the city as the hold-out device. Take Jinja, for instance, where 10 air quality monitors were used. A total of 10 experiments were conducted, whereby for each experiment, a different device was used as the hold-out device, while data from the 9 remaining devices were used to train the model.

For model evaluation, we used the root-mean-square error (RMSE) as our primary metric, which we calculated as shown in Equation 3.8, where $ {y}_{\mathrm{actual}} $ represents the actual (test) PM $ {}_{2.5} $ concentration, and $ {y}_{\mathrm{predicted}} $ represents the predicted PM $ {}_{2.5} $ concentration.

(3.8)

$$ \mathrm{RMSE}=\sqrt{\frac{1}{n}\sum \limits_{i=1}^n{\left({y}_{\mathrm{actual}}-{y}_{\mathrm{predicted}}\right)}^2} $$

Additionally, we computed the normalized RMSE (nRMSE), which was derived as shown in Equation 3.9, the coefficient of determination (R²), and the prediction bias to provide a more comprehensive assessment of model performance.

(3.9)

$$ \mathrm{nRMSE}=\frac{RMSE}{{\overline{y}}_{\mathrm{actual}}}\times 100 $$

4. Results

4.1. Model predictions

Our aim is to provide predictions for the PM $ {}_{2.5} $ concentration in locations where there is no air quality monitor available using data from the monitors installed elsewhere. For each experiment, the performance metrics were computed as described in Section 3.4, and as a result, we obtained multiple values for each metric. For instance, we had 10 and 34 different RMSEs for Jinja and Kampala, respectively. The mean and standard deviation values of the RMSE, nRMSE, R², and bias for both cities are shown in Table 1.

Table 1.

Table showing the summary of model performance for the two cities

Graphs from sample locations in Kampala and Jinja, that is, Civic center and Jinja Main Street, showing the actual PM $ {}_{2.5} $ concentration versus the predicted PM $ {}_{2.5} $ concentration are represented in Figures 4 and 5, respectively. Figure 4 shows the predictions for when the hold-out device is the one installed at a location known as Civic center. The blue line plot shows the actual PM $ {}_{2.5} $ concentration during this period, while the orange plot shows the predictions made by a model trained using data from the other devices (besides the one at Civic center). It can be seen that the plots follow a similar trend, and the overall RMSE for this experiment was 15.84 μg/m³. Similarly, Figure 5 shows the results of an experiment where the test data is from a device installed at a location, here called Jinja Main Street. The RMSE for this experiment was 12.64 μg/m³. The shaded region in both figures represents the 95% confidence interval of the model’s predictions.

Figure 4.

A graph plot showing actual vs predicted PM $ {}_{2.5} $ concentration for the Civic Center device location in Kampala.

Figure 5.

A graph plot showing actual vs predicted PM $ {}_{2.5} $ concentration for the Jinja Main Street device location in Jinja.

4.2. Comparison with other methods

We aimed to compare our model’s performance with that of well-known state-of-the-art algorithms used in the prediction of time-series data, namely, kriging, support vector machine, RF, eXtreme Gradient Boosting (XGBoost), a feed-forward neural network (FFNN), an LSTM network, a Bayesian neural network (BNN), and a deep GPR (DGPR) model. We incorporated dropout regularization as appropriate for the neural network-based models. We specifically chose these models because of their utilization in previous studies and their good performance on the task. Additionally, we included DGPR because it extends GPR by incorporating deep neural networks into the GP framework. We used the same approach as described in Section 4.1 and the results were also summarized in a similar fashion. Table 2 shows the results for Kampala, while Table 3 shows the ones for Jinja. The individual RMSEs for all the locations in Kampala and Jinja are shown in Figures B1 and B2 in Appendix B.

Table 2.

Summary of performance metrics for various models in predicting PM $ {}_{2.5} $ concentration across different locations in Kampala. Bold values indicate the best-performing model for each metric.

The values represent the mean and standard deviation computed from individual location-specific results.

Table 3.

Summary of performance metrics for various models in predicting PM $ {}_{2.5} $ concentration across different locations in Jinja. Bold values indicate the best-performing model for each metric.

The values represent the mean and standard deviation computed from individual location-specific results.

For Kampala, our model had the lowest mean RMSE, nRMSE, and the highest R² value, although this was still low. XGBoost had the second best performance after GPR. The neural network-based models, along with SVM and Kriging exhibited worse performance, along with having a negative R² value. This may be due to the fact that they were not able to adequately capture the nonlinear relationships in the data. Also, the size of the dataset may have affected the performance of the neural network-based models.

Similarly, for Jinja, our GPR model had the best value for the mean RMSE, nRMSE, and R². Again, all the other models had a negative R² value.

5. Discussion

In this article, we presented the prediction of PM $ {}_{2.5} $ levels for a geographical location using air quality data from sensors that are installed elsewhere in two major Ugandan cities. We also showed the uncertainty over the predictions made and demonstrated that GPR is the most appropriate method for such a task.

On the whole, it can clearly be seen that there is a remarkable difference between the performance of the two GPR models, that is, the one for Kampala and the one for Jinja. While the mean error rates are not so incredibly different, the Jinja model’s average error rate (16.88 μg/m³) is lower than that of the Kampala model, which is 18.32 μg/m³. However, as shown by the standard deviation of the RMSEs, the Kampala results are more varied than those of Jinja. Additionally, we see that the highest RMSE for a Kampala experiment (40.32 μg/m³) is almost twice the highest RMSE got for a Jinja experiment (23.04 μg/m³). Looking at the performance summaries of the two cities, it can be inferred that in general, Kampala has a higher and more variable PM $ {}_{2.5} $ concentration level than Jinja. This can be attributed to many factors, such as population density, traffic levels, and level of commercial activity, all of which are at a higher level in Kampala than in Jinja.

Additionally, from our study, we deduced that a location with more “spikes” in its data will have a higher RMSE. Here, we define a spike as a sharp increase in the PM $ {}_{2.5} $ concentration of an area between two consecutive time points. A graph of the location with the second highest RMSE in Kampala (36.94 μg/m³) is shown in Figure 6, and the sharp peaks in the graph are the “spikes.” Evidently, while the model is able to predict the general trend of the air quality, it is not able to convincingly predict the spikes in the location, therefore leading to a higher error rate in the predictions. It should also be noted that the location of the device whose data and predictions are shown in Figure 6 is next to a roadside with heavy traffic concentration in a busy commercial area. We believe that these spikes are due to air pollution from local sources in the area and, hence, the difficulty in predicting these PM $ {}_{2.5} $ concentrations. Addressing this challenge will be pivotal in the further improvement of our model’s performance.

Figure 6.

A graph plot showing actual vs predicted PM $ {}_{2.5} $ concentration for the location with the highest RMSE (Kiwatule) in Kampala as well as the spikes in the location. It should be noted that data are missing for most of the month of September.

Furthermore, we believe that the GPR models perform better than any of the other models because they are uniquely suited to this task. With the lengthscale parameter in the RBF kernel, we are able to control how strongly features at the different locations correlate with those at the location of interest. Therefore, data from locations nearer in latitude and longitude to the location of interest will be more relevant to the prediction as well as data from timestamps closer to the timestamp of interest. This is not possible with most of the other models. LSTM’s ability to model long-term dependencies, and hence why it is used with sequential data such as time-series data, is more helpful if the features of the data are changing with time. In this case, other than the time feature, the other two features are latitude and longitude, which are time-invariant. This explains why LSTM performs poorly on this task as well as the other models. The ability of GPs to learn a distribution of the data and also model the model’s uncertainty are also very powerful features that make GPR the most suitable method for this study.

5.1. Challenges and limitations

However, GPs have their own challenges, the most significant of which is their computational complexity. GPR has $ \mathcal{O}\left({n}^3\right) $ training computational complexity, and $ \mathcal{O}\left({n}^2\right) $ memory complexity, where $ n $ is the number of samples in the training dataset. This computational complexity is primarily due to the Cholesky decomposition, which involves the inversion of the covariance matrix (Rasmussen, Reference Rasmussen, Bousquet, von Luxburg and Rätsch2003). During our experiments, we had to manually reduce the size of the training datasets in order to optimize computational resources. As such, a significant amount of data were left out of the training process, and this could have impacted performance. Nonetheless, even with the implementation of this measure, the GP models had the second longest training time as shown in Table 4. This is reasonable because GPR has a higher computational complexity than all the other models (Rasmussen, Reference Rasmussen, Bousquet, von Luxburg and Rätsch2003) besides DGPR, which has a higher complexity due to its multi-layered neural network architecture (Damianou and Lawrence, Reference Damianou and Lawrence2013).

Table 4.

The duration of training and testing processes for the different algorithms for one location (Jinja Main Street in Jinja city)

We show the mean $ \pm $ standard deviation of 7 runs. It should be noted that all algorithms were run on the same computer except DGPR because it was unable to run on that particular machine. The training and testing times shown for the DGPR are therefore from another computer.

Additionally, some of the locations whose data were used had a few gaps, which may have affected the performance of our models. However, given the limited size of these gaps, their impact is assumed to be minimal. Moreover, the study period spanned only 3 months, which is insufficient to capture full annual seasonal trends and variations, as well as their effect on PM $ {}_{2.5} $ concentrations and model predictions.

6. Conclusion and future work

In this study, we developed air quality prediction models for two cities in Uganda, namely, Kampala and Jinja. We utilized data from devices installed elsewhere to predict the air quality of a particular location at a given time. This is vital to predict air quality in areas without air quality sensors. In sub-Saharan Africa especially, where there is a scarcity on the air quality information readily available, this model could help in providing timely information to inform decisions. For instance, in a region with devices installed in a few locations, these data can then be used to create an air quality prediction heatmap over the entire region. Additionally, we demonstrated the aptness of GPs for this problem and we believe that this solution can be replicated across different cities on the continent. The main limitations in the study include a few gaps in the data where the air quality monitors did not record the relevant measurements, and also the computational complexity of GPs, which meant that some of the data available were not used.

For future work, we intend to leverage sparse approximation in our methodology to optimize resource utilization and be able to use data from a multiyear period to capture seasonal changes and trends. Additionally, we shall work on incorporating spike detection in the model since spikes have a significant effect on the error rate. The assumption is that improving the ability of the model to predict spikes will positively affect the quality of the predictions made. Moreover, instead of eliminating data with missing values, the missing values can also be imputed using GPR. Furthermore, GPR could be used to aid in the optimal placement of air quality sensors based on model uncertainty.

Open peer review

To view the open peer review materials for this article, please visit http://doi.org/10.1017/eds.2025.10026.

Acknowledgements

We would like to thank Michael T. Smith, for his invaluable insights that helped advance this work.

Author contribution

Conceptualization: E.B. and L.M. Project administration: E.B. and D.O. Funding acquisition: E.B. Data curation: R.S. and L.M. Methodology: L.M. Software: L.M. and R.S. Data visualization: L.M. Writing – original draft: L.M. Writing – review & editing: L.M., R.S., D.O., and E.B. Supervision: E.B. All authors approved the final submitted draft.

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability statement

Replication data and code can be found at https://doi.org/10.5281/zenodo.17607881.

Ethics statement

The research meets all ethical guidelines, including adherence to the legal requirements of the study country.

Funding statement

This research was supported by grants from Google.org; the Global Challenges Research Fund; the Kingdom of Belgium through the Wehubit program implemented by Enabel.

A. Appendix. Meteorological summary of Kampala and Jinja

Table A1.

Descriptive statistics for weather parameters in Kampala during the study period

Table A2.

Monthly weather summary for Kampala during the study period

Table A3.

Descriptive statistics for weather parameters in Jinja during the study period

Table A4.

Monthly weather summary for Jinja during the study period

B. Appendix. More results

Figures B1 and B2 show the device locations in Kampala and Jinja that were used in this study alongside their individual RMSE values.

Figure B1.

The device locations in Kampala with their respective RMSE values.

Figure B2.

The device locations in Jinja and their respective RMSE values.

Footnotes

This research article was awarded Open Data and Open Materials badges for transparent practices. See the Data Availability Statement for details.

References

Adong, P, Bainomugisha, E, Okure, D and Sserunjogi, R (2022) Applying machine learning for large scale field calibration of low-cost PM 2.5 and PM 10 air pollution sensors. Applied AI Letters 3(3), e76.CrossRef Google Scholar

AIRQO Africa (2018) AirQo. Available at https://airqo.net/ (accessed 30 November 2023).Google Scholar

Alaran, A, Natasha, O, Lambed, T, Sserunjogi, R and Okello, G (2024) Air pollution (PM 2.5) and its meteorology predictors in Kampala and Jinja cities, in Uganda. Environmental Science: Atmospheres 4(10), 1145–1156.Google Scholar

Athira, V, Geetha, P, Vinayakumar, R and Soman, K (2018) DeepAirNet: Applying recurrent networks for air quality prediction. Procedia Computer Science 132, 1394–1403.Google Scholar

Bainomugisha, E, Ssematimba, J, Okedi, D, Nsubuga, A, Banda, M, Settala, G and Lubisia, G (2023) AirQo sensor kit: A particulate matter air quality sensing kit custom designed for low-resource settings. HardwareX 16, e00482.10.1016/j.ohx.2023.e00482CrossRef Google Scholar

Cross, DT (2021, October 18) Air pollution is a growing problem in Africa, requiring long-term solutions. Sustainability Times. Available at https://www.sustainability-times.com/clean-cities/air-pollution-is-a-growing-problem-in-africa-requiring-long-term-solutions/ (accessed 12 June 2023).Google Scholar

Curtis, L, Rea, W, Smith-Willis, P, Fenyves, E and Pan, Y (2006) Adverse health effects of outdoor air pollutants. Environment International 32(6), 815–830.CrossRef Google Scholar PubMed

Damianou, A and Lawrence, ND (2013) Deep Gaussian processes. 16^th International Conference on Artificial Intelligence and Statistics. PMLR, pp. 207–215.Google Scholar

Davda, K (2023, October 26) How does Air Quality Sensor Calibration improve data accuracy? Oizom. https://oizom.com/how-does-air-quality-sensor-calibration-improve-data-accuracy (accessed 01 February 2024).Google Scholar

Dobson, R, Siddiqi, K, Ferdous, T, Huque, R, Lesosky, M, Balmes, J and Semple, S (2021) Diurnal variability of fine particulate pollution concentrations: Data from 14 low-and middle-income countries. The International Journal of Tuberculosis and Lung Disease 25(3), 206–214.10.5588/ijtld.20.0704CrossRef Google Scholar PubMed

Doreswamy, N, Harishkumar, K, Yogesh, K and Gad, I (2020) Forecasting air pollution particulate matter (PM 2.5) using machine learning regression models. Procedia Computer Science 171, 2057–2066.10.1016/j.procs.2020.04.221CrossRef Google Scholar

Finardi, S, De Maria, R, D’Allura, A, Cascone, C, Calori, G and Lollobrigida, F (2008) A deterministic air quality forecasting system for Torino urban area, Italy. Environmental Modelling & Software 23(3), 344–355.10.1016/j.envsoft.2007.04.001CrossRef Google Scholar

Green, P, Okure, D, Adong, P, Sserunjogi, R and Bainomugisha, E (2022) Exploring PM2.5 variations from calibrated low-cost sensor network in Greater Kampala, during COVID-19 imposed lockdown restrictions: Lessons for policy. Clean Air Journal 32(1), 1–14.10.17159/caj/2022/32/1.10906CrossRef Google Scholar

Hertel, O, Berkowicz, R and Larssen, S (1991) The operational street pollution model (OSPM). In van Dop, H., Steyn, D.G. (eds.), Air Pollution Modeling and Its Application VIII. NATO· Challenges of Modern Society, vol 15. Boston, MA: Springer.Google Scholar

Hung, NT, Ketzel, M, Jensen, SS and Oanh, NTK (2010) Air pollution modeling at road sides using the operational street pollution model—A case study in Hanoi, Vietnam. Journal of the Air & Waste Management Association 60(11), 1315–1326.Google Scholar PubMed

Iyer, SR, Balashankar, A, Aeberhard, WH, Bhattacharyya, S, Rusconi, G, Jose, L, Soans, N, Sudarshan, A, Pande, R and Subramanian, L (2022) Modeling fine-grained spatio-temporal pollution maps with low-cost sensors. npj Climate and Atmospheric Science 5(1), 76.10.1038/s41612-022-00293-zCrossRef Google Scholar PubMed

Jaiswal, A, Samuel, C and Kadabgaon, V (2018) Statistical trend analysis and forecast modeling of air pollutants. Global Journal of Environmental Science and Management 4(4), 427–438.Google Scholar

Kalajdjieski, J, Mirceva, G and Kalajdziski, S (2020) Attention models for PM 2.5 prediction. 2020 IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT). IEEE, pp. 1–8.Google Scholar

Kazungu, D (2020, October 15). UG DECIDES 2021: Kyemba nominated for Jinja city woman MP, pledges to make Jinja city great. PML Daily. Available at https://www.pmldaily.com/news/politics/2020/10/ug-decides-2021-kyemba-nominated-for-jinja-city-woman-mp-pledges-to-make-jinja-city-great.html (accessed 10 August 2023).Google Scholar

Ketzel, M, Jensen, S, Brandt, J, Ellermann, T, Olesen, H, Berkowicz, R and Hertel, O (2012) Evaluation of the street pollution model OSPM for measurements at 12 streets stations using a newly developed and freely available evaluation tool. Journal of Civil & Environmental Engineering 1, 1–11.Google Scholar

Kim, HS, Park, I, Song, CH, Lee, K, Yun, JW, Kim, HK, Jeon, M, Lee, J and Han, KM (2019) Development of a daily PM 10 and PM 2.5 prediction system using a deep long short-term memory neural network model. Atmospheric Chemistry and Physics 19(20), 12935–12951.10.5194/acp-19-12935-2019CrossRef Google Scholar

Kirenga, BJ, Meng, Q, Van Gemert, F, Aanyu-Tukamuhebwa, H, Chavannes, N, Katamba, A, Obai, G, Van der Molen, T, Schwander, S and Mohsenin, V (2015) The state of ambient air quality in two Ugandan cities: A pilot cross-sectional spatial assessment. International Journal of Environmental Research and Public Health 12(7), 8075–8091.CrossRef Google Scholar PubMed

Khurram, H and Lim, A (2024) Analyzing and forecasting air pollution concentration in the capital and southern Thailand using a lag-dependent Gaussian process model. Environmental Monitoring and Assessment 196, 1106.10.1007/s10661-024-13275-wCrossRef Google Scholar PubMed

Laumbach, RJ (2010) Outdoor air pollutants and patient health. American Family Physician 81(2), 175.Google Scholar PubMed

LeCun, Y, Bengio, Y and Hinton, G (2015) Deep learning. Nature 521(7553), 436–444.CrossRef Google Scholar PubMed

Lei, MT, Monjardino, J, Mendes, L, Gonçalves, D and Ferreira, F (2019) Macao air quality forecast using statistical methods. Air Quality, Atmosphere & Health 12(9), 1049–1057.10.1007/s11869-019-00721-9CrossRef Google Scholar

Liang, Y-C, Maimury, Y, Chen, AH-L and Juarez, JRC (2020) Machine learning-based prediction of air quality. Applied Sciences 10(24), 9151.10.3390/app10249151CrossRef Google Scholar

Lin, L, Chen, C-Y, Yang, H-Y, Xu, Z and Fang, S-H (2020) Dynamic system approach for improved PM 2.5 prediction in Taiwan. IEEE Access 8, 210910–210921.10.1109/ACCESS.2020.3038853CrossRef Google Scholar

Lim, A, Owusu, B, Thongrod, T, Khurram, H, Pongsiri, N, Ingviya, T and Buya, S (2024) Trend and association between particulate matters and meteorological factors: A prospect for prediction of PM2.5 in southern Thailand. Polish Journal of Environmental Studies 34(5), 5215–5223.10.15244/pjoes/190787CrossRef Google Scholar

Liu, H, Yang, C, Huang, M, Wang, D and Yoo, C (2018) Modeling of subway indoor air quality using Gaussian process regression. Journal of Hazardous Materials 359, 266–273.10.1016/j.jhazmat.2018.07.034CrossRef Google Scholar PubMed

Mao, W, Wang, W, Jiao, L, Zhao, S and Liu, A (2021) Modeling air quality prediction using a deep learning approach: Method optimization and evaluation. Sustainable Cities and Society 65, 102567.10.1016/j.scs.2020.102567CrossRef Google Scholar

Matthews, AG d G, van der Wilk, M, Nickson, T, Fujii, K, Boukouvalas, A, León-Villagrá, P, Ghahramani, Z and Hensman, J (2017) GPflow: A Gaussian process library using TensorFlow. Journal of Machine Learning Research 18(40), 1–6.Google Scholar

Mcdonald, K (2012) Air pollution in the urban atmosphere: Sources and consequences. Metropolitan Sustainability, 231–259.10.1533/9780857096463.3.231CrossRef Google Scholar

McQuaid, K, Vanderbeck, RM, Valentine, G, Liu, C, Chen, L, Zhang, M and Diprose, K (2018) Urban climate change, livelihood vulnerability and narratives of generational responsibility in Jinja, Uganda. Africa 88(1), 11–37.10.1017/S0001972017000547CrossRef Google Scholar

Met One Instruments (2023, July 24) BAM 1022 Beta Attenuation Mass Monitor. Met One Instruments. Available at https://metone.com/products/bam-1022/ (accessed 24 November 2023).Google Scholar

Mitchell, TM (1997) Machine Learning, Vol. 1. New York: McGraw-Hill .Google Scholar

Nairobi City County (2023) Nairobi Air Quality. Available at https://nairobi.go.ke/nairobi-air-quality/ (accessed 24 November 2023).Google Scholar

OECD iLibrary (2017) Breathing Clean Air in All the Cities of the World (SDG 11). Available at https://www.oecd-ilibrary.org/sites/65a765da-en/index.html?itemId=/content/component/65a765da-en (accessed 12 June 2023).Google Scholar

Okure, D, Ssematimba, J, Sserunjogi, R, Gracia, NL, Soppelsa, ME and Bainomugisha, E (2022) Characterization of ambient air quality in selected urban areas in Uganda using low-cost sensing and measurement technologies. Environmental Science & Technology 56(6), 3324–3339.10.1021/acs.est.1c01443CrossRef Google Scholar PubMed

Petelin, D, Grancharova, A and Kocijan, J (2013) Evolving Gaussian process models for prediction of ozone concentration in the air. Simulation Modelling Practice and Theory 33, 68–80.10.1016/j.simpat.2012.04.005CrossRef Google Scholar

Qin, D, Yu, J, Zou, G, Yong, R, Zhao, Q and Zhang, B (2019) A novel combined prediction scheme based on CNN and LSTM for urban PM 2.5 concentration. IEEE Access 7, 20050–20059.CrossRef Google Scholar

Rasmussen, CE (2003) Gaussian processes in machine learning. In Bousquet, O., von Luxburg, U., Rätsch, G. (eds.), Advanced Lectures on Machine Learning. ML 2003. Lecture Notes in Computer Science (LNCS), vol 3176. Berlin, Heidelberg: Springer, pp. 63–71.10.1007/978-3-540-28650-9_4CrossRef Google Scholar

Ritchie, H and Roser, M (2017) Air pollution. Our World in Data.Available at: https://ourworldindata.org/air-pollution (accessed 12 June 2023).Google Scholar

Roozitalab, B, Carmichael, GR and Guttikunda, SK (2021) Improving regional air quality predictions in the indo-Gangetic plain–case study of an intensive pollution episode in November 2017. Atmospheric Chemistry and Physics 21(4), 2837–2860.10.5194/acp-21-2837-2021CrossRef Google Scholar

Rwanda Air Quality (2021) Available at https://aq.rema.gov.rw/ (accessed 08 September 2023).Google Scholar

Seeger, M (2004) Gaussian processes for machine learning. International Journal of Neural Systems 14(02), 69–106.10.1142/S0129065704001899CrossRef Google Scholar PubMed

Seigneur, C and Dennis, R (2011) Atmospheric modeling. In Hidy, G., Brook, J., Demerjian, K., Molina, L., Pennell, W., Scheffe, R. (eds.), Technical Challenges of Multipollutant Air Quality Management. Dordrecht: Springer, pp. 299–337.10.1007/978-94-007-0304-9_9CrossRef Google Scholar

Song, L, Pang, S, Longley, I, Olivares, G, and Sarrafzadeh, A (2014) Spatio-temporal PM 2.5 prediction by spatial data aided incremental support vector regression. 2014 International Joint Conference on Neural Networks (IJCNN). IEEE, pp.623–630.10.1109/IJCNN.2014.6889521CrossRef Google Scholar

Sserunjogi, R, Ssematimba, J, Okure, D, Ogenrwot, D, Adong, P, Muyama, L, Nsimbe, N, Bbaale, M and Bainomugisha, E (2022) Seeing the air in detail: Hyperlocal air quality dataset collected from spatially distributed AirQo network. Data in Brief 44, 108512.10.1016/j.dib.2022.108512CrossRef Google Scholar PubMed

Stroud, C, Moran, M, Makar, P, Gong, S, Gong, W, Zhang, J, Slowik, J, Abbatt, J, Lu, G, Brook, J, Mihele, C, Li, Q, Sills, D, Strawbridge, K, McGuire, M and Evans, G (2012) Evaluation of chemical transport model predictions of primary organic aerosol for air masses classified by particle component-based factor analysis. Atmospheric Chemistry and Physics 12(18), 8297–8321.10.5194/acp-12-8297-2012CrossRef Google Scholar

The Editors of Encyclopaedia Britannica (2023, November 15) Jinja. Encyclopedia Britannica. Available at https://www.britannica.com/place/Jinja-Uganda (accessed 01 February 2024) .Google Scholar

The Editors of Encyclopaedia Britannica. (2024, January 27). Kampala. Encyclopedia Britannica. Available at https://www.britannica.com/place/Kampala (accessed 01 February 2024).Google Scholar

The World Air Quality Index Project (2008) Air pollution in Africa: Real-time air quality index visual map. aqicn.org. Available at https://aqicn.org/map/africa/ (accessed 08 December 2023).Google Scholar

The World Bank (2018) Great Kampala Metropolitan Area — Quick facts. The World Bank. Available at https://thedocs.worldbank.org/en/doc/595971521054661269-0010022018/original/GreatKampalaMetropolitanAreaQuickFacts.pdf (accessed 12 June 2023).Google Scholar

Uganda Investment Authority (2022) Uganda’s Industrial Journey So Far: Progress, achievements, and prospects. Uganda Investment Authority - Your Investment Is Our Business.Google Scholar

Uno, I, Carmichael, G, Streets, D, Tang, Y, Yienger, J, Satake, S, Wang, Z, Woo, J-H, Guttikunda, S, Uematsu, M, Matsumoto, K, Tanimoto, H, Yoshioka, K and Iida, T (2003) Regional chemical weather forecasting system CFORS: Model descriptions and analysis of surface observations at Japanese island stations during the ACE-Asia experiment. Journal of Geophysical Research: Atmospheres 108(D23), 8668.10.1029/2002JD002845CrossRef Google Scholar

Vermeiren, K, Van Rompaey, A, Loopmans, M, Serwajja, E and Mukwaya, P (2012) Urban growth of Kampala, Uganda: Pattern analysis and scenario development. Landscape and Urban Planning 106(2), 199–206.10.1016/j.landurbplan.2012.03.006CrossRef Google Scholar

Vong, C-M, Ip, W-F, Wong, P-k and Yang, J-y (2012) Short-term prediction of air pollution in Macau using support vector machines. Journal of Control Science and Engineering, 1–11.10.1155/2012/518032CrossRef Google Scholar

Wang, J and Song, G (2018) A deep spatial-temporal ensemble model for air quality prediction. Neurocomputing 314, 198–206.10.1016/j.neucom.2018.06.049CrossRef Google Scholar

World Health Organization. (2022, December 19). Ambient (outdoor) air pollution. Available at https://www.who.int/news-room/fact-sheets/detail/ambient-(outdoor)-air-quality-and-health (accessed 12 June 2023).Google Scholar

Yu, R, Yang, Y, Yang, L, Han, G and Move, OA (2016) RAQ–a random forest approach for predicting air quality in urban sensing systems. Sensors 16(1), 86.10.3390/s16010086CrossRef Google Scholar

Zhang, Y, Wang, Y, Gao, M, Ma, Q, Zhao, J, Zhang, R, Wang, Q and Huang, L (2019) A predictive data feature exploration-based air quality prediction approach. IEEE Access 7, 30732–30743.10.1109/ACCESS.2019.2897754CrossRef Google Scholar

Zheng, S, Pozzer, A, Cao, C and Lelieveld, J (2015) Long-term (2001–2012) concentrations of fine particulate matter (PM 2.5) and the impact on human health in Beijing, China. Atmospheric Chemistry and Physics 15(10), 5715–5725.CrossRef Google Scholar

Figure 1. A map of Uganda showing the locations of Kampala and Jinja cities.

Figure 2. A map of Kampala showing some of the sensor locations.

Figure 3. A map of Jinja showing the sensor locations used in this study.

Table 1. Table showing the summary of model performance for the two cities

Figure 4. A graph plot showing actual vs predicted PM$ {}_{2.5} $ concentration for the Civic Center device location in Kampala.

Figure 5. A graph plot showing actual vs predicted PM$ {}_{2.5} $ concentration for the Jinja Main Street device location in Jinja.

Table 2. Summary of performance metrics for various models in predicting PM$ {}_{2.5} $ concentration across different locations in Kampala. Bold values indicate the best-performing model for each metric.

Table 3. Summary of performance metrics for various models in predicting PM$ {}_{2.5} $ concentration across different locations in Jinja. Bold values indicate the best-performing model for each metric.

Figure 6. A graph plot showing actual vs predicted PM$ {}_{2.5} $ concentration for the location with the highest RMSE (Kiwatule) in Kampala as well as the spikes in the location. It should be noted that data are missing for most of the month of September.

Table 4. The duration of training and testing processes for the different algorithms for one location (Jinja Main Street in Jinja city)

Table A1. Descriptive statistics for weather parameters in Kampala during the study period

Table A2. Monthly weather summary for Kampala during the study period

Table A3. Descriptive statistics for weather parameters in Jinja during the study period

Table A4. Monthly weather summary for Jinja during the study period

Figure B1. The device locations in Kampala with their respective RMSE values.

Figure B2. The device locations in Jinja and their respective RMSE values.

Article contents

Using Gaussian processes for spatial prediction of PM2.5 concentration based on calibrated data from distributed low-cost sensor networks

Abstract

Keywords

Information

Impact Statement

1. Introduction

2. Related work

3. Methodology

3.1. Study areas

3.2. Dataset description

3.3. Methods

3.4. Experiment setup and evaluation

4. Results

4.1. Model predictions

4.2. Comparison with other methods

5. Discussion

5.1. Challenges and limitations

6. Conclusion and future work

Open peer review

Acknowledgements

Author contribution

Competing interests

Data availability statement

Ethics statement

Funding statement

A. Appendix. Meteorological summary of Kampala and Jinja

B. Appendix. More results

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests