Impact Statement
This work advances forest monitoring using deep learning on aerial imagery time series. By leveraging phenological information and taxonomic hierarchies, our proposed methods improve tree species segmentation performance. The introduction of a compact spatio-temporal feature extraction module enables the use of pretrained models for this task. Our findings highlight the importance of incorporating temporal data and hierarchical knowledge in forest monitoring, and we hope our work will offer valuable insights for biodiversity conservation and climate change mitigation efforts.
1. Introduction
Climate change and biodiversity loss in forests are closely intertwined, with each potentially exacerbating the other. As the climate changes, the suitable habitat for many tree species shifts geographically, with ranges expanding in some regions while contracting or disappearing in others, leading to changes in forest composition and potential biodiversity loss (Lenoir et al., Reference Lenoir, Gégout, Marquet, de Ruffray and Brisse2008; Allen et al., Reference Allen, Macalady, Chenchouni, Bachelet, McDowell, Vennetier, Kitzberger, Rigling, Breshears, Hogg, Gonzalez, Fensham, Zhang, Castro, Demidova, Lim, Allard, Running, Semerci and Cobb2010; Mahecha et al., Reference Mahecha, Bastos, Bohn, Eisenhauer, Feilhauer, Hickler, Kalesse-Los, Migliavacca, Otto, Peng, Sippel, Tegen, Weigelt, Wendisch, Wirth, Al-Halbouni, Deneke, Doktor, Dunker, Duveiller, Ehrlich, Foth, García-García, Guerra, Guimarães-Steinicke, Hartmann, Henning, Herrmann, Hu, Ji, Kattenborn, Kolleck, Kretschmer, Kühn, Luttkus, Maahn, Mönks, Mora, Pöhlker, Reichstein, Rüger, Sánchez-Parra, Schäfer, Stratmann, Tesche, Wehner, Wieneke, Winkler, Wolf, Zaehle, Zscheischler and Quaas2024). Conversely, biodiversity loss in forests can reduce their ability to absorb and store carbon, further contributing to climate change. Different tree species have varying tolerances to changes in temperature, precipitation, and other environmental factors. As a result, climate change can cause variable phenological changes (Visser and Gienapp, Reference Visser and Gienapp2019), shifts in species distribution (Babst et al., Reference Babst, Bouriaud, Poulter, Trouet, Girardin and Frank2019) and differential growth responses due to increased atmospheric
$ {\mathrm{CO}}_2 $
(Bonan, Reference Bonan2008; Anderegg et al., Reference Anderegg, Kane and Leander2012). Phenology in trees refers to the timing of seasonal events such as leaf emergence, color change, and leaf fall. These cyclical changes are influenced by environmental factors like temperature and day length, and often vary between tree species. Understanding phenological patterns can potentially enhance our ability to distinguish between tree species and monitor their responses to environmental changes.
Increasingly, deep learning–based methods, alongside remote sensing applications (e.g., land-use and land-cover mapping (Hamdi et al., Reference Hamdi, Brandmeier and Straub2019; Helber et al., Reference Helber, Bischke, Dengel and Borth2019; Vali et al., Reference Vali, Comai and Matteucci2020; Hamedianfar et al., Reference Hamedianfar, Mohamedou, Kangas and Vauhkonen2022), change detection (Khelifi and Mignotte, Reference Khelifi and Mignotte2020), have helped with advancing the field of forest monitoring in tree species classification (Fricker et al., Reference Fricker, Ventura, Wolf, North, Davis and Franklin2019), biomass estimation (Zhang et al., Reference Zhang, Shao, Liu and Cheng2019), and tree crown semantic segmentation (Schiefer et al., Reference Schiefer, Kattenborn, Frick, Frey, Schall, Koch and Schmidtlein2020; Weinstein et al., Reference Weinstein, Marconi, Aubry-Kientz, Vincent, Senyondo and White2020).
The use of temporal data as inputs to these methods has also shown successes in other tasks such as crop mapping (Sainte Fare Garnot et al., Reference Sainte Fare Garnot, Landrieu, Giordano and Chehata2020; Cai et al., Reference Cai, Bi, Nicholl and Sterritt2023; Tarasiou et al., Reference Tarasiou, Chavez and Zafeiriou2023) and forest health mapping (Hamdi et al., Reference Hamdi, Brandmeier and Straub2019). Semantic segmentation of tree crowns is a crucial task in forest monitoring as it provides valuable information about forest composition and health. It could be further explored by leveraging time-series inputs to learn phenological changes that occur between seasons according to each tree species throughout the years.
In this work, we evaluate multiple models on the task of tree crown semantic segmentation using a rich dataset recorded in the Laurentides region of Québec, Canada (Cloutier et al., Reference Cloutier, Germain and Laliberté2024). Among the numerous datasets available for tree crown semantic segmentation (Ouaknine et al., Reference Ouaknine, Kattenborn, Laliberté and Rolnick2025), we chose this one for its unique characteristics: high-resolution time-series data and a number of closely related classes. This allows us to investigate the impact of phenological (seasonal) changes on tree species identification and assess the ability of the model to distinguish between closely related species.
To this end, we employ state-of-the-art models in semantic segmentation for single-image and time-series segmentation. Additionally, we introduce a lightweight module to extract spatio-temporal features from a time-series input, allowing it to be used with backbones that typically operate on single images. The dataset we use lacks fine-grained species-level labels for all trees, as it is challenging to accurately identify tree species at a granular level. As a result, it is often easier to identify them on a coarser (genus or family) level. To address this, we propose a custom hierarchical loss function that incorporates labels from all three levels (species, genus, and family) and penalizes incorrect predictions at each level. Overall, our work can be summarized as follows:
-
• We introduce a simple yet effective module for extracting spatio-temporal features, enabling the use of pretrained models for segmenting tree crowns with time series.
-
• We find that time-series data improves species identification performance, particularly for deciduous trees.
-
• We demonstrate that models achieve better accuracy when leveraging taxonomic hierarchies through our proposed loss function.
2. Related Work
2.1. Semantic segmentation
Deep learning applications for computer vision have been widely explored over the years, including various methods based on convolutional neural networks (CNNs) such as Fully Convolutional Networks (FCNs) (Long et al., Reference Long, Shelhamer and Darrell2015), U-Net (Ronneberger et al., Reference Ronneberger, Fischer and Brox2015), and DeepLab (Chen et al., Reference Chen, Zhu, Papandreou, Schroff, Adam, Ferrari, Hebert, Sminchisescu and Weiss2018a).
The “dilated” (also named “atrous”) convolution (Yu and Koltun, Reference Yu, Koltun, Bengio and LeCun2016; Chen et al., Reference Chen, Zhu, Papandreou, Schroff, Adam, Ferrari, Hebert, Sminchisescu and Weiss2018a), has been introduced to increase the receptive field of CNNs, while attention mechanisms (Oktay et al., Reference Oktay, Schlemper, Le Folgoc, Matthew, Heinrich, Misawa, Mori, McDonagh, Hammerla, Kainz, Glocker and Rueckert2018; Fu et al., Reference Fu, Liu, Tian, Li, Bao, Fang and Lu2019) have been incorporated to focus on relevant regions. Multi-scale and pyramid pooling approaches, such as PSPNet (Zhao et al., Reference Zhao, Shi, Qi, Wang and Jia2017) and DeepLabV3+ (Chen et al., Reference Chen, Zhu, Papandreou, Schroff and Adam2018b), have been employed to capture context at different scales. Specific methods have also been designed to exploit temporal information for semantic segmentation, e.g., with 3D U-Net (Çiçek et al., Reference Çiçek, Abdulkadir, Lienkamp, Brox and Ronneberger2016) and V-Net (Milletari et al., Reference Milletari, Navab and Ahmadi2016).
Recently, transformer-based models have gained popularity in semantic segmentation, showing impressive results, e.g., Mask2Former (Cheng et al., Reference Cheng, Misra, Schwing, Kirillov and Girdhar2022), combining the strengths of CNN-based and transformer-based architectures. It employs a hybrid approach with a CNN backbone for feature extraction and a transformer decoder for capturing global context and generating high-resolution segmentation masks. Other transformer-based models, such as SETR (Zheng et al., Reference Zheng, Lu, Zhao, Zhu, Luo, Wang, Fu, Feng, Xiang, Torr and Zhang2021), TransUNet (Chen et al., Reference Chen, Mei, Li, Lu, Yu, Wei, Luo, Xie, Adeli, Wang, Lungren, Zhang, Xing, Lu, Yuille and Zhou2024), and SegFormer (Xie et al., Reference Xie, Wang, Yu, Anandkumar, Alvarez, Luo, Ranzato, Beygelzimer, Dauphin, Liang and Vaughan2021), have also been proposed, leveraging the self-attention mechanism to capture long-range dependencies and global context effectively. These latter methods have demonstrated competitive or improved performance on various semantic segmentation benchmarks compared to traditional CNN-based models.
2.2. Satellite image time series (SITS)
Leveraging the temporal information with satellite and aerial imagery provides information on land dynamics and phenology. Researchers have used convolutional neural networks (CNNs) in temporal convolutions for land cover mapping (Lucas et al., Reference Lucas, Pelletier, Schmidt, Webb and Petitjean2021) and crop classification (Rußwurm and Körner, Reference Rußwurm and Körner2018). Attention-based methods have been used for encoding time series, which have proven to be well-suited for satellite imagery (Garnot and Landrieu, Reference Garnot and Landrieu2021; Sainte Fare Garnot et al., Reference Sainte Fare Garnot, Landrieu, Giordano and Chehata2020; Rußwurm et al., Reference Rußwurm, Courty, Emonet, Lefèvre, Tuia and Tavenard2023). More recently, transformer-based methods have proven their merit using satellite image time series (SITS) with self-supervised learning, exploiting unlabeled data to improve performance on downstream tasks (Cong et al., Reference Cong, Khanna, Meng, Liu, Rozi, He, Burke, Lobell, Ermon, Koyejo, Mohamed, Agarwal, Belgrave, Cho and Oh2022; Tarasiou et al., Reference Tarasiou, Chavez and Zafeiriou2023; Tseng et al., Reference Tseng, Cartuyvels, Zvonkov, Purohit, Rolnick and Kerner2023; Reed et al., Reference Reed, Gupta, Li, Brockman, Funk, Clipp, Keutzer, Candido, Uyttendaele and Darrell2023).
A recent method has also proposed a new encoding scheme for SITS in order to fit popular pretrained backbones rather than creating task-specific architectures (Cai et al., Reference Cai, Bi, Nicholl and Sterritt2023).
2.3. Forest monitoring
Deep learning methods have helped advance the field of vegetation monitoring using remote sensing, including both satellite and aerial imagery Kattenborn et al. (Reference Kattenborn, Leitloff, Schiefer and Hinz2021), enabling progress in forest monitoring for accurate and efficient analysis at scale (Bae et al., Reference Bae, Levick, Heidrich, Magdon, Leutner, Wöllauer, Serebryanyk, Nauss, Krzystek, Gossner, Schall, Heibl, Bässler, Doerfler, Schulze, Krah, Culmsee, Jung, Heurich, Fischer, Seibold, Thorn, Gerlach, Hothorn, Weisser and Müller2019; Reichstein et al., Reference Reichstein, Camps-Valls, Stevens, Jung and Denzler2019; Beloiu et al., Reference Beloiu, Heinzmann, Rehush, Gessler and Griess2023; Nguyen et al., Reference Nguyen, Rußwurm, Lenczner and Tuia2024). Such models have achieved state-of-the-art performance in classifying tree species from high-resolution remote sensing imagery (Fricker et al., Reference Fricker, Ventura, Wolf, North, Davis and Franklin2019; Onishi and Ise, Reference Onishi and Ise2021).
Mapping deforestation at a large scale using satellite imagery has also been explored (Adarme et al., Reference Adarme, Feitosa, Happ, De Almeida and Gomes2020; Maretto et al., Reference Maretto, Fonseca, Jacobs, Körting, Bendini and Parente2021). Computer vision and remote sensing have also been leveraged in applications to plant phenology (Katal et al., Reference Katal, Rzanny, Mäder and Wäldchen2022). Global vegetation phenology has been modeled with satellite imagery alongside meteorological variables as inputs of a 1D CNN (Zhou et al., Reference Zhou, Xin, Dai and Li2021). Automated monitoring of forests has also been investigated to accurately identify key phenological events (Cao et al., Reference Cao, Sun, Jiang, Li and Xin2021; Song et al., Reference Song, Wu, Calvin, Serbin, Wolfe, Ng, Ely, Bogonovich, Wang, Lin, Saleska, Nelson, Rogers and Wu2022; Wang et al., Reference Wang, Song, Liddell, Morellato, Calvin, Yang, Alberton, Detto, Ma, Zhao, Henry, Zhang, Ng, Nelson, Huete and Wu2023).
Deep learning–based segmentation methods have been applied to automatically delineate individual tree crowns from high-resolution remote sensing imagery (Brandt et al., Reference Brandt, Tucker, Kariryaa, Rasmussen, Abel, Small, Chave, Rasmussen, Hiernaux, Diouf, Kergoat, Mertz, Igel, Gieseke, Schöning, Li, Melocik, Meyer, Sinno, Romero, Glennie and Montagu2020; Schiefer et al., Reference Schiefer, Kattenborn, Frick, Frey, Schall, Koch and Schmidtlein2020; Weinstein et al., Reference Weinstein, Marconi, Aubry-Kientz, Vincent, Senyondo and White2020; Li et al., Reference Li, Brandt, Fensholt, Kariryaa, Igel, Gieseke, Nord-Larsen, Oehmcke, Carlsen, Junttila, Tong, d’Aspremont and Ciais2023). In a similar vein, a U-Net architecture has been used for fine-grained segmentation of plant species using aerial imagery (Kattenborn et al., Reference Kattenborn, Eichel and Fassnacht2019). A foundation model trained on datasets from multiple sources is also able to perform decently on a variety of downstream tasks for forest monitoring, including classification, detection, and semantic segmentation (Bountos et al., Reference Bountos, Ouaknine, Papoutsis and Rolnick2025).
2.4. Hierarchical losses
Hierarchical loss functions have been extensively explored in various tasks to leverage the inherently hierarchical structure of object classes. By incorporating information from different levels of granularity, such loss functions aim to improve the ability of the model to make fine-grained distinctions and enhance overall performance. For classification tasks, a curriculum-based hierarchical loss, gradually increasing the specificity of the target class, was explored by Goyal and Ghosh (Reference Goyal and Ghosh2021). Similarly, a loss function evaluated at multiple operating points within the class hierarchy has helped to capture information at various levels of this hierarchy (Valmadre, Reference Valmadre2022). In contrast, one may encourage the model to make better mistakes by assigning different weights to the misclassified samples based on their position in the hierarchy, promoting more semantically meaningful errors (Bertinetto et al., Reference Bertinetto, Mueller, Tertikas, Samangooei and Lord2020).
Hierarchical loss functions have also been applied to object detection (Katole et al., Reference Katole, Yellapragada, Bedi, Kalra and Chaitanya2015; Zwemer et al., Reference Zwemer, Rob and Peter2022) and semantic segmentation (Sharma et al., Reference Sharma, Tuzel and Jacobs2015; Muller and Smith, Reference Muller and Smith2020; Li et al., Reference Li, Zhou, Wang, Li and Yang2022) demonstrating the effectiveness of incorporating a more structured and informative signal during the learning process.
3. Dataset
The dataset used in our work (Cloutier et al., Reference Cloutier, Germain and Laliberté2024) consists of high-resolution RGB imagery from unmanned aerial vehicles (UAVs) at seven different acquisition dates over a temperate-mixed forest in the Laurentides region of Québec, Canada, during the year 2021. The acquisitions were conducted monthly from May to August, with three additional acquisitions in September and October to capture color changes during autumn. The dataset contains a total of 23,000 individual tree crowns that were segmented and annotated, mostly at the species level, with 1,956 trees annotated only at the genus level due to the difficulty in accurately identifying species-level labels. This dataset offers a unique combination of time-series data and a large number of fine-grained tree species. This allows us to leverage the temporal information to investigate the impact of phenological changes on tree species identification. An example of this dataset is shown in Figure 1.

Figure 1. Example of an annotated sample from the studied dataset. The image 1 shows a scene captured on September 2nd, while the image 1a overlays the tree species labels on the same scene. Each tree species is represented by a distinct color, as seen in Table 1.
We perform three-fold cross-validation using spatially separated splits for training, validation, and test, while ensuring balanced distribution of tree species classes across splits. The spatial separation between splits, with a consistent test region across all folds, allows us to evaluate how well our models generalize to new geographic areas, a critical requirement for real-world applications. An example of one cross-validation fold is illustrated in Figure 2.

Figure 2. Spatial splits of the dataset. The image on the left depicts the entire region where the aerial imagery was captured, while the image on the right shows the different subregions used to train, evaluate, and test models from one fold of cross-validation. The training region is represented by , the validation region by
, and the test region by
. To prevent data leakage between the subsets, a buffer tile is omitted between the adjacent regions. This spatial partitioning ensures that the model’s performance is assessed on geographically distinct areas, simulating real-world scenarios where the model would be applied to unseen locations.
For our three-fold cross-validation, we maintain a consistent test region across all folds to ensure reliable performance comparisons. In the remaining area, we create train and validation splits by systematically shifting their positions from left to right. Both training and validation regions maintain approximately equal sizes while their positions shift in each fold. Buffer tiles separate all regions (test, train, and validation) to prevent spatial autocorrelation, which is crucial for aerial imagery where neighboring pixels typically share similar characteristics. This method ensures no data leakage between splits while preserving the distribution of tree species across the heterogeneous forest ecosystem.
We ensure that each split maintains approximately the same proportion of tree species as the overall dataset, addressing potential sampling biases while preserving the natural spatial patterns of the forest.
For our experiments, we use an image size of
$ 768\times 768\times 3 $
, providing sufficient spatial context to include multiple tree crowns and to learn relationships between different regions in the image. The labels are annotated using recordings from September 2 as reference (representing a date before most leaves change colour), which is also used as the input for our single-image models. For the models that take time series as input, we select one image from June, two from September, and one from October to reduce redundant information, as most phenological changes occur between September and October.
As a design choice, from the initial 28 classes, we merged those with less than 50 occurrences (mostly species with fewer than 10 samples) into the background class, leaving us with a total of 15 classes, excluding the background class. This ensures the selected classes have sufficient samples in each split in order to effectively train and evaluate each model. The tree species distribution is illustrated in Figure 3.

Figure 3. Distribution of the selected classes in the dataset. We observe that there is a substantial difference in the frequency of occurrence of each tree species. The common and scientific names used for the abbreviations are detailed in Table 1.
The dataset is split into train, validation, and test sets with 64%, 16%, and 20% of the samples, respectively. We opted for a larger test set (20%) compared to conventional splits to ensure robust evaluation across all tree species classes, particularly given the class imbalance in our dataset. This split ratio maintains adequate representation of less frequent species in the test set while preserving sufficient training data. The validation set (16%) remains large enough for effective model selection and hyperparameter tuning. This is kept approximately consistent across all three folds of cross-validation. Given that this dataset has a mix of coarse (genus) and fine-grained (species) labels, we leverage this information to create a complete taxonomy of the classes used, as seen in Figure 4. This taxonomic hierarchy is incorporated in our proposed loss function as detailed in Section 4.3.

Figure 4. Taxonomic hierarchy of tree species. The hierarchical structure is visually represented using a tree diagram. Blue nodes represent the species level, the most fine-grained classification in the hierarchy. Red nodes denote the genus level, which groups together closely related species. Finally, green nodes group the higher-level taxon, the broadest classification level, which encompasses multiple genera and families. This structure of labels allows the models to learn more comprehensive relationships between different tree species at multiple levels of granularity. The full names of each abbreviation are detailed in Table 1.
Table 1. Tree species names and their abbreviations

Note. The color we use to depict each species is highlighted in the second column and is consistent for all the plots and figures.
4. Methods
In this section, we provide more details on the methods used to perform semantic segmentation, either with single-image or time-series inputs. We will also describe the proposed hierarchical loss used to exploit the tree label taxonomy.
4.1. Single image semantic segmentation
The single-image semantic segmentation experiments are conducted with diverse methods detailed in the following sections.
4.1.1. U-Net
U-Net (Ronneberger et al., Reference Ronneberger, Fischer and Brox2015) is a widely adopted convolutional neural network (CNN) architecture (Dong et al., Reference Dong, Yang, Liu, Mo and Guo2017; Falk et al., Reference Falk, Mai, Bensch, Ciçek, Abdulkadir, Marrakchi, Böhm, Deubner, Jäckel, Seiwald, Dovzhenko, Tietz, Bosco, Walsh, Saltukoglu, Tay, Prinz, Palme, Simons, Diester, Brox and Ronneberger2018; Li et al., Reference Li, Chen, Qi, Dou, Fu and Heng2018) designed for efficient image segmentation tasks. The architecture consists of an encoder path and a decoder path, which together form a U-shaped structure. The encoder path follows the typical structure of a CNN, consisting of successive CNN layers, rectified linear units (ReLU), and max-pooling operations, which gradually reduce the spatial dimensions while increasing the number of feature maps. The decoder path utilizes transposed convolutions to upsample the feature channels, enabling the network to construct segmentation maps at the original input resolution. The U-Net architecture uses skip connections (He et al., Reference He, Zhang, Ren and Sun2016) to concatenate feature maps from the encoder path with the corresponding upsampled feature maps in the decoder path.
4.1.2. DeepLabv3+
The DeepLabv3+ architecture (Chen et al., Reference Chen, Zhu, Papandreou, Schroff and Adam2018b) is an image segmentation method built upon the strengths of pyramid pooling with an encoder–decoder structure (Chen et al., Reference Chen, Zhu, Papandreou, Schroff, Adam, Ferrari, Hebert, Sminchisescu and Weiss2018a). The encoder module of the DeepLabv3+ utilizes “dilated” (also named “atrous”) convolutions to extract dense feature maps at multiple scales with larger receptive fields while keeping the computation costs lower. The encoder incorporates atrous spatial pyramid pooling (ASPP), which applies atrous convolutions with different dilation rates in parallel to further capture multi-scale context (Chen et al., Reference Chen, Zhu, Papandreou, Schroff, Adam, Ferrari, Hebert, Sminchisescu and Weiss2018a).
The decoder module of the DeepLabv3+ combines the output of the encoder with low-level features from the encoder. This information is refined with
$ 3\times 3 $
convolutions to produce the final output segmentation maps.
4.1.3. Mask2Former
The Mask2Former architecture (Cheng et al., Reference Cheng, Misra, Schwing, Kirillov and Girdhar2022) is a versatile method that applies binary masks to focus attention only on the areas with foreground features. The architecture consists of three parts: a backbone network, a pixel decoder, and a transformer decoder. Universal backbones (ResNet (He et al., Reference He, Zhang, Ren and Sun2016) or Swin Transformer (Liu et al., Reference Liu, Lin, Cao, Hu, Wei, Zhang, Lin and Guo2021)) are used to extract features from the input image. The low-resolution features are then used in a pixel decoder and upsampled to higher resolution. The masked attention is finally applied to the pixel embeddings in the transformer decoder.
To reduce the computational burden of using high-resolution masks, the transformer decoder processes the multi-scale features per resolution one at a time. The Mask2Former architecture performs well across a variety of tasks, like semantic, instance, and panoptic segmentation, which makes it a popular choice.
4.2. Time-series semantic segmentation
We compare various methods for semantic segmentation with time-series data, including 3D-UNet (Çiçek et al., Reference Çiçek, Abdulkadir, Lienkamp, Brox and Ronneberger2016), specialized for medical images, and U-Net with temporal attention encoder (U-TAE) (Garnot and Landrieu, Reference Garnot and Landrieu2021) specialized for SITS. Additionally, we propose a simple, yet effective, module composed of 3D convolutional layers, referred to as “Processor”, to preliminarily process the time series and use its representation as input for mainstream single-image segmentation methods.
4.2.1. 3D-UNet
The 3D-UNet method (Çiçek et al., Reference Çiçek, Abdulkadir, Lienkamp, Brox and Ronneberger2016) is composed of successive 3D convolutions with a
$ 3\times 3\times 3 $
kernel, followed by batch normalization and a leaky ReLU activation. The 3D-UNet downsampling part is composed of five blocks, separated by spatial downsampling after the second and fourth blocks. The upsampling part consists of five blocks with transposed convolutions, while features from the downsampling part are concatenated similarly to U-Net (Ronneberger et al., Reference Ronneberger, Fischer and Brox2015).
4.2.2. U-TAE
The U-TAE architecture (Garnot and Landrieu, Reference Garnot and Landrieu2021) has been introduced for panoptic segmentation of SITS. It consists of three main parts: a multi-scale spatial encoder, a temporal encoder, and a convolutional decoder that produces a single feature map with the same spatial resolution as the input. The sequence of images is processed in parallel by the spatial encoder, and the temporal attention encoder (TAE) is applied at the lowest resolution features to generate attention masks. These masks are interpolated and applied to each feature map, allowing the extraction of spatial and temporal information at multiple scales. The decoder uses a series of transposed convolutions, ReLU, and batch normalization layers to produce the final feature map.
4.2.3. Processor module
Our proposed Processor module is composed of 3D convolutions and is designed to extract spatio-temporal features from time-series data, enabling the use of pretrained models for semantic segmentation. The motivation behind the Processor architecture is to capture spatio-temporal patterns while maintaining the spatial resolution to fit established models pretrained on single-image datasets. This approach differs from task-specific models relying on specialized architectures for processing time-series data in particular contexts, such as land use and land cover mapping (Garnot and Landrieu, Reference Garnot and Landrieu2021; Tarasiou et al., Reference Tarasiou, Chavez and Zafeiriou2023).
The module is composed of two 3D convolutional layers. The first layer has a kernel size of
$ 3\times 3\times 3 $
, followed by a second layer with a kernel size of
$ 2\times 3\times 3 $
. The padding in these layers is set to
$ \left(\mathrm{0,1,1}\right) $
, and the number of output channels is set to 32 and 64, respectively. This configuration will collapse the temporal dimension of the input while simultaneously increasing the number of channels.
Since the kernel sizes are designed for a specific time-series length, they must be adjusted for a different application, yet our lightweight module is easily trainable from scratch.
Formally,
$ \mathbf{x}\in {\mathrm{\mathbb{R}}}^{T\times C\times H\times W} $
let be an input time series, where
$ T $
is the length of the time series,
$ C $
the number of channels of each image,
$ H $
and
$ W $
their respective height and width dimensions. Our Processor module
$ {p}_{\Theta}(.) $
, parameterized by
$ \Theta $
, can be used prior to any semantic segmentation model
$ {f}_{\theta } $
parameterized by
$ \theta $
, via
$ {f}_{\theta}\left({p}_{\Theta}\left(\mathbf{x}\right)\right) $
. To evaluate the effectiveness of our approach, we used the Processor alongside U-Net and DeepLabv3+. The results of our experiments are detailed in Section 6.
4.3. Hierarchical loss
This section details the proposed hierarchical loss that leverages information about taxonomic hierarchies of tree species, genus, and families.
The dataset detailed in Section 3 groups a mix of finer (species) and coarser (genus) level labels.
The taxonomic structure of these labels offers an opportunity to train a model while benefiting from such a hierarchical structure.
To exploit this hierarchy, we extend each label to multiple levels: species, genus, and higher-level taxon. The taxonomic hierarchy is illustrated in Figure 4, and a visual example of these labels is illustrated in Figure 5.

Figure 5. Example of the proposed three-level hierarchical label structure. The labels are concatenated to form semantic segmentation masks where each channel corresponds to a specific taxonomic level: species 5b, genus 5c, and higher-level taxon 5d. Each image represents an area of approximately 20.1m
$ \times $
20.1m. In this example, there are three classes at the species and genus levels. However, the higher-level taxon only has two classes due to the aggregation of different trees under one class. Note that the colors used in this image do not conform to the color code shown in Table 1.
During training, the model predicts only the species-level labels for each pixel. These softmax probabilities at the species level are then aggregated according to our knowledge of the label taxonomies (see Figure 4) to generate first the genus level predictions (see Equation 4.3) and second the higher-level predictions (see Equation 4.5).
Note that our implementation of the hierarchical loss differs from certain related work presented in Section 2, where classes at all levels are predicted separately to compute the loss (Turkoglu et al., Reference Turkoglu, D’Aronco, Perich, Liebisch, Streit, Schindler and Wegner2021).
Formally, let
$ \mathbf{x}\in {\mathrm{\mathbb{R}}}^{C\times H\times W} $
be a training example,
$ {\mathbf{y}}_S\in {\left\{0,1\right\}}^{S\times H\times W} $
its one-hot ground truth where
$ S $
is the number of classes at the species level, and
$ {f}_{\theta}\left(\mathbf{x}\right)={\mathbf{p}}_S $
the associated predictions. The cross-entropy loss function at the species level is defined as normal via

where
. The cross-entropy loss function at the genus level is then computed using the ground truth and predictions at the species level, as


where
$ G $
is the number of classes at the genus level and
$ {S}_g $
is the number of classes at the species level corresponding to a given genus class
$ g $
. In the same vein, the cross-entropy loss function at the higher-level taxon is also obtained via the ground truth and predictions at the species level, as:


where
$ T $
is the number of classes at the higher-level taxon and
$ {G}_t $
the number of classes at the genus level corresponding to a given higher-level class
$ t $
.
The hierarchical loss function is formulated as

where
$ {\lambda}_1 $
,
$ {\lambda}_2 $
, and
$ {\lambda}_3 $
are the weights for the species, genus, and higher-level taxon losses, respectively, and
$ {\boldsymbol{L}}_{\mathrm{species}} $
,
$ {\boldsymbol{L}}_{\mathrm{genus}} $
, and
$ {\boldsymbol{L}}_{\mathrm{taxon}} $
are the corresponding cross-entropy losses.
We set empirically
$ {\lambda}_1=1 $
,
$ {\lambda}_2=0.3 $
, and
$ {\lambda}_3=0.1 $
since we observed that giving more weight to the species-level loss helps the model to prioritize the fine-grained predictions while still benefiting from the hierarchical information. However, we have not attempted to fully optimize these values.
5. Experiments
5.1. Experimental setup
All methods detailed in Section 4 have been trained with normalized input data, either with the means and standard deviations of our dataset to train models from scratch, or with statistics of the datasets used for pretraining for models based on MS-COCO and ImageNet weights. All these experiments are performed on three-fold cross-validation sets to get a better understanding of model performance.
We employ the Adam optimizer (Kingma and Ba, Reference Kingma and Ba2015) for all models except Mask2Former, which is trained with the AdamW optimizer (Loshchilov and Hutter, Reference Loshchilov and Hutter2019) to maintain consistency with the original training methodology. We trained all models with a learning rate of
$ 1\mathrm{e}-4 $
with exponential learning rate decay for 300 epochs.
We included rotation (in multiples of 90°) with horizontal flips as data augmentation to enhance the diversity of the training data. The batch sizes used for each model are detailed in Table A2. These were set to the largest size that could fit within an NVIDIA RTX 8000 GPU. We train our models either using our proposed hierarchical loss, noted HLoss, and described in Section 4.3, or using a combination of dice and cross-entropy losses, noted Dice
$ + $
CE (Figure 6).

Figure 6. Batch sizes used for training.
The latter is a popular choice for segmentation tasks since the dice loss measures the overlap between the predicted and ground truth masks, while the cross-entropy loss quantifies the dissimilarity between the predicted and true class probabilities. We trained the Mask2Former model with the loss function proposed by its authors (Cheng et al., Reference Cheng, Misra, Schwing, Kirillov and Girdhar2022). This loss function improves the training efficiency by randomly sampling a fixed number of points in the labels and predictions.
The loss weighing scheme and other implementation details are kept consistent with the original implementation to ensure a fair comparison. Note that we did not run Mask2Former with HLoss and Dice
$ + $
CE loss as the training would be much more computationally expensive, resulting in a smaller batch size.
The performance of our models are evaluated with the Intersection over Union (IoU) metric, also known as the Jaccard index, which measures the overlap between the predicted and ground truth masks. Letting
$ A $
and
$ B $
be two sets, the IoU score is defined as

The mean IoU (mIoU) is computed by averaging the IoU scores across all classes. This metric provides a comprehensive assessment of the segmentation performance of a model, taking into account both the precision and recall.
5.2. Experiment configuration
We conduct a comprehensive set of experiments to thoroughly evaluate the performance of the considered methods:
-
• We compared models using either single-image or time-series inputs to evaluate the contribution of the phenological information on the tree species segmentation task. The time series is composed of images at four different periods of the year (see Section 3). Note that both methods predict segmentation masks corresponding to a single image.
-
• We compared models with two different loss functions to demonstrate the value of leveraging taxonomic information through the HLoss against a standard combination of loss functions (Dice
$ + $ CE).
-
• We conduct ablation studies to investigate the impact of different pretrained backbones on the segmentation performance. For the CNN-based models, we experiment with ResNet-34, ResNet-50, and ResNet-101 backbones, whereas for the Mask2Former model, we use Swin-T and Swin-S backbones (Liu et al., Reference Liu, Lin, Cao, Hu, Wei, Zhang, Lin and Guo2021).
The results of these experiments are discussed in Section 6 where we compare results both quantitatively and qualitatively.
6. Results
6.1. Single-image input for semantic segmentation
For the single-image segmentation model, we compare the performance of DeepLabv3+ and U-Net architectures with ResNet backbones of varying depths (ResNet-34, ResNet-50, ResNet-101) and the Mask2Former architecture with the Swin-T and Swin-S backbones.
As seen in Table 2, both DeepLabv3+ and U-Net architectures show a consistent increase in performance with increasing backbone size, where the ResNet-101 model achieves the highest mIoU score. Comparing the loss functions, the proposed HLoss generally leads to higher mean IoU scores than the Dice+CE loss across most backbones and architectures. For instance, with the U-Net ResNet101 backbone, HLoss achieves a statistically significant improvement over Dice+CE (55.15
$ \pm $
0.29 vs 54.6
$ \pm $
0.21). However, for some configurations, such as DeepLabv3+ with ResNet101, the performance difference between HLoss and Dice+CE is smaller and not statistically significant, given the overlapping error margins. This suggests that while leveraging taxonomic information via HLoss is often beneficial, its impact can vary.
Table 2. Comparison of single image methods with different losses and backbones

Note. Performances are compared with IoU averaged over all the classes of the dataset (mIoU) for single-image models.
a Indicates models trained from scratch without using ImageNet weights (Deng et al., Reference Deng, Dong, Socher, Li, Li and Fei-Fei2009).
b Indicates Swin-based models using weights from the MS-COCO dataset (Lin et al., Reference Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár and Zitnick2014). All results are averaged across three-fold cross-validation, and the best result for each backbone is shown in bold text. The best model overall is highlighted in red.
We also observe in Table 2 that training models from scratch results in significantly lower mIoU scores compared to using pretrained ImageNet weights, highlighting the importance of transfer learning. When trained from scratch, both U-Net and DeepLabv3+ with ResNet50 backbones achieved comparable results using either HLoss or Dice+CE, with all differences falling within the error margins. The Mask2Former models, trained with the loss of the original implementation and with pretrained weights from the MS-COCO dataset, perform better than the models trained from scratch; however, their performance is not comparable to the CNN-based architectures.
While mIoU provides insights into spatial segmentation accuracy, we also evaluated the models using classification metrics (F1-score, precision, and recall). These results follow similar trends to the mIoU scores and are detailed in Appendix A.2.
6.2. Time-series input for semantic segmentation
For time-series inputs, we make use of the Processor module, detailed in Section 4.3, to extract spatio-temporal features and evaluate its performance with DeepLabv3+ and U-Net architectures. Among the time-series models incorporating the Processor module, the use of HLoss often results in mean IoU scores similar to those from Dice
$ + $
CE loss (Table 3). For instance, with the U-Net
$ + $
Processor architecture (ResNet101 backbone), HLoss (55.97
$ \pm $
0.48) provides only a marginal and likely nonsignificant improvement compared to Dice+CE (55.04
$ \pm $
0.47). Similarly, for the DeepLabv3+
$ + $
Processor architecture with the identical backbone, the mean IoU achieved with HLoss is not statistically distinguishable from that of Dice+CE when accounting for their respective standard deviations.
Table 3. Comparison of time-series methods with different losses and backbones

Note. Performances are compared with IoU averaged over all the classes of the dataset (mIoU) for single-image models.
a Indicates models trained from scratch. All results are averaged across three-fold cross-validation, and the best result for each backbone is shown in bold text. The best model overall is highlighted in red.
Qualitative results comparing HLoss with Dice
$ + $
CE loss are illustrated in Figure 7, where HLoss demonstrates the ability to better discriminate between classes. Models trained using the Dice
$ + $
CE loss exhibit some confusion among classes. Using HLoss would reduce confusion among classes that do not belong in the same genus or higher-level taxon as the model is penalized for incorrect predictions at all levels. The U-Net
$ + $
Processor with ResNet-101 backbone trained with HLoss achieves the best mIoU score among all models. Furthermore, the time-series models slightly outperform their single-image counterparts, indicating the importance of leveraging phenological patterns by incorporating temporal information for tree species segmentation. The classification metrics further support the advantages of temporal information, with the Processor+U-Net models showing balanced performance across F1-score, precision, and recall metrics. A detailed analysis of these classification metrics is provided in Appendix A.2.

Figure 7. Qualitative results of the Dice
$ + $
CE loss versus HLoss. This example compares the best-performing Processor
$ + $
UNet (ResNet101) models trained with the Dice
$ + $
CE loss and the proposed hierarchical loss (HLoss). Each image represents an area of approximately 20.1m
$ \times $
20.1m. First, 7a shows a sample image from the sequence, while 7b displays the corresponding ground truth annotation. Then, 7c depicts the segmentation output obtained by the model trained with the Dice
$ + $
CE loss, and finally, 7d illustrates the output from the model trained with HLoss. The colors of the labels and predicted segments correspond to specific tree species, as indicated by the legend in Table 1. Upon closer inspection of the regions highlighted by the cyan circle (
), the model trained with the Dice
$ + $
CE loss exhibits some confusion among classes, whereas the model trained with HLoss demonstrates improved discrimination between classes.
To gain a deeper understanding of how leveraging time-series data affects the performance of our models for individual species, we conduct a detailed analysis of the class-wise results for our best-performing single-image and time-series models. For the single-image model, we select the U-Net architecture with a ResNet-101 backbone, while for the time-series model, we choose the Processor+U-Net architecture, also with a ResNet-101 backbone. This allows for a fair comparison between the two approaches, as the main difference lies in the incorporation of temporal information through the Processor module. Table 4 presents the class-wise Intersection over Union (IoU) scores for both models, with the classes grouped into non-coniferous and coniferous categories. Note that we omit a class from this analysis: “Acer sp.,” a class composed of trees belonging to Striped Maple (ACPE), Red Maple (ACRU), or Sugar Maple (ACSA) that have not been assigned a fine-grained label by the annotators due to low confidence.
Table 4. The table shows the IoU for individual classes for our best-performing Processor + U-Net and U-Net models, both with ResNet-101 as backbone

Note. All results are averaged across three-fold cross-validation. The classes are grouped into non-coniferous and coniferous categories, with the color shown for each class corresponding to the color code in Table 1. The last row presents the metrics from Table 2 and Table 3 as a reference. These metrics represent the average performance across all classes over three seeds, not the average of the values shown in this table. We observe that incorporating time-series data improves the segmentation performance for most of the individual tree species. This performance gain is more pronounced for non-coniferous trees.
While the overall mIoU shows a statistically significant advantage for the time-series approach, the class-wise results reveal a more complex picture with considerable variability. This class-level analysis reveals where the overall statistically significant mIoU improvement for the time-series model originates. While the performance advantage was statistically significant for specific classes like Red Maple (ACRU) and Eastern Hemlock (TSCA), the time-series model achieved comparable performance to the single-image model for the majority of other species (e.g., Populus, ACPE (Striped Maple), ACSA (Sugar Maple), BEAL (Yellow Birch), BEPA (Paper Birch), FAGR (American Beech), PIST (Eastern White Pine), Picea, THOC (Eastern White Cedar), LALA (Tamarack)), with differences not being statistically significant based on our analysis.
This indicates that for many classes, the temporal information allowed the model to maintain a high level of accuracy similar to the strong single-image baseline. Although the single-image model did perform better for Balsam Fir (ABBA) and the DEAD tree class, the overall significant mIoU improvement for the time-series approach stems from the combination of specific significant gains and comparable performance across most other classes, which supports the value of incorporating temporal data for this task.
An example of the results comparing single-image and time-series models is illustrated in Figure 8, where using temporal information helps the model differentiate between tree species that undergo senescence at slightly different times. Red maple trees are among the earliest trees to show color changes in the fall, and the single-image model misclassifies a Swamp Birch as a Red Maple. This misclassification can be attributed to the lack of temporal context, which is necessary to understand the correlation between tree species and the timing of their senescence.
We also test the generalization capability of our best-performing time-series model on a dataset from a different region of Quebec, which has similar ecological characteristics as the training area. An in-depth explanation has been provided in Appendix A, and the results can be seen in Figure A1.

Figure 8. Qualitative results of the single-image versus time-series inputs. This example compares the best-performing models with single-image (SI) and time-series (TS) inputs for tree species segmentation. Each image represents an area of approximately 20.1m
$ \times $
20.1m. First, 8a shows a sample image from the sequence, while 8b displays the corresponding ground truth annotation. Then 8c depicts the segmentation output obtained by the single-image model, and finally 8d illustrates the output from the time-series model. The colors of the labels and predicted segments correspond to specific tree species, as indicated by the legend in Table 1. Upon comparing the results, we observe that here the time-series model outperforms the single-image model in correctly predicting the classes. In the instance highlighted by the cyan circle (
), the time-series model accurately identifies the Swamp Birch, while the single-image model misclassifies it as Red Maple.
7. Discussion
This work advances the field of forest monitoring through several contributions that build upon and extend previous research in tree semantic segmentation using a time series of images. The slight yet statistically significant gain in performance observed for U-Net models with HLoss when comparing our best time-series model to its single-image counterpart validates the importance of incorporating phenological information. (Zhou et al., Reference Zhou, Xin, Dai and Li2021; Wang et al., Reference Wang, Song, Liddell, Morellato, Calvin, Yang, Alberton, Detto, Ma, Zhao, Henry, Zhang, Ng, Nelson, Huete and Wu2023). Previous studies have achieved success in tree species classification using single high-resolution images (Fricker et al., Reference Fricker, Ventura, Wolf, North, Davis and Franklin2019; Zhang et al., Reference Zhang, Xia, Feng, Yang and Du2020). Our results, however, demonstrate that incorporating temporal data can improve discrimination, with this enhancement being particularly significant for specific species such as Red Maple (ACRU) and Eastern Hemlock (TSCA). The ability to capture distinct seasonal changes makes this approach especially valuable for deciduous trees like Red Maple.
The lightweight Processor module offers a practical solution to a key challenge in remote sensing: the need for specialized architectures that can handle temporal data while leveraging pretrained models (Cao et al., Reference Cao, Sun, Jiang, Li and Xin2021; Kattenborn et al., Reference Kattenborn, Leitloff, Schiefer and Hinz2021). The significant performance gap between models trained from scratch versus those using pretrained weights reinforces the value of transfer learning in forest monitoring applications (Bountos et al., Reference Bountos, Ouaknine, Papoutsis and Rolnick2025).
The effectiveness of our hierarchical loss function, which often led to performance gains compared to a standard Dice+CE loss, builds upon the previous work in hierarchical classification (Bertinetto et al., Reference Bertinetto, Mueller, Tertikas, Samangooei and Lord2020; Muller and Smith, Reference Muller and Smith2020; Valmadre, Reference Valmadre2022), addressing the specific challenges in forest monitoring where species-level identification may not always be possible or necessary. This observed tendency for improvement suggests that the HLoss approach could be particularly valuable for large-scale forest monitoring applications.
Our work could enable more accurate forest inventories and better monitoring of species distribution changes in response to climate change. However, some manual intervention is still required, particularly in the initial data acquisition phase, where high-quality aerial imagery must be collected at specific temporal intervals to capture phenological changes. The collection of ground truth data for model training also remains a labor-intensive process, requiring expert knowledge for accurate annotation for species identification.
While our results demonstrate promising capabilities for automated forest monitoring, several practical challenges remain. Crown delineation accuracy can vary significantly with canopy density and image quality (Weinstein et al., Reference Weinstein, Marconi, Aubry-Kientz, Vincent, Senyondo and White2020). We used a uniform learning rate across architectures for experimental consistency; however, an exhaustive ablation study exploring architecture-specific learning rates could potentially yield improved results, particularly for transformer-based models. The computational costs of such a study, coupled with the need to establish reliable standard deviations, led us to adopt our current approach, though we acknowledge this may result in conservative performance estimates. The Processor module’s fixed time-step requirement, while effective for our dataset, may limit applicability to regions with different temporal sampling frequencies or irregular acquisition patterns (Rußwurm et al., Reference Rußwurm, Courty, Emonet, Lefèvre, Tuia and Tavenard2023). Future work could explore integrating attention mechanisms to better handle longer time series (Garnot and Landrieu, Reference Garnot and Landrieu2021, Sainte Fare Garnot et al., Reference Sainte Fare Garnot, Landrieu, Giordano and Chehata2020), extending the hierarchical loss approach to incorporate additional ecological relationships beyond taxonomic structure, and developing more flexible temporal processing architectures (Cai et al., Reference Cai, Bi, Nicholl and Sterritt2023; Tarasiou et al., Reference Tarasiou, Chavez and Zafeiriou2023). Such improvements would enhance the model’s ability to handle variable-length time series and irregular sampling patterns, making it more adaptable to different forest monitoring scenarios and geographic regions.
Despite these limitations, the Processor module offers a simple yet effective approach to leveraging temporal information in tree species segmentation. Moreover, the compact design of the Processor module allows for efficient computation and reduces the overall complexity of the model, making it suitable for resource-constrained scenarios.
8. Conclusion
In this work, we developed a comprehensive approach for tree species segmentation using aerial image time series, demonstrating the advantages of incorporating temporal information and taxonomic knowledge. By combining a lightweight temporal processing module with a hierarchical loss, our approach often improved species discrimination, achieving statistically significant gains in key comparisons while maintaining the benefits of existing pretrained models. The framework’s ability to effectively leverage phenological changes and taxonomic relationships provides a robust foundation for large-scale forest monitoring applications.
Climate change affects different tree species in varying ways, from altered phenological patterns to shifts in habitat suitability, making it essential to track changes at both species and broader taxonomic levels to understand ecosystem-wide responses. Our framework’s ability to work with both species-level and higher taxonomic classifications enables monitoring at multiple scales, supporting both detailed species-specific studies and broader assessments of forest composition change. The proposed methods have significant implications for forest monitoring and biodiversity conservation, enabling accurate mapping of tree species composition, crucial for understanding forest ecosystems, monitoring changes over time, and informing conservation strategies. Future research could explore the incorporation of additional data modalities, addressing the limitations of the Processor module mentioned in Section 7, and the extension of the methods to other applications in forest ecology and management.
This work opens new possibilities for integrating remote sensing with ecological research. The combination of temporal analysis and hierarchical classification could serve as a foundation for studying species distribution shifts, phenological changes, and ecosystem responses to environmental stressors. These capabilities will be essential for developing evidence-based conservation strategies and understanding the ongoing impacts of climate change on forest ecosystems.
Open peer review
To view the open peer review materials for this article, please visit http://doi.org/10.1017/eds.2025.10013.
Acknowledgments
The authors are grateful for support from M. Cloutier for her guidance in understanding the intricacies of the dataset. In addition, we acknowledge material support from the Mila Quebec AI Institute and from NVIDIA Corporation in the form of computational resources.
Author contribution
Conceptualization: V.R.; A.O.; D.R. Methodology: V.R.; A.O.; D.R. Data curation: V.R. Data visualization: V.R. Writing original draft: V.R. Writing—Review and Editing: V.R.; A.O.; D.R. All authors approved the final submitted draft.
Competing interests
The authors declare none.
Data availability statement
The dataset used in our work is published by the original authors here: https://doi.org/10.5281/zenodo.8148479. Our code can be found at https://github.com/RolnickLab/Forest-Monitoring.
Ethics statement
The research meets all ethical guidelines, including adherence to the legal requirements of the study country.
Funding statement
This work was funded through the IVADO program on “AI, Biodiversity and Climate Change” and the Canada CIFAR AI Chairs program.
A. Appendix
A.1. Spatial transferability
To assess the geographic generalization capabilities of our model, we conducted additional experiments using aerial imagery from the municipality of Stornoway, located in Quebec’s Le Granit regional county municipality within the administrative region of Estrie. This location was specifically chosen for evaluation as it shares similar ecological characteristics with our training site in the Laurentides region, particularly in terms of tree species composition and forest structure typical of Quebec’s temperate-mixed forests.

Figure A1. Evaluation of spatial transferability in Stornoway, Quebec. The figure presents paired comparisons of original input imagery (left) and corresponding model predictions (right) from a geographically distinct test location. Each image represents an area of approximately 20.1m
$ \times $
20.1m. Here, we used the time-series model with a single image replicated across four time steps due to the scarcity of relevant time-series datasets for tree species segmentation. Tree species are color-coded according to the scheme established in Table 1. The model is effective in segmenting and delineating tree crowns in this new location, particularly for distinguishing between neighboring trees with different species compositions. This transfer to a new geographic location, while within a similar ecological zone, suggests the model’s potential for broader regional application in forests of similar tree composition.
For generating the predictions, we chose the best performing Processer+U-Net model with ResNet-101 backbone. Given the scarcity of high-resolution time-series datasets and the difficulty of collecting such datasets for tree species segmentation, we conducted our experiments by replicating a single image across four time steps in our time-series model. The results of these predictions are shown in Figure A1. While a comprehensive quantitative evaluation was not possible due to the absence of ground truth labels for this region, qualitative assessment of the model’s predictions reveals that the model demonstrates robust capabilities in both detecting individual trees and accurately delineating crown boundaries, even in areas with dense canopy cover and varying lighting conditions.
The model’s ability to maintain consistent performance in distinguishing between different tree species suggests that the learned features are sufficiently generalizable across similar forest ecosystems within Quebec. However, it is important to acknowledge that this evaluation is limited to regions with comparable ecological conditions.
A.2. Classification metrics
In addition to the mIoU metric presented in the main text, Tables A1 and A2 show F1-score, precision, and recall metrics for our single-image and time-series models, respectively. These metrics provide complementary insights into our models’ classification performance.
Table A1. Comparison of single image methods with different classification metrics

Note. Performances are compared using F1-score, Precision, and Recall averaged over all the classes of the dataset. For DeepLabv3+ and U-Net models, each backbone shows two rows of results: HLoss metrics in the first row and Dice Loss metrics in the second row.
Table A2. Comparison of time-series methods with different classification metrics

Note. Performances are compared using F1-score, Precision, and Recall averaged over all the classes of the dataset. For each model and backbone combination, the first row shows results using HLoss metrics and the second row shows results using Dice Loss metrics.
For single-image models (Table A1), the U-Net architecture with a ResNet101 backbone trained with HLoss achieved the highest mean scores (F1: 71.09
$ \pm $
0.24, Precision: 69.43
$ \pm $
0.63, Recall: 71.89
$ \pm $
0.67). Comparing this model to the same architecture trained with Dice+CE loss (F1: 70.64
$ \pm $
0.18, Precision: 69.49
$ \pm $
0.68, Recall: 71.83
$ \pm $
0.73), the HLoss model showed a likely statistically significant improvement in F1-score. However, the differences in mean Precision and Recall were small relative to their standard deviations, with overlapping error ranges suggesting these metrics were comparable between the two loss functions for this model. The performance degradation observed in models trained from scratch (denoted by a) was consistent across all classification metrics, reinforcing the importance of transfer learning.
In time-series approaches (Table A2), the integration of temporal information via our Processor module generally led to strong classification performance, particularly when combined with U-Net and HLoss. The Processor+U-Net model with a ResNet101 backbone and HLoss achieved the highest mean scores across all metrics (F1: 71.77
$ \pm $
0.39, Precision: 72.28
$ \pm $
1.20, Recall: 71.27
$ \pm $
1.17). Compared to the same model trained with Dice+CE (F1: 71.00
$ \pm $
0.39, Precision: 71.66
$ \pm $
1.00, Recall: 70.35
$ \pm $
0.96), the HLoss version showed slight improvements in F1 and Recall, while Precision was comparable. Specialized time-series architectures like UNet 3D and U-TAE, while designed to capture temporal patterns, achieved lower classification scores (F1-scores around 55
$ \pm $
0.55 and 55
$ \pm $
0.37), likely due to the lack of pretrained weights and the challenge of training such architectures from scratch on limited data.
Overall, the analysis of classification metrics aligns with the mIoU findings. The benefits of the Processor module and HLoss are evident. However, the improvements are not uniformly significant across all metrics or all model comparisons when considering the error margins from the three-fold cross-validation.
Comments
Dear Editors,
We are pleased to submit our manuscript titled “Tree semantic segmentation from aerial image time-series” for consideration in the Environmental Data Science journal.
Climate change and forests are closely intertwined, with each issue exacerbating the other. As the climate changes, the suitable habitat for many tree species shifts, shrinks, or disappears altogether, leading to changes in forest composition and potential biodiversity loss. Accurate identification of tree species is crucial for effective forest monitoring and biodiversity conservation efforts.
Our work addresses this challenge by leveraging deep learning techniques and aerial image time-series data to perform semantic segmentation of individual tree species. By incorporating temporal information and taxonomic hierarchies, our proposed methods significantly improve tree species identification accuracy. We also introduce a compact spatio-temporal feature extraction module that enables the use of pretrained models, making the approach more accessible and efficient for researchers and practitioners.
The methods presented in this paper have the potential to advance forest monitoring practices and contribute to a better understanding of forest ecosystems in the face of global environmental challenges. Accurate mapping of tree species composition can inform conservation strategies, help monitor changes in forest health over time, and provide valuable insights into the impacts of climate change on biodiversity.
We believe our findings will be of interest to a wide audience, including researchers in the fields of remote sensing, ecology, and environmental science, as well as practitioners involved in forest management and conservation efforts. The proposed techniques could also be extended to other applications in forest ecology and natural resource management.