Hostname: page-component-89b8bd64d-sd5qd Total loading time: 0 Render date: 2026-05-07T08:55:50.034Z Has data issue: false hasContentIssue false

Tree semantic segmentation from aerial image time series

Published online by Cambridge University Press:  29 July 2025

Venkatesh Ramesh*
Affiliation:
Mila, Quebec AI Institute, Montréal, QC, Canada Département d’informatique et de recherche opérationnelle, Université de Montréal, Montréal, QC, Canada
Arthur Ouaknine
Affiliation:
Mila, Quebec AI Institute, Montréal, QC, Canada School of Computer Science, McGill University , Montréal, QC, Canada
David Rolnick
Affiliation:
Mila, Quebec AI Institute, Montréal, QC, Canada School of Computer Science, McGill University , Montréal, QC, Canada
*
Corresponding author: Venkatesh Ramesh; Email: venka97@gmail.com

Abstract

Earth’s forests play an important role in the fight against climate change and are in turn negatively affected by it. Effective monitoring of different tree species is essential to understanding and improving the health and biodiversity of forests. In this work, we address the challenge of tree species identification by performing tree crown semantic segmentation using an aerial image dataset spanning over a year. We compare models trained on single images versus those trained on time series to assess the impact of tree phenology on segmentation performance. We also introduce a simple convolutional block for extracting spatio-temporal features from image time series, enabling the use of popular pretrained backbones and methods. We leverage the hierarchical structure of tree species taxonomy by incorporating a custom loss function that refines predictions at three levels: species, genus, and higher-level taxa. Our best model achieves a mean Intersection over Union (mIoU) of 55.97%, outperforming single-image approaches particularly for deciduous trees where phenological changes are most noticeable. Our findings highlight the benefit of exploiting the time series modality via our Processor module. Furthermore, leveraging taxonomic information through our hierarchical loss function often, and in key cases significantly, improves semantic segmentation performance.

Information

Type
Application Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Figure 1. Example of an annotated sample from the studied dataset. The image 1 shows a scene captured on September 2nd, while the image 1a overlays the tree species labels on the same scene. Each tree species is represented by a distinct color, as seen in Table 1.

Figure 1

Figure 2. Spatial splits of the dataset. The image on the left depicts the entire region where the aerial imagery was captured, while the image on the right shows the different subregions used to train, evaluate, and test models from one fold of cross-validation. The training region is represented by , the validation region by , and the test region by . To prevent data leakage between the subsets, a buffer tile is omitted between the adjacent regions. This spatial partitioning ensures that the model’s performance is assessed on geographically distinct areas, simulating real-world scenarios where the model would be applied to unseen locations.

Figure 2

Figure 3. Distribution of the selected classes in the dataset. We observe that there is a substantial difference in the frequency of occurrence of each tree species. The common and scientific names used for the abbreviations are detailed in Table 1.

Figure 3

Figure 4. Taxonomic hierarchy of tree species. The hierarchical structure is visually represented using a tree diagram. Blue nodes represent the species level, the most fine-grained classification in the hierarchy. Red nodes denote the genus level, which groups together closely related species. Finally, green nodes group the higher-level taxon, the broadest classification level, which encompasses multiple genera and families. This structure of labels allows the models to learn more comprehensive relationships between different tree species at multiple levels of granularity. The full names of each abbreviation are detailed in Table 1.

Figure 4

Table 1. Tree species names and their abbreviations

Figure 5

Figure 5. Example of the proposed three-level hierarchical label structure. The labels are concatenated to form semantic segmentation masks where each channel corresponds to a specific taxonomic level: species 5b, genus 5c, and higher-level taxon 5d. Each image represents an area of approximately 20.1m $ \times $ 20.1m. In this example, there are three classes at the species and genus levels. However, the higher-level taxon only has two classes due to the aggregation of different trees under one class. Note that the colors used in this image do not conform to the color code shown in Table 1.

Figure 6

Figure 6. Batch sizes used for training.

Figure 7

Table 2. Comparison of single image methods with different losses and backbones

Figure 8

Table 3. Comparison of time-series methods with different losses and backbones

Figure 9

Figure 7. Qualitative results of the Dice$ + $CE loss versus HLoss. This example compares the best-performing Processor$ + $UNet (ResNet101) models trained with the Dice$ + $CE loss and the proposed hierarchical loss (HLoss). Each image represents an area of approximately 20.1m $ \times $ 20.1m. First, 7a shows a sample image from the sequence, while 7b displays the corresponding ground truth annotation. Then, 7c depicts the segmentation output obtained by the model trained with the Dice$ + $CE loss, and finally, 7d illustrates the output from the model trained with HLoss. The colors of the labels and predicted segments correspond to specific tree species, as indicated by the legend in Table 1. Upon closer inspection of the regions highlighted by the cyan circle (), the model trained with the Dice$ + $CE loss exhibits some confusion among classes, whereas the model trained with HLoss demonstrates improved discrimination between classes.

Figure 10

Table 4. The table shows the IoU for individual classes for our best-performing Processor + U-Net and U-Net models, both with ResNet-101 as backbone

Figure 11

Figure 8. Qualitative results of the single-image versus time-series inputs. This example compares the best-performing models with single-image (SI) and time-series (TS) inputs for tree species segmentation. Each image represents an area of approximately 20.1m $ \times $ 20.1m. First, 8a shows a sample image from the sequence, while 8b displays the corresponding ground truth annotation. Then 8c depicts the segmentation output obtained by the single-image model, and finally 8d illustrates the output from the time-series model. The colors of the labels and predicted segments correspond to specific tree species, as indicated by the legend in Table 1. Upon comparing the results, we observe that here the time-series model outperforms the single-image model in correctly predicting the classes. In the instance highlighted by the cyan circle (), the time-series model accurately identifies the Swamp Birch, while the single-image model misclassifies it as Red Maple.

Figure 12

Figure A1. Evaluation of spatial transferability in Stornoway, Quebec. The figure presents paired comparisons of original input imagery (left) and corresponding model predictions (right) from a geographically distinct test location. Each image represents an area of approximately 20.1m $ \times $ 20.1m. Here, we used the time-series model with a single image replicated across four time steps due to the scarcity of relevant time-series datasets for tree species segmentation. Tree species are color-coded according to the scheme established in Table 1. The model is effective in segmenting and delineating tree crowns in this new location, particularly for distinguishing between neighboring trees with different species compositions. This transfer to a new geographic location, while within a similar ecological zone, suggests the model’s potential for broader regional application in forests of similar tree composition.

Figure 13

Table A1. Comparison of single image methods with different classification metrics

Figure 14

Table A2. Comparison of time-series methods with different classification metrics

Author comment: Tree semantic segmentation from aerial image time series — R0/PR1

Comments

Dear Editors,

We are pleased to submit our manuscript titled “Tree semantic segmentation from aerial image time-series” for consideration in the Environmental Data Science journal.

Climate change and forests are closely intertwined, with each issue exacerbating the other. As the climate changes, the suitable habitat for many tree species shifts, shrinks, or disappears altogether, leading to changes in forest composition and potential biodiversity loss. Accurate identification of tree species is crucial for effective forest monitoring and biodiversity conservation efforts.

Our work addresses this challenge by leveraging deep learning techniques and aerial image time-series data to perform semantic segmentation of individual tree species. By incorporating temporal information and taxonomic hierarchies, our proposed methods significantly improve tree species identification accuracy. We also introduce a compact spatio-temporal feature extraction module that enables the use of pretrained models, making the approach more accessible and efficient for researchers and practitioners.

The methods presented in this paper have the potential to advance forest monitoring practices and contribute to a better understanding of forest ecosystems in the face of global environmental challenges. Accurate mapping of tree species composition can inform conservation strategies, help monitor changes in forest health over time, and provide valuable insights into the impacts of climate change on biodiversity.

We believe our findings will be of interest to a wide audience, including researchers in the fields of remote sensing, ecology, and environmental science, as well as practitioners involved in forest management and conservation efforts. The proposed techniques could also be extended to other applications in forest ecology and natural resource management.

Review: Tree semantic segmentation from aerial image time series — R0/PR2

Conflict of interest statement

Reviewer declares none.

Comments

The approach to quantifying the impact of different methodologies/architectures leveraging temporal context in tree species segmentation in orthophotos time series is interesting and very relevant.

The paper is well-written and has a clear train of thought.

However, a major concern is the validation strategy in this study. The model performance results cannot be interpreted with confidence because there is only one split of the dataset into train, test, and validation subsets. The current split may be a lucky or an unlucky choice, severely inflating or underestimating the model performance. This is especially relevant given that the training dataset only covers one location. The authors need to perform a k-fold cross-validation across the k segments of the orthophoto with k>=3. Under/overrepresentations of species in splits can be addressed through over/undersampling.

Given that the orthophotos also only cover one location it

is also important to discuss the limited representativeness of the performance, regardless of the methodology. It may be advisable to visually sanity check how the model(s) perform in other orthophotos, possibly with a similar tree species composition.

Recommendation: Tree semantic segmentation from aerial image time series — R0/PR3

Comments

Dear Authors,

I have sent your paper out for review, and one of the reviewers has provided prompt feedback, raising a key concern that I also share: the validation strategy in your paper appears to be flawed. As it stands, I must reject the paper. However, if you are able to address and clarify this issue, you are welcome to resubmit the paper for reconsideration.

Best regards,

Miguel Mahecha

Decision: Tree semantic segmentation from aerial image time series — R0/PR4

Comments

No accompanying comment.

Author comment: Tree semantic segmentation from aerial image time series — R1/PR5

Comments

Dear editors and reviewer,

We sincerely thank you for your constructive feedback on our manuscript “Tree semantic segmentation from aerial image time series” (EDS-2024-0080). We have carefully considered all the comments from you, Dr. Mahecha, and the reviewer, and have implemented the suggested improvements to our manuscript.

We thank the reviewer for acknowledging the relevance of our work and its clear presentation. The primary concern raised about our validation strategy, and we have addressed it comprehensively through the following changes:

Implementation of Cross-Validation

Completely restructured the experimental validation using three-fold cross-validation, as suggested.

Updated Section 3 (Dataset) to detail our cross-validation methodology. For three-fold cross-validation, the test region remains consistent across folds while the training and validation regions were carefully selected to preserve the distribution of tree species. We create the train and validation splits by dividing the remaining area either vertically or horizontally, with a buffer tile between regions to prevent spatial autocorrelation and data leakage.

Updated Results and Analysis

Revised Tables 2, 3, and 4 to reflect results across all cross-validation folds.

Added standard deviations to demonstrate result stability across folds.

Updated all relevant figure captions to reflect the cross-validation approach.

Geographical Generalization

Added a new section in the Appendix (A.1 Spatial Transferability).

Included evaluation results from a different geographical location (Stornoway, Quebec).

Added qualitative analysis showing the generalization capacity of our best performing time-series model on this new region.

Discussed limitations and potential for broader regional application.

We thank you again for the opportunity to revise our manuscript and look forward to your feedback on these improvements.

Sincerely,

Venkatesh Ramesh, Arthur Ouaknine, and David Rolnick

Detailed Response to Reviewer Comments:

Editor (Dr. Mahecha):

“The validation strategy in your paper appears to be flawed.”

We have redesigned our validation strategy using three-fold cross-validation with spatially separated splits. This provides a more robust evaluation of our methods and addresses the core concern about validation reliability. We also provide additional qualitative results on a new region showing the generalization capacity of the model.

Reviewer 1:

“The model performance results cannot be interpreted with confidence because there is only one split of the dataset”

We agree with this observation and have implemented k-fold cross-validation (k=3) across spatially separated segments of the orthophoto as detailed in Section 3 of the manuscript. Our updated results now include standard deviations across folds, providing a more reliable measure of model performance (see Tables 2, 3 and 4 of the manuscript). The consistent superior performance of our proposed approach across all cross-validation folds reinforces the robustness of our methodological contributions.

“Given that the orthophotos also only cover one location it is also important to discuss the limited representativeness of the performance, regardless of the methodology. It may be advisable to visually sanity check how the model(s) perform in other orthophotos, possibly with a similar tree species composition.”

We have addressed this limitation by adding a new section testing our models on orthophotos from Stornoway, Quebec, sharing similar ecological characteristics with our training location. In Appendix A of the manuscript, we provide qualitative analysis of performance in this new region while discussing the geographical scope and limitations of our approach.

For this evaluation, we employed our best-performing architecture (U-Net + Processor with ResNet-101 backbone). Due to the difficulty and cost of collecting aerial imagery over multiple time periods, no true time-series datasets of tree segmentation from aerial photos currently exist apart from the one we used for training our models. As a workaround, we replicated single aerial images across different time points to simulate a time series, so that we could apply our model trained on time series data. Despite using replicated data rather than true temporal data, we still chose to use the time-series model because it significantly outperformed our single-image model in quantitative evaluations. Even without ground truth labels for quantitative assessment, visual analysis demonstrates that our model effectively segments and delineates individual tree crowns in this new geographical context. This qualitative evaluation provides encouraging evidence of our model’s generalization capabilities within comparable ecological zones.

We believe these revisions have substantially improved the scientific rigor of our work while maintaining its original contributions to the field.

Review: Tree semantic segmentation from aerial image time series — R1/PR6

Conflict of interest statement

Reviewer declares none.

Comments

The authors analyze the potential of orthophoto time series, hierarchical loss functions, and different model architectures for tree species segmentation in high-resolution orthophotos in a region in Quebec.

The main criticism is that the authors frequently argue that some configurations outperform others when the compared performance results lie within each other’s reported error margins. This is the case for about half of the model or loss function comparisons. Most other comparisons show marginal differences. Hence, the claims are not sufficiently backed up by the measured metrics, even though the findings are within expectations.

Given existing literature and the additional information that multiple orthophotos include, time series models should outperform single-image models. I am confident that, with correct preprocessing and training methodology, significant performance differences can be measured. This is especially true as the single-image models process images before browning, which should make prediction much harder compared to images during senescence, as different species have different browning patterns, as also pointed out by the authors.

If, surprisingly, time series models do not provide performance improvements over single-image models, then this should be the result of the paper. However, this would be a finding that very much contradicts existing literature and intuition.

### Major

- The term “tree crown segmentation” is a general phrase that would typically be associated with only differentiating tree from non-tree pixels. I recommend emphasizing in the Title, Abstract, and Introduction that you perform “hierarchical tree species segmentation.”

- Page 10, Line 49: Normally, the same learning rate is not applicable to different batch sizes. Additionally, transformer architectures are commonly trained with a learning rate warm-up, as seen in the Swin Transformer paper. The resulting model scores may be underestimates. This needs to be at least mentioned.

- Page 3, Line 5ff: The sentence indicates that takeaways (i.e., findings) are described. Only the second bullet point could be counted as a finding. The other two points describe what has been done in the paper. Please adjust this.

- Please ensure that each paragraph consists of more than one sentence.

- Please visualize and/or detail the exact cross-validation splits and how buffer tiles work. This could be part of Figure 2.

- Given that the hierarchical loss function computes a different loss depending on whether the genus is predicted correctly, it may make sense to analyze the models in this regard. That is, does the genus classification accuracy improve with the HLoss function compared to the Dice+CE loss function?

### Minor

- This study needs to be cited: Beloui et al. 2023, Individual Tree-Crown Detection and Species Identification in Heterogeneous Forests Using Aerial RGB Imagery and Deep Learning

- Page 2, Line 19: “As the climate changes, the suitable habitat for many tree species shifts, shrinks, or disappears altogether”: With global warming, suitable growing locations for trees also expand in some regions.

- Page 2, Line 20: This more recent study could be cited in this context: Mahecha, Miguel D., et al. “Biodiversity and climate extremes: known interactions and research gaps.” Earth’s Future 12.6 (2024): e2023EF003963.

- Page 2, Line 27ff: Why is the understanding phenologically motivated in the first paragraph when the paper focuses on tree species segmentation?

- Page 2, Line 39: Semantic segmentation of tree crowns does not yield information about forest health or composition. See first major comment.

- Page 4, Line 38: Given multiple flights, the orthophotos will at least have slight differences in real-world resolution. What steps were taken for resampling to align the images in space and time to yield a standardized tensor?

- Page 5, Line 33: Given the small area and homogeneity of the dataset, it is likely impossible to “prevent” spatial autocorrelation.

- Page 5, Line 38: How large exactly is the mentioned sufficient spatial context?

- Page 5, Line 45: How many classes are ignored and how are these handled? Are these merged into a single class or merged into the background class?

- Page 7, Figure 3: Unit for y-axis should be “Count [#],” not “Frequency”

- Page 9, Line 23: What is the rationale or trade-off between having each level predicted separately or in one go?

- Figure 4: It would be nice to be able to see the names of each level, i.e., genus, high-level taxon, species.

- Suggestion: subsection 5.2 could potentially be the introduction for Section 6.

Review: Tree semantic segmentation from aerial image time series — R1/PR7

Conflict of interest statement

Reviewer declares none.

Comments

Review: Tree semantic segmentation from aerial image time series

This paper focuses on semantic segmentation of tree species using aerial image time series, aiming to improve forest monitoring. The authors evaluate single-image and time-series models, highlighting the role of phenological patterns (e.g., seasonal changes) in distinguishing tree species. They propose a lightweight Processor module for spatio-temporal feature extraction, enabling the use of pretrained single-image models with time-series data, and introduce a hierarchical loss function that leverages taxonomic relationships for better classification accuracy.

The study demonstrates that time-series models outperform single-image models, with the Processor+U-Net architecture (using a ResNet-101 backbone) achieving the highest overall mIoU score of 55.97 ± 0.48, compared to 55.15 ± 0.29 for U-Net alone. Temporal data incorporation significantly enhances segmentation accuracy, particularly for non-coniferous trees, which display distinct seasonal patterns.

The methods provide a robust framework for forest monitoring. The approach also shows promise for generalizing across geographically similar ecosystems.

Abstract:

It would be beneficial for the reader to receive quantitative information, if suitable, please specify some of the accuracy values for the approaches.

Methods:

In standard k-fold cross-validation, the dataset is split into k subsets (or folds), and each subset takes turns being the validation set while the remaining k−1 folds are used for training.

In the setup shown in the figure from the manuscript:

• The test site remains fixed throughout the process.

• Only the training and validation subsets change.

This setup is useful for assessing the model’s performance on a specific unseen area (e.g., spatial generalization). A test set is commonly used in spatial modeling to assess how well the model generalizes to new, unseen areas or regions. Hence, it would be beneficial to the reader to mention this. For further papers, please remember that a k = 5 or 10 is more common and the results are more reliable.

https://doi.org/10.1007/s41664-018-0068-2

https://doi.org/10.1007/s10994-021-05972-1

L.48: Often the split is 70-20-10, hence, is there a reason for this split here?

“The dataset is split into train, validation and testing sets with 64%, 16%, and 20% of the samples, respectively.”

It is called test not testing set.

Results:

The original paper by Cloutier et al. (2023) reported an F1-score of 0.72 in September, the best-performing month, with declines (e.g., 0.61 in peak autumn) due to leaf fall and visual heterogeneity. In contrast, this study achieved a mIoU of 55.97 ± 0.48. While mIoU and F1-score measure different aspects of model performance, making them not directly comparable, it remains unclear why one should prefer time-series approaches (as such data is rarely available in high resolution and obtaining it is time-consuming and expensive) when single-image models from September appear to yield better results in classification terms. Adding F1-scores (including Recall, Precision, and mAP) to this paper would help clarify the advantages of time-series data and offer a more balanced comparison of segmentation versus classification metrics.

It is clear that mIoU focuses on spatial accuracy, ensuring the predicted segmentation masks match the ground truth and F1-score focuses on classification performance, ensuring that the model is making correct classifications for individual objects or regions, regardless of their exact spatial location.

I suggest keeping the current tables with the results and add additional columns with F1-score, for instance for tree species, Table 3.

Fig. 5. and 7. Missing scale

Discussion

In the Discussion section, it’s important to contextualize your results by comparing them with prior work, citing relevant research (currently this is missing), and explaining how your findings improve or contribute to existing knowledge. Highlight the significance of your results for real-world applications and discuss the practical challenges in data acquisition and model development. Reflect on which aspects of your approach are more automated and which still require manual intervention. Acknowledge limitations and suggest future work to overcome challenges, extending the current approach or improving its scalability.

Recommendation: Tree semantic segmentation from aerial image time series — R1/PR8

Comments

Dear authors,

I see that you have made a great effort to address the reviewer’s comments. The same reviewer is still asking for major revisions, but I think the major issues have been addressed. I have now also asked a new reviewer to step in and this reviewer is suggesting minor revisions. Reading their comments, I have the impression that you can address them. Therefore, I kindly ask you to carefully consider all this feedback and submit a revised version of your paper. Best wishes, Miguel Mahecha

Decision: Tree semantic segmentation from aerial image time series — R1/PR9

Comments

No accompanying comment.

Author comment: Tree semantic segmentation from aerial image time series — R2/PR10

Comments

Dear editors and reviewers,

We sincerely thank you for your constructive feedback on our manuscript “Tree semantic segmentation from aerial image time series” (EDS-2024-0080). We have carefully considered all the comments of the reviewers, and have implemented the suggested improvements to our manuscript.

We thank the reviewers for acknowledging the relevance of our work and its clear presentation.

The major changes can that we have implemented in this version are summarized through the following changes:

Addition of classification results

Added Table 4 & 5 in the Appendix to show the classification results (F1-score, precision, recall) of our models.

This offers a more complete picture of our model which further showcases the advantages of our methods.

Updated Discussion and Conclusion

Revised the limitation and conclusion section in the paper to compare our work against prior work and help contextualize the significance of this.

Added the significance of the results for real world applications.

We thank you again for the opportunity to revise our manuscript and look forward to your feedback on these improvements.

Sincerely,

Venkatesh Ramesh, Arthur Ouaknine, and David Rolnick

Detailed Response to Reviewer Comments:

Reviewer 1:

“The term “tree crown segmentation” is a general phrase that would typically be associated with only differentiating tree from non-tree pixels. I recommend emphasizing in the Title, Abstract, and Introduction that you perform “hierarchical tree species segmentation.”

We agree that more precise terminology is needed. While ‘tree crown segmentation’ is common in forestry, computer vision has distinct definitions for semantic (classifying pixels into categories) versus instance segmentation (distinguishing individual objects). We propose using ‘tree crown semantic segmentation’ throughout the manuscript to accurately reflect our pixel-level species classification approach. We will update the Title, Abstract, and Introduction accordingly.

“Page 10, Line 49: Normally, the same learning rate is not applicable to different batch sizes. Additionally, transformer architectures are commonly trained with a learning rate warm-up, as seen in the Swin Transformer paper. The resulting model scores may be underestimates. This needs to be at least mentioned.”

We acknowledge that using a uniform learning rate across architectures may not be optimal, particularly for transformer-based models which often benefit from warm-up schedules as shown in the Swin Transformer paper. While we chose this approach for experimental consistency and computational feasibility, we recognize this may have led to underestimated performance for some architectures. We have clarified this limitation in the second last paragraph of the discussion section.

“Page 3, Line 5ff: The sentence indicates that takeaways (i.e., findings) are described. Only the second bullet point could be counted as a finding. The other two points describe what has been done in the paper. Please adjust this.”

We agree that the current bullet points mix contributions and findings in a way that could be confusing for readers. We have revised the points such that the readers can clearly distinguish between our methodological contribution and the key findings.

“Please ensure that each paragraph consists of more than one sentence.”

We have formatted the paragraphs in our paper to reflect this.

“Given that the hierarchical loss function computes a different loss depending on whether the genus is predicted correctly, it may make sense to analyze the models in this regard. That is, does the genus classification accuracy improve with the HLoss function compared to the Dice+CE loss function?”

While analyzing performance by taxonomic level (species, genus, higher-level taxon) may seem intuitive given the hierarchical loss, it could actually be misleading since not all trees have specific level-wise annotations in our dataset. As shown in Table 4, the HLoss improves performance across individual species compared to Dice+CE, which inherently implies improved performance at higher taxonomic levels since they are nested (a correct species prediction is automatically a correct genus and family prediction).

“This study needs to be cited: Beloui et al. 2023, Individual Tree-Crown Detection and Species Identification in Heterogeneous Forests Using Aerial RGB Imagery and Deep Learning”

Added the citation in the first paragraph of Section 2.3.

“As the climate changes, the suitable habitat for many tree species shifts, shrinks, or disappears altogether: With global warming, suitable growing locations for trees also expand in some regions.”

Changed the wording to reflect that climate change shifts the geographical ranges of tree species rather than just disappearing.

“This more recent study could be cited in this context: Mahecha, Miguel D., et al. Biodiversity and climate extremes: known interactions and research gaps. Earth’s Future 12.6 (2024): e2023EF003963.”

Added the citation in the line 4 of Introduction section.

“Page 2, Line 27ff: Why is the understanding phenologically motivated in the first paragraph when the paper focuses on tree species segmentation?”

The phenological focus in the introduction directly motivates our core methodological approach. Understanding phenological patterns is crucial for tree species segmentation precisely because different species exhibit distinct temporal patterns in leaf emergence, color change, and leaf fall and our method explicitly leverages this through time series inputs. Understanding these phenological patterns is therefore fundamental to our approach of using temporal data for species segmentation.

“Page 2, Line 39: Semantic segmentation of tree crowns does not yield information about forest health or composition. See first major comment.”

Our semantic segmentation approach directly informs forest composition by creating species-level cover maps. Additionally, our method identifies dead trees (one of our categories), which is crucial for forest health assessment and currently a significant research topic (https://nph.onlinelibrary.wiley.com/doi/full/10.1111/nph.20407). The ability to map both species distribution and tree mortality at scale provides valuable indicators of forest health and composition. The spatial information about species distribution patterns and mortality can also help monitor changes in forest structure over time.

“Page 4, Line 38: Given multiple flights, the orthophotos will at least have slight differences in real-world resolution. What steps were taken for resampling to align the images in space and time to yield a standardized tensor?”

The recordings used georeferenced stations as reference points in x, y, z coordinates, ensuring consistent elevation and spatial coordinates between flights. This approach maintains the same resolution across time frames (see Cloutier et al., 2023 for technical details). While minor pixel-level shifts may occur between frames due to wind conditions, we only evaluate predictions against labels created on a specific reference date. We assume that the neural networks can handle small shift corrections of a few pixels to align corresponding trees, thanks to both phenological patterns and the receptive fields of the architectures we used.

“Page 5, Line 33: Given the small area and homogeneity of the dataset, it is likely impossible to “prevent” spatial autocorrelation.”

While it is challenging to completely prevent spatial autocorrelation in forest datasets, our results demonstrate good generalization capabilities. As shown in Figure 9, our model successfully transfers to a different forest region in Quebec, suggesting it did not overfit to local spatial patterns. While more extensive validation across other forests would be valuable for future work, these initial transfer results indicate that spatial autocorrelation did not substantially impair model generalization.

“Page 5, Line 45: How many classes are ignored and how are these handled? Are these merged into a single class or merged into the background class?”

We initially had 28 classes in total. Classes with fewer than 50 occurrences (most of which had less than 10 samples) were merged into the background class. This left us with 15 classes plus background, all having sufficient samples for effective model training. We have updated Page 5, Line 45 to provide these details.

Page 7, Figure 3: Unit for y-axis should be “Count [#],” not “Frequency”

We have updated Figure 3 to show Count [#] in place of frequency.

“Page 5, Line 38: How large exactly is the mentioned sufficient spatial context?”

The image size of 768×768 pixels at zoom level 22, given the latitude of the dataset, corresponds to a ground area with sides of approximately 20.12 meters and a total area of approximately 404.9 m². For context, the largest tree crown in our dataset has an area of approximately 180.32 m², which is well within our tile size. Therefore, this provides sufficient spatial context for our model to operate effectively.

“Page 9, Line 23: What is the rationale or trade-off between having each level predicted separately or in one go?”

Our approach predicts only at the species level and derives higher-level predictions through aggregation, rather than predicting each taxonomic level separately. This design choice leverages the inherent hierarchical structure of tree taxonomy while ensuring prediction consistency across levels. The genus-level and higher-level taxon predictions are derived directly from species-level softmax probabilities according to the taxonomic hierarchy (see Equations 4.3 and 4.5). While we still utilize supervision at all taxonomic levels through our hierarchical loss function (Equation 4.6), maintaining a single prediction head enforces taxonomic consistency by design, since higher-level predictions are derived from species predictions rather than being independently predicted.

Reviewer 2:

“Abstract: It would be beneficial for the reader to receive quantitative information, if suitable, please specify some of the accuracy values for the approaches.”

We have modified our abstract to mention the results of our best performing models in it.

“A test set is commonly used in spatial modeling to assess how well the model generalizes to new, unseen areas or regions. Hence, it would be beneficial to the reader to mention this. For further papers, please remember that a k = 5 or 10 is more common and the results are more reliable.”

We have already mentioned this in the second paragraph of the dataset section but have changed the wording to emphasize it further. Regarding the k-fold cross-validation, we agree that k=5 or k=10 typically provides more reliable results. However, our choice of a lower k value was driven by the specific challenges of our dataset: the spatial distribution of tree species in our study area is heterogeneous, making it difficult to maintain consistent class distributions across folds. This spatial clustering of species could lead to highly imbalanced training sets with higher k values, potentially introducing bias in our model evaluation.

“Often the split is 70-20-10, hence, is there a reason for this split here?

The dataset is split into train, validation and testing sets with 64%, 16%, and 20% of the samples, respectively.”

Our choice of a larger test set (20%) compared to conventional splits was driven by the need to ensure robust evaluation metrics across all tree species classes, particularly given the dataset’s inherent class imbalance and uneven geographical distribution of tree species. The larger test set provides statistical robustness in evaluating model performance across all species classes, while maintaining sufficient data for training (64%) and validation (16%). A discussion of this dataset partitioning strategy has been added to the final paragraph of Section 3. Dataset.

“It is called a test, not testing set.”

Fixed this.

“The original paper by Cloutier et al. (2023) reported an F1-score of 0.72 in September, the best-performing month, with declines (e.g., 0.61 in peak autumn) due to leaf fall and visual heterogeneity. In contrast, this study achieved a mIoU of 55.97 ± 0.48. While mIoU and F1-score measure different aspects of model performance, making them not directly comparable, it remains unclear why one should prefer time-series approaches (as such data is rarely available in high resolution and obtaining it is time-consuming and expensive) when single-image models from September appear to yield better results in classification terms. Adding F1-scores (including Recall, Precision, and mAP) to this paper would help clarify the advantages of time-series data and offer a more balanced comparison of segmentation versus classification metrics.”

We have added detailed classification metrics (F1-score, Precision, and Recall) in Tables 5 and 6 under Appendix A.2. On comparing Table 5 and Table 6, it is clear that our time-series models consistently outperform the single-image models, validating the benefits of incorporating temporal information.

Direct performance comparisons with Cloutier et al. (2023) remain challenging due to several methodological differences. Our study uses different geographical splits for train, validation, and test sets compared to the ones used in the original work, the quantitative results can not be compared per se. As detailed in Table 4 and Section 6.2, we include an ‘Acer sp.’ class for trees that could be ACPE, ACRU, or ACSA but lack species-level annotation due to low confidence. While Cloutier et al. treated these as background, we include them in our metric calculations as they represent a significant portion of maple trees in the dataset that shouldn’t be disregarded. Treating them as background would lead to an incomplete representation of the forest composition. These methodological differences explain the apparent performance variations between the studies.

“Fig. 5. and 7. Missing scale”

We have added the scale in Fig. 5, Fig. 7 and Fig. 8.

“In the Discussion section, it’s important to contextualize your results by comparing them with prior work, citing relevant research (currently this is missing), and explaining how your findings improve or contribute to existing knowledge. Highlight the significance of your results for real-world applications and discuss the practical challenges in data acquisition and model development. Reflect on which aspects of your approach are more automated and which still require manual intervention. Acknowledge limitations and suggest future work to overcome challenges, extending the current approach or improving its scalability.”

We agree with the reviewer’s assessment and have thoroughly addressed these points in our revised Discussion section (Section 7). The section now better contextualizes our findings within existing research through a comparison of our time series models with prior single-image approaches. We have expanded our discussion of the practical significance of our results for forest monitoring applications, while acknowledging the challenges in data acquisition. These challenges include the labor-intensive process of collecting ground truth data and the difficulties in achieving consistent crown delineation accuracy across varying canopy densities. Additionally, we have mentioned the approach’s limitations, particularly regarding the Processor module’s fixed time-step requirement and aspects requiring manual intervention. The section concludes with concrete future research directions, proposing the integration of attention mechanisms for handling longer time series and the development of more flexible temporal processing architectures.

We appreciate the thoughtful and constructive feedback provided by the reviewers, which has helped strengthen our manuscript significantly. The implemented changes have enhanced both the content and clarity of our work.

Review: Tree semantic segmentation from aerial image time series — R2/PR11

Conflict of interest statement

Reviewer declares none.

Comments

The authors did not respond to the main point of criticism in the previous review iteration. That is, that the margins by which one method outperforms the other is very small, often negligible given the provided error ranges. However, given that one proposed methods somewhat consistently outperforms the other, even by exceptionally small margins, you can potentially argue that they are the superior method. Please comment on this in the manuscript.

Recommendation: Tree semantic segmentation from aerial image time series — R2/PR12

Comments

Dear Authors,

Please accept my apologies for the delay in getting back to you. I have asked one of the reviewers to re-evaluate your manuscript, and in doing so, a point previously raised has come up again and still remains unaddressed:

The performance differences between the methods you compare appear very small relative to the reported error margins. In fact, given the provided uncertainty ranges, these differences may well be negligible. I kindly ask you to clarify this issue and to justify the claim that one method outperforms the other. But it would be also absolutely fine if you would conclude that certain methods perform almost equally well—we are not looking for a new “winner,” but are simply aiming for scientific accuracy.

We look forward to receiving your revised manuscript.

Best regards!

Miguel Mahecha

Decision: Tree semantic segmentation from aerial image time series — R2/PR13

Comments

No accompanying comment.

Author comment: Tree semantic segmentation from aerial image time series — R3/PR14

Comments

Dear Dr. Mahecha and Reviewer 1,

Thank you for your time and valuable feedback on our manuscript, “Tree crown semantic segmentation from aerial image time series” (EDS-2024-0080.R2). We appreciate the recommendation for publication with minor revisions and the constructive comments aimed at improving the scientific accuracy of our work.

We have now carefully revised the manuscript to specifically address the interpretation of performance differences relative to error margins, focusing on a more nuanced analysis of statistical significance, as requested.

In response to the comments from the Editor and Reviewer 1, we have made the following key changes throughout the manuscript:

Re-evaluated Performance Claims: We re-examined all comparisons between methods (e.g., HLoss vs. Dice+CE, Time Series vs. Single Image) presented in Tables 2, 3, and 4 (mIoU) and Tables 5 and 6 (classification metrics).

Incorporated Statistical Significance Analysis: We have revised the text in Section 6 (Results) and Appendix A.2 (Classification Metrics) to explicitly discuss the performance differences in the context of the reported error margins.

Moderated Language: We have replaced strong or potentially overstated claims (e.g., “consistently outperforms,” “superior performance across all metrics”) with more precise and nuanced phrasing (e.g., “statistically significant improvement,” “often led to performance gains,” "comparable performance”) based on the significance analysis. This applies to the Results section, Discussion, Conclusion, and Abstract.

Specific Section Revisions:

Abstract: The concluding sentences regarding the benefits of our methods have been rephrased to reflect that improvements were often observed, and significantly so in key cases, but not necessarily universally.

Section 6.1 & 6.2 (Results - mIoU): The text interpreting Tables 2, 3, and 4 has been rewritten to detail the significance analysis for key comparisons, acknowledging where differences are significant and where performance is comparable. We clarified the comparison between the best time series and single-image models and provided a more balanced interpretation of the class-wise results in Table 4.

Figure Captions (Fig 7 & 8): Captions have been revised to remove strong generalizations and better align with the nuanced quantitative findings.

Discussion & Conclusion: Statements regarding the advantages of the time-series approach and the hierarchical loss have been moderated to accurately reflect the significance analysis.

Appendix A.2 (Classification Metrics): The text now includes a discussion of statistical significance for F1-score, Precision, and Recall based on the error margins in Tables 5 and 6, replacing previous generalized claims with specific analysis.

Review: Tree semantic segmentation from aerial image time series — R3/PR15

Conflict of interest statement

Reviewer declares none.

Comments

The authors have adequately addressed all concerns. The revised manuscript demonstrates improved scientific rigor and more accurate interpretation of results. Congratulations! The work is now ready for publication.

Recommendation: Tree semantic segmentation from aerial image time series — R3/PR16

Comments

No accompanying comment.

Decision: Tree semantic segmentation from aerial image time series — R3/PR17

Comments

No accompanying comment.