With all the new projects and startups and investing going into the acquisition and processing of satellite remote sensing data, it’s important to remember that satellite data is just that…data points. A variable in a model. A tool to be used carefully and nothing more.
Of course, data collected from space sounds exciting and cool and new. It is. However, issues with these data are numerous and I wanted to write this article for those who are new to the earth observation sphere. If you've been using spatial remote sensing data for a while, none of this should be new information. However, if you’re new to this realm, the following points are factors to consider when working with any type of remote sensing data.
I focus primarily on areas that create room for error.
Spatial autocorrelation
Spatial autocorrelation. One of the most important statistical parameters to account for when working with spatial data in models, not just remote sensing data. Spatial autocorrelation measures the spatial variation within a dataset. Positive spatial correlation indicates similar values among points that are closer together. Negative spatial correlation indicates dissimilar values amongst nearby points (like a checkerboard). We measure spatial autocorrelation using a parameter called Moran’s I. Positive spatial autocorrelation is prevalent across environmental variables, and if you don’t account for them in your model-building workflow, you have an artificially inflated predictive performance.
Next week I’ll be writing a post on what spatial autocorrelation is, and how to measure it, and then I’ll be reviewing three papers on the importance of spatial autocorrelation in your dataset. If you’re working with EO data, learn what this is and how to measure it - it’s important in every model.
Missing values
If you work with satellite remote sensing data, you know quality assurance checks will often result in a lot of missing data. The big question here is: how are you going to account for this within your project or experiment? If you live in a climate with extended periods of cloudy weather, it will be important to will consider how you account for the lack of data points. Are you going to average the dataset by month? Are you going to apply a gap-filling technique? And if so, what will it be? How will that impact the model results? Does it even make sense to average between points to fill in the gap?
These are all questions you need to think about when preparing input datasets for your models.
Ground-truthing
I would love to hear your thoughts on this, but I think it’s important to ground truth in all models that are developed using environmental datasets. Too often, we prepare regression / ML models using a wide host of environmental data, and publish results without any ground-truthing!
How could you be confident that your results represent real-world conditions? Even if the results of the validation are abhorrent, I think the exercise is worth the effort.
EO data products are modelled products
And this means that they already have inherent room for error errors embedded into them. Of course, this is always the case with any input variable you use in a model, but with EO data, there may be room to improve the data product before you use it in your model. For example, Griffin et. al. (2018) reduced the bias in the TROPOMI nitrogen dioxide data product by 15-30% by altering the value for the air mass factors in the TM5-MP chemical transport model. Is there a method through which you can improve the accuracy of your EO product before you use it?
What other issues should people think about? Let me know in the comments!
Thanks for reading Earthbound. If you haven’t already, feel free to subscribe.