AutoML4HybridModels: AutoML for creating hybrid Earth science models
Due to the availability of large sets of satellite data, an increasing number of Earth system science problems are tackled by applying machine learning. In general, two types of methods are used for Earth system science problems: "data-driven" methods and "theory-driven" methods. Data-driven methods involve the use of a large training dataset to train a machine learning model. In the context of remote sensing tasks, a machine learning model is trained by using a large set of "in situ" training data (ground truth measurements) coupled with satellite observations, where the satellite observations provide the input features and the in situ training dataset contains the target values to predict. However, in many scenarios the amount of available in situ data is limited. Theory-driven methods rely on the use of existing domain knowledge instead of large sets of training data. An example of such a method is the use of simulation models to create simulated training data. On the downside, these models typically require extensive domain knowledge to tune correctly.
A novel perspective on data science aims to combine these data-driven and theory-driven methods: "theory-guided" data science. In this thesis, we introduce a theory-guided framework that incorporates both simulation models and available in situ data within a modelling pipeline. For this framework, we create an extension to the existing automated machine learning framework of Auto-sklearn. We compare the performance of this new framework to several commonly used data-driven baselines including Random forest, Multilayer perceptron, Gaussian process regression and vanilla Auto-sklearn. To facilitate this comparison, we introduce a benchmark dataset consisting of four distinct Earth system science tasks with preprocessed, ready-to-use in situ, simulation and remote sensing data for each task. From our experiments with this benchmark dataset, we conclude that for one task (leaf area index estimation), the theory-guided framework outperforms all baselines. In this task, the proposed method improves on vanilla Auto-sklearn by an increase in R2 of 0.01 to 0.02 for training sizes of up to 250 in situ samples. For other tasks, vanilla Auto-sklearn consistently ranks as the best model.
The original AutoML system used is Auto-sklearn.
In this project we composed a benchmark dataset of preprocessed, ready-to-use in situ data, satellite data and simulation data. This dataset is available on GitHub. This benchmark combines data from the following sources:
- Ground-Based Observations for Validation (GBOV) of Copernicus Global Land Products - Reference Measurements
- Swedish National Forest Inventory - Sample plot data 2007-2020
- Valente, A et al. (2016): A compilation of global bio-optical in situ data for ocean-colour satellite applications.
- ESA Philab WorldCrops - Bavaria Yield
- Google Earth Engine Data Catalog - Sentinel Collections
- PROSAIL (Python library)
- HYDROPT: a Python framework for fast inverse modelling of multi- and hyperspectral ocean color data
- Victor Neuteboom (supervisors: dr. Mitra Baratchi & prof. Peter van Bodegom & drs. Nuno C. de Sá & dr. Michael Marszalek)
AutoML for creating hybrid Earth science models.Master's Thesis in Computer Science at Leiden Institute of Advanced Computer Science, Leiden University, 2021.