Overview
Recent research in geospatial machine learning has demonstrated that models pretrained with self-supervised learning on Earth observation data can perform well on downstream tasks with limited training data. However, most of the existing geospatial benchmark datasets have few data modalities and poor global representation, limiting the ability to evaluate multimodal pretrained models at global scales. To fill this gap, we introduce MMEarth-Bench, a collection of five new multimodal environmental tasks with 12 modalities, globally distributed data, and both in- and out-of-distribution test splits. We benchmark a diverse set of pretrained models and find that while (multimodal) pretraining tends to improve model robustness in limited data settings, geographic generalization abilities remain poor. In order to facilitate model adaptation to new downstream tasks and geographic domains, we propose a model-agnostic method for test-time training with multimodal reconstruction (TTT-MMR) that uses all the modalities available at test time as auxiliary tasks, regardless of whether a pretrained model accepts them as input. Our method improves model performance on both the random and geographic test splits, and geographic batching leads to a good trade-off between regularization and specialization during TTT.
Motivation
Self-supervised multimodal pretraining promises to overcome the grand challenges in Earth observation. Crucial applications like the SDGs have to rely on limited and sparse (e.g., field measurements) and geographically biased (i.e., no labels for large parts of the world) training data. Furthermore, models conditioned on multiple modalities may resolve ambiguities inherent to modeling biophysical quantities with remotely sensed data.
Explorer
Explore the MMEarth-Bench dataset interactively. Zoom in to view pixel-level data, hover over tiles to see tile-level data, and filter by task, split, and species.
Species
The species task in MMEarth-Bench contains 100 terrestrial mammals, shown below in order of their prevalence in the dataset.
Benchmarking
We benchmark 7 pretrained models on MMEarth-Bench:
3 RGB: Scale-MAE DINOv3 Web DINOv3 Sat
2 Sentinel-2: SatlasNet MPMAE
2 multimodal: TerraMind Copernicus-FM
We rank these models after finetuning on all training data.
| Split | Rank | All tasks | Biomass | Soil N | Soil OC | Soil pH | Species |
|---|---|---|---|---|---|---|---|
| Random | 1 | Copernicus-FM | MPMAE | TerraMind | Copernicus-FM | TerraMind | Copernicus-FM |
| 2 | TerraMind | Copernicus-FM | Copernicus-FM | MPMAE | Copernicus-FM | TerraMind | |
| 3 | MPMAE | TerraMind | MPMAE | TerraMind | MPMAE | MPMAE | |
| 4 | DINOv3 Sat | DINOv3 Sat | DINOv3 Sat | SatlasNet | DINOv3 Web | SatlasNet | |
| 5 | SatlasNet | SatlasNet | DINOv3 Web | DINOv3 Web | DINOv3 Sat | DINOv3 Sat | |
| 6 | DINOv3 Web | DINOv3 Web | SatlasNet | Scale-MAE | SatlasNet | DINOv3 Web | |
| 7 | Scale-MAE | Scale-MAE | Scale-MAE | DINOv3 Sat | Scale-MAE | Scale-MAE | |
| Geographic | 1 | Copernicus-FM | MPMAE | DINOv3 Sat | Copernicus-FM | TerraMind | Copernicus-FM |
| 2 | MPMAE | TerraMind | Copernicus-FM | MPMAE | SatlasNet | TerraMind | |
| 3 | TerraMind | Copernicus-FM | MPMAE | SatlasNet | MPMAE | MPMAE | |
| 4 | SatlasNet | DINOv3 Sat | TerraMind | TerraMind | DINOv3 Web | DINOv3 Sat | |
| 5 | DINOv3 Sat | SatlasNet | DINOv3 Web | DINOv3 Web | DINOv3 Sat | DINOv3 Web | |
| 6 | DINOv3 Web | DINOv3 Web | SatlasNet | Scale-MAE | Copernicus-FM | SatlasNet | |
| 7 | Scale-MAE | Scale-MAE | Scale-MAE | DINOv3 Sat | Scale-MAE | Scale-MAE |
Method
The geospatial machine learning community has embraced multimodal data for self-supervised pretraining of geospatial foundation models. Leveraging multimodal data at inference time as well provides more context for the model when making a prediction. A pretrained model can use multimodal data at inference time by taking it as input, but this typically requires a large multimodal encoder that was or can be trained with a lot of data. Inspired by both the multi-pretext pretraining paradigm employed by MMEarth and the test-time adaptation framework, we propose using multiple modalities as auxiliary tasks at test time. In particular, we use multiple modalities as reconstruction targets to provide a test-time adaptation signal for the encoder. The reconstruction tasks can be solved with a lightweight, linear decoder, and they do not require the encoder to accept the modalities as input. In the animation below, we illustrate our method for test-time training with multimodal reconstruction (TTT-MMR) for improving model performance at test time.
Applying TTT to batches of test tiles acts as a regularizer since it results in less noisy gradients. However, this also allows for less specialization to any particular tile, which could limit its benefits. To balance regularization and specialization, rather than forming batches randomly as in TTT-MMR, we also propose geographic batching, in which test tiles are grouped into non-overlapping batches that are contiguous geographic regions. Our TTT-MMR-Geo method batches the test tiles based on geographic proximity as a proxy for tile similarity using recursive spatial partitioning.
Download
Run the following command in the command line to download the MMEarth-Bench data.
mkdir -p mmearth-bench-data/{biomass,soil_nitrogen,soil_organic_carbon,soil_pH,species} && for task in biomass soil_nitrogen soil_organic_carbon soil_pH species; do wget -c -P "mmearth-bench-data/$task" "https://sid.erda.dk/share_redirect/cbMhbwV1yP/mmearth-bench-data/$task/$task.h5" "https://sid.erda.dk/share_redirect/cbMhbwV1yP/mmearth-bench-data/$task/${task}_split_data.json"; done && wget -c -P mmearth-bench-data/species "https://sid.erda.dk/share_redirect/cbMhbwV1yP/mmearth-bench-data/species/species_labels.json" && wget -c -P mmearth-bench-data "https://sid.erda.dk/share_redirect/cbMhbwV1yP/mmearth-bench-data/no_data_values.json"
This will take about 2 hours and yield the following folder structure, occupying 59 GB:
mmearth-bench-data/
βββ biomass/
β βββ biomass_split_data.json
β βββ biomass.h5
βββ soil_nitrogen/
β βββ soil_nitrogen_split_data.json
β βββ soil_nitrogen.h5
βββ soil_organic_carbon/
β βββ soil_organic_carbon_split_data.json
β βββ soil_organic_carbon.h5
βββ soil_pH/
β βββ soil_pH_split_data.json
β βββ soil_pH.h5
βββ species/
β βββ species_labels.json
β βββ species_split_data.json
β βββ species.h5
βββ no_data_values.json
BibTeX
@misc{gordon2026mmearthbench,
title={MMEarth-Bench: Global Model Adaptation via Multimodal Test-Time Training},
author={Lucia Gordon and Serge Belongie and Christian Igel and Nico Lang},
year={2026},
eprint={2602.06285},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.06285}}