MMEarth-Bench: Global Model Adaptation via Multimodal Test-Time Training

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

MMEarth-Bench:
Global Model Adaptation via
Multimodal Test-Time Training

Lucia Gordon^1,2

Serge Belongie²

Christian Igel²

Nico Lang²

¹Harvard University ²University of Copenhagen

Paper Code arXiv Explorer

Overview

Recent research in geospatial machine learning has demonstrated that models pretrained with self-supervised learning on Earth observation data can perform well on downstream tasks with limited training data. However, most of the existing geospatial benchmark datasets have few data modalities and poor global representation, limiting the ability to evaluate multimodal pretrained models at global scales. To fill this gap, we introduce MMEarth-Bench, a collection of five new multimodal environmental tasks with 12 modalities, globally distributed data, and both in- and out-of-distribution test splits. We benchmark a diverse set of pretrained models and find that while (multimodal) pretraining tends to improve model robustness in limited data settings, geographic generalization abilities remain poor. In order to facilitate model adaptation to new downstream tasks and geographic domains, we propose a model-agnostic method for test-time training with multimodal reconstruction (TTT-MMR) that uses all the modalities available at test time as auxiliary tasks, regardless of whether a pretrained model accepts them as input. Our method improves model performance on both the random and geographic test splits, and geographic batching leads to a good trade-off between regularization and specialization during TTT.

Motivation

Self-supervised multimodal pretraining promises to overcome the grand challenges in Earth observation. Crucial applications like the SDGs have to rely on limited and sparse (e.g., field measurements) and geographically biased (i.e., no labels for large parts of the world) training data. Furthermore, models conditioned on multiple modalities may resolve ambiguities inherent to modeling biophysical quantities with remotely sensed data.

Limited data

Domain shifts

Ambiguities

Multimodal
input

Explorer

Explore the MMEarth-Bench dataset interactively. Zoom in to view pixel-level data, hover over tiles to see tile-level data, and filter by task, split, and species.

Open in new tab

Species

The species task in MMEarth-Bench contains 100 terrestrial mammals, shown below in order of their prevalence in the dataset.

Benchmarking

We benchmark 7 pretrained models on MMEarth-Bench:
3 RGB: Scale-MAE DINOv3 Web DINOv3 Sat
2 Sentinel-2: SatlasNet MPMAE
2 multimodal: TerraMind Copernicus-FM
We rank these models after finetuning on all training data.

Split	Rank	All tasks	Biomass	Soil N	Soil OC	Soil pH	Species
Random	1	Copernicus-FM	MPMAE	TerraMind	Copernicus-FM	TerraMind	Copernicus-FM
	2	TerraMind	Copernicus-FM	Copernicus-FM	MPMAE	Copernicus-FM	TerraMind
	3	MPMAE	TerraMind	MPMAE	TerraMind	MPMAE	MPMAE
	4	DINOv3 Sat	DINOv3 Sat	DINOv3 Sat	SatlasNet	DINOv3 Web	SatlasNet
	5	SatlasNet	SatlasNet	DINOv3 Web	DINOv3 Web	DINOv3 Sat	DINOv3 Sat
	6	DINOv3 Web	DINOv3 Web	SatlasNet	Scale-MAE	SatlasNet	DINOv3 Web
	7	Scale-MAE	Scale-MAE	Scale-MAE	DINOv3 Sat	Scale-MAE	Scale-MAE
Geographic	1	Copernicus-FM	MPMAE	DINOv3 Sat	Copernicus-FM	TerraMind	Copernicus-FM
	2	MPMAE	TerraMind	Copernicus-FM	MPMAE	SatlasNet	TerraMind
	3	TerraMind	Copernicus-FM	MPMAE	SatlasNet	MPMAE	MPMAE
	4	SatlasNet	DINOv3 Sat	TerraMind	TerraMind	DINOv3 Web	DINOv3 Sat
	5	DINOv3 Sat	SatlasNet	DINOv3 Web	DINOv3 Web	DINOv3 Sat	DINOv3 Web
	6	DINOv3 Web	DINOv3 Web	SatlasNet	Scale-MAE	Copernicus-FM	SatlasNet
	7	Scale-MAE	Scale-MAE	Scale-MAE	DINOv3 Sat	Scale-MAE	Scale-MAE

Method

The geospatial machine learning community has embraced multimodal data for self-supervised pretraining of geospatial foundation models. Leveraging multimodal data at inference time as well provides more context for the model when making a prediction. A pretrained model can use multimodal data at inference time by taking it as input, but this typically requires a large multimodal encoder that was or can be trained with a lot of data. Inspired by both the multi-pretext pretraining paradigm employed by MMEarth and the test-time adaptation framework, we propose using multiple modalities as auxiliary tasks at test time. In particular, we use multiple modalities as reconstruction targets to provide a test-time adaptation signal for the encoder. The reconstruction tasks can be solved with a lightweight, linear decoder, and they do not require the encoder to accept the modalities as input. In the animation below, we illustrate our method for test-time training with multimodal reconstruction (TTT-MMR) for improving model performance at test time.

Applying TTT to batches of test tiles acts as a regularizer since it results in less noisy gradients. However, this also allows for less specialization to any particular tile, which could limit its benefits. To balance regularization and specialization, rather than forming batches randomly as in TTT-MMR, we also propose geographic batching, in which test tiles are grouped into non-overlapping batches that are contiguous geographic regions. Our TTT-MMR-Geo method batches the test tiles based on geographic proximity as a proxy for tile similarity using recursive spatial partitioning.

Download

Run the following command in the command line to download the MMEarth-Bench data.

mkdir -p mmearth-bench-data/{biomass,soil_nitrogen,soil_organic_carbon,soil_pH,species} && for task in biomass soil_nitrogen soil_organic_carbon soil_pH species; do wget -c -P "mmearth-bench-data/$task" "https://sid.erda.dk/share_redirect/cbMhbwV1yP/mmearth-bench-data/$task/$task.h5" "https://sid.erda.dk/share_redirect/cbMhbwV1yP/mmearth-bench-data/$task/${task}_split_data.json"; done && wget -c -P mmearth-bench-data/species "https://sid.erda.dk/share_redirect/cbMhbwV1yP/mmearth-bench-data/species/species_labels.json" && wget -c -P mmearth-bench-data "https://sid.erda.dk/share_redirect/cbMhbwV1yP/mmearth-bench-data/no_data_values.json"

This will take about 2 hours and yield the following folder structure, occupying 59 GB:

mmearth-bench-data/
├── biomass/
│   ├── biomass_split_data.json
│   └── biomass.h5
├── soil_nitrogen/
│   ├── soil_nitrogen_split_data.json
│   └── soil_nitrogen.h5
├── soil_organic_carbon/
│   ├── soil_organic_carbon_split_data.json
│   └── soil_organic_carbon.h5
├── soil_pH/
│   ├── soil_pH_split_data.json
│   └── soil_pH.h5
├── species/
│   ├── species_labels.json
│   ├── species_split_data.json
│   └── species.h5
└── no_data_values.json

BibTeX

@misc{gordon2026mmearthbench,
title={MMEarth-Bench: Global Model Adaptation via Multimodal Test-Time Training},
author={Lucia Gordon and Serge Belongie and Christian Igel and Nico Lang},
year={2026},
eprint={2602.06285},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.06285}}