Forecasting land surface temperature (LST) is a key task in environmental monitoring. While Transformer-based architectures have recently shown promise in modeling temperature time series, it remains unclear whether representing geophysical signals as raw sequences or as visual encodings offers better predictive accuracy. In this work, we investigate this question by comparing two autoregressive architectures: a standard sequence-based Transformer operating on raw LST values, and a video-based model that leverages visual encodings of temperature fields on a pre-trained encoder. To this end, we convert hourly LST grids from the Copernicus Land Monitoring Service and the ERA5 datasets into color-coded heatmaps using OpenCV to further train a TimeSformer-based model that performs spatiotemporal attention across the resulting video-like sequences. The visual backbone extracts structured features from both space and time, which are then passed to an autoregressive decoder to forecast the next 72 h of temperature evolution. Our framework is evaluated on a multi-year dataset covering the region of Florence (Italy), and compared against a previously validated Transformer model trained directly on numerical signals. Experimental results show that the vision-based model achieves competitive performance with respect to the numeric baseline. The study highlights the potential of vision transformers for environmental forecasting tasks, bridging computer vision and climate modeling.

Seeing the Heat: Vision Transformers for Spatiotemporal Temperature Prediction / Russo, Paolo; Di Ciaccio, Fabiana. - ELETTRONICO. - 16169:(2026), pp. 365-376. ( Workshops and competitions hosted by the 23rd International Conference on Image Analysis and Processing, ICIAP 2025 Rome, Italy 2025) [10.1007/978-3-032-11317-7_31].

Seeing the Heat: Vision Transformers for Spatiotemporal Temperature Prediction

Di Ciaccio, Fabiana
2026

Abstract

Forecasting land surface temperature (LST) is a key task in environmental monitoring. While Transformer-based architectures have recently shown promise in modeling temperature time series, it remains unclear whether representing geophysical signals as raw sequences or as visual encodings offers better predictive accuracy. In this work, we investigate this question by comparing two autoregressive architectures: a standard sequence-based Transformer operating on raw LST values, and a video-based model that leverages visual encodings of temperature fields on a pre-trained encoder. To this end, we convert hourly LST grids from the Copernicus Land Monitoring Service and the ERA5 datasets into color-coded heatmaps using OpenCV to further train a TimeSformer-based model that performs spatiotemporal attention across the resulting video-like sequences. The visual backbone extracts structured features from both space and time, which are then passed to an autoregressive decoder to forecast the next 72 h of temperature evolution. Our framework is evaluated on a multi-year dataset covering the region of Florence (Italy), and compared against a previously validated Transformer model trained directly on numerical signals. Experimental results show that the vision-based model achieves competitive performance with respect to the numeric baseline. The study highlights the potential of vision transformers for environmental forecasting tasks, bridging computer vision and climate modeling.
2026
Lecture Notes in Computer Science
Workshops and competitions hosted by the 23rd International Conference on Image Analysis and Processing, ICIAP 2025
Rome, Italy
2025
Russo, Paolo; Di Ciaccio, Fabiana
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1451995
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact