Forecasting land surface temperature (LST) is a key task in environmental monitoring. While Transformer-based architectures have recently shown promise in modeling temperature time series, it remains unclear whether representing geophysical signals as raw sequences or as visual encodings offers better predictive accuracy. In this work, we investigate this question by comparing two autoregressive architectures: a standard sequence-based Transformer operating on raw LST values, and a video-based model that leverages visual encodings of temperature fields on a pre-trained encoder. To this end, we convert hourly LST grids from the Copernicus Land Monitoring Service and the ERA5 datasets into color-coded heatmaps using OpenCV to further train a TimeSformer-based model that performs spatiotemporal attention across the resulting video-like sequences. The visual backbone extracts structured features from both space and time, which are then passed to an autoregressive decoder to forecast the next 72 h of temperature evolution. Our framework is evaluated on a multi-year dataset covering the region of Florence (Italy), and compared against a previously validated Transformer model trained directly on numerical signals. Experimental results show that the vision-based model achieves competitive performance with respect to the numeric baseline. The study highlights the potential of vision transformers for environmental forecasting tasks, bridging computer vision and climate modeling.
Seeing the Heat: Vision Transformers for Spatiotemporal Temperature Prediction / Russo, Paolo; Di Ciaccio, Fabiana. - ELETTRONICO. - 16169:(2026), pp. 365-376. ( Workshops and competitions hosted by the 23rd International Conference on Image Analysis and Processing, ICIAP 2025 Rome, Italy 2025) [10.1007/978-3-032-11317-7_31].
Seeing the Heat: Vision Transformers for Spatiotemporal Temperature Prediction
Di Ciaccio, Fabiana
2026
Abstract
Forecasting land surface temperature (LST) is a key task in environmental monitoring. While Transformer-based architectures have recently shown promise in modeling temperature time series, it remains unclear whether representing geophysical signals as raw sequences or as visual encodings offers better predictive accuracy. In this work, we investigate this question by comparing two autoregressive architectures: a standard sequence-based Transformer operating on raw LST values, and a video-based model that leverages visual encodings of temperature fields on a pre-trained encoder. To this end, we convert hourly LST grids from the Copernicus Land Monitoring Service and the ERA5 datasets into color-coded heatmaps using OpenCV to further train a TimeSformer-based model that performs spatiotemporal attention across the resulting video-like sequences. The visual backbone extracts structured features from both space and time, which are then passed to an autoregressive decoder to forecast the next 72 h of temperature evolution. Our framework is evaluated on a multi-year dataset covering the region of Florence (Italy), and compared against a previously validated Transformer model trained directly on numerical signals. Experimental results show that the vision-based model achieves competitive performance with respect to the numeric baseline. The study highlights the potential of vision transformers for environmental forecasting tasks, bridging computer vision and climate modeling.I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



