GridFormer-VSR: A Multi-Attention Vision Transformer for Video Super-Resolution
Paper ID : 1102-ICEEM2025 (R1)
Authors
Anas M. Ali *, El-Sayed M. El-Rabaie, Khalil F. Ramadan, Walid El-Shafai, Fathi E. Abd El−Samie
Department of Electronics and Electrical Communications Engineering, Faculty of Electronic Engineering, Menoufa University, Menouf 32952, Egypt
Abstract
Video super-resolution (VSR) is essential for enhancing the visual quality of low-resolution video sequences, particularly in applications such as surveillance, remote sensing, and environmental monitoring. We present GridFormer-VSR, a multi-attention Vision Transformer designed to capture both fine-grained spatial details and long-range temporal dependencies across video frames. The model employs a dual-attention mechanism, combining local window-based attention with grid-based global attention, while the HaloMBConv module improves computational efficiency and preserves detail. To evaluate the method, we constructed a new benchmark dataset by generating temporally consistent video sequences from the AID remote sensing dataset. In addition, we introduce a comprehensive evaluation protocol covering both visual quality and temporal consistency metrics to assess VSR methods more effectively. Experimental results show that GridFormer-VSR achieves state-of-the-art performance on multiple benchmarks, outperforming existing VSR models in terms of PSNR, SSIM, and LPIPS metrics, while maintaining fast inference suitable for real-time applications. The scalability of GridFormer-VSR makes it suitable for a wide range of practical scenarios, including aerial surveillance and large-scale environmental monitoring. These findings demonstrate the potential of GridFormer-VSR as a robust and versatile solution for high-quality video super-resolution.
Keywords
Video Super-Resolution, Vision Transformer, Multi-Attention, Remote Sensing, Temporal Consistency, AID Video Dataset, Deep Learning
Status: Accepted