Leakage-Aware Research Design for Interpretable PM2.5 Prediction Across Urban Air-Quality Monitoring Sites

Hongzhi Lu; Hongxue Lu

Authors

Hongzhi Lu School of Industrial Technology, Universiti Sains Malaysia, Gelugor 11800, Malaysia
Hongxue Lu University of Malaya, Jalan Universiti, Kuala Lumpur 50603, Malaysia

Keywords:

PM2.5 prediction, research design, temporal validation, air quality, machine learning, reproducibility

Abstract

Air-quality prediction studies often report optimistic model performance when temporal and spatial dependencies are not handled explicitly. This study develops a leakage-aware research design for next-hour PM2.5 prediction using the public Beijing Multi-Site Air Quality dataset. Lagged pollutant, meteorological, calendar, wind-direction, and station features were evaluated with persistence, Ridge regression, random forest, and histogram gradient boosting models. The design compared random, chronological, rolling-origin, station-holdout, multi-horizon, feature-ablation, leakage-stress, and stratified-error evidence with MAE, RMSE, R2, median absolute error, and bootstrap confidence intervals. Using 408,172 feature rows, the best chronological model was HistGradientBoosting with MAE 15.61 ug/m3 and RMSE 29.01 ug/m3, while the best random-split result was 14.90 ug/m3 MAE. The 24-hour horizon increased HistGradientBoosting MAE to 53.03 ug/m3, and a deliberately leaky target feature reduced apparent MAE to 0.85 ug/m3, demonstrating why leakage diagnostics are necessary. The workflow provides a reproducible blueprint for environmental machine-learning studies rather than a new forecasting algorithm.