Due to air quality significantly affects human health, it is becoming increasingly important to accurately and timely predict the Air Quality Index (AQI). To this end, this paper proposes a new federated learning-based aerial-ground air quality sensing framework for fine-grained 3D air quality monitoring and forecasting. Specifically, in the air, this framework leverages a light-weight Dense-MobileNet model to achieve energy-efficient end-to-end learning from haze features of haze images taken by Unmanned Aerial Vehicles (UAVs) for predicting AQI scale distribution. Furthermore, the Federated Learning Framework not only allows various organizations or institutions to collaboratively learn a well-trained global model to monitor AQI without compromising privacy, but also expands the scope of UAV swarms monitoring. For ground sensing systems, we propose a Graph Convolutional neural network-based Long Short-Term Memory (GC-LSTM) model to achieve accurate, real-time and future AQI inference. The GC-LSTM model utilizes the topological structure of the ground monitoring station to capture the spatio-temporal correlation of historical observation data, which helps the aerial-ground sensing system to achieve accurate AQI inference. Through extensive case studies on a real-world dataset, numerical results show that the proposed framework can achieve accurate and energy-efficient AQI sensing without compromising the privacy of raw data.