This survey paper covers the breadth and depth of time-series and spatiotemporal causality methods, and their applications in Earth science. The paper first introduces the concepts of causal discovery and causal inference, followed by the underlying causal assumptions, evaluation techniques and key terminologies of the two domain areas. The paper elicits the various state-of-the-art methods introduced for time-series and spatiotemporal causal analysis along with their strengths and limitations. The paper further describes the existing applications of several methods for answering specific Earth science questions such as extreme weather events, sea level rise, teleconnections, etc. Our survey paper will benefit the Earth science community interested in taking an AI-driven approach to study the causality of different dynamic and thermodynamic processes as we present the open challenges and opportunities in performing causality-based Earth science study. It will also serve as a primer for data science researchers interested in data-driven causal study as we share a holistic list of resources, such as Earth science datasets (synthetic, simulated and observational data) and open source tools for causal analysis.
Title: Synthetic, simulated, and real world datasets used for causal analysis}
Dataset | Description | Data Type | Causal Category | Dataset Type |
---|---|---|---|---|
CausalWorld | This open-source causal structure learning benchmarking data generation platform contains a robotic environment manipulation dataset for different tasks. The generated datasets represent different causal structures of interacting objects like robot and object masses, colors, sizes, etc. Different causal studies like do-interventions, counterfactual situations, structure learning, inference, etc. can be performed and evaluated using this platform. | Time-series | Causal Discovery/Inference | Realistic Simulated |
Harvard Dataverse | Contains six synthetic datasets representing different causal structures. The time series datasets are generated using a nonlinear function of cause variables, linear self-causation and additive Gaussian noise. | Time series | Causal Discovery | Synthetic |
FLAIRS | This resource contains 22 simulated time series datasets. All datasets contain 20 continuous variables and 1000 time points with a lag of 1 and 3 time units. | Time series | Causal Discovery | Synthetic |
Diffusion Data | This dataset contains 4000 samples of diffusion-based spatiotemporal images. The dataset contains 3 variables including treatment, time-varying confounder, and (factual and counterfactual) outcomes. | Spatiotemporal | Causal Inference | Synthetic |
North American Mesoscale (NAM) | Generated by the National Centers for Environmental Prediction (NCEP) using the WRF Non-Hydrostatic Mesoscale Model. This is a spatiotemporal dataset of 12km resolution covering the continental United States and the data frequency is every 6 hours from 2012-01-01 00:00 to 2023-10-15 18:00. Different properties of Air Temperature, Geopotential Height, Humidity, Sea Level Pressure, Snow, Surface Pressure and Upper Level Winds are available in this simulation. | Spatiotemporal | Causal Discovery | Realistic Simulated |
NCEP-DOE Reanalysis 2 product | The US National Centers for Environmental Protection (NCEP) and the Department of Energy (DOE) provide this dataset from 1979 to the present time. All available data is applied to a complex climate model to generate reanalysis data for unobserved locations and missing time steps. This is a large set of almost 40 atmospheric variables measured in the reanalysis dataset. The dataset covers 90N-90S, 0E-357.5E with a 2.5-degree latitude x 2.5-degree longitude global grid (144x73). | Spatiotemporal | Causal Discovery | Realistic Simulated |
Beijing Multi-Site Air-Quality Dataset | This observational dataset was collected by the Beijing Municipal Environmental Monitoring Center and contains hourly observation of 6 pollutants in the air: CO, PM2.5, PM10, O3, NO2 and SO2, and 6 meteorological variables: air temperature, wind direction and speed, pressure, dew point temperature, and precipitation. These data were collected from 2013 to 2017. | Time-series | Causal Discovery/Inference | Real-world |
ERA5 | The European Centre for Medium-Range Weather Forecasts (ECMWF) maintains this global climate and weather dataset. The hourly observations of the different atmosphere, land and oceanic variables are available in this dataset from 1940 to the present day for the whole globe and are updated daily for new data. | Spatiotemporal | Causal Discovery/Inference | Real-world |
Sea Ice Data | This data collection is a polar sea ice observational dataset maintained by the National Snow and Ice Data Center (NSIDC). This dataset is collected from the Scanning Multichannel Microwave Radiometer (SMMR) instrument on the Nimbus-7 satellite and the Special Sensor Microwave/Imager (SSM/I). Several observational variables like sea ice concentration and extent, sea surface temperatures, wind stress, snow cover, rainfall rates, etc. are recorded in this dataset from 1978 to the present. | Spatiotemporal | Causal Discovery/Inference | Real-world |
Metropolit Cohort | The Metropolit Cohort dataset contains data from 11532 humans born in 1953 and lived till 1968 in the Copenhagen Metropolitan area, Denmark. This dataset comprises physical, medical, mental, social and diagnosis information from different stages of life of these men collected from nationwide social and health registers. This is a very reliable dataset with minimal measurement error and strong validity. | Time-series | Causal Discovery | Real-world |
Lalonde | This is a popular observational dataset collected from the National Supported Work Demonstration. The study examined how well a work training program (the treatment) affected a participant's actual wages a few years after the program's conclusion. Besides the treatment indicator the dataset provides demographic variables like age, race, academic background and previous real earnings for 260 controlled and 185 treated subjects (a total of 445) with the response (real earnings in the year 1978). In the values of the treatment assignment indicator variable 1 means treated and 0 means control/untreated. | Time-series | Causal Inference | Real-world |