index.json

[{"content":" This is a analytic walk-through of the following paper: ClimateBench v1.0: A Benchmark for Data-Driven Climate Projections https://doi.org/10.1029/2021MS002954 (Watson-Paris et al., 2022)\n  conda environment The analysis was conducted under a conda virtual environment (esem). I\u0026rsquo;ll attached the .yml file of the environment together with the assignment, you can install it with conda env create -f esem.yml. Or if you prefer to install it manually, here are the commands I used:\nconda create --name esem python scipy numpy pandas xarray matplotlib cartopy seaborn\npip install netcdf4 #this is better than using conda, less likely have error of .DLL files\npip install \u0026quot;dask[dataframe]\u0026quot;\npip install xskillscore\n#you probably don\u0026rsquo;t need the following if you don\u0026rsquo;t want to train emulators yourselves\n pip install eofs pip install gpflow pip install -U scikit-learn pip install esem  The last will install the model package used in paper (https://github.com/duncanwp/ESEm/tree/v1.1.0): it will also automatically install the dependent packages tensorflow. Other packages such as keras and scikit-learn are also needed if you want to retrain the emulators from paper. (Because their original code specified hyperparameters using functions in these packages\u0026hellip; for percisely reproducing the emulator and generating data, it\u0026rsquo;s better to use their setting)\ndataset from paper for building emulators  test dataset: https://zenodo.org/record/7064308/files/test.tar.gz?download=1 training dataset: https://zenodo.org/record/7064308/files/train_val.tar.gz?download=1  data that I used to evalute emulators (The important part of the paper is to evaluate results generated from emulators, NOT building emulators! But the authors did not provide emulators' results, instead, he provided the training and testing dataset, listed above. So I downloaded the data, built emulators myself, generated emulators' results using the template and the same packages as the paper\u0026hellip;)\n test dataset: https://zenodo.org/record/7064308/files/test.tar.gz?download=1 (This is the same, for comparing emulation with the truth generated from models) outputs_ssp245_prediction_RF.nc; outputs_ssp245_predict_pr90.nc; outputs_ssp245_predict_pr.nc; outputs_ssp245_predict_tas.nc; outputs_ssp245_predict_dtr.nc (These are emulators results I generated using the default template of packages setting)  ClimateBench v1.0: A Benchmark for Data-Driven Climate Projections 1. Introduction  There are a wide range of emission pathways consistent with climate goals; Traditionally, scientists used Earth System Models (ESMs) to simulate those Shared Socioeconomic Pathways (SSPs), and explore what will happen under future scenarios; Potential problems in traditional approach: Earth System Models are expensive to run, and it\u0026rsquo;s impractical to use expensive models to fully explore the space of possibilities (O\u0026rsquo;Neill et al., 2016) Solution proposed in the paper: ClimateBench, using a set of baseline machine/statistical learning models to emulate the response to a variety of forcers (potentially exploring a wider range of SSPs in a cheaper way).  Idea and assumptions behind this approach: The state-of-the-art statistical/machine learning models have shown potential to capture nonlinear relationship in long-term climate responses (Mansfield et al., 2020). And these methods are less computational costly compared with ESMs that traditionally used to simulation climate projections. In the paper, these statistical models have been employed as \u0026ldquo;emulator\u0026rdquo; using training dataset from CMIP6 outputs. These emulators then can be used as a cheaper but robust alternatives to explore wide range of emission pathways, to predict annual mean global distributions of temperature and precipitation under different unexplored scenarioes.\nThe analytic goals in the paper:\nProviding evidences that statistical emulators are robust in capturing nonlinear relationship in the Earth system (i.e. the emulator can reproduce the targeted output traditionally generated through ESMs)\n Comparing different statisical/machine learning methods that used to build emulators (i.e. benchmarking their performances in emulating climate projections) Evaluating the emulators by assessing whether the emulation results fulfill physical constraints (i.e. the emulated outputs are meanful and are likely not the results of over-fitting)  2. Data and emulator 2.1 Data Input Variables: global mean emissions of carbon dioxide($CO_{2}$) and methane($CH_4$), first 5 principal components of sulfur dioxide($SO_2$) and balck carbon($BC$)\nOutput variables: temperature (surface air temperature); precipitation (total precipitation); diural temperature range (difference in daily maximum and minimum surface air temperature); extreme precipitation (90th percentile of the daily precipitation)\nTask for emulator: predict the output variables using only the input variables under the chosen test scenario ssp245. The emulators will be evaluated based on its prediction skills over 2080-2100.\nTraining dataset for emulator: The training dataset for the emulator comes from the following scenarios: the historical data; ssp126; ssp370; ssp585; and historical data with aerosol (hist-aer) and greenhouse gas (high-GHG). The description of these scenario dataset are listed in the table below:\n   Protocol Experiment Period Notes     ScenarioMIP ssp126 2015-2100 A high ambition scenario designed to produce significantly less than 2° warming by 2100    ssp245 2015-2100 Designed to represent a medium forcing future scenario. This is the test scenario to be held back for evaluation    ssp370 2015-2100 A medium-high forcing scenario with high emissions of near-term climate forcers (NTCF) such as methane and aerosol    ssp585 2015-2100 This scenario represents the high end of the range of future pathways in the IAM literature and leads to a very large forcing of 8.5 Wm−2 in 2100   CMIP6 historical 1850-2014 A simulation using historical emissions of all forcing agents designed to recreate the historically observed climate   DAMIP hist-aer 1850-2014 A historical simulation with varying concentrations for CO2 and other long-lived greenhouse-gases (only)    hist-GHG 1850-2014 A historical simulation only forced by changes in anthropogenic aerosol    In the training process for emulators, they used the input and output variables in each of the selected scenarios as training dataset, and optimized the hyperparameters in the emulator through available optimisation algorithm in the ESEM package.\n2.2 Method in building emulators 2.2.1 Random Forest Tree-based method repeatedly split data into subsets according to its features such that in-subset variance is low and between-subset variance is high. Describing training different trees on different subsets of the data or holding back some of the data dimensions for each individual tree. The Forest makes a prediction by averaging over the predictions of all individual trees.\nTraining Setting:\n training data: global mean emissions of $CO_{2}$ and $CH_4$, first 5 principal components of $SO_2$ and $BC$ The following hyperparameters are tuned using random search of the training data without replacement: number of trees, tree depth, number of samples required to split a node and to be at each leaf node.  2.2.2 Neural Network The chosen architecture consists of a CNN followed by an LSTM built with the Keras library. The CNN includes one convolutional layer with 20 filters, a filter size of 3, and a ReLU activation function. The 3 × 3 pixel filters scan the input images to detect spatial patterns and feed these patterns to the next layer. These next layers are average pooling layers that reduce the spatial dimensionality ahead of the LSTM layer.\nTraining Setting:\n training data time-series is segmented into 10-year chunks, using a moving-time window in one-year increments, leading to 754 training samples of shape (10, 96, 144, 4) corresponding to the number of years, latitude, longitude and then number of variables. trained four different emulators for the four different output variables. not to do any hyperparameter optimization, and all the parameters were chosen manually.  3. walk-through analysis of 2 emulators Here I reproduced the evaluation process for 2 of the baseline-emulators described in the paper (and above).\nThe methods used in the paper to evalute emulators are:\n t-test rmse compare with physical constrained relationships  which are demostrated below:\nimport matplotlib.pyplot as plt import cartopy.crs as ccrs import xarray as xr from glob import glob import seaborn as sns import numpy as np import pandas as pd # read the model data, and emulation data I generated data_path = \u0026#34;C:\\\\Users\\\\maq\\\\scripts\\\\gg606\\\\data\\\\\u0026#34; # In order to rerun the script, change the above to wherever the data is stored variables = [\u0026#39;tas\u0026#39;, \u0026#39;dtr\u0026#39;, \u0026#39;pr\u0026#39;, \u0026#39;pr90\u0026#39;] output_ssp245 = xr.open_dataset(data_path + \u0026#39;outputs_ssp245.nc\u0026#39;).rename(diurnal_temperature_range=\u0026#34;dtr\u0026#34;).mean(\u0026#39;member\u0026#39;) rf_predictions = xr.open_dataset(data_path + \u0026#39;outputs_ssp245_prediction_RF.nc\u0026#39;).swap_dims(sample=\u0026#39;time\u0026#39;) rf_predictions = rf_predictions.assign_coords(time=rf_predictions.time+1).rename(diurnal_temperature_range=\u0026#34;dtr\u0026#34;) nn_predictions = xr.merge([{v: xr.open_dataarray(data_path + \u0026#34;outputs_ssp245_predict_{}.nc\u0026#34;.format(v))} for v in variables]) # Convert the precip values to mm/day, only rf has done this lol output_ssp245[\u0026#34;pr\u0026#34;] *= 86400 output_ssp245[\u0026#34;pr90\u0026#34;] *= 86400 nn_predictions[\u0026#34;pr\u0026#34;] *= 86400 nn_predictions[\u0026#34;pr90\u0026#34;] *= 86400 #Now that we have 3 dataset: one is the original Truth, other 2 are data predicted by emulators models = [output_ssp245, rf_predictions, nn_predictions] model_names = [\u0026#39;output_ssp245\u0026#39;,\u0026#39;Random Forest\u0026#39;, \u0026#39;Neural Network\u0026#39;] var_names = labels = [\u0026#34;Temperature (K)\u0026#34;, \u0026#34;Diurnal temperature range (K)\u0026#34;, \u0026#34;Precipitation (mm/day)\u0026#34;, \u0026#34;Extreme precipitation (mm/day)\u0026#34;] #rf_predictions # t-test def ttest(diff_mean, diff_std, diff_num): from scipy.stats import distributions z = diff_mean / np.sqrt(diff_std ** 2 / diff_num) # use np.abs to get upper tail,  # then multiply by two as this is a two-tailed test p = distributions.t.sf(np.abs(z), diff_num - 1) * 2 return z, p p_level = 0.05 proj = ccrs.Robinson() # this proj is sooooo slow... kwargs = [dict(cmap=\u0026#34;coolwarm\u0026#34;, vmin=-1, vmax=1), dict(cmap=\u0026#34;coolwarm\u0026#34;, vmin=-0.5, vmax=0.5), dict(cmap=\u0026#34;coolwarm\u0026#34;, vmin=-1, vmax=1), dict(cmap=\u0026#34;coolwarm\u0026#34;, vmin=-2, vmax=2)] with sns.plotting_context(\u0026#34;talk\u0026#34;): fig, axes = plt.subplots(4, 2, subplot_kw=dict(projection=proj), figsize=(16, 16), constrained_layout=True) #print(axes) for fig_axes, var, var_name, kws in zip(axes, variables, var_names, kwargs): for ax, model, model_name in zip(fig_axes, models[1:], model_names[1:]): ax.set_title(model_name) #print(model_name, model) diff = (model[var]-models[0][var]).sel(time=slice(2080, 2100)) mean_diff = diff.mean(\u0026#39;time\u0026#39;) #print(mean_diff) _, p = ttest(mean_diff, diff.std(\u0026#39;time\u0026#39;), diff.count(\u0026#39;time\u0026#39;)) if model_name == \u0026#39;Neural Network\u0026#39;: mean_diff.where(p \u0026lt; p_level).plot(ax=ax, add_labels=False, transform=ccrs.PlateCarree(), cbar_kwargs={\u0026#34;label\u0026#34;:var_name, \u0026#34;orientation\u0026#34;:\u0026#39;vertical\u0026#39;}, **kws) else: mean_diff.where(p \u0026lt; p_level).plot(ax=ax, add_labels=False, transform=ccrs.PlateCarree(), add_colorbar=False, **kws) ax.coastlines() Through t-test, we produced maps of the mean difference in the ClimateBench target output variables for 2 baseline emulators against the target values under the test ssp245 scenario averaged between 2080 and 2100. Differences insignificant at the p \u0026lt; 5% level are masked from the plots. The figure shows a similar bias structure of the selected 2 emulator. For example, both emulators tend to over predict the temperature in the northern hemisphere, and underpredict the temperature over the ocean in the southern hemisphere. The colorbar indicated the actual value difference bvetween emulation and simulation. Overall, Random Forest emulator tends to produce a larger difference in targeted variables, the neural network performs the best in predicting targeted variables.\n# rmse from xskillscore import rmse #weight by lat weights = np.cos(np.deg2rad(output_ssp245[\u0026#39;tas\u0026#39;].lat)).expand_dims(lon=144).assign_coords(lon=output_ssp245.lon) def global_mean(ds): weights = np.cos(np.deg2rad(ds.lat)) return ds.weighted(weights).mean([\u0026#39;lat\u0026#39;, \u0026#39;lon\u0026#39;]) RMSE_S = pd.DataFrame({ name: {var: rmse(output_ssp245[var].sel(time=slice(2080, None)).mean(\u0026#39;time\u0026#39;), model[var].sel(time=slice(2080, None)).mean(\u0026#39;time\u0026#39;), weights=weights).data/ np.abs(global_mean(output_ssp245[var].sel(time=slice(2080, None)).mean(\u0026#39;time\u0026#39;)).data) for var in variables} for name, model in zip(model_names[1:], models[1:]) }) RMSE_G = pd.DataFrame({ name: {var: rmse(global_mean(output_ssp245[var].sel(time=slice(2080, None))), global_mean(model[var].sel(time=slice(2080, None)))).data/ np.abs(global_mean(output_ssp245[var].sel(time=slice(2080, None)).mean(\u0026#39;time\u0026#39;)).data) for var in variables} for name, model in zip(model_names[1:], models[1:]) }) RMSE_T = RMSE_S + 5*RMSE_G merge_rmse = pd.concat([RMSE_S, RMSE_G, RMSE_T], keys=[\u0026#39;Spatial\u0026#39;, \u0026#39;Global\u0026#39;, \u0026#39;Total\u0026#39;])[model_names[1:]].T.swaplevel(axis=1)[variables] merge_rmse.style.highlight_min(axis = 0, props=\u0026#39;font-weight: bold\u0026#39;).format(\u0026#34;{:.3f}\u0026#34;) #T_8acd0_row0_col9, #T_8acd0_row1_col0, #T_8acd0_row1_col1, #T_8acd0_row1_col2, #T_8acd0_row1_col3, #T_8acd0_row1_col4, #T_8acd0_row1_col5, #T_8acd0_row1_col6, #T_8acd0_row1_col7, #T_8acd0_row1_col8, #T_8acd0_row1_col10, #T_8acd0_row1_col11 { font-weight: bold; }    \u0026nbsp; tas dtr pr pr90   \u0026nbsp; Spatial Global Total Spatial Global Total Spatial Global Total Spatial Global Total     Random Forest 0.108 0.058 0.400 9.195 2.652 22.457 2.524 0.502 5.035 2.682 0.543 5.399   Neural Network 0.102 0.043 0.316 8.493 1.679 16.885 2.191 0.209 3.235 2.765 0.318 4.356    The primary metrics the evaluate emulators is the normalized, global mean root-mean square error (NRMSE). It is a metric the quantify the deviation between emulated output variables and the trageted output variables in the test scenario. The smaller RMSE value is, the lesser error the emulator is produced.\nIn the paper, they constructed spatial, global and total NRMSE in the following forms:\nSpatial NRMSE: $$NRMSE_s = \\sqrt{\u0026lt;(|x_{i,j,t}|t - |y{i,j,t,n}|{n,t})^2\u0026gt;} / |\u0026lt;y{i,j}\u0026gt;|_{t,n}$$\nGlobal NRMSE: $$NRMSE_g = \\sqrt{|(\u0026lt;x_{i,j,t}\u0026gt; - \u0026lt;|y_{i,j,t,n}|{n}\u0026gt;)^2|{t}} / |\u0026lt;y_{i,j}\u0026gt;|_{t,n}$$\nTotal NRMSE: $$NRMSE_t = NRMSE_s + \\alpha * NRMSE_g$$\n$\u0026lt;x_{i,j}\u0026gt;$ indicated the global mean with a weighting function that accounts for the decreasing grid-cell area toward the poles\n$$\u0026lt;x_{i,j}\u0026gt; = \\frac{1}{N_{lat}N_{lon}} \\sum_{i}^{N_{lat}} \\sum_{j}^{N_{lon}} cos(lat(i))x_{i,j}$$\n$\\alpha$ is a defined weight coefficient and $\\alpha = 5$ in this paper.\nAs shown in the table above, the neural network emulator seems to perform better in terms of predicting temperature and precipitation changes. In most targeted value and most metrics, the neural network emulator tends to score a smaller RMSE and has a lesser difference between its emulations and original model simulations.\n# Clausius-Clapeyron relationship and energy conservation considerations baseline_precip = 2.335 with sns.plotting_context(\u0026#34;talk\u0026#34;): x = np.linspace(0, 2.5, 100) fig, ax = plt.subplots(1, 1, figsize=(10, 8)) for model, model_name in zip(models, model_names): smooth = global_mean(model).coarsen(time=5, boundary=\u0026#39;pad\u0026#39;).mean().dropna(\u0026#39;time\u0026#39;) x_, y_ = smooth[\u0026#39;tas\u0026#39;], smooth[\u0026#39;pr\u0026#39;]/baseline_precip*100 s=ax.scatter(x_.sel(time=slice(2030, None)), y_.sel(time=slice(2030, None)), label=model_name) plt.plot(x, x*6, label=\u0026#34;Clausius-Clapeyron (6%/K)\u0026#34;, ls=\u0026#39;-\u0026#39;, c=\u0026#39;k\u0026#39;) plt.plot(x, x*2-1, label=\u0026#34;Energetic constraint (2%/K)\u0026#34;, ls=\u0026#39;--\u0026#39;, c=\u0026#39;r\u0026#39;) plt.setp(plt.gca(), xlabel=\u0026#34;Temperature change (K)\u0026#34;, ylabel=\u0026#34;Precipitation change (%)\u0026#34;, xlim=[0, 2.5], ylim=[0, 4]) plt.legend(loc=\u0026#39;upper left\u0026#39;) One common concern on applying statistical emulators on Earth system modelling is their abilities to capture and demostrate the physical relationships, whether emulators are overfitted on the existed dataset. In the paper, they try to demostrate the trustworthyness of emulator by assessing whether emulations can capture the same relationship under physical constraints. Particularly, they chose Clausius-Clapeyron relationship to assess. This relationship refers to relative change in global mean precipitation as a function of global mean temperature change. In simple atmospheric physics, Clausius-Clapeyron relationship could be explained as, as the air temperature increases, the water vapor needs more exchange of latent heat to complete a phase transition (precipitation) than previously. The atmosphere has more ability to hold water content under a higher temperature, and the atmospheric water content increases by between 6 and 7% per 1 K of temperature changes if we only consider precipitation changes in a local air mass.\nHowever, in the real Earth system, we also need to consider the energy constraints when assessing global precipitation changes. The global precipitation changes will be balanced by the radiative cooling from the cloud (Pendergrass \u0026amp; Hartmann, 2013) and the relative change of global precipitaion under temperature change will be around 2% per 1 K.\nThe above figure assessed the predicted temperature changes vs relative global precipitation changes in emulations and simulations results. The slope of the red dash line indicated the ideal relationship of temperature and precipitation. The figure shows a scatter distribution of the relative changes in precipitation vs temperature, if the slope/tendency of the data close to the slope of the red line, it indicates that the changes predicted in emulations is close to the ideal physical relationship. The figure indicated that teh neural network emulator perform better in terms of capturing the physical relationship between changes in global precipitation and temperature, and its predicted relationship is close to both the ideal value and the model simulation results under targeted SSP245 scenario.\n4. Critique 4.1 Data The data used in the paper to construct emulators came from validated peer-review papers and projects using Earth system modelling to perform climate projection under given future scenarios. These model simulations data are quite assessible and the quality of data is very good.\nHowever, the trained emulator and emulations results used in the paper for analysis is not provided. In order to reproduce the paper, I need to use the ML algorithm in given packages and the author\u0026rsquo;s hyperparameters, with the same model simulation data to train the emulator myself. This created a bit of barriers in reproducing the paper. It is still possible to reproduce the paper, and the results from the paper are validated through my reproductions.\n4.2 method 4.2.1 Emulator As the goal of this paper is to perform a benchmarking work and assess different methods to construct emulator, the selection of specific method for emulators are therefore important. In the paper, the assessed emulators are all constructed using popular and powerful machine learning method that have been applied in emulating nonlinearity relationship in the Earth system, thus made the emulators used in this research more applicable to the task.\nThe process of contructing emulators are also following similar approaches in the given field where the relationships between emitted greenhouse gases and temperature/precipitation changes have been used to construct the emulators. The hyperpaparmeters selected for each emulator are reasonable and listed in its appendix, and the used packages are well-constructed and assessible from GitHub, which makes the reproduction processes for this paper easier.\nIn terms of emulation itself, the emulators were employed in an interpretation problems, where a boarder range of SSPs are used in training and only one targeted SSPs (ssp245) within the range is used for testing. Using statistical/machine learning models for interpretation is a safer approach compared with for exterpolation problems, and could potentially avoid overfitting. However, the selected SSPs are of a quite small range, and the data employed in training are not fully independent. This is part of the issue from limited model simulations to explore SSPs, and also comes from the time dependent nature in the Earth system itself, where the variable in one timestep is relevant to its previous timesteps. Therefore, although the authors used several SSPs simulations and their data on different timesteps for training, the dataset are not entirely independent and might affect the emulators' results.\n4.2.2 Evaluation The methods used for evaluations are quite trustworthy and useful.\nThe authors employed Student\u0026rsquo;s T-test to assess whether there are significant difference bvetween emulations and simulations. By showing maps of differences, it is clear for readers to find out the spatial bias structures for each emulator, and potentially help deliver a better illustration for emulators' biases. In terms of specific explanation for t-test (and its normal distribution approximation), the author did not include a detailed explantion but pointed to a book of statistics. This is probably because t-test are widely used in the atmospheric science and it\u0026rsquo;s more efficient to assume readers understand this commonn method.\nThe RMSE metrics are a clear and straight-forward way to quantify the errors produced by emulators, which sucessfully provide a numeric value for evaluation and fulfill the gaps from spatial maps of t-test difference. However, the authors did not provide a detailed explanation for each different RMSE values they constructed, and they did not explain why they favor one contructed RMSE metrics over another. The Spatial and Global RMSE used in the paper are only differed in when to perform an average obver time dimention (before or after calculating the difference), and whether the average process neglect time variability in the selected timeframes. As the task for emulators in this paper is the prediction the targeted temperature and precipitation changes in a given period, and the time variability in the atmosphere is not the first priority in this question. Nevertheless, the explanations about different RMSE metrics used in the research could still be useful.\nLastly, assessing the relative relationship between global precipitation and temperature changes is a good way to outline the ability of emulators to capture physical relationship under energy constraints. Given that the targeted outputs in the research are either precipitation or temperature, the selected relationship here is useful for examination, and it\u0026rsquo;s smart to demonstrate this through a scatterplot which fully illustrate the relationship between 2 variables.\n4.3 Structure and layouts The style of the figure in the paper is quite clear and appearing, and fulfill the common requirement in scientific publication. Apart from benchmarking different emulators, the paper also carries an introduction for readers that are not specialized in emulation and calibration. Therefore, in the first half of teh paper, the authors took quite an effort to introduce to background of the research, which dataset is acquired and which emulator has been employed, and also illustrate the characteristics of the dataset with several figures. These figure are not directly used for analysis, but for introduction of the background and dataset, which may create a clearer introdution but also push the first analytic figure to Figure 4 in the paper. Therefore, in this analytic demo, I selected the useful figure for benchmarking and evaluating emulators, in a concise way. Overall, the original paper is quite informatic and provide a nice demonstration for benchmarking and evaluating selected emulators for climate projection.\nReference  Mansfield, L. A., Nowack, P. J., Kasoar, M., Everitt, R. G., Collins, W. J., \u0026amp; Voulgarakis, A. (2020). Predicting global patterns of long-term climate change from short-term simulations using machine learning. NPJ Climate and Atmospheric Science, 3(1), 44. https://doi.org/10.1038/s41612-020-00148-5 O’Neill, B. C., Tebaldi, C., Vuuren, D. P., van Eyring, V., Friedlingstein, P., Hurtt, G., et al. (2016). The scenario model intercomparison project (ScenarioMIP) for CMIP6. Geoscientific Model Development, 9, 3461–3482. https://doi.org/10.5194/gmd-9-3461-2016 Pendergrass, A. G., \u0026amp; Hartmann, D. L. (2013). The atmospheric energy constraint on global-mean precipitation change. Journal of Climate, 27(2), 130916120136005. https://doi.org/10.1175/jcli-d-13-00163.1 Watson-Parris, D., Williams, A., Deaconu, L., \u0026amp; Stier, P. (2021). Model calibration using ESEm v1.1.0 – an open, scalable Earth system emulator. Geoscientific Model Development, 14(12), 7659–7672. https://doi.org/10.5194/gmd-14-7659-2021 Watson-Parris, D., Rao, Y., Olivié, D., Seland, Ø., Nowack, P., Camps-Valls, G., Stier, P., Bouabid, S., Dewey, M., Fons, E., Gonzalez, J., Harder, P., Jeggle, K., Lenhardt, J., Manshausen, P., Novitasari, M., Ricard, L., \u0026amp; Roesch, C. (2022). ClimateBench v1.0: A Benchmark for Data-Driven Climate Projections. Journal of Advances in Modeling Earth Systems, 14(10), e2021MS002954. https://doi.org/10.1029/2021MS002954  ","permalink":"http://maanqi4.github.io/posts/climatebench/","summary":"This is a analytic walk-through of the following paper: ClimateBench v1.0: A Benchmark for Data-Driven Climate Projections https://doi.org/10.1029/2021MS002954 (Watson-Paris et al., 2022)\n  conda environment The analysis was conducted under a conda virtual environment (esem). I\u0026rsquo;ll attached the .yml file of the environment together with the assignment, you can install it with conda env create -f esem.yml. Or if you prefer to install it manually, here are the commands I used:","title":"[Demo] A Benchmark for Data-Driven Climate Projections"},{"content":" This is a course assignment demo (GG606 Scientific Data Wrangling)\n  cancensus is a R package that can assess Statistics Canada Census data for Census year 1996, 2001, 2006, 2011, 2016 and 2021. The datasets present information from the Census of Population for various levels of geography, including provinces and territories, census metropolitan areas, communities and census tracts.\n 1. Installation and retrieve the data vectors list (API key required) install.packages(\u0026#34;cancensus\u0026#34;) library(cancensus) #view avaliable Census data list_census_datasets() #view regions and vectors in a given dataset list_census_regions(\u0026#34;CA21\u0026#34;) list_census_vectors(\u0026#34;CA21\u0026#34;) #retrieve the data needed #Warning: Cached regions list may be out of date. Set `use_cache = FALSE` to update it. library(geojsonsf) dataset \u0026lt;- \u0026#34;CA21\u0026#34; #set census api set_cancensus_api_key(\u0026#39;ENTER API KEY\u0026#39;) set_cancensus_cache_path(here(\u0026#34;data\u0026#34;)) library(tidyverse) ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ── ## ✔ ggplot2 3.4.0 ✔ purrr 1.0.0 ## ✔ tibble 3.1.8 ✔ dplyr 1.0.10 ## ✔ tidyr 1.2.1 ✔ stringr 1.5.0 ## ✔ readr 2.1.3 ✔ forcats 0.5.2 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag()  regions.list \u0026lt;- list_census_regions(dataset, use_cache = FALSE) %\u0026gt;% filter(level == \u0026#34;PR\u0026#34;, name == \u0026#34;Alberta\u0026#34;) %\u0026gt;% as_census_region_list vec \u0026lt;- find_census_vectors(\u0026#34;housing\u0026#34;, dataset = dataset, query_type = \u0026#34;semantic\u0026#34;) 2. Housing situation in Alberta 2.1 Dwelling Value # getting 2021 data # selecting vectors in need # Median Dwelling value dwelling.21.cost \u0026lt;- get_census(dataset = \u0026#34;CA21\u0026#34;, regions = regions.list, vectors = c(\u0026#34;median.dwelling\u0026#34;=\u0026#34;v_CA21_4311\u0026#34;), level = \u0026#34;CSD\u0026#34;, labels = \u0026#39;short\u0026#39;, geo_format = \u0026#39;sf\u0026#39;) dwelling.21.value \u0026lt;- get_census(dataset = \u0026#34;CA21\u0026#34;, regions = regions.list, vectors = c(\u0026#34;median.dwelling\u0026#34;=\u0026#34;v_CA21_4311\u0026#34;), level = \u0026#34;CSD\u0026#34;, labels = \u0026#39;short\u0026#39;) %\u0026gt;% slice_max(median.dwelling, prop = .05) ggplot(dwelling.21.cost) + geom_sf(aes(fill=median.dwelling)) + theme_minimal() + theme(panel.grid = element_blank(), axis.text = element_blank(), axis.ticks = element_blank(), strip.text.x = element_text(size=12)) + coord_sf(datum=NA) + scale_fill_viridis_c(\u0026#34;Median Dwellings Value\u0026#34;, labels = scales::dollar) By looking at general housing situation, I first started with the price of dwellings. The first figure showed the distribution of median dwelling value in Alberta. In a large number of district, the value of dwellings are around $500,000 and above. Roughly, the price of dwellings is more expensive in the southern Alberta compared with the North, which might related to what kind of climate people want to live in. There is a special pattern around the district of Calgary, where in the center of Calgary, the price of dwellings is far less than its surrounding area. The similar, but less obvious pattern could be found around Edmonton. The reason may relate to people prefer to live in suburban area compared with living in downtown, but the gap of dwellings value around Calgary is astonishing. Besides, the surrounding cycle outside Alberta is the most expensive area in terms of dwellings value in the province. The next figure will look into it in more detail.\noptions(scipen = 999) ggplot(dwelling.21.value, aes(x=median.dwelling, y=fct_reorder(`Region Name`, median.dwelling))) + geom_point() + theme_minimal() + theme(axis.title.y = element_blank()) + xlab(\u0026#34;Median Dwellings Value\u0026#34;) + ggtitle(\u0026#34;Districts with Top 5% Median Dwellings Value in Alberta\u0026#34;) The second figure looked into dwelling value in more detail: showing the districts with top 5% median dwelling value. 25 out of 423 districts in Alberta have been selected with higher median dwelling value, and particularly, all district with median dwelling value higher than $500,000 have been selected. Apparently, the district around Calgary have all been included, and also the districts to the west of Calgary. By Google-search these district names, I realized these districts located in a good place for relax and vacation. There are often beside the lake (e.g. Birchcliff, Jarvis Bay) or near natural park (e.g. Banff, Jasper). The abnormal increase of the dwelling value may be related to its good location and nature environment.\n2.2 Monthly cost # Median Monthly shelter cost mon.cost.21 \u0026lt;- get_census(dataset = \u0026#34;CA21\u0026#34;, regions = regions.list, vectors = c(\u0026#34;median.cost.rent\u0026#34;=\u0026#34;v_CA21_4317\u0026#34;, \u0026#34;median.cost.own\u0026#34;=\u0026#34;v_CA21_4309\u0026#34;), level = \u0026#34;CSD\u0026#34;, geo_format = \u0026#39;sf\u0026#39;, labels = \u0026#39;short\u0026#39;) mon.cost.21 \u0026lt;- mon.cost.21 %\u0026gt;% pivot_longer(cols = starts_with(\u0026#34;median.cost\u0026#34;), names_to = \u0026#34;status\u0026#34;, names_prefix = \u0026#34;median.cost.\u0026#34;, values_to = \u0026#34;month.cost\u0026#34;) mon.cost.21$status \u0026lt;- factor(mon.cost.21$status, levels = c(\u0026#34;own\u0026#34;,\u0026#34;rent\u0026#34;), labels = c(\u0026#34;Owner Households\u0026#34;, \u0026#34;Renter Households\u0026#34;)) ggplot(mon.cost.21) + geom_sf(aes(fill=month.cost)) + theme_minimal() + theme(panel.grid = element_blank(), axis.text = element_blank(), axis.ticks = element_blank(), strip.text.x = element_text(size=12)) + coord_sf(datum=NA) + scale_fill_viridis_c(\u0026#34;Median Monthly cost\u0026#34;, labels = scales::dollar) + facet_wrap(~status) The next question I had about the housing is whether the cost of housing people spend monthly share the same distribution as the dwelling price?\nThe above figure showed the median monthly cost people spend per household. The monthly cost, in general, share the same characteristic as the dwelling price: where the dwelling value is low, the monthly cost people spend on housing is also tend to be low.\nThere are 2 interesting differences: first the cycle of higher price around Calgary is disappeared. So people who live in high value dwellings actually don’t spend significantly more monthly. It might because those high value dwellings are owned, not rented, and the cost for owner maintaining is not significantly higher. We can also notice this in the right sub-figure in figure-3, where area which has high value of dwellings actually have average or even lower monthly cost for renter households.\nThe second interesting difference is between the owner household and renter household. The monthly cost distribution tend to be the same, however, in the northeastern area, the owner households have a abnormal high cost monthly. By Google-search, the Northeastern area is agricultural area, where the owner might spend more to maintain their farm house, which not only serve as living.\n2.3 Difference between large cities and elsewhere # Number of Household of Renter and Owner household \u0026lt;- get_census(dataset = \u0026#34;CA21\u0026#34;, regions = regions.list, vectors = c(\u0026#34;status.Renter\u0026#34;=\u0026#34;v_CA21_4239\u0026#34;, \u0026#34;status.Owner\u0026#34;=\u0026#34;v_CA21_4238\u0026#34;, \u0026#34;total.house\u0026#34;=\u0026#34;v_CA21_4237\u0026#34;), level = \u0026#34;CSD\u0026#34;, labels = \u0026#39;short\u0026#39;) household \u0026lt;- household %\u0026gt;% slice_max(total.house, n=5) ggplot(household, aes(x=total.house, y=fct_reorder(`Region Name`, total.house))) + geom_col(fill=\u0026#34;cornflowerblue\u0026#34;) + theme_minimal() + theme(axis.title.y = element_blank()) + xlab(\u0026#34;Number of Households\u0026#34;) + ggtitle(\u0026#34;Districts with Top 5 Number of Household in Alberta\u0026#34;) While I looking at the household distribution, I realized the households are distributed very unevenly within Alberta. The above figure showed the districts with top 5 number of household in Alberta. As shown in the figure, the number of household in Calgary and Edmonton is extremely larger than the rest of the province. Calgary has around 500,000 households, with about 400,000 in Edmonton, while the third district in rank has less than 50,000, which means less the 10% of the number of households in Alberta. This mean the majority of households are accumulated in merely 2 districts in the province.\nThis then led to the last question: whether the housing situation has large difference between the 2 districts (Calgary and Edmonton) and elsewhere in Alberta?\n#suitable housing data vec_suit\u0026lt;- list_census_vectors(\u0026#34;CA21\u0026#34;) %\u0026gt;% filter(vector == \u0026#34;v_CA21_4293\u0026#34;) %\u0026gt;% child_census_vectors(TRUE, keep_parent = TRUE) accept \u0026lt;- get_census(dataset = \u0026#34;CA21\u0026#34;, regions = regions.list, vectors = vec_suit$vector, level = \u0026#34;CSD\u0026#34;, labels = \u0026#34;short\u0026#34;) regions.main \u0026lt;- list_census_regions(dataset, use_cache = FALSE) %\u0026gt;% filter(level == \u0026#34;CSD\u0026#34;, name %in% c(\u0026#34;Calgary\u0026#34;, \u0026#34;Edmonton\u0026#34;)) %\u0026gt;% as_census_region_list ##calgary \u0026amp; edmonton vs elsewhere rate \u0026lt;- accept %\u0026gt;% mutate(urban = case_when( GeoUID %in% regions.main$CSD ~ \u0026#34;Calgary and Edmonton\u0026#34;, TRUE ~ \u0026#34;Elsewhere\u0026#34;)) ## no identified problems rest.select \u0026lt;- rate %\u0026gt;% select(contains(\u0026#34;v_CA21\u0026#34;), -v_CA21_4293) rate$rest \u0026lt;- rowSums(rest.select) rate \u0026lt;- rate %\u0026gt;% mutate(v_CA21_4293=Households-rest) ##factor order categories \u0026lt;- vec_suit %\u0026gt;% pull(\u0026#34;vector\u0026#34;) cat_list \u0026lt;- factor(categories, ordered = TRUE) cat_name \u0026lt;- vec_suit %\u0026gt;% pull(\u0026#34;label\u0026#34;) cat_name[1]=\u0026#34;No identified problems\u0026#34; ##change column names to human readable names(rate)[grepl(\u0026#34;v_\u0026#34;, names(rate))] \u0026lt;- cat_name ##pivot_longer plot_data \u0026lt;- rate %\u0026gt;% pivot_longer(cols = cat_name[1]:cat_name[8], names_to = \u0026#34;Categories\u0026#34;, values_to = \u0026#34;Count\u0026#34;) plot_data$Categories \u0026lt;- factor(plot_data$Categories, levels = cat_name, ordered = TRUE) plot_data \u0026lt;- plot_data %\u0026gt;% group_by(Categories, urban) %\u0026gt;% summarise(Count=mean(Count, na.rm = TRUE)) # Make plots wider  #knitr::opts_chunk$set(fig.width=8, fig.height=6) ggplot(plot_data, aes(x=\u0026#34;\u0026#34;, Count, group=Categories, fill=Categories)) + geom_bar(stat = \u0026#34;identity\u0026#34;, position=\u0026#34;fill\u0026#34;, width = 1) + coord_polar(\u0026#34;y\u0026#34;, start = 0) + facet_wrap(~urban)+ theme_void() + theme(legend.position = \u0026#34;bottom\u0026#34;, axis.title = element_blank(), strip.text.x = element_text(size=15), legend.text = element_text(size=10), plot.title = element_text(size=20)) + guides(fill=guide_legend(nrow = 4)) + ggtitle(\u0026#34;Housing problems: \u0026#34;) + scale_fill_brewer(palette = \u0026#34;Set2\u0026#34;) When assessing housing situation, there are 3 criteria used:\n whether people spend 30% or more income on shelter cost. If people spend more than 30% of their income on housing, it means the cost of housing is too high. whether there are more than one person per room in the dwelling. If there are more than one person per room, it means people have to live in overcrowded dwellings, and it will be considered as “not suitable” housing for living. whether the dwelling needs major repairs.  The above pie chart showed the housing problems in 2 large cities (Calgary and Edmonton) versus elsewhere in Alberta. Overall, there are more percentage of households having identified problems in Calgary and Edmonton. Specifically, there are more people spend more than 30% of their income, and more people living in “not suitable” household. It means in these 2 cities, people are suffer from high cost of housing, they either have to spend more than 30% of their income on housing, or share a room with other to reduce the cost. While in elsewhere in Alberta, the housing situation is slightly better, but there ae more households need major repairs. Which indicates that in elsewhere (other than Calgary and Alberta), there are more percentage of households not in good condition and needs repairs, but people living in couldn’t afford to.\nComments on figures All the data downloading and pre-possessing codes can be found in the first 2 chunks in .rmd file, while figures are shown in separate chunks align with the analysis paragraphs.\nThere are 3 types of figures used in the analysis:\n  The spatial distribution of a given value: I used this showing the distribution of dwelling value and monthly cost people spend on housing. This type of figure could fulfill the purpose showing how the value/cost change in space, and can lead to interesting finding, such as the high value of dwelling cycle around Calgary. When assessing how the price of housing differ in area, this type of figure could be useful. In my figures, I selected the color pallete to make the higher price more significant in bright yellow color, while lower price in deep blue color. This could highlight the area in which having higher price. I also change the scale of legend to dollar signs, which makes more sense for the analized questions.\n  The rank of districts by value: I used geom_point and geom_col to generate the 2 figures about the higher dwelling value and larger number of household. When there are a lot of observations to show, e.g. the dwelling value in districts, geom_point could make the figure more readable; While when there are just 5 observation value to show, i.e. in number of household, geom_col could really make the higher value bar stand out and attract the readers’ attention. In both figures, I reorder the data and make it rank from high to low, which increased the readability. Besides, this type of figure required reader refer to x-axis to find the exact number value, so I used theme_minimal to remain the light grid lines to help indicate the exact value in x-axis.\n  pie chart, which I used to show the percentage of housing problems in different area. ggplot actually doesn’t have a pie chart function, so I modified this from geom_bar by adjusting the function and formulating the data. I selected the color pallete to clearly distinguish the different types of problems. (I don’t know for some reasons the figure can show perfectly in Rmd but can’t show fully in knit html file…)\n  ","permalink":"http://maanqi4.github.io/posts/cancensus/","summary":"This is a course assignment demo (GG606 Scientific Data Wrangling)\n  cancensus is a R package that can assess Statistics Canada Census data for Census year 1996, 2001, 2006, 2011, 2016 and 2021. The datasets present information from the Census of Population for various levels of geography, including provinces and territories, census metropolitan areas, communities and census tracts.\n 1. Installation and retrieve the data vectors list (API key required) install.","title":"Work with Canadian Census data"},{"content":" This is a course assignment for analyze and visualize earthquake data\n 1. Read the data in and clean it for analysis, used the readr package functions for reading and parsing data. [5 marks] My answer is written here and is explains what I did and why.\n#code  library(tidyverse) ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ── ## ✔ ggplot2 3.4.0 ✔ purrr 1.0.0  ## ✔ tibble 3.1.8 ✔ dplyr 1.0.10 ## ✔ tidyr 1.2.1 ✔ stringr 1.5.0  ## ✔ readr 2.1.3 ✔ forcats 0.5.2  ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() #read data data \u0026lt;- read_csv(\u0026#34;https://jjvenky.github.io/GG606AW23/database.csv\u0026#34;, col_types = cols(Date = col_date(format = \u0026#34;%m/%d/%Y\u0026#34;), Time = col_time(format = \u0026#34;%H:%M:%S\u0026#34;))) #problems(data) #data[problems(data)$row,] #as.POSIXct(problems(data)$actual, \u0026#34;%Y-%m-%dT%H:%M:%S\u0026#34;, tz=\u0026#34;UTC\u0026#34;) #check data and omit N/A if(any(is.na(data$Date))){ data \u0026lt;- filter(data, !is.na(data$Date)) } data ## # A tibble: 23,409 × 21 ## Date Time Latitude Longitude Type Depth Depth…¹ Depth…² Magni…³ ## \u0026lt;date\u0026gt; \u0026lt;time\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;chr\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;dbl\u0026gt; ## 1 1965-01-02 13:44:18 19.2 146. Earthqu… 132. NA NA 6 ## 2 1965-01-04 11:29:49 1.86 127. Earthqu… 80 NA NA 5.8 ## 3 1965-01-05 18:05:58 -20.6 -174. Earthqu… 20 NA NA 6.2 ## 4 1965-01-08 18:49:43 -59.1 -23.6 Earthqu… 15 NA NA 5.8 ## 5 1965-01-09 13:32:50 11.9 126. Earthqu… 15 NA NA 5.8 ## 6 1965-01-10 13:36:32 -13.4 167. Earthqu… 35 NA NA 6.7 ## 7 1965-01-12 13:32:25 27.4 87.9 Earthqu… 20 NA NA 5.9 ## 8 1965-01-15 23:17:42 -13.3 166. Earthqu… 35 NA NA 6 ## 9 1965-01-16 11:32:37 -56.5 -27.0 Earthqu… 95 NA NA 6 ## 10 1965-01-17 10:43:17 -24.6 178. Earthqu… 565 NA NA 5.8 ## # … with 23,399 more rows, 12 more variables: `Magnitude Type` \u0026lt;chr\u0026gt;, ## # `Magnitude Error` \u0026lt;dbl\u0026gt;, `Magnitude Seismic Stations` \u0026lt;dbl\u0026gt;, ## # `Azimuthal Gap` \u0026lt;dbl\u0026gt;, `Horizontal Distance` \u0026lt;dbl\u0026gt;, ## # `Horizontal Error` \u0026lt;dbl\u0026gt;, `Root Mean Square` \u0026lt;dbl\u0026gt;, ID \u0026lt;chr\u0026gt;, Source \u0026lt;chr\u0026gt;, ## # `Location Source` \u0026lt;chr\u0026gt;, `Magnitude Source` \u0026lt;chr\u0026gt;, Status \u0026lt;chr\u0026gt;, and ## # abbreviated variable names ¹​`Depth Error`, ²​`Depth Seismic Stations`, ## # ³​Magnitude  Here col_types in read_csv function was defined with format to match the format of Date and Time in the .csv file.\nNote:\nThere are 3 rows in which the Date and Time were written different format compared with the rest of the dataset. Although it’s possible to convert them to acessible format, for example using as.POSIXct(problems(data)$actual, \u0026quot;%Y-%m-%dT%H:%M:%S\u0026quot;, tz=\u0026quot;UTC\u0026quot;).The current method somehow could successfully return the correct Date, and the following question doesn’t necessary need the exact time of the earthquakes, plus I’m not sure about the timezone in the Date written in those 3 rows. So I just ignored those 3 rows as this warning would not affect following analysis.\n2. Did more earthquakes happen on weekends or weekdays? [5 marks] #code  #select Earthquake Earthquake \u0026lt;- subset(data, Type == \u0026#34;Earthquake\u0026#34;) #convert Date to day in a week Earthquake$wday \u0026lt;- weekdays(Earthquake$Date) #reorder days in week (so it\u0026#39;ll be in good order in figure\u0026#39;s legend) Earthquake$wday \u0026lt;- factor(Earthquake$wday, levels = c(\u0026#34;Monday\u0026#34;,\u0026#34;Tuesday\u0026#34;,\u0026#34;Wednesday\u0026#34;,\u0026#34;Thursday\u0026#34;, \u0026#34;Friday\u0026#34;,\u0026#34;Saturday\u0026#34;,\u0026#34;Sunday\u0026#34;)) #separate weekdays and weekend Earthquake \u0026lt;- mutate(Earthquake, wday_ID = case_when(wday == \u0026#34;Sunday\u0026#34; ~ \u0026#34;Weekends\u0026#34;, wday == \u0026#34;Saturday\u0026#34; ~ \u0026#34;Weekends\u0026#34;, TRUE ~ \u0026#34;Weekdays\u0026#34;)) #check dataset before plotting  #Earthquake #plot ggplot(data = Earthquake) + geom_bar(aes(wday_ID,fill=wday), alpha=1) + theme_classic() + scale_fill_brewer(palette = \u0026#34;Set3\u0026#34;) + labs(title = \u0026#34;Amount of Earthquake happened on Weekdays and Weekends\u0026#34;, x=\u0026#34;\u0026#34;,fill=\u0026#34;Day in week\u0026#34;) The above analysis demonstrates the amount of earthquakes happened on weekdays and weekends in the dataset. There are significantly more earthquakes happened on weekdays, however, when it comes to the number of earthquakes per day, it doesn’t show any preference between weekdays or weekends. There are 5 days in a week could be counted as weekdays, while only 2 days are weekends. If earthquakes are equally distributed in every day, then when numbers added up, it will be more earthquakes on weekdays than weekends as there are more days in weekdays.\n3. Has there been any change in the frequency of earthquakes? [5 marks] #code  #create new variable Year Earthquake \u0026lt;- mutate(Earthquake, Year = format(Date, \u0026#34;%Y\u0026#34;)) #(for setting figure axis) #count minmax of Year and number of earthquakes minyear \u0026lt;- min(Earthquake$Year) maxyear \u0026lt;- max(Earthquake$Year) counteq \u0026lt;- Earthquake %\u0026gt;% count(Earthquake$Year) maxeq \u0026lt;- max(counteq$n) maxy \u0026lt;- ceiling(maxeq/100)*100 #plot ggplot(data = Earthquake) + geom_bar(aes(Year),fill=\u0026#34;cornflowerblue\u0026#34;) + theme_minimal() + scale_fill_brewer(palette = \u0026#34;Set3\u0026#34;) + labs(title = paste0(\u0026#34;Amount of earthquake each year (\u0026#34;, minyear, \u0026#34;-\u0026#34;, maxyear, \u0026#34;)\u0026#34;)) + scale_x_discrete(breaks = seq(minyear,maxyear,by=5)) + scale_y_continuous(breaks = seq(0,maxy,by=100), expand = c(0,0)) + coord_cartesian(ylim = c(0,maxy)) The above analysis demonstrates the trend of earthquake from 1965 to 2016. The number of earthquake in each year showed a steady increase in the given time period. Therefore, we concluded that the frequency of earthquake in each year has a steady increase between the year of 1965 and 2016.\n4. Where were there more earthquakes in the 1980s, South America or North America? [5 marks] #code  library(maps) ## ## Attaching package: 'maps' ## The following object is masked from 'package:purrr': ## ## map  #select earthquakes in 1980s eq.1980s \u0026lt;- filter(Earthquake, Year \u0026gt;= 1980 \u0026amp; Year \u0026lt;= 1989) #eq.1980s #First, plot a scatterplot map with all the earthquakes in 1980s world_map \u0026lt;- map_data(\u0026#34;world\u0026#34;) ggplot(NULL) + geom_polygon(data = world_map, aes(x=long,y=lat,group=group), fill=\u0026#34;azure3\u0026#34;) + geom_point(data = eq.1980s, aes(x=Longitude, y=Latitude), color = \u0026#34;cornflowerblue\u0026#34;, size=.5) + theme_void() + ggtitle(\u0026#34;Earthquakes in 1980s\u0026#34;) ## Notice a lot of earthquakes located in ocean, some on land ##define South/North America ##only count earthquakes happened on land ##convert (lat,lon) to continent (point in polygon question) ##https://stackoverflow.com/questions/21708488/get-country-and-continent-from-longitude-and-latitude-point-in-r library(sp) library(rworldmap) ## ### Welcome to rworldmap ### ## For a short introduction type : vignette('rworldmap')  latlon2Con \u0026lt;- function(lat,lon){ sPDF \u0026lt;- getMap() points = data.frame(lon=lon,lat=lat) pointsSP = SpatialPoints(points, proj4string = CRS(proj4string(sPDF))) indices = over(pointsSP, sPDF) return(indices$REGION) #return(indice$ADMIN) } eq.1980s.land \u0026lt;- mutate(eq.1980s, Continent = latlon2Con(eq.1980s$Latitude,eq.1980s$Longitude)) eq.1980s.land \u0026lt;- filter(eq.1980s.land, !is.na(Continent)) ggplot(data = eq.1980s.land) +geom_bar(aes(Continent), fill=\u0026#34;cornflowerblue\u0026#34;)+ theme_classic() + ggtitle(\u0026#34;Earthquakes in 1980s (on land)\u0026#34;) ##2. consider earthquakes both on land and ocean ##if earthquake located in ocean, find the nearest continent ##calculate centriods of each countries #library(sf) #world_map_sf \u0026lt;- st_as_sf(world_map, coords = c(\u0026#34;long\u0026#34;,\u0026#34;lat\u0026#34;), crs=4326) #crs = 4326 means lat lon are interpreted as WGS 84 coordinates #world_cens \u0026lt;- st_centroid(world_map_sf) #create index to speed up the nearest function #grid_index \u0026lt;- st_make_grid(world_cens) #find nearest centroids  #point \u0026lt;- data.frame(lon=100,lat=23) #point_sf \u0026lt;- st_as_sf(point, coords = c(\u0026#34;lon\u0026#34;,\u0026#34;lat\u0026#34;), crs=4326,agr=\u0026#34;contant\u0026#34;) #nearest_points \u0026lt;- st_nearest_points(world_cens,point_sf,index=grid_index) #still takes too much time and memory, ignore this method! #nearest_points The above analysis demonstrates the geospatial distribution of earthquakes in 1980s and the number of earthquake in each continent.\nFirst, I demonstrated the raw data of earthquakes in 1980s and their distribution. The earthquakes were mostly taken place in the major earthquake zone, while some of the earthquakes happened on land, a large number of them happened in the ocean area. Here, when identifying the origin location(continent) of earthquakes, I excluded those in the ocean.\nIdealy, there should be 2 algorithms identifying the origin location of the earthquakes: point-in-polygon and finding-nearest-points. After experiments, I realized it’s not feasible to calculate nearest points using such large dataset on my laptop. Here I modified the function in rworldmap package and in this solution. The latlon2Con function in the code could decide whether the location (indexed by latitude and longitude) falls into a certain country’s polygon, and return the continent where that country belongs. However, for earthquakes happened in ocean, it will return N/A value.\nWith the above function, we can assigned a Continent value for each earthquakes on land in 1980s, and plot and bar plot indicating the number of earthquakes on each continent in 1980s. Here thh bar plot shows that there are more earthquakes in South America than North America in 1980s.\n5. Has there been any geographic shifts in the distribution of earthquakes? [10 marks] #code  eq.land \u0026lt;- mutate(Earthquake, Continent = latlon2Con(Earthquake$Latitude,Earthquake$Longitude)) eq.land \u0026lt;- filter(eq.land, !is.na(Continent)) eq.land.num \u0026lt;- eq.land %\u0026gt;% count(Year, Continent) %\u0026gt;% group_by(Continent) ggplot(eq.land.num, aes(x=Year, y=Continent, fill=n)) + geom_tile() + theme_classic() + labs(title = paste0(\u0026#34;Earthquakes on land (\u0026#34;, minyear, \u0026#34;-\u0026#34;, maxyear, \u0026#34;)\u0026#34;),fill=\u0026#34;Num\u0026#34;) + scale_x_discrete(breaks = seq(minyear,maxyear,by=5)) + scale_y_discrete(expand = c(0,0)) + scale_fill_distiller(palette = \u0026#34;YlOrRd\u0026#34;, direction = 1) The above analysis demonstrates the number of earthquakes in each continent from the year of 1965 to 2016. Similarly, here I only considered earthquakes happened on land, using the same function as in the previous question. According to the figure, there is no major shifts of the location(origin continent) of earthquakes in the given time period. Asia and South America constantly have more earthquakes compared to other continents.\n6. Comment on how lessons from Wilke’s Fundementals of Data Visualization were applied to each figure with specific reference to book sections [5 marks] Overall, I tried to produce clean and readable figure, with clear coordinates and readable titles, legends and axis.\n  For the figure illustrating the number of earthquakes on weekdays and weekends: It is a figure to visualize amounts, I chose stacked bar plots. In this way, the figure could display the amount of earthquakes in weekdays and weekend while comparing earthquakes happened in each day, showing more information at the same time. I also assigned an order to the weekdays column so that it will displayed in a more readable way (i.e. following the order of day in a week). Reference: Section 6.2\n  For the third question, I displayed the amount of earthquakes each year to visualize how the frequency of earthquakes (in each year) changed in the given time period. Similar to the second question, here I also visualized the amount, but with more data points. Another potential way is visualizing the trends with smoothed line plot, however, I want readers have a clearer sense of specific numbers without omitting too much information, so I still used the bar plot. This time, in order to give a clearer reference of numbers, I changed to theme_minimal() to display hidden grid lines on the figure. This design will allow readers locate the values of each bar easier while still maintain a clear visualization style. Reference: Section 25, Section 28.3\n  For the 4th question, I used 2 figures:\n    Global distribution of earthquakes in 1980s.\n  Although it is not directly asked, for question involved with further analysis, I prefer to offer the raw data first. This will give reader (and myself) a more general idea of the data, and avoid future misunderstanding or mistakes when further analysis the data. Here, I displayed the raw distribution of earthquakes globally, showing some general knowledge of where the earthquakes usually take place, and leads to further discussion that lots of earthquakes happened in the ocean area, while I only consider earthquakes on land in the following analysis. This will also avoid future questions from readers e.g. why the number of earthquakes counted in the next figure is much smaller compared to the numbers shown in the 2nd and 3rd question.\n  Here I used theme_void() to only display the global landmass and earthquakes location. The color of landmass and earthquakes were carefully chose so that it can clearly distinguish earthquakes from landmass. It might be better to use Robinson Projections so that the high latitude area is less distorted, but I have problems finding this projection in coord_map(), besides we are not focused on high latitude in this question, so I used the default settings. Reference: Section 4.1\n    Number of earthquakes on each continents. Here it’s also required to display “amounts”, so I used the simple bar plots to make the figure clear.\n  For the 5th question, I used a heatmap to illustrate how the number of earthquakes changes on each continent in each year. If there was geographic shift, say, from continent to continent, then it would be easily spotted on the heatmap figure. Besides, I also selected color palette to make the fill colors represent the number values in a more custom way. Reference: Section 6.3, Section 4.2  ","permalink":"http://maanqi4.github.io/posts/earthquakes/","summary":"This is a course assignment for analyze and visualize earthquake data\n 1. Read the data in and clean it for analysis, used the readr package functions for reading and parsing data. [5 marks] My answer is written here and is explains what I did and why.\n#code  library(tidyverse) ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ── ## ✔ ggplot2 3.4.0 ✔ purrr 1.0.0  ## ✔ tibble 3.","title":"Earthquakes data analysis demo"},{"content":"1. Main Settings Experiment name: f.e12.FAMIPCN.f19_f19\n f f compset;\n“F” compsets use CAM,CLM, CICE(prescribed-thermo), DOCN(prescribed-SST). e12 cesm1.2.2 (model version) FAMIPCN AMIP run for CMIP5 protocol with CLM/CN:\nAMIP_CAM4_CLM40%CN_CICE%PRES_DOCN%DOM_RTM_SGLC_SWAV\nAMIP: time, AMIP runs\nCAM4: atmosphere model CAM4\nCLM40: land model, clm4.0\nCN (carbon-nitrogen) model version: a biogeochemistry model that simulates the carbon and nitrogen cycles\nCICE%PRES: prescribed sea ice\nDOCN%DOM: DOCN data ocean mode\nRTM: river transport model, land river runoff\nSGLC: Land Ice, Stub glacier (land ice) component\nSWAV: Wave, Stub wave component f19_f19 1.9x2.5_1.9x2.5\natm_grid,lnd_grid,ice_grid,ocn_grid: 1.9x2.5\natm_grid_type,ocn_grid_type: finite volume   1.1 output frequency 1.2 enable -COSP option In user_nl_cam\n\u0026amp;cospsimulator_nl docosp = .true. cosp_amwg = .true. cosp_cfmip_da = .true. #daily simulator  2. Long Simulations 3. Short Simulations 3.1 Initial Conditions 3.1.1 generate initial conditions Base: f.e12.FAMIPCN.f19_f19.short.Parent Modify env_run.xml\n./xmlchange RUN_STARTDATE = \u0026#39;1978-10-01\u0026#39; ./xmlchange STOP_OPTIONS = \u0026#39;nmonths\u0026#39; ./xmlchange STOP_N = \u0026#39;3\u0026#39; ./xmlchange RESUBMIT = \u0026#39;3\u0026#39; Modify user_nl_cam\ninithist = \u0026#39;MONTHLY\u0026#39; inithist_all = .true. Also remember to set relevant perturbed parameters in user_nl_cam\ncldfrc_rhminl = 0.91 cldopt_rliqocean = 14 hkconv_cmftau = 1800 cldfrc_rhminh = 0.80 zmconv_tau = 3600 After 4 runs consecutively, in $ARCHIVE/rest, there will be 4 directories, containing initial files for 1979-01-01, 1979-04-01, 1979-07-01, 1979-10-01. Example file list in 1979-01-01:\nf.e12.FAMIPCN.f19_f19.short.Parent.cam.h0.1978-12.nc f.e12.FAMIPCN.f19_f19.short.Parent.cam.h1.1978-10-01-00000.nc f.e12.FAMIPCN.f19_f19.short.Parent.cam.h2.1978-12-30-10800.nc f.e12.FAMIPCN.f19_f19.short.Parent.cam.i.1979-01-01-00000.nc f.e12.FAMIPCN.f19_f19.short.Parent.cam.r.1979-01-01-00000.nc f.e12.FAMIPCN.f19_f19.short.Parent.cam.rh0.1979-10-31-00000.nc f.e12.FAMIPCN.f19_f19.short.Parent.cam.rs.1979-01-01-00000.nc f.e12.FAMIPCN.f19_f19.short.Parent.cice.r.1979-01-01-00000.nc f.e12.FAMIPCN.f19_f19.short.Parent.clm2.h0.1978-12.nc f.e12.FAMIPCN.f19_f19.short.Parent.clm2.r.1979-01-01-00000.nc f.e12.FAMIPCN.f19_f19.short.Parent.clm2.rh0.1979-01-01-00000.nc f.e12.FAMIPCN.f19_f19.short.Parent.cpl.r.1979-01-01-00000.nc f.e12.FAMIPCN.f19_f19.short.Parent.docn.rs1.1979-01-01-00000.bin f.e12.FAMIPCN.f19_f19.short.Parent.rtm.h0.1978-12.nc f.e12.FAMIPCN.f19_f19.short.Parent.rtm.r.1979-01-01-00000.nc f.e12.FAMIPCN.f19_f19.short.Parent.rtm.rh0.1979-01-01-00000.nc rpointer.atm rpointer.drv rpointer.ice rpointer.lnd rpointer.ocn rpointer.rof scripts: csh.init.diff.niagara generate initial conditions for each parameter setting (modify and submit 50cases: f.e12.FAMIPCN.f19_f19.init.diff.*)\n3.2 start initial runs (same initial condition for each parameter setting) scripts:\n bash ==init.short.parent.niagara== to generate parent run files [change runstartdate] sbatch ==csh.init.short.child.niagara== to run 50 ensembles in given start date [change runstartdate] f.e12.FAMIPCN.f19_f19.short.op.$i.$runstartdate munual run 4 parent runs in f.e12.FAMIPCN.f19_f19.init.op.$runstartdate genarated by ==init.short.parent.niagara== [munually modify parameters in user_nl_cam]  Modify initial files, in user_nl_cam (use the initial files generate by $3.1$)\nncdata = \u0026#39;f.e12.FAMIPCN.f19_f19.short.Parent.cam.i.1979-01-01-00000.nc\u0026#39; #initial files for CAM By default, this file should soft link to $RUN, setting RUN_REFCASE and GET_REFCASE=TRUE to make true of this. One can also indicate specific path for initial file, but if initial conditions files are not in $RUN directory, make sure to also set other model\u0026rsquo;s .r. files to avoid error. I prefer to ponit all initial files to $RUN for an easier operation. Modify env_run.xml\n./xmlchange RUN_TYPE=hybrid #default is hybrid for FAMIPCN, but make sure set it up anyway ./xmlchange RUN_STARTDATE=1979-01-01 ./xmlchange RUN_REFCASE=f.e12.FAMIPCN.f19_f19.short.Parent #which REFCASE to look for in /inputdata/ccsm_init ./xmlchange RUN_REFDATE=1979-01-01 #which date/directory in REFCASE to use ./xmlchange GET_REFCASE=TRUE #soft link initial files to $RUN directory One can have a case start from 1979-01-01 but using initial conditions in 1979-04-01 by setting RUN_STARTDATE=1979-01-01 and RUN_REFDATE=1979-04-01. (If these is a REFCASE started in 1979-04-01 existed.) Build a new case or rebuild the original case\n3.3 initial runs with different initial conditions for different parameter setting scripts: csh.init.diff.short.child.parent\n","permalink":"http://maanqi4.github.io/posts/short-long-experiment-outline/","summary":"1. Main Settings Experiment name: f.e12.FAMIPCN.f19_f19\n f f compset;\n“F” compsets use CAM,CLM, CICE(prescribed-thermo), DOCN(prescribed-SST). e12 cesm1.2.2 (model version) FAMIPCN AMIP run for CMIP5 protocol with CLM/CN:\nAMIP_CAM4_CLM40%CN_CICE%PRES_DOCN%DOM_RTM_SGLC_SWAV\nAMIP: time, AMIP runs\nCAM4: atmosphere model CAM4\nCLM40: land model, clm4.0\nCN (carbon-nitrogen) model version: a biogeochemistry model that simulates the carbon and nitrogen cycles\nCICE%PRES: prescribed sea ice\nDOCN%DOM: DOCN data ocean mode\nRTM: river transport model, land river runoff","title":"Generate Short/Long Perturbed Parameter Ensembles (PPE) with CESM"}]