Skip to content

Creating a dataset

Joaquin Bedia edited this page Mar 3, 2015 · 12 revisions

Definition of a dataset

downscaleR can handle many different binary files formats besides netCDF, such as HDF, GRIB, NEXRAD, etc, and makes the data accessible through a standard API, thanks to The NetCDF-Java / CDM library. It can also read remote datasets through OPeNDAP and other remote access protocols. As we shall see, by using NcML one can also create virtual datasets by modifying and aggregating other datasets.

As a representative example in climate applications, we will show how to access a collection of netCDF files containing a number of climate variables, by means of a NcML file. The NetCDF Markup Language (NcML) is an XML dialect that allows you to create datasets. An NcML document is an XML document that uses NcML, and defines a dataset. The purpose of NcML is to allow:

  1. Metadata to be added, deleted, and changed.
  2. Variables to be renamed, added, deleted and restructured.
  3. Data from multiple CDM files to be combined (a.k.a. "aggregated").

In downscaleR, the NcML representation of a dataset will be often referred to simply as a "dataset".

Extended examples on how to create/modify NcML datasets can be accessed at the NcML tutorial and the NcML cookbook

Creation of a dataset using downscaleR

In many occasions, climate data (reanalysis, GCM, gridded observations) are stored as collections of netCDF files living in a directory with or without subdirectories. This may contain just one file per variable (as in this example), or more complex configurations such as several files per variable, partitioned by time periods, or one subdirectory per variable. All these situations are automatically handled by the function makeAggregatedDataset.

With the downscaleR installation, there is a built-in netCDF dataset that serves as a first example:

require(downscaleR)
dir <- file.path(find.package("downscaleR"), "datasets/reanalysis/Iberia_NCEP/")
list.files(dir, pattern = "\\.nc$")

this return:

[1] "NCEP_2T.nc"  "NCEP_pr.nc"  "NCEP_Q.nc"   "NCEP_SLP.nc" "NCEP_T.nc"   "NCEP_Z.nc"  

It contains 6 netCDF files. In order to aggregate them to form a unique dataset, we can just type:

makeAggregatedDataset(source.dir = dir, ncml.file = "/home/user/temp/ncml_test.ncml", verbose = TRUE)
[2014-09-02 14:27:09] Creating dataset from 6 files
[2014-09-02 14:27:10] Scanning file 1 out of 6
[2014-09-02 14:27:10] Scanning file 2 out of 6
[2014-09-02 14:27:10] Scanning file 3 out of 6
[2014-09-02 14:27:10] Scanning file 4 out of 6
[2014-09-02 14:27:10] Scanning file 5 out of 6
[2014-09-02 14:27:10] Scanning file 6 out of 6
[2014-09-02 14:27:11] NcML file "/home/user/temp/ncml_test.ncml" created from 6 files corresponding to 6 variables
Use 'dataInventory' to obtain a description of the dataset

The new dataset has been created in the destination given by the ncml.file argument. Now, it is possible to make an inventory or load any or all the variables, as indicated in the section Accessing Gridded Datasets. For instance:

di <- dataInventory(dataset = "/home/user/temp/ncml_test.ncml")
[2014-09-02 14:28:56] Doing inventory ...
[2014-09-02 14:28:56] Retrieving info for '2T' (5 vars remaining)
[2014-09-02 14:28:56] Retrieving info for 'pr' (4 vars remaining)
[2014-09-02 14:28:56] Retrieving info for 'Q' (3 vars remaining)
[2014-09-02 14:28:57] Retrieving info for 'SLP' (2 vars remaining)
[2014-09-02 14:28:57] Retrieving info for 'T' (1 vars remaining)
[2014-09-02 14:28:57] Retrieving info for 'Z' (0 vars remaining)
[2014-09-02 14:28:57] Done.
names(di)
[1] "2T"  "pr"  "Q"   "SLP" "T"   "Z"  
str(di$Q)
List of 4
 $ Description: chr "Specific humidity"
 $ DataType   : chr "float"
 $ Units      : chr "kg kg**-1"
 $ Dimensions :List of 4
  ..$ time :List of 4
  .. ..$ Type      : chr "Time"
  .. ..$ TimeStep  : chr "1.0 days"
  .. ..$ Units     : chr "days since 1950-01-01 00:00:00"
  .. ..$ Date_range: chr "1961-01-01T00:00:00Z - 2010-12-31T00:00:00Z"
  ..$ level:List of 3
  .. ..$ Type  : chr "Pressure"
  .. ..$ Units : chr "Pa"
  .. ..$ Values: num [1:4] 50000 70000 85000 100000
  ..$ lat  :List of 3
  .. ..$ Type  : chr "Lat"
  .. ..$ Units : chr "degrees north"
  .. ..$ Values: num [1:6] 35 37.5 40 42.5 45 47.5
  ..$ lon  :List of 3
  .. ..$ Type  : chr "Lon"
  .. ..$ Units : chr "degrees east"
  .. ..$ Values: num [1:9] -15 -12.5 -10 -7.5 -5 -2.5 0 2.5 5

This is how this NcML looks. Note that the paths to the files may change depending on the computer.

system("cat /home/user/temp/ncml_test.ncml")
<?xml version="1.0" encoding="UTF-8"?>
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
	<aggregation type="union">
	<netcdf location="/home/juaco/R/i686-pc-linux-gnu-library/3.1/downscaleR/datasets/reanalysis/Iberia_NCEP/NCEP_2T.nc" ncoords="18262"/>
	<netcdf location="/home/juaco/R/i686-pc-linux-gnu-library/3.1/downscaleR/datasets/reanalysis/Iberia_NCEP/NCEP_pr.nc" ncoords="18262"/>
	<netcdf location="/home/juaco/R/i686-pc-linux-gnu-library/3.1/downscaleR/datasets/reanalysis/Iberia_NCEP/NCEP_Q.nc" ncoords="18262"/>
	<netcdf location="/home/juaco/R/i686-pc-linux-gnu-library/3.1/downscaleR/datasets/reanalysis/Iberia_NCEP/NCEP_SLP.nc" ncoords="18262"/>
	<netcdf location="/home/juaco/R/i686-pc-linux-gnu-library/3.1/downscaleR/datasets/reanalysis/Iberia_NCEP/NCEP_T.nc" ncoords="18262"/>
	<netcdf location="/home/juaco/R/i686-pc-linux-gnu-library/3.1/downscaleR/datasets/reanalysis/Iberia_NCEP/NCEP_Z.nc" ncoords="18262"/>
	</aggregation>
</netcdf>
Clone this wiki locally