Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ghcnd_read expects the wrong file format for .dly files. #383

Open
jonathan-g opened this issue Jan 29, 2021 · 3 comments
Open

ghcnd_read expects the wrong file format for .dly files. #383

jonathan-g opened this issue Jan 29, 2021 · 3 comments

Comments

@jonathan-g
Copy link

Bug description

ghcnd_read fails with an error because it expects a .dly file to be a .csv file, but it's a fixed-width file with no delimiters between columns.

Reprex

library(rnoaa)
download.file("ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/all/USW00013897.dly", "USW00013897.dly")
ghcnd_read("USW00013897.dly")
#> Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
#> cols = 1 != length(data) = 128
#> Error: Columns 2, 3, 4, 5, 6, and 122 more must be named.

Created on 2021-01-29 by the reprex package (v0.3.0)

This is what the first several lines of `USW00013897.dly" look like:

USC00111577192802TMAX-9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999      39  0-9999   -9999   
USC00111577192802TMIN-9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999     -28  0-9999   -9999   
USC00111577192802PRCP-9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999       0T 0-9999   -9999   

And this is the file format, as described in the readme-1.txt at the FTP site ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt:

III. FORMAT OF DATA FILES (".dly" FILES)

Each ".dly" file contains data for one station.  The name of the file
corresponds to a station's identification code.  For example, "USC00026481.dly"
contains the data for the station with the identification code USC00026481).

Each record in a file contains one month of daily data.  The variables on each
line include the following:

------------------------------
Variable   Columns   Type
------------------------------
ID            1-11   Character
YEAR         12-15   Integer
MONTH        16-17   Integer
ELEMENT      18-21   Character
VALUE1       22-26   Integer
MFLAG1       27-27   Character
QFLAG1       28-28   Character
SFLAG1       29-29   Character
VALUE2       30-34   Integer
MFLAG2       35-35   Character
QFLAG2       36-36   Character
SFLAG2       37-37   Character
  .           .          .
  .           .          .
  .           .          .
VALUE31    262-266   Integer
MFLAG31    267-267   Character
QFLAG31    268-268   Character
SFLAG31    269-269   Character
------------------------------

Session Info

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rnoaa_1.3.0          revealjg_0.9.9006    rprojroot_2.0.2      reticulate_1.18      lubridate_1.7.9.2    forcats_0.5.0       
 [7] stringr_1.4.0        dplyr_1.0.3          purrr_0.3.4          readr_1.4.0          tidyr_1.1.2          tibble_3.0.5        
[13] ggplot2_3.3.3        tidyverse_1.3.0      yaml_2.2.1           rmarkdown_2.6.6.9000 knitr_1.30           pacman_0.5.1        

loaded via a namespace (and not attached):
 [1] fs_1.5.0           usethis_2.0.0      RColorBrewer_1.1-2 httr_1.4.2         tools_4.0.3        backports_1.2.1   
 [7] bslib_0.2.3.9000   utf8_1.1.4         R6_2.5.0           DBI_1.1.1          colorspace_2.0-0   withr_2.4.0       
[13] tidyselect_1.1.0   gridExtra_2.3      processx_3.4.5     curl_4.3           compiler_4.0.3     cli_2.2.0         
[19] rvest_0.3.6        xml2_1.3.2         triebeard_0.3.0    sass_0.3.0.9000    scales_1.1.1       callr_3.5.1       
[25] askpass_1.1        rappdirs_0.3.1     digest_0.6.27      pkgconfig_2.0.3    htmltools_0.5.1.1  dbplyr_2.0.0      
[31] rlang_0.4.10       readxl_1.3.1       rstudioapi_0.13    httpcode_0.3.0     jquerylib_0.1.3    generics_0.1.0    
[37] jsonlite_1.7.2     magrittr_2.0.1     credentials_1.3.0  Matrix_1.3-2       Rcpp_1.0.6         munsell_0.5.0     
[43] fansi_0.4.2        clipr_0.7.1        lifecycle_0.2.0    stringi_1.5.3      whisker_0.4        grid_4.0.3        
[49] crayon_1.3.4       slider_0.1.5       lattice_0.20-41    haven_2.3.1        hms_1.0.0          sys_3.4           
[55] ps_1.5.0           pillar_1.4.7       crul_1.0.0         reprex_0.3.0       XML_3.99-0.5       glue_1.4.2        
[61] evaluate_0.14      hoardr_0.5.2       data.table_1.13.6  remotes_2.2.0      modelr_0.1.8       vctrs_0.3.6       
[67] urltools_1.7.3     cellranger_1.1.0   gtable_0.3.0       openssl_1.4.3      assertthat_0.2.1   xfun_0.20         
[73] broom_0.7.3.9000   warp_0.2.0         ellipsis_0.3.1     here_1.0.1        
@sckott sckott added this to the v1.4 milestone Jan 29, 2021
@sckott
Copy link
Contributor

sckott commented Jan 29, 2021

thanks for catching that!

up for helping out by sending a PR?

I added the ghcnd_read fxn as an afterthought not thinking it through all the way. When we download the files with ghncd() they are in the fixed width format you describe, but then before writing them to disk we process them to make them more digestible - see https://github.com/ropensci/rnoaa/blob/master/R/ghcnd.R#L202-L223 - THEN they are written to disk in comma sep format

So probably ideally we change gchnd_read() to read a file directly from NOAA in fixed width format AND in comma sep format (i.e., from a call to ghcnd()) - Sound good?

So, i think we:

  • factor out the code in ghcnd_GET to process a fwf file to a fxn, e.g., process_fwf
  • use process_fwf in ghcnd_GET to replace the code just factored out
  • use process_fwf in ghcnd_read if the file is fwf, or simply read as comma sep format if already a csv

@jonathan-g
Copy link
Author

I will be happy to submit a PR to fix this. It may take a little while for me to get to it, but I will be happy to do this if you're not in a hurry.

@sckott
Copy link
Contributor

sckott commented Feb 1, 2021

Great, not in a hurry unless CRAN maintainers get in touch about any failures, e.g. #382

@sckott sckott modified the milestones: v1.4, v2.0 May 13, 2021
@sckott sckott removed this from the v2.0 milestone Jun 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants