Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update PBMTL,PGTML,PB0(bads),resample obs #8

Merged
merged 8 commits into from
Jun 26, 2020

Conversation

jordansread
Copy link

fixes #5 some of #4

@jordansread
Copy link
Author

@aappling-usgs key addition is here

Copy link

@aappling-usgs aappling-usgs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some thoughts on potential fragilities, but I didn't spot anything that I think would be creating real problems right now.

},
depth_diff = abs(depth - new_depth)) %>%
# after approx(), trash any values at new_depth >= 0.5 m from the nearest observation
filter(depth_diff < 0.5) %>%

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about replacing 0.5 with sample_res? Won't matter for this project, but in case we ever transfer this code to another project where sample_res is a higher value...

filter(depth_diff < 0.5) %>%
# only keep one estimate for each new_depth
group_by(new_depth) %>%
filter(depth_diff == min(depth_diff)) %>%

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any chance of ending up with two observations with the same new_depth and depth_diff? Lines 190-191 would weed out identical original depths but wouldn't catch it if those two observations were at, say, 0.4 and 0.6m. If there were such duplicates, I think you probably would notice when writing out the observations in the date x depth matrix format, at least if using pivot_wider...but anyway, even recognizing all that, it might still be a good idea to ensure we get a single unique value here. You could do it by adding a call to slice(1) right after this filter call and before the ungroup, I think.

PS - here's why I think you'd notice when using pivot_wider: it puts a list in each cell if there are any duplicates and you don't specify values_fn:

> tibble(depth=c(1,2,3,3,3,4,4,5), temp=1:8) %>% pivot_wider(names_from=depth, values_from=temp)
# A tibble: 1 x 5
  `1`       `2`       `3`       `4`       `5`      
  <list>    <list>    <list>    <list>    <list>   
1 <int [1]> <int [1]> <int [3]> <int [2]> <int [1]>
Warning message:
Values are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = length` to identify where the duplicates arise
* Use `values_fn = {summary_fun}` to summarise duplicates 
> tibble(depth=c(1,2,3,3,3,4,4,5), temp=1:8) %>% pivot_wider(names_from=depth, values_from=temp, values_fn=mean)
# A tibble: 1 x 5
    `1`   `2`   `3`   `4`   `5`
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     2     4   6.5     8

feather::read_feather(these_files$source_filepath[i]) %>% rename(depth = index) %>%
pivot_longer(-depth, names_to = 'date', values_to = 'temp') %>%
mutate(date = as.Date(date)) %>%
pivot_wider(names_from = depth, values_from = temp, names_prefix = 'temp_') %>%

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the pivot_wider I was imagining above. So I do think you would have caught it already if any duplicate site-date-new_depth combos were occurring, because write_csv for a tibble with list columns gives an error:

> tibble(depth=c(1,2,3,3,3,4,4,5), temp=1:8) %>% pivot_wider(names_from=depth, values_from=temp, values_fn=mean) %>% write_csv('test.csv')
## ^works
> tibble(depth=c(1,2,3,3,3,4,4,5), temp=1:8) %>% pivot_wider(names_from=depth, values_from=temp) %>% write_csv('test.csv')
Error in stream_delim_(df, path, ..., bom = bom, quote_escape = quote_escape) : 
  Don't know how to handle vector of type list.
In addition: Warning message:
Values are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = length` to identify where the duplicates arise
* Use `values_fn = {summary_fun}` to summarise duplicates 

But note that even with the error, a file (just the column names) does get written...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch! empty file when we have an error? that's not good.

Yes, I will add this other code. Your original code does this with a summarize, which I realized after the fact...thought it was redundant with what was done above, but clearly it isn't (e.g., the case where 0.4 and 0.6 exist)

# to see my work as columns, print out the result up to this point, e.g. by uncommenting
# tail(20) %>% print(n=20)
# now we clean up the columns
select(site_id, date, depth=new_depth, temp=new_temp, source) %>%
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aappling-usgs how about adding this here to deal w/ the duplication issue?

group_by(site_id, date, depth) %>%
    summarize(temp = first(temp), source = first(source)) %>%
    ungroup() %>% 
    saveRDS(file = outfile)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call - I forgot we were grouping by source above, so this deduplication does indeed need to happen outside of that do(). Does it make sense to prefer a specific source, e.g., the one with the most observations, instead of just picking the first one?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) it may in the future, but I think this is such an edge-case that I don't want to try to code it in for this experiment. Also, some high-obs sources are worse in quality than ones with fewer...so it may not return the "best" data afterall (?) As long as I am tracking the source used all the way into the model_matched_obs data (which we are), I am ok making this simplification.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds totally reasonable.

@jordansread jordansread merged commit f33438f into DOI-USGS:master Jun 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add PB-MTL evaluation
2 participants