-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update PBMTL,PGTML,PB0(bads),resample obs #8
Conversation
@aappling-usgs key addition is here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some thoughts on potential fragilities, but I didn't spot anything that I think would be creating real problems right now.
}, | ||
depth_diff = abs(depth - new_depth)) %>% | ||
# after approx(), trash any values at new_depth >= 0.5 m from the nearest observation | ||
filter(depth_diff < 0.5) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about replacing 0.5 with sample_res
? Won't matter for this project, but in case we ever transfer this code to another project where sample_res is a higher value...
filter(depth_diff < 0.5) %>% | ||
# only keep one estimate for each new_depth | ||
group_by(new_depth) %>% | ||
filter(depth_diff == min(depth_diff)) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any chance of ending up with two observations with the same new_depth
and depth_diff
? Lines 190-191 would weed out identical original depths but wouldn't catch it if those two observations were at, say, 0.4 and 0.6m. If there were such duplicates, I think you probably would notice when writing out the observations in the date x depth matrix format, at least if using pivot_wider
...but anyway, even recognizing all that, it might still be a good idea to ensure we get a single unique value here. You could do it by adding a call to slice(1)
right after this filter call and before the ungroup
, I think.
PS - here's why I think you'd notice when using pivot_wider: it puts a list in each cell if there are any duplicates and you don't specify values_fn
:
> tibble(depth=c(1,2,3,3,3,4,4,5), temp=1:8) %>% pivot_wider(names_from=depth, values_from=temp)
# A tibble: 1 x 5
`1` `2` `3` `4` `5`
<list> <list> <list> <list> <list>
1 <int [1]> <int [1]> <int [3]> <int [2]> <int [1]>
Warning message:
Values are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = length` to identify where the duplicates arise
* Use `values_fn = {summary_fun}` to summarise duplicates
> tibble(depth=c(1,2,3,3,3,4,4,5), temp=1:8) %>% pivot_wider(names_from=depth, values_from=temp, values_fn=mean)
# A tibble: 1 x 5
`1` `2` `3` `4` `5`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 4 6.5 8
feather::read_feather(these_files$source_filepath[i]) %>% rename(depth = index) %>% | ||
pivot_longer(-depth, names_to = 'date', values_to = 'temp') %>% | ||
mutate(date = as.Date(date)) %>% | ||
pivot_wider(names_from = depth, values_from = temp, names_prefix = 'temp_') %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the pivot_wider
I was imagining above. So I do think you would have caught it already if any duplicate site-date-new_depth combos were occurring, because write_csv for a tibble with list columns gives an error:
> tibble(depth=c(1,2,3,3,3,4,4,5), temp=1:8) %>% pivot_wider(names_from=depth, values_from=temp, values_fn=mean) %>% write_csv('test.csv')
## ^works
> tibble(depth=c(1,2,3,3,3,4,4,5), temp=1:8) %>% pivot_wider(names_from=depth, values_from=temp) %>% write_csv('test.csv')
Error in stream_delim_(df, path, ..., bom = bom, quote_escape = quote_escape) :
Don't know how to handle vector of type list.
In addition: Warning message:
Values are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = length` to identify where the duplicates arise
* Use `values_fn = {summary_fun}` to summarise duplicates
But note that even with the error, a file (just the column names) does get written...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouch! empty file when we have an error? that's not good.
Yes, I will add this other code. Your original code does this with a summarize, which I realized after the fact...thought it was redundant with what was done above, but clearly it isn't (e.g., the case where 0.4 and 0.6 exist)
# to see my work as columns, print out the result up to this point, e.g. by uncommenting | ||
# tail(20) %>% print(n=20) | ||
# now we clean up the columns | ||
select(site_id, date, depth=new_depth, temp=new_temp, source) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aappling-usgs how about adding this here to deal w/ the duplication issue?
group_by(site_id, date, depth) %>%
summarize(temp = first(temp), source = first(source)) %>%
ungroup() %>%
saveRDS(file = outfile)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call - I forgot we were grouping by source above, so this deduplication does indeed need to happen outside of that do()
. Does it make sense to prefer a specific source, e.g., the one with the most observations, instead of just picking the first one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:) it may in the future, but I think this is such an edge-case that I don't want to try to code it in for this experiment. Also, some high-obs sources are worse in quality than ones with fewer...so it may not return the "best" data afterall (?) As long as I am tracking the source used all the way into the model_matched_obs data (which we are), I am ok making this simplification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds totally reasonable.
fixes #5 some of #4