-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove errant data (e.g., "LAB" values) from evaluation dataset #12
Comments
These three manual files are flawed based on the earlier analysis: pgmtl_matched_to_observations %>% filter(site_id == "nhdhr_120020307") %>% group_by(source) %>% summarize(n = length(source), rmse = sqrt(mean((pred-obs)^2, na.rm=TRUE)))
# A tibble: 4 x 3
source n rmse
<chr> <int> <dbl>
1 7a_temp_coop_munge/tmp/MN_fisheries_all_temp_data_Jan2018.rds 65 2.08
2 7a_temp_coop_munge/tmp/MPCA_temp_data_all.rds 1413 1.50
3 7a_temp_coop_munge/tmp/South_Center_DO_2018_09_11_All.rds 853 10.8
4 wqp 114 1.14
pgmtl_matched_to_observations %>% filter(site_id == "nhdhr_120020567") %>% group_by(source) %>% summarize(n = length(source), rmse = sqrt(mean((pred-obs)^2, na.rm=TRUE)))
# A tibble: 6 x 3
source n rmse
<chr> <int> <dbl>
1 7a_temp_coop_munge/tmp/aitkin_anoka_becker_cook_mnlakedata_historicalfiles_manualentry.rds 63 3.97
2 7a_temp_coop_munge/tmp/Greenwood_DO_2018_09_14_All.rds 1043 12.0
3 7a_temp_coop_munge/tmp/MN_fisheries_all_temp_data_Jan2018.rds 681 3.75
4 7a_temp_coop_munge/tmp/MPCA_temp_data_all.rds 452 3.58
5 7a_temp_coop_munge/tmp/Water_Temp.rds 28151 3.08
6 wqp 101 5.11
pgmtl_matched_to_observations %>% filter(site_id == "nhdhr_58125241") %>% group_by(source) %>% summarize(n = length(source), rmse = sqrt(mean((pred-obs)^2, na.rm=TRUE)))
# A tibble: 5 x 3
source n rmse
<chr> <int> <dbl>
1 7a_temp_coop_munge/tmp/Carlos_DO_2018_11_05_All.rds 996 11.1
2 7a_temp_coop_munge/tmp/MN_fisheries_all_temp_data_Jan2018.rds 116 3.31
3 7a_temp_coop_munge/tmp/MPCA_temp_data_all.rds 1852 2.85
4 7a_temp_coop_munge/tmp/Water_Temp.rds 46818 2.96
5 wqp 592 2.77 Sites prefixed with "wqp_IL_EPA" only appear in feather::read_feather('../lake-temperature-model-prep/7b_temp_merge/out/merged_temp_data_daily.feather') %>% filter(str_detect(source, "wqp_IL_EPA"), site_id %in% target_expansion_ids) %>% pull(site_id) %>% unique()
"nhdhr_109982172" "nhdhr_109984628" "nhdhr_109986464" "nhdhr_109986912" "nhdhr_109987472" "nhdhr_109989384" "nhdhr_109989482"
[8] "nhdhr_109990726" "nhdhr_121207127" "nhdhr_121207134" "nhdhr_121207285" "nhdhr_121624992" "nhdhr_121625003" "nhdhr_121625323"
[15] "nhdhr_121627799" "nhdhr_121628055" "nhdhr_121628955" "nhdhr_121650552" "nhdhr_121650572" "nhdhr_121650602" "nhdhr_121650613"
[22] "nhdhr_121650633" "nhdhr_121650643" "nhdhr_145607036" "nhdhr_145608202" "nhdhr_145757037" "nhdhr_156039648" "nhdhr_83837813"
[29] "nhdhr_85083102" "nhdhr_90588560" "nhdhr_109992116" "nhdhr_121650562" "nhdhr_121650592" "nhdhr_109986638" Not including these sites, the worse PB0 rmse is 8.82° (n=2187); but out of these 34, the best performing is 5.33° and worse is 18.2° pb0_matched_to_observations %>% filter(site_id %in% bad_EPA) %>% group_by(site_id) %>% summarize(rmse = sqrt(mean((pred-obs)^2, na.rm=TRUE))) %>% arrange(desc(rmse)) %>% print(n=100)
# A tibble: 34 x 2
site_id rmse
<chr> <dbl>
1 nhdhr_121207127 18.2
2 nhdhr_109986912 17.6
3 nhdhr_109986464 14.6
4 nhdhr_109984628 14.5
5 nhdhr_121627799 14.5
6 nhdhr_109990726 13.8
7 nhdhr_109989482 13.5
8 nhdhr_145608202 13.0
9 nhdhr_121650552 12.9
10 nhdhr_121650602 12.8
11 nhdhr_121628955 12.7
12 nhdhr_85083102 12.4
13 nhdhr_121207134 12.2
14 nhdhr_109986638 12.1
15 nhdhr_121650633 11.7
16 nhdhr_109987472 11.2
17 nhdhr_109982172 11.0
18 nhdhr_121650592 10.9
19 nhdhr_121650613 10.7
20 nhdhr_121625003 10.3
21 nhdhr_109989384 10.1
22 nhdhr_121625323 9.72
23 nhdhr_145607036 8.98
24 nhdhr_90588560 8.53
25 nhdhr_121207285 8.41
26 nhdhr_83837813 8.38
27 nhdhr_109992116 8.23
28 nhdhr_121624992 7.70
29 nhdhr_121650643 7.64
30 nhdhr_156039648 7.46
31 nhdhr_121650562 7.07
32 nhdhr_121650572 6.99
33 nhdhr_121628055 6.24
34 nhdhr_145757037 5.33 For this, I have only used information on these specific sources that came from two non-pred related findings: 1) the .rds files were flipped, and 2) these EPA sites had "LAB" temperature data intermingled within actual field temperature readings. |
completed in #10 |
We accepted a certain amount of know flaws in the dataset, as we knew we weren't going to have a comprehensive QAQC effort and there will always be some degree of observation error.
But the "LAB" values and other flaws from this issue are in the current observations dataset and should be removed if possible. If these data issues are constrained to the extended test lakes, that is fine, since those observations are only used in evaluation for the exported value's RMSE. If they impact the 305 test lakes, that is bad because the RMSE tables used to evaluate the MTL performance would then be impacted. Also, these data should not be removed for any reason based on model performance. We should only remove them if they are clearly errant without the aid of any predictions to tell us this.
The text was updated successfully, but these errors were encountered: