-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
As a tester, I want to test the event based evaluation solution implemented in #130 #382
Comments
As my initial evaluation, I used observations for ABRN1 streamflow (part of the HEFS Test A evaluations), and simulations for its NWM feature id (acquired from WRDS), and came up with this: label: Testing Event Based
observed:
label: OBS Streamflow
sources: /home/ISED/wres/wresTestData/issue92087/inputs/ABRN1_QME.xml
variable: QME
feature_authority: nws lid
type: observations
time_scale:
function: mean
period: 24
unit: hours
predicted:
label: "19161749 RetroSim CSVs"
sources:
- /home/ISED/wres/nwm_3_0_retro_simulations/wfo/OAX/19161749_nwm_3_0_retro_wres.csv.gz
variable: streamflow
feature_authority: nwm feature id
type: simulations
features:
- {observed: ABRN1, predicted: '19161749'}
time_scale:
function: mean
period: 24
unit: hours
event_detection: observed It appears as though 64 events were identified with standard statistics output; here is sampling of the last few pools listed:
I don't have a good way to view them graphically at the moment. Let me see if I can spin up a quick-and-dirty spreadsheet to support viewing the XML observations and CSV simulations. Hank |
That 7 month event in 1987 is kind of odd: "1987-03-22T06:00:00Z, Latest valid time: 1987-10-03T06:00:00Z". Again, I need to visualize the time series so that I can understand where the events are coming from. Hank |
Here is a plot of the observations and simulation for ABRN1 stream, with the NWM retrosim being averaged to 24-hours ending at the times of the 24-hour observations (I believe that's how WRES would rescale it by default; observations are blue): Its crude, but the spreadsheet should allow me to focus in on individual events identified by the WRES to see if it makes sense. I'll start by examining the data for Mar 22, 1987, through Oct 3, 1987, which the WRES identified as one, long event. Hank |
For what it's worth... The events themselves won't always make a ton of sense and you will find that they are rather sensitive to the parameters. It's the algorithm we have for now, but I am pretty certain it's an accurate implementation and it has a decent set of unit tests, including those ported across from python. There is probably no algorithm that produces completely satisfactory results, though. Spending time looking at the events may lead to a different/better set of default parameter values, but it probably won't. On the whole, it produces vaguely sensible looking events for synthetic time-series with strong peak signals. It starts to look more questionable for (some) real time-series. I don't want to sway your UAT but, TBH, I am personally more concerned about the range of interactions between event detection and various other features and whether it produces any/sensible results across all of those various possibilities - there's only so much that can be covered with integration tests. |
James: Thanks. I was about to post the same conclusion that the parameters are probably just not optimal for this location, given the number of multi-week/month events I'm seeing, and I'm not going to spend time trying to optimize them. Next step is to generate outputs that make some sense to me. I'm going to add graphics to the evaluation to help visualize what the WRES produces. Hank |
Oh, and understood about wanting me to look at the interactions between features. I'll start working on that once I can make sense of the a "simple" case using real data. Hank |
On the visualization, as I noted in #130, if there were a quick win for visualizing these events, I would've implemented it, but there really isn't. The quickest way would be a sidecar in the event detection package that generated visualizations, but that is pretty yucky as it bypasses our graphics clients. The best way would be to add event detection as a metric as well as a declaration option in itself. That way, you could calculate and visualize the detected events alone using the normal workflow, but that was going to be a lot of work as it would require a new metric/statistics class with output composed of two instants (start and end times). Will have to wait but, honestly, it would probably be better for event detection to be a separate web service. |
Understood. That's why I'm trying to come up with something, myself, that I can do quickly enough to perform some basic checks of the output. Here are the Pearson correlation coefficients for the various events: Okay. I think I need to just test out some of the different options provided in the wiki to see what happens. Hank |
Oh, and if there are other metrics I should be looking at, let me know. I was thinking the time-to-peak metric probably won't be meaningful until I use forecasts to the evaluation, which I do plan to do. For now, since I'm evaluating simulations, I just looked at a couple of single-valued metrics and shared only the correlation. Hank |
The example in the wiki uses simulations. In general, event detection won't work for forecast datasets because they are too short and will only capture partial events, unless it's a long-range forecast, perhaps. Anyway, see the example in the wiki, which is supposed to be exemplary as much as illustrative. I think that is the best/most reasonable application of event detection as it stands, i.e., detecting events for both observations and predictions simultaneously and then comparing the detected events in terms of peak timing error (or possibly other hydrologic signatures that we could add in future). Anything with forecasts is going to be much more dubious, unless you use a non-forecast dataset for detection. Anything with traditional, pool-based, statistics is going to be somewhat dubious too, IMHO. |
Can I compute the average correlation across the identified events? I don't know how useful that would be; I just figured to give it a shot. Looking at the wiki, I don't think I can. I know you can summarize across feature pools, but I don't think we can summary across referenced date/time pools, right? I'll check the wiki. For now, I just started an evaluation using the more complex declaration in, https://github.com/NOAA-OWP/wres/wiki/Event-detection#how-do-i-declare-event-detection just to see what happens. Hank |
No, that would need to be added as an aggregation option for summary statistics. I think I speculated in #130 about this, so probably worth a ticket for a future release. But again, I am personally a bit doubtful about traditional statistics generated for event periods. In most cases, this sort of analysis probably makes more sense with thresholds, like an analysis for flows above a flood threshold. |
Anyway, I am going to mostly leave you alone now unless you have questions as I have probably already swayed you too much. I just wanted to emphasize that the events are quite sensitive to the parameters and the timing error type analysis probably makes most sense for this new feature, but users will do what they do and you are probably representing their thought process too... |
The run using the parameters in the aforementioned section of wiki does yield significantly different events: So, yeah, sensitive to parameters. I'm going to try to workup a checklist of things to look at as part of this testing now. I still have questions, but they should be answered as I work through the tests. Hank |
I have a check list as a starting point. I'm sure that some of the items are nonsensical, but I want to see what happens when I combine different options. As I work through the list, I will likely use previously successful evaluations to add the new, specified feature, in order to see how the results are impacted. I'm probably overlooking tests to perform; I'll add those when I discover the oversight. Thanks, Hank |
FYI... all test declarations will be kept in the standard Hank |
I'm not sure I'm going to get to this today, except perhaps during the office hours (if no one is on). I've been working on the QPR slide deck and dealing with the proxy outage. Hank |
Just talked with James some during the office hours. I ran an evaluation of single valued forecasts from WRDS against observations used for HEFS, with default event-based evaluation parameters, and obtained results for the pearson correlation coefficient, time to peak error, and sample size. First, we noticed that the Second, James explained that each event period will yield time-to-peak errors for each single valued forecast for overlapping that period, and that each such error will be stored in the Each time to peak error is presented as an issued time and value pair on two rows. The output image would then look like: Since I'm zoomed out so far, it appears that the points line up vertically, but that is actually not the case. Pay close attention to the 0-line at the top, you'll see that they do not exactly overlap. I think this is reasonable output given the declaration I employed. I'll do a bit more testing, though, before checking the single-valued forecast box. Hank |
I reported #385 for the CSV issue. I'll pick up testing again when I can tomorrow. I'm not making as much progress as I had hoped. Hank |
My single valued example run used ABRN1 RFC forecasts from WRDS. I opened the As an aside, ABRN1 appears to be one of those forecast points (presumably in ABRFC) where forecasts are only generated when needed. So, for example, WRES identified an event spanning Jun 5 - Aug 5, 2014. The forecasts for that point that overlap all have issued times of Jun 4, meaning that the RFC generated the forecast only when the event was on the horizon. As for the event being two months long, that was likely due to the parameter options as discussed before. I believe there are summary statistic options for time series metrics. Let me see if I can find those and give it a shot. Hank |
First thing I found gave me overall summary statistics instead of one set of stats for the time to peak error instead of one per event. Let me revisit the event based wiki to see how I'm supposed to do it. Hank |
No, I did it right. That number is intended to report the average time to peak error across all events. In other words, it answers the question, when an event occurs, what is the average time to peak error I can expect for forecasts of those events. If I want statistics related for a single event, then, I guess I would modify the declaration to focus on the single event of interest and run it again. James: If that sounds wrong, please let me know. I'm checking the single valued forecast evaluation. It works in a simple evaluation, which is the point of the checkbox, 'Evaluating single value forecasts for events (including time series metrics)'. More complicated stuff comes later. Hank |
Yeah, there is one "time to peak" to for one "peak" aka one "event", so the "raw" numbers of the time-to-peak are the "per event" values and the summary statistics aggregate across all events. |
Thanks, James! The next checkbox is for a basic ensemble forecast test. So I guess I'll point the declaration to the HEFS data for ABRN1 and see what happens. Hank |
Ensemble forecasts don't allow for time to peak error:
Makes sense: there is one different peak per member. Anyway, I'll use more traditional metrics, even if they aren't really as interesting in this case. Hank |
Odd, the |
Ticket posted. I acknowledged there your comment indicating that "UNKNOWN" may actually be reasonable if the reference time type is actually unknown by the WRES. In this case it shouldn't be given that we're using WRDS. I'll rerun my original single valued forecast event based evaluation to see if the time to peak error plot had an unknown domain axis. Hank |
Continuing with testing... I believe the evaluation documented in #390 has a messed up
I'm going to declare the instantaneous time scale of the Hank |
There was no change in the results and I know why: the retrosim data and WRDS AHPS forecasts are probably clarifying their own time scale. That is, in both cases, the instantaneous time scale is either in the data (retrosim CSVs) or (perhaps?) known by the reader (in the case of WRDS). Okay. I now what to see the impact of rescaling the variable used for event detection. Specifically, if the evaluation is performed at the scale of a 24-hour mean, as in my cases (because of HEFS observations being used), is event detection performed at that scale? I think the easiest way to do this may be to attempt one of these evaluations at the scale of a 7 days max (mean or whatever) and see if a different set of events are identified. Hank |
It appears that event detection is performed at the Here is after declaring a 168 hour mean time scale: Definitely different. Here it is with a 168 hour maximum time scale: There are differences in the events for the mean and maximum scales, as well. Seems reasonable to me. Hank |
Putting this on-hold for the weekend. I plan to test #384 and that will probably be it. Hank |
Note to self... Run some of these evaluations using a database next week and benchmark the results for UAT during deployment later. All tests so far have been run in-memory. Thanks, Hank |
Also, I checked the box for use of covariates. It worked fine in my testing. The only issue to arise was how the domain axis was labeled in one situation, but I see no reason to leave the box unchecked for that. With that, I think all of the relatively simple tests have passed. I'll be running more complicated evaluations next week including multiple features, sampling uncertainty, thresholds, more complicated temporal pools (I've already evaluated using lead time pools in one instance), etc. We'll see how it goes. Have a great weekend! Hank |
Thanks, you too! |
Back to this... I'm going to start by running some of the evaluations using a database instead of in-memory. Hank |
There is such a big difference in performance: one HEFS evaluation took over 10 minutes using a database, and only 1m 3s in-memory. Despite that, the results look consistent, so that's good. Looking at events from two sources, probably using HEFS observations and NWM retrosim. Hank |
James: I'm still reading through the wiki for clarity, but maybe you can save me some time... If I declare two Thanks, Hank EDIT: Here is the covariates:
- label: OBS streamflow covariate
sources: /home/ISED/wres/wresTestData/issue92087/inputs/ABRN1_QME.xml
variable: QME
feature_authority: nws lid
type: observations
time_scale:
function: mean
period: 24
unit: hours
purpose: detect
- label: "19161749 RetroSim CSVs"
sources:
- /home/ISED/wres/nwm_3_0_retro_simulations/wfo/OAX/19161749_nwm_3_0_retro_wres.csv.gz
variable: streamflow
feature_authority: nwm feature id
type: simulations
purpose: detect |
I think you're asking what the default method of |
That's what I was asking, and I just confirmed it by playing around with the options. Thanks! I've run tests of HEFS observations used for events, NWM retrosim, both together, and both together with an Hank |
Yeah, I guess multiple covariates is a bit of an edge case w/r to those I suppose the way to test this would be consistency across the identified events for multiple datasets declared as |
The consistency check is what I'm planning to do, and an initial look seems "reasonable" (I always feel I need to quote that). Reviewing the wiki, its still not clear to me that the default operation is Hank |
Ah, yes, there is a subtle difference between there being no default |
Adjusted the language there. |
Note, first, that I performed this evaluation using Here are the correlations for HEFS observations: Here are the correlations for NWM retrosim: Here are the default union results: And here are the intersection results: I'm guessing that one or two of the HEFS observation events are overlapping two NWM retrosim events, forcing them to be combined into one big event when doing the intersection. At least, that's the only reason I can think of that the number of events for the intersection (5) would be fewer than for the NWM retrosim data alone (7). I am a bit surprised that I can't find the ~0.6 correlation value in 2013 for NWM retrosim anywhere in the union result. Instead, all correlations in 2013 are over 0.9. I think I'll need to look at the raw numbers to understand fully. Doing that now, Hank |
Wiki looks better. Thanks! Hank |
Raw number examination may have to wait until Thursday (I'm out tomorrow). We got a high priority security issue from ITSG that I need to review and that probably won't leave much time after. Hank |
An It's important to understand/visualize what is meant by an intersection in this context, which is a temporal intersection. By default, there is no |
Okay. Thanks. I ran the Anyway, I plan to scan the evaluation CSVs tomorrow to concretely identify the events and check what happened, Hank |
It will certainly do that, eliminating events that do not overlap across all detection datasets and using the overall period spanned by the overlapping events as the new event. |
At this point, its pretty clear I'm not going to get to do anymore testing today as I had hoped to do. Next week, I need to do the QPR and perhaps help with a deployment of the COWRES and WRES GUI late in the week. Hopefully, that will leave at least some time to do this testing. Apologies for the delays, Hank |
Oh, no problem. It goes out when it goes out. Have a nice weekend! |
You too! Hank |
Back to this one, but note that a CHPS/GraphGen ticket is taking up some of my attention. Hank |
I've gathered the results to look at the impact of the Hank |
See #130. In this ticket, I'll track testing of the new capability. I'll start with build the latest code and running a very basic example. I'll then workup a test plan with different tests to perform tracked via checkboxes in the description of this ticket (i.e., below). Note that testing will make use of the standalone, both in-memory and using a database. COWRES testing will come later.
Let me pull the code and make sure I can build it.
Thanks,
Hank
==========
Tests to be performed are below. As I work down the list, in some cases, tests higher in the list will be updated to include the mentioned capability being tested. I'll essentially be throwing the kitchen sink at the capability to see if/when something "breaks". Other tests may be added as I progress through this list.
The text was updated successfully, but these errors were encountered: