Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop evaluation methods for matching models #23

Open
not-the-fish opened this issue Sep 28, 2017 · 4 comments
Open

Develop evaluation methods for matching models #23

not-the-fish opened this issue Sep 28, 2017 · 4 comments

Comments

@not-the-fish
Copy link
Contributor

We will want to compare, select, and evaluate matching models. This requires generating and storing metrics (see dssg/pgdedupe#20 for some possibilities) and, perhaps comparing Type I and Type II error rates on labeled pairs not used in the training data (see #20).

This will likely entail storing metrics in a metrics table and a notebook/methods/workflow for conducting comparisons and evaluations.

@not-the-fish
Copy link
Contributor Author

Many of these metrics with have cluster score thresholds (see #26), so the metrics table should be similar in shape to the triage results evaluations table.

@not-the-fish
Copy link
Contributor Author

not-the-fish commented Oct 4, 2017

Metrics:

  • Number of clusters @ threshold
  • Number of unmatched records @ threshold
  • Number of exact matches
  • Average size of cluster @ threshold
  • Maximum size of cluster @ threshold
  • Percentage of clusters of size 2 @ threshold
  • Number of blocks
  • Average size of blocks
  • Maximum size of block
  • Minimum size of block
  • Precision and recall on holdout labels @ threshold (see Store labeled pairs in a table #20)

@thcrock
Copy link
Contributor

thcrock commented Apr 19, 2018

@nanounanue here are ideas for metrics

@thcrock
Copy link
Contributor

thcrock commented Apr 19, 2018

From Joe:
recall
number of unique persons identified
This is one way to check whether the model is not matching enough people. e.g. If we don't match anyone -- we assume every event is for a separate person -- we'll probably get a ridiculous number of people in the data. We might even get more people than live in the jurisidiction
[] Measure of variation on the number of persons identified
[] maximum number of person events
To understand what I mean, think in the extreme, where we say all records belong to a single person. That person would have more events than is reasonable, e.g. 1 person has 10,000 jail bookings. This can help provide a check on the quality of the matches
[] Number of times the user says the model made a mistake

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants