You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First, I just want to give tremendous thanks for developing this helpful library and all the support you all provide.
I am trying to link multiple tables of records that describe live events from different ticketing providers. Each table is a different source like SeatGeek, StubHub, Ticketmaster etc and from these sources, I extract the columns name, venue_name, timezone_local_date, timezone_local_time, and event type.
I have a few of questions regarding this problem
Since I expect that each event source has a single record for each event, is it possible to make it such that Splink knows that only a single record is expected to match from each table? Splink currently just gives match weights for all blocked comparisons. I think working this into the algorithm would likely improve match accuracy.
Some of my event sources give an ID of the matching record in other event sources. For instance, some records in SeatGeek table give IDs for StubHub. This is provided only by a few of the tables and only covers a partial list of the records. Although they are mostly accurate, these provided mappings are also sometimes erroneous. Would it be appropriate to use these to estimate m-values?
Is there a general resource or guide recommended on how to tune string comparison measures and comparison levels for string comparisons by data exploration or other empirical methods?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello,
First, I just want to give tremendous thanks for developing this helpful library and all the support you all provide.
I am trying to link multiple tables of records that describe live events from different ticketing providers. Each table is a different source like SeatGeek, StubHub, Ticketmaster etc and from these sources, I extract the columns
name
,venue_name
,timezone_local_date
,timezone_local_time
, andevent type
.I have a few of questions regarding this problem
Since I expect that each event source has a single record for each event, is it possible to make it such that Splink knows that only a single record is expected to match from each table? Splink currently just gives match weights for all blocked comparisons. I think working this into the algorithm would likely improve match accuracy.
My comparison between the name and venue_name is a string similarity measure (Token Set Ratio), but treats the words as a set. Would it be appropriate (or even necessary) to apply the array tf computation described in A possible methodology for combining array fields with term frequency adjustments #2022 even though these are not array fields?
Some of my event sources give an ID of the matching record in other event sources. For instance, some records in SeatGeek table give IDs for StubHub. This is provided only by a few of the tables and only covers a partial list of the records. Although they are mostly accurate, these provided mappings are also sometimes erroneous. Would it be appropriate to use these to estimate m-values?
Is there a general resource or guide recommended on how to tune string comparison measures and comparison levels for string comparisons by data exploration or other empirical methods?
Beta Was this translation helpful? Give feedback.
All reactions