Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enriched metadata with date, game_week and game_id #340

Merged
merged 2 commits into from
Oct 22, 2024

Conversation

SportsDynamicsDS
Copy link
Contributor

Hello everyone,
Happy to contribute for the first time to the package!

Working with the Metadata class, we noticed there is a few missing pieces of information that could be useful :

  • The date of the game.
  • The game week of the game.
  • The game id of the game.

In this PR :

  • We added the three fields for the class Metadata (kloppy/domain/models/common.py).
  • We added methods to extract the information when available for the following providers : Statsperform/ma1_json, Statsperform/f24_xml, Secondspectrum, Sportec and Tracab.
  • We also implemented the method to extract the score for Statsperform/ma1_json (kloppy/infra/serializers/event/statsperform/parsers/ma1_json.py)

Feel free to contact us if anything is unclear or need to be updated.

Thanks,
The SportsDynamics team

PS : I started with a PR instead of creating an issue, but I have checked if no issue mentioned this topic.

@koenvo
Copy link
Contributor

koenvo commented Jul 18, 2024

Thanks a lot for this PR!

Some questions/feedback:

  1. Can you elaborate on the game_week? Does this always start with 1, and is it an int type?
  2. Please use date or datetime type for game_date
  3. Please add tests for each provider

happy to help where needed.

@SportsDynamicsDS
Copy link
Contributor Author

@koenvo Thanks as well for the quick feedback!

The game_week is information on the number of games played by a team, within a league, including the current one. For instance, the first gameweek in Premier League will start on the 16th of August at 7:30pm. So it is mainly an integer from 1 to 38 in the PL, 1 to 34 in the Ligue 1 (since there are only 18 teams now), and 30 in the Jupiler Pro League. Moreover, if the game is happening during a cup or during a play-off, then the information will be "8th Finals".
The majority of providers are including this information, and it is useful to map your data with this information. Ex: I want to get only the first five games of the team. And this is why it has to be a str type.
I'm on it for 2. and 3.

@SportsDynamicsDS
Copy link
Contributor Author

@koenvo

  1. I explicited the doc in the class Metadata.
  2. The date is now in format datetime, with the timezone defined as UTC.
  3. There is also a test for each provider.

Please let me know if anything is still unclear

@UnravelSports
Copy link
Contributor

It would probably be helpful to keep track of all the providers / deserializers we are supporting this behavior for.

This PR supports:
Event Providers

  • StatsPerform F24 XML
  • StatsBomb
  • WyScout V2
  • WyScout V3
  • Sportec
  • Metrica
  • Datafactory

Tracking Providers

  • StatsPerform MA1
  • SecondSpectrum
  • Tracab
  • Sportec
  • SkillCorner
  • Metrica

Am I missing anything?

@SportsDynamicsDS
Copy link
Contributor Author

@UnravelSports This is correct yes.
Since "Event Providers / Sportec" is using the same metadata file as "Tracking Providers / Sportec", it could also be implemented.
For the other providers, I didn't have enough samples to implement and test.

@SportsDynamicsDS
Copy link
Contributor Author

SportsDynamicsDS commented Jul 18, 2024

@UnravelSports I just pushed the implementation for Events/sportec (the test is similar to the implementation for Tracking/sportec).

So the updated coverage is :

This PR supports:

Event Providers

  • StatsPerform F24 XML
  • StatsBomb
  • WyScout V2
  • WyScout V3
  • Sportec
  • Metrica
  • Datafactory

Tracking Providers

  • StatsPerform MA1
  • SecondSpectrum
  • Tracab
  • Sportec
  • SkillCorner
  • Metrica

@UnravelSports
Copy link
Contributor

UnravelSports commented Jul 18, 2024

I've been looking at the existing test files to see if we can learn anything from them for the currently unsupported providers. But, I'm not 100% sure if the values that I'm listing here are the correct ones.

I personally don't have access to any of the required files, except for SkillCorner and StatsBomb, but I think we can find most of them in the test files.

Event Providers

  • StatsPerform F24 XML
  • StatsBomb
    • Cannot find either of the 3 values inside event.json, lineup.json or 360.json. Not sure, I'm seeing this right though. Can also not find them in my own file.
  • WyScout V2
    • date: does not seem to be supported
    • week: does not seem to be supported
    • game id. Every item in "events" has a "matchId" key
  • WyScout V3
    • date: Inside "match" and then "dateutc"
    • week: Inside "match" and then "gameweek"
    • game id: Inside "match" and then "wyId" (not 100% sure this is correct)
  • Sportec
  • Metrica
    • date: FileDate, but not sure this is correct (in meta_data)
    • week: does not seem to be supported
    • game id: does not seem to be supported
  • Datafactory
    • date: in "match" -> "date" but seems to be a weird format, so might need a different file.
    • week: in "match" -> "week"
    • game id: "match" -> "matchId"

Tracking Providers

  • StatsPerform MA1
  • SecondSpectrum
  • Tracab
  • Sportec
  • SkillCorner
    • date: in match_data.json -> "date_time"
    • week: does not seem to be supported
    • game id: in match_data.json -> "id"
  • Metrica
    • See events

@SportsDynamicsDS SportsDynamicsDS force-pushed the feature-enriched-metadata branch 2 times, most recently from 82fa271 to 8eaeb0b Compare July 19, 2024 06:52
@SportsDynamicsDS
Copy link
Contributor Author

Thanks @UnravelSports , this is really helpful!

Based on your investigation, I was able to implement enrichment for :

  • WyScout V2
  • WyScout V3
  • Datafactory (The date format is recognised using the dateutil.parser/parse method. The parser will be able to automatically detect the date format, and I also added a try/except condition to set the date as None if the format is not recognised.)
  • SkillCorner

Metrica's "FileDate" seems to represent the last time the file has been updated. Which is not the information we are looking for.

The updated coverage is then :

Event Providers

  • StatsPerform F24 XML
    • date
    • game_week
    • game_id
  • StatsBomb
    • date
    • game_week
    • game_id
  • WyScout V2
    • date
    • game_week
    • game_id
  • WyScout V3
    • date
    • game_week
    • game_id
  • Sportec
    • date
    • game_week
    • game_id
  • Metrica
    • date
    • game_week
    • game_id
  • Datafactory
    • date
    • game_week
    • game_id

Tracking Providers

  • StatsPerform MA1
    • date
    • game_week
    • game_id
  • SecondSpectrum
    • date
    • game_week
    • game_id
  • TracabDAT
    • date
    • game_week
    • game_id
  • Sportec
    • date
    • game_week
    • game_id
  • SkillCorner
    • date
    • game_week
    • game_id
  • Metrica

@UnravelSports
Copy link
Contributor

Awesome additions @SportsDynamicsDS!

Okay, so for StatsBomb the matchIds are not provided in the event files, rather they are provided in the match files. Unfortunately, statsbomb.load() does not support passing the matches file.

I don't think it's feasible to pass the matches file to statsbomb.load() because we still won't know what file/game we're dealing with.

@koenvo @probberechts @JanVanHaaren do you think it would be a solution to add match_id and match_date as optional parameters directly to statsbomb.load()? If the answer is yes, the only small issue is that we also have kick_off_time as a separate value in the matches files.

  "match_id" : 3888787,
  "match_date" : "1979-09-07",
  "kick_off" : "19:00:00.000",
  "competition" : {
    "competition_id" : 1470,
    "country_name" : "International",
    "competition_name" : "FIFA U20 World Cup"
  }```

@SportsDynamicsDS
Copy link
Contributor Author

We also checked for Hawkeye, and the only available information is the game_id.

@probberechts talking about Hawkeye, we were able to run successfully your deserializer (Issue 324) on a few Champions League games, and it worked as expected 👍

@koenvo
Copy link
Contributor

koenvo commented Aug 9, 2024

Awesome additions @SportsDynamicsDS!

Okay, so for StatsBomb the matchIds are not provided in the event files, rather they are provided in the match files. Unfortunately, statsbomb.load() does not support passing the matches file.

I don't think it's feasible to pass the matches file to statsbomb.load() because we still won't know what file/game we're dealing with.

@koenvo @probberechts @JanVanHaaren do you think it would be a solution to add match_id and match_date as optional parameters directly to statsbomb.load()? If the answer is yes, the only small issue is that we also have kick_off_time as a separate value in the matches files.

  "match_id" : 3888787,
  "match_date" : "1979-09-07",
  "kick_off" : "19:00:00.000",
  "competition" : {
    "competition_id" : 1470,
    "country_name" : "International",
    "competition_name" : "FIFA U20 World Cup"
  }```

Yes, maybe best to just pass additional_metadata of type dict downstream. And in the deserializer do something like **additional_metadata to populate all attributes. What do you think @JanVanHaaren ?

@JanVanHaaren
Copy link
Collaborator

Yes, maybe best to just pass additional_metadata of type dict downstream. And in the deserializer do something like **additional_metadata to populate all attributes. What do you think @JanVanHaaren ?

Yes, that solution sounds good to me!

I can think of use cases where a user might want to include additional meta data such as the identity of the coach.

@SportsDynamicsDS
Copy link
Contributor Author

Thanks for the feedback guys!
I added "additional_metadata" for statsbomb.load(), and it includes tests.

Moreover, as @JanVanHaaren suggested, "home_coach" and "away_coach" are now parameters of the Metadata class. For WyScout and SkillCorner, the information is automatically extracted.

So the updated coverage is :

This PR supports:

Event Providers

  • StatsPerform F24 XML
    • date
    • game_week
    • game_id
    • coach
  • StatsBomb (Could be included manually)
    • date
    • game_week
    • game_id
    • coach
  • WyScout V2
    • date
    • game_week
    • game_id
    • coach
  • WyScout V3
    • date
    • game_week
    • game_id
    • coach
  • Sportec
    • date
    • game_week
    • game_id
    • coach
  • Metrica
    • date
    • game_week
    • game_id
    • coach
  • Datafactory
    • date
    • game_week
    • game_id
    • coach

Tracking Providers

  • StatsPerform MA1
    • date
    • game_week
    • game_id
    • coach
  • SecondSpectrum
    • date
    • game_week
    • game_id
    • coach
  • TracabDAT
    • date
    • game_week
    • game_id
    • coach
  • Sportec
    • date
    • game_week
    • game_id
    • coach
  • SkillCorner
    • date
    • game_week
    • game_id
    • coach
  • Metrica

@SportsDynamicsDS
Copy link
Contributor Author

Hi @UnravelSports @koenvo,

Any update on this PR ?

Thanks 🙏

@koenvo koenvo merged commit b99d1ac into PySport:master Oct 22, 2024
4 of 19 checks passed
@koenvo koenvo added this to the 3.16.0 milestone Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants