Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more info to the dataframe containing the list of players on the teamsheet of a game #793

Open
Gibranium opened this issue Aug 9, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@Gibranium
Copy link

Gibranium commented Aug 9, 2024

(I'm referring to the WhoScored data source and not Statsbomb, as mentioned in the notebook 1)

Hello,

• I think it would be fantastic if the "players" file created in notebook 1, instead of only including the number of minutes played, also included the time interval during which the player was on the field, specifying the start and end minute of their match. This would allow the substitution information to be incorporated directly into the file, enabling those who need it to normalize the players' data based only on the minutes they were actually on the field.

• I’m not sure if this is possible within the framework or the data source, but it would be interesting to add a column with the players' date of birth in the file. This would make it easier to filter by age in a scouting function without having to gather data from other sources and then cross-reference it. Obviously, it would also be useful to include other personal data of the players, such as nationality or other relevant information, if possible.

@probberechts
Copy link
Member

  1. Ideally, it should be a log of each player's position during the game. This can still be easily aggregated if one does not care about the specific position. It would be quite a bit of work to implement this, however, and I don't have plans to implement it myself as long as I don't need it.
match_id player_id team_id position_id start_period start_timestamp end_period end_timestamp duration
... ... ... ... ... ... ... ... ...
  1. For some data sources we have it (e.g, Wyscout). Others might not provide that data, or we might not have bothered parsing it. If some data source provides it and we didn't parse it, feel free to create a pull request.

@probberechts probberechts changed the title Improvements for the players file creation in Public Notebook 1 Add more info to the dataframe containing the list of players on the teamsheet of a game Aug 23, 2024
@probberechts probberechts added the enhancement New feature or request label Aug 23, 2024
@Gibranium
Copy link
Author

I understand.

For the dof and all part it was only a desire because I hate to think I'll need to manually create a file in which every player_id has to be connected with a transfermakrt/fbref link, but I don't really see in the raw Whoscored's file games the needed data.

For the other problem I've actually done it myself, the code is pretty ugly but it seems to do what I need it to do, so I'll share it here so that if someone needs it too, or wants to improve it so it can eventually be implemented inside the package is here.

But first, I found that in my conversion from events to actions and atomic_actions format all the 2nd halves didn't start from 0.0 but from 180.0 or 240.0 so I've implemented this, I don't know if it's only my problem or not, anyway:

actions.loc[actions['period_id'] == 2, 'time_seconds'] = actions[actions['period_id'] == 2].groupby(['game_id', 'period_id'])['time_seconds'].transform(lambda x: x - x.min())

For the other "problem", this would be the result of my work:

game_id team_id player_id player_name is_starter minutes_played game_duration jersey_number starting_position season_id competition_id start_period start_second end_period end_second
1703195 901 18208 Mariano Andújar True 99:59 99:59 21 GK 2023 ARG-Liga Profesional 1.0 00:00 2.0 51:59
1703195 901 125564 Luciano Lollo True 88:48 99:59 26 DC 2023 ARG-Liga Profesional 1.0 00:00 2.0 40:48
1703195 901 303860 Jorge Rodríguez True 99:59 99:59 30 DC 2023 ARG-Liga Profesional 1.0 00:00 2.0 51:59
1703195 901 389191 Zaid Romero True 99:59 99:59 2 DC 2023 ARG-Liga Profesional 1.0 00:00 2.0 51:59
1703195 901 144146 Leonardo Godoy True 99:59 99:59 29 DMR 2023 ARG-Liga Profesional 1.0 00:00 2.0 51:59
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1702085 9293 450274 Jeong Sang-Bin False 20:32 95:06 11 Sub 2023 USA-Major League Soccer 2.0 26:31 2.0 47:03
1702085 9293 141087 Franco Fragapane False 00:00 95:06 7 Sub 2023 USA-Major League Soccer 1.0 00:00 2.0 00:00
1702085 9293 125130 Ismael Tajouri-Shradi False 00:00 95:06 18 Sub 2023 USA-Major League Soccer 1.0 00:00 2.0 00:00
1702085 9293 126741 Clint Irwin False 00:00 95:06 1 Sub 2023 USA-Major League Soccer 1.0 00:00 2.0 00:00
1702085 9293 445293 Devin Padelford False 00:00 95:06 2 Sub 2023 USA-Major League Soccer 1.0 00:00 2.0 00:00

This is the code I've written, again, sorry for the chaos but I'm self taught with coding:
(final_subs is a concat of the events of every game)

# Select substitutions of type 'player off' or 'player on'
Z = final_subs[(final_subs['type_name'] == 'player off') | (final_subs['type_name'] == 'player on')]
Z1 = Z[["game_id", "team_id", "player_id", "period_id", "timestamp", "minute", "second", "type_name"]]
substitutions = Z1.drop_duplicates(["game_id", "team_id", "player_id", "period_id", "type_name"])

# Define a function to merge timestamp data based on game_id and specific conditions
def merge_timestamps(players_df, subs_df, period_id, type_name, new_column_name):
    filtered_subs = subs_df[(subs_df['type_name'] == type_name) & (subs_df['period_id'] == period_id)]
    return players_df.merge(
        filtered_subs[['game_id', 'timestamp']].rename(columns={'timestamp': new_column_name}),
        on='game_id',
        how='left'
    )

# Merge start and end timestamps for each period
players1 = players.copy()
players1 = merge_timestamps(players1, final_subs, period_id=1, type_name='start', new_column_name='start_timestamp')
players1 = merge_timestamps(players1, final_subs, period_id=1, type_name='end', new_column_name='1st_end_timestamp')
players1 = merge_timestamps(players1, final_subs, period_id=2, type_name='start', new_column_name='2nd_start_timestamp')
players1 = merge_timestamps(players1, final_subs, period_id=2, type_name='end', new_column_name='end_timestamp')

players1 = merge_timestamps(players1, final_subs, period_id=3, type_name='start', new_column_name='ot_start_timestamp')
players1 = merge_timestamps(players1, final_subs, period_id=3, type_name='end', new_column_name='1st_ot_end_timestamp')
players1 = merge_timestamps(players1, final_subs, period_id=4, type_name='start', new_column_name='2nd_ot_start_timestamp')
players1 = merge_timestamps(players1, final_subs, period_id=4, type_name='end', new_column_name='ot_end_timestamp')

# Remove duplicate entries
players1 = players1.drop_duplicates(["game_id", "team_id", "player_id"])

# Copy original DataFrame to avoid modifying the original data
substitutions0 = substitutions.copy()
substitutions0 = merge_timestamps(substitutions0, final_subs, period_id=1, type_name='start', new_column_name='start_timestamp')
substitutions0 = merge_timestamps(substitutions0, final_subs, period_id=1, type_name='end', new_column_name='1st_end_timestamp')
substitutions0 = merge_timestamps(substitutions0, final_subs, period_id=2, type_name='start', new_column_name='2nd_start_timestamp')
substitutions0 = merge_timestamps(substitutions0, final_subs, period_id=2, type_name='end', new_column_name='end_timestamp')

substitutions0 = merge_timestamps(substitutions0, final_subs, period_id=3, type_name='start', new_column_name='ot_start_timestamp')
substitutions0 = merge_timestamps(substitutions0, final_subs, period_id=3, type_name='end', new_column_name='1st_ot_end_timestamp')
substitutions0 = merge_timestamps(substitutions0, final_subs, period_id=4, type_name='start', new_column_name='2nd_ot_start_timestamp')
substitutions0 = merge_timestamps(substitutions0, final_subs, period_id=4, type_name='end', new_column_name='ot_end_timestamp')

# Remove duplicate entries
substitutions0 = substitutions0.drop_duplicates(['game_id', 'player_id', 'type_name'])

# Define a function to calculate durations
def calculate_durations(df):
    """Calculate the 1st half, 2nd half, and total game durations."""
    df['1st_half_duration'] = df['1st_end_timestamp'] - df['start_timestamp']
    df['2nd_half_duration'] = df['end_timestamp'] - df['2nd_start_timestamp']
    df['1st_ot_duration'] = (df['1st_ot_end_timestamp'] - df['ot_start_timestamp']).fillna(pd.Timedelta(0))
    df['2nd_ot_duration'] = (df['ot_end_timestamp'] - df['2nd_ot_start_timestamp']).fillna(pd.Timedelta(0))
    
    df['game_duration'] = (df['1st_half_duration'] + df['2nd_half_duration'] + df['1st_ot_duration'] + df['2nd_ot_duration'])
    return df

# Calculate durations for substitutions0
substitutions0 = calculate_durations(substitutions0)

# Calculate durations for players1
players1 = calculate_durations(players1)

# Define the conversion function to convert timedelta to "mm:ss"
def convert_timedelta_to_minutes_seconds(td):
    # Check if the value is NaT (Not a Time), which is pandas' NaN for timedelta
    if pd.isna(td):
        return None  # or you can return None or any placeholder you prefer
    
    # Total number of seconds in the timedelta
    total_seconds = int(td.total_seconds())
    
    # Calculate minutes and seconds
    minutes = total_seconds // 60
    seconds = total_seconds % 60
    
    return f"{minutes:02}:{seconds:02}"

# Apply the function to the column
substitutions0['1st_half_duration'] = substitutions0['1st_half_duration'].apply(convert_timedelta_to_minutes_seconds)
substitutions0['2nd_half_duration'] = substitutions0['2nd_half_duration'].apply(convert_timedelta_to_minutes_seconds)
substitutions0['1st_ot_duration'] = substitutions0['1st_ot_duration'].apply(convert_timedelta_to_minutes_seconds)
substitutions0['2nd_ot_duration'] = substitutions0['2nd_ot_duration'].apply(convert_timedelta_to_minutes_seconds)
substitutions0['game_duration'] = substitutions0['game_duration'].apply(convert_timedelta_to_minutes_seconds)

players1['1st_half_duration'] = players1['1st_half_duration'].apply(convert_timedelta_to_minutes_seconds)
players1['2nd_half_duration'] = players1['2nd_half_duration'].apply(convert_timedelta_to_minutes_seconds)
players1['1st_ot_duration'] = players1['1st_ot_duration'].apply(convert_timedelta_to_minutes_seconds)
players1['2nd_ot_duration'] = players1['2nd_ot_duration'].apply(convert_timedelta_to_minutes_seconds)
players1['game_duration'] = players1['game_duration'].apply(convert_timedelta_to_minutes_seconds)

# Example DataFrame modification based on 'minutes_played'
players1.loc[players1['minutes_played'] == 0, ['start_period', 'end_period']] = [1, 2]
players1.loc[(players1['minutes_played'] == 0) & (players1['ot_end_timestamp'].notna()), 'end_period'] = 4

# Assuming start_second and end_second should start from 0 seconds
players1.loc[players1['minutes_played'] == 0, 'start_second'] = pd.to_timedelta(0, unit='s')
players1.loc[players1['minutes_played'] == 0, 'end_second'] = pd.to_timedelta(0, unit='s')

# Apply the conversion function to start_second and end_second
players1['start_second'] = players1['start_second'].apply(convert_timedelta_to_minutes_seconds)
players1['end_second'] = players1['end_second'].apply(convert_timedelta_to_minutes_seconds)

# Find the maximum 'minutes_played' for each 'game_id'
max_minutes_played = players1.groupby('game_id')['minutes_played'].transform('max')

# Replace the highest 'minutes_played' value with the 'game_duration' for each 'game_id'
players1['minutes_played'] = players1['minutes_played'].mask(players1['minutes_played'] == max_minutes_played, players1['game_duration'])

# Convert 'minutes_played' of 0 to "00:00" using the same function for consistency
players1['minutes_played'] = players1['minutes_played'].apply(lambda x: "00:00" if x == 0 else x)

# Update 'start_period' and 'end_period' where 'minutes_played' is equal to 'game_duration'
players1.loc[players1['minutes_played'] == players1['game_duration'], ['start_period', 'end_period']] = [1, 2]
players1.loc[(players1['minutes_played'] == players1['game_duration']) & (players1['ot_end_timestamp'].notna()), 'end_period'] = 4

# Update 'start_second' to "00:00" where 'minutes_played' is equal to 'game_duration'
players1.loc[players1['minutes_played'] == players1['game_duration'], 'start_second'] = "00:00"

# Update 'end_second' based on the value of 'end_period' when 'minutes_played' is equal to 'game_duration'
# If 'end_period' is 2, set 'end_second' to '2nd_half_duration'
players1.loc[
    (players1['minutes_played'] == players1['game_duration']) & (players1['end_period'] == 2), 
    'end_second'] = players1['2nd_half_duration']

# If 'end_period' is 4, set 'end_second' to '2nd_ot_duration'
players1.loc[
    (players1['minutes_played'] == players1['game_duration']) & (players1['end_period'] == 4), 
    'end_second'] = players1['2nd_ot_duration']

players2 = players1[["game_id", "team_id", "player_id", "player_name", "is_starter", "minutes_played", "jersey_number", "starting_position",
                     "season_id", "competition_id", "1st_half_duration", "2nd_half_duration", "1st_ot_duration", "2nd_ot_duration", "start_period", "end_period", "start_second",
                     "end_second"]]

substitutions1 = substitutions0[substitutions0['type_name'] == 'player on']
substitutions1['start_period'] = substitutions1['period_id']
substitutions1a = substitutions1[substitutions1['period_id'] == 1]
substitutions1a['start_second'] = substitutions1a['timestamp']-substitutions1a['start_timestamp']
substitutions1b = substitutions1[substitutions1['period_id'] == 2]
substitutions1b['start_second'] = substitutions1b['timestamp']-substitutions1b['2nd_start_timestamp']
substitutions1c = substitutions1[substitutions1['period_id'] == 3]
substitutions1c['start_second'] = substitutions1c['timestamp']-substitutions1c['ot_start_timestamp']
substitutions1d = substitutions1[substitutions1['period_id'] == 4]
substitutions1d['start_second'] = substitutions1d['timestamp']-substitutions1d['2nd_ot_start_timestamp']
substitutions1 = pd.concat([substitutions1a, substitutions1b, substitutions1c, substitutions1d])
substitutions1['start_second'] = substitutions1['start_second'].apply(convert_timedelta_to_minutes_seconds)

substitutions2 = substitutions0[substitutions0['type_name'] == 'player off']
substitutions2['end_period'] = substitutions2['period_id']
substitutions2a = substitutions2[substitutions2['period_id'] == 1]
substitutions2a['end_second'] = substitutions2a['timestamp']-substitutions2a['start_timestamp']
substitutions2b = substitutions2[substitutions2['period_id'] == 2]
substitutions2b['end_second'] = substitutions2b['timestamp']-substitutions2b['2nd_start_timestamp']
substitutions2c = substitutions2[substitutions2['period_id'] == 3]
substitutions2c['end_second'] = substitutions2c['timestamp']-substitutions2c['ot_start_timestamp']
substitutions2d = substitutions2[substitutions2['period_id'] == 4]
substitutions2d['end_second'] = substitutions2d['timestamp']-substitutions2d['2nd_ot_start_timestamp']
substitutions2 = pd.concat([substitutions2a, substitutions2b, substitutions2c, substitutions2d])
substitutions2['end_second'] = substitutions2['end_second'].apply(convert_timedelta_to_minutes_seconds)

substitutions1x = substitutions1[["game_id", "team_id", "player_id", "timestamp", "type_name", "1st_half_duration", "2nd_half_duration", "1st_ot_duration", "2nd_ot_duration", "game_duration",
                                  "start_period", "start_second"]]
substitutions2x = substitutions2[["game_id", "team_id", "player_id", "timestamp", "type_name", "1st_half_duration", "2nd_half_duration", "1st_ot_duration", "2nd_ot_duration", "game_duration",
                                  "end_period", "end_second"]]

import numpy as np

# Merge players2 with substitutions1x and substitutions2x on the matching columns
merged_subs1 = pd.merge(players2, substitutions1x, on=["game_id", "team_id", "player_id", "1st_half_duration", "2nd_half_duration", "1st_ot_duration", "2nd_ot_duration"], how="left", suffixes=('', '_subs1'))
merged_subs2 = pd.merge(players2, substitutions2x, on=["game_id", "team_id", "player_id", "1st_half_duration", "2nd_half_duration", "1st_ot_duration", "2nd_ot_duration"], how="left", suffixes=('', '_subs2'))

# Update 'start_period' and 'start_second' using np.where for conditional update
players2["start_period"] = np.where(merged_subs1["start_period_subs1"].notna(), merged_subs1["start_period_subs1"], players2["start_period"])
players2["start_second"] = np.where(merged_subs1["start_second_subs1"].notna(), merged_subs1["start_second_subs1"], players2["start_second"])

# Update 'end_period' and 'end_second' using np.where for conditional update
players2["end_period"] = np.where(merged_subs2["end_period_subs2"].notna(), merged_subs2["end_period_subs2"], players2["end_period"])
players2["end_second"] = np.where(merged_subs2["end_second_subs2"].notna(), merged_subs2["end_second_subs2"], players2["end_second"])

# Optionally, drop any temporary columns created during the merge
players2.drop(columns=[col for col in players2.columns if col.endswith('_subs1') or col.endswith('_subs2')], inplace=True)

# Identify rows where 'is_starter' is True and 'start_period' and 'start_second' are NaN
condition = (players2["is_starter"] == True) & (players2["start_period"].isna()) & (players2["start_second"].isna())

# Update 'start_period' to 1 where the condition is met
players2.loc[condition, "start_period"] = 1

# Update 'start_second' to '0' (or "00:00" if it's a time string) where the condition is met
players2.loc[condition, "start_second"] = "00:00"

# Define the condition to check for NaN values in 'end_period' and 'end_second'
condition = (players2["2nd_ot_duration"] == "00:00") & (players2["end_period"].isna()) & (players2["end_second"].isna())
# New condition: Check where 'end_period' is NaN and '2nd_ot_duration' is not NaN
condition_ot = (players2["2nd_ot_duration"] != "00:00") & (players2["end_period"].isna()) & (players2["end_second"].isna())

# Update 'end_period' to 2 or 4 where the initial condition is met
players2.loc[condition, "end_period"] = 2
players2.loc[condition_ot, "end_period"] = 4

# Update 'end_second' to '2nd_half_duration' where the initial condition is met
players2.loc[condition, "end_second"] = players2.loc[condition, "2nd_half_duration"]
players2.loc[condition_ot, "end_second"] = players2.loc[condition_ot, "2nd_ot_duration"].values

cols = list(players2.columns)

# Switch third-to-last with the second-to-last column
# Third-to-last index is -3 and second-to-last index is -2
cols[-3], cols[-2] = cols[-2], cols[-3]

# Reorder DataFrame using the modified column order
players2 = players2[cols]

# Function to convert to MM:SS format
def format_minutes_played(value):
    if pd.isna(value):
        return value  # Return as is if value is None or NaN
    value = str(value)  # Convert the value to string
    if ':' in value:
        minutes, seconds = value.split(':')
        return f"{int(minutes):02}:{int(seconds):02}"  # Format with leading zeros
    else:
        # Convert MM to MM:SS format by appending ":00" with leading zeros for minutes
        return f"{int(value):02}:00"

# Apply the function to 'minutes_played' column
players2['minutes_played'] = players2['minutes_played'].apply(format_minutes_played)

def convert_to_total_seconds(value):
    if pd.isna(value):
        return np.nan  # Return NaN if value is None or NaN
    value = str(value)  # Convert the value to string
    if ':' in value:
        minutes, seconds = value.split(':')
        total_seconds = int(minutes) * 60 + int(seconds)
    else:
        total_seconds = int(value)  # In case value is already in seconds
    return total_seconds

# Convert 'end_second' and 'start_second' columns to total seconds
players2['minutes_played_seconds'] = players2['minutes_played'].apply(convert_to_total_seconds)
players2['end_second_seconds'] = players2['end_second'].apply(convert_to_total_seconds)
players2['start_second_seconds'] = players2['start_second'].apply(convert_to_total_seconds)
players2['1st_half_duration_seconds'] = players2['1st_half_duration'].apply(convert_to_total_seconds)
players2['2nd_half_duration_seconds'] = players2['2nd_half_duration'].apply(convert_to_total_seconds)
players2['1st_ot_duration_seconds'] = players2['1st_ot_duration'].apply(convert_to_total_seconds)
players2['2nd_ot_duration_seconds'] = players2['2nd_ot_duration'].apply(convert_to_total_seconds)

def calculate_minutes(row):
    if 0 < row['minutes_played_seconds'] < (row['1st_half_duration_seconds'] + row['2nd_half_duration_seconds'] + row.get('1st_ot_duration_seconds', 0) + row.get('2nd_ot_duration_seconds', 0)):
        # Handle periods 1 and 2 (regular time)
        if row['start_period'] == 1:
            if row['end_period'] == 1:
                return row['end_second_seconds'] - row['start_second_seconds']
            elif row['end_period'] == 2:
                if row['end_second_seconds'] == 0:
                    return row['1st_half_duration_seconds'] - row['start_second_seconds']
                else:
                    return row['1st_half_duration_seconds'] + row['end_second_seconds']
            elif row['end_period'] == 3:
                return row['1st_half_duration_seconds'] + row['2nd_half_duration_seconds'] + row['end_second_seconds']
            elif row['end_period'] == 4:
                return (row['1st_half_duration_seconds'] + row['2nd_half_duration_seconds'] +
                        row.get('1st_ot_duration_seconds', 0) + row['end_second_seconds'])

        elif row['start_period'] == 2:
            if row['end_period'] == 2:
                if row['start_second_seconds'] == 0:
                    return row['end_second_seconds']
                else:
                    return row['end_second_seconds'] - row['start_second_seconds']
            elif row['end_period'] == 3:
                return row['2nd_half_duration_seconds'] - row['start_second_seconds'] + row['end_second_seconds']
            elif row['end_period'] == 4:
                return (row['2nd_half_duration_seconds'] - row['start_second_seconds'] +
                        row.get('1st_ot_duration_seconds', 0) + row['end_second_seconds'])

        # Handle overtime periods (3 and 4)
        elif row['start_period'] == 3:
            if row['end_period'] == 3:
                return row['end_second_seconds'] - row['start_second_seconds']
            elif row['end_period'] == 4:
                if row['start_second_seconds'] == 0:
                    return row.get('1st_ot_duration_seconds', 0) + row['end_second_seconds']
                else:
                    return (row.get('1st_ot_duration_seconds', 0) - row['start_second_seconds'] +
                            row['end_second_seconds'])

        elif row['start_period'] == 4:
            if row['start_second_seconds'] == 0:
                return row['end_second_seconds']
            else:
                return row['end_second_seconds'] - row['start_second_seconds']

    return None  # Return None if conditions are not met

# Apply the function to create a new column for the calculated minutes
players2['calculated_minutes'] = players2.apply(calculate_minutes, axis=1)

# Function to convert total seconds to MM:SS format
def convert_to_mmss(seconds):
    if pd.isna(seconds) or seconds is None:
        return np.nan  # Return NaN if the input is None or NaN
    seconds = int(seconds)  # Ensure the input is an integer
    minutes = seconds // 60  # Calculate the minutes
    seconds = seconds % 60   # Calculate the remaining seconds
    return f"{minutes:02}:{seconds:02}"  # Format as MM:SS with leading zeros

# Apply the function to the 'duration_seconds' column to create a new column for the calculated minutes
players2['calculated_minutes'] = players2['calculated_minutes'].apply(convert_to_mmss)

players2.loc[~players2['calculated_minutes'].isna(), 'minutes_played'] = players2['calculated_minutes']

players2['game_duration'] = players2['1st_half_duration_seconds']+players2['2nd_half_duration_seconds']+players2['1st_ot_duration_seconds']+players2['2nd_ot_duration_seconds']

players2['game_duration'] = players2['game_duration'].apply(convert_to_mmss)

players3 = players2[["game_id", "team_id", "player_id", "player_name", "is_starter", "minutes_played", "game_duration", "jersey_number", "starting_position",
                                  "season_id", "competition_id", "start_period", "start_second", "end_period", "end_second"]]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants