Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

removed some columns in 0.5.1 #152

Closed
koftezz opened this issue Dec 28, 2024 · 1 comment
Closed

removed some columns in 0.5.1 #152

koftezz opened this issue Dec 28, 2024 · 1 comment

Comments

@koftezz
Copy link

koftezz commented Dec 28, 2024

Hey, I was going to upgrade to a newer version since I was using 0.4, however, you stopped returning words, weekday hour columns in get_df method. I was wondering why is there such a case.

0.5.1
100%|█████████████████████████████████| 8340/8340 [00:00<00:00, 14199.79it/s]
28.12.2024 16:49:47 INFO Finished parsing raw messages.
Index(['timestamp', 'author', 'message'], dtype='object')

0.5.0
100%|█████████████████████████████████| 8340/8340 [00:00<00:00, 14126.80it/s]
28.12.2024 16:50:47 INFO Finished parsing raw messages.
Index(['timestamp', 'author', 'message', 'weekday', 'hour', 'words',
'letters'],
dtype='object')

@joweich
Copy link
Owner

joweich commented Jan 4, 2025

Hey @koftezz, we decided to remove these aggregations from the default calculation as they can be inferred from the message field later (see commit cbd31ba). This way, the dataframe is leaner.

You can simply use this polars snippet to bring them back:

df = df.with_columns([
    pl.col("timestamp").dt.weekday().map_dict({
        0: "Monday", 1: "Tuesday", 2: "Wednesday",
        3: "Thursday", 4: "Friday", 5: "Saturday", 6: "Sunday"
    }).alias("weekday"),
    pl.col("timestamp").dt.hour().alias("hour"),
    pl.col("message").str.split(" ").list.len().alias("words"),
    pl.col("message").str.len_chars().alias("letters")
])

If you prefer to use pandas, you'll do this:

df = df.to_pandas()
df["weekday"] = df["timestamp"].dt.day_name()
df["hour"] = df["timestamp"].dt.hour
df["words"] = df["message"].apply(lambda s: len(s.split(" ")))
df["letters"] = df["message"].apply(len)

Let me know if you have any further concerns!

@joweich joweich closed this as completed Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants