You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Applying ops.GroupBy(...) after ops.Filter(...) causes some weird behaviour. Some rows are filled with lists of nans, and rows are not groupped correctly. It seems like the problem is with indexes.
Expected behavior
Expected output_df should look like this:
session event_id_list category_list event_type_list category_count
0 a [0, 1] [3, 3] [start, start] 2
1 b [3] [4] [start] 1
The event with event_id == 3 should be assigned to the session b, not a.
Dtype of columns event_id_list and category_list should be lists of ints not floats
Environment details (please complete the following information):
Related issue #1767 was about TypeError. In the output_df you can see, that the category_list column contains lists of floats (categories should be ints after ops.Categorify ) so they were converted in order to avoid TypeError.
I believe, that only the symptom of a bug was fixed there and not the cause. I think TypeError was an indirect result of the bug I describe in this issue. Since GroupBy causes some rows to be nans, there was a type conflict between original values (ints) and the nans (floats). But the real problem is that GroupBy after Filter messes up indexing and create some empty rows.
The text was updated successfully, but these errors were encountered:
I'm also really interested in this issue. I've seen that @oliverholworthy was involved in the linked problem! Dear @oliverholworthy do you think the above could be considered a pressing issue? Do you have any ideas what could be happening here?
Describe the bug
Applying
ops.GroupBy(...)
afterops.Filter(...)
causes some weird behaviour. Some rows are filled with lists ofnan
s, and rows are not groupped correctly. It seems like the problem is with indexes.A bug related to #1767
Steps/Code to reproduce bug
Sample code:
input_df
looks like this:And
output_df
(after filter and groupby):Expected behavior
Expected
output_df
should look like this:The event with
event_id == 3
should be assigned to the sessionb
, nota
.Dtype of columns
event_id_list
andcategory_list
should be lists of ints not floatsEnvironment details (please complete the following information):
Additional context
Related issue #1767 was about
TypeError
. In theoutput_df
you can see, that thecategory_list
column contains lists of floats (categories should be ints afterops.Categorify
) so they were converted in order to avoidTypeError
.I believe, that only the symptom of a bug was fixed there and not the cause. I think
TypeError
was an indirect result of the bug I describe in this issue. SinceGroupBy
causes some rows to benan
s, there was a type conflict between original values (ints) and the nans (floats). But the real problem is thatGroupBy
afterFilter
messes up indexing and create some empty rows.The text was updated successfully, but these errors were encountered: