-
Notifications
You must be signed in to change notification settings - Fork 794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faceting bug for categorical columns #3588
Comments
https://docs.pola.rs/api/python/stable/reference/api/polars.datatypes.Categorical.html @wirhabenzeit you'll need to use import altair as alt
import polars as pl
from vega_datasets import data
df = pl.DataFrame(data.cars()).with_columns(
pl.col("Origin", "Cylinders").cast(pl.String).cast(pl.Categorical("lexical"))
)
alt.Chart(df).mark_point().properties(width=150, height=150).encode(
x="Horsepower",
y="Miles_per_Gallon",
shape="Cylinders",
color=alt.Color("Origin").scale(scheme="category10"),
).facet(row="Origin", column="Cylinders") |
@dangotbanned Hmmm I think you misunderstood the issue. The issue is not that the order of facets is not lexicographical. The issue is that for categorical columns the resulting plot simply puts data points in wrong facets. If you look at the example above, then the blue points all should be in the USA facet, irrespective of the ordering of the rows. In fact when I encountered this issue I used categorical encoding precisely to be able to specify an order, but then the plot just becomes erratic. |
@wirhabenzeit Could you explain the difference between these two? I'm more than happy to reopen the issue if I've misunderstood, but they look the same to me? What you would like to happenOutput in #3588 (comment) |
@dangotbanned There is no difference. Maybe I explained it poorly. My bug report is that faceting with categorical columns which are not |
Thanks for raising this issue @wirhabenzeit! This is a very interesting issue you are raising. I can reproduce the issue you are describing, but I'm not sure exactly what is going on. Will investigate a bit more what changed with the categorical definition. The usage that you describe sounds solid to me. Maybe this is a regression with 5.4? Anyway, it is reproducible! Thanks again for your time to raise this issue! |
@mattijn I have looked around more and I think this goes back to vega/vega-lite#5937 |
I'm still unsure how this isn't explained by the nondeterministic ordering in |
Might also be related to vega/vega-lite#8675 which was reported in Altair here #3481. |
@mattijn Thanks for investigating! As far as I can see the issue arises on a group level, e.g. when grouping by facet, but also specifying encodings such as color or shape, then data gets misplaced as soon as any group (like row a, column b, color c, shape d) contains no data points. Could there be an automatic way of detecting this on the Altair side, and issuing a warning? Probably that’s difficult in case categories are derived using transformations etc? |
Thanks for your response. You mean you can introduce this behavior without a |
No, I think without rows/columns the issue is not there. What I meant is that problems arises as soon as in some facet some color/shape group has no data points. So for the workaround one would need to fill in nulls not only for empty facets but also empty groups within a facet. In my original example above the two plots are not just different in the sense that some entire facets are in the wrong place, but the individual facets are also different. I can try to produce a more minimal example showing this. |
There seems something going on with polars too. First, if I do import polars as pl
import vega_datasets
df = pl.DataFrame(vega_datasets.data.cars()).with_columns(
pl.col("Origin", "Cylinders").cast(pl.String).cast(pl.Categorical("lexical"))
) The order is correct in the chart, but when doing: df['Cylinders'].cat.get_categories().to_list() I get ['8', '4', '6', '3', '5'] So it is not really clear to me, how the chart specification can know the right order. But if I try to force the categorical order using an Enum: import vega_datasets
import polars as pl
df = pl.from_pandas(vega_datasets.data.cars()).with_columns(
pl.col("Origin"),
pl.col("Cylinders").cast(pl.String).cast(pl.Categorical),
)
uniq_cylinders = df['Cylinders'].unique().to_list()
print('cast Enum', sorted(uniq_cylinders))
df_sort = df.sort(pl.col('Cylinders').cast(pl.Enum(sorted(uniq_cylinders)))) # ['3', '4', '5', '6', '8']
df_catg = df_sort.with_columns(pl.col('Cylinders').cast(pl.Categorical))
df_catg['Cylinders'].cat.get_categories().to_list() It returns cast Enum ['3', '4', '5', '6', '8']
['8', '4', '6', '3', '5'] And a wrongly sorted chart. |
@mattijn I can help but could you add some comments - explaining the intention behind each action you've taken please?
Code blockimport vega_datasets
import polars as pl
df = pl.from_pandas(vega_datasets.data.cars()).with_columns(
pl.col("Origin"),
pl.col("Cylinders").cast(pl.String).cast(pl.Categorical),
)
uniq_cylinders = df['Cylinders'].unique().to_list()
print('cast Enum', sorted(uniq_cylinders))
df_sort = df.sort(pl.col('Cylinders').cast(pl.Enum(sorted(uniq_cylinders)))) # ['3', '4', '5', '6', '8']
df_catg = df_sort.with_columns(pl.col('Cylinders').cast(pl.Categorical))
df_catg['Cylinders'].cat.get_categories().to_list() I'm having trouble understanding as this reads more like My immediate thoughts are:
I should have elaborated in #3588 (comment) but to me the issue seems to be wanting some explicit behavior - without using any of the explicit features of So one way to look at this, is if you tell Maybe this section of their user guide would be helpful? Also https://docs.pola.rs/user-guide/concepts/data-types/categoricals/ |
I notice one thing what is different. If I define the dataframe as you suggested using a df = pl.DataFrame(vega_datasets.data.cars()).with_columns(
pl.col("Origin", "Cylinders").cast(pl.String).cast(pl.Categorical("lexical"))
)
chart = alt.Chart(df).mark_point().properties(width=100, height=100).encode(
x="Horsepower",
y="Miles_per_Gallon",
shape="Cylinders",
color="Origin",
).facet(row="Origin", column="Cylinders")
print(df.get_column("Cylinders").cat.get_categories())
print(chart.to_dict()['facet']) shape: (5,)
Series: 'Cylinders' [str]
[
"8"
"4"
"6"
"3"
"5"
]
{'column': {'field': 'Cylinders', 'type': 'nominal'}, 'row': {'field': 'Origin', 'type': 'nominal'}} As you can see, there is no Where in my ugly (no fun indeed!) defined DataFrame it actually includes the {'column': {'field': 'Cylinders', 'sort': ['8', '4', '6', '3', '5'], 'type': 'ordinal'}, 'row': {'field': 'Origin', 'type': 'nominal'}} Long story short, how does the inference works of a polars column casted as a lexical categorical? Is it correct that there is no |
Thanks @mattijn for the detail! So this part can be answered (I think) with Lines 712 to 729 in a171ce8
From what I'm understanding of https://github.com/narwhals-dev/narwhals/blob/aed2d515a2e26465a6edecf8d7aa560353cbdfa2/narwhals/utils.py#L401-L407 The type will be
cc @MarcoGorelli to double check EditMisunderstood that
|
Thanks for adding more info on the table! But I'm not sure if I can read an answer in this already. |
@mattijn no worries, yeah you've understood that correctly |
Thanks for the ping!
I think they should be the same? As in,
Regarding In [22]: pl.Series(['b', 'a', 'c'], dtype=pl.Categorical('lexical')).sort()
Out[22]:
shape: (3,)
Series: '' [cat]
[
"a"
"b"
"c"
]
In [23]: pl.Series(['b', 'a', 'c'], dtype=pl.Categorical('physical')).sort()
Out[23]:
shape: (3,)
Series: '' [cat]
[
"b"
"a"
"c"
] but they both return the same output for
if dtype == Categorical:
categories = self._col.cat.get_categories()
is_ordered = dtype.ordering == "physical" # type: ignore[attr-defined]
elif dtype == Enum:
categories = dtype.categories # type: ignore[attr-defined]
is_ordered = True
else:
msg = "`describe_categorical` only works on categorical columns"
raise TypeError(msg) the interchange protocol definition is a bit vague here, it just says "whether the ordering of dictionary indices is semantically meaningful" |
Thanks @MarcoGorelli
So I goofed on this one 🤦♂️ In #3588 (comment) I was trying to explain this bit where AFAIK Lines 712 to 722 in a171ce8
Maybe I should've wrote |
This comment was marked as outdated.
This comment was marked as outdated.
I think the issue reproduces with pandas ordered categoricals too, both on Altair 5.3.0 and Altair 5.4.1 code: import altair as alt
import pandas as pd
df_catg2 = pd.DataFrame(
{
"time": [0, 1, 0, 1, 0, 1, 0, 1],
"value": [0, 5, 0, 5, 0, 5, 0, 5],
"choice": ["A", "A", "B", "B", "A", "A", "B", "B"],
"month": ["jan", "jan", "feb", "feb", "feb", "feb", "mar", "mar"],
}
)
df_catg2["month"] = df_catg2["month"].astype(pd.CategoricalDtype(ordered=True))
chart_catg2 = (
alt.Chart(df_catg2, height=100, width=100)
.mark_line()
.encode(x="time", y="value", color="choice", row="choice", column="month")
)
chart_catg2 |
Pff, complicated. Altair assumes that the returned categories are in sorted order when it is defined as ordered, but this is an assumption that does not always hold.
my_order = ["k", "z", "b", "a"]
df = pl.from_dict({"cats": ['z', 'z', 'k', 'a', 'b'], "vals": [3, 1, 2, 2, 3]})
df = df.with_columns(pl.col("cats").cast(pl.Enum(my_order)))
nw_s = nw.from_native(df.get_column("cats"), allow_series=True)
print('ordered according to narwhals:', nw.is_ordered_categorical(nw_s))
print('sort order of get categories:', df.get_column("cats").cat.get_categories().to_list())
chart = alt.Chart(df, title='pl.Enum(my_order)').mark_bar().encode(
x='vals',
y='cats'
)
print('y-encoding sort definition:', chart.to_dict()['encoding']['y'])
chart
df = pl.from_dict({"cats": ['12', '4', '2'], "vals": [3, 1, 2]})
df = df.with_columns(pl.col("cats").cast(pl.Categorical())) # 'physical'
nw_s = nw.from_native(df.get_column("cats"), allow_series=True)
print('ordered according to narwhals:', nw.is_ordered_categorical(nw_s))
print('sort order of get categories:', df.get_column("cats").cat.get_categories().to_list())
chart = alt.Chart(df, title='pl.Categorical()').mark_bar().encode(
x='vals',
y='cats'
)
print('y-encoding sort definition:', chart.to_dict()['encoding']['y'])
chart
df = pl.from_dict({"cats": ['12', '4', '2'], "vals": [3, 1, 2]})
df = df.with_columns(pl.col("cats").cast(pl.Categorical('lexical')))
nw_s = nw.from_native(df.get_column("cats"), allow_series=True)
print('ordered according to narwhals:', nw.is_ordered_categorical(nw_s))
print('sort order of get categories:', df.get_column("cats").cat.get_categories().to_list())
chart = alt.Chart(df, title="pl.Categorical('lexical')").mark_bar().encode(
x='vals',
y='cats'
)
print('y-encoding sort definition:', chart.to_dict()['encoding']['y'])
chart To support lexical categorical, it should
Current implemention of Apparently the situation is different for pandas ordered categorical. Since it does not always return the sorted physical ordered categorical. Btw. When trying OP as you did in #3588 (comment). I get this: import polars as pl
import vega_datasets
import altair as alt
alt.Chart(
pl.from_pandas(vega_datasets.data.cars()).with_columns(
pl.col("Origin").cast(pl.Categorical),
pl.col("Cylinders").cast(pl.String).cast(pl.Categorical),
)
).mark_point().properties(width=150, height=150).encode(
x="Horsepower",
y="Miles_per_Gallon",
shape="Cylinders",
color=alt.Color("Origin").scale(scheme="category10"),
).facet(row="Origin", column="Cylinders").to_dict()['facet']
With |
Right, sorry about that, I just did
Are you sure about this? It seems to me that anything which is auto-inferred to be "ordinal" (as opposed to "nominal") is subject to issues For example, if we start with import altair as alt
import pandas as pd
import polars as pl
df_cat = pd.DataFrame(
{
"time": [0, 1, 0, 1, 0, 1, 0, 1],
"value": [0, 5, 0, 5, 0, 5, 0, 5],
"choice": ["A", "A", "B", "B", "A", "A", "B", "B"],
"month": ["jan", "jan", "feb", "feb", "feb", "feb", "mar", "mar"],
}
) then: pandas ordered categorical: 'ordinal', incorrect datadf_cat["month"] = df_cat["month"].astype(pd.CategoricalDtype(ordered=True))
chart_cat = (
alt.Chart(df_cat, height=100, width=100)
.mark_line()
.encode(x="time", y="value", color="choice", row="choice", column="month")
)
chart_cat pandas unordered categorical: 'nominal', correct data (but wrong ordering)df_cat["month"] = df_cat["month"].astype(pd.CategoricalDtype(ordered=False))
chart_cat = (
alt.Chart(df_cat, height=100, width=100)
.mark_line()
.encode(x="time", y="value", color="choice", row="choice", column="month")
)
chart_cat Polars physical categorical: 'ordinal', incorrect datadf_cat = pl.from_pandas(df_cat).with_columns(
pl.col('month').cast(pl.Categorical('physical'))
)
chart_cat = (
alt.Chart(df_cat, height=100, width=100)
.mark_line()
.encode(x="time", y="value", color="choice", row="choice", column="month")
)
chart_cat Polars lexical categorical: 'nominal', correct data (but wrong ordering)
I've tried doing this, but then the output from the example above becomes incorrect for both physical and lexical |
Not sure about anything anymore, but I think we have identified at least four issues/anomalies by now:
pd.Series(['4', '2', '12'], dtype=pd.CategoricalDtype(ordered=True)).cat.categories.to_list()
pl.Series(['4', '2', '12']).cast(pl.Categorical('lexical')).cat.get_categories().to_list()
s1 = pd.Series(['4', '2', '12'], dtype='category')
s2 = pd.Series(['4', '2', '12'])
pl_s1 = pl.from_pandas(s1).cast(pl.Categorical('physical')).cat.get_categories().to_list()
pl_s2 = pl.from_pandas(s2).cast(pl.Categorical('physical')).cat.get_categories().to_list()
pl_s1, pl_s2
For clarity, data without defined categorical is returning its categories sorted in physical order when casted to physical categorical in polars ( Regarding my comment, a few clarification notes in [italic]:
So basically, for proper dataframe inference of ordered categoricals then:
Also meaning, that this will currently lead to data being placed in wrong subplots if there are panels without data for both lexical and pysical ordered categoricals, since there will be a |
Pinging @c-peters for additional context, as they may have the best understanding of |
Forgive me if there is something I am misunderstanding, but it seems like all the issues reported here could stem from VegaLite not handling the Outside row and column faceting, all the scenarios with pd and pl categories work as expected as far as I can see (i.e. the order of the color scale match the order of the categories in each of these examples): pd orderedIdentified as ordinal as expected: Changing the categorical order changes the color scale order: pd unorderedIdentified as nominal as expected: pl physicalIdentified as ordinal as expected: pl lexicalAs already pointed out above, it seems that identification of lexical categories as ordinal data is not yet supported since it is not indicated as categorical data by narwhals and thus we get back unsorted nominal data: Which would be the same as if we used the pl physical data frame and explicitly encoded the data type as nominal: |
I'm not too familiar with what happens in Altair / Narwhals, but indeed the call to Would it be possible to call |
What happened?
Faceting by
pl.Categorical
columns results in wrong facetsI am not exactly sure what is going wrong, but suddenly all American cars are in the Europe facet, some European cars are in the Japan facet, the Japanese cars are in the correct facet, the 4-Cylinder cars are in the 5 and 6-Cylinder facets, etc. (There is probably some obvious pattern here which I am missing)
I checked the Vega-Lite output and I think the issue is the
sort
parameter of the resulting spec file.What would you like to happen instead?
The same code with
pl.String
columns works as expected:Which version of Altair are you using?
5.4.1
The text was updated successfully, but these errors were encountered: