Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readCsv replaces missing values in last column with 0 #19

Closed
smurgle opened this issue Dec 9, 2021 · 6 comments · Fixed by #20
Closed

readCsv replaces missing values in last column with 0 #19

smurgle opened this issue Dec 9, 2021 · 6 comments · Fixed by #20

Comments

@smurgle
Copy link

smurgle commented Dec 9, 2021

It is expected to use NaN, as it currently does when missing values happen in inner columns.

@smurgle
Copy link
Author

smurgle commented Dec 10, 2021

Apparently there are other situations where this happens... I mean the unexpected 0. Not only in the last column.
For instance if I load on df a CSV like this:
w,x,y,z
1,10,0.1,100
2,ERR,inf,200
NaN,N/A,0.3,300
4,40,0.4,400

echo df.pretty()

Dataframe with 4 columns and 4 rows:
Idx w x y z
dtype: object object object int
0 1 10 0.1 100
1 2 0 inf 200
2 NaN N/A 0.3 300
3 4 40 0.4 400

i.e. ERR gets replaced with 0 (or better, it stays to init'ed int value if I got your point correctly), while NaN or N/A are "just fine".
So possibly the issue to address is a bit broader, and not related to just the last column missing values. If some values, for whatever reason, are not mapped (ERR is "special" case for the parser I guess... possibly making something to fail?), are kept to... 0. I faced this issue because I was trying to understand how to perform "datacleaning" with datamancer.
I know how to "replace" inf or ERR with NaN in pandas... (ERR maybe is uncommon, but 'inf' or 'NaN' are not).
I would do something like this to remap ERR on NaN on 'x' column:

df = df.replace("ERR","NaN")
df["x"] = df["x"].astype(float) #NaN is float type for Python, unless using Int64 (with "I" capital case)

I have still to figure it out IF and HOW I can do the same on datamancer.
Thank you.

@Vindaar
Copy link
Member

Vindaar commented Dec 12, 2021

So, I've addressed the things that are broken in my opinion.

What I mean is: explicit appearing of NaN and Inf should be parsed correctly into NaN or Inf. Missing values are parsed as NaN (note: this implies an integer column with missing values, will be turned into a float column!).

However, the case of ERR and N/A you bring up, are irrelevant in my opinion. These are just human ways to indicate something. They are turned into string values of the corresponding entry.

Note that indeed, the value of ERR in particular was buggy, due to starting with E (as E can indicate an exponent in a float). That bug is fixed. So your given DF is now turned into:

Dataframe with 4 columns and 4 rows:
       Idx         w         x         y         z
    dtype:     float    object     float       int
         0         1        10       0.1       100
         1         2       ERR       inf       200
         2       nan       N/A       0.3       300
         3         4        40       0.4       400

I think this is reasonable.

To be honest, I'm not the biggest fan of having to convert the first column here into a float column, but it is what it is.

And sure, data cleaning is a very common task. But keep in mind that there is a reason that in the R community there is a whole package just for this, tidyr:
https://tidyr.tidyverse.org/

In any case, with the DF as it stands now, you can apply rules to the object column to clean it up as you see fit.

@smurgle
Copy link
Author

smurgle commented Dec 15, 2021

Hello Vindaar, that's excellent (it's more than reasonable). The new layout you show is exactly the one I expected and quite the same Python/Pandas (I'm not an R guy... shame on me) would have produced.
NaN and INF implies float in Pandas too (Pandas introduced Int64 "nullable integer" at certain point... but also Pandas' read_csv doesn't dare to use it as default and I understood that using it could pose risk of breaking some code / causing weird stuff).
ERR and N/A are definitely expected to force the whole DF column (in Pandas is called Series instead of Tensor) to be an object datatype (basically a string), Pandas does make the same. Here I have one newby question on datamancer: then what would be the most idiomatic / efficient way to replace (mutate in place if it makes sense) e.g. 'N/A' and 'ERR' with '0' and then turn the whole DF column from str to int or float? Thank you. You are doing a great job.

@Vindaar
Copy link
Member

Vindaar commented Dec 16, 2021

then what would be the most idiomatic / efficient way to replace (mutate in place if it makes sense) e.g. 'N/A' and 'ERR' with '0' and then turn the whole DF column from str to int or float?

What I would personally do here is the following:

df = df.mutate(f{Value -> int: "x" ~
  (if `x` == %~ "ERR" or `x` == %~ "N/A":
    0
  else:
    `x`.toInt)})

An explanation:

  • mutate takes a formula that either adds a new column or overwrites an existing one
  • we give it information about the type Value -> int, meaning input columns are read as Value (that's what an object column actually is) and the output shall be int
  • we overwrite the x column, hence "x" is left of the ~
  • ~ indicates we have a formula that creates a new full column
  • then comes the actual body of the formula. Important note here: because we have a more complex expression, we need the parenthesis around the if. Otherwise the compiler will complain
  • in the body we simply refer to the column we access using back ticks
  • because it's a formula with ~, it means the code will run for each element of the column, hence the back ticked x is each element of the column x
  • we compare using the %~ operator, which is the (maybe a bit weird) operator to convert a regular type into a Value
  • if it's either of our "bad" values we return 0
  • else we return the existing value, but take out the existing integer from the Value by using toInt

See the documentation of the Value here:
https://scinim.github.io/Datamancer/value.html

There are other ways to do this of course, but this would be the most "idiomatic" if you will. Given the use case though, doing it manually by getting the tensor using df["x", Value] and using map_inline or something and overwriting manually via df["x"] = df["x", Value].map_inline(...) or similar could be more intuitive however.

@smurgle
Copy link
Author

smurgle commented Dec 16, 2021

Thank you Vindaar. Illuminating. I tried some initial experiments in the same direction, but I struggled with the type to use with object (Value... I see... not string) and I've would never tought to use ~ or %~ that way. Explained examples like this are a goldmine for datamancer newcomers. This one to me is worth of a further chapter in datamancer data wrangling tutorial (maybe is there and I've just missed), even because replacement (even if in casis like this, often "filtering out" could be more appropriate) and data type conversion are routinary operations when "cleaning" data.

@Vindaar
Copy link
Member

Vindaar commented Dec 16, 2021

Yes, I understand that.

The documentation is on the one hand clearly still lacking and on the other everything related to object columns is not the most intuitive. The name Value is older than the object column (which I adopted for familiarity with pandas & numpy). And %~ comes from a similarity to the % operator for Nim's JsonNode.
Neither is great. I'm not sure about better ways though. I'm all ears for proposals. toValue/toObject etc. could of course be added, but are more verbose.

For the specific use cases of comparisons, I suppose I could add overloads to == for Value with native types. That at least would hide some of the behind the scenes stuff. I didn't want to jump headlong into using converters etc. though, because I might regret it later.

It's a good idea for an additional section in the data wrangling tutorial for sure though! I've opened an issue:

SciNim/getting-started#37

for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants