Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Add pandas_compat.table_to_frame(tab) #3180

Merged
merged 2 commits into from
Aug 3, 2018
Merged

[ENH] Add pandas_compat.table_to_frame(tab) #3180

merged 2 commits into from
Aug 3, 2018

Conversation

apetrov
Copy link
Contributor

@apetrov apetrov commented Aug 2, 2018

Feature

Convert Orange.data.Table instance to pandas dataframe

Includes
  • Code changes
  • Tests
  • Documentation

@CLAassistant
Copy link

CLAassistant commented Aug 2, 2018

CLA assistant check
All committers have signed the CLA.

Copy link
Member

@astaric astaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Could you add some tests?

Current tests for table_from_frame can be found here:
https://github.com/biolab/orange3/blob/master/Orange/data/tests/test_pandas.py

pandas.DataFrame
"""
def _column_to_series(col, vals):
print(col.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this was used for debugging. Please remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

def _column_to_series(col, vals):
print(col.name)
if col.is_discrete:
labels = [col.values[i] for i in np.vectorize(int)(vals)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vals.astype(int) avoids a call into python.

But what if some values are NaN?

Alternative:

return col.name, pd.Categorical.from_codes(codes=pd.Series(vals).fillna(-1).astype(int),
                                           categories=col.values, ordered=col.ordered)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion, thanks, fixed.

labels = [col.values[i] for i in np.vectorize(int)(vals)]
return (col.name, pd.Series(labels).astype('category'))
elif col.is_time:
return (col.name, pd.to_datetime(vals,unit='s').to_series())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spaces after commas.

if domain.attributes:
x = _columns_to_series(domain.attributes,tab.X)
if domain.class_vars:
y = _columns_to_series(domain.class_vars,tab.Y.reshape(tab.Y.shape[0],1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reshaping Y here doesn't work with multiple target columns. Perhaps:

table.Y.reshape(table.Y.shape[0], len(domain.class_vars))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@codecov-io
Copy link

codecov-io commented Aug 2, 2018

Codecov Report

Merging #3180 into master will increase coverage by 0.01%.
The diff coverage is 65.57%.

@@            Coverage Diff             @@
##           master    #3180      +/-   ##
==========================================
+ Coverage   82.59%   82.61%   +0.01%     
==========================================
  Files         340      340              
  Lines       58767    58847      +80     
==========================================
+ Hits        48541    48616      +75     
- Misses      10226    10231       +5

@apetrov
Copy link
Contributor Author

apetrov commented Aug 2, 2018

@astaric sure, added basic tests and looks good on Iris data set.
checking for datasets with datetime columns so I could test datetime columns properly.

python -m unittest Orange/data/tests/test_pandas.py

@apetrov
Copy link
Contributor Author

apetrov commented Aug 2, 2018

update: fixed pylinting.

@apetrov
Copy link
Contributor Author

apetrov commented Aug 2, 2018

ok, this function works across all datasets.

import os
from Orange.data import Table
from Orange.data.pandas_compat import table_to_frame
datasets = [ f.split('.')[0] for f in os.listdir("Orange/datasets") if '.tab' in f]
for ds in datasets:
    print(ds)
    table = Table(ds)
    table_to_frame(table)

I don't want to introduce it as a part of test suits since it's pretty slow:

python dataset_test.py  4.64s user 0.49s system 110% cpu 4.643 total

Convert Orange.data.Table instance to pandas dataframe

[FIX] pandas_compat.table_to_frame NaN values

[FIX] table_to_frame persist column order

[ENH] add tests for pandas_compat.table_to_frame

[FIX] pandas_compat.table_to_frame commas

[FIX] pandas_compat.table_to_frame support multipe target columns
@astaric
Copy link
Member

astaric commented Aug 3, 2018

I'd add it as a test, but mark it as @Skip, so it can be uncommented when needed.

You wrote in an earlier comment that the function did not work on a couple of datasets. Could you test on them as well? (not sure what the problem was)

@apetrov
Copy link
Contributor Author

apetrov commented Aug 3, 2018

@astaric ok, that's a good point.

Yes, it helped me to troubleshoot a problem with datetime column. works on any Orange build-in dataset

@apetrov
Copy link
Contributor Author

apetrov commented Aug 3, 2018

@astaric added my script as a part of test suit. actually looks pretty neat as if something fails it pinpoints the problem and exact dataset that function failed on.

@lanzagar lanzagar merged commit c30fa4c into biolab:master Aug 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants