Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema transform #79

Closed
16 tasks done
rolyp opened this issue Sep 23, 2020 · 2 comments
Closed
16 tasks done

Schema transform #79

rolyp opened this issue Sep 23, 2020 · 2 comments

Comments

@rolyp
Copy link
Collaborator

rolyp commented Sep 23, 2020

See notebook. To do:

  • set the scene better for the error when calling lm.fit
  • have Ptype report on how it automates this step (i.e. summarise anomalous/erroneous rows)
  • rewrite bullet points at beginning into proper text
  • interweave explanatory text into Part 2

Done/dropped:

  • Part 2 seems to improve on Part 1 only in saving the user from converting ‘?’ to pd.na
  • maybe duplication between Part 2 and Part 1 isn’t necessary – maybe present Ptype usage within a larger scenario?
  • better name for use case
  • rename transform_schema
  • show_schema should be a method on schema object – see Move cols property to Schema object #118
  • Ptype should describe inferred type as Int64, not integer – see Use Pandas names for Pandas datatypes #117
  • remove unnecessary imports
  • move summary above to beginning of notebook?
  • move Ptype imports to where they’re first used
  • rather than showing df.loc[130], point out that it’s not obvious where the problem is in the dataset
  • run query to show occurrences of ‘?’
  • remove metrics such as mean_absolute_error
  • instead of build_table_schema, why not just show dtypes again?
  • dtypes are not automatically changed after replacing ‘?’ in part 1. add another step to change it
  • explain dtype=’str’ step
  • inline scatter_plot helper
@rolyp rolyp changed the title Use case: Infer column type Infer column type Sep 23, 2020
@rolyp rolyp changed the title Infer column type Use column type inferred by ptype Sep 23, 2020
@rolyp rolyp changed the title Use column type inferred by ptype Correct column type inference Sep 23, 2020
@rolyp rolyp mentioned this issue Sep 23, 2020
12 tasks
@tahaceritli
Copy link
Collaborator

fixed some of the items above in #99

@rolyp
Copy link
Collaborator Author

rolyp commented Oct 7, 2020

Summary:

Part 1 (solve data cleaning problem without Ptype):

  • import dataset using Pandas read_csv
  • run linear regression on data
  • error occurs because of missing data: could not convert string to float: ‘?’
  • inspect dtypes property of dataframe to see the problem
  • use Pandas to change encoding of missing data, remove relevant rows
  • run linear regression again (no errors), plot results
  • use dtypes to verify that we now have appropriate column types

Part 2 (how Ptype makes this problem easier):

  • import dataset using Pandas read_csv, but this time with dtype=’str’
  • instantiate Ptype
  • ask Ptype to infer schema; show inferred types
  • ask Ptype to adjust type of dataframe to match schema
  • inspect transformed dataframe to verify types as expected
  • then as per Part 1 to remove missing data and continue

@rolyp rolyp changed the title Correct column type inference Schema transform Oct 7, 2020
@rolyp rolyp changed the title Schema transform Transform dataframe according to inferred schema Oct 7, 2020
@rolyp rolyp changed the title Transform dataframe according to inferred schema Schema transform Oct 7, 2020
@rolyp rolyp closed this as completed Oct 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants