Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classify data as anomalous #88

Open
1 of 8 tasks
rolyp opened this issue Sep 23, 2020 · 4 comments
Open
1 of 8 tasks

Classify data as anomalous #88

rolyp opened this issue Sep 23, 2020 · 4 comments
Assignees

Comments

@rolyp
Copy link
Collaborator

rolyp commented Sep 23, 2020

See notebook. Summary:

Part 1 (False Negatives):

  • created a toy example
  • read dataset again and look at first few rows
  • instantiate ptype and fit schema
  • let the user change an_values
  • re-run inference and show the fixed predictions

Part 2 (False Positives):

  • load dataset using pandas.read_csv
  • read dataset again and look at first few rows
  • define an analytical task (visualization of counts per state?)
  • instantiate ptype and fit schema
  • show misclassified entries
  • let the user handle (potentially change the string alphabet?)
  • continue the visualization

To do:

  • new title that fits the use-cases
  • mention that Part 1 shows a false negative
  • intro text for Part 2; mention that it shows a false positive
  • Part 1 should be a “use case” (i.e. show why it matters that the inference is incorrect and how fixing it helps)
  • this example shouldn’t show how ptype can’t process parentheses in strings (does this apply to ‘?)
  • remove comment about making this column specific rather than global (see Column-specific missing/anomalous values #135)
  • need to subtract the anomalous probabilities for PFSM calculate_probability
  • is Part 2 really about reclassifying a value as “normal”, or extending the alphabet of a type (specifically, string)? And what does extending the alphabet of a type actually entail, more generally? Does it work for any type, or only for string?
@tahaceritli tahaceritli self-assigned this Oct 29, 2020
@rolyp rolyp changed the title Incorrect anomalous data inference Classify data as anomalous Nov 2, 2020
@BenjaminFraser
Copy link

BenjaminFraser commented Jan 28, 2021

It's a difficult task to precisely define what should be classed as anomalous by default, however one thing that might be worth considering is to include @ and / in strings, otherwise all email addresses and urls will be classified as anomalous.

Not a huge problem, since like you suggest the user can simply add these chars to the vocabulary like so:

str_alphabet = ptype.get_string_alphabet()
str_alphabet.extend(["#", "+", "/"])
ptype.set_string_alphabet(str_alphabet)

Just wondering if email addresses and urls should function straight out of the box (given their common occurrence within datasets), without any user adjustment required?

@tahaceritli
Copy link
Collaborator

tahaceritli commented Jan 28, 2021

Hi @BenjaminFraser,

Thanks for the question! I assume that this issue occurs when a data column contains values that are not supported by the string type (e.g., email addresses). If so, there may be an alternative solution!

You can let ptype treat such values as normal values by modifying the data types it considers. For example, the following initializataion lets ptype take into account the email address type (although I should say that we haven't made extensive experiments with it):

types = ["integer", "string", "float", "boolean", "date-iso-8601", "date-eu", "date-non-std-subtype", "date-non-std", "EmailAddress"]
ptype = Ptype(self.types=types)

With this, you should be able to annotate data columns with the "EmailAddress" label when approriate, and treat such values as normal rather than anomalous. Does this sound helpful? Note that the "EmailAddress" type is already supported, but we'd need to create a new PFSM for the urls.

@BenjaminFraser
Copy link

BenjaminFraser commented Jan 28, 2021

Hi @tahaceritli ,

Thanks for the speedy response! The solution you provided is great and a neater one than manually adding @ to the string type alphabet!

I hadn't realised the EmailAddress field was already supported (apologies, I should have looked more diligently!).

@tahaceritli
Copy link
Collaborator

No worries. Hope it works (please let me know if you'd have a problem). Perhaps I should prepare another notebook to demonstrate that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants