Classify data as anomalous #88

rolyp · 2020-09-23T16:57:11Z

See notebook. Summary:

Part 1 (False Negatives):

created a toy example
read dataset again and look at first few rows
instantiate ptype and fit schema
let the user change an_values
re-run inference and show the fixed predictions

Part 2 (False Positives):

load dataset using pandas.read_csv
read dataset again and look at first few rows
define an analytical task (visualization of counts per state?)
instantiate ptype and fit schema
show misclassified entries
let the user handle (potentially change the string alphabet?)
continue the visualization

To do:

new title that fits the use-cases
mention that Part 1 shows a false negative
intro text for Part 2; mention that it shows a false positive
Part 1 should be a “use case” (i.e. show why it matters that the inference is incorrect and how fixing it helps)
this example shouldn’t show how ptype can’t process parentheses in strings (does this apply to ‘?)
remove comment about making this column specific rather than global (see Column-specific missing/anomalous values #135)
need to subtract the anomalous probabilities for PFSM calculate_probability
is Part 2 really about reclassifying a value as “normal”, or extending the alphabet of a type (specifically, string)? And what does extending the alphabet of a type actually entail, more generally? Does it work for any type, or only for string?

The text was updated successfully, but these errors were encountered:

BenjaminFraser · 2021-01-28T12:55:36Z

It's a difficult task to precisely define what should be classed as anomalous by default, however one thing that might be worth considering is to include @ and / in strings, otherwise all email addresses and urls will be classified as anomalous.

Not a huge problem, since like you suggest the user can simply add these chars to the vocabulary like so:

str_alphabet = ptype.get_string_alphabet()
str_alphabet.extend(["#", "+", "/"])
ptype.set_string_alphabet(str_alphabet)

Just wondering if email addresses and urls should function straight out of the box (given their common occurrence within datasets), without any user adjustment required?

tahaceritli · 2021-01-28T13:36:00Z

Hi @BenjaminFraser,

Thanks for the question! I assume that this issue occurs when a data column contains values that are not supported by the string type (e.g., email addresses). If so, there may be an alternative solution!

You can let ptype treat such values as normal values by modifying the data types it considers. For example, the following initializataion lets ptype take into account the email address type (although I should say that we haven't made extensive experiments with it):

types = ["integer", "string", "float", "boolean", "date-iso-8601", "date-eu", "date-non-std-subtype", "date-non-std", "EmailAddress"]
ptype = Ptype(self.types=types)

With this, you should be able to annotate data columns with the "EmailAddress" label when approriate, and treat such values as normal rather than anomalous. Does this sound helpful? Note that the "EmailAddress" type is already supported, but we'd need to create a new PFSM for the urls.

BenjaminFraser · 2021-01-28T13:58:25Z

Hi @tahaceritli ,

Thanks for the speedy response! The solution you provided is great and a neater one than manually adding @ to the string type alphabet!

I hadn't realised the EmailAddress field was already supported (apologies, I should have looked more diligently!).

tahaceritli · 2021-01-28T14:07:44Z

No worries. Hope it works (please let me know if you'd have a problem). Perhaps I should prepare another notebook to demonstrate that.

rolyp added the task:use-cases label Sep 23, 2020

tahaceritli self-assigned this Oct 29, 2020

rolyp changed the title ~~Incorrect anomalous data inference~~ Classify data as anomalous Nov 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classify data as anomalous #88

Classify data as anomalous #88

rolyp commented Sep 23, 2020 •

edited

Loading

BenjaminFraser commented Jan 28, 2021 •

edited

Loading

tahaceritli commented Jan 28, 2021 •

edited

Loading

BenjaminFraser commented Jan 28, 2021 •

edited

Loading

tahaceritli commented Jan 28, 2021

Classify data as anomalous #88

Classify data as anomalous #88

Comments

rolyp commented Sep 23, 2020 • edited Loading

BenjaminFraser commented Jan 28, 2021 • edited Loading

tahaceritli commented Jan 28, 2021 • edited Loading

BenjaminFraser commented Jan 28, 2021 • edited Loading

tahaceritli commented Jan 28, 2021

rolyp commented Sep 23, 2020 •

edited

Loading

BenjaminFraser commented Jan 28, 2021 •

edited

Loading

tahaceritli commented Jan 28, 2021 •

edited

Loading

BenjaminFraser commented Jan 28, 2021 •

edited

Loading