Issue 22: huge values are valid JSON #81

aliamcami · 2019-03-23T03:18:43Z

The question that originated this analyse is: "Are all big values valid JSON?"

Overview

All the greatest values are JSON, but they represent very little percentage of the whole data.

Most of the data have small value_len

(mean = 1356 for the 10% sample)

95,58% of the data have value_len smaller than the mean
4,42% are bigger than the mean
9.35% are valid JSON

Values above the mean:

61,54% are NOT valid JSON
38,46% are valid JSON

Values that are 1 standard deviation (std) above the mean

(std = 26310 for 10% sample):

0,11% are NOT valid JSON
99,88% are valid JSON
The bigger the value the greater the chance of being a valid JSON

Values 4 std above the mean

100% are valid JSON
The biggest non-JSON value have the length of 104653

The top 46745 gratest value_len are valid JSONs, that is 9.35% of the filtered sample (value_len > mean) and 0,41% of the original 10% sample.

aliamcami · 2019-03-24T04:49:58Z

I was questioned by @birdsarah:

"what are your next questions? i'm keen to see from you what questions this work has thrown up for you - are there groups / themes to these questions? if you were concerned about tracking / privacy what would you look at next?"

So, I organized some of my questioning in groups/themes and what I got is the following:

About JSONs:

The JSON values are always from the same location or related domains?
Are there a set of location domains that always produces a JSON?
Does the JSON values follow a structure pattern? What pattern?
What data does the JSON hold? Is there any pattern on content?
Do they have nested JSON? Css? Html? Javascript? Recursive study on JSON properties.
Is a JSON's structure for a single script_url domain always the same?
Is every JSON with the same structure produced by the same script_url domain?

General

I'm think some things here maybe a crawler investigation or just wiki reading, since someone may have already described and explained. I just need to find, read and understand it.

Are there other valid data types like html, css... in the values column or just JSON?
Where does the value comes from? What is it used for?

Smal: value_len < mean

What are the small values?
Does the smaller values have any pattern?
What the majority data type?

Medium: mean < value_len < (mean + std)

How many rows are there in the intersection of “no JSON” and “everything is JSON” ?
What are they? Are they from a specific script_url domain? Or realated domains?

Big: value_len > (mean + std)

What are the big non-JSON values?

Security and data sharing:

Do the value columns have any javascript? nested javascript?
Do the javascripts in the dataset contain known malicious behaviors?
Can they collect data that threatens user's privacy?

if you were concerned about tracking / privacy what would you look at next?

I would love to deeper analyze the javascripts, but that’s a whole other area of knowledge. I think I can study common patterns of privacy intrusion and malicious behavior in javacript and try to correlate with the scripts present in the dataset. A related analysis to what was done in the medium article with cryptocoin mining scripts.

Statistical knowledge / coincidence:

The mean of the original 10% sample is pretty similar to the std of the sample taken after filtering for values above the mean

why?
Is it a coincidence?
Is it always like this?
Is it a statistical pattern?

birdsarah

This is a great start.

As I mentioned before I'd really like to see some visualizations. In particular:

a histogram, or something like it, of the value_len column. I think this will help you answer some of your own questions about the mean, standard deviation, etc. It's important to think about how the shape of our data can affect summary statistics like the mean.
a plot of % json compared to, say, minimum value_len. You could start with your subset of everything over total mean (1,356) to get a feel for it.

Your follow-on questions that you posted are great. I'd like to see them included in the PR - perhaps in the README - it's likely that I'll want to turn many of them into issues later. But we'll cross that bridge when we come to it.

I think the most interesting questions which are the natural from where you are that I'd love to see you tackle one or two of are:

The JSON values are always from the same location or related domains?
Are there a set of location domains that always produces a JSON?
Does the JSON values follow a structure pattern? What pattern?
What data does the JSON hold? Is there any pattern on content?
What are the big non-JSON values?

Also, if you want to output samples of these values and save them as text files as part of your folder, or gists, that might help give the future reader context.

There are some notebook cleanups that I would eventually like to be done before merging, but they are not as important. Things like using: values (such as the mean) wherever possible rather than manually copying, and making sure the narrative all makes sense when reading the whole notebook.

aliamcami · 2019-03-27T02:32:47Z

Thank you for your incredible review.
I updated to WIP and I'll leave it until the follow is ready:

Study and implement how to best plot the requested graphs
Make a readme with those questions
Cleanup the notebook

About the values hardcoded, I actually left them hardcoded to eliminate the need to recalculate them every time I started the notebook, since it does take quite some time for me. Should I have a file with this saved then? Or variables holding the hardcoded value? Or leave it to be calculated every time?

About the follow up questions, should I open a new PR specifically for each of them or increment this one when I start to tackle them?

birdsarah · 2019-03-27T16:31:44Z

"I actually left them hardcoded to eliminate the need to recalculate them every time I started the notebook" I understand. There are trade-offs. In the case you continue hard coding, you can still reduce - there's perhaps one or two locations where you need to set that value. Places where you are just writing text to document your result, use string formatting to print out the text you want to say with the value embedded. The downside of hard coding is that when you get new data you need to remember to update those values and people coming new to run your code may not know where the number came from (which data, which field) I would suggest it's better to not hard code, but to save a derived dataset e.g. the data with only values greater than the mean. Then you can start again from that point half way down your code with those values. And yes, as you said, if it's necessary perhaps to save a file with the values stored. That way you can run the notebook and repopulate from fresh data easily. There are lots of judgement calls here and no right answers just thinking through trade-offs of maintainability, and readability. While you don't generally check in data. I think small datasets (e.g. the means) can be checked in (and I often do this). Hope this helps. Again, no right answers here. Just craft.

…

On March 26, 2019 9:32:47 PM CDT, Camila Oliveira ***@***.***> wrote: Thank you for your incredible review. I updated to WIP and I'll leave it until the follow is ready: - Study and implement how to best plot the requested graphs - Make a readme with those questions - Cleanup the notebook About the values hardcoded, I actually left them hardcoded to eliminate the need to recalculate them every time I started the notebook, since it does take quite some time for me. Should I have a file with this saved then? Or variables holding the hardcoded value? Or leave it to be calculated every time? -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #81 (comment)

…ue columns

…2019_03_aliamcami_value_analyses'

…ch other

aliamcami · 2019-03-31T23:04:50Z

@birdsarah I have included the following:

Graphs for visualisation for the previous analysis
- (notebook updated to: "isJson_Quantitative_Comparasion.ipynb")
Research/analysis on how the location domain correlates to the value column
- (new notebook named: "isJson_correlation_domain_and_value.ipynb")
Readme with the future questions and overview update
Notebook cleanup

birdsarah

Great work!

isJson_dataPrep:

still needs clean-up - if I tried to run it in order it actually wouldn't give the same results as you currently have.
i'm suspicious about your notebook because it's not showing a warning for the rows df['location_domain'] = df.location.apply(extract_domain) dask should complain about the lack of specified meta attribute df['location_domain'] = df.location.apply(extract_domain, meta='O') is what you need
pro dask tip (1) - whenever you're significantly reducing the size of your data use df.repartition() to put your data into a proportionately smaller number of partitions. this will make future computation a little quicker as there's overhead associated with opening and closing each partition
pro dask tip (2) - seeing a lot of nanny memory error - sigh working with the value column is hard. in this case i have found it generally works for me to not use the distributed client. this makes the processing go slower but it is generally very reliable. To do this never run the client = Client() cell. Instead:

from dask.diagnostics import ProgressBar
# set up your dataframe
with ProgressBar():
    df.to_parquet(.....)

You should get a progress bar to see how it's going. Because you're not using dask distributed you won't get any other kind of insight on progress.

isJsoncorrelationDomain:

pie charts.....not my favorite :) some people say they should never be used. They do have at least one specific application - showing parts of a whole comparsions. But there use is definitely not needed here. A simple bar chart is probably the most appropriate viz here. People find it easy to compare heights.
Overall this notebook is hard for me to follow. The code is very compact in a way that's not necessarily bad but is hard to scan - using more verbose variable names would increase readability. Perhaps a little more text along the way, perhaps in markdown columns to explain each plot.
I like your use of md5 to figure out more efficiently if values are exactly the same. I can think of some potential shortcomings but it's a fine start: in particular, you're hashing a string. but the string can be different even if the json data is the same e.g. different key order. or what if one is a subset of the other
you have chosen to examine location_domain. location is where the action was happening, but script_url is the script that actually did the getting or setting of the json.
consider using the "operation" column to see if these are multiple reads of the same data or the same data being set over and over.

isJson_QuantitativeComparison:

I'm curious about the variable name cdf that you chose - what does it stand for in your mind - computed dataframe?
plot in cell 5 - nice. - consider using a log axis on the y axis so you can pick up more detail on the right side of the graph
plot in cell 7 - right idea - a few tweaks could be a lot more illuminating. firstly consider the fact that you're now comparing two populations of different sizes so the absolute frequency is less interesting than changing them all into % so you can see that, say 40% of is_json=True is at value x1, an 40% of is_json=False is at value x2 (a quick search found this reference https://www.stat.auckland.ac.nz/~ihaka/787/lectures-distrib.pdf which looks good but is r focused). A very quick rework of your histogram, with and without log axes looks like this:

plot in cell 15 - definitely should be a bar chart
cell 16 and 17, super excited to see you getting stuck in on statistics - for submission of the final analysis only include these where you have a specific link back to a property of the dataset / point you're trying to make.
I still want to see the plot x axis = value_len (really this value len or lower) and y axis = % valid json

birdsarah · 2019-04-01T01:20:08Z

I still want to see the plot x axis = value_len (really this value len or lower) and y axis = % valid json

This would ideally be across all values not just above the mean.

aliamcami · 2019-04-01T01:37:53Z

Thank you for the amazing review, I have a better idea what to do (and how). Thank you!

…aPrep final sample output

…r valid jsons

aliamcami added 3 commits March 23, 2019 00:02

is JSON data preparation

5d5b186

Quantitative analysts for json values

cd0ac0c

Readme with overview of the findings about the quantitative analysts

970ea0e

birdsarah suggested changes Mar 26, 2019

View reviewed changes

aliamcami changed the title ~~Issue 22: huge values are valid JSON~~ [WIP] Issue 22: huge values are valid JSON Mar 27, 2019

birdsarah changed the title ~~[WIP] Issue 22: huge values are valid JSON~~ Issue 22: huge values are valid JSON [WIP] Mar 28, 2019

aliamcami added 10 commits March 31, 2019 13:57

Sample comparasions for quantity of valid json values

0272b1c

Data prep saving other samples

c5ec9b9

Update readme with future questions

c3fb738

Add of 'domain' column to data prep

1a5bcdb

Update jsJson_dataPrep to include an extra column with the md5 of val…

efc051e

…ue columns

Rename 'isJson_Sample_Comparasion' to 'isJson_Quantitative_Comparasion'

2b617de

Rename folder from ''2019_03_aliamcami_greatest_values_are_json' to '…

68700ec

…2019_03_aliamcami_value_analyses'

Removal of outdated notebook

0820cea

Add analyse for the correlation the domain and the value have with ea…

327429a

…ch other

Readme update - Quantitative_Comparasion overview

4e18a11

aliamcami changed the title ~~Issue 22: huge values are valid JSON [WIP]~~ Issue 22: huge values are valid JSON Mar 31, 2019

birdsarah suggested changes Apr 1, 2019

View reviewed changes

aliamcami changed the title ~~Issue 22: huge values are valid JSON~~ Issue 22: huge values are valid JSON [WIP] Apr 1, 2019

aliamcami added 5 commits April 3, 2019 23:14

DataPrep cleanup and new 'json_keys' and 'json_schema' columns to dat…

a509bff

…aPrep final sample output

Remove Quantitative comparison and Add value distribution notebook

9e48a03

Fix typo

699b066

Removed fixed names, session organization, removed false positives fo…

46c31d0

…r valid jsons

Value distribution with new data that filtered json false positives

df6d843

aliamcami added 5 commits April 17, 2019 19:40

Add new notebookt 'isJson_Occurrence_of_operation_symbols_domains.ipynb'

b77dccf

Clean run of the dataPrep with all columns

2be179c

Add isJson_Identify_Source.ipynb

f30f68a

Remove isJson_correlation_domain_and_value.ipynb

e1ee1f2

Add isJson_Script_Domain_Output.ipynb and update readme

3b80915

aliamcami changed the title ~~Issue 22: huge values are valid JSON [WIP]~~ Issue 22: huge values are valid JSON Apr 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue 22: huge values are valid JSON #81

Issue 22: huge values are valid JSON #81

Uh oh!

aliamcami commented Mar 23, 2019

Uh oh!

aliamcami commented Mar 24, 2019

Uh oh!

birdsarah left a comment

Uh oh!

aliamcami commented Mar 27, 2019 •

edited

Loading

Uh oh!

birdsarah commented Mar 27, 2019 via email

Uh oh!

aliamcami commented Mar 31, 2019

Uh oh!

birdsarah left a comment

Uh oh!

birdsarah commented Apr 1, 2019

Uh oh!

aliamcami commented Apr 1, 2019

Uh oh!

Uh oh!

Issue 22: huge values are valid JSON #81

Are you sure you want to change the base?

Issue 22: huge values are valid JSON #81

Uh oh!

Conversation

aliamcami commented Mar 23, 2019

Overview

Most of the data have small value_len

Values above the mean:

Values that are 1 standard deviation (std) above the mean

Values 4 std above the mean

Uh oh!

aliamcami commented Mar 24, 2019

About JSONs:

General

Smal: value_len < mean

Medium: mean < value_len < (mean + std)

Big: value_len > (mean + std)

Security and data sharing:

Statistical knowledge / coincidence:

Uh oh!

birdsarah left a comment

Choose a reason for hiding this comment

Uh oh!

aliamcami commented Mar 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

birdsarah commented Mar 27, 2019 via email

Uh oh!

aliamcami commented Mar 31, 2019

Uh oh!

birdsarah left a comment

Choose a reason for hiding this comment

Uh oh!

birdsarah commented Apr 1, 2019

Uh oh!

aliamcami commented Apr 1, 2019

Uh oh!

Uh oh!

aliamcami commented Mar 27, 2019 •

edited

Loading