Skip to content
This repository was archived by the owner on Dec 22, 2021. It is now read-only.

Issue 22: huge values are valid JSON #81

Open
wants to merge 23 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
5d5b186
is JSON data preparation
aliamcami Mar 23, 2019
cd0ac0c
Quantitative analysts for json values
aliamcami Mar 23, 2019
970ea0e
Readme with overview of the findings about the quantitative analysts
aliamcami Mar 23, 2019
0272b1c
Sample comparasions for quantity of valid json values
aliamcami Mar 31, 2019
c5ec9b9
Data prep saving other samples
aliamcami Mar 31, 2019
c3fb738
Update readme with future questions
aliamcami Mar 31, 2019
1a5bcdb
Add of 'domain' column to data prep
aliamcami Mar 31, 2019
efc051e
Update jsJson_dataPrep to include an extra column with the md5 of val…
aliamcami Mar 31, 2019
2b617de
Rename 'isJson_Sample_Comparasion' to 'isJson_Quantitative_Comparasion'
aliamcami Mar 31, 2019
68700ec
Rename folder from ''2019_03_aliamcami_greatest_values_are_json' to '…
aliamcami Mar 31, 2019
0820cea
Removal of outdated notebook
aliamcami Mar 31, 2019
327429a
Add analyse for the correlation the domain and the value have with ea…
aliamcami Mar 31, 2019
4e18a11
Readme update - Quantitative_Comparasion overview
aliamcami Mar 31, 2019
a509bff
DataPrep cleanup and new 'json_keys' and 'json_schema' columns to dat…
aliamcami Apr 4, 2019
9e48a03
Remove Quantitative comparison and Add value distribution notebook
aliamcami Apr 8, 2019
699b066
Fix typo
aliamcami Apr 8, 2019
46c31d0
Removed fixed names, session organization, removed false positives fo…
aliamcami Apr 8, 2019
df6d843
Value distribution with new data that filtered json false positives
aliamcami Apr 8, 2019
b77dccf
Add new notebookt 'isJson_Occurrence_of_operation_symbols_domains.ipynb'
aliamcami Apr 17, 2019
2be179c
Clean run of the dataPrep with all columns
aliamcami Apr 17, 2019
f30f68a
Add isJson_Identify_Source.ipynb
aliamcami Apr 22, 2019
e1ee1f2
Remove isJson_correlation_domain_and_value.ipynb
aliamcami Apr 22, 2019
3b80915
Add isJson_Script_Domain_Output.ipynb and update readme
aliamcami Apr 22, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions analyses/2019_03_aliamcami_value_analyses/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Overview

## JSON
All the greatest values are JSON, but they represent very little percentual of the whole data.

### Most of the data have small value_len
(mean = 1356 for the 10% sample)
- 95,58% of the data have value_len smaller than the mean
- 4,42% are bigger than the mean
- 9.35% are valid JSON

### Values above the mean:
- 61,54% are NOT valid JSON
- 38,46% are valid JSON

### Values that are 1 standard deviation (std) above the mean
(std = 26310 for 10% sample):
- 0,11% are NOT valid JSON
- 99,88% are valid JSON
- The bigger the value the greater the chance of being a valid JSON

### Values 4 std above the mean
- 100% are valid JSON
- The biggest non-JSON value have the length of 104653

##
The top 46745 gratest value_len are valid JSONs, that is 9.35% of the filtered sample (value_len > mean) and 0,41% of the original 10% sample.

---
## Correlation of location_domain and value

- One domain can produces a single type of output (31%).
- 99% of the domains with single type of output do not produces JSON.


- 31% of all domains can produce JSON.
- Only 0,016% of all the domains will aways have JSON as output, and less than half of it will always have the same JSON.


- One JSON is usually (83.09%) produced by a single script domain.


---

# Future questions

## About JSONs:
- **The JSON values are always from the same location or related domains?***
- **Are there a set of location domains that always produces a JSON?***
- Does the JSON values follow a structure pattern? What pattern?
- What data does the JSON hold? Is there any pattern on content?
- Do they have nested JSON? Css? Html? Javascript? Recursive study on JSON properties.

- Is a JSON's structure for a single script_url domain always the same?
- Is every JSON with the same structure produced by the same script_url domain?

<sub> *See notebook 'isJson_Quantitative_Comparasion.ipynb' for more information<sub>

## General
I'm think some things here maybe a crawler investigation or just wiki reading, since someone may have already described and explained. I just need to find, read and understand it.

- Are there other valid data types like html, css... in the values column or just JSON?
- Where does the value comes from? What is it used for?

## Smal: value_len < mean
- What are the small values?
- Does the smaller values have any pattern?
- What the majority data type?

## Medium: mean < value_len < (mean + std)
- How many rows are there in the intersection of *“no JSON”* and *“everything is JSON”* ?
- What are they? Are they from a specific script_url domain? Or realated domains?

## Big: value_len > (mean + std)
- What are the big non-JSON values?

## Security and data sharing:
- Do the value columns have any javascript? nested javascript?
- Do the javascripts in the dataset contain known malicious behaviors?
- Can they collect data that threatens user's privacy?

## Statistical knowledge / coincidence:
The **mean** of the original 10% sample is pretty similar to the **std** of the sample taken after filtering for values above the mean
- why?
- Is it a coincidence?
- Is it always like this?
- Is it a statistical pattern?
Loading