#Fake Data This is simply available for visualization practice b/c it represents some interesting challenges. The clusterID in the last column of the line_counts_clusterID.csv aligns with the results; however, the results do not contain all of the same keys.
Huge thanks to @zanstrong for the drawings and brainstorming! Also, thanks to @s1nelson, @micahstubbs, @shirleyxywu and @enjalot for all of the thinking with me!
###Fake Data Structure
- line_counts_clusterID.csv - the last column is the cluster label. The preceeding 14 columns are boolean values that represent the presences of some features.
- cluster#_results - additional information for each cluster that provides details on the following:
- Gender
- Language
- Interests
- TV Genre
- TV Show
- Location: Country
- Location: Region
- Location: Metro
- Device Category
- Device Wireless Network
###Fake Data Visualization & Analysis What is possible?
Analysis:
- Multinomial Logistic Regression (thanks @s1nelson)
- t-SNE (thanks @micahstubbs)
Visualization:
- % Presence of each feature per each label.
- Heatmap (thanks @zanstrong)
- click merge on feature selection
- option to remove feature
- separation between heatmap rows to indication disctions
- How do the users arrive at the label?
- Node path ()
- Similar to a decision tree to showcase paths by which tweets receive a label.
- Parallel Coordinates (thanks @shirleyxywu and @enjalot)
- Turn on/off labels
- Forground/Background colors instead of 14 different colors
- Opacity for selected labels
- randomly arround the feature ordering.
- Crossfilter (thanks @enjalot).
- The @zanstrong technique:
- Difference from the mean count of each item per each cluster.