Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Label the darn axes, NO BAD IDEAS #12

Open
jwzimmer-zz opened this issue Sep 19, 2021 · 12 comments
Open

Label the darn axes, NO BAD IDEAS #12

jwzimmer-zz opened this issue Sep 19, 2021 · 12 comments

Comments

@jwzimmer-zz
Copy link
Owner

jwzimmer-zz commented Sep 19, 2021

From trying to come up with what visuals I want in the paper, it has become clear I absolutely can't avoid labeling the axes anymore. I keep not doing it because I'm worried I'll do it wrong. So this is the No Bad Ideas version. If it's stupid I'm sure Dodds will let me know.

Basic idea: Which traits are most important to each "dimensions"? That will be those traits which have the most extreme weights in each ROW of V. Which characters best exemplify each "dimension"? That will be the characters which have the most extreme weights in each COLUMN of U. How much more important is the first "dimension" compared to the second? That is given by the relevant WEIGHT in Sigma.

@jwzimmer-zz
Copy link
Owner Author

jwzimmer-zz commented Sep 20, 2021

Want to make: lists/ word clouds based on the traits which have the most positive, most neutral, and most negative weights in each of the first 3 dimensions -- this should lead to a D&D style alignment chart (3x3) which will hopefully show a clear pattern? Maybe also do with characters?

Pseudocode:

  • Need the SVD results (means removed).
  • Specifically, need rows of V.
  • For the first three rows:
    • For each row, sort traits by most positive; most neutral; most negative
    • Save the sorted lists so they can be truncated at different points
  • Make a wordcloud for each of those categories
    • Assemble into 3x3 grid
    • Look for patterns, if its legible
    • If it's not legible, make the list shorter

Using this tutorial: https://towardsdatascience.com/how-to-make-word-clouds-in-python-that-dont-suck-86518cdcb61f

Saving visualizations here as I make them so I can hopefully tell/ remember what they are in the future: https://docs.google.com/presentation/d/1_kc36iI6B2OmsZlbMxLB0xiT0ePaQQ2qh7NykKqsefI/edit?usp=sharing

I'm using this function in the file nextstep.py to make very basic word clouds, giving the wordcloud python package the scores for each trait in each row of V as if they were "frequencies" even though they are not and letting the built in function generate_from_frequencies interpret that however it may.

def simple_wordcloud(matrix_array,num_row,item_names):
    #assuming np array, and interested in rows (not columns)
    matrix_row = matrix_array[num_row,:]
    matrix_dict = dict(zip(item_names,matrix_row))
    
    # change the value to black
    def black_color_func(word, font_size, position,orientation,random_state=None, **kwargs):
        return("hsl(0,100%, 1%)")
    # set the wordcloud background color to white
    # set max_words to 1000
    # set width and height to higher quality, 3000 x 2000
    wordcloud = WordCloud(background_color="white", width=3000, height=2000, max_words=500).generate_from_frequencies(matrix_dict)
    # set the word color to black
    wordcloud.recolor(color_func = black_color_func)

    plt.imshow(wordcloud)
    return wordcloud

This seems to cause spyder to crash pretty often, so I made it lower quality and take fewer words:

def simple_wordcloud(matrix_array,num_row,item_names):
    #assuming np array, and interested in rows (not columns)
    matrix_row = matrix_array[num_row,:]
    matrix_dict = dict(zip(item_names,matrix_row))
    
    # change the value to black
    def black_color_func(word, font_size, position,orientation,random_state=None, **kwargs):
        return("hsl(0,100%, 1%)")
    # set the wordcloud background color to white
    wordcloud = WordCloud(background_color="white", max_words=300).generate_from_frequencies(matrix_dict)
    # set the word color to black
    wordcloud.recolor(color_func = black_color_func)

    plt.imshow(wordcloud)
    return wordcloud

For the first 6 rows of V, that results in:
Screen Shot 2021-09-20 at 3 04 55 PM

For reference, the weights in the relevant sigma matrix are as follows, they get pretty small by at least e.g. 15 in: [4571.60069027, 3977.77079978, 3148.95421275, 2330.72490479,
1863.71976093, 1422.81288847, 1389.55887554, 1311.97892059,
1024.52207029, 924.99578844, 890.04169256, 774.10750006,
728.56171378, 665.35637138, 607.12742436, 592.00578495,
567.349024 , 517.6459256 , 504.80145181, 496.07731617,
482.19264758, 476.40215009, 450.10001708, 430.57116746,
419.4340081 , 409.40418421, 406.91557611, 394.68135566,
385.86736628, 377.25202325, 372.82985457, 356.41834577,
351.72495156, 347.74900228, 339.38741564, 333.79399487,
326.55896904, 324.55117824, 318.1295393 , 315.10346038,
308.25490266, 299.26762091, 295.0152497 , 289.96992691,
288.85032287, 281.9690584 , 276.58233643, 272.43157464,
271.90675683, 266.18190642, 263.66959715, 259.37314041,
256.60410545, 254.70809149, 252.68163905, 247.4687916 ,
245.9929314 , 244.85792913, 241.82939261, 239.93695879,
235.41983321, 231.65486434, 230.66525203, 226.50847219,
226.02510515, 224.1026829 , 221.54346153, 218.66355964,
216.47694732, 215.91077504, 215.1921562 , 213.26036446,
211.09757665, 208.67226747, 206.33412318, 203.46306598,
202.00129305, 198.56407551, 197.94191631, 196.92701143,
195.12457197, 192.86726501, 190.96810407, 189.87703712,
189.53880978, 189.06950455, 188.50868644, 185.43954795,
182.61832651, 181.50209107, 179.99456989, 178.44065033,
177.4464131 , 176.71970982, 175.55113009, 174.82536567,
172.5053995 , 171.26372319, 170.70552398, 168.26458816,
167.98707707, 165.43564178, 165.08935084, 164.83722953,
162.13498 , 161.42803178, 160.30528848, 159.49239512,
159.21142423, 158.28515706, 157.13243679, 155.45907298,
153.86700243, 153.59045706, 152.35155954, 150.58777948,
149.58254526, 149.01649307, 148.12937946, 146.70195903,
145.70874362, 144.44385711, 143.42005057, 142.91038791,
141.60808627, 141.4631097 , 140.21726391, 139.21397298,
137.97307267, 137.48926772, 136.25779283, 135.36367309,
134.63421905, 133.13706912, 132.23788945, 130.99681122,
130.49813038, 129.36471842, 129.21269304, 128.19432229,
126.64128126, 126.28955773, 125.6550039 , 124.83269046,
124.14202611, 122.74294555, 120.90210089, 120.42513441,
119.67430339, 119.42790495, 119.24266103, 117.64425693,
117.36405301, 116.18928162, 115.39920124, 114.76582936,
113.78433957, 113.4101737 , 112.08557423, 111.10900704,
110.64820963, 110.17308651, 109.86539458, 107.9222643 ,
107.68943644, 106.66859772, 105.97305812, 105.54842185,
104.91505923, 103.6099165 , 102.85102213, 102.2707889 ,
101.34768192, 101.09396798, 100.60069694, 99.77079538,
98.96423342, 98.31576173, 97.81404952, 96.80348288,
96.30542631, 95.57328745, 95.13722686, 93.98035775,
93.2769668 , 92.75871829, 92.42652245, 91.92542246,
91.05170611, 90.01036083, 89.7513737 , 89.22258541,
88.78020526, 88.65292871, 87.40167041, 86.49717578,
85.02984127, 84.81686455, 84.40993647, 82.99396525,
82.35567233, 81.60991198, 81.36376152, 79.95434487,
79.39810207, 79.08318183, 77.83822367, 77.22776508,
76.30862441, 75.47880711, 74.9228648 , 74.77301107,
73.84800751, 73.60236366, 72.87570326, 72.38778495,
71.67473456, 70.54334797, 69.59775162, 69.28375765,
68.05775428, 67.04052598, 65.98883931, 64.82344704,
64.49868912, 64.02335036, 63.06458598, 62.70576773,
62.01116387, 60.33286102, 59.29124147, 58.27457299,
56.17688565, 55.46391112, 53.41931643, 48.49385579]

Moving on to a more D-D like chart...

def simple_wordcloud(matrix_dict):
    # change the value to black
    def black_color_func(word, font_size, position,orientation,random_state=None, **kwargs):
        return("hsl(0,100%, 1%)")
    # set the wordcloud background color to white
    wordcloud = WordCloud(background_color="white", max_words=300).generate_from_frequencies(matrix_dict)
    # set the word color to black
    wordcloud.recolor(color_func = black_color_func)
    plt.imshow(wordcloud)
    return wordcloud

def make_dd_wordcloud_dicts(matrix_array,num_row,item_names):
    #assuming np array, and interested in rows (not columns)
    matrix_row = matrix_array[num_row,:]
    matrix_dict = dict(zip(item_names,matrix_row))
    #sorted from most negative to least negative, ascending
    sorted_md = {k: v for k, v in sorted(matrix_dict.items(), key=lambda item: item[1])}
    traits_list = list(sorted_md.keys())
    scores_list = list(sorted_md.values())
    
    dict1 = dict(zip(traits_list[:89],scores_list[:89]))
    dict2 = dict(zip(traits_list[89:178],scores_list[89:178]))
    dict3 = dict(zip(traits_list[178:],scores_list[178:]))
    return dict1,dict2,dict3

Call like this for e.g. the 3rd row of V with means removed: d1_2,d2_2,d3_2 = make_dd_wordcloud_dicts(V2,2,col2), then do simple_wordcloud(d3_2) (etc.).

Produces this for the first 3 rows of V with means removed, dividing the traits into ordered 3rds... the first 89 with the most negative/ least positive, then the next 89 middle words, then the final 88 most positive words, those get put into 3 dicts by the above function; the way they are ordered in the chart is the opposite, the MOST POSITIVE at the top, etc. Maybe that isn't a good way to do that, because the spread isn't the same for every row of V, so I should come up with a new strategy for that?
Screen Shot 2021-09-20 at 4 02 08 PM

@jwzimmer-zz
Copy link
Owner Author

jwzimmer-zz commented Sep 21, 2021

Other visualization ideas

  • Ranking turbulence kind of like the allotaxonographs, but simpler?
  • Just a list of the top N traits by size of value in the rows of V
  • Can I do something like: the character that is best described by the first N rows of V, by subtracting the difference between the rebuilt matrix approximation versus the actual scores for that character?
  • What about... which traits' scores are most closely approximated by the first N reconstructed matrix? What would that indicate... that most of the information about that trait is completely captured by the first N dimensions, I think?
  • Split into positive and negative magnitude traits per row of V, rank them, and then do allotaxonomographs to see how the traits move around from 1 dimension to the next?

@jwzimmer-zz
Copy link
Owner Author

jwzimmer-zz commented Sep 21, 2021

Meeting with Dodds notes

  • dot product of character vector with the row of v, then sort by cos
  • change this: subtract the mean of all the scores, then re-run SVD --> svd output saved in eb8244e
  • look at the distribution of the means --> Label the darn axes, NO BAD IDEAS #12 (comment)
  • what is the mean? about 49.65
  • do it both ways
  • also try subtracting just 50? --> svd output saved in ea4f717

@jwzimmer-zz
Copy link
Owner Author

jwzimmer-zz commented Sep 25, 2021

Looking at how the values in the rows of V2 (V^T) are distributed (per above comment/ conversation), using the version of V2 from running SVD with the overall mean (49.65 ish) removed via e.g. plt.scatter(range(1,237),V2[21,:]).

First row of V
Screen Shot 2021-09-25 at 4 35 54 PM

Second row of V
Screen Shot 2021-09-25 at 4 38 02 PM

Third row of V
Screen Shot 2021-09-25 at 4 38 35 PM

Actually I think this might be easier to see as a scatter plot?

First row of V
Screen Shot 2021-09-25 at 4 42 04 PM

Second row of V
Screen Shot 2021-09-25 at 4 42 38 PM

Third row of V
Screen Shot 2021-09-25 at 4 43 25 PM

Fourth row of V
Screen Shot 2021-09-25 at 4 43 57 PM

Fifth row of V
Screen Shot 2021-09-25 at 4 44 33 PM

Sixth row of V
Screen Shot 2021-09-25 at 4 45 17 PM

Seventh row of V
Screen Shot 2021-09-25 at 4 45 57 PM

Eighth row of V
Screen Shot 2021-09-25 at 4 46 52 PM

Ninth row of V
Screen Shot 2021-09-25 at 4 47 20 PM

Tenth row of V
Screen Shot 2021-09-25 at 4 47 59 PM

11th row of V
Screen Shot 2021-09-25 at 4 54 55 PM

17th row of V
Screen Shot 2021-09-25 at 4 55 35 PM

22nd row of V
Screen Shot 2021-09-25 at 4 56 05 PM

27th row of V
Screen Shot 2021-09-25 at 4 56 42 PM

52nd row of V
Screen Shot 2021-09-25 at 4 57 16 PM

77th row of V
Screen Shot 2021-09-25 at 4 57 43 PM

102nd row of V
Screen Shot 2021-09-25 at 4 58 14 PM

202nd row of V
Screen Shot 2021-09-25 at 4 58 44 PM

236th row of V (last row)
Screen Shot 2021-09-25 at 5 00 04 PM

My Interpretation

  • There's a change in the way the weights are distributed from the 1-3rd to the 4th+ rows of V.
    • Most of the traits are relevant to the first 3 dimensions (because they're scattered fairly randomly)
    • After that, most of the traits tend to cluster around 0, and a few have outlying values by comparison
    • Maaaybe there's a trend that as the dimensions increase, they're more likely to be dominated by a single or fewer traits (tighter clustering for most around 0, and larger values for the outlying scores)?
  • It might be a good idea to go through some of the dimensions that ARE dominated by only 1 or 3 traits, and seeing which traits those are and if there's a pattern to those?

@jwzimmer-zz
Copy link
Owner Author

jwzimmer-zz commented Sep 26, 2021

Looking at trait magnitude in the rows of V (V2), with the overall mean removed --> 1c5699d

Using this code to render the bar charts:

def vector_barchart(vector_names,vector,n,style="by_mag",ascending=False):
    """ vector_names should be the labels for the values in the vector
        vector should be the vector (ndarray)
        n should be the number of values you want displayed in the chart
        style should be the format of the chart
        ascending=False will be most relevant traits by magnitude,
        ascending=True will be least relevant traits by magnitude"""
    n=min(n,len(vector_names))
    vectordf = pd.DataFrame()
    vectordf["Trait"] = col2
    vectordf["Values"] = vector
    
    if style=="by_mag":
        vectordf["Magnitude"] = vectordf.apply(lambda row: abs(row["Values"]), axis = 1)
        sorteddf = vectordf.sort_values(by="Magnitude",ascending=ascending)
        #plotguy = sorteddf.iloc[-2*n:].iloc[::-1]
        plotguy = sorteddf.iloc[0:2*n]
    sns.barplot(plotguy["Values"],plotguy["Trait"])
    return vectordf, plotguy

Screen Shot 2021-09-27 at 8 16 28 PM

Screen Shot 2021-09-26 at 3 50 06 PM

Screen Shot 2021-09-26 at 4 13 00 PM

Screen Shot 2021-09-26 at 4 27 20 PM

Interpretation: there do seem to be some patterns in the lower dimensions with "outlying" traits, e.g. a sort of leadership style component (captain<->first-mate seems to come up a lot) and a physical component (thick<->thin and tall<->short). And, in the lower dimensions, gender, sexuality, and procreation seem to come up a lot. To make a chart for a specific row of V, use vector_barchart(col2,V2[26,:],10,style="by_mag",ascending=False) (that's the 27th row with 10 traits shown).

@jwzimmer-zz
Copy link
Owner Author

Meeting with Dodds

  • subtract mean of 50
  • redo with larger font

@jwzimmer-zz
Copy link
Owner Author

jwzimmer-zz commented Oct 3, 2021

Subtracting the mean of 50, rather than the overall mean.

Screen Shot 2021-10-03 at 3 47 42 PM

Screen Shot 2021-10-03 at 3 57 07 PM

Screen Shot 2021-10-03 at 4 29 54 PM

@jwzimmer-zz
Copy link
Owner Author

Only the trait from the indicated side (positive left, negative right) with the largest magnitudes for the first 3 dimensions (theoretical mean removed)

dim3

@jwzimmer-zz
Copy link
Owner Author

jwzimmer-zz commented Oct 4, 2021

To do:

@jwzimmer-zz
Copy link
Owner Author

Relative size of dimensions (sigma values, theoretical mean removed, first 20 dimensions)
Screen Shot 2021-10-05 at 7 42 20 PM

@jwzimmer-zz
Copy link
Owner Author

Notes from talking with Dodds Oct 12

  • Rerunning SVD after you use some minimum threshhold for number of ratings
  • What happens if you're missing data?
  • What happens by work/ storyverse?
  • What happens if you "translate" first -- take the traits most important to a storyverse (or time period or whatever) first, then do SVD
  • Inner products to see how close the dimensions are
  • Number of traits removed vs. vector 1 from whole svd vs. vector 1 from the modified one
  • Note which works are still present
  • Most-answered traits
  • Most-answered characters
  • Most-answered storyverses
  • Other idioms? Heroes and zeroes, saints and sinners

@jwzimmer-zz
Copy link
Owner Author

jwzimmer-zz commented Nov 4, 2021

What can columns of V^T and rows of U tell us (as opposed to rows of V^T and columns of U)?

The columns of V^T tell us how a single trait contributes to each dimension of traits. We can look through the values of the columns to see which dimension that trait is most important to. The rows of U tell us how well a single character is described by each dimension. We can look through the values in the row to see which dimension best describes that character.

Plotting the first 3 traits -- diligent<->lazy, competent<->incompetent, disorganized<->self-disciplined -- with the highest magnitudes from dimension 1 (row 1 of V^T) in the first 15 dimensions:

plt.plot(V2[:,134][:15]); plt.plot(V2[:,31][:15]); plt.plot(V2[:,74][:15])

image

"Reversing" the last trait that is backwards from the other two:

plt.plot(V2[:,134][:15]); plt.plot(V2[:,31][:15]); plt.plot(-1*V2[:,74][:15])

image

We would not expect similar traits to track each other perfectly unless they had identical meanings. But seeing where they converge and diverge can perhaps help us pinpoint how some dimensions differ from each other.

Finding the order of the traits:
bap_map[bap_map["low/left anchor"]=="hard"]

"hard<->soft" and "hard<->soft 2" track each other for the first 15 dimensions (good -- sanity check)
plt.plot(V2[:,97][:15]); plt.plot(V2[:,182][:15])
image

It looks like as the dimensions progress they track each other more poorly, maybe indicating that the dimensions get less meaningful as they progress -- at some point they represent quirks of our specific dataset rather than underlying structure; replicating our exact initial data isn't meaningful or important.
image

Where do they diverge?
Dimensions 15 - 30:
image
Dimensions 0 - 30:
image
Dimensions 5 - 10:
image
Dimensions 0 - 10:
image

So they look like they diverge after the 7 or 8th dimension -- maybe evidence to focus on the first few?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant