Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SVD/ PCA on "character space" #6

Open
jwzimmer-zz opened this issue Feb 10, 2021 · 11 comments
Open

SVD/ PCA on "character space" #6

jwzimmer-zz opened this issue Feb 10, 2021 · 11 comments

Comments

@jwzimmer-zz
Copy link
Owner

https://openpsychometrics.org/_rawdata/

from tropes meeting with peter & phil

@jwzimmer-zz
Copy link
Owner Author

jwzimmer-zz commented Jun 28, 2021

Making a plan:

  • Get overleaf doc for correctly formatted draft
  • Redo SVD/ PCA analysis because I would definitely believe we made mistakes -- if we get consistent results again with what we found in the first place, great -- if not, obviously a red flag to make sure we understand what we are doing.
    • With and without removing the mean of each "eigentrait"
    • Maybe re-orient original matrix to see what happens with "eigencharacters"?
  • Better visualizations?
  • Transfer any worthwhile content from older draft to the new template
  • Decide to what extent PDS should be mentioned, if at all?
  • Clean up draft and get feedback from D&D
  • Make new plan based on feedback

@jwzimmer-zz
Copy link
Owner Author

jwzimmer-zz commented Jul 6, 2021

Codebook/ documenting what things are

Original data artifacts

@jwzimmer-zz
Copy link
Owner Author

Cleaned data artifacts

@jwzimmer-zz
Copy link
Owner Author

jwzimmer-zz commented Jul 7, 2021

Rerunning the SVD without removing any means

Artifacts: all the artifacts created as part of the SVD process are saved to files with names matching the variable they were assigned as output of runSVD in this commit: 28845f7

This is how I reran the SVD:

df_traits = pd.read_json("July2021_df_traits.json")

clean_column_dict = get_json("July2021_cleaned_column_dict.json")

def runSVD(df1,dropcols=['unnamed.1','name','work'],n=None):
    if len(dropcols) > 0:
        for x in dropcols:
            df1 = df1.drop(x,axis=1)
    if n==None:
        n=df1.shape[1]-1
    X = df1.to_numpy()
    #decompose
    U, D, V = np.linalg.svd(X)
    # get dim of X
    M,N = X.shape
    # Construct sigma matrix in SVD (it simply adds null row vectors to match the dim of X)
    Sig = sp.linalg.diagsvd(D,M,N)
    # Now you can get X back:
    remakeX = np.dot(U, np.dot(Sig, V))
    assert np.sum(remakeX - X) < 0.00001
    return df1, U, D, V, Sig, X, remakeX

df1, U, D, V, Sig, X, remakeX = runSVD(df_traits)

df1.to_json("July2021_SVD_df.json")
write_json(U.tolist(),"July2021_SVD_U.json")
write_json(D.tolist(),"July2021_SVD_D.json")
write_json(V.tolist(),"July2021_SVD_V.json")
write_json(Sig.tolist(),"July2021_SVD_Sig.json")
write_json(X.tolist(),"July2021_SVD_X.json")
write_json(remakeX.tolist(),"July2021_SVD_remakeX.json")```

@jwzimmer-zz
Copy link
Owner Author

Rerunning SVD with removing the mean from each trait

Artifacts: all the artifacts created as part of the SVD process are saved to files with names matching the variable they were assigned as output of runSVD in this commit: 0ed3f65

The code for rerunning it:

df_bap = pd.read_json("July2021_df_bap.json")
df_traits = pd.read_json("July2021_df_traits.json")

clean_column_dict = get_json("July2021_cleaned_column_dict.json")

def runSVD(df1,dropcols=['unnamed.1','name','work'],n=None):
    if len(dropcols) > 0:
        for x in dropcols:
            df1 = df1.drop(x,axis=1)
    if n==None:
        n=df1.shape[1]-1
    X = df1.to_numpy()
    #decompose
    U, D, V = np.linalg.svd(X)
    # get dim of X
    M,N = X.shape
    # Construct sigma matrix in SVD (it simply adds null row vectors to match the dim of X)
    Sig = sp.linalg.diagsvd(D,M,N)
    # Now you can get X back:
    remakeX = np.dot(U, np.dot(Sig, V))
    assert np.sum(remakeX - X) < 0.00001
    return df1, U, D, V, Sig, X, remakeX

# Output from SVD without removing means
df1, U, D, V, Sig, X, remakeX = runSVD(df_traits)

# Remove the average of each trait
df1_means = df1.mean()
df1_normed = df1 - df1_means

# Output from SVD WITH removing means
df2, U2, D2, V2, Sig2, X2, remakeX2 = runSVD(df1_normed,dropcols=[])

df2.to_json("July2021_normed_trait_df.json")
write_json(U2.tolist(),"July2021_SVD_normed_U.json")
write_json(D2.tolist(),"July2021_SVD_normed_D.json")
write_json(V2.tolist(),"July2021_SVD_normed_V.json")
write_json(Sig2.tolist(),"July2021_SVD_normed_Sig.json")
write_json(X2.tolist(),"July2021_SVD_normed_X.json")
write_json(remakeX2.tolist(),"July2021_SVD_normed_remakeX.json")```

@jwzimmer-zz
Copy link
Owner Author

Sanity check

Since there are dataframes that should differ only by their column headers (one with the original BAP trait labels and one with the anchor words, July2021_df_bap.json and July2021_df_traits.json), I can sanity check that the output from SVD should be the same from each of these dataframes (since SVD doesn't know about the column headers).

When the code below is run, the assert statements passed successfully (which is good). To rerun, use the script saved here: d8d7c4c#diff-bb5be8dc2521f069449811f33a63824ae9dd7b3b0391c62d8fbdd7ab495809f8

How I sanity-checked:

df_bap = pd.read_json("July2021_df_bap.json")
df_traits = pd.read_json("July2021_df_traits.json")

clean_column_dict = get_json("July2021_cleaned_column_dict.json")

def runSVD(df1,dropcols=['unnamed.1','name','work'],n=None):
    if len(dropcols) > 0:
        for x in dropcols:
            df1 = df1.drop(x,axis=1)
    if n==None:
        n=df1.shape[1]-1
    X = df1.to_numpy()
    #decompose
    U, D, V = np.linalg.svd(X)
    # get dim of X
    M,N = X.shape
    # Construct sigma matrix in SVD (it simply adds null row vectors to match the dim of X)
    Sig = sp.linalg.diagsvd(D,M,N)
    # Now you can get X back:
    remakeX = np.dot(U, np.dot(Sig, V))
    assert np.sum(remakeX - X) < 0.00001
    return df1, U, D, V, Sig, X, remakeX

# Output from SVD without removing means
df1, U, D, V, Sig, X, remakeX = runSVD(df_traits)

# Remove the average of each trait
df1_means = df1.mean()
df1_normed = df1 - df1_means

# Output from SVD WITH removing means
df2, U2, D2, V2, Sig2, X2, remakeX2 = runSVD(df1_normed,dropcols=[])

# Remove the average of each trait using the BAP df as a sanity check
df3, U3, D3, V3, Sig3, X3, remakeX3 = runSVD(df_bap)

# Remove the average of each trait
df3_means = df3.mean()
df3_normed = df3 - df3_means
df4, U4, D4, V4, Sig4, X4, remakeX4 = runSVD(df3_normed,dropcols=[])

assert np.sum(U4-U2) < 0.0001
assert np.sum(D4-D2) < 0.0001
assert np.sum(V4-V2) < 0.0001
assert np.sum(remakeX2-remakeX4) < 0.0001```

@jwzimmer-zz
Copy link
Owner Author

Transposing the matrix

We've been using the matrix in the form that characters are each row and the traits are the columns. While I'm redoing the SVD anyway, I made a version with the matrix transposed so that the characters are the columns and the traits are the rows, just in case we want to do something with that.

The artifacts (outputs from SVD and transposed dataframes) and the script are saved in this commit: 2560f87

@jwzimmer-zz
Copy link
Owner Author

jwzimmer-zz commented Jul 13, 2021

(In nextstep.py)

To get just characters from a certain work, e.g. Pride and Prejudice (rows 259 - 268 in df_traits):
df_traits.loc[df_traits["work"]=="Pride and Prejudice"]

The means for each trait have been removed from df2, as well as the "extra" columns (name, work, etc.), so we can select the matching rows by index:
df2.iloc[259:269]

Then we can run SVD on a single work (not for any results per se but to get a smaller artifact that's easier to understand intermediately):

df2pp = df2.iloc[259:269]
df2pp, U2pp, D2pp, V2pp, Sig2pp, X2pp, remakeX2pp = runSVD(df2pp, dropcols=[])

To get a really small matrix to use as a toy model

Using just the rows corresponding to characters from Pride and Prejudice, we can see which traits contribute most by taking the absolute value of all scores and then summing per column:
df2pp.abs().sum()

Then we can get the top n=15 traits with the largest sums using: df2pp.abs().sum().nlargest(15).
This gives us:
gossiping<->confidential 309.5000
judgemental<->accepting 289.6970
independent<->codependent 280.2340
scandalous<->proper 279.9235
selfish<->altruistic 279.9000
cunning<->honorable 277.7000
trash<->treasure 272.2000
young<->old 271.2345
arrogant<->humble 270.3045
wholesome<->salacious 268.9000
sheltered<->street-smart 267.9560
rich<->poor 267.3715
quarrelsome<->warm 263.9230
masculine<->feminine 263.3865
rude<->respectful 261.8985

We can put those traits into a list:

pplist = ['gossiping<->confidential',
 'judgemental<->accepting',
 'independent<->codependent',
 'scandalous<->proper',
 'selfish<->altruistic',
 'cunning<->honorable',
 'trash<->treasure',
 'young<->old',
 'arrogant<->humble',
 'wholesome<->salacious',
 'sheltered<->street-smart',
 'rich<->poor',
 'quarrelsome<->warm',
 'masculine<->feminine',
 'rude<->respectful']

The df2pp[pplist] will gives us a 10 character x 15 trait matrix with the means already removed (the means per trait over all 800 characters, not over these 10 characters specifically). That df is saved in this commit: 4638346

We can run SVD on this toy matrix: DF, u, d, v, sig, x, remake_x = runSVD(df2pp[pplist],dropcols=[])

This yields:

u=array([[-0.08026365,  0.28342991, -0.45526078, -0.0664045 ,  0.50874406,
        -0.33335   ,  0.18430594, -0.0057889 ,  0.44852539, -0.31053718],
       [-0.08396192,  0.15645385,  0.14053663, -0.70715547,  0.2271066 ,
        -0.2511955 , -0.02011789, -0.39023386, -0.36354705,  0.22163698],
       [ 0.25954173, -0.31034307, -0.35522931,  0.15628852, -0.04000741,
        -0.60984357, -0.04676285,  0.18937151, -0.12989473,  0.50722832],
       [ 0.40650185, -0.11318425,  0.36149316,  0.35004533,  0.29471143,
        -0.15955976, -0.32223942, -0.55967292,  0.18397752, -0.07417983],
       [-0.33411562, -0.286026  , -0.16265535,  0.38160353,  0.27866994,
         0.00105568,  0.28555704, -0.23232173, -0.59800975, -0.25236469],
       [-0.3715037 , -0.43635952, -0.10259882, -0.05714874,  0.22120278,
         0.34615965,  0.06656464, -0.232405  ,  0.42110127,  0.50324595],
       [ 0.34982716, -0.45739007,  0.21871478, -0.28809451,  0.48994575,
         0.15302199,  0.07844722,  0.48020423, -0.063472  , -0.18259064],
       [-0.41318364, -0.42683061, -0.00921618, -0.1974537 , -0.20780707,
        -0.29966454, -0.54353286,  0.03654642,  0.11218081, -0.4070519 ],
       [ 0.30147215, -0.34947551, -0.08825894, -0.23787739, -0.43740969,
        -0.079282  ,  0.55174215, -0.34925464,  0.17587574, -0.26012067],
       [ 0.34679823, -0.01795822, -0.65071337, -0.14527374,  0.00109131,
         0.43483593, -0.40942224, -0.1915533 , -0.18289531, -0.08956964]])
d=array([273.45005938, 157.11930121, 131.97187477,  81.04634391,
        70.76989585,  52.32877291,  32.73332522,  21.88512639,
        13.30087534,   7.50968549])
v=array([[-0.33985495, -0.29612671,  0.04303773, -0.19430756, -0.33997662,
        -0.32236686, -0.31278163,  0.1514476 , -0.32808836,  0.3097413 ,
         0.06290999,  0.12897509, -0.30557215, -0.04094033, -0.31257376],
       [ 0.30564872, -0.14496602, -0.52284552, -0.07679449,  0.09119113,
         0.02914019,  0.18982058,  0.19440876, -0.10905841, -0.00706552,
         0.48225663, -0.09168561, -0.16293022, -0.49084418, -0.05395979],
       [-0.02769439,  0.24078822,  0.03745725, -0.57672513,  0.02246244,
        -0.13091967,  0.07329089, -0.40214836,  0.10072198,  0.01570634,
         0.34182379,  0.52299393,  0.1187455 , -0.00601891,  0.08720821],
       [-0.08025258,  0.29517503,  0.37569212,  0.01795614, -0.05874584,
        -0.07298374, -0.30536557,  0.03718126,  0.07454569,  0.1948677 ,
         0.05683033, -0.24349984,  0.24578564, -0.67877057,  0.18537411],
       [ 0.1150476 , -0.10383944, -0.15196138, -0.0683828 , -0.08962658,
         0.12041794, -0.09952601, -0.78580216, -0.21228544, -0.0574279 ,
        -0.27112323, -0.29399761, -0.08990117, -0.19494395, -0.20951912],
       [-0.19131476,  0.38035801, -0.2932324 , -0.23653336, -0.07076555,
        -0.34197986,  0.10189438,  0.01179161,  0.09773657,  0.14230296,
         0.12327661, -0.62524381,  0.06220088,  0.32312917, -0.01013698],
       [-0.26004231, -0.22143183,  0.41447416, -0.47112382,  0.05220788,
         0.22821447,  0.45998814,  0.13790597, -0.03164187, -0.28475731,
        -0.04056089, -0.28285624, -0.13686034, -0.13176831, -0.06595748],
       [ 0.17358445,  0.27223799, -0.10482308, -0.2015069 ,  0.04509023,
         0.0990094 ,  0.29456974,  0.13610467, -0.03820593,  0.54587192,
        -0.54264616,  0.14601149, -0.28873467, -0.123614  ,  0.11472584],
       [ 0.62152663,  0.1666971 ,  0.36819711, -0.19534219, -0.01392521,
         0.19094972, -0.29921406,  0.0971171 ,  0.02161291,  0.06792114,
         0.21542289, -0.16411461, -0.24248052,  0.25467636, -0.27293712],
       [ 0.07840771, -0.42320563,  0.01940182, -0.16340421,  0.13880765,
        -0.06200609, -0.23698661, -0.1033672 ,  0.0482833 ,  0.13047996,
         0.05377752, -0.16884177, -0.23921517,  0.13632374,  0.75574347],
       [-0.0297133 ,  0.11938014,  0.33357639,  0.41703516,  0.06961158,
        -0.08548349,  0.41994792, -0.24354082, -0.44763044,  0.26838829,
         0.36398154, -0.03359123, -0.15364751,  0.09755973,  0.11274173],
       [ 0.21963318,  0.11255698,  0.07419373, -0.09983337,  0.42250793,
        -0.58157332, -0.06552255,  0.11094783, -0.41730944, -0.38047638,
        -0.25779172,  0.03189939,  0.00139478, -0.06049719,  0.0152392 ],
       [ 0.32877213, -0.37094867,  0.0552218 , -0.15893589, -0.18810026,
        -0.07471512,  0.18716202,  0.06698879, -0.21039446,  0.28156536,
        -0.06489546, -0.0591665 ,  0.71045387,  0.09594379, -0.01754533],
       [-0.15203352,  0.28918159, -0.18128653, -0.14865689, -0.12596379,
         0.49125909, -0.22555514,  0.15525241, -0.62053863, -0.1523707 ,
         0.03275004,  0.01742755,  0.13756812,  0.11159349,  0.25509985],
       [ 0.25201017,  0.11200719,  0.04172258,  0.06420225, -0.77477492,
        -0.21603036,  0.19136847,  0.02583857,  0.03211352, -0.34964351,
        -0.07113749,  0.0454031 , -0.16798456, -0.0573666 ,  0.26820543]])
sig=array([[273.45005938,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ],
       [  0.        , 157.11930121,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ],
       [  0.        ,   0.        , 131.97187477,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        ,  81.04634391,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        ,   0.        ,
         70.76989585,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,  52.32877291,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,  32.73332522,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,  21.88512639,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
         13.30087534,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   7.50968549,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ]])

Note dimensions:
u is 10 x 10 (the number of characters/ rows), v is 15 x 15 (the number of traits/ columns), sig is 10 x 15 with 10 non-zero diagonal values.

@jwzimmer-zz
Copy link
Owner Author

jwzimmer-zz commented Jul 19, 2021

An even smaller toy matrix

dfsmall = df2pp[pplist[:5]] will get the 10 characters from Pride and Prejudice x the top 5 traits from the list mentioned above:

gossiping<->confidential judgemental<->accepting independent<->codependent scandalous<->proper selfish<->altruistic
259 32.5799 -25.7742 -24.1915 33.4618 8.39938
260 19.5799 -19.5742 -34.3915 -4.43825 13.1994
261 -32.6201 -33.7742 39.7085 24.5618 -27.2006
262 -40.7201 -16.1742 23.6085 -39.6382 -42.5006
263 9.27988 31.5258 28.8085 31.2618 22.9994
264 14.9799 38.1258 25.1085 25.6618 25.8994
265 -50.4201 -15.6742 26.0085 -37.3382 -40.3006
266 25.9799 39.4258 24.0085 40.6618 34.2994
267 -50.4201 -28.1742 39.6085 -9.43825 -28.8006
268 -32.9201 -41.4742 -14.7915 33.4618 -36.3006

We can then run SVD: dfs, u, d, v, sig, x, rex = runSVD(dfsmall, dropcols=[]).

To make the matrices easier to read, we can make dataframes from the arrays returned by runSVD, e.g. for the matrix U, dfu = pd.DataFrame.from_records(u). Then we can get a GitHub markdown table with print(dfu.to_markdown()):

0 1 2 3 4 5 6 7 8 9
0 -0.174965 0.326296 0.429562 0.401283 0.442693 0.0303303 0.361543 -0.103124 0.355525 0.229617
1 -0.095925 0.408267 0.000175068 0.3096 -0.458768 0.343455 0.0980603 0.353716 -0.419526 0.2985
2 0.289264 -0.254787 0.51681 0.402896 0.130014 0.0265614 -0.150021 -0.0534204 -0.485991 -0.377936
3 0.433256 -0.107442 -0.227372 0.0490339 0.522595 0.195355 -0.316602 0.271603 -0.0631081 0.507333
4 -0.217048 -0.425549 0.14686 -0.0537437 -0.104094 -0.308657 0.15884 -0.330346 -0.33884 0.626152
5 -0.256956 -0.40636 0.0442409 -0.0868463 0.0124899 0.840803 0.0552487 -0.192599 0.107295 -0.0240949
6 0.458136 -0.157342 -0.200635 -0.0711835 0.0267243 0.0436022 0.830354 0.093688 -0.0994286 -0.0962616
7 -0.358611 -0.409782 0.174044 -0.0122618 0.0631441 -0.160897 0.0860923 0.786754 0.116035 -0.0574674
8 0.412107 -0.263198 0.149876 0.321032 -0.530391 -0.0257513 -0.107634 0.00337257 0.554674 0.184753
9 0.259505 0.205468 0.616565 -0.677998 -0.0808574 0.107604 -0.0126528 0.116771 0.00958342 0.136392

The matrix Sigma:

0 1 2 3 4
0 167.959 0 0 0 0
1 0 103.103 0 0 0
2 0 0 85.5536 0 0
3 0 0 0 26.6988 0
4 0 0 0 0 14.3684
5 0 0 0 0 0
6 0 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
9 0 0 0 0 0

The matrix V:

0 1 2 3 4
0 -0.608822 -0.421068 0.192499 -0.332048 -0.55202
1 0.243136 -0.482655 -0.81904 -0.17509 -0.0802911
2 -0.0560139 -0.447946 0.0686642 0.883551 -0.104065
3 0.434501 -0.603099 0.512022 -0.279439 0.327457
4 0.615055 0.159256 0.158861 0.0184046 -0.755493

@jwzimmer-zz
Copy link
Owner Author

jwzimmer-zz commented Jul 19, 2021

Continuing with the above toy model, just trying to understand the SVD...

If you dot U with Sigma, you get a 10 x 5 matrix, which is the first 5 columns of U each multiplied by the corresponding weight from Sigma, so column 1 of U is multiplied by 167.959, weight 1 from Sigma; column 2 of U is multiplied by 103.103, weight 2 from Sigma, etc.

0 1 2 3 4
0 -29.387 33.642 36.7505 10.7138 6.36082
1 -16.1115 42.0935 0.0149777 8.26593 -6.59179
2 48.5845 -26.2692 44.215 10.7568 1.86811
3 72.7694 -11.0776 -19.4525 1.30915 7.50888
4 -36.4552 -43.8753 12.5644 -1.43489 -1.49566
5 -43.1581 -41.8969 3.78496 -2.31869 0.179461
6 76.9483 -16.2224 -17.1651 -1.90051 0.383987
7 -60.232 -42.2497 14.8901 -0.327374 0.907282
8 69.2172 -27.1365 12.8224 8.57115 -7.62089
9 43.5864 21.1843 52.7493 -18.1017 -1.1618

If you dot Sigma with V, you get a 10 x 5 matrix in which the new first row is the first row of V multiplied by the first weight in Sigma; the second row is the second row of V multiplied by the second weight in Sigma, etc.

0 1 2 3 4
0 -102.257 -70.7223 32.332 -55.7706 -92.7169
1 25.068 -49.7631 -84.4453 -18.0523 -8.27824
2 -4.79219 -38.3234 5.87447 75.591 -8.90313
3 11.6006 -16.102 13.6704 -7.46067 8.74271
4 8.83739 2.28826 2.28258 0.264445 -10.8553
5 0 0 0 0 0
6 0 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
9 0 0 0 0 0

Note: V is actually V^T, it has already been transposed when it is returned by linalg.svd. We know this is the case because you do not need to transpose it in order to get back your original data matrix (remakeX = np.dot(U, np.dot(Sig, V)) from the runSVD function).

The matrix product of Sigma dot V is what we will dot with U in order to get back our original data matrix. So the weighted rows of V and the columns of U are what describe our original matrix.

To approximate our original matrix:

We can tune how good of an approximation we want by how many non-zero weights are in Sigma. Since they are in descending order of importance, let's say we don't want to use all 5 rows of V in reconstructing our matrix; let's use the first three. We can use only the first 3 weights of sigma, and therefore the first 3 rows of V, like this:

Screen Shot 2021-07-19 at 2 59 59 PM

We can then approximate our original data matrix with newx = np.dot(u, np.dot(newsig, v)):

0 1 2 3 4
0 24.0125 -20.3258 -30.6877 36.3385 9.69663
1 20.0426 -13.5393 -37.5767 -2.00711 5.51259
2 -38.443 -27.5843 33.904 27.5332 -29.3117
3 -45.9073 -16.5805 21.7453 -39.4106 -37.2564
4 10.8233 30.8986 29.7808 30.8883 22.3393
5 15.877 36.6988 26.2672 25.0105 26.7942
6 -49.8305 -16.8816 26.9206 -37.8764 -39.3882
7 25.5641 39.0838 24.032 40.5536 35.092
8 -49.457 -21.7913 36.4305 -6.90288 -37.3648
9 -24.3404 -52.2064 -5.33847 28.4248 -31.2508

This isn't a great approximation of our original data, but it isn't totally insane looking... to sanity check, if we use 4 weights instead of 3, the approximation should improve, so let's check that that actually happens:

0 1 2 3 4
0 28.6676 -26.7872 -25.202 33.3447 13.2049
1 23.6342 -18.5245 -33.3443 -4.31693 8.21933
2 -33.7691 -34.0718 39.4117 24.5274 -25.7893
3 -45.3385 -17.3701 22.4156 -39.7764 -36.8277
4 10.1998 31.7639 29.0461 31.2893 21.8694
5 14.8695 38.0972 25.08 25.6584 26.035
6 -50.6563 -15.7354 25.9475 -37.3453 -40.0105
7 25.4218 39.2813 23.8644 40.6451 34.9848
8 -45.7329 -26.9606 40.8192 -9.29799 -34.5582
9 -32.2056 -41.2892 -14.6069 33.4831 -37.1784

And this approximation is indeed closer to the original matrix (dfsmall) that we started out with earlier. Great!

So now we should get into the details of what is happening when we take the dot product of U and the matrix we get from taking the dot product of Sigma with V... so the number of non-zero weights we include in Sigma determines how many rows of V will be used to approximate our original matrix. When you take the dot product of U and this other matrix, SigmadotV, the rows of U will be combined with the columns of SigmadotV, which only have as many nonzero values as we've chosen to include in our approximation, the last few entries of EVERY row of U will be multiplied by 0 (and disregarded). Therefore, the last few COLUMNS of U will have no impact on the values in our final approximation. So when we are approximating our original matrix using U, Sigma, and V, we choose the first N weights of Sigma, the first N rows of V, and the first N columns of U to get combined into the final result.

The product of our new Sigma (only 3 weights) dotted with V gives us a 10 x 5 matrix containing the first 3 rows of V weighted by the corresponding weight in Sigma:

0 1 2 3 4
0 -102.257 -70.7223 32.332 -55.7706 -92.7169
1 25.068 -49.7631 -84.4453 -18.0523 -8.27824
2 -4.79219 -38.3234 5.87447 75.591 -8.90313
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
9 0 0 0 0 0

Screen Shot 2021-07-19 at 4 02 06 PM

So, when we approximate with N dimensions, we will use the first N columns of U, the first N weights of Sigma, and the first N rows of what I've been calling V (but which is really the first N rows of V^T/ the first N columns of V).

@jwzimmer-zz
Copy link
Owner Author

jwzimmer-zz commented Jul 19, 2021

Interpretation

The columns of U must be "eigencharacters" in terms of the fictional characters; the rows of V must be "eigentraits" in terms of fictional traits. That is the only way I can understand the dimensions of the relevant objects.

Therefore, what I want to look at is the characters that comprise the first few columns of U as linear combinations, and the traits that comprise the first rows of V as linear combinations.

Which traits are most important to each "dimensions"? That will be those traits which have the most extreme weights in each ROW of V. Which characters best exemplify each "dimension"? That will be the characters which have the most extreme weights in each COLUMN of U. How much more important is the first "dimension" compared to the second? That is given by the relevant WEIGHT in Sigma.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant