-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SVD/ PCA on "character space" #6
Comments
Making a plan:
|
Codebook/ documenting what things are Original data artifacts
|
Cleaned data artifacts
|
Rerunning the SVD without removing any means Artifacts: all the artifacts created as part of the SVD process are saved to files with names matching the variable they were assigned as output of runSVD in this commit: 28845f7 This is how I reran the SVD:
|
Rerunning SVD with removing the mean from each trait Artifacts: all the artifacts created as part of the SVD process are saved to files with names matching the variable they were assigned as output of runSVD in this commit: 0ed3f65 The code for rerunning it:
|
Sanity check Since there are dataframes that should differ only by their column headers (one with the original BAP trait labels and one with the anchor words, July2021_df_bap.json and July2021_df_traits.json), I can sanity check that the output from SVD should be the same from each of these dataframes (since SVD doesn't know about the column headers). When the code below is run, the assert statements passed successfully (which is good). To rerun, use the script saved here: d8d7c4c#diff-bb5be8dc2521f069449811f33a63824ae9dd7b3b0391c62d8fbdd7ab495809f8 How I sanity-checked:
|
Transposing the matrix We've been using the matrix in the form that characters are each row and the traits are the columns. While I'm redoing the SVD anyway, I made a version with the matrix transposed so that the characters are the columns and the traits are the rows, just in case we want to do something with that. The artifacts (outputs from SVD and transposed dataframes) and the script are saved in this commit: 2560f87 |
(In nextstep.py) To get just characters from a certain work, e.g. Pride and Prejudice (rows 259 - 268 in df_traits): The means for each trait have been removed from df2, as well as the "extra" columns (name, work, etc.), so we can select the matching rows by index: Then we can run SVD on a single work (not for any results per se but to get a smaller artifact that's easier to understand intermediately):
To get a really small matrix to use as a toy model Using just the rows corresponding to characters from Pride and Prejudice, we can see which traits contribute most by taking the absolute value of all scores and then summing per column: Then we can get the top n=15 traits with the largest sums using: We can put those traits into a list:
The We can run SVD on this toy matrix: This yields:
Note dimensions: |
An even smaller toy matrix
We can then run SVD: To make the matrices easier to read, we can make dataframes from the arrays returned by runSVD, e.g. for the matrix U,
The matrix Sigma:
The matrix V:
|
Continuing with the above toy model, just trying to understand the SVD... If you dot U with Sigma, you get a 10 x 5 matrix, which is the first 5 columns of U each multiplied by the corresponding weight from Sigma, so column 1 of U is multiplied by 167.959, weight 1 from Sigma; column 2 of U is multiplied by 103.103, weight 2 from Sigma, etc.
If you dot Sigma with V, you get a 10 x 5 matrix in which the new first row is the first row of V multiplied by the first weight in Sigma; the second row is the second row of V multiplied by the second weight in Sigma, etc.
Note: V is actually V^T, it has already been transposed when it is returned by The matrix product of Sigma dot V is what we will dot with U in order to get back our original data matrix. So the weighted rows of V and the columns of U are what describe our original matrix. To approximate our original matrix: We can tune how good of an approximation we want by how many non-zero weights are in Sigma. Since they are in descending order of importance, let's say we don't want to use all 5 rows of V in reconstructing our matrix; let's use the first three. We can use only the first 3 weights of sigma, and therefore the first 3 rows of V, like this: We can then approximate our original data matrix with
This isn't a great approximation of our original data, but it isn't totally insane looking... to sanity check, if we use 4 weights instead of 3, the approximation should improve, so let's check that that actually happens:
And this approximation is indeed closer to the original matrix (dfsmall) that we started out with earlier. Great! So now we should get into the details of what is happening when we take the dot product of U and the matrix we get from taking the dot product of Sigma with V... so the number of non-zero weights we include in Sigma determines how many rows of V will be used to approximate our original matrix. When you take the dot product of U and this other matrix, SigmadotV, the rows of U will be combined with the columns of SigmadotV, which only have as many nonzero values as we've chosen to include in our approximation, the last few entries of EVERY row of U will be multiplied by 0 (and disregarded). Therefore, the last few COLUMNS of U will have no impact on the values in our final approximation. So when we are approximating our original matrix using U, Sigma, and V, we choose the first N weights of Sigma, the first N rows of V, and the first N columns of U to get combined into the final result. The product of our new Sigma (only 3 weights) dotted with V gives us a 10 x 5 matrix containing the first 3 rows of V weighted by the corresponding weight in Sigma:
So, when we approximate with N dimensions, we will use the first N columns of U, the first N weights of Sigma, and the first N rows of what I've been calling V (but which is really the first N rows of V^T/ the first N columns of V). |
Interpretation The columns of U must be "eigencharacters" in terms of the fictional characters; the rows of V must be "eigentraits" in terms of fictional traits. That is the only way I can understand the dimensions of the relevant objects. Therefore, what I want to look at is the characters that comprise the first few columns of U as linear combinations, and the traits that comprise the first rows of V as linear combinations. Which traits are most important to each "dimensions"? That will be those traits which have the most extreme weights in each ROW of V. Which characters best exemplify each "dimension"? That will be the characters which have the most extreme weights in each COLUMN of U. How much more important is the first "dimension" compared to the second? That is given by the relevant WEIGHT in Sigma. |
https://openpsychometrics.org/_rawdata/
from tropes meeting with peter & phil
The text was updated successfully, but these errors were encountered: