Developing "coupling graphs" #169

andkov · 2017-02-03T02:35:05Z

andkov · 2017-02-03T03:52:04Z

See the script ./sandbox/coupling-graphs/coupling-graphs-1.R for the implementation of the graphs.

Graph 1

The graph show 48 random individuals (males) from MAP study. Red indicates process A (physical), black represents process B (cognitive). If a person has less then two data points no lines are drawn.

on more similar scale than original metric

not really sure what this means. Could you please elaborate?

Graph 2

The graph shows a random 48 ids from MAP study.

would it make sense to connect the dots?

here's a version of this graph with connected dots.

You can adjust or further develop graph in the ./sandbox/coupling-graphs/coupling-graphs-1.R script. I have prepared the functions to extract the data. Let me know if you have any questions!

ampiccinin · 2017-02-03T06:08:04Z

@andkov - wow! That was fast! Thanks!

Graph 1 is exactly what I meant. By more similar metric I meant that both would hover around zero, whereas their original raw metrics might be 0-44 versus 300-600.

Too bad they don't all look like person 198!!

I'm surprised that cognitive is black. I would have expected the physical measurements to be more variable.

For Graph 2, I wondered about a single summary graph that contained all individuals' data. That was why I suggested joining points within person - to have a sense of which points went together. Looking at person 198 again, though, shouldn't all their points fall between -1 and 1?

andkov · 2017-02-03T13:33:06Z

By more similar metric I meant that both would hover around zero, whereas their original raw metrics might be 0-44 versus 300-600.

ah. of course. this explains why the black (cognitive, in this case boston story immediate) is all over the place, while red (physical, in this case fev) is relatively stable. Here's the descriptives for the observed measures:

> summary(ds_wide$observed_A)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.420   1.720   2.145   2.134   2.560   3.930     749 
> summary(ds_wide$observed_B)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   8.000  10.000   9.609  11.000  12.000     395

Redisualizing helps reducing the span, but it still remain substantially different:

> summary(ds_wide$residual_A)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
-1.0970 -0.0893  0.0033 -0.0001  0.1033  0.6330     749 
> summary(ds_wide$residual_B)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
-7.3840 -0.9522  0.0680 -0.0001  0.9649  3.6590     395

And of course with other measures, this would be even greater. I turn these residual scores into z-scores, which should help this issue

ampiccinin · 2017-02-03T14:05:16Z

Thanks @andkov - I'm not sure we want z-scores. Our model is not standardized, and seeing how much a variable jumps around should be informative. When I said "more similar metric" I was indicating that it would be easy to plot with the same axis (both would have scores around 0).

My surprise about the cognitive residuals seeming larger than the physical was based on my memory of the observed raw trajectories. My sense had been that the physical measures jumped around quite a bit.

Are you able to resolve the apparent inconsistencies between Graphs 1 & 2 for person 198? Is it just my misinterpretation that is confusing me? Thanks!!

ampiccinin · 2017-02-03T14:15:37Z

@andkov ...are you saying that the physical looks like smaller variability because the original range is wider?

andkov · 2017-02-03T14:16:03Z

a slight diversion.

I have been troubled by such patterns as 153 (see graph 1), so I started looking at the individual level data. Here's the data for this person:

> ds_wide %>% 
+   dplyr::filter(id == 153) %>% 
+   dplyr::select(id,observed_A, observed_B, predicted_A, predicted_B, residual_A, residual_B)
   id observed_A observed_B predicted_A predicted_B residual_A residual_B
1 153         NA          8          NA    9.556000         NA -1.1525397
2 153         NA         10          NA    9.466136         NA  0.3954367
3 153         NA          8          NA    9.370914         NA -1.0154452
4 153         NA         12          NA    9.282460         NA  2.0129001
5 153       2.29          9    2.255416    9.184888  0.2008373 -0.1369478

Even though there is only one observation in A, the model still estimates. I wonder if the peculiar patterns we see in correlations may be due to the fact that many factor scores are coming from cases like these. I've looked at this study (MAP) to get these counts:

count_miss <- ds_wide %>% 
  dplyr::group_by(id) %>% 
  dplyr::mutate(
    n_nonmiss_a = length(observed_A) - sum(is.na(observed_A)),
    n_nonmiss_b = length(observed_B) - sum(is.na(observed_B))
  ) %>% 
  dplyr::ungroup() %>% 
  dplyr::distinct(id,n_nonmiss_a, n_nonmiss_b) 

count_miss %>% 
  dplyr::group_by(n_nonmiss_a) %>% 
  dplyr::count() %>% 
  dplyr::mutate(
    cumulative = cumsum(n)
  )
# A tibble: 6 × 3
  n_nonmiss_a     n cumulative
        <int> <int>      <int>
1           0    26         26
2           1    74        100
3           2    54        154
4           3    53        207
5           4    55        262
6           5    59        321

as you see out of 321 individuals, there were 26 that had no observation on process A and 100 that had up to 0 or 1 observations. For process B (cogntive) the story is similar:

> count_miss %>% 
+   dplyr::group_by(n_nonmiss_b) %>% 
+   dplyr::count() %>% 
+   dplyr::mutate(
+     cumulative = cumsum(n)
+   )
# A tibble: 5 × 3
  n_nonmiss_b     n cumulative
        <int> <int>      <int>
1           1    42         42
2           2    39         81
3           3    34        115
4           4    42        157
5           5   164        321

In fact, there are even cased in which we have 0 observed point on A, only 1 observed point in B and we still get an estimate for the factor score!

> count_miss %>% 
+   dplyr::group_by(n_nonmiss_a, n_nonmiss_b) %>% 
+   dplyr::count() %>% 
+   print(n=100)
Source: local data frame [21 x 3]
Groups: n_nonmiss_a [?]

   n_nonmiss_a n_nonmiss_b     n
         <int>       <int> <int>
1            0           1     9
2            0           2     6
3            0           3     5
4            0           4     3
5            0           5     3
6            1           1    33
7            1           2    19
8            1           3     4
9            1           4     3
10           1           5    15
11           2           2    14
12           2           3    14
13           2           4    10
14           2           5    16
15           3           3    11
16           3           4    16
17           3           5    26
18           4           4     9
19           4           5    46
20           5           4     1
21           5           5    58

Which is, I guess, because of the FIML option? To clarify, i'm using the observation THAT HAS BEEN USED BY MPLUS during the estimation, recovering it from the output.

ampiccinin · 2017-02-03T14:23:05Z

@andkov - damn. I'm sure there are people who would be fine with this. I am not one of them. Thanks for checking it!!! In theory (at least with MAR), the S-S correlation should be the same with and without these people who have data on only one or the other variable. Might be worth checking, though. What happens if you restrict to only folk who have at least 2 of each variable?

andkov · 2017-02-03T14:28:28Z

@ampiccinin , I can certainly do that. It will bring down the sample size, but not catastrophically.

andkov · 2017-02-03T14:31:21Z

Looking at person 198 again, though, shouldn't all their points fall between -1 and 1?

I wonder if it's just hard seeing it on this graph. Here's the data for person 198. There are almost between -1 and 1. Does that resolve your the inconsistency you thought there was in this case?

> ds_wide %>% 
+   dplyr::filter(id == 198) %>% 
+   dplyr::select(id,observed_A, observed_B, predicted_A, predicted_B, residual_A, residual_B)
   id observed_A observed_B predicted_A predicted_B residual_A residual_B
1 198      2.480         12    2.319000    11.09000   0.161000   0.910000
2 198      2.420         12    2.237832    11.01248   0.182168   0.987520
3 198      1.550         11    2.147230    10.92595  -0.597230   0.074050
4 198      2.285         12    2.060188    10.84282   0.224812   1.157180
5 198         NA         11          NA    10.75417         NA   0.245835

standardized residual, versions of graphs

andkov · 2017-02-03T14:48:51Z

...are you saying that the physical looks like smaller variability because the original range is wider?

yes, exactly.

I'm not sure we want z-scores. Our model is not standardized, and seeing how much a variable jumps around should be informative. When I said "more similar metric" I was indicating that it would be easy to plot with the same axis (both would have scores around 0).

IMHO, i don't see a problem with linear transformation of the residuals (dividing by respective sd) for the sake of residual diagnostics. No to do any computation on them, but just adjust to VIEW them better. I agree, the fluctuation on the raw scale should be informative, but if our objective is simply to visually compare residuals, it seems that such standardization is permissible.

andkov · 2017-02-03T15:41:49Z

@ampiccinin , here are the modifications based on your comments:

Graph 1

Here, residual scores have been divided by respective sd of the process

Graph 2

Here, all individuals' residuals have been placed on the same XY canvass. Drawing the lines throw the points is, of course, possible, but i'm afraid not practical. After 10+ individuals it becomes a mess. Here's the same 48 persons, with color mapped onto id.

A more illuminating display would be, IMHO, to get rid of the lines, and map the value of a covariate onto the color, like we did in enhanced anatomy plots (e.g.). Here's a sketch of such a draft using all individuals and mapping the binary SMOKE covariate:

andkov · 2017-02-03T16:00:05Z

@ampiccinin

That was fast!

The graphing itself is relatively easy, once the data is extracted and shaped to the needed form. It's the data wrangling that takes 80-90% of the time in a normal graphing project. In this case, however, I've spent so much time learning these structures and developing custom functions to process them (and failing most of the times) that it became possible to turn it around so fast.

Portland is metaphor for the industrial age coming to data science: massive initial investment, but cheap, fast, and reliable production once the machinery is set up and calibrated (and engineers are trained!). You are starting to see that initial investment (giving me time and bringing @wibeasley along) paying off. I wish it were faster than that, but it seems that none of us had a clear idea of the magnitude of the engineering challenge the coordinated analysis presents on the scale you've envisioned. Well, retrospectively, I'd glad I was overly optimistic: not sure i'd have a heart to jump into something like that with a full disclosure, given the time lag in publications that it brings. :)

The bottom line: here's to your patience and my perseverance! 👍

ampiccinin · 2017-02-03T16:46:51Z

@andkov :)

you are absolutely right -

198 is fine. My confusion was due to failing eyesight...I thought the -0.6 was -6.0!! Might make sense to force the same axis for x & y?
the lines in graph 2 are a mess.

...Presumably the example we are looking at does not show (statistically significant) coupling (right?) and that we would see something more of an elliptical cloud if there were. I find it interesting that there seem to be cases with a positive association, and others with negative (and a bunch with no association or [mainly] not enough data on both variables). I suspect we would be going down a rabbit hole to chase whether there was something different about the + and - coupling people, but maybe when the Portland papers are submitted, if we are bored and looking for something to do, we could play around with this (with a strict time limit so we don't end up in too long a tea party with the Mad Hatter).

ampiccinin · 2017-02-03T16:52:37Z

@knighttime @jru7sh - what do you think of the latest "Graph2" to represent the coupling?

the only thing that bothers me ( @andkov ) is having multiple points per person in a single graph that does not differentiate among people...but I can live with this.

Can we/would it be useful to - use this for Jamie's BSIT project? Is it easy enough to modify the syntax for her variables?

andkov · 2017-02-03T18:22:21Z

Might make sense to force the same axis for x & y?

yes. good point. will adopt in the next iteration.

Is it easy enough to modify the syntax for her variables?

yes, in terms of graphing, these functions are transferable (i'm assuming @knighttime is dealing with bivariate growth models). However, if there are multiple studies + multiple models it might take some time to set up the production line (i'm talking 2-3 hours on my part + some time by the analyst to learn the set up)

andkov · 2017-02-03T18:25:48Z

@ampiccinin

Presumably the example we are looking at does not show (statistically significant) coupling (right?) and that we would see something more of an elliptical cloud if there were.

If my understanding of "coupling" is correct (association between the residuals of two processes), then that's exactly my interpretation of these patterns ( latest version of graph 2)

ampiccinin · 2017-02-03T18:35:19Z

@andkov : For now, @knighttime 's purpose is plots for a single study.

I hope we are then talking 2-3 minutes, rather than hours...

ampiccinin · 2017-02-03T18:40:36Z

@andkov - we could plot a significant association, such as OCTO female pulmonary block or mmse.

...but don't let me take you too far away from what you were otherwise planning to do today!! this can wait.

andkov · 2017-02-03T22:53:02Z

@ampiccinin

I hope we are then talking 2-3 minutes, rather than hours...

That's not an unreasonable estimate. However, it all boils down to the options @knighttime used for modeling, whether it permits easy extraction. It's easy to remedy, but 2-3 minutes is the estimate if everything is set up as needed.

knighttime · 2017-02-05T03:36:17Z

@andkov
Thanks Andrey. This looks great!

My coupling model is using MAP.

I understand the code for the graph but it is pulling the model info from the helper functions - those I can't figure out.

Just looking at the code for the graph though, it looks like I can just substitute my own residual for cognition for "residual A" and residual for BSIT for "residual B" then it should work. is there a shorthand code to get that? resid(ds$cogn_ep) maybe?
geom_line(aes(y=residual_A), color="red")+ `

I will try it out tomorrow and see how it goes

ampiccinin · 2017-02-05T05:38:59Z

@knighttime @andkov - possibly important detail: Jamie's models are in R. Presumably she can save model implied residuals and use them in your syntax that grabs them from Mplus? (The fact that you are running the models in different software just struck me now)

andkov · 2017-02-05T06:26:56Z

@ampiccinin @knighttime
Yes, different model objects (R vs Mplus) require different plumbing. But that shouldn't affect the graphing function.

@knighttime, I think you are on the right track. But I doubt it would be as simple as that because the model shape is different (is it?). Best strategy is to study the data structure that goes into the graph and try to replicate that. Let me know if you get stuck.

I'm pulling the models from Mplus outputs, you can disregard what happens before graphing

but the flexibily of data selection is left to be desired. needs more work

The `foresplot::forestplot` function is suboptimal solution. Can't control graph size, font size, and many other thing. Choose ggplot based graphics next time.

andkov added a commit that referenced this issue Feb 3, 2017

#169 basic graphs stabilized

9c9f2a2

andkov added a commit that referenced this issue Feb 3, 2017

#169 png dump

089fb07

andkov added a commit that referenced this issue Feb 3, 2017

#169 lined versio of graph 2

d4edc29

andkov added a commit that referenced this issue Feb 3, 2017

#169 unlined version of graph 2

be49567

andkov added a commit that referenced this issue Feb 3, 2017

#169 individualized residula stable

3529fb8

andkov added a commit that referenced this issue Feb 3, 2017

#169 individualized residula png dump

6beea83

andkov added a commit that referenced this issue Feb 3, 2017

#169 standardized residuals

c5bd8c8

andkov added a commit that referenced this issue Feb 3, 2017

#169 single residual graph

c0c4b46

andkov added a commit that referenced this issue Feb 3, 2017

#169 script development

25f0123

standardized residual, versions of graphs

andkov added a commit that referenced this issue Feb 3, 2017

#169 graphs of the same IDs as in the first version

67a2452

andkov added a commit that referenced this issue Feb 3, 2017

#169 fixed ids

adfd499

andkov added a commit that referenced this issue Feb 3, 2017

#169 added 2c graph

6896135

andkov added a commit that referenced this issue Feb 3, 2017

#169 added alpha to 2c

68ef3c7

andkov added a commit that referenced this issue Feb 3, 2017

#169 minor edits

dfe2580

andkov added a commit that referenced this issue Feb 7, 2017

#169 forestplot in essense is stable

144de03

but the flexibily of data selection is left to be desired. needs more work

andkov added a commit that referenced this issue Feb 7, 2017

#169 demo for forestplot production

36b9871

andkov added a commit that referenced this issue Feb 7, 2017

#169 forestplot::forestplot wrapper stabilized

85d30bf

andkov added a commit that referenced this issue Feb 7, 2017

#169 stabilized dynamic report

7334296

andkov added a commit that referenced this issue Feb 7, 2017

#169 first draft of forest for pulmonary

f976162

The `foresplot::forestplot` function is suboptimal solution. Can't control graph size, font size, and many other thing. Choose ggplot based graphics next time.

andkov added a commit that referenced this issue Feb 7, 2017

#169 pnd dump

b2a76b1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Developing "coupling graphs" #169

Developing "coupling graphs" #169

andkov commented Feb 3, 2017

andkov commented Feb 3, 2017

ampiccinin commented Feb 3, 2017

andkov commented Feb 3, 2017

ampiccinin commented Feb 3, 2017

ampiccinin commented Feb 3, 2017

andkov commented Feb 3, 2017

ampiccinin commented Feb 3, 2017

andkov commented Feb 3, 2017

andkov commented Feb 3, 2017

andkov commented Feb 3, 2017

andkov commented Feb 3, 2017

andkov commented Feb 3, 2017 •

edited

Loading

ampiccinin commented Feb 3, 2017

ampiccinin commented Feb 3, 2017

andkov commented Feb 3, 2017

andkov commented Feb 3, 2017

ampiccinin commented Feb 3, 2017

ampiccinin commented Feb 3, 2017

andkov commented Feb 3, 2017

knighttime commented Feb 5, 2017

ampiccinin commented Feb 5, 2017

andkov commented Feb 5, 2017

Developing "coupling graphs" #169

Developing "coupling graphs" #169

Comments

andkov commented Feb 3, 2017

andkov commented Feb 3, 2017

Graph 1

Graph 2

ampiccinin commented Feb 3, 2017

andkov commented Feb 3, 2017

ampiccinin commented Feb 3, 2017

ampiccinin commented Feb 3, 2017

andkov commented Feb 3, 2017

a slight diversion.

ampiccinin commented Feb 3, 2017

andkov commented Feb 3, 2017

andkov commented Feb 3, 2017

andkov commented Feb 3, 2017

andkov commented Feb 3, 2017

Graph 1

Graph 2

andkov commented Feb 3, 2017 • edited Loading

ampiccinin commented Feb 3, 2017

ampiccinin commented Feb 3, 2017

andkov commented Feb 3, 2017

andkov commented Feb 3, 2017

ampiccinin commented Feb 3, 2017

ampiccinin commented Feb 3, 2017

andkov commented Feb 3, 2017

knighttime commented Feb 5, 2017

ampiccinin commented Feb 5, 2017

andkov commented Feb 5, 2017

andkov commented Feb 3, 2017 •

edited

Loading