+
+ +
+

Sankey diagram

+
+

What’s this?

+

Macro for Sankey diagram using SAS GRAPH.

+

A Sankey diagram is a visualization used to depict a flow from one domain of values to another domain. +Sankey graphs is useful for showing flow and relationships among multiple categories.

+

In pharmacoepidemiology study, this diagram is used for Changes of the state of a patient over time, such as medication switching, disease stage progression, +or movement through treatment pathways.1(1)

+
+

Elements of sankey diagram

+

Sankey diagram is contained the three elements.

+_images/elements_of_sankey_diagram.png +
    +
  • domain

    +
    +

    the group of data as x-axis

    +
    +
  • +
  • Node

    +
    +

    the category. displays as rectangle shape.

    +
    +
  • +
  • link (flow)

    +
    +

    sigmoid shapes connected from node (source node) to next node (target node). amounts of flow are displays as widths of sigmoid shapes. +the links are connected the nodes in order left domain to right domain.

    +
    +
  • +
+
+
+

Rules

+
+
    +
  • the nodes are stacked in node category order.

  • +
  • all records of the input data are displaied in the all domains.

  • +
  • nodes in the first domain shoud be source node only, not target node.

  • +
  • the nodes of last domain shoud be target node only, not source node.

  • +
  • left-to right links (circular links) are not allowed.

  • +
+
+
+
+
+

Input data

+

the node categories are set to each domain variables like below. +I recommended that the node category format is applied to each domain variable. +the domain variables should be numeric.

+

null value is allowed. null value is defined as null node and same as other nodes.

+ + + + + + + + + + + + + + + + + + + +

domain1

domain2

domain3

node A

node B

node C

node B

node A

node C

node D

node A

node B

+

the domain format is required.

+

code

+
proc format;
+  value domainfmt
+  1="Domain 1"
+  2="Domain 2"
+  3="Domain 3"
+;
+run;
+
+
+
+
+

Syntax

+
+
ods graphics / < graphics option > ;
+ods listing gpath=< output path >;
+
+%macro sankey(
+    data=,
+    focus=None,
+
+    domain=,
+    domainfmt=,
+    domaintextattrs=(color=black size=11),
+
+    gap=5,
+    nodefmt=auto,
+    nodewidth=0.2,
+    nodeattrs=auto,
+    nodename=true,
+    nodetextattrs=(color=black size=9),
+
+    linktext=true,
+    linktext_offset=0.05,
+    linkattrs=auto,
+    linktextattrs=(color=black size=8),
+    stat=both,
+    unit=,
+    reverse=false,
+
+    legend=false,
+    palette=sns);
+
+
+
+
+
+

Parameters

+
    +
  • data : dataset name (required)

    +
    +

    input data. keep and rename options are available but keep option is not. +input data is copied to work library and keep only domain variable set to domain parameter.

    +
    +
  • +
+
+

Domain settings

+
    +
  • domain : variable name (required)

    +
    +

    domain variable. domain variables are set in link order. +for example, if domain parameter is set as “var1 var2 var3”, the var1 and var2, var2 and var3 are connected.

    +
    +
  • +
  • domainfmt : variable name (required)

    +
    +

    format for domain variable. format catalog should be saved in work library and value should be numeric. +interval of domains is depends on the values of format. labels of the format are displayed on plot.

    +
    +
  • +
  • domaintextattrs : text appearance (optional)

  • +
+
+
+

Node settings

+
    +
  • gap : numeric (optional)

    +
    +

    the gap between nodes. defalt is 5.

    +
    +
  • +
  • nodefmt : keyword or format name (optional)

    +
    +

    format for nodes. if “auto” is set, format is obtained from domain variable in input data. +if you want to display the node which is not existed in input data in the legend. the node format should be set to this parameter. format catalog should be saved in work library. +defalt is “auto”.

    +
    +
  • +
  • nodewidth : numeric (optional)

    +
    +

    the width of node shapes. the default is 0.2.

    +
    +
  • +
  • nodeattrs : keyword or fill appearance (optional)

    +
    +

    node appearance . +if “auto” is set, the fill color of nodes are set based on the node category. +if you want to be set same color to all the nodes.fill attribute is set to this parameter.

    +

    for example, the fill appearance described below is set and the all nodes is set blue.

    +

    nodeattrs=(color=blue)

    +

    defalt is “auto”.

    +
    +
  • +
  • +
    nodename : bool (optional)

    displays the node name (node category name). +defalt is “true”.

    +
    +
    +
  • +
  • nodetextattrs : text appearance (optional)

    +
    +

    the appearance (font size, font color and font weights) of node text. +default is “(size=9 color=black).

    +
    +
  • +
+
+ +
+

Other settings

+
    +
  • focus : where statement (optional)

    +
    +

    specify the focused node. the links whose source node except for the node set by focus parameter will be set grey color and link text will be not displayed. +focused node is specified by subset if. +for example, if “type A” node of second domain is forcused, set the parameter described below. value of type A node is 1.

    +

    focus=(node=1 and domain=2)

    +

    defalut is “none” (not focused).

    +
    +
  • +
  • stat : keyword (optional)

    +
    +

    displays the frequency or percentage of nodes and links. +the keyword described below is available.

    +
    +
    +
      +
    • FREQ: frequency

    • +
    • PCT: percentage(displays second decimal place)

    • +
    • BOTH: frequency and percentage

    • +
    • NONE: not displaied

    • +
    +
    +

    the percentage is calculated using following equation.

    +

    percentage = frequency of nodes or links / number observation of input data * 100

    +
    +
    +
  • +
  • unit : text (optional)

    +
    +

    the units of frequency (suffix). if “FREQ” or “BOTH” keyword is set to the stat parameter, +the units of frequency is displaied in the node text and link text.

    +

    default is “” (null).

    +
    +
  • +
  • legend : bool (optional)

    +
    +

    if “True” the legend of node category item is displayed. +default is “false”.

    +
    +
  • +
  • pallete : keyword (optional)

    +
    +

    color palette for fill, line and markers. the palletes described below is available. +see color palette section of introduction page. defalut is “SNS” (Seaborn defalut palette).

    +
    +
      +
    • SAS

    • +
    • SNS (Seaborn)

    • +
    • STATA

    • +
    • TABLEAU

    • +
    +
    +
    +
  • +
+
+
+
+

example

+
+

Basic sankey diagram

+

the regimen changes of the patients is displaied as sankey diagram using this macro. +variables (day0, day30, day60, day120) are selected regimen of the patient at 0 ,30, 60 and 120days. +these variables is set the regimen types.

+

code

+
proc format;
+value domainf
+1="day0"
+2="day30"
+3="day60"
+4="day120";
+
+value nodef
+0="Regimen A"
+1="Regimen B"
+2="Regimen C"
+3="Regimen D"
+4="Regimen E"
+5="Regimen F"
+6="Regimen G"
+7="Regimen H"
+8="Regimen I"
+;
+run;
+
+data graph;
+set raw;
+format day0 day30 day60 day120 nodef.;
+run;
+
+ods graphics / height=15cm width=20cm imagefmt=svg imagename="sankey_basic" noborder;
+ods listing gpath="<output path>";
+%sankey(
+    data=graph,
+    domain=day0 day30 day60,
+    domainfmt=domainf
+);
+
+
+_images/sankey_basic.svg
+
+

Adjust the domain interval

+

The interval of the domain is defined as the values of the domain format. +when the domain format values are adjusted, then interval of the domain will be changed.

+

if the format is changed described below, the plot of the previous section “basic sankey diagram” +is as follows.

+

code

+
proc format;
+value domainf
+1="day0"
+2="day30"
+4="day60"
+8="day120";
+
+
+_images/sankey_change_interval.svg
+
+

Node format and legend

+

If the codes of the previous section “basic sankey diagram” is chaged as described below, the color of the “Regimen E”, “Regimen G” and +“Regimen I” is diffrent from the previous section.

+

By defalt, the color of the nodes is defined based on the input data and domain variables. +when the input dataset is modified, the color of the nodes might be changed even though same node.

+

code

+
%sankey(
+    data=raw,
+    domain=day0 day30,
+    domainfmt=domainf,
+    legend=true
+);
+
+
+_images/sankey_not_set_nodefmt.svg

if nodefmt parameter is set, the color of the nodes defined based on the node format. The node color is independent from dataset. +All elements of the format is displaied in the legend if legend parameter is true.

+

code

+
%sankey(
+    data=raw,
+    domain=day0 day30,
+    domainfmt=domainf,
+    ndoefmt=nodef,
+    legend=true
+);
+
+
+_images/sankey_set_nodefmt.svg
+
+

Change the fill appearance

+

By defalt, the node color and the link color are defined based on the node category. +If the nodeattrs or linkattrs are set, the fill color of the all the nodes or links will be changed. +Nodefmt parameter will be ignored.

+

code

+
%sankey(
+    data=raw,
+    domain=day0 day30 day60 ,
+    domainfmt=domainf,
+    nodeattrs=(color=grey)
+);
+
+
+_images/sankey_set_nodeattrs.svg

code

+
%sankey(
+    data=raw,
+    domain=day0 day30 day60 ,
+    domainfmt=domainf,
+    linkattrs=(color=skyblue transparency=0.7)
+);
+
+
+_images/sankey_set_linkattrs.svg
+
+

Change the text appearance

+

The domaintextattrs, nodetextattrs and linktextattrs parameters are useful for modily text appearance of +domains, nodes and links.

+

code

+
%sankey(
+    data=raw,
+    domain=cyl vs gear,
+    domainfmt=domainf,
+    nodeattrs=(color=grey),
+    linkattrs=(color=skyblue transparency=0.7),
+    domaintextattrs=(color=red size=12),
+    nodetextattrs=(color=blue size=12),
+    linktext_offset=0.15,
+    linktextattrs=(color=green size=8)
+);
+
+
+_images/sankey_set_textattrs.svg
+
+

Focus parameter

+

The focus parameter is useful for the highlighting specified nodes. +The not-highlighted nodes and the links whose source node is not highlighted are set grey color and the link texts are not displaied.

+

code

+
proc format;
+value domainf
+1="day 0"
+2="day 30"
+3="day 60"
+4="day 90";
+
+value nodef
+
+1="Drug A"
+2="Drug B"
+3="Drug C"
+4="Drug D"
+5="Lost to Follow-up"
+;
+run;
+
+data graph;
+set raw;
+
+format var1-var4 nodef.;
+run;
+
+options mprint;
+ods graphics / height=15cm width=25cm imagefmt=svg imagename="sankey_focus" noborder;
+ods listing gpath="<output path>";
+
+%sankey(
+    data=raw,
+    focus=(node=1),
+    domain=var1 var2 var3 var4,
+    domainfmt=domainf,
+    gap=1,
+    reverse=true,
+    nodewidth=0.1,
+    nodename=false,
+    stat=freq,
+    legend=true,
+    linktext_offset=0.03,
+    palette=sns);
+
+
+_images/sankey_focus.svg
+
+
+ + +
+