Skip to content

Visualisations for standard names data (includes POC)? #110

Closed
@sadielbartholomew

Description

@sadielbartholomew

There are various aspects of the standard names that I find interesting & I suspect would be informative to others in a short summary form, notably:

  • the total number of names in the official table & how that has evolved over time;
  • the names that have been added on a per-version basis, in particular the set of new names that has been added relative to the most-recent version.

Such data lends itself well to plots or visualisations, & given the systematic XML encoding of the names in the per-version hierarchical directory structure under Data/cf-standard-names, it is possible to create a script to grab the relevant data & to generate those with, such that can be re-run at any time to pick up on updates, including new versions.

I think a few such visualisations could be useful to have on the site, either directly within the http://cfconventions.org/standard-names.html page, or on a page linked off it. They would mean that anyone can see at a glance how the name table has grown & what has been added, whereas at the moment I think someone would have to trawl through the version directories, or do some analysis via coding or using some tool, to work this out?

I raise this in particular because back in February I wrote a script to plot such aspects for interest, & with the CF Workshop coming up I revisited it. I now have (contained in my personal branch here) that outlined above, a Python script to:

  • grab the string names given canonically per-version in this repo (though probably not in the most efficient way, using a basic regular expressions search);
  • calculate e.g. the total count by version & by date, & differences across versions, &
  • visualise those (by plot & word cloud, see below).

It is designed so it can be re-run at any time to re-generate updated visualisations based on the current state of the repo, without any editing (though some minor re-formatting may be required over time to e.g. tweak the axes bounds on a totals plot to optimise the display).

So I already have a means to build, & re-build as necessary, some visualisations (see the examples below). If it is agreed that it would be good to put up some visualisations on the site, I am happy to adapt the script as you all see fit & put up a PR to incorporate it into this repo so it can be used for the site to generate those, or something similar, to display on a page.

What do you think: would you like to see something like this on the site, & if so what do you think about the visualisations I generated with my script (perhaps as a starting point, I am happy to amend them to fit as you see best with the site)? @japamment I believe you are in charge of the Standard Names, so it would be good to hear from you especially.

Proof of concept visualisations

Generated directly using the script in my branch here (on a state of the repo from a few months back, but I will update it shortly to the current state).

Plot of names per date by version:

From passing the extracted data to matplotlib. (With thanks to @davidhassell who suggested to plot by date rather than by version, & to instead have version shown by indicative markers, after I wrote the initial script).

totals-and-diffs-plot

Word clouds of names present in version A but not version B

A nice way to show all the new names by version. My script has some utility functions to determine the new names as strings & passes them to the word_cloud library. Here are examples, using the versions giving the spikes in the differences in total relative to the previous version in the plot above (which seemed most interesting).

(I think word_cloud is parsing & grouping them by items of one or two words by default, judging by the outputs, but that can be tweaked I am sure if longer phrases or lone words would be more useful to show.)

New additions for v.12:

wordcloud_diff_12_and_11

And similarly, new additions for v.49:

wordcloud_diff_49_and_48

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions