+ Summary
+ Back in 2017, a special issue on the topic of brain
+ parcellation and segmentation was published in the journal
+ Neuroimage. We acted as guest editors for this special issue, and
+ wrote an editorial
+ (Craddock
+ et al., 2018) providing an overview of all papers, sorted into
+ categories. The categories were generated using a data-driven
+ parcellation analysis, based on the words contained in the abstract of
+ the articles. This jupyter book will allow interested readers to
+ reproduce this analysis, as a proof of concept for reproducible
+ publications using
+ jupyter
+ books and the
+ Neurolibre
+ preprint server.
+
+
+ Acknowledgements
+ NeuroLibre is sponsored by the Canadian Open Neuroscience Platform
+ (CONP), Brain Canada, Cancer Computers, the Courtois foundation, the
+ Quebec Bioimaging Network, and Healthy Brains for Healthy Life.
+
+
+ NOTE: The following section in this
+ document repeats the narrative content exactly as found in the
+ corresponding
+ NeuroLibre Reproducible Preprint (NRP). The content was
+ automatically incorporated into this PDF using the NeuroLibre
+ publication workflow
+ (Karakuzu
+ et al., 2022) to credit the referenced resources. The
+ submitting author of the preprint has verified and approved the
+ inclusion of this section through a GitHub pull request made to the
+ source
+ repository from which this document was built. Please
+ note that the figures and tables have been excluded from this
+ (static) document. To interactively explore such outputs and
+ re-generate them, please visit the corresponding
+ NRP.
+ For more information on integrated research objects (e.g., NRPs)
+ that bundle narrative and executable content for reproducible and
+ transparent publications, please refer to DuPre et al.
+ (2022).
+ NeuroLibre is sponsored by the Canadian Open Neuroscience Platform
+ (CONP)
+ (Harding
+ et al., 2023).
+
+
+
+ Text mining
+
+ List of papers
+ We first assembled the title, the name of the corresponding
+ author, and the abstract for all the articles into a
+ tabular-separated values (tsv) file, which we publicly archived on
+ Figshare.
+ We use the
+ Repo2Data
+ tool developed by the NeuroLibre team to collect these data and
+ include them in our reproducible computational environment.
+
+
+ Word features
+ For each paper, we used
+ scikit-learn
+ (Kramer,
+ 2016) to extract a bag of words representation for each
+ abstract, picking on the 300 most important terms seen across all
+ articles based on a term frequency-inverse document frequency
+ (tf-idf)
+ index. Following that, a special value decomposition was
+ used to further reduce the dimensionality of the abstracts to 10
+ components. We ended up with a component matrix of dimension 38
+ (articles) times 10 (abstract text components). The distribution of
+ each of the 38 articles across the 10 components is represented
+ below. Note how some articles have particular high loadings on
+ specific components, suggesting these may capture particular topics.
+ Rather than visually inspect the component loadings to group paper
+ ourselves, we are going to resort to an automated parcellation
+ (clustering) technique.
+
+
+ Parcellate the papers
+ Now that the content of each paper has been condensed into only
+ 10 (hopefully informative) numbers, we can run these features into a
+ trusted, classic parcellation algorithm: Ward’s agglomerative
+ hierarchical clustering, as implemented in the scipy library. We cut
+ the hierarchy to extract 7 “paper parcels”, and also use the
+ hierarchy to re-order the papers, such that similar papers are close
+ in order, as illustrated in a dendrogram representation.
+
+
+ Similarity matrix
+ So, to get a better feel of the similarity between papers that
+ was fed into the clustering procedure, we extracted the 38x38
+ (papers x papers) correlation matrix across features. Papers are
+ re-ordered in the matrix according to the above hierarchy. Each
+ “paper parcel” has been indicated by a white square along the
+ diagonal, which represents the similarity measures between papers
+ falling into the same parcel.
+
+
+ Word cloud
+ Now, each paper of the special issue has been assigned to one and
+ only one out of 7 possible “paper parcel”. For each paper parcel, we
+ can evaluate which words contribute more to the dominant component
+ associated with that parcel.
+
+
+ Categories
+ Thanks to the word clouds, these simple data-driven categories
+ turned out to be fairly easily interpretable. For example, the word
+ cloud of the category number 4 features prominently words like
+ “white”, “matter” and “bundles”. If we examine the exact list of
+ papers included in this category, we see that it is composed of four
+ papers, which all considered parcels derived from white matter
+ bundles with diffusion imaging. We can also check the distribution
+ of component loadings for this category alone. As expected, there is
+ a certain similarity in the component loadings for these papers, in
+ particular along component 4:
+
+
+
+