There are a number of relatively easy things that could and should be done next:
- Extract document frequencies from all publication years (from ca. 1800 until 2020) to provide all data for TFIDF and allow time-matched TFIDF normalization.
- Move more hard-coded constants to function header, defining sensible defaults.
- Accelerate the plotting by vectorizing loop constructs or refactoring part of the code in C++ using Rcpp.
- Match KNIME workflow output to ChEBI plot input (additional filters).
- Interactive HTML5 visualization with tooltip revealing and linking to most abundant ChEBI entry in (log P/mass) bin.