Skip to content
Sam Borms edited this page Mar 12, 2018 · 7 revisions

Background

The R package sentometrics is a recently created toolbox for textual sentiment computation, aggregation of the sentiment into time series and sparse regression-based prediction. It was developed during Google Summer of Code 2017 and released on CRAN in November 2017. It is unique in the way it covers sentiment analysis, aggregation and prediction based on the sentiment time series in one integrated framework. The package can benefit from a Google Summer of Code 2018 project in three ways.

At first, the sentometrics package needs to be glued more closely to the prominent text mining R packages. The goal is to form a natural extension to any text mining workflow, from texts retrieval to cleaning and analysis. People whose aim is to process texts for multiple reasons, one of which is sentiment analysis, should be able to switch smoothly back and forth between sentometrics for sentiment analysis and related applications to packages capable of other text mining tasks.

A second extension entails improving the speed and breadth of the textual sentiment computation. Computing sentiment of many texts for many documents is a cumbersome task. Most of the text mining packages offer a way to calculate sentiment, but two problems persist: it is often not straightforward (multiple manipulations are required), and none of the implementations is optimized for speed in a lower-level programming language. This means, an Rcpp implementation of the textual sentiment calculation would count as a clear contribution. Furthermore, there is a considerable degree of flexibility in how to compute sentiment hardly accounted for in any R package. The current sentiment calculation in sentometrics takes a lexicon-based approach and considers several options in terms of document-level aggregation. To go further, it needs (i) an improved integration of linguistic complexities (such as valence shifters and n-grams) in the lexicon-based approach, and (ii) a way to train and apply machine learning sentiment classifiers.

Thirdly, the modelling component of the package has to be expanded. Sparse regression is useful in many setups, but not in all. An easy interface needs to be added for simple linear and logistic regression, univariate and cross-correlation sentiment time series analysis, optimization tools across textual sentiment dimensions (features, aggregation options and computation methods) for specific objective functions, and ensemble machine learning algorithms.

The reference to the current vignette: Ardia, Bluteau, Borms, and Boudt, "The R Package Sentometrics to Compute, Aggregate and Predict with Textual Sentiment" (November 9, 2017). Available at SSRN: https://ssrn.com/abstract=3067734.

Related work

The quanteda package is used as the text mining backend in sentometrics. It offers many great tools, but a clear-cut approach to textual sentiment analysis is not part of them.

The sentimentr package has to date implemented the most complex textual sentiment calculation in R, accounting for several linguistic intricacies. Its downside, however, is that it becomes slow for the number of documents sentometrics desires to address (x0,000 - x00,000).

The meanr package provides a barebone implementation in C of a lexicon-based sentiment calculation. It allows for only one (default) lexicon and has no other options. This can be used as the basis for an optimized sentiment calculation using Rcpp.

Details of the coding project

The task of the student is to extend the R package sentometrics along three lines.

The first aspect is to strengthen the link of sentometrics with the R text mining universe. This comprises:

  1. A better-defined workflow of corpus construction and manipulation starting from at least the quanteda package (but preferably also other packages, such as tm and cleanNLP) up to a sentocorpus object;
  2. The inclusion of a topic modelling functionality to add features to either an existing sentocorpus object, or directly from the other text mining packages mentioned.

The second aspect is to speed up the current sentiment analysis calculation and add in additional sentiment calculation engines. This encompasses:

  1. An Rcpp implementation of lexicon-based sentiment calculation;
  2. Integration of more linguistic features in the lexicon-based approach, à la sentimentr;
  3. A framework for training and application of machine learning algorithms for textual sentiment classification and lexicon construction.

The implementations should account for proper parallelization and memory management.

The third aspect is the addition of relevant econometric tools which integrate textual sentiment time series. This has to cover:

  1. A simple interface to non-sparse linear and logistic models, using the lm and glm packages (or faster Rcpp counterparts);
  2. ARIMA-type analysis of the many sentiment time series, as well as correlation analysis of the time series, to more easily uncover relationships within the sentiment;
  3. Quadratic and non-quadratic optimizers to link the prediction of a response variable with specific constraints or optimization objectives on the sentiment time series, using the quadprog and Rdonlp2 packages.
  4. Prediction based on ensemble machine learning algorithms, such as random forest.

At the same time, the sento_model() function needs to be reconsidered to accommodate for the modelling alternatives added.

In addition to the programmatic implementation of the three aspects, both the documentation of the package and its vignette need to be updated accordingly.

Expected impact

The R package sentometrics, as initiated during last year's Google Summer of Code, has the ambition to become the go-to package for textual sentiment calculation, aggregation and modelling. Altogether, the three proposed enhancements to sentometrics assure the R community to dispose of a user-friendly, fast and flexible package to gain informative sentiment insights from large collections of texts.

Mentors

Ardia David, Assistant Professor of Finance, University of Neuchâtel and Laval University

Keven Bluteau, Researcher, University of Neuchâtel and Vrije Universiteit Brussel

Kris Boudt, Associate Professor of Finance and Econometrics, Vrije Universiteit Brussel and Vrije Universiteit Amsterdam

Tests

Applicants have to be able to show that they have:

  • Familiarity with the sentometrics R package;
  • Familiarity with textual sentiment analysis;
  • Familiarity with packages quanteda and sentimentr;
  • A good working knowledge of programming in R, Rcpp and C++;
  • A good working knowledge of devtools for the construction of package development;
  • A good working knowledge of LaTeX for the vignette;
  • Good coding standards (Google's C++ and R style guide).

Students should show their motivation by following the points below:

  • Easy: Load the sentometrics package, create a corpus with a set of self-collected texts, add several features, construct a few sentiment measures and plot the results;
  • Medium: Take the built-in corpus from the sentometrics package and apply a topic model to it using one of the existing text mining packages. Report on the methodology chosen, the parameters you had to deal with and the topics obtained.
  • Hard: Write a simple textual sentiment calculator in Rcpp, possibly based on the C implementation for the same purpose in the meanr package.

Solution to tests

Students, please post a link to your test results here.

Clone this wiki locally