Skip to content
amin saied edited this page Sep 21, 2017 · 2 revisions

The Cornetto Classification Library

Welcome to cornetto! The aim of this library is to build a mathematics classifier. More specifically, we build a variety of machine learning models, including:

  • neural networks (fully connected/RNNs),
  • SVMs,
  • naive Bayes models,

designed to read a mathematics text, such as an abstract of a paper, and to predict its subject. We choose to use the standard Mathematics Subject Classification (henceforth, MSC) to enumerate our subjects.

Our Data

We built a database from the mathematics arxiv (https://arxiv.org/archive/math) consisting of:

  • Title
  • Authors
  • Abstract
  • Date
  • MSC code (this is an optional field in arXiv!)

We found that only about half of the papers on the arXiv have actually been labelled with their MSC code. We use this half to train our models and test our models. Furthermore, we have built a website where you can use our models to classify your own texts and abstracts. Check it out here (please note that this is still under construction).

Our code is written in Python 3. We have written a series of modules and scripts to facilitate:

  • scraping data from the web
  • processing data
  • handling data
  • training models
  • using trained models

Libraries we use heavily: numpy, pandas, tensorflow, scikit-learn.

Clone this wiki locally