Skip to content

Latest commit

 

History

History
107 lines (77 loc) · 4.07 KB

README.rst

File metadata and controls

107 lines (77 loc) · 4.07 KB

PyStemmer

What is PyStemmer?

PyStemmer is a Python interface to the stemming algorithms from the Snowball project (https://snowballstem.org/).

Snowball can generate pure-Python stemmer code, but if you want to stem a lot of words this can be rather slow.

PyStemmer instead wraps the "libstemmer_c" library which is built from C code generated by Snowball.

An alternative to using PyStemmer directly is to use the snowballstemmer module from Snowball, which will automatically use PyStemmer if available, falling back to the pure Python implementations if not. This allows your users to choose between the convenience of only dealing with pure Python code and the significantly better performance of PyStemmer.

What is Stemming?

Stemming maps different forms of the same word to a common "stem" - for example, the English stemmer maps connection, connections, connective, connected, and connecting to connect. So a searching for connected would also find documents which only have the other forms.

This stem form is often a word itself, but this is not always the case as this is not a requirement for text search systems, which are the intended field of use. We also aim to conflate words with the same meaning, rather than all words with a common linguistic root (so awe and awful don't have the same stem), and over-stemming is more problematic than under-stemming so we tend not to stem in cases that are hard to resolve. If you want to always reduce words to a root form and/or get a root form which is itself a word then Snowball's stemming algorithms likely aren't the right answer.

Requirements

You need a working C compiler.

Python header files should be installed.

This version of PyStemmer has been CI tested using Python series 3.6, 3.7, 3.8, 3.9, 3.10, 3.11, 3.12, 3.13, pypy and pypy3.

We no longer actively support Python 2 as the Python developers stopped supporting it at the start of 2020. PyStemmer 2.2.0.1 was the final version which we tested with Python 2.

PyStemmer can use a system install of libstemmer_c (from a package manager or an install you've previously done by hand). To do this, make sure that the development headers are installed (these may be in a separate binary package with a -dev or --devel suffix) and set environment variable PYSTEMMER_SYSTEM_LIBSTEMMER to a non-empty value.

Otherwise PyStemmer will do a private build of libstemmer_c and use that. It looks for a tarball of the corresponding libstemmer_c release in the top level directory, and will attempt to automatically download it if not present (with a checksum check). If you want to avoid the downloading step (for example, to build in an environment which doesn't allow internet access, or to avoid build failures due to connectivity problems) you can make sure that the tarball is already present before building.

Installation

PyStemmer uses distutils, so all that is necessary to build and install PyStemmer is the usual distutils invocation:

python setup.py install

You can also install using pip:

  • from PyPI: pip install pystemmer
  • from a local copy of the code: pip install .
  • from git: pip install git+git://github.com/snowballstem/pystemmer

If Python doesn't find your C compiler, you can set environment variable CC to the C compiler to use, for example:

CC=gcc-14 python setup.py install

or:

CC=/opt/bin/cc pip install pystemmer

API

PyStemmer's API is documented by documentation comments.

A brief overview can be found in docs/quickstart.txt

License

PyStemmer is copyright (c) 2006, Richard Boulton, and is licensed under the MIT license: see the file "LICENSE" for the full text of this. It is was inspired by an earlier implementation (which was copyright (c) 2001, Andreas Jung, and also licensed under the MIT license, but no portions of which remain in this package, and had a different API).

The snowball algorithms, and the snowball library, are copyright (c) 2001-2006, Dr Martin Porter and Richard Boulton, and are licensed under the BSD license.