Skip to content
tom-de-smedt edited this page Oct 27, 2014 · 31 revisions

The aim of this fork is to make Pattern compatible with Python 3.

Pattern (www.clips.ua.ac.be/pattern) is a Python 2.7 package for data mining, natural language processing and machine learning.

The Python Software Foundation (PSF) has granted $1,200 to support the compatibility update. Please read the grant proposal (PDF) for an overview of our objectives. The Computational Psycholinguistics Group (CLiPS) of the University of Antwerp has granted $250. The Experimental Media Research Group (EMRG) of the St Lucas University College of Art & Design Antwerp has granted $250. Tom De Smedt has granted $150.

If you like Pattern and you like Python 3, you can help by joining the development team or by donating. To join the development team, simply leave an issue or a pull request. To donate, go to www.clips.ua.ac.be/pattern, click the blue Donate button and mention "Pattern 3". Thanks!


General remarks for developers

1. Task: support Python 2.7 + Python 3.3.

The Pattern package is organized in modules (pattern.web, pattern.text, ...). Some of these should be relatively easy to port (pattern.metrics, pattern.graph) and some harder. In particular, pattern.web has HTML parsers and attempts to convert all text downloaded from the web to Unicode, and pattern.text and pattern.vector are both lengthy and full of string / text handling functions.

Submodules of pattern.text such as pattern.en and pattern.nl are near identical: they import everything from pattern.text, with language-specific settings. Some of these have an inflect.py submodule that contain older code with weird-looking imports that may not work in 3.

2. Pull requests

Pattern 3 is a "public" fork on GitHub. With "public" we mean that anyone can join the development team. Even so, try to use pull requests instead of directly altering code so we have a history log. We should probably work module by module: finish one module before starting the next. But if someone is more comfortable with fixing things here and there that is okay too.

In any case we should clearly mark pull requests and dicussions with a pattern.<module> title, a took me x hours tagline and a description of the update. We can use this information to divide the available budget.

3. Travis

People have suggested Travis continuous integration as a starting point. If you know how that works, please go ahead and set it up!

4. Try importing

Simply trying to import a module in Python 3.4 (e.g., from pattern import metrics) will yield revealing errors. When you fix one of these, you can do a find/replace in the entire project to fix the issue in other modules too. For example, I previously used the Textmate editor to find/replace on "print", and then used a regular expression to replace all (?) print x statements with print(x) functions.

5. Run unit tests

Running unit tests will also yield revealing errors. Note that not all functionality is covered by a unit test. As an estimate, about 70-80% of the source code is covered. Some important things like saving and loading to file don't always have a unit test.

6. Use six?

Do we use six or not? An argument against six is that the source code will become cluttered with six.<some_function>(), which makes it less interesting to people looking to learn from the source code or copy portions. But if you think we can work faster with six, go ahead. We can always try to remove it later on.

At least each module could have some PY2 and PY3 (booleans, True or False) constants for if-statements and, for example, a STRING and UNICODE constant (not sure yet how these would work).

7. Unicode conversion and string testing

Most modules have a decode_utf8() and encode_utf8() function. These are mostly for the users' convenience and not used consistently internally. Sometimes they have a short u() and s() alias for use in the source code (this is mainly the case in pattern.web, e.g., here).

Most of the time, isinstance(s, basestring), isinstance(s, unicode), isinstance(s, str) or some variation is used to figure out the type of string input (e.g., here). These are all good candidates for find/replace. Also, the __str__ and __unicode__ methods are used abundantly with classes. Not sure yet how to deal with those.

Finally, some classes will save data to a file, usually through a .save() and .load() method. Some of these explicitly convert Unicode to a UTF8-encoded byte string and write this string to file preceded by a BOM_UTF8 header. So looking for code near open() calls is also a good idea (e.g., here).

8. Other stuff

Some modules use __getslice__ (e.g., here) which is not supported in 3. map() is an iterator in Python 3. A good overview of differences between 2 and 3 can be found here: http://python3porting.com/differences.html


Module pattern.web

Most of the source code is in __init__.py, along with some submodules. Some of these submodules are external projects (BeautifulSoup, feedparser, simplejson, pdfminer and python-docx). The most important of these, BeautifulSoup, has a more recent version for 3. We should bundle both versions. Depending whether Pattern runs in 2 or 3, it should switch what version of BeautifulSoup is imported. Right now we don't care too much about the other projects - these can be dealt with later. For example, pdfminer is a large and complex project; if the developers have no port to 3 yet we should not attempt it ourselves. PDF parsing support in 3 could then raise a NotImplementedError.

Some other (native) submodules like cache and oauth are essential to pattern.web, but they are shorter and should be easy to port.

The init.py submodule

The module uses sgmllib.py to implement a HTMLParser class (see here), but sgmllib.py is no longer supported in 3. So the HTMLParser needs to be rewritten using (presumably) html.parser.

Most functionality (e.g., the Google or Wikipedia class) builds on URL.download() (here), so URL is an important class to check. Classes such as Google and Wikipedia all look alike. If you fix one of these, it might be interesting to check the others in one go.

The module has a Document class (here) that parses a HTML-string to a tree of Python objects, using BeautifulSoup. It contains a BeautifulSoup object as a private property (._beautifulSoup, inherited from Node class). A Document contains Element objects. These should be checked, so that they continue to work correctly with the new Beautifulsoup.

Expect other, unforeseen issues, pattern.web is probably the hardest module to tackle.


Module pattern.text

The init.py submodule

This is a base module that provides classes and functions for building pattern.en, pattern.de, etc. All functions and classes are allowed to take byte strings as well as Unicode strings as input. These are (should be) silently converted to Unicode output with try ... except if byte strings don't work. Lots of isinstance() checks here. There is project on GitHub called TextBlob that wraps a (older) version of pattern.text ported to 3. It probably contains many useful hints.

Classes such as Lexicon and Frequency read from file (check for a .load() method). These files should already be Unicode.

The pattern.en|es|de|... submodule

The pattern.en submodule bundles WordNet through the old PyWordNet project. It might not be possible to port it anymore. If so, pattern.en.wordnet should fail silently on 3 until there is an opportunity to rewrite it from scratch. Perhaps we can try an automatic conversion with 2to3. Or, if someone wants to have a go at it: there is no standalone Python package for WordNet so this could be a useful side-project with high visibility.

The inflect.py script is one of the oldest and may contain weird stuff (imports etc., see this).

Other submodules like pattern.es and pattern.nl will have an __init__.py that is almost identical to the one in pattern.en.

The search.py submodule

Tools in this module will generally take the output string of the parsetree() function in pattern.en (or some other pattern.xx) and search it for patterns, comparable to regular expressions.

The tree.py submodule

An older helper module that takes the output of the parse() function in pattern.en and transforms it to a tree of nested Python objects (sentences contain chunks that contain words). As such, it is the basis of the parsetree() function. If an input string is not Unicode, it should silently be converted to Unicode. A parse tree can be exported to an XML-file.


Module pattern.vector

Contains machine learning algorithms, with lots of math. All divisions should already be float divisions. The module has a stemmer.py submodule, with string functions that should give little problems. The module also bundles the Python bindings to LIBSVM. These are essential. Don't know if the LIBSVM people have made them Python 3-ready, but the source code is not that lengthy, so if not, we should try to convert these ourselves (2to3?), because pattern.vector is not that useful without a fast SVM algorithm.

The init.py submodule

Lots of complex source code. This is probably the second hardest module to update. In general, check for isinstance(), load(), save(), BOM_UTF8 and StringIO. In 2, StringIO only works with byte strings (?) while StringIO in 3 only works with Unicode. The Document class has a long initializer to deal with all sorts of input, from strings to dictionaries.


Module pattern.db

This module has wrappers for Sqlite and MySQL databases. MySQL relies on the python-mysql aka MySQLdb project, which is not bundled. This is not a high priority. Support for Sqlite should be okay since Sqlite is part of Python's core modules.

However, the pattern.db module also has a Datasheet class, which is an Excel-like interface to CSV-files. It is essential in Pattern, and Python's csv library is old. I don't think the version in 2 works with Unicode (in 3?). In any case, Excel (still) has problems with Unicode CSV-files. Because of this, right now in Pattern 2.6 CSV-files are exported as UTF8-encoded byte strings (see here).


Module pattern.graph

No special remarks. This module contains mostly math. It might be worthwile to check for integer divisions that need to be written as x // y, though in general all divisions in Pattern should already be float divisions. There are some .export() methods and functions that write a string to a file.


Module pattern.server

This module is the latest addition and work in progress. It bundles the CherryPy project, which should be compatible with 3, so everything built on top of CherryPy should be okay. Some initial effort was done to make this module compatible with 3, but this has slacked lately. Some classes like Template read strings from file.


Module pattern.metrics

Lots of assorted math and string functions, probably no major problems here.


Examples, tests, docs

pattern/examples/

All examples should use print(). And they should work in 3 of course.

pattern/tests/

All unit tests should use print(). And they should work in 3 of course.

Online documentation

Need to be checked and updated. This work can be done at the end by Tom De Smedt, but if anyone is willing, reading through the docs to spot errors and old spelling mistakes is very helpful.


Project timeline

Below are rough estimates, assuming that people will be working in short intervals when they have time to spare. If anyone has a full free week or month, things can go faster. If it looks like we are running into problems around February, we can attempt to get more funding (e.g., from the University of Antwerp).

  • November: getting started + setup + simple trial: pattern.metrics
  • December: holiday month, spread the word, start on pattern.web
  • January: start on pattern.text
  • February: start on pattern.vector
  • March: pattern.graph + pattern.server
  • May: pattern.db + examples + tests
  • April: docs + beta release, write blog article about the pitfalls

Progress reports to PSF (task for Tom De Smedt):

  • January
  • May

Get paid for your contributions

There is a $1,850 budget. PSF will donate upon completion, unless we set up a progress plan for partial payments (do or don't?). 1,200$ is provided by PSF, 650$ by private funding. The 650$ can be paid immediately upon completion of single tasks. Below are rough estimates. People should report if a task is simpler or harder than we initially thought so we can adjust the budget.

The grant proposal suggests the following breakdown by task type:

  • $800 for general conversion work (presumably mainly Unicode-related)
  • $600 for programming work: updating sgmllib, csv, BeautifulSoup, LIBSVM, ...
  • $250 for promotion of clean code, refactoring, tidying up & improvements
  • $200 for updating examples, tests, docs.

Another estimate is by module:

  • $450 for pattern.web
  • $450 for pattern.text
  • $400 for pattern.vector
  • $200 for pattern.db
  • $150 for pattern.server
  • $100 for pattern.graph
  • $100 for pattern.metrics

We can use these figures as a guideline to divide the available budget among contributors, taking into account their pull requests and the hours spent on a pull request (please describe each pull request: the module you worked on, the issue you fixed, the time it took you to finish the work).

We can keep an overview per collaborator at the bottom of this document. The budget will be divided in general consensus, with the project initiator (Tom De Smedt) having the final word. Tom De Smedt will work for free.

Finally, feel free to start an issue to discuss a task in more detail.