-
Notifications
You must be signed in to change notification settings - Fork 35
Home
The aim of this fork is to make Pattern compatible with Python 3.
Pattern (www.clips.ua.ac.be/pattern) is a Python 2.7 package for data mining, natural language processing and machine learning.
The Python Software Foundation (PSF) has granted $1,200 to support the compatibility update. Please read the grant proposal (PDF) for an overview of our objectives. The Computational Psycholinguistics Group (CLiPS) of the University of Antwerp has granted $250. The Experimental Media Research Group (EMRG) of the St Lucas University College of Art & Design Antwerp has granted $250. Tom De Smedt has granted $150.
If you like Pattern and you like Python 3, you can help by joining the development team or by donating. To join the development team, simply leave an issue or a pull request. To donate, go to www.clips.ua.ac.be/pattern, click the blue Donate button and mention "Pattern 3". Thanks!
1. Task: support Python 2.7 + Python 3.3.
The Pattern package is organized in modules (pattern.web
, pattern.text
, ...). Some of these should be relatively easy to port (pattern.metrics
, pattern.graph
) and some harder. In particular, pattern.web
has HTML parsers and attempts to convert all text downloaded from the web to Unicode, and pattern.text
and pattern.vector
are both lengthy and full of string / text handling functions.
Submodules of pattern.text
such as pattern.en
and pattern.nl
are near identical: they import everything from pattern.text
, with language-specific settings. Some of these have an inflect.py
submodule that contain older code with weird-looking imports that may not work in 3.
2. Pull requests
Pattern 3 is a "public" fork on GitHub. With "public" we mean that anyone can join the development team. Even so, try to use pull requests instead of directly altering code so we have a history log. We should probably work module by module: finish one module before starting the next. But if someone is more comfortable with fixing things here and there that is okay too.
In any case we should clearly mark pull requests and dicussions with a pattern.<module>
title, a took me x hours
tagline and a description of the update. We can use this information to divide the available budget.
3. Travis
People have suggested Travis continuous integration as a starting point. If you know how that works, please go ahead and set it up!
4. Try importing
Simply trying to import a module in Python 3.4 (e.g., from pattern import metrics
) will yield revealing errors. When you fix one of these, you can do a find/replace in the entire project to fix the issue in other modules too. For example, I previously used the Textmate editor to find/replace on "print", and then used a regular expression to replace all (?) print x
statements with print(x)
functions.
5. Run unit tests
Running unit tests will also yield revealing errors. Note that not all functionality is covered by a unit test. As an estimate, about 70-80% of the source code is covered. Some important things like saving and loading to file don't always have a unit test.
6. Use six?
Do we use six or not? An argument against six is that the source code will become cluttered with six.<some_function>()
, which makes it less interesting to people looking to learn from the source code or copy portions. But if you think we can work faster with six, go ahead. We can always try to remove it later on.
At least each module could have some PY2
and PY3
(booleans, True
or False
) constants for if-statements and, for example, a STRING
and UNICODE
constant (not sure yet how these would work).
7. Unicode conversion and string testing
Most modules have a decode_utf8()
and encode_utf8()
function. These are mostly for the users' convenience and not used consistently internally. Sometimes they have a short u()
and s()
alias for use in the source code (this is mainly the case in pattern.web
, e.g., here).
Most of the time, isinstance(s, basestring)
, isinstance(s, unicode)
, isinstance(s, str)
or some variation is used to figure out the type of string input (e.g., here). These are all good candidates for find/replace. Also, the __str__
and __unicode__
methods are used abundantly with classes. Not sure yet how to deal with those.
Finally, some classes will save data to a file, usually through a .save()
and .load()
method. Some of these explicitly convert Unicode to a UTF8-encoded byte string and write this string to file preceded by a BOM_UTF8
header. So looking for code near open()
calls is also a good idea (e.g., here).
8. Other stuff
Some modules use __getslice__
(e.g., here) which is not supported in 3. map()
is an iterator in Python 3. A good overview of differences between 2 and 3 can be found here: http://python3porting.com/differences.html
Most of the source code is in __init__.py
, along with some submodules. Some of these submodules are external projects (BeautifulSoup, feedparser, simplejson, pdfminer and python-docx). The most important of these, BeautifulSoup, has a more recent version for 3. We should bundle both versions. Depending whether Pattern runs in 2 or 3, it should switch what version of BeautifulSoup is imported. Right now we don't care too much about the other projects - these can be dealt with later. For example, pdfminer is a large and complex project; if the developers have no port to 3 yet we should not attempt it ourselves. PDF parsing support in 3 could then raise a NotImplementedError
.
Some other (native) submodules like cache
and oauth
are essential to pattern.web
, but they are shorter and should be easy to port.
The init.py submodule
The module uses sgmllib.py
to implement a HTMLParser
class (see here), but sgmllib.py
is no longer supported in 3. So the HTMLParser
needs to be rewritten using (presumably) html.parser
.
Most functionality (e.g., the Google
or Wikipedia
class) builds on URL.download()
(here), so URL
is an important class to check. Classes such as Google
and Wikipedia
all look alike. If you fix one of these, it might be interesting to check the others in one go.
The module has a Document
class (here) that parses a HTML-string to a tree of Python objects, using BeautifulSoup. It contains a BeautifulSoup object as a private property (._beautifulSoup
, inherited from Node
class). A Document
contains Element
objects. These should be checked, so that they continue to work correctly with the new Beautifulsoup.
Expect other, unforeseen issues, pattern.web
is probably the hardest module to tackle.
The init.py submodule
This is a base module that provides classes and functions for building pattern.en
, pattern.de
, etc. All functions and classes are allowed to take byte strings as well as Unicode strings as input. These are (should be) silently converted to Unicode output with try ... except
if byte strings don't work. Lots of isinstance()
checks here. There is project on GitHub called TextBlob that wraps a (older) version of pattern.text
ported to 3. It probably contains many useful hints.
Classes such as Lexicon
and Frequency
read from file (check for a .load()
method). These files should already be Unicode.
The pattern.en|es|de|... submodule
The pattern.en
submodule bundles WordNet through the old PyWordNet project. It might not be possible to port it anymore. If so, pattern.en.wordnet
should fail silently on 3 until there is an opportunity to rewrite it from scratch. Perhaps we can try an automatic conversion with 2to3. Or, if someone wants to have a go at it: there is no standalone Python package for WordNet so this could be a useful side-project with high visibility.
The inflect.py
script is one of the oldest and may contain weird stuff (imports etc., see this).
Other submodules like pattern.es
and pattern.nl
will have an __init__.py
that is almost identical to the one in pattern.en
.
The search.py submodule
Tools in this module will generally take the output string of the parsetree()
function in pattern.en
(or some other pattern.xx
) and search it for patterns, comparable to regular expressions.
The tree.py submodule
An older helper module that takes the output of the parse()
function in pattern.en
and transforms it to a tree of nested Python objects (sentences contain chunks that contain words). As such, it is the basis of the parsetree()
function. If an input string is not Unicode, it should silently be converted to Unicode. A parse tree can be exported to an XML-file.
Contains machine learning algorithms, with lots of math. All divisions should already be float divisions. The module has a stemmer.py
submodule, with string functions that should give little problems. The module also bundles the Python bindings to LIBSVM. These are essential. Don't know if the LIBSVM people have made them Python 3-ready, but the source code is not that lengthy, so if not, we should try to convert these ourselves (2to3?), because pattern.vector
is not that useful without a fast SVM algorithm.
The init.py submodule
Lots of complex source code. This is probably the second hardest module to update. In general, check for isinstance()
, load()
, save()
, BOM_UTF8
and StringIO
. In 2, StringIO
only works with byte strings (?) while StringIO
in 3 only works with Unicode. The Document
class has a long initializer to deal with all sorts of input, from strings to dictionaries.
This module has wrappers for Sqlite and MySQL databases. MySQL relies on the python-mysql aka MySQLdb project, which is not bundled. This is not a high priority. Support for Sqlite should be okay since Sqlite is part of Python's core modules.
However, the pattern.db
module also has a Datasheet
class, which is an Excel-like interface to CSV-files. It is essential in Pattern, and Python's csv
library is old. I don't think the version in 2 works with Unicode (in 3?). In any case, Excel (still) has problems with Unicode CSV-files. Because of this, right now in Pattern 2.6 CSV-files are exported as UTF8-encoded byte strings (see here).
No special remarks. This module contains mostly math. It might be worthwile to check for integer divisions that need to be written as x // y
, though in general all divisions in Pattern should already be float divisions. There are some .export()
methods and functions that write a string to a file.
This module is the latest addition and work in progress. It bundles the CherryPy project, which should be compatible with 3, so everything built on top of CherryPy should be okay. Some initial effort was done to make this module compatible with 3, but this has slacked lately. Some classes like Template
read strings from file.
Lots of assorted math and string functions, probably no major problems here.
pattern/examples/
All examples should use print()
. And they should work in 3 of course.
pattern/tests/
All unit tests should use print()
. And they should work in 3 of course.
Online documentation
Need to be checked and updated. This work can be done at the end by Tom De Smedt, but if anyone is willing, reading through the docs to spot errors and old spelling mistakes is very helpful.
Below are rough estimates, assuming that people will be working in short intervals when they have time to spare. If anyone has a full free week or month, things can go faster. If it looks like we are running into problems around February, we can attempt to get more funding (e.g., from the University of Antwerp).
-
November: getting started + setup + simple trial:
pattern.metrics
-
December: holiday month, spread the word, start on
pattern.web
-
January: start on
pattern.text
-
February: start on
pattern.vector
-
March:
pattern.graph
+pattern.server
-
May:
pattern.db
+ examples + tests - April: docs + beta release, write blog article about the pitfalls
Progress reports to PSF (task for Tom De Smedt):
- January
- May
There is a
The grant proposal suggests the following breakdown by task type:
- $800 for general conversion work (presumably mainly Unicode-related)
- $600 for programming work: updating
sgmllib
,csv
, BeautifulSoup, LIBSVM, ... - $250 for promotion of clean code, refactoring, tidying up & improvements
- $200 for updating examples, tests, docs.
Another estimate is by module:
- $450 for
pattern.web
- $450 for
pattern.text
- $400 for
pattern.vector
- $200 for
pattern.db
- $150 for
pattern.server
- $100 for
pattern.graph
- $100 for
pattern.metrics
We can use these figures as a guideline to divide the available budget among contributors, taking into account their pull requests and the hours spent on a pull request (please describe each pull request: the module you worked on, the issue you fixed, the time it took you to finish the work).
We can keep an overview per collaborator at the bottom of this document. The budget will be divided in general consensus, with the project initiator (Tom De Smedt) having the final word. Tom De Smedt will work for free.
Finally, feel free to start an issue to discuss a task in more detail.