Initial import from Google Code. This is Penelope v2.0.0

pettarin · Jun 30, 2014 · da36f47 · da36f47
1 parent 182e1d9
commit da36f47
Show file tree

Hide file tree

Showing 27 changed files with 5,907 additions and 5 deletions.
diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,6 @@
 The MIT License (MIT)
 
-Copyright (c) 2014 Alberto Pettarin
+Copyright (c) 2012-2014 Alberto Pettarin ([email protected])
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
@@ -18,4 +18,5 @@ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-SOFTWARE.
+SOFTWARE.
+
diff --git a/README.md b/README.md
@@ -1,4 +1,100 @@
-penelope
-========
+# Penelope
+
+**Penelope** is a multi-tool for creating, editing and converting dictionaries, especially for eReader devices.
+
+* Version: 2.0.0
+* Date: 2014-06-30
+* Developer: [Alberto Pettarin](http://www.albertopettarin.it/) ([contact](http://www.albertopettarin.it/contact.html))
+
+With the current version you can:
+
+* convert a dictionary FROM/TO the following formats:
+    * Bookeen Cybook Odyssey (R/W)
+    * Kobo (R index only, W unencrypted/unobfuscated only)
+    * StarDict (R/W)
+    * XML (R/W)
+    * CSV (R/W)
+* merge more dictionaries (of the same type) into a single dictionary
+* define your own parser for each word/definition
+* define your own collation function when outputting to Bookeen Cybook Odyssey format
+* generate an EPUB file containing the index of a given dictionary (e.g., to cope with the lack of a search function on your eReader)
+
+Please note that Penelope needs substantial code refactoring.
+Unfortunately, I no longer have time to do that.
+Please fork and improve.
+
+Many people have asked for PRC/MOBI support.
+Again, I no longer have time to do that.
+
+
+### IMPORTANT UPDATE (2013-04-27)
+
+Kobo issued a new firmware 2.5.1 (thanks!), which allows you to use unencrypted/unobfuscated dictionaries again, including those produced by Penelope. Some minor bugs in the UI/UX are still present, but at least the custom dictionaries are back!
+
+
+### UPDATE (2013-04-23)
+
+It seems that Kobo, with firmware 2.5.0, requires the dictionaries to be encrypted/obfuscated. Hence, the dictionaries output by Penelope do not longer work on Kobo devices. I contacted Kobo staff via Twitter, and they forwarded the notice to their development team. I hope they will fix the issue with a new firmware release soon. Meanwhile, if you need your custom-made dictionaries, you must stay with or revert to firmware 2.4.0. 
+
+
+## Usage
+
+```
+$ python penelope.py -h
+$ python penelope.py           -p foo -f en -t en
+$ python penelope.py           -p bar -f en -t it
+$ python penelope.py           -p "bar,foo,zam" -f en -t it
+$ python penelope.py --xml     -p foo -f en -t en
+$ python penelope.py --xml     -p foo -f en -t en --output-sd
+$ python penelope.py           -p bar -f en -t it --output-kobo
+$ python penelope.py           -p bar -f en -t it --output-xml -i
+$ python penelope.py --kobo    -p bar -f it -t it --output-epub
+$ python penelope.py --odyssey -p bar -f en -t en --output-epub
+$ python penelope.py           -p bar -f en -t it --title "My EN->IT dictionary" --year 2012 --license "CC-BY-NC-SA 3.0"
+$ python penelope.py           -p foo -f en -t en --parser foo_parser.py --title "Custom EN dictionary"
+$ python penelope.py           -p foo -f en -t en --collation custom_collation.py
+$ python penelope.py --xml     -p foo -f en -t en --output-csv --fs "\t\t" --ls "\n" 
+```
+
+Please have a look at this web page for details:
+http://www.albertopettarin.it/penelope.html
+
+## License
+
+**Penelope** is released under the MIT License since version 2.0.0 (2014-06-30).
+
+Previous versions, hosted in a [Google Code repo](http://code.google.com/p/penelope-dictionary-converter/),
+were released under the GNU GPL 3 License.
+
+
+## Technical Notes
+
+The current version runs both under Python 2 or Python 3,
+and it has been tested under Linux (Debian, Fedora) and Windows (XP, 7).
+Unfortunately, since I do not have any financial support for the project,
+I cannot offer support for all the possibile
+values of the tuple (OS, Python version, console encoding).
+Therefore, only problems running Penelope in a Linux environment
+will receive full priority.
+
+
+## Acknowledgments 
+
+Many thanks to:
+
+* _uwelovesdonna_ for contributing ideas for improving the code and for setting up many pages of the project wiki;
+* _Jens Sadowski_ for pointing out a bug with Unicode file names and for suggesting using multiset `dict()` instead of set `dict()`;
+* _oldnat_ for pointing out a bug under Windows and Python 3;
+* _Wolfgang Miller-Reichling_ for providing the code for reading CSV dictionaries;
+* _branok_ for providing the idea and initial code for German collation function;
+* _pal_ for suggesting passing `-l` switch to `MARISA_BUILD`;
+* _Lukas Brückner_ for suggesting escaping `& < >` when outputting in XML format;
+* _Stephan Lichtenhagen_ for suggesting forcing UTF-8 encoding on Python 3.
+
+
+## Limitations and Missing Features 
+
+* No support for PRC/MOBI dictionaries 
+* Input files are assumed to be Unicode UTF-8 encoded
+* CWDIR dependent
 
-Penelope is a multi-tool for creating, editing and converting dictionaries, especially for eReader devices
diff --git a/dictionary_index_epub/Chambers1908.epub b/dictionary_index_epub/Chambers1908.epub
diff --git a/dictionary_index_epub/Websters1913.epub b/dictionary_index_epub/Websters1913.epub
diff --git a/dictionary_index_epub/dicthtml-de.epub b/dictionary_index_epub/dicthtml-de.epub
diff --git a/dictionary_index_epub/dicthtml-en.epub b/dictionary_index_epub/dicthtml-en.epub
diff --git a/dictionary_index_epub/dicthtml-es.epub b/dictionary_index_epub/dicthtml-es.epub
diff --git a/dictionary_index_epub/dicthtml-fr.epub b/dictionary_index_epub/dicthtml-fr.epub
diff --git a/dictionary_index_epub/dicthtml-it.epub b/dictionary_index_epub/dicthtml-it.epub
diff --git a/dictionary_index_epub/dicthtml-ja.epub b/dictionary_index_epub/dicthtml-ja.epub
diff --git a/dictionary_index_epub/dicthtml-nl.epub b/dictionary_index_epub/dicthtml-nl.epub
diff --git a/dictionary_index_epub/dicthtml-pt.epub b/dictionary_index_epub/dicthtml-pt.epub
diff --git a/src/collation_de.py b/src/collation_de.py
@@ -0,0 +1,49 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+__license__     = 'MIT'
+__author__      = 'Alberto Pettarin (alberto albertopettarin.it)'
+__copyright__   = '2012-2014 Alberto Pettarin (alberto albertopettarin.it)'
+__version__     = 'v2.0.0'
+__date__        = '2014-06-30'
+__description__ = 'Default collation function for penelope.py'
+
+### BEGIN collate_function ###
+# collate_function(string1, string2)
+# compare string1 to string2
+# return  0 if string1 == string2
+#        -1 if string1 < string2
+#         1 if string1 > string2
+def collate_function(string1, string2):
+    # conversion to unicode and lower case (only for Python 2)
+    #Python2#
+    b1 = string1.decode('utf-8')
+    #Python3#    b1 = string1
+    #Python2#
+    b2 = string2.decode('utf-8')
+    #Python3#    b2 = string2
+    b1 = b1.lower()
+    b2 = b2.lower()
+    # store strings with original accents for 2nd level collation
+    c1 = b1
+    c2 = b2
+
+    # replace german accent characters by base characters for 1st level collation
+    #Python2#
+    for f in [ [u'ä', u'a'], [u'ö', u'o'], [u'ü', u'u'], [u'ß', u'ss'] ]:
+    #Python3#    for f in [ ['ä', 'a'], ['ö', 'o'], ['ü', 'u'], ['ß', 'ss'] ]:
+        b1 = b1.replace(f[0], f[1])
+        b2 = b2.replace(f[0], f[1])
+
+    # 1st level collation
+    if b1.encode('utf-16') == b2.encode('utf-16'):
+        # 2nd level collation
+        if c1.encode('utf-16') == c2.encode('utf-16'):
+            return 0
+        else:
+            return -1 if c1.encode('utf-16') < c2.encode('utf-16') else 1
+    # 1st level collation
+    else:
+        return -1 if b1.encode('utf-16') < b2.encode('utf-16') else 1
+### END collate_function ###
+
diff --git a/src/collation_de3.py b/src/collation_de3.py
@@ -0,0 +1,50 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+__license__     = 'MIT'
+__author__      = 'Alberto Pettarin (alberto albertopettarin.it)'
+__copyright__   = '2012-2014 Alberto Pettarin (alberto albertopettarin.it)'
+__version__     = 'v2.0.0'
+__date__        = '2014-06-30'
+__description__ = 'Default collation function for penelope.py'
+
+### BEGIN collate_function ###
+# collate_function(string1, string2)
+# compare string1 to string2
+# return  0 if string1 == string2
+#        -1 if string1 < string2
+#         1 if string1 > string2
+def collate_function(string1, string2):
+    # conversion to unicode and lower case (only for Python 2)
+    #Python2#    b1 = string1.decode('utf-8')
+    #Python3#
+    b1 = string1
+    #Python2#    b2 = string2.decode('utf-8')
+    #Python3#
+    b2 = string2
+    b1 = b1.lower()
+    b2 = b2.lower()
+    # store strings with original accents for 2nd level collation
+    c1 = b1
+    c2 = b2
+
+    # replace german accent characters by base characters for 1st level collation
+    #Python2#    for f in [ [u'ä', u'a'], [u'ö', u'o'], [u'ü', u'u'], [u'ß', u'ss'] ]:
+    #Python3#
+    for f in [ ['ä', 'a'], ['ö', 'o'], ['ü', 'u'], ['ß', 'ss'] ]:
+        b1 = b1.replace(f[0], f[1])
+        b2 = b2.replace(f[0], f[1])
+
+    # 1st level collation
+    if b1.encode('utf-16') == b2.encode('utf-16'):
+        # 2nd level collation
+        if c1.encode('utf-16') == c2.encode('utf-16'):
+            return 0
+        else:
+            return -1 if c1.encode('utf-16') < c2.encode('utf-16') else 1
+    # 1st level collation
+    else:
+        return -1 if b1.encode('utf-16') < b2.encode('utf-16') else 1
+### END collate_function ###
+
+
diff --git a/src/default_collation.py b/src/default_collation.py
@@ -0,0 +1,25 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+__license__     = 'MIT'
+__author__      = 'Alberto Pettarin (alberto albertopettarin.it)'
+__copyright__   = '2012-2014 Alberto Pettarin (alberto albertopettarin.it)'
+__version__     = 'v2.0.0'
+__date__        = '2014-06-30'
+__description__ = 'Default collation function for penelope.py'
+
+### BEGIN collate_function ###
+# collate_function(string1, string2)
+# compare string1 to string2
+# return  0 if string1 == string2
+#        -1 if string1 < string2
+#         1 if string1 > string2
+def collate_function(string1, string2):
+    b1 = bytearray(string1, 'utf-8').lower()
+    b2 = bytearray(string2, 'utf-8').lower()
+    if (b1 == b2):
+        return 0
+    else:
+        return -1 if (b1 < b2) else 1
+### END collate_function ###
+
diff --git a/src/default_parser.py b/src/default_parser.py
@@ -0,0 +1,33 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+__license__     = 'MIT'
+__author__      = 'Alberto Pettarin (alberto albertopettarin.it)'
+__copyright__   = '2012-2014 Alberto Pettarin (alberto albertopettarin.it)'
+__version__     = 'v2.0.0'
+__date__        = '2014-06-30'
+__description__ = 'Parse the given definition list for penelope.py'
+
+### BEGIN parse ###
+# parse(data, type_sequence, ignore_case)
+# parse the given list of pairs
+# data = [ [word, definition] ]
+# with type_sequence and ignore_case options,
+# and outputs the following list:
+# parsed = [ word, include, synonyms, substitutions, definition ]
+#
+# where:
+#        word is the sorting key
+#        include is a boolean saying whether the word should be included
+#        synonyms is a list of alternative strings for word
+#        substitutions is a list of pairs [ word_to_replace, replacement ]
+#        definition is the definition of word
+
+# default implementation, just copy the content of the stardict dictionary
+def parse(data, type_sequence, ignore_case):
+    parsed_data = []
+    for d in data:
+        parsed_data += [ [ d[0], True, [], [], d[1] ] ]
+    return parsed_data
+### END parse ###
+