README

Project description

Pydurma is a fast and modular collation engine that:

uses a very fast collation mechanism (diff-match-patch instead of Needleman-Wunsch)
makes it easy to define language-specific tokenization and normalization, necessary for languages like Tibetan
keeps track of the character position in the original files (even XML files where markup is removed in normalization)
has a configurable engine to select the best reading, based on reading frequency among versions, OCR confidence index, language-specific knowledge, etc.

It does not:

use any non-tabular (graph) representation (à la CollateX)
implement reajustments of alignment based on the distance between tokens (à la CollateX)
detect transpositions

It does not yet:

use subword tokenizers à la sentencepiece, potentially more robust on dirty (OCR) data than those based on linguistic features (spaces punctuation, etc.)
allow configurable token distance function based on language-specific knowledge (graphical distance, phonetic distance) to reajust alignment (à la CollateX' near_match=True option)
implement STAR algorithm to find the best "base" between different editions
have a clearly defined export format for the collated text or critical apparatus

We intend Pydurma to be used in large scale projects to automate:

merging different OCR outputs of the same images, selecting the best version of each of them
creating an edition that averages all other editions automatically

The name Pydurma is a combination of:

Python
Pedurma དཔེ་བསྡུར་མ།, Tibetan for critical or diplomatic edition

Note on possible workflows based on Pydurma

While Pydurma can be used in a classical Lachmannian critical edition process, its innovative design allows it to automate the process of variant selection.

Using this automated selection directly can easily be gasped at, but we want to defend this concept. The type of editions Pydurma can produce:

are fully automatic and thus uncritical (uncritical editions?), but can be produced on a large scale (industrial editions?)
do not try to reproduce one variant in particular as their base (in that sense are not diplomatic editions, perhaps undiplomatic editions?)
are intended to be similar to the concept of vulgate ("a commonly accepted text or reading" MW)
can be regenerated from different inputs (improved OCR, more versions, etc.) giving a different output, and are thus temporary (or at least, not meant to be definitive)
optimize measurable linguistic soundness
as a result, are a good base for an AI-ready corpus

A workflow using Pydurma can work on a very large scale with minimal human intervention, which we hope can be a game changer for under-resourced literary traditions like the Tibetan tradition.

Pydurma workflow

Pydurma operates in three steps:

Preprocessing
Collation
Variant Selection

Here is that process:

Preprocessing

Collation

Variant Selection

Previous work:

text alignment

OCR merging

ocromore
This was also done by Oliver Hellwig apparently, research to be one

Need help?

File an issue
Join our Discord and ask us there

Terms of use

Pydurma is licensed under the Apache license.

Acknowledgements and citation

Pydurma is a creation of:

The intended use of Pydurma at the Buddhist Digital Resource Center was presented at the Digital Humanities Workshop & Symposium organized in January 2023 at the University of Hamburg (see summary of the symposium, slide selection available here).

@software{
  Roux_Pydurma_2023,
  author = {Roux, Elie AND Tenzin Kaldan},
  title = {{Pydurma}},
  url = {https://github.com/openpecha/pydurma},
  version = {0.1.0}
}

@software{Roux_Pydurma_2023,author = {Roux, Elie and },month = jan,title = {{Pydurma}},url = {https://github.com/openpecha/pydurma},version = {0.1.0},year = {2023}}

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.github/workflows		.github/workflows
src/Pydurma		src/Pydurma
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

README

Project description

Note on possible workflows based on Pydurma

Pydurma workflow

Preprocessing

Collation

Variant Selection

Previous work:

text alignment

OCR merging

See also

Need help?

Terms of use

Acknowledgements and citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

OpenPecha/Pydurma

Folders and files

Latest commit

History

Repository files navigation

README

Project description

Note on possible workflows based on Pydurma

Pydurma workflow

Preprocessing

Collation

Variant Selection

Previous work:

text alignment

OCR merging

See also

Need help?

Terms of use

Acknowledgements and citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages