Generalized Text::Hyphen #5

PhilterPaper · 2020-11-19T01:27:51Z

Please take a look at Pull Request #1, in case you've overlooked it. The idea is that rather than releasing and maintaining a whole bunch of Text::Hyphen::XX packages, to release just the one Text::Hyphen that can either be updated manually with desired language files from the CTAN library, or go and fetch them itself, given a language option in new(). At this point I'm not sure if there are any issues with where the cache or library of patterns and exceptions should go, with respect to permissions across a wide range of platforms. I.e., can a random user trigger an action that adds files to the library in their Perl module collection?
Add: Perhaps Hyphen.pm should have a clearly marked and easily changed "where the cache is" setting, and/or an option setting for new(), where it's to write (and read from) all the pattern and exception files. This would get around worrying about some users not having permission to write to certain directories.

Then there's the issue of how you keep this library updated, should CTAN refresh a file. Certainly one way to do it is just as done now, which is to have separate Text::Hyphen::XX packages, and use the normal CPAN update mechanism. However, this means that someone will need to take on building and releasing all these packages in the first place, and keeping them all up to date! I suspect that this was the impetus for PR 1 in the first place, that users wouldn't have to wait for someone to get around to creating a package. By the way, one very useful package would be Latin! How many packages use Ipsum Lorem text for examples, and would like a way to properly hyphenate it?
Add: Perhaps there's a way that I've overlooked, but there doesn't seem to be a way to subscribe to CTAN to tell your local system to update its cache or library of hyphenation data. Maybe the best way would be to periodically run a utility to check last-modified dates on the page, and pull down anything that needs an update? On Linuxy systems, at least, this could be on a cron job (dunno about Windows). Worst case, whenever Text::Hyphen is run, it could check the date/time and run the utility for you? Anyway, there's probably not a lot of changes to such packages once they've settled down.

Anyway, I think it's time to discuss better ways of getting hyphenation support out for a wide range of languages, and CTAN appears to have done much of the work already, if we could just directly read those files (and import them easily).

PhilterPaper · 2020-11-26T19:50:51Z

Playing around with Text::Hyphen, I see a couple more needs:

A supplementary pattern (pat) file for a given language. For example, Text::Hyphen won't properly split up "sesquicentennial". It keeps "sesqui" as a single syllable, when it should be "ses-qui". One could either manually update the English pattern file whenever it's downloaded, or add a supplement file to be appended to the current official list. Or wait until the CTAN maintainer gets around to updating the files! Addendum: I understand that some splits are declined if they might lead to ambiguity in how the first fragment is to be pronounced. Thus, Knuth-Liang will decline to do some reasonable splits.
A supplemental exceptions (hyp) file might be desirable, to supplement and override the built-in hyp file. Two examples off the top of my head: "project" and "present" are both listed as not-to-be-split, as where you split them into syllables depends on the exact meaning. "record" isn't in the hyp file, but it isn't split either. "I went to the studio to re-cord some music for my new rec-ord." Perhaps the best way to deal with such words is to manually insert a soft hyphen (&SHY;) , as without a lot of deep language understanding, there's no way for Text::Hyphen to figure out the appropriate syllablization. However, a supplemental hyp file could still be useful if you choose, say, to spell "recognizance" in a more British (?) style, such as "recognisance".

PhilterPaper · 2021-01-24T15:02:27Z

Another improvement: If a fragment (or entire word) exceeds some minimum size (say, 8 characters), force it to be split anyway, to avoid ridiculous cases where you have a very, very long string that might not even fit on a line, much less force an enormous "hole" in a line when it moves to the next line.

"Aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaah," he yelled as he went over the cliff.

Even if that long Aa...ah fits on one line, if it nearly fit at the end of the previous line, it will be a ridiculous amount of stretch (Knuth-Plass) to right-justify the previous line. Even if this text is a bit contrived, you can easily end up with long unsplittable runs with things like passwords and MD5 hashes in your text. Even foreign words may be a problem if your language selection can't recognize them and there's no means to use a different Knuth-Liang pattern list on demand.

First try splitting at reasonable points, such as after hyphens/dashes or between a lowercase letter and an uppercase letter (camelCase text), then between a letter and a digit (or vice-versa), then within runs of digits, and then after (or before?) punctuation. Obey the minimum prefix and suffix lengths (such as 2/3), and keep chopping until nothing is longer than 8 characters. The exact length could be a new parameter in case you want to suppress this behavior.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalized Text::Hyphen #5

Generalized Text::Hyphen #5

PhilterPaper commented Nov 19, 2020 •

edited

Loading

PhilterPaper commented Nov 26, 2020 •

edited

Loading

PhilterPaper commented Jan 24, 2021

Generalized Text::Hyphen #5

Generalized Text::Hyphen #5

Comments

PhilterPaper commented Nov 19, 2020 • edited Loading

PhilterPaper commented Nov 26, 2020 • edited Loading

PhilterPaper commented Jan 24, 2021

PhilterPaper commented Nov 19, 2020 •

edited

Loading

PhilterPaper commented Nov 26, 2020 •

edited

Loading