Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional titlecasing amendments #96

Open
robinwhittleton opened this issue Dec 26, 2023 · 6 comments
Open

Additional titlecasing amendments #96

robinwhittleton opened this issue Dec 26, 2023 · 6 comments

Comments

@robinwhittleton
Copy link
Contributor

robinwhittleton commented Dec 26, 2023

The original plan was to lift the regexes directly, but I’d forgotten that Standard Ebooks is a GPL3 codebase, and here is MIT. Obviously we can’t copy everything directly over, so the new plan is that I’ll copy over my original contributions, and anything that anyone else agrees should be contributed.


At Standard Ebooks we use python-titlecase to format a bunch of stuff throughout our productions (thanks!) but we also have some additional rules and changes to meet our specific needs. These start at [redacted]; the comments as a list give you a good overview:

  • Uppercase Roman numerals, but only if they are valid Roman numerals and they are not MIX (which is much more likely to be an English word than a Roman numeral) or DI which may be an Italian word
  • Lowercase and, or even if preceded by punctuation
  • pip_titlecase capitalizes all prepositions preceded by parenthesis; we only want to capitalize ones that aren't the first word of a subtitle OK: From Sergeant Bulmer (of the Detective Police) to Mr. Pendril OK: Three Men in a Boat (To Say Nothing of the Dog)
  • Uppercase words preceded by en or em dash
  • Lowercase and, if it's not the very first word, and not preceded by an em-dash
  • Lowercase the, if preceded by a dash (like Puss-in-Boots or Jack-in-the-Box)
  • Lowercase "in", if followed by a semicolon (but not words like "inheritance")
  • Lowercase th’, sometimes used poetically
  • Lowercase o’
  • Uppercase words that begin compound words, like to-night (which might appear in poetry)
  • Lowercase from, with, as long as they're not the first word and not preceded by a parenthesis
  • Capitalise the first word after an opening quote or italicisation that signifies a work this relies on SE specific markup
  • Lowercase the if preceded by vs.
  • Lowercase de, von, van, le, du as in Charles de Gaulle, Werner von Braun, etc., and if not the first word and not preceded by an “
  • Uppercase word following Or,, since it is probably a subtitle
  • Uppercase word following :, except or, , which indicates a kind of subtitle
  • Uppercase words after an initial contraction, like O'Keefe or L'Affaire. But only if there's at least 3 letters after, to prevent catching things like I'm or E're
  • Uppercase letter after Mc
  • Uppercase first letter after beginning contraction
  • Uppercase first letter
  • Lowercase by
  • Lowercase leading d’, as in Marie d’Elle
  • Uppercase l’ as in l’Affaire, but not if it's a the first letter
  • Uppercase leading A- as in A-Breaking
  • Uppercase some known initialisms
  • Lowercase À (as in À La Carte) unless it's the first word
  • Uppercase initialisms
  • Uppercase No. as in Number
  • Lowercase V. as in versus in a legal case
  • Lowercase mm (millimeters, as in 50 mm gun) unless it's followed by a period in which case it's likely Mm. (Monsieurs)
  • Lowercase al- (as in the Arabic definite article) unless it’s the first word
  • …and some special cases

Would any of these be things that python-titlecase are interested in? I’d be happy to upstream them as PRs.

@MinchinWeb
Copy link

Personally, I think these would all be great additions!

Regarding À La Carte, should the la (litterally, "the") also be lowercase? so as à la Carte?

Should that also be extended to lowercase au (à + le) and aux (à + les) as well? (these are the masculine and plural forms of à la, which is feminine).

Regarding "Lowercase de, von, van, le, du" -- should this list be extended to des (de + les; du is de + le; de la is written out, all meainng "of the")? and also les and la (the plural and feminine forms of le, meaning "the")? (I realize la is sometimes used as a music note, so it's inclusion may cause more false positives than is helpful.)

@ppannuto
Copy link
Owner

These all look reasonable to me, and happy to take a PR (or possibly better several PRs as these look like a large number of rules?).

My biggest ask would be to make sure that each new rule adds a test case or two which demonstrates (and validates) when it is and is not supposed to trigger, and that it's operating correctly. Should be easy to just add a phrase-per-rule ish to the tests.py.

@robinwhittleton
Copy link
Contributor Author

So, something I forgot was that Standard Ebooks’ tooling is GPL3 which isn’t compatible with MIT. That makes the way forwards a little difficult and comes down to a couple of options.

  1. I could check which of them I added and leave it at that. Potentially I could check in with other contributors to see if they’d be happy having their contributions reused in an MIT codebase. But I’ve checked with one of the bigger contributors and they’re not.
  2. Alternatively I could leave the list here, but remove the link. Then other people could do a cleanroom implementation of the functionality without reference to a GPL3 codebase.

Sorry about that, it honestly didn’t cross my mind until I sat down to actually implement it.

@ppannuto
Copy link
Owner

ppannuto commented Apr 6, 2024

:/, that's unfortunate. I definitely can't do any kind of license change on this end in good conscience. I'm just a steward really of a project which has had many owners over the years.

I guess the best path forward is to pull over any changes you can, and then leave this open with the link removed as you suggested. It's a great todo list for anyone looking for some simple OSS contributions at least.

@robinwhittleton
Copy link
Contributor Author

OK, I’ll try to get around to this at some point over the next week.

@robinwhittleton
Copy link
Contributor Author

robinwhittleton commented May 12, 2024

I reviewed through blame who’d contributed which rules, and it turns out that all but two were written by a contributor who would (reasonably of course) rather their code remains GPL-3 rather than MIT. The other two were written by me, but are not useful in the more general context.

So I think I’ve done as much as I can here. I know the original code so I don’t want to attempt a black-box reimplementation as MIT. If anyone else who hasn’t read the GPL3 code wants to take this list as the starting point for python-titlecase improvements then go for it, but otherwise let’s close this issue.

Thanks for the time anyway, and sorry that I hadn’t been more careful about licensing when I proposed this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants