Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add downloader for AVCX #202

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Conversation

afontenot
Copy link
Contributor

Adds a downloader for AVCX (American Values Club Crosswords). These are popular crosswords from a variety of creators, see https://avxwords.com/about-us/.

This is a subscription-only crossword series, and requires authentication. This is handled in exactly the same way as NYT.

This downloader may not seem to serve an obvious purpose, given that AVCX emails subscribers an AcrossLite compatible .puz file for every new release. However, I'm thinking it will be useful for the following features:

  • Automatic downloading of new crosswords, e.g. using a crontab.
  • Would allow downstream software like Gnome Crosswords to fetch AVCX automatically.
  • I do a few fixups on the .puz files and include difficulty metadata and the puzzle notes.
  • Some crosswords are not available in .puz from the AVCX website, e.g. barred crosswords like https://avxwords.com/puzzles/1621. As these are available in JPZ I think it would be nice to do a very minimal translation into a close .puz approximation. (This is currently TODO.)

Adds a downloader for AVCX (American Values Club Crosswords). These
are popular crosswords from a variety of creators, see
https://avxwords.com/about-us/.

This is a subscription-only crossword series, and requires
authentication. This is handled in exactly the same way as NYT.

This downloader may not seem to serve an obvious purpose, given that
AVCX emails subscribers an AcrossLite compatible .puz file for every
new release. However, I'm thinking it will be useful for the
following features:

 * Automatic downloading of new crosswords, e.g. using a crontab.
 * Would allow downstream software like Gnome Crosswords to fetch
     AVCX automatically.
 * I do a few fixups on the .puz files and include and difficulty
     metadata.
 * Some crosswords are not available in .puz from the AVCX website,
     e.g. barred crosswords like https://avxwords.com/puzzles/1621.
     As these are available in JPZ I think it would be nice to do
     a very minimal translation into a close .puz approximation.
@afontenot
Copy link
Contributor Author

Just noticed there's sort of a JPZ parser already in compilerdownloader.py, but there are subtle differences. The compiler parser doesn't handle when the clue text is inside a <span> element, as it is in AVCX, and it also appears to have no handling at all for barred crosswords. It would have to be extended if it were to work for AVCX, but it's certainly a starting point.


def find_solver(self, url):
if "puzzles" in url:
url = url.removesuffix("/")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is 3.9+ and 3.8 is not quite EOL yet but I'm okay with that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, normally I'd be all for backward compatibility, but 3.8 will probably be EOL before the next release of xword-dl, and at this point even Debian oldstable has 3.9. Still, willing to to change it if you'd prefer.

@thisisparker
Copy link
Owner

At a glance, this is great! I want to poke around at it some and test it out, but as an AVCX subscriber I would totally use this.

@afontenot
Copy link
Contributor Author

@thisisparker Question about using puzzle.notes in a downloader: the saved file lacks the newline characters of the original string. Is this something that xword_dl is stripping out (e.g. perhaps treating the notes field as HTML?), or do I need to chase down an issue in the puzpy library?

@thisisparker
Copy link
Owner

It's likely my cleanup function being a little overzealous. These are \n characters getting stripped? I will take a look and confirm.

@afontenot
Copy link
Contributor Author

afontenot commented Aug 3, 2024

It's likely my cleanup function being a little overzealous. These are \n characters getting stripped? I will take a look and confirm.

Yep, my AVCX code slaps several bits of metadata into the notes with "\n\n".join(self.descriptions), but there are no \n characters at all in the resulting file.

@afontenot
Copy link
Contributor Author

I had a look myself, this is an issue with using html2text on the notes. Space is not significant in HTML so this is correct behavior from the html2text library, but we should probably only be calling it on fields that contain HTML.

Also, what's the intended purpose of using this library? It converts HTML to a Markdown equivalent, but does the AcrossLite PUZ specification support Markdown text representation? Are there specific programs that display it correctly? I tried putting the HTML markup directly in puzzle.notes but the resulting document contained a bunch of Markdown links which made the notes hard to read in Gnome Crosswords.

@thisisparker
Copy link
Owner

The intention behind html2text is to convert from something that looks "marked up" to something that doesn't, because some clients don't render html and e.g. <em>foreign phrase</em> probably looks worse than the same thing in _s. (In other words, I'm actually just looking for a "plaintext" representation of formatted text, and for formatting elements markdown is pretty good, but it's not great for links as you note.) This is kind of orthogonal to the puz spec itself, which is afaict silent on markup questions, though it's possible the "observed spec" has moved a bit in the direction of HTML if AcrossLite now supports it; I actually don't know whether that's the case.

That's all probably a matter of opinion! Which is why I added the --preserve-html flag, which should skip the invocation of html2text entirely. Again sorry, writing this quickly, but does that happen to do the right thing for you?

@afontenot
Copy link
Contributor Author

Again sorry, writing this quickly, but does that happen to do the right thing for you?

Yes, that fixes the issue with removing new lines.

@thisisparker
Copy link
Owner

Yes, that fixes the issue with removing new lines.

Alright! Then one option is to pass it at runtime each time, or another would be to put a preserve_html line in your settings file (under the general section or a specific outlet). I'm not inclined to change this behavior in the short term because I personally use a client that doesn't render the HTML and I prefer the look of unformatted markdown, but I am aware that's probably increasingly idiosyncratic

@afontenot
Copy link
Contributor Author

Hmm, well me not liking the look of it is one thing, but it removing any new lines in the notes string is another. That seems like it should be avoided. Should downloaders that have plain text notes replace \n with <br> to get the correct output?

Seems like this ought to affect the Puzzle Society downloader too, although that one is currently disabled.

@thisisparker
Copy link
Owner

I think that using <br> in these notes instead of \n is the right solution. By default they'll be converted, and if you're saving for a context that will render HTML, you'll be using the preserve flag and you'll still get the newlines.

Semantically it's probably even a touch better to just wrap paragraphs in <p> tags, which should have the same effect after html2text, but which might not be quite as concise as just using '<br><br>'.join(). (I can contrive a scenario where <p> rendering is cleaner than <br>s, but that's fully speculative.)

self.descriptions.append(f"Edited by {parts[2]}.")

if self.descriptions:
puzzle.notes = "\n\n".join(self.descriptions)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my last comment, I think you could do this like

puzzle.notes = "\n\n".join([f"<p>{d}</p>" for d in self.descriptions])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've mostly done that. There's a little more tinkering to get a nice plain text rendering; I'm replacing any links with just the text when preserve-html is off, hopefully you think that's a reasonable compromise if no one is actually rendering the Markdown at present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants