Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numerical Commas Fix #853

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

panda01
Copy link

@panda01 panda01 commented Mar 21, 2024

Added a fix to not count commas inside of things with prices like 5,99 etc to give a more accurate content score.

One example of this causing problems is on this page https://www.krefel.be/nl/c/airfryers. However I think you also have to lower the char threshold for it to work

@panda01 panda01 marked this pull request as ready for review March 21, 2024 20:13
@cmkm
Copy link

cmkm commented Mar 22, 2024

It'd be helpful to get a little more detail here. Is there a specific part of this page (such as the description) you'd like to be included in the results? What's the character threshold value to see the expected results?

@panda01
Copy link
Author

panda01 commented Mar 26, 2024

@cmkm Sorry had a hectic past few days.

  • The character threshold to see the content is 300 characters.
  • I'm trying to see the first paragraph that starts with "Wil je graag een airfryer kopen?" Just lowering the threshold didn't help.

When Readability tries to grab the content it only grabs the product list and the descriptions below it. In my investigations I found that the comma detection was giving the product list extra points (like 55 was the score) because of the commas it was detecting in the prices like 109,00.

I just am assuming here that detecting commas in prices, numbers, etc was not the original purpose of that code, and we really don't care about those anyway if we're looking for readable content. So i just changed the Regex to only look for commas surrounded by non digits. Which def helped with the extraction for the specific page above, and gave me that first paragraph and all of the content below. Which is def closer I would say to what a user might expect when using readability on that page

@cmkm
Copy link

cmkm commented Apr 5, 2024

Thank you! This seems reasonable, but before we merge it, would you mind please adding a test case by using the instructions here? If you have an example of a page with slightly more readerable content, that would be ideal. Thanks! :)

@panda01
Copy link
Author

panda01 commented Apr 11, 2024

@cmkm Awesome, I definitely would love to contribute! However I did try and run that script to no avail, and even went so far as adding some more console.log's on it to see where it's getting hung up.

Below is a screenshot of my modifications and the output when I try and run the command.
Screenshot 2024-04-11 at 3 19 47 PM

I'm not sure what I'm doing wrong here, so any guidance is greatly appreciated!

@cmkm
Copy link

cmkm commented Apr 18, 2024

Thanks for your patience here! I think what you're running into is the same thing that I was, which is that this website may be blocking certain user agents, so your requests are timing out. Is there another website test case that your patch fixes that you might be able to try instead?

@panda01
Copy link
Author

panda01 commented May 29, 2024

Busy couple of weeks there! I can def try and find some new source to check this. Just was thinking what happened with this PR

@gijsk
Copy link
Contributor

gijsk commented Jun 20, 2024

So I tried to generate a testcase for this locally to test this. The problem is that it seems to do the opposite of what the comments suggest.

Specifically, before this patch, the "expected" HTML includes the "Wil je graag een airfryer kopen?" text. After the patch, that paragraph is removed.

@panda01
Copy link
Author

panda01 commented Jul 1, 2024

So I tried to generate a testcase for this locally to test this. The problem is that it seems to do the opposite of what the comments suggest.

Specifically, before this patch, the "expected" HTML includes the "Wil je graag een airfryer kopen?" text. After the patch, that paragraph is removed.

@gijsk This is an image of the website on my ff v127.0 64bit apple intel. I don't see the above mentioned section "Wil je graag een airfryer kopen?"
Screenshot 2024-07-01 at 9 31 54 AM

Also, with this particular website, I believe you also have to lower the character threshold in order for this to work. Either way, the product list shouldn't be counted so high in the algorithm.

The reason why it is counted so high is because of the commas being counted in the prices, which is why this PR works.

@panda01
Copy link
Author

panda01 commented Sep 3, 2024

Hello, I would really like to contribute to this repo. And apparently this is still relevant, however this seems stuck. Is there something I'm missing here as to why this can't be merged? It's a very tiny PR, and I'm just having a hard time understanding what the hold up is, and what I can do differently.

@gijsk
Copy link
Contributor

gijsk commented Sep 26, 2024

Hello, I would really like to contribute to this repo. And apparently this is still relevant, however this seems stuck. Is there something I'm missing here as to why this can't be merged? It's a very tiny PR, and I'm just having a hard time understanding what the hold up is, and what I can do differently.

Well, we'd really like to make sure that the patch works and that we don't break it in the future. In order to do that it'd be helpful to have a testcase as part of the pull request that checks the output for a site where this PR makes a difference. As I said, I tried to generate one based on the site linked in the initial comment but the outcome of the changes here is that the produced text is worse than before the changes.

It sounds like you're confident that this is a positive change so it would be helpful if you could include a testcase that demonstrates this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants