Numerical Commas Fix #853

panda01 · 2024-03-21T20:13:30Z

Added a fix to not count commas inside of things with prices like 5,99 etc to give a more accurate content score.

One example of this causing problems is on this page https://www.krefel.be/nl/c/airfryers. However I think you also have to lower the char threshold for it to work

…things

cmkm · 2024-03-22T18:03:59Z

It'd be helpful to get a little more detail here. Is there a specific part of this page (such as the description) you'd like to be included in the results? What's the character threshold value to see the expected results?

panda01 · 2024-03-26T13:42:21Z

@cmkm Sorry had a hectic past few days.

The character threshold to see the content is 300 characters.
I'm trying to see the first paragraph that starts with "Wil je graag een airfryer kopen?" Just lowering the threshold didn't help.

When Readability tries to grab the content it only grabs the product list and the descriptions below it. In my investigations I found that the comma detection was giving the product list extra points (like 55 was the score) because of the commas it was detecting in the prices like 109,00.

I just am assuming here that detecting commas in prices, numbers, etc was not the original purpose of that code, and we really don't care about those anyway if we're looking for readable content. So i just changed the Regex to only look for commas surrounded by non digits. Which def helped with the extraction for the specific page above, and gave me that first paragraph and all of the content below. Which is def closer I would say to what a user might expect when using readability on that page

cmkm · 2024-04-05T16:46:35Z

Thank you! This seems reasonable, but before we merge it, would you mind please adding a test case by using the instructions here? If you have an example of a page with slightly more readerable content, that would be ideal. Thanks! :)

panda01 · 2024-04-11T19:22:35Z

@cmkm Awesome, I definitely would love to contribute! However I did try and run that script to no avail, and even went so far as adding some more console.log's on it to see where it's getting hung up.

Below is a screenshot of my modifications and the output when I try and run the command.

I'm not sure what I'm doing wrong here, so any guidance is greatly appreciated!

cmkm · 2024-04-18T21:23:11Z

Thanks for your patience here! I think what you're running into is the same thing that I was, which is that this website may be blocking certain user agents, so your requests are timing out. Is there another website test case that your patch fixes that you might be able to try instead?

panda01 · 2024-05-29T15:03:33Z

Busy couple of weeks there! I can def try and find some new source to check this. Just was thinking what happened with this PR

gijsk · 2024-06-20T17:17:19Z

So I tried to generate a testcase for this locally to test this. The problem is that it seems to do the opposite of what the comments suggest.

Specifically, before this patch, the "expected" HTML includes the "Wil je graag een airfryer kopen?" text. After the patch, that paragraph is removed.

panda01 · 2024-07-01T13:36:19Z

So I tried to generate a testcase for this locally to test this. The problem is that it seems to do the opposite of what the comments suggest.

Specifically, before this patch, the "expected" HTML includes the "Wil je graag een airfryer kopen?" text. After the patch, that paragraph is removed.

@gijsk This is an image of the website on my ff v127.0 64bit apple intel. I don't see the above mentioned section "Wil je graag een airfryer kopen?"

Also, with this particular website, I believe you also have to lower the character threshold in order for this to work. Either way, the product list shouldn't be counted so high in the algorithm.

The reason why it is counted so high is because of the commas being counted in the prices, which is why this PR works.

panda01 · 2024-09-03T17:02:57Z

Hello, I would really like to contribute to this repo. And apparently this is still relevant, however this seems stuck. Is there something I'm missing here as to why this can't be merged? It's a very tiny PR, and I'm just having a hard time understanding what the hold up is, and what I can do differently.

gijsk · 2024-09-26T16:33:10Z

Hello, I would really like to contribute to this repo. And apparently this is still relevant, however this seems stuck. Is there something I'm missing here as to why this can't be merged? It's a very tiny PR, and I'm just having a hard time understanding what the hold up is, and what I can do differently.

Well, we'd really like to make sure that the patch works and that we don't break it in the future. In order to do that it'd be helpful to have a testcase as part of the pull request that checks the output for a site where this PR makes a difference. As I said, I tried to generate one based on the site linked in the initial comment but the outcome of the changes here is that the produced text is worse than before the changes.

It sounds like you're confident that this is a positive change so it would be helpful if you could include a testcase that demonstrates this.

[price commas fix] Adding a fix to not count commas in the prices of …

61eff12

…things

panda01 marked this pull request as ready for review March 21, 2024 20:13

gijsk mentioned this pull request Apr 8, 2024

Changing class and id check to node name check to follow more semantic HTML #848

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numerical Commas Fix #853

Numerical Commas Fix #853

panda01 commented Mar 21, 2024 •

edited

Loading

cmkm commented Mar 22, 2024

panda01 commented Mar 26, 2024 •

edited

Loading

cmkm commented Apr 5, 2024

panda01 commented Apr 11, 2024

cmkm commented Apr 18, 2024

panda01 commented May 29, 2024

gijsk commented Jun 20, 2024

panda01 commented Jul 1, 2024

panda01 commented Sep 3, 2024

gijsk commented Sep 26, 2024

Numerical Commas Fix #853

Are you sure you want to change the base?

Numerical Commas Fix #853

Conversation

panda01 commented Mar 21, 2024 • edited Loading

cmkm commented Mar 22, 2024

panda01 commented Mar 26, 2024 • edited Loading

cmkm commented Apr 5, 2024

panda01 commented Apr 11, 2024

cmkm commented Apr 18, 2024

panda01 commented May 29, 2024

gijsk commented Jun 20, 2024

panda01 commented Jul 1, 2024

panda01 commented Sep 3, 2024

gijsk commented Sep 26, 2024

panda01 commented Mar 21, 2024 •

edited

Loading

panda01 commented Mar 26, 2024 •

edited

Loading