Skip to content

Commit

Permalink
fix: less strict filtering of headers
Browse files Browse the repository at this point in the history
  • Loading branch information
inhumantsar committed May 4, 2024
1 parent c1f2ca6 commit 886a5f2
Show file tree
Hide file tree
Showing 18 changed files with 3,317 additions and 74 deletions.
6 changes: 3 additions & 3 deletions Readability.js
Original file line number Diff line number Diff line change
Expand Up @@ -122,10 +122,10 @@ Readability.prototype = {
REGEXPS: {
// NOTE: These two regular expressions are duplicated in
// Readability-readerable.js. Please keep both copies in sync.
unlikelyCandidates: /-ad-|ai2html|banner|breadcrumbs|combx|comment|community|cover-wrap|disqus|extra|footer|gdpr|header|legends|menu|related|remark|replies|rss|shoutbox|sidebar|skyscraper|social|sponsor|supplemental|ad-break|agegate|pagination|pager|popup|yom-remote/i,
unlikelyCandidates: /-ad-|ai2html|banner|breadcrumbs|combx|comment|community|cover-wrap|disqus|extra|footer|gdpr|legends|menu|related|remark|replies|rss|shoutbox|sidebar|skyscraper|social|sponsor|supplemental|ad-break|agegate|pagination|pager|popup|yom-remote/i,
okMaybeItsACandidate: /and|article|body|column|content|main|shadow/i,

positive: /article|body|content|entry|hentry|h-entry|main|page|pagination|post|text|blog|story/i,
positive: /article|body|content|entry|hentry|h-entry|main|page|pagination|post|text|blog|story|header/i,
negative: /-ad-|hidden|^hid$| hid$| hid |^hid |banner|combx|comment|com-|contact|foot|footer|footnote|gdpr|masthead|media|meta|outbrain|promo|related|scroll|share|shoutbox|sidebar|skyscraper|sponsor|shopping|tags|tool|widget/i,
extraneous: /print|archive|comment|discuss|e[\-]?mail|share|reply|all|login|sign|single|utility/i,
byline: /byline|author|dateline|writtenby|p-author/i,
Expand Down Expand Up @@ -2278,7 +2278,7 @@ Readability.prototype = {
if (!articleContent)
return null;

this.log("Grabbed: " + articleContent.innerHTML);
//this.log("Grabbed: " + articleContent.innerHTML);

this._postProcessContent(articleContent);

Expand Down
3 changes: 2 additions & 1 deletion test/test-pages/buzzfeed-1/expected-metadata.json
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
{
"title": "Student Dies After Diet Pills She Bought Online \"Burned Her Up From Within\"",
"byline": null,
"byline": "Mark Di Stefano\n BuzzFeed News Reporter",
"dir": null,
"lang": "en",
"excerpt": "An inquest into Eloise Parry's death has been adjourned until July.",
"siteName": "BuzzFeed",
"publishedTime": null,
"readerable": true
}
85 changes: 53 additions & 32 deletions test/test-pages/buzzfeed-1/expected.html
Original file line number Diff line number Diff line change
@@ -1,40 +1,61 @@
<div id="readability-page-1" class="page">
<div id="buzz_sub_buzz">
<div id="superlist_3758406_5547137" rel:buzz_num="1">
<h2>The mother of a woman who took suspected diet pills bought online has described how her daughter was “literally burning up from within” moments before her death.</h2>
<p> <span>West Merica Police</span></p>
</div>
<div id="superlist_3758406_5547213" rel:buzz_num="2">
<p>Eloise Parry, 21, was taken to Royal Shrewsbury hospital on 12 April after taking a lethal dose of highly toxic “slimming tablets”. </p>
<p>“The drug was in her system, there was no anti-dote, two tablets was a lethal dose – and she had taken eight,” her mother, Fiona, <a href="https://www.westmercia.police.uk/article/9501/A-tribute-to-Eloise-Aimee-Parry-written-by-her-mother-Fiona-Parry">said in a statement</a> yesterday.</p>
<p>“As Eloise deteriorated, the staff in A&amp;E did all they could to stabilise her. As the drug kicked in and started to make her metabolism soar, they attempted to cool her down, but they were fighting an uphill battle.</p>
<p>“She was literally burning up from within.”</p>
<p>She added: “They never stood a chance of saving her. She burned and crashed.”</p>
</div>
<div id="superlist_3758406_5547140" rel:buzz_num="3">
<div>
<div>
<p><img src="http://ak-hdl.buzzfed.com/static/2015-04/21/5/enhanced/webdr12/grid-cell-2501-1429608056-15.jpg" rel:bf_image_src="http://ak-hdl.buzzfed.com/static/2015-04/21/5/enhanced/webdr12/grid-cell-2501-1429608056-15.jpg" height="412" width="203" /></p>
</div>
<p>Facebook</p>
<div rel:bf_bucket="track" track="{&quot;c&quot;:&quot;7FNW2J7&quot;,&quot;u&quot;:&quot;7717MJ7&quot;,&quot;buzz&quot;:&quot;diet-pills-burns-up&quot;,&quot;user&quot;:&quot;markdistefano&quot;,&quot;types&quot;:[100],&quot;queries&quot;:[]}">
<header id="post-3758406" rel:bf_bucket="track" track="{&quot;c&quot;:&quot;7FNW2J7&quot;,&quot;u&quot;:&quot;7717MJ7&quot;,&quot;buzz&quot;:&quot;diet-pills-burns-up&quot;,&quot;user&quot;:&quot;markdistefano&quot;,&quot;types&quot;:[100],&quot;queries&quot;:[]}" rel:ptool="true" rel:ptool_code="0.0.1.2.0.0" rel:owner="markdistefano" rel:advertiser="0" rel:partner="0" rel:data="{&quot;buzz_id&quot;:&quot;3758406&quot;,&quot;type&quot;:&quot;super&quot;,&quot;uri&quot;:&quot;diet-pills-burns-up&quot;,&quot;form_id&quot;:&quot;20&quot;,&quot;category&quot;:&quot;UKNews&quot;}" rel:ptool_stats="{&quot;impressions&quot;:&quot;653,817&quot;,&quot;email_shares&quot;:&quot;81&quot;,&quot;pinterest_count&quot;:&quot;&quot;,&quot;twitter_count&quot;:&quot;251&quot;,&quot;viral_lift&quot;:&quot;1.7X&quot;,&quot;facebook_count&quot;:&quot;665&quot;}">
<div id="buzz_header" rel:gt_cat="[ttp]:header">
<hgroup>
<a name="post-title"></a>
<p>
<b>An inquest into Eloise Parry’s death has been adjourned until July.</b>
</p>
<span>
<span id="update_posted_time_3758406">posted on April 21, 2015, at 11:29 a.m.</span>
</span>
</hgroup>
</div>
<div>
<div>
<p><img src="http://ak-hdl.buzzfed.com/static/2015-04/21/5/enhanced/webdr12/grid-cell-2501-1429608057-18.jpg" rel:bf_image_src="http://ak-hdl.buzzfed.com/static/2015-04/21/5/enhanced/webdr12/grid-cell-2501-1429608057-18.jpg" height="412" width="412" /></p>
</header>
<div data-print="body" rel:gt_cat="[ttp]:content">
<div id="buzz_sub_buzz">
<div id="superlist_3758406_5547137" rel:buzz_num="1">
<h2>The mother of a woman who took suspected diet pills bought online has described how her daughter was “literally burning up from within” moments before her death.</h2>
<p> <span>West Merica Police</span></p>
</div>
<div id="superlist_3758406_5547213" rel:buzz_num="2">
<p>Eloise Parry, 21, was taken to Royal Shrewsbury hospital on 12 April after taking a lethal dose of highly toxic “slimming tablets”. </p>
<p>“The drug was in her system, there was no anti-dote, two tablets was a lethal dose – and she had taken eight,” her mother, Fiona, <a href="https://www.westmercia.police.uk/article/9501/A-tribute-to-Eloise-Aimee-Parry-written-by-her-mother-Fiona-Parry">said in a statement</a> yesterday.</p>
<p>“As Eloise deteriorated, the staff in A&amp;E did all they could to stabilise her. As the drug kicked in and started to make her metabolism soar, they attempted to cool her down, but they were fighting an uphill battle.</p>
<p>“She was literally burning up from within.”</p>
<p>She added: “They never stood a chance of saving her. She burned and crashed.”</p>
</div>
<div id="superlist_3758406_5547140" rel:buzz_num="3">
<div>
<div>
<p><img src="http://ak-hdl.buzzfed.com/static/2015-04/21/5/enhanced/webdr12/grid-cell-2501-1429608056-15.jpg" rel:bf_image_src="http://ak-hdl.buzzfed.com/static/2015-04/21/5/enhanced/webdr12/grid-cell-2501-1429608056-15.jpg" height="412" width="203" /></p>
</div>
<p>Facebook</p>
</div>
<div>
<div>
<p><img src="http://ak-hdl.buzzfed.com/static/2015-04/21/5/enhanced/webdr12/grid-cell-2501-1429608057-18.jpg" rel:bf_image_src="http://ak-hdl.buzzfed.com/static/2015-04/21/5/enhanced/webdr12/grid-cell-2501-1429608057-18.jpg" height="412" width="412" /></p>
</div>
<p>Facebook</p>
</div>
</div>
<div id="superlist_3758406_5547284" rel:buzz_num="4">
<p>West Mercia police <a href="https://www.westmercia.police.uk/article/9500/Warning-Issued-As-Shrewsbury-Woman-Dies-After-Taking-Suspected-Diet-Pills">said the tablets were believed to contain dinitrophenol</a>, known as DNP, which is a highly toxic industrial chemical. </p>
<p>“We are undoubtedly concerned over the origin and sale of these pills and are working with partner agencies to establish where they were bought from and how they were advertised,” said chief inspector Jennifer Mattinson from the West Mercia police.</p>
<p>The Food Standards Agency warned people to stay away from slimming products that contained DNP.</p>
<p>“We advise the public not to take any tablets or powders containing DNP, as it is an industrial chemical and not fit for human consumption,” it said in a statement.</p>
</div>
<div id="superlist_3758406_5547219" rel:buzz_num="5">
<h2>Fiona Parry issued a plea for people to stay away from pills containing the chemical.</h2>
<p>“[Eloise] just never really understood how dangerous the tablets that she took were,” she said. “Most of us don’t believe that a slimming tablet could possibly kill us.</p>
<p>“DNP is not a miracle slimming pill. It is a deadly toxin.”</p>
</div>
<p>Facebook</p>
</div>
<p><a href="http://buzzfeed.com/"><b>Check out more articles on BuzzFeed.com!</b></a></p>
</div>
<div id="superlist_3758406_5547284" rel:buzz_num="4">
<p>West Mercia police <a href="https://www.westmercia.police.uk/article/9500/Warning-Issued-As-Shrewsbury-Woman-Dies-After-Taking-Suspected-Diet-Pills">said the tablets were believed to contain dinitrophenol</a>, known as DNP, which is a highly toxic industrial chemical. </p>
<p>“We are undoubtedly concerned over the origin and sale of these pills and are working with partner agencies to establish where they were bought from and how they were advertised,” said chief inspector Jennifer Mattinson from the West Mercia police.</p>
<p>The Food Standards Agency warned people to stay away from slimming products that contained DNP.</p>
<p>“We advise the public not to take any tablets or powders containing DNP, as it is an industrial chemical and not fit for human consumption,” it said in a statement.</p>
</div>
<div id="superlist_3758406_5547219" rel:buzz_num="5">
<h2>Fiona Parry issued a plea for people to stay away from pills containing the chemical.</h2>
<p>“[Eloise] just never really understood how dangerous the tablets that she took were,” she said. “Most of us don’t believe that a slimming tablet could possibly kill us.</p>
<p>“DNP is not a miracle slimming pill. It is a deadly toxin.”</p>
<div>
<p>Mark di Stefano is a breaking news reporter for BuzzFeed News and is based in Sydney, Australia. </p>
</div>
</div>
</div>
4 changes: 4 additions & 0 deletions test/test-pages/cnet/expected.html
Original file line number Diff line number Diff line change
Expand Up @@ -25,5 +25,9 @@
<p>It wasn't clear why these strategies didn't work on Snapchat CEO Evan Spiegel, who <a href="https://www.cnet.com/news/snapchat-said-to-rebuff-3-billion-offer-from-facebook/">famously rebuffed</a> a $3 billion takeover offer from Facebook in 2013.</p>
<p><em><strong>Tech Enabled:</strong> CNET chronicles tech's role in providing new kinds of accessibility. Check it out <a href="https://www.cnet.com/tech-enabled/">here</a>.</em><em><strong><br /></strong></em></p>
<p><em><strong>Technically Literate:</strong> Original works of short fiction with unique perspectives on tech, exclusively on CNET. <a href="https://www.cnet.com/technically-literate/">Here</a>.</em></p>
<div id="taboola-zuckerberg-offers-peek-at-facebooks-acquisition-strategies-below-article-thumbnails-article-redesign" data-component="taboola" data-taboola-options="{&quot;mode&quot;:&quot;thumbnails-f&quot;,&quot;container&quot;:&quot;taboola-zuckerberg-offers-peek-at-facebooks-acquisition-strategies-below-article-thumbnails-article-redesign&quot;,&quot;canonicalUrl&quot;:&quot;https:\/\/www.cnet.com\/news\/zuckerberg-offers-peek-at-facebooks-acquisition-strategies\/&quot;,&quot;placement&quot;:&quot;Below Article Thumbnails Article Redesign&quot;,&quot;width&quot;:&quot;col-8&quot;,&quot;isXhr&quot;:false,&quot;target_type&quot;:&quot;mix&quot;}" data-placement-name="article desktop Below Article Thumbnails Article Redesign">
<p><span><span>YOU</span> MAY ALSO LIKE</span>
</p>
</div>
</div>
</div>
20 changes: 20 additions & 0 deletions test/test-pages/engadget/expected.html
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,24 @@ <h4> Gallery: Xbox One X | 14 Photos </h4>
</div>
</section>
<div>
<h4>Engadget Score <div>
<figure>
<div data-rating-from="1" data-rating-to="55">
<p><span>Poor</span></p>
</div>
<div data-rating-from="55" data-rating-to="70">
<p><span>Uninspiring</span></p>
</div>
<div data-rating-from="70" data-rating-to="85">
<p><span>Good</span></p>
</div>
<div data-rating-from="85" data-rating-to="100">
<p><span>Excellent</span></p>
</div>
<figcaption>Key</figcaption>
</figure>
</div>
</h4>
<div>
<div>
<ul>
Expand All @@ -20,6 +38,7 @@ <h4> Gallery: Xbox One X | 14 Photos </h4>
</ul>
</div>
<div>
<h5>Cons</h5>
<ul>
<li>Expensive </li>
<li>Not worth it if you don’t have a 4K TV </li>
Expand All @@ -28,6 +47,7 @@ <h4> Gallery: Xbox One X | 14 Photos </h4>
</div>
</div>
<div>
<h4>Summary</h4>
<p>As promised, the Xbox One X is the most powerful game console ever. In practice, though, it really just puts Microsoft on equal footing with Sony’s PlayStation 4 Pro. 4K/HDR enhanced games look great, but it’s lack of VR is disappointing in 2017.</p>
</div>
</div>
Expand Down
Loading

0 comments on commit 886a5f2

Please sign in to comment.