[Feat] Strip non-content tags, headers, footers #1

oliviermills · 2024-04-16T18:46:07Z

The markdown would be much more useful if you stripped headers/footers and other tags like filters etc that is not core content (i.e. low value for RAG/context). Either using tag or class-based removal from the html or using something like Mozilla's Readability or both! Highly opinionated class-based removal is risky but produces high value content and less noise.

For example a language selector in a header gets produced and should be stripped:

[Skip to main content](#main-content)

Select LanguageEnglishAfrikaansAlbanianArabicArmenianAzerbaijaniBasqueBelarusianBengaliBosnianBulgarianCatalanCebuanoChinese (Simplified)Chinese (Traditional)CroatianCzechDanishDutchEsperantoEstonianFilipinoFinnishFrenchGalicianGeorgianGermanGreekGujaratiHaitian CreoleHausaHebrewHindiHmongHungarianIcelandicIgboIndonesianIrishItalianJapaneseJavaneseKannadaKhmerKoreanLaoLatinLatvianLithuanianMacedonianMalayMalteseMaoriMarathiMongolianNepaliNorwegianPersianPolishPortuguesePunjabiRomanianRussianSerbianSlovakSlovenianSomaliSpanishSwahiliSwedishTamilTeluguThaiTurkishUkrainianUrduVietnameseWelshYiddishYorubaZulu

Here is a starter list.. should probably test against a couple thousand random pages and use an LLM like haiku with vision as judge.

const exclude = [
  'header', '.header', '.top', '.navbar', '#header',
  'footer', '.footer', '.bottom', '#footer',
  '.sidebar', '.side', '.aside', '#sidebar',
  '.modal', '.popup', '#modal', '.overlay',
  '.ad', '.ads', '.advert', '#ad',
  '.lang-selector', '.language', '#language-selector',
  '.social', '.social-media', '.social-links', '#social',
  '.menu', '.navigation', 'nav', '#nav',
  '.breadcrumbs', '#breadcrumbs',
  '.form', 'form', '#search-form',
  'script', 'noscript'
];

calebpeffer · 2024-04-16T19:38:56Z

So, we've defaulted towards removing less, because (like you said) highly opinionated removal is risky and its easy to do further cleaning on the output with regex.

Like the idea of readability as an option. Great suggestion!

nickscamara · 2024-04-18T01:45:36Z

@oliviermills thank you for this. Just merged an option to remove non content tags. #14

This is just a start and I think there is room for other improvements here.

nickscamara · 2024-04-18T01:47:24Z

Let me know if you have any feedback!

oliviermills · 2024-04-18T03:40:52Z

I suggest a cleaner function per my PR #16 .. its slightly less aggressive but needs integration testing (#15) to see if it affects the md conversion. I checked turndown and any customizations within the code base here and it doesn't use style so that should be ok.

nickscamara · 2024-04-18T16:59:08Z

Awesome, thanks @oliviermills! Will be checking it out soon.

Fix FIRECRAWL_API_URL bug, also various PyLint fixes

rafaelsideguide · 2024-06-14T12:59:09Z

Closing this one (#273 solves this issue).

oliviermills changed the title ~~Strip non-content tags, headers, footers~~ [Feat] Strip non-content tags, headers, footers Apr 16, 2024

oliviermills mentioned this issue Apr 18, 2024

[Feat] issue #1 exclude tags (html clean-up) #16

Closed

nickscamara pushed a commit that referenced this issue May 24, 2024

Merge pull request #1 from mendableai/main

9663015

Fix FIRECRAWL_API_URL bug, also various PyLint fixes

mattjoyce mentioned this issue Jun 10, 2024

[BUG][SELF-HOST] Crawl requests generate Supabase error. #261

Closed

rafaelsideguide mentioned this issue Jun 12, 2024

[Feat] Add pageOptions.removeTags #273

Closed

nickscamara mentioned this issue Jun 13, 2024

Added pageOptions.removeTags #275

Merged

rafaelsideguide closed this as completed Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Strip non-content tags, headers, footers #1

[Feat] Strip non-content tags, headers, footers #1

oliviermills commented Apr 16, 2024 •

edited

Loading

calebpeffer commented Apr 16, 2024

nickscamara commented Apr 18, 2024 •

edited

Loading

nickscamara commented Apr 18, 2024

oliviermills commented Apr 18, 2024 •

edited

Loading

nickscamara commented Apr 18, 2024

rafaelsideguide commented Jun 14, 2024

[Feat] Strip non-content tags, headers, footers #1

[Feat] Strip non-content tags, headers, footers #1

Comments

oliviermills commented Apr 16, 2024 • edited Loading

calebpeffer commented Apr 16, 2024

nickscamara commented Apr 18, 2024 • edited Loading

nickscamara commented Apr 18, 2024

oliviermills commented Apr 18, 2024 • edited Loading

nickscamara commented Apr 18, 2024

rafaelsideguide commented Jun 14, 2024

oliviermills commented Apr 16, 2024 •

edited

Loading

nickscamara commented Apr 18, 2024 •

edited

Loading

oliviermills commented Apr 18, 2024 •

edited

Loading