Handle embedded in HTML files #11

obsessedcake · 2024-07-01T19:37:46Z

@PxINKY found that Gumroad can embed directly into the page to allow creators to embed videos/images into the page, this result removes download_url from the item object.

To overcome this we can extract download URLs from raw HTML (like we did before in f30e7ef) and do something with them.

I've made a small utility method from to extract all files from raw HTML.

    def _get_all_files(self, soup: BeautifulSoup) -> dict[str, str]:    
        raw_files: dict[str, str] = {}
        for file in soup.find_all("div", attrs={"role": "treeitem"}, class_="js-file-list-element"):
            file_type = file.select_one("li:nth-child(1)").string.lower()
            file_name = file.select_one("h4").string
            file_url = self._session.base_url + file.select_one("a", href=True)["href"]

            raw_files[f"{file_name}.{file_type}"] = file_url

{file_name}.{file_type} - not the best key here because we can have many files with same name in different folders.

If I remember correctly, each file should have a unique id in JSON. If they use this "uid" in HTML then it's an easy go, otherwise meh...

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle embedded in HTML files #11

Handle embedded in HTML files #11

obsessedcake commented Jul 1, 2024

Handle embedded in HTML files #11

Handle embedded in HTML files #11

Comments

obsessedcake commented Jul 1, 2024