Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New source: Noveldeglace #2275

Closed
wants to merge 69 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
0f01815
- there is no webp package
pmosko-fp Jan 31, 2024
6144e33
Merge pull request #2248 from pmosko/update_termux_doc
dipu-bd Feb 9, 2024
2195337
Generate source index
dipu-bd Dec 27, 2023
85f1196
Update VERSION
dipu-bd Dec 27, 2023
0501131
Generate source index
dipu-bd Dec 27, 2023
2a76b73
new domain for 69shuba
nd2024 Jan 1, 2024
68fb6cf
Update 69shuba.py
dipu-bd Jan 1, 2024
f431000
Generate source index
dipu-bd Jan 1, 2024
28be9f5
Fix source "Coffeemanga"
CryZFix Jan 11, 2024
ce0ae3f
changed links
CryZFix Jan 11, 2024
0301e60
Fixed Bato and etc
CryZFix Jan 11, 2024
0715cab
Generate source index
dipu-bd Jan 14, 2024
06432e8
add tigertranslations.py as new english source
Campiotti Oct 23, 2022
4df5842
update tigertranslations.py: improve metadata, remove annoying texts,…
Jan 16, 2024
712b055
Generate source index
dipu-bd Jan 21, 2024
8971930
update lightnovelreader.py URL for readlightnovel.app to readlitenove…
Jan 21, 2024
d1229cc
Generate source index
dipu-bd Feb 9, 2024
3b5ab67
add faqwiki.py source (https://faqwiki.us/)
Jan 21, 2024
160a099
faqwiki: fix downloads for novels with missing cover img
Feb 1, 2024
cde1d91
faqwiki: fix downloads for novels with chapters in chronological order
Feb 4, 2024
e44b931
faqwiki: remove conditional chapter reversal & improve cover image do…
Feb 6, 2024
c9de252
Generate source index
dipu-bd Feb 9, 2024
95f5fad
add webfic multilingual source (https://www.webfic.com)
Jan 22, 2024
9d1c9cd
define crawler language dynamically
Jan 28, 2024
bf3bb20
Generate source index
dipu-bd Feb 9, 2024
73c6eff
fixed updated
SirGryphin Feb 1, 2024
6b2f226
removed commented code
SirGryphin Feb 1, 2024
275f606
Generate source index
dipu-bd Feb 9, 2024
0c51e4f
Update royalroad.py
needKVAS Feb 2, 2024
9afdc7f
Update royalroad.py
needKVAS Feb 3, 2024
d088355
Generate source index
dipu-bd Feb 9, 2024
81132de
fix tw.m.ixdzs.com & www.aixdzs.com sources (now redirect to new domain)
Feb 4, 2024
fdafc98
Generate source index
dipu-bd Feb 9, 2024
e103fae
cleanup and fix 69shuba / 69shu / 69xinshu
Feb 5, 2024
ca9f56f
69shuba: auto-fix chapter indexing, fix issue with getting > 4.3k cha…
Feb 5, 2024
b0dfcc8
Update 69shuba.py
dipu-bd Feb 9, 2024
a6d8c32
bato: fix empty chapters
Feb 5, 2024
9b1bde5
add luminarynovels as new source (based on MadaraTemplate)
Feb 5, 2024
d95a5d7
fix mangabuddy chapter downloading
Feb 6, 2024
0650775
Generate source index
dipu-bd Feb 9, 2024
505cf49
fix logging call missing string template in msg in app.py
Feb 10, 2024
834059b
replace all deprecated logger.warn calls with logger.warning
Feb 10, 2024
dc3c5c9
Generate source index
dipu-bd Feb 12, 2024
5052306
fix syosetu's new pages
NilanEkanayake Jan 26, 2024
5641aec
fix fanstrans
NilanEkanayake Jan 26, 2024
f98adc6
removed niche use-case
NilanEkanayake Jan 26, 2024
eca156c
Update syosetu.py
dipu-bd Feb 9, 2024
4ab7086
Fix lint errors
dipu-bd Feb 9, 2024
a13a346
Fix lint errors
dipu-bd Feb 9, 2024
e8c6bf7
Fix duplicate URL
NilanEkanayake Feb 9, 2024
3379035
Generate source index
dipu-bd Feb 12, 2024
8cdd268
add wtrlab multilingual source (https://wtr-lab.com/)
Jan 28, 2024
f28f785
move wtrlab into multilingual sources, add dynamic language assignment
Jan 28, 2024
4e0f56b
cleanup wtrlab
Feb 9, 2024
419451f
Generate source index
dipu-bd Feb 12, 2024
0097b7d
UukanshuOnline: fix URL & rename file
Feb 9, 2024
64d485f
UukanshuOnline: add support for www and tw subdomains (traditional & …
Feb 9, 2024
9ba5d83
Generate source index
dipu-bd Feb 12, 2024
c307213
add nyxtranslation as a new en source
Feb 12, 2024
e22e590
Generate source index
dipu-bd Feb 16, 2024
9925bdb
First non working WIP version of NDG scraper
Vuizur Feb 21, 2024
65a8a31
Try to extract novel meta details
Vuizur Feb 21, 2024
8662b58
Better approach (WIP)
Vuizur Feb 21, 2024
20527e9
Working prototype
Vuizur Feb 21, 2024
03049ae
Remove unneeded stuff + fix title and author
Vuizur Feb 22, 2024
cff6e7f
Fix tomes split by arcs
Vuizur Feb 22, 2024
1fd1af9
Delete unneeded comments
Vuizur Feb 22, 2024
dbd0980
Fix crash with future chapters and unneeded titles
Vuizur Feb 22, 2024
1fc2d8b
Remove mistake caption
Vuizur Feb 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 11 additions & 1 deletion .github/contribs.json
Original file line number Diff line number Diff line change
Expand Up @@ -89,5 +89,15 @@
"Neory Dominise": null,
"[email protected]": null,
"HeliosLHC": "HeliosLHC",
"[email protected]": "HeliosLHC"
"[email protected]": "HeliosLHC",
"alzamer2": "alzamer2",
"[email protected]": "alzamer2",
"Unknown404": null,
"[email protected]": null,
"ACA": null,
"[email protected]": null,
"Campiotti": null,
"[email protected]": null,
"Nilan Ekanayake": null,
"[email protected]": null
}
866 changes: 463 additions & 403 deletions README.md

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion lncrawl/VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
3.4.1
3.4.2
2 changes: 1 addition & 1 deletion lncrawl/bots/telegram/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ def start(self):

async def error_handler(self, update: Update, context: ContextTypes.DEFAULT_TYPE):
"""Log Errors caused by Updates."""
logger.warn("Error: %s\nCaused by: %s", context.error, update)
logger.warning("Error: %s\nCaused by: %s", context.error, update)

async def show_help(self, update: Update, context: ContextTypes.DEFAULT_TYPE):
await update.message.reply_text("Send /start to create new session.\n")
Expand Down
2 changes: 1 addition & 1 deletion lncrawl/core/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -240,7 +240,7 @@ def compress_books(self, archive_singles=False):
format="zip",
root_dir=root_dir,
)
logger.info("Compressed:", os.path.basename(archived_file))
logger.info("Compressed: %s", os.path.basename(archived_file))

if archived_file:
self.archived_outputs.append(archived_file)
10 changes: 5 additions & 5 deletions lncrawl/core/sources.py
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@ def __load_latest_index():
except Exception as e:
if "crawlers" not in __current_index:
raise LNException("Could not fetch sources index")
logger.warn("Could not download latest index. Error: %s", e)
logger.warning("Could not download latest index. Error: %s", e)
__latest_index = __current_index


Expand Down Expand Up @@ -223,7 +223,7 @@ def __download_sources():
try:
__save_source_data(sid, data)
except Exception as e:
logger.warn("Failed to save source file. Error: %s", e)
logger.warning("Failed to save source file. Error: %s", e)


# --------------------------------------------------------------------------- #
Expand All @@ -248,7 +248,7 @@ def __import_crawlers(file_path: Path) -> List[Type[Crawler]]:
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
except Exception as e:
logger.warn("Module load failed: %s | %s", file_path, e)
logger.warning("Module load failed: %s | %s", file_path, e)
return []

language_code = ""
Expand Down Expand Up @@ -296,7 +296,7 @@ def __add_crawlers_from_path(path: Path):
return

if not path.exists():
logger.warn("Path does not exists: %s", path)
logger.warning("Path does not exists: %s", path)
return

if path.is_dir():
Expand All @@ -312,7 +312,7 @@ def __add_crawlers_from_path(path: Path):
for url in getattr(crawler, "base_url"):
crawler_list[url] = crawler
except Exception as e:
logger.warn("Could not load crawlers from %s. Error: %s", path, e)
logger.warning("Could not load crawlers from %s. Error: %s", path, e)


# --------------------------------------------------------------------------- #
Expand Down
8 changes: 4 additions & 4 deletions lncrawl/templates/browser/general.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,13 +25,13 @@ def read_novel_info_in_scraper(self) -> None:
try:
self.novel_cover = self.parse_cover(soup)
except Exception as e:
logger.warn("Failed to parse novel cover | %s", e)
logger.warning("Failed to parse novel cover | %s", e)

try:
authors = set(list(self.parse_authors(soup)))
self.novel_author = ", ".join(authors)
except Exception as e:
logger.warn("Failed to parse novel authors | %s", e)
logger.warning("Failed to parse novel authors | %s", e)

for item in self.parse_chapter_list(soup):
if isinstance(item, Chapter):
Expand All @@ -51,13 +51,13 @@ def read_novel_info_in_browser(self) -> None:
try:
self.novel_cover = self.parse_cover_in_browser()
except Exception as e:
logger.warn("Failed to parse novel cover | %s", e)
logger.warning("Failed to parse novel cover | %s", e)

try:
authors = set(list(self.parse_authors_in_browser()))
self.novel_author = ", ".join(authors)
except Exception as e:
logger.warn("Failed to parse novel authors | %s", e)
logger.warning("Failed to parse novel authors | %s", e)

for item in self.parse_chapter_list_in_browser():
if isinstance(item, Chapter):
Expand Down
4 changes: 2 additions & 2 deletions lncrawl/templates/soup/general.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,13 @@ def read_novel_info(self) -> None:
try:
self.novel_cover = self.parse_cover(soup)
except Exception as e:
logger.warn("Failed to parse novel cover | %s", e)
logger.warning("Failed to parse novel cover | %s", e)

try:
authors = set(list(self.parse_authors(soup)))
self.novel_author = ", ".join(authors)
except Exception as e:
logger.warn("Failed to parse novel authors | %s", e)
logger.warning("Failed to parse novel authors | %s", e)

for item in self.parse_chapter_list(soup):
if isinstance(item, Chapter):
Expand Down
2 changes: 1 addition & 1 deletion lncrawl/utils/pbincli.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ class PBinCLIException(Exception):


def PBinCLIError(message):
logger.warn("PBinCLI Error: {}".format(message))
logger.warning("PBinCLI Error: {}".format(message))


def path_leaf(path):
Expand Down
2 changes: 1 addition & 1 deletion sources/_index.json

Large diffs are not rendered by default.

17 changes: 12 additions & 5 deletions sources/en/b/bato.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,10 +139,10 @@ def read_novel_info(self):

def download_chapter_body(self, chapter):
soup = self.get_soup(chapter["url"])
soup = soup.find("script", string=re.compile(r"const imgHttpLis = \["))
soup = soup.find("script", string=re.compile(r"const imgHttps = \["))

img_list = json.loads(
re.search(r"const imgHttpLis = (.*);", soup.text).group(1)
re.search(r"const imgHttps = (.*);", soup.text).group(1)
)

bato_pass = decode_pass(
Expand All @@ -151,10 +151,17 @@ def download_chapter_body(self, chapter):

bato_word = re.search(r"const batoWord = (.*);", soup.text).group(1).strip('"')

# looks like some kind of "access" GET args that may be necessary, not always though
query_args = json.loads(decrypt(bato_word, bato_pass).decode())

image_urls = [
f'<img src="{img}?{args}">' for img, args in zip(img_list, query_args)
]
# so if it ends up empty or mismatches, just ignore it and return the img list instead
if len(query_args) != len(img_list):
image_urls = [
f'<img src="{img}" alt="img">' for img in img_list
]
else:
image_urls = [
f'<img src="{img}?{args}">' for img, args in zip(img_list, query_args)
]

return "<p>" + "</p><p>".join(image_urls) + "</p>"
8 changes: 4 additions & 4 deletions sources/en/c/coffeemanga.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,12 @@
from lncrawl.core.crawler import Crawler

logger = logging.getLogger(__name__)
search_url = "https://coffeemanga.com/?s=%s&post_type=wp-manga"
chapter_list_url = "https://coffeemanga.com/wp-admin/admin-ajax.php"
search_url = "https://coffeemanga.io/?s=%s&post_type=wp-manga"


class CoffeeManga(Crawler):
has_manga = True
base_url = "https://coffeemanga.com/"
base_url = ["https://coffeemanga.io/"]

def search_novel(self, query):
query = query.lower().replace(" ", "+")
Expand Down Expand Up @@ -53,7 +52,8 @@ def read_novel_info(self):
)
logger.info("%s", self.novel_author)

for a in reversed(soup.select("ul.main li.wp-manga-chapter a")):
soup = self.post_soup(f"{self.novel_url}ajax/chapters/")
for a in reversed(soup.select("li.wp-manga-chapter a")):
chap_id = len(self.chapters) + 1
vol_id = len(self.chapters) // 100 + 1
if len(self.chapters) % 100 == 0:
Expand Down
2 changes: 1 addition & 1 deletion sources/en/d/dobelyuwai.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ def read_novel_info(self):
# try:
# self.novel_author = soup.select_one('div.entry-content > p:nth-child(2)').text.strip()
# except Exception as e:
# logger.warn('Failed to get novel auth. Error: %s', e)
# logger.warning('Failed to get novel auth. Error: %s', e)
# logger.info('%s', self.novel_author)

# Removes none TOC links from bottom of page.
Expand Down
3 changes: 3 additions & 0 deletions sources/en/f/fanstrans.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ def initialize(self) -> None:
r"^Get on Patreon",
r"^Check out other novels on Fan’s Translation~",
r"^to get Notification for latest Chapter Releases",
r"^Can’t wait to read more? Want to show your support? Click",
r"^to be a sponsor and get additional chapters ahead of time!",
]
)
self.cleaner.bad_tags.update(["a"])
Expand All @@ -36,6 +38,7 @@ class FansTranslations(Crawler):

def initialize(self) -> None:
self.cleaner.bad_tags.update(["h3"])
self.init_executor(4)

def search_novel(self, query):
query = query.lower().replace(" ", "+")
Expand Down
143 changes: 143 additions & 0 deletions sources/en/f/faqwiki.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# -*- coding: utf-8 -*-
import logging

from bs4.element import Tag
from lncrawl.core.crawler import Crawler
from lncrawl.models import Volume, Chapter, SearchResult

logger = logging.getLogger(__name__)


class FaqWiki(Crawler):
base_url = ["https://faqwiki.us/"]
has_manga = False
has_mtl = True

def initialize(self) -> None:
# There's about 4+ ads as img tags within each chapter.
# Have not yet seen an img be part of any chapter, worst case we'll miss out on it.
self.cleaner.bad_tags.add("img")

def read_novel_info(self):
soup = self.get_soup(self.novel_url)

content = soup.select_one(".entry-content")

entry_title = soup.select_one("h1.entry-title")
assert isinstance(entry_title, Tag) # this must be here, is part of normal site structure/framework
self.novel_title = entry_title.text.strip()
# remove suffix from completed novels' title
if self.novel_title.endswith(" – All Chapters"):
self.novel_title = self.novel_title[0:self.novel_title.find(" – All Chapters")]
self.novel_author = "FaqWiki"
cover = content.select_one('.wp-block-image img')
# is missing in some rarer cases
if cover:
src = str(cover['src'])
# may be replaced with JS after load, in such case try and get the real img hidden in data-values
if src.startswith("data:"):
try:
src = cover["data-ezsrc"]
except KeyError:
pass
self.novel_cover = self.absolute_url(src)
# remove any optimized image size GET args from novel cover URL
if self.novel_cover and "?" in self.novel_cover:
self.novel_cover = self.novel_cover[0:self.novel_cover.find("?")]

metadata_container = soup.select_one("div.book-review-block__meta-item-value")
keywords = {
"desc": "Description:",
"alt_name": "Alternate Names:",
"genre": "Genre:",
"author": "Author(s):",
"status": "Status:",
"original_pub": "Original Publisher:"
}

if metadata_container:
metadata = metadata_container.text # doesn't have line breaks anyway so not splitting here
pos_dict = {}
for key, sep in keywords.items():
pos_dict[key + "_start"] = metadata.find(sep)
pos_dict[key] = metadata.find(sep) + len(sep)

self.novel_synopsis = metadata[pos_dict["desc"]:pos_dict["alt_name_start"]].strip()
self.novel_tags = metadata[pos_dict["genre"]:pos_dict["author_start"]].strip().split(" ")
self.novel_author = metadata[pos_dict["author"]:pos_dict["status_start"]].strip()

logger.info("Novel title: %s", self.novel_title)
logger.info("Novel synopsis: %s", self.novel_synopsis)
logger.info("Novel tags: %s", ",".join(self.novel_tags))
logger.info("Novel author: %s", self.novel_author)
logger.info("Novel cover: %s", self.novel_cover)

chap_list = soup.select_one('#lcp_instance_0').select("li>a")

for idx, a in enumerate(chap_list):
if "chapter" not in a.text.lower():
continue
chap_id = 1 + idx
vol_id = 1 + len(self.chapters) // 100
vol_title = f"Volume {vol_id}"
if chap_id % 100 == 1:
self.volumes.append(
Volume(
id=vol_id,
title=vol_title
))

# chapter name is only (sometimes) present in chapter page, not in overview
entry_title = f"Chapter {chap_id}"

self.chapters.append(
Chapter(
id=chap_id,
url=self.absolute_url(a["href"]),
title=entry_title,
volume=vol_id,
volume_title=vol_title
),
)

def download_chapter_body(self, chapter):
soup = self.get_soup(chapter.url)

contents_html = soup.select_one("div.entry-content")
contents_html = self.cleaner.clean_contents(contents_html)
contents_str = self.cleaner.extract_contents(contents_html)

return contents_str

def search_novel(self, query: str):
novel_selector = "article > div > header > h3.entry-title > a"
next_selector = "div.nav-links > a.next"

soup = self.get_soup(f"https://faqwiki.us/?s={query.replace(' ','+')}&post_type=page")
empty = "nothing found" in soup.select_one("h1.page-title").text.strip().lower()
if empty:
return []

novels = soup.select(novel_selector)

# loop over all pages via next button and get all novels
next_page = soup.select_one(next_selector)
while next_page:
page_soup = self.get_soup(self.absolute_url(next_page["href"]))
novels += page_soup.select(novel_selector)
next_page = page_soup.select_one(next_selector)

results = []
for novel in novels:
# filter out "fake" novels (links to All, completed & ongoing pages)
if "novels" in novel.text.lower():
pass
# simple but at least won't taint results
if query.lower() in novel.text.lower():
results.append(
SearchResult(
title=novel.text,
url=novel["href"]
)
)
return results
7 changes: 3 additions & 4 deletions sources/en/i/isotls.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ def read_novel_info(self):
if possible_novel_author:
self.novel_author = possible_novel_author['content']

for a in soup.select('main section div:nth-child(2) ul li a'):
for a in soup.select('main section:nth-child(3) nav ul li a'):
chap_id = len(self.chapters) + 1
vol_id = len(self.chapters) // 100 + 1
if len(self.chapters) % 100 == 0:
Expand All @@ -41,6 +41,5 @@ def read_novel_info(self):

def download_chapter_body(self, chapter):
soup = self.get_soup(chapter['url'])
contents = soup.select('article p')
body = [str(p) for p in contents if p.text.strip()]
return '<p>' + '</p><p>'.join(body) + '</p>'
contents = soup.select_one("div.content")
return self.cleaner.extract_contents(contents)
2 changes: 1 addition & 1 deletion sources/en/l/lightnovelreader.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ class LightnovelReader(Crawler):
"https://lnreader.org/",
"https://www.lnreader.org/",
"http://readlightnovel.online/",
"https://readlightnovel.app/",
"https://readlitenovel.com/",
]

def initialize(self) -> None:
Expand Down
Loading
Loading