Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegExp not works #107

Open
ZeusFSX opened this issue Apr 27, 2024 · 2 comments
Open

RegExp not works #107

ZeusFSX opened this issue Apr 27, 2024 · 2 comments

Comments

@ZeusFSX
Copy link

ZeusFSX commented Apr 27, 2024

Hi I used your services and done all steps but when I run extract with regexp url which I wrote in config file not match urls. In logs I got the error, but when I manually match it in python everything ok:

My config file:

{
    "extractors_path": "./extractors",
    "routes": [
        {
            "regexes": ["^https://\\w*\\.{0,1}rozetka\\.com\\.ua/[^/]+/p\\d+/$", "^https://\\w*\\.{0,1}rozetka\\.com\\.ua/ua/[^/]+/p\\d+/$"],
            "extractors": [{
                "name": "rozetka_extractor",
                "since": "2023-01-01"
            }]
        }
    ]
}

Here the logs

2024-04-27 14:13:23,071 - synchronized.py:64 : ERROR - Failed to process https://rozetka.com.ua/88779405/p88779405/ with No route found for url: https://rozetka.com.ua/88779405/p88779405/ -> ADD_INFO: filename='crawl-data/CC-MAIN-2022-33/segme nts/1659882572043.2/warc/CC-MAIN-20220814143522-20220814173522-00500.warc.gz' url='https://rozetka.com.ua/88779405/p88779405/' offset=448027058 length=51078 digest='6VJW4LQ4VNDCUXRSKSYATPGJDRNHBJG' encoding='UTF-8' timestamp=datetime.datetime
(2022, 8, 14, 15, 29, 3)

but when i manually test it in python everything match:

>>> re.match("^https://\w*\.{0,1}rozetka\.com\.ua/[^/]+/p\d+/$", "https://rozetka.com.ua/88779405/p88779405/")
<re.Match object; span=(0, 42), match='https://rozetka.com.ua/88779405/p88779405/'>
@ZeusFSX
Copy link
Author

ZeusFSX commented Apr 27, 2024

Ohh, I saw my mistake It's not match by date. Maybe You can update logs for it, because it's not informative?

@hynky1999
Copy link
Owner

Great that you managed to resolve your issue. I will take a look at logging, and see what is possible to do to prevent this problem from happening :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants