Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move dns checking to dedicated class and add concurrency #92

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

nicolasassi
Copy link

implemented ideas discussed in #91.

Also moved all dns checking for find_urls and has_urls so all found urls could be check concurrently if the user needs.
I kept all intances of dns checking in the abstract methods for backwards compatibility but marked them as DEPRECATED and removed its effects. Maybe we could remove it all together in the future.

@lipoja lipoja self-requested a review June 19, 2021 20:59
@nicolasassi
Copy link
Author

I'm not really sure how depedencies work on this project, but I tried to add Pebble both to setup.py and to requirements.txt. Hope it fix the problem

@lipoja
Copy link
Owner

lipoja commented Jun 20, 2021

You should be able to run tox locally to test if everything works.
requirements.txt should be used to install all dependencies for development
setup.py should contain all basic dependencies necessary to run urlextract

urlextract/dns_check.py Outdated Show resolved Hide resolved
urlextract/dns_check.py Outdated Show resolved Hide resolved
urlextract/urlextract_core.py Outdated Show resolved Hide resolved
urlextract/dns_check.py Show resolved Hide resolved
@@ -745,13 +717,18 @@ def gen_urls(self, text, check_dns=False, get_indices=False):
# move cursor right after found TLD
tld_pos += len(tld) + offset

def find_urls(self, text, only_unique=False, check_dns=False, get_indices=False):
def find_urls(self, text, only_unique=False, check_dns=False, get_indices=False, timeout=None,
accept_on_timeout=False, max_workers=None, max_tasks=None):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I am thinking here is, isn't there already too much parameters for this find function?
Do you think that it is not sufficient to let user set these parameters ahead by modifying properties of DNSCheck class

What is you opinion?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could write some setter methods for the dsn_checker class in the URLExtract class and the user could set then before running find_urls. But I would add to the docs of find_urls the default values for dns_checker in case the user sets check_dns = True and is not aware of it. What do you think?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for setter methods in dns_check class. Ideally all methods related to DNS checks should be in that class. (not directly in URLExtract class)
And yes, everything should be documented. And also everything should have reasonable default values.

urlextract/urlextract_core.py Outdated Show resolved Hide resolved
@lipoja
Copy link
Owner

lipoja commented Jun 20, 2021

Thank you for your time and effort working on this issue. I really appreciate that!
Let's have a discussion about this code and I believe that withing few iteration we can merge the code.

I did not had chance to review it all, I want to go deeper once I have more time.

FYI: Do you think we could also add some test for it? I did not thought about it much yet. But I think there should be some.

@nicolasassi
Copy link
Author

nicolasassi commented Jun 21, 2021

@lipoja I fixed most of what was asked. As some changes are still in discussion (like i find_urls should have some many args etc...) there are still changes to be made. Also, the tests are not passing.

For some reason with my alterations now the results of dns checking are not being saved in the cache (that's way the tests are not passing now) and I can't figure out why. Could you please give it a look?

Thanks!

@lipoja
Copy link
Owner

lipoja commented Jul 8, 2021

@nicolasassi Sorry for the delay, lack of time it is ... family comes first these days.

What I was able to determine is that your second commit is breaking the tests. So maybe we can dig deeper around that.
It is about moving cache to one function. But at the first glance I do not see anything wrong. It needs more debugging.

I will look on when I save some time (might be on weekend, but no promises).

@lipoja
Copy link
Owner

lipoja commented Jul 11, 2021

@nicolasassi OK I found it and fixed it:
Method socket.gethostbyname(host) accept hosts ... unfortunately there should be some pre-processing since you are working with URLs and not with hosts only. You have to modify _get_hots() to this:

    def _get_host(self, host: str):
        """
         Get the IP address from a given host
        :param str host: the host to get IP from
        :return: A tuple with the given host and its IP address (a string of the form '255.255.255.255') if found
        (e.g: host.com, '255.255.255.255')
        :rtype: tuple
        """
        tmp_url = host
        scheme_pos = host.find('://')
        if scheme_pos == -1:
            tmp_url = 'http://' + host

        url_parts = uritools.urisplit(tmp_url)
        tmp_host = url_parts.gethost()

        if isinstance(tmp_host, ipaddress.IPv4Address):
            return host, tmp_host

        try:
            return host, socket.gethostbyname(tmp_host)
        ...

Thank you for your work, it looks like the parallel processing might work well... I want to review and test it more once this fix is in place.

@nicolasassi
Copy link
Author

@lipoja sorry for the delay I'm also kind busy these days. As soon as I have some time I'm gonna implement the changes you've suggested and add more tests. Thank you for your time!

@lipoja
Copy link
Owner

lipoja commented Aug 24, 2021

@nicolasassi Do you think I can take this over and finish that PR in case I have some time?

@nicolasassi
Copy link
Author

@lipoja sure! Life has been crazy and unfortunatelly time is sort... I still hope I can take some time to focus on this project and finish it, but just in case, if you have some time, feel free to take over.

Hope my contribuition already helped somehow and feel free to @ me if you need some help on this or anything in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants