-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
move dns checking to dedicated class and add concurrency #92
base: master
Are you sure you want to change the base?
move dns checking to dedicated class and add concurrency #92
Conversation
I'm not really sure how depedencies work on this project, but I tried to add Pebble both to setup.py and to requirements.txt. Hope it fix the problem |
You should be able to run |
@@ -745,13 +717,18 @@ def gen_urls(self, text, check_dns=False, get_indices=False): | |||
# move cursor right after found TLD | |||
tld_pos += len(tld) + offset | |||
|
|||
def find_urls(self, text, only_unique=False, check_dns=False, get_indices=False): | |||
def find_urls(self, text, only_unique=False, check_dns=False, get_indices=False, timeout=None, | |||
accept_on_timeout=False, max_workers=None, max_tasks=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I am thinking here is, isn't there already too much parameters for this find function?
Do you think that it is not sufficient to let user set these parameters ahead by modifying properties of DNSCheck
class
What is you opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could write some setter methods for the dsn_checker class in the URLExtract class and the user could set then before running find_urls
. But I would add to the docs of find_urls
the default values for dns_checker in case the user sets check_dns = True
and is not aware of it. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote for setter methods in dns_check class. Ideally all methods related to DNS checks should be in that class. (not directly in URLExtract class)
And yes, everything should be documented. And also everything should have reasonable default values.
Thank you for your time and effort working on this issue. I really appreciate that! I did not had chance to review it all, I want to go deeper once I have more time. FYI: Do you think we could also add some test for it? I did not thought about it much yet. But I think there should be some. |
@lipoja I fixed most of what was asked. As some changes are still in discussion (like i For some reason with my alterations now the results of dns checking are not being saved in the cache (that's way the tests are not passing now) and I can't figure out why. Could you please give it a look? Thanks! |
@nicolasassi Sorry for the delay, lack of time it is ... family comes first these days. What I was able to determine is that your second commit is breaking the tests. So maybe we can dig deeper around that. I will look on when I save some time (might be on weekend, but no promises). |
@nicolasassi OK I found it and fixed it: def _get_host(self, host: str):
"""
Get the IP address from a given host
:param str host: the host to get IP from
:return: A tuple with the given host and its IP address (a string of the form '255.255.255.255') if found
(e.g: host.com, '255.255.255.255')
:rtype: tuple
"""
tmp_url = host
scheme_pos = host.find('://')
if scheme_pos == -1:
tmp_url = 'http://' + host
url_parts = uritools.urisplit(tmp_url)
tmp_host = url_parts.gethost()
if isinstance(tmp_host, ipaddress.IPv4Address):
return host, tmp_host
try:
return host, socket.gethostbyname(tmp_host)
... Thank you for your work, it looks like the parallel processing might work well... I want to review and test it more once this fix is in place. |
@lipoja sorry for the delay I'm also kind busy these days. As soon as I have some time I'm gonna implement the changes you've suggested and add more tests. Thank you for your time! |
@nicolasassi Do you think I can take this over and finish that PR in case I have some time? |
@lipoja sure! Life has been crazy and unfortunatelly time is sort... I still hope I can take some time to focus on this project and finish it, but just in case, if you have some time, feel free to take over. Hope my contribuition already helped somehow and feel free to @ me if you need some help on this or anything in the future |
implemented ideas discussed in #91.
Also moved all dns checking for
find_urls
andhas_urls
so all found urls could be check concurrently if the user needs.I kept all intances of dns checking in the abstract methods for backwards compatibility but marked them as DEPRECATED and removed its effects. Maybe we could remove it all together in the future.