-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some url could not be found anymore (404 Not Found). #3
Comments
@fwang91 URL return "404" Would you like to provide the original images? |
I have also found this problem. The URL is crawled in the last year and maybe some URLs have been blocked. We cannot directly send the original images to the face recognition community due to the copyright issues. If there is a better way to release dataset and avoid the copyright issues, please contact me. |
many links can not retrieve, sometimes 404 not found, sometimes timeout |
@fwang91 So, why don't you upload the whole clean dataset (in .jpg) and the corresponding landmarks to 百度网盘? |
@hustzeyu we do not have the copyright of the original images. |
@fwang91 I think it‘s ok if we only use it for acadeimic purpose? |
Hmmm ... we have already used MS1M dataset for acadeimic purpose, is the other part a problem? |
Definitely not okay for academic purposes just because. Remember "academic" is another word for "commercial" since universities use research to get grant money, sell licenses, etc. Even if it wasn't recognized as "commercial", IMDb states anything non-personal needs explicit permission. IMDb Conditions of Use explicitly states:
And further:
And to get consent:
So that really means that the use of this dataset (as it requires you to make a script to download the images from the scrapped URLs) for anything non-personal (including academic and commercial) requires explicit written permission if you want to remain above the radar. Otherwise you're open for lawsuit from Amazon (who owns IMDb). In other words, none of us can rehost the images without invoking copyright infringement, and it likely requires more effort than I think is worth for @fwang91 et al to fix the URLs if the URLs just simply changed. |
@seesemichaelj thanks for your detailed explanation. |
expecting updated url |
@zhongyy hi,could you tell me the number of invalid URL? |
@ Thanks for your attention. We have downloaded the most of the dataset and we are making a list of the invalid URL. May I email it to you "[email protected]" ? |
Yes. Thanks a lot. |
Hire you can find list of URLs that finished with 404 error for me: |
Unfortunately one year later out of 1,662,888 URLs I was able to download 1,180,173, so there are some 482k invalid URLs... I could generate a list of the invalid URLs, but given that it's almost a third of the total this may no longer be useful. As an example, most of the Tom_Paolino images are not available. That's subject ID nm0660057. |
I have found about 300 images "404 not found" when I download the first 2000 images in the .csv file. Besides some links are duplicated. We are interested in your database and trying to use it in research. Would you like to provide the original images?
The text was updated successfully, but these errors were encountered: