Question about --log, --no-dupes and --skip #585

Gramrock · 2022-01-26T19:21:31Z

Gramrock
Jan 26, 2022

First of all, amazing tool, after you told me about ffmpeg missing, It worked like a charm! Kudos and thanks to you!

I am using : python -m bdfr download ./lolo --subreddit "MemeVideos" --no-dupes --time "day" --skip "jpeg, gif, png" -L 50 --file-scheme "{POSTID}"

What I am doing though is I manually delete some videos but I need a way to prevent them from re-downloading when deleted by me. Is there a way to use --log to prevent re-downloading items which I downloaded at some point? because if I have to keep files in order to prevent them from re-downloading my HDD would be full quite quickly :(

So can I use --log for that and/or how does log work other than logging?
After setting --no-dupes it seems like it is still crawling through each post and takes the same time as it would downloading, is that intended? (minor issue)
am I using the --skip parameter correctly? I couldn't find any list of how to list those

Appreciate you, thanks!

Answered by Serene-Arc

Jan 27, 2022

So --log only specifies where the logfile should be saved. This is only really useful if you're running multiple instances of the BDFR at the same time, since multiple processes accessing the same file will crash on Windows, and write gibberish on Linux and Mac.

The --no-dupes option is only there to prevent saving the same image on the same run. What happens is that we calculate the hash of the image once it's been downloaded, and check to see if the hash exists in memory. This only gets hashes from the current run, so it's of limited use by itself. It was originally paired with the --search-existing option, which will scan the destination and compute all the hashes for every file, but t…

View full answer

Serene-Arc · 2022-01-27T00:31:43Z

Serene-Arc
Jan 27, 2022
Maintainer

So --log only specifies where the logfile should be saved. This is only really useful if you're running multiple instances of the BDFR at the same time, since multiple processes accessing the same file will crash on Windows, and write gibberish on Linux and Mac.

The --no-dupes option is only there to prevent saving the same image on the same run. What happens is that we calculate the hash of the image once it's been downloaded, and check to see if the hash exists in memory. This only gets hashes from the current run, so it's of limited use by itself. It was originally paired with the --search-existing option, which will scan the destination and compute all the hashes for every file, but this feature is heavily discouraged, don't use it. It's slow and a stupid solution to the problem.

The --skip option is a holdover from the old version of the BDFR and it skips filetypes, such as .mp4 or .png and so on. I doubt that's what you want.

What you do want is --exclude-id and --exclude-id-file. These options allow you to supply submission IDs and a file containing IDs respectively and the BDFR will skip downloading them altogether. It won't even try, unlike all of the above options, which means it will have the best speed increase.

For just such an occasion, we have a folder of scripts that will read the logfile generated by each run and extract all IDs from that run, either all successfully downloaded IDs (so they aren't retried) or the failed ones (so you don't waste time reattempting those, if that's what you want). This is the best way to prevent duplicates. The scripts are available here and come in bash and Powershell so they'll work for whatever system you have. Use them with the --exclude-id-file option and it should do what you want :)

7 replies

Serene-Arc Jan 28, 2022
Maintainer

The extension .sh means that the file is a Bash script; it won't work in a Powershell window. Try the script with the .ps1 extension, that's the Powershell version.

Gramrock Jan 31, 2022
Author

Thanks, it is working!

I have two last questions regarding the ID list.

It came to my attention that in the middle of the ID's there are multiple entries called "file", those entries are actually not relevant are they?
If I download files and name them only by their ID, doesn't it generate the same filter list when using something like: "DIR downloads*.* >> alreadydownloaded.txt" and then using that as the "--exclude-id-file" file?

Thanks in advance!

Gramrock Jan 31, 2022
Author

correct command would be: for %f in (downloads"*.mp4") do @echo %~nf >> alreadydownloaded.txt

Serene-Arc Oct 4, 2022
Maintainer

It is stored in the log file, which has detailed log entries of every action the BDFR performs. The script is just fancy grepping to find the ID in the messages. It isn't stored anywhere else, so if you want it, you'll have to parse the log file. That's what the separate scripts are for; the BDFR won't do it on its own.

Serene-Arc Oct 6, 2022
Maintainer

The idea is to have the BDFR as modular as possible. The scripts are just one of the possible uses of extracting information from the logfile, so it's unlikely that it'll become integrated into the Python module. There's simply no way for anyone to know all of the possible use cases and IMO this is the most elegant solution. It allows, for example, getting the IDs for only some types of errors; you might want to reattempt 429 errors and continue to ignore 404 errors. This allowed and actually very simple.

As for your specific issue, firstly: I am not familiar with Powershell. I do not use Windows in any meaningful amount and I didn't write the Powershell scripts, those were contributed by someone else; I only did the bash scripts. It's hard to know what the problem is without knowing what the output and the result is. You say it is mangled, how so?

Also I still don't get the fixation on having all of the subreddits download in order, why do you need that?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about --log, --no-dupes and --skip #585

{{title}}

Replies: 1 comment 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Question about --log, --no-dupes and --skip #585

Gramrock Jan 26, 2022

Replies: 1 comment · 9 replies

Serene-Arc Jan 27, 2022 Maintainer

Serene-Arc Jan 28, 2022 Maintainer

Gramrock Jan 31, 2022 Author

Gramrock Jan 31, 2022 Author

Serene-Arc Oct 4, 2022 Maintainer

Serene-Arc Oct 6, 2022 Maintainer

Gramrock
Jan 26, 2022

Replies: 1 comment 9 replies

Serene-Arc
Jan 27, 2022
Maintainer

Serene-Arc Jan 28, 2022
Maintainer

Gramrock Jan 31, 2022
Author

Gramrock Jan 31, 2022
Author

Serene-Arc Oct 4, 2022
Maintainer

Serene-Arc Oct 6, 2022
Maintainer