Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add download progress bar #697

Open
wants to merge 3 commits into
base: development
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 32 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,8 @@ bdfr archive ./path/to/output --user reddituser --submitted --all-comments --com
bdfr archive ./path/to/output --subreddit all --format yaml -L 500 --folder-scheme ""
```

### YAML options

Alternatively, you can pass options through a YAML file.

```bash
Expand All @@ -128,16 +130,40 @@ subreddit:
- CityPorn
```

would be equilavent to (take note that in YAML there is `file_scheme` instead of `file-scheme`):
would be equivalent to:

```bash
bdfr download ./path/to/output --skip mp4 --skip avi --file-scheme "{UPVOTES}_{REDDITOR}_{POSTID}_{DATE}" -L 10 -S top --subreddit EarthPorn --subreddit CityPorn
```

Any option that can be specified multiple times should be formatted like subreddit is above.
In case when the same option is specified both in the YAML file and as a command line argument, the command line
argument takes priority.

### Progress bar

In case when the same option is specified both in the YAML file and in as a command line argument, the command line
argument takes priority
When you run BDFR manually from your Terminal, if you pass `--progress-bar`,
you will get a progress bar with a live summary of the results.
Each downloaded image comes with its status (❌ or ✅),
the number of upvotes, and title. For example:

```bash
python -m bdfr download ./example -S top -L 50 -s DataIsBeautiful --progress-bar
```

```text
✅ 162712🔼 [OC] Trending Google Searches by State Between 2018 and 2020
✅ 122725🔼 I analysed 70 years of baby names in the US to decide what t...
✅ 120972🔼 For everyone asking why i didn't include the Spanish Flu and...
✅ 111357🔼 Let's hear it for the lurkers! The vast majority of Reddit u...
❌ 109991🔼 US College Tuition & Fees vs. Overall Inflation [OC]
✅ 106921🔼 A wish for election night data visualization [OC]
✅ 104746🔼 [OC] u/IHateTheLetterF is a mad lad
✅ 104517🔼 Area of land burnt in Australia and area of smoke coverage s...
✅ 101616🔼 Light Speed – fast, but slow [OC]
Subreddits: 0%| | 0/2 [00:39<?, ?subreddit/s]
dataisbeautiful/top: 18%|███▊ | 9/50 [00:39<01:56, 2.84s/post]
```

## Options

Expand Down Expand Up @@ -176,6 +202,9 @@ The following options are common between both the `archive` and `download` comma
- `--log`
- This allows one to specify the location of the logfile
- This must be done when running multiple instances of the BDFR, see [Multiple Instances](#multiple-instances) below
- `--progress-bar`
- Displays a progress bar in the terminal
- Prints a simplified log for each downloaded image (status, upvotes, title)
- `--saved`
- This option will make the BDFR use the supplied user's saved posts list as a download source
- This requires an authenticated Reddit instance, using the `--authenticate` flag, as well as `--user` set to `me`
Expand Down
1 change: 1 addition & 0 deletions bdfr/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
click.option("-L", "--limit", default=None, type=int),
click.option("-l", "--link", multiple=True, default=None, type=str),
click.option("-m", "--multireddit", multiple=True, default=None, type=str),
click.option("-p", "--progress-bar", is_flag=True, default=None),
click.option(
"-S", "--sort", type=click.Choice(("hot", "top", "new", "controversial", "rising", "relevance")), default=None
),
Expand Down
7 changes: 7 additions & 0 deletions bdfr/archiver.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
from bdfr.configuration import Configuration
from bdfr.connector import RedditConnector
from bdfr.exceptions import ArchiverError
from bdfr.progress_bar import Progress
from bdfr.resource import Resource

logger = logging.getLogger(__name__)
Expand All @@ -29,7 +30,9 @@ def __init__(self, args: Configuration, logging_handlers: Iterable[logging.Handl
super().__init__(args, logging_handlers)

def download(self) -> None:
progress = Progress(self.args.progress_bar, len(self.reddit_lists))
for generator in self.reddit_lists:
progress.subreddit_new(generator)
try:
for submission in generator:
try:
Expand All @@ -40,18 +43,22 @@ def download(self) -> None:
f"Submission {submission.id} in {submission.subreddit.display_name} skipped due to"
f" {submission.author.name if submission.author else 'DELETED'} being an ignored user"
)
progress.post_done(submission, False)
continue
if submission.id in self.excluded_submission_ids:
logger.debug(f"Object {submission.id} in exclusion list, skipping")
progress.post_done(submission, False)
continue
logger.debug(f"Attempting to archive submission {submission.id}")
self.write_entry(submission)
progress.post_done(submission, True)
except prawcore.PrawcoreException as e:
logger.error(f"Submission {submission.id} failed to be archived due to a PRAW exception: {e}")
except prawcore.PrawcoreException as e:
logger.error(f"The submission after {submission.id} failed to download due to a PRAW exception: {e}")
logger.debug("Waiting 60 seconds to continue")
sleep(60)
progress.subreddit_done()

def get_submissions_from_link(self) -> list[list[praw.models.Submission]]:
supplied_submissions = []
Expand Down
8 changes: 7 additions & 1 deletion bdfr/cloner.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from bdfr.archiver import Archiver
from bdfr.configuration import Configuration
from bdfr.downloader import RedditDownloader
from bdfr.progress_bar import Progress

logger = logging.getLogger(__name__)

Expand All @@ -18,15 +19,20 @@ def __init__(self, args: Configuration, logging_handlers: Iterable[logging.Handl
super().__init__(args, logging_handlers)

def download(self) -> None:
progress = Progress(self.args.progress_bar, len(self.reddit_lists))
for generator in self.reddit_lists:
progress.subreddit_new(generator)
try:
for submission in generator:
try:
self._download_submission(submission)
success = self._download_submission(submission)
self.write_entry(submission)
except prawcore.PrawcoreException as e:
logger.error(f"Submission {submission.id} failed to be cloned due to a PRAW exception: {e}")
success = False
progress.post_done(submission, success)
except prawcore.PrawcoreException as e:
logger.error(f"The submission after {submission.id} failed to download due to a PRAW exception: {e}")
logger.debug("Waiting 60 seconds to continue")
sleep(60)
progress.subreddit_done()
1 change: 1 addition & 0 deletions bdfr/configuration.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ def __init__(self) -> None:
self.max_wait_time = None
self.multireddit: list[str] = []
self.no_dupes: bool = False
self.progress_bar: bool = False
self.saved: bool = False
self.search: Optional[str] = None
self.search_existing: bool = False
Expand Down
41 changes: 24 additions & 17 deletions bdfr/downloader.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
from bdfr import exceptions as errors
from bdfr.configuration import Configuration
from bdfr.connector import RedditConnector
from bdfr.progress_bar import Progress
from bdfr.site_downloaders.download_factory import DownloadFactory

logger = logging.getLogger(__name__)
Expand All @@ -42,54 +43,59 @@ def __init__(self, args: Configuration, logging_handlers: Iterable[logging.Handl
self.master_hash_list = self.scan_existing_files(self.download_directory)

def download(self) -> None:
progress = Progress(self.args.progress_bar, len(self.reddit_lists))
for generator in self.reddit_lists:
progress.subreddit_new(generator)
try:
for submission in generator:
try:
self._download_submission(submission)
success = self._download_submission(submission)
except prawcore.PrawcoreException as e:
logger.error(f"Submission {submission.id} failed to download due to a PRAW exception: {e}")
success = False
progress.post_done(submission, success)
except prawcore.PrawcoreException as e:
logger.error(f"The submission after {submission.id} failed to download due to a PRAW exception: {e}")
logger.debug("Waiting 60 seconds to continue")
sleep(60)
progress.subreddit_done()

def _download_submission(self, submission: praw.models.Submission) -> None:
def _download_submission(self, submission: praw.models.Submission) -> bool:
if submission.id in self.excluded_submission_ids:
logger.debug(f"Object {submission.id} in exclusion list, skipping")
return
return False
elif submission.subreddit.display_name.lower() in self.args.skip_subreddit:
logger.debug(f"Submission {submission.id} in {submission.subreddit.display_name} in skip list")
return
return False
elif (submission.author and submission.author.name in self.args.ignore_user) or (
submission.author is None and "DELETED" in self.args.ignore_user
):
logger.debug(
f"Submission {submission.id} in {submission.subreddit.display_name} skipped"
f" due to {submission.author.name if submission.author else 'DELETED'} being an ignored user"
)
return
return False
elif self.args.min_score and submission.score < self.args.min_score:
logger.debug(
f"Submission {submission.id} filtered due to score {submission.score} < [{self.args.min_score}]"
)
return
return False
elif self.args.max_score and self.args.max_score < submission.score:
logger.debug(
f"Submission {submission.id} filtered due to score {submission.score} > [{self.args.max_score}]"
)
return
return False
elif (self.args.min_score_ratio and submission.upvote_ratio < self.args.min_score_ratio) or (
self.args.max_score_ratio and self.args.max_score_ratio < submission.upvote_ratio
):
logger.debug(f"Submission {submission.id} filtered due to score ratio ({submission.upvote_ratio})")
return
return False
elif not isinstance(submission, praw.models.Submission):
logger.warning(f"{submission.id} is not a submission")
return
return False
elif not self.download_filter.check_url(submission.url):
logger.debug(f"Submission {submission.id} filtered due to URL {submission.url}")
return
return False

logger.debug(f"Attempting to download submission {submission.id}")
try:
Expand All @@ -98,15 +104,15 @@ def _download_submission(self, submission: praw.models.Submission) -> None:
logger.debug(f"Using {downloader_class.__name__} with url {submission.url}")
except errors.NotADownloadableLinkError as e:
logger.error(f"Could not download submission {submission.id}: {e}")
return
return False
if downloader_class.__name__.lower() in self.args.disable_module:
logger.debug(f"Submission {submission.id} skipped due to disabled module {downloader_class.__name__}")
return
return False
try:
content = downloader.find_resources(self.authenticator)
except errors.SiteDownloaderError as e:
logger.error(f"Site {downloader_class.__name__} failed to download submission {submission.id}: {e}")
return
return False
for destination, res in self.file_name_formatter.format_resource_paths(content, self.download_directory):
if destination.exists():
logger.debug(f"File {destination} from submission {submission.id} already exists, continuing")
Expand All @@ -121,12 +127,12 @@ def _download_submission(self, submission: praw.models.Submission) -> None:
f"Failed to download resource {res.url} in submission {submission.id} "
f"with downloader {downloader_class.__name__}: {e}"
)
return
return False
resource_hash = res.hash.hexdigest()
if resource_hash in self.master_hash_list:
if self.args.no_dupes:
logger.info(f"Resource hash {resource_hash} from submission {submission.id} downloaded elsewhere")
return
return False
elif self.args.make_hard_links:
destination.parent.mkdir(parents=True, exist_ok=True)
try:
Expand All @@ -137,7 +143,7 @@ def _download_submission(self, submission: praw.models.Submission) -> None:
f"Hard link made linking {destination} to {self.master_hash_list[resource_hash]}"
f" in submission {submission.id}"
)
return
return False
destination.parent.mkdir(parents=True, exist_ok=True)
try:
with destination.open("wb") as file:
Expand All @@ -146,12 +152,13 @@ def _download_submission(self, submission: praw.models.Submission) -> None:
except OSError as e:
logger.exception(e)
logger.error(f"Failed to write file in submission {submission.id} to {destination}: {e}")
return
return False
creation_time = time.mktime(datetime.fromtimestamp(submission.created_utc).timetuple())
os.utime(destination, (creation_time, creation_time))
self.master_hash_list[resource_hash] = destination
logger.debug(f"Hash added to master list: {resource_hash}")
logger.info(f"Downloaded submission {submission.id} from {submission.subreddit.display_name}")
return True

@staticmethod
def scan_existing_files(directory: Path) -> dict[str, Path]:
Expand Down
45 changes: 45 additions & 0 deletions bdfr/progress_bar.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
import logging
from typing import Optional

from tqdm import tqdm

logger = logging.getLogger()


class Progress:
def __init__(self, progress_bar: bool, n_subreddits: int):
self.progress_bar = progress_bar
self.bar_outer: Optional[tqdm] = None
self.bar_inner: Optional[tqdm] = None

if self.progress_bar:
logger.setLevel(logging.CRITICAL)
self.bar_outer = tqdm(total=n_subreddits, initial=0, desc="Subreddits", unit="subreddit", colour="green")

def subreddit_new(self, generator):
if self.progress_bar:
# generator is a ListingGenerator or a (usually empty) list
try:
desc = generator.url
except:
desc = "Posts"

try:
total = generator.limit
except:
total = 1

self.bar_inner = tqdm(total=total, initial=0, desc=desc, unit="post", colour="green", leave=False)

def subreddit_done(self):
if self.progress_bar:
self.bar_outer.update(1)
self.bar_inner.close()

def post_done(self, submission, success: bool):
if self.progress_bar:
self.bar_inner.update(1)
title_short = submission.title[:60] + (submission.title[60:] and "...")
log_str = f"{submission.score:5d}🔼 {title_short}"
icon = "✅" if success else "❌"
self.bar_outer.write(f"{icon} {log_str}")
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ dependencies = [
"praw>=7.2.0",
"pyyaml>=5.4.1",
"requests>=2.28.2",
"tqdm>=4.64.1",
"yt-dlp>=2023.2.17",
]
dynamic = ["version"]
Expand Down
Loading