Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve documentation #185

Open
wants to merge 237 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
237 commits
Select commit Hold shift + click to select a range
03c6708
Update .gitignore
cebtenzzre May 29, 2020
eea8680
tumblr_backup: Fix shebang and encoding
cebtenzzre Aug 29, 2020
13d0f61
tumblr_backup: Support Python 3
cebtenzzre Dec 27, 2019
feefb65
tumblr_backup: Make video embedding more robust to missing player
cebtenzzre Dec 13, 2018
8515c04
tumblr_backup: Better handling of integer options and cookiefile
cebtenzzre Jan 11, 2020
aa834b6
tumblr_backup: Fix excessive memory usage while building the index
cebtenzzre Jan 12, 2020
c89b45a
tumblr_backup: Better padding in log function
cebtenzzre May 26, 2020
62f8490
tumblr_backup: Replace http with https in URLs
cebtenzzre May 28, 2020
c5a80db
tumblr_backup: Avoid TOCTOU when checking self.queue.qsize()
cebtenzzre May 28, 2020
be270ff
tumblr_backup: Use log instead of print or sys.std{out,err}.write
cebtenzzre Aug 22, 2020
fdad7b3
tumblr_backup: Linting and cleanup
cebtenzzre May 27, 2020
10cf897
tumblr_backup: Migrate from optparse to argparse
cebtenzzre May 26, 2020
edefd50
tumblr_backup: --save-notes
cebtenzzre Dec 15, 2018
5ed50e0
note_scraper: Make typing optional
cebtenzzre Jan 17, 2021
4c419b5
note_scraper: Use multiprocessing
cebtenzzre Aug 27, 2020
6be1a92
tumblr_backup: Fix this log message in particular
cebtenzzre Sep 18, 2020
9086f93
tumblr_backup: Don't print status after final message
cebtenzzre Sep 17, 2020
2a0703c
tumblr_backup: --filter
cebtenzzre Dec 2, 2019
6e09e35
tumblr_backup: --json-dirs
cebtenzzre Dec 3, 2019
5e20a61
tumblr_backup: --json-dirs -> --prev-archives
cebtenzzre Jun 25, 2020
7567ca2
tumblr_backup: --copy-notes
cebtenzzre Jun 27, 2020
9411b43
tumblr_backup: Show remaining posts immediately and more frequently
cebtenzzre Aug 13, 2020
2cb67f2
tumblr_backup: Bail on non-200 API status, retry on 429
cebtenzzre Aug 22, 2020
e7dffb1
tumblr_backup: Use before parameter to implement --period
cebtenzzre Sep 20, 2020
6bb9ec7
tumblr_backup: Fix URL path logic, get_media_url and get_filename
cebtenzzre Sep 28, 2020
d380d09
tumblr_backup: Support backup of over 1000 likes
cebtenzzre Sep 26, 2020
a01e0a1
tumblr_backup: --likes: Support new API behavior
cebtenzzre Sep 30, 2020
585141e
tumblr_backup: Sort liked posts by liked_timestamp
cebtenzzre Sep 30, 2020
2f07acd
tumblr_backup: Clearly explain why the backup stopped
cebtenzzre Oct 16, 2020
989bb65
tumblr_backup: Disallow --prev-archives on outdir
cebtenzzre Oct 21, 2020
e50b082
tumblr_backup: Correctly sort responses read from disk
cebtenzzre Oct 21, 2020
0628eb3
tumblr_backup: Support "before" parameter in --prev-archives mode
cebtenzzre Oct 21, 2020
9c8142e
tumblr_backup: --no-post-clobber
cebtenzzre Aug 28, 2020
d59e628
tumblr_backup: Suppress early creation of post_dir
cebtenzzre Jan 6, 2021
119d074
tumblr_backup: Implement wget module
cebtenzzre May 22, 2020
13ca77d
tumblr_backup: Prevent a deadlock in ThreadPool.cancel
cebtenzzre Oct 20, 2020
93fa32d
wget: Support chunked Transfer-Encoding with non-identity Content-Enc…
cebtenzzre Jan 3, 2021
5d8b51b
tumblr_backup: Fix requests import (part 1)
cebtenzzre Jan 17, 2021
4f274ab
wget: Make it work on Windows
cebtenzzre Jan 17, 2021
8b132b1
Merge branch 'wget' into cebtenzzre
cebtenzzre Apr 9, 2021
9be9059
tumblr_backup: Use the internal SVC API to access dashboard-only blogs
cebtenzzre Sep 15, 2020
d023cb4
tumblr_backup: svc API returns string id, sort numerically
cebtenzzre Nov 2, 2020
5da5f23
tumblr_backup: Fix requests import (part 2)
cebtenzzre Jan 17, 2021
f032208
Merge branch 'svc' into cebtenzzre
cebtenzzre Apr 9, 2021
4453257
tumblr_backup: Performance optimizations for short runs
cebtenzzre Jul 13, 2020
c8bd5e0
tumblr_backup: Busy waiting -> proper synchronization
cebtenzzre Aug 28, 2020
d7472a8
tumblr_backup: Worker early-stop and graceful SIGTERM/SIGHUP
cebtenzzre Sep 18, 2020
34a2abc
tumblr_backup: --continue and related new behavior
cebtenzzre Sep 21, 2020
3c304de
tumblr_backup: Better --count 0
cebtenzzre Sep 26, 2020
6fafb0e
tumblr_backup: --no-get
cebtenzzre Sep 27, 2020
e2c50cd
tumblr_backup: Use .first_run_options of --prev-archives
cebtenzzre Sep 28, 2020
7d051eb
tumblr_backup: Flag to allow continue with different options
cebtenzzre Sep 30, 2020
510d7a2
tumblr_backup: Better logging of what posts are being backed up
cebtenzzre Sep 30, 2020
c2aaa0c
tumblr_backup: Lazy save_folder mkdir, disallow continue if empty
cebtenzzre Sep 30, 2020
d5f4512
tumblr_backup: save_post now calls get_content, JSON first
cebtenzzre Sep 30, 2020
1336a2e
tumblr_backup: Better rate limit handling
cebtenzzre Sep 30, 2020
60305d3
tumblr_backup: --reuse-json
cebtenzzre Oct 1, 2020
f8ee556
tumblr_backup: Do not create .complete if the backup could not finish
cebtenzzre Oct 13, 2020
775ef1a
tumblr_backup: Make fatal errors stand out
cebtenzzre Oct 13, 2020
cf34845
tumblr_backup: Refine urllib3 retry configuration
cebtenzzre Oct 14, 2020
9bfcd78
tumblr_backup: Print time as a string in log messages
cebtenzzre Oct 15, 2020
4b7189e
tumblr_backup: Update skip and count when using --continue
cebtenzzre Oct 15, 2020
5e9ec68
wget: *.media.tumblr.com ignores If-Modified-Since, don't log about it
cebtenzzre Oct 15, 2020
5035f1e
wget: Handle certain Cloudflare status codes like connection errors
cebtenzzre Oct 16, 2020
7c4b1e2
tumblr_backup: Defer .first_run_options creation
cebtenzzre Oct 16, 2020
73a6f7a
tumblr_backup: --continue only downloads new posts, act like it
cebtenzzre Oct 16, 2020
9a6aab7
note_scraper: Detect dashboard-only blogs
cebtenzzre Oct 17, 2020
4680a75
tumblr_backup: Don't signal dead threads
cebtenzzre Oct 19, 2020
5d2caca
tumblr_backup: Assert on before instead of period, -i implies -nc
cebtenzzre Oct 21, 2020
0ffee2b
tumblr_backup: Disable note scraping for dashboard-only blogs
cebtenzzre Oct 17, 2020
e5ffad5
tumblr_backup: Make it work on Windows
cebtenzzre Jan 17, 2021
b178443
tumblr_backup: Declare wget_retrieve early
cebtenzzre Jan 17, 2021
4048ec6
tumblr_backup: Remove thread killing code
cebtenzzre Jan 30, 2021
290a734
tumblr_backup: Fix a few "type: ignore" comments
cebtenzzre Feb 11, 2021
c6fb901
tumblr_backup: Fix image_names usage in get_filename
cebtenzzre Feb 24, 2021
7ba4651
wget: Handle certain Cloudflare status codes like connection errors
cebtenzzre Oct 16, 2020
20f64e9
Merge branch 'wget' into cebtenzzre
cebtenzzre Apr 9, 2021
c744bba
note_scraper: Detect dashboard-only blogs
cebtenzzre Oct 17, 2020
fed765a
tumblr_backup: Disable note scraping for dashboard-only blogs
cebtenzzre Oct 17, 2020
85d78e7
--internet-archive: Internet Archive fallback for Tumblr media
cebtenzzre Feb 24, 2021
36e9f90
tumblr_backup: Declare wget_retrieve early
cebtenzzre Jan 17, 2021
0af57c6
wget: Handle certain Cloudflare status codes like connection errors
cebtenzzre Oct 16, 2020
84b71eb
tumblr_backup: Insert query strings into media filenames
cebtenzzre Feb 24, 2021
8c325da
tumblr_backup: More informed media timestamps
cebtenzzre Feb 25, 2021
579b8ec
tumblr_backup: Remove --timestamping and related options
cebtenzzre Feb 25, 2021
e0922eb
wget: Typing fixups
cebtenzzre Feb 25, 2021
64984fb
tumblr_backup: Use oldest known media timestamp
cebtenzzre Feb 26, 2021
fceb8c6
wget: Call IO.flush() before os.utime() to avoid clobbering mtime
cebtenzzre Mar 24, 2021
ef31f29
Merge branch 'wget' into cebtenzzre
cebtenzzre Apr 9, 2021
da2621e
Merge branch 'cebtenzzre' into experimental
cebtenzzre Apr 9, 2021
e65e541
tumblr_backup: Allow --no-get with --reuse-json
cebtenzzre Apr 9, 2021
500b827
wget: Improve SSL alternative injection code
cebtenzzre Apr 21, 2021
d89e9cd
Merge branch 'wget' into cebtenzzre
cebtenzzre Apr 21, 2021
d05d593
Merge branch 'cebtenzzre' into experimental
cebtenzzre Apr 21, 2021
95ea710
Update README.md
cebtenzzre Apr 22, 2021
a6462af
Merge branch 'cebtenzzre' into experimental
cebtenzzre Apr 22, 2021
9879181
wget: pyopenssl.inject_into_urllib3 can raise ImportError
cebtenzzre Apr 23, 2021
373254a
Merge branch 'wget' into cebtenzzre
cebtenzzre Apr 23, 2021
8c9fa5d
tumblr_backup: HAS_SNI doesn't necessarily exist
cebtenzzre Apr 23, 2021
137544f
Merge branch 'cebtenzzre' into experimental
cebtenzzre Apr 23, 2021
69886a2
wget: Fix incorrect attribute name in --internet-archive code
cebtenzzre Oct 1, 2021
486cf74
tumblr_login: Update to use the new API so it works again
cebtenzzre Oct 1, 2021
fbd20e0
Merge branch 'svc' into cebtenzzre
cebtenzzre Oct 1, 2021
99c42d9
note_scraper: Skip duplicate "original post" notes
cebtenzzre Oct 1, 2021
6e7a042
Merge branch 'cebtenzzre' into experimental
cebtenzzre Oct 1, 2021
bc43e85
tumblr_backup: Do not allow slashes in blog names
cebtenzzre Aug 31, 2021
9ec09ee
tumblr_backup: Use copy_file_range for reflinks and server-side copy
cebtenzzre Jul 3, 2021
0fec63f
tumblr_backup: Honor --no-get in get_avatar and get_style
cebtenzzre Aug 31, 2021
d185e95
Merge branch 'cebtenzzre' into experimental
cebtenzzre Oct 14, 2021
d509eb8
tumblr_backup: Cleanup lint
cebtenzzre Dec 17, 2021
dcafee7
note_scraper: Better rate limit handling
cebtenzzre Dec 17, 2021
e0f3e3c
tumblr_backup: Implement log levels
cebtenzzre Nov 5, 2021
6e2a4f1
tumblr_backup: Print which blogs failed at exit
cebtenzzre Dec 5, 2021
8df232d
tumblr_backup: Replace pyjq dependency with jq.py
cebtenzzre Dec 15, 2021
06d062e
tumblr_backup: Prefer yt_dlp if available
cebtenzzre Dec 15, 2021
c07eff2
Merge branch 'cebtenzzre' into experimental
cebtenzzre Dec 19, 2021
a432950
tumblr_backup: Cleanup
cebtenzzre Dec 17, 2021
c57892b
tumblr_backup: Avoid print() and print_exc() for logging
cebtenzzre Nov 19, 2021
c342c3f
tumblr_backup: Fix blog title
cebtenzzre Dec 4, 2021
6594748
tumblr_backup: Preload bs4
cebtenzzre Dec 5, 2021
a54b63b
tumblr_backup: Improve youtube_dl import
cebtenzzre Dec 15, 2021
ab26d7e
tumblr_backup: Improve --reuse-json
cebtenzzre Dec 17, 2021
be07d70
tumblr_backup: Fix --no-get
cebtenzzre Dec 19, 2021
9ed6539
tumblr_backup: Fix --period
cebtenzzre Jan 6, 2022
4f7355e
Merge branch 'cebtenzzre' into experimental
cebtenzzre Jan 6, 2022
6ba4c14
tumblr_backup: Add missing LF to a warning
cebtenzzre Jan 26, 2022
3582d8d
note_scraper: Fix rate limiting
cebtenzzre Jan 26, 2022
51effe9
Merge branch 'cebtenzzre' into experimental
cebtenzzre Jan 26, 2022
701211d
tumblr_backup: Never write index if --blosxom
cebtenzzre Dec 4, 2021
11f8501
tumblr_backup: Default --copy-notes to True sometimes
cebtenzzre Jan 25, 2022
733dc1e
tumblr_backup: update BACKUP_CHANGING_OPTIONS
cebtenzzre Feb 2, 2022
e059f39
tumblr_backup: make a mutually exclusive group
cebtenzzre Feb 2, 2022
f15ed7c
wget: Work around fsync issues on macOS
cebtenzzre Mar 19, 2022
08b36bf
util: Fix F_FULLFSYNC typo
cebtenzzre Mar 19, 2022
21f1698
Merge branch 'cebtenzzre' into experimental
cebtenzzre Mar 19, 2022
ddf32ce
tumblr_backup: drop Python 2.7 support
cebtenzzre Jul 12, 2022
ada89d7
note_scraper: Always use small icons
cebtenzzre Jan 26, 2022
8e4a6f4
tumblr_backup: cannot use 'before' with svc API
cebtenzzre Feb 2, 2022
5c92ee3
tumblr_backup: downgrade 'scraping disabled' to info
cebtenzzre Feb 2, 2022
c733a9b
tumblr_backup: --only-reblog option
cebtenzzre Mar 30, 2022
332f2af
tumblr_backup: More flexible --period
cebtenzzre Mar 30, 2022
86d12ca
tumblr_backup: Refine urllib3 retry configuration
cebtenzzre Oct 14, 2020
72a3432
wget: Retry on HTTP 500 Internal Server Error
cebtenzzre Mar 30, 2022
babf2ce
tumblr_backup: Handle jq StopIteration
cebtenzzre Mar 30, 2022
d7634d2
tumblr_backup: Shut up about media timestamps
cebtenzzre Mar 30, 2022
3a54428
wget: Handle redirect connection errors better
cebtenzzre Apr 4, 2022
f4db4c3
tumblr_backup: better failure handling for likes
cebtenzzre May 23, 2022
c0ea269
tumblr_backup: Never overwrite in maybe_copy_media
cebtenzzre May 23, 2022
dcb14b6
tumblr_backup: Print a message after cancelling
cebtenzzre May 23, 2022
570ecb8
misc typing fixes
cebtenzzre Mar 27, 2022
bda4c7d
Merge branch 'cebtenzzre' into experimental
cebtenzzre Jul 20, 2022
bef8694
tumblr_backup: Allow --continue of complete backup
cebtenzzre Feb 2, 2022
b66db61
tumblr_backup: Guard against potential races in download_media
cebtenzzre Mar 30, 2022
0728c44
tumblr_backup: Only search all posts for timestamp if --likes
cebtenzzre Mar 30, 2022
e03d2e2
tumblr_backup: record orig_options AFTER setting implied options
cebtenzzre May 23, 2022
dce1b8a
tumblr_backup: misc bugfixes
cebtenzzre Jul 19, 2022
aaebc81
note_scraper: update HTTP 429 handling
cebtenzzre May 11, 2023
1881fae
tumblr_backup: don't abuse copyfile
cebtenzzre Feb 17, 2023
d15e9d4
wget: add HTTP 502 'Bad Gateway' to retry list
cebtenzzre May 11, 2023
2cf5b30
tumblr_backup: remove python2-style super() calls
cebtenzzre May 11, 2023
df62fd8
tumblr_backup: don't log '0 remaining posts'
cebtenzzre May 11, 2023
9c08339
wget: improve readability of logging messages
cebtenzzre May 12, 2023
608f3c2
tumblr_backup: fix import-related typing issues
cebtenzzre May 16, 2023
c0e6b81
tumblr_backup: fix BeautifulSoup typing issues
cebtenzzre May 16, 2023
f78efef
tumblr_backup: remove unused globals
cebtenzzre May 16, 2023
5bf0310
util: remove explicit brotlipy detection
cebtenzzre May 16, 2023
f89c663
tumblr_backup: support filetype as an imghdr alternative
cebtenzzre May 16, 2023
c442dbd
tumblr_backup: fix pytype warnings
cebtenzzre May 19, 2023
76510b6
tumblr_backup: handle HTTP 420 'Enhance Your Calm'
cebtenzzre May 11, 2023
fc85196
note_scraper: log RequestException errors less verbosely
cebtenzzre May 20, 2023
5b91b05
wget: fix incorrect use of retry_counter
cebtenzzre Jun 7, 2023
f819cce
note scraper: fix safe mode check
cebtenzzre Jun 8, 2023
e3be708
Merge branch 'cebtenzzre' into experimental
cebtenzzre Jun 8, 2023
2035715
tumblr_backup: make optional imports typecheck better
cebtenzzre May 17, 2023
a008915
util: fix a pytype warning
cebtenzzre May 19, 2023
e5d80ff
note_scraper: more retries for HTTP 420
cebtenzzre Jul 3, 2023
a0081c1
Merge branch 'cebtenzzre' into experimental
cebtenzzre Jul 3, 2023
91348e7
tumblr_backup: fix ModuleNotFoundError on Windows
cebtenzzre Jul 11, 2023
9ddfe7d
Merge branch 'cebtenzzre' into experimental
cebtenzzre Jul 11, 2023
91d872a
tumblr_backup: --skip-dns-check option
cebtenzzre Jul 13, 2023
b2aecc4
wget: typing fixup for pytype
cebtenzzre Jul 13, 2023
73007b3
wget: remove current_url hack
cebtenzzre Jul 13, 2023
91ae5cf
add basic requirements.txt
cebtenzzre Jul 13, 2023
82d84b6
Merge branch 'cebtenzzre' into experimental
cebtenzzre Jul 13, 2023
941e410
tumblr_backup: more typing fixups
cebtenzzre Jul 13, 2023
d573510
Merge branch 'cebtenzzre' into experimental
cebtenzzre Jul 13, 2023
9184ad4
tumblr_backup: avoid using glob() to match file paths
cebtenzzre Jul 21, 2023
52f0ac7
Merge branch 'cebtenzzre' into experimental
cebtenzzre Jul 21, 2023
27b6c44
tumblr_backup: fail fast if lxml is missing
cebtenzzre Jul 29, 2023
8be7342
Merge branch 'cebtenzzre' into experimental
cebtenzzre Jul 29, 2023
c4102c6
tumblr_backup: --media-list option
cebtenzzre May 31, 2020
f95a3d7
tumblr_backup: --id-file option
cebtenzzre Jun 25, 2020
150cce1
tumblr_backup: Wait for keypress on ENOSPC
cebtenzzre Sep 18, 2020
134ae08
tumblr_backup: --json-info option
cebtenzzre Sep 16, 2020
475a48b
tumblr_backup: Change EXIT_NOPOSTS value to reduce ambiguity
cebtenzzre Feb 8, 2021
581f9fb
tumblr_backup: Use id= paramter for --id-file option
cebtenzzre Jan 6, 2022
1b9858e
tumblr_backup: Check for reblogs with new module is_reblog.py
cebtenzzre Jan 30, 2022
e984891
tumblr_backup: fix race condition in record_media
cebtenzzre Jul 29, 2023
495859e
tumblr_backup: make mypy happy
cebtenzzre Oct 21, 2023
6b8388d
Merge branch 'experimental'
cebtenzzre Oct 21, 2023
a2e3254
tumblr_backup: make post saving more robust/atomic
cebtenzzre Mar 30, 2022
50abbe1
tumblr_backup: make --prev-archives work with --likes
cebtenzzre Nov 23, 2023
563d6f1
wget: fix exception copying for Internet Archive fallback
cebtenzzre Dec 11, 2023
9db5c3a
maint: remove untouched scripts from upstream
cebtenzzre Feb 17, 2024
9ed86cf
tumblr_backup: store settings as JSON
cebtenzzre Feb 17, 2024
e099533
make all modules submodules of tumblr_backup
cebtenzzre Feb 17, 2024
d68f6c1
add pyproject.toml
cebtenzzre Feb 17, 2024
3c29ec5
maint: update .gitignore
cebtenzzre Feb 18, 2024
de8b058
properly implement required and optional dependencies
cebtenzzre Feb 18, 2024
9b4db3a
remove encoding specification from .py files
cebtenzzre Feb 18, 2024
bce8486
tumblr_backup: refactor entry point
cebtenzzre Feb 18, 2024
baa761b
refactor login entry point
cebtenzzre Feb 18, 2024
2348b45
maint: add dist/ to .gitignore
cebtenzzre Feb 18, 2024
af63181
docs: make tumblr_backup README the main one
cebtenzzre Feb 18, 2024
ceca203
documentation fixups
cebtenzzre Feb 18, 2024
3505524
linting
cebtenzzre Feb 18, 2024
2c3d76f
project: release version 1.0.0
cebtenzzre Feb 18, 2024
cc51b41
doc: fix README formatting
cebtenzzre Feb 21, 2024
7becc2b
project: release version 1.0.0.post1
cebtenzzre Feb 21, 2024
4a8d90e
main: fix config.json load/save
cebtenzzre Feb 21, 2024
aab9319
project: release version 1.0.1
cebtenzzre Feb 21, 2024
fa4a45c
implement urllib3 2.x compatibility
cebtenzzre Mar 8, 2024
6eb713d
use less globals
cebtenzzre Mar 8, 2024
312689d
project: release version 1.0.2
cebtenzzre Mar 8, 2024
06aa3f3
project: drop py36 because of dataclasses import
cebtenzzre Mar 8, 2024
49f623f
util: address TODO comment
cebtenzzre Mar 17, 2024
714e792
linting: run isort
cebtenzzre Mar 17, 2024
2a1e67b
maint: integrate style checking config and apply fixes
cebtenzzre Mar 23, 2024
6853175
maint: modernize type annotations
cebtenzzre Mar 23, 2024
7b1ab31
maint: fix lint warnings
cebtenzzre Dec 4, 2024
0677786
wget: fix urllib3 v2.2.2 compatibility
cebtenzzre Dec 4, 2024
28f4777
project: release version 1.0.4
cebtenzzre Dec 4, 2024
2a9d5e2
make the need for notes (now bs4) component clearer
cebtenzzre Dec 4, 2024
2866441
wget: fix ConnectTimeoutError handling and make it more future-proof
cebtenzzre Dec 4, 2024
754f2ad
project: release version 1.0.5
cebtenzzre Dec 4, 2024
b1ec755
Update extra name in README.md
cenodis Dec 4, 2024
8e6a067
Merge pull request #34 from cenodis/update_readme
cebtenzzre Dec 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
tumblr_backup: Remove --timestamping and related options
This code never worked well with Tumblr media because the Last-Modified
timestamp is hit-or-miss. This only removes the conditional get code;
files are still given mtime based on the Last-Modified header.
cebtenzzre committed Apr 9, 2021
commit 579b8ec3425968ec3da4343c792a32b43c913e0c
22 changes: 4 additions & 18 deletions tumblr_backup.py
Original file line number Diff line number Diff line change
@@ -24,7 +24,7 @@
from posixpath import basename as urlbasename, join as urlpathjoin, splitext as urlsplitext
from xml.sax.saxutils import escape

from util import ConnectionFile, LockedQueue, PY3, no_internet, nullcontext, path_is_on_vfat, to_bytes, to_unicode
from util import ConnectionFile, LockedQueue, PY3, no_internet, nullcontext, to_bytes, to_unicode
from wget import HTTPError, WGError, WgetRetrieveWrapper, set_ssl_verify, urlopen

try:
@@ -416,13 +416,8 @@ def get_avatar(prev_archive):
avatar_dest = avatar_fpath = open_file(lambda f: f, (theme_dir, avatar_base))

# Remove old avatars
old_avatars = glob(join(theme_dir, avatar_base + '.*'))
if len(old_avatars) > 1:
for old_avatar in old_avatars:
os.unlink(old_avatar)
elif len(old_avatars) == 1:
# Use the old avatar for timestamping
avatar_dest, = old_avatars
if glob(join(theme_dir, avatar_base + '.*')):
return # Do not clobber

def adj_bn(old_bn, f):
# Give it an extension
@@ -1217,7 +1212,7 @@ def download_media(self, url, filename):
path_parts.insert(1, hostdir)

cpy_res = maybe_copy_media(self.prev_archive, path_parts)
if not cpy_res:
if not (cpy_res or os.path.exists(path_to(*path_parts))):
assert wget_retrieve is not None
try:
wget_retrieve(url, open_file(lambda f: f, path_parts), post_timestamp=self.post['timestamp'])
@@ -1561,14 +1556,8 @@ def __call__(self, parser, namespace, values, option_string=None):
parser.add_argument('--prev-archives', action=CSVListCallback, default=[], metavar='DIRS',
help='comma-separated list of directories (one per blog) containing previous blog archives')
parser.add_argument('--no-post-clobber', action='store_true', help='Do not re-download existing posts')
parser.add_argument('-M', '--timestamping', action='store_true',
help="don't re-download files if the remote timestamp and size match the local file")
parser.add_argument('--no-if-modified-since', action='store_false', dest='if_modified_since',
help="timestamping: don't send If-Modified-Since header")
parser.add_argument('--no-server-timestamps', action='store_false', dest='use_server_timestamps',
help="don't set local timestamps from HTTP headers")
parser.add_argument('--mtime-postfix', action='store_true',
help="timestamping: work around low-precision mtime on FAT filesystems")
parser.add_argument('--hostdirs', action='store_true', help='Generate host-prefixed directories for media')
parser.add_argument('blogs', nargs='*')
options = parser.parse_args()
@@ -1644,9 +1633,6 @@ def __call__(self, parser, namespace, values, option_string=None):
if os.path.realpath(pa) == os.path.realpath(blogdir):
parser.error("--prev-archives: Directory '{}' is also being written to. Use --reuse-json instead if "
"you want this, or specify --outdir if you don't.".format(pa))
if not options.mtime_postfix and path_is_on_vfat.works and path_is_on_vfat('.'):
print('Warning: FAT filesystem detected, enabling --mtime-postfix', file=sys.stderr)
options.mtime_postfix = True

if not API_KEY:
sys.stderr.write('''\
27 changes: 0 additions & 27 deletions util.py
Original file line number Diff line number Diff line change
@@ -141,33 +141,6 @@ def is_dns_working(timeout=None):
return True


def rstrip_slashes(path):
return path.rstrip(b'\\/' if isinstance(path, bytes) else u'\\/')


class _Path_Is_On_VFat(object):
works = _PATH_IS_ON_VFAT_WORKS

def __call__(self, path):
if not self.works:
raise RuntimeError('This function must not be called unless PATH_IS_ON_VFAT_WORKS is True')

if os.name == 'nt':
# Compare normalized absolute path of volume
getdev = rstrip_slashes
path_dev = rstrip_slashes(_getvolumepathname(path))
else:
# Compare device ID
def getdev(mount): return os.stat(mount).st_dev
path_dev = getdev(path)

return any(part.fstype == 'vfat' and getdev(part.mountpoint) == path_dev
for part in psutil.disk_partitions(all=True))


path_is_on_vfat = _Path_Is_On_VFat()


class WaitOnMainThread(object):
def __init__(self):
self.cond = None # type: Optional[threading.Condition]
116 changes: 13 additions & 103 deletions wget.py
Original file line number Diff line number Diff line change
@@ -12,7 +12,6 @@
import warnings
from email.utils import mktime_tz, parsedate_tz
from tempfile import NamedTemporaryFile
from wsgiref.handlers import format_date_time

from util import PY3, is_dns_working, no_internet

@@ -94,16 +93,13 @@

# Document type flags
RETROKF = 0x2 # retrieval was OK
HEAD_ONLY = 0x4 # only send the HEAD request
IF_MODIFIED_SINCE = 0x80 # use If-Modified-Since header


# Error statuses
class UErr(object):
RETRUNNEEDED = 0
RETRINCOMPLETE = 1 # Not part of wget
RETRFINISHED = 2
HEADUNSUPPORTED = 3


class HttpStat(object):
@@ -118,9 +114,6 @@ def __init__(self):
self.statcode = 0 # status code
self.dest_dir = None # handle to the directory containing part_file
self.part_file = None # handle to local file used to store in-progress download
self.orig_file_exists = False # if there is a local file to compare for time-stamping
self.orig_file_size = 0 # size of file to compare for time-stamping
self.orig_file_tstamp = 0 # time-stamp of file to compare for time-stamping
self.remote_encoding = None # the encoding of the remote file
self.enc_is_identity = None # whether the remote encoding is identity
self.decoder = None # saved decoder from the HTTPResponse
@@ -261,36 +254,31 @@ def gethttp(url, hstat, doctype, options, logger, retry_counter):
hstat.remote_time = None

# Initialize the request
meth = 'GET'
if doctype & HEAD_ONLY:
meth = 'HEAD'
request_headers = {}
if doctype & IF_MODIFIED_SINCE:
request_headers['If-Modified-Since'] = format_date_time(hstat.orig_file_tstamp)
if hstat.restval:
request_headers['Range'] = 'bytes={}-'.format(hstat.restval)

doctype &= ~RETROKF

resp = urlopen(url, method=meth, headers=request_headers, preload_content=False, enforce_content_length=False)
resp = urlopen(url, headers=request_headers, preload_content=False, enforce_content_length=False)
url = hstat.current_url = urljoin(url, resp.current_url)

try:
err, doctype = process_response(url, hstat, doctype, options, logger, retry_counter, meth, resp)
err, doctype = process_response(url, hstat, doctype, options, logger, retry_counter, resp)
finally:
resp.release_conn()

return err, doctype


def process_response(url, hstat, doctype, options, logger, retry_counter, meth, resp):
def process_response(url, hstat, doctype, options, logger, retry_counter, resp):
# RFC 7233 section 4.1 paragraph 6:
# "A server MUST NOT generate a multipart response to a request for a single range [...]"
conttype = resp.headers.get('Content-Type')
if conttype is not None and conttype.lower().split(';', 1)[0].strip() == 'multipart/byteranges':
raise WGBadResponseError(logger, url, 'Sever sent multipart response, but multiple ranges were not requested')

contlen = resp.get_content_length(meth)
contlen = resp.get_content_length('GET')

crange_header = resp.headers.get('Content-Range')
crange_parsed = parse_content_range(crange_header)
@@ -333,11 +321,6 @@ def norm_enc(enc):
except ValueError:
hstat.statcode = 0

# HTTP 500 Internal Server Error
# HTTP 501 Not Implemented
if hstat.statcode in (500, 501) and (doctype & HEAD_ONLY):
return UErr.HEADUNSUPPORTED, doctype

# HTTP 20X
# HTTP 207 Multi-Status
if 200 <= hstat.statcode < 300 and hstat.statcode != 207:
@@ -348,21 +331,6 @@ def norm_enc(enc):
hstat.bytes_read = hstat.restval = 0
return UErr.RETRFINISHED, doctype

if doctype & IF_MODIFIED_SINCE:
# HTTP 304 Not Modified
if hstat.statcode == 304:
# File not modified on server according to If-Modified-Since, not retrieving.
doctype |= RETROKF
return UErr.RETRUNNEEDED, doctype
if (hstat.statcode == 200
and contlen in (None, hstat.orig_file_size)
and hstat.remote_time is not None
and hstat.remote_time <= hstat.orig_file_tstamp
):
logger.log(url, 'If-Modified-Since was ignored (file not actually modified), not retrieving.')
return UErr.RETRUNNEEDED, doctype
logger.log(url, 'Retrieving remote file because If-Modified-Since response indicates it was modified.')

if not (doctype & RETROKF):
e = WGWrongCodeError(logger, url, hstat.statcode, resp.reason, resp.headers)
# Cloudflare-specific errors
@@ -382,8 +350,8 @@ def norm_enc(enc):
shrunk = False
if hstat.statcode == 416:
shrunk = True # HTTP 416 Range Not Satisfiable
elif hstat.statcode != 200 or options.timestamping or contlen == 0:
pass # Only verify contlen if 200 OK (NOT 206 Partial Contents), not timestamping, and contlen is nonzero
elif hstat.statcode != 200 or contlen == 0:
pass # Only verify contlen if 200 OK (NOT 206 Partial Contents) and contlen is nonzero
elif contlen is not None and contrange == 0 and hstat.restval >= contlen:
shrunk = True # Got the whole content but it is known to be shorter than the restart point

@@ -414,7 +382,7 @@ def norm_enc(enc):
if hstat.contlen is not None:
hstat.contlen += contrange

if (doctype & HEAD_ONLY) or not (doctype & RETROKF):
if not (doctype & RETROKF):
hstat.bytes_read = hstat.restval = 0
return UErr.RETRFINISHED, doctype

@@ -625,7 +593,6 @@ def _retrieve_loop(hstat, url, dest_file, post_timestamp, adjust_basename, optio
raise WGUnreachableHostError(logger, url, 'Host {} is ignored.'.format(hostname))

doctype = 0
got_head = False # used for time-stamping
dest_dirname, dest_basename = os.path.split(dest_file)

flags = os.O_RDONLY
@@ -640,32 +607,6 @@ def _retrieve_loop(hstat, url, dest_file, post_timestamp, adjust_basename, optio
lambda pfx, dir_: NamedTemporaryFile('wb', prefix=pfx, dir=dir_, delete=False),
'.{}.'.format(dest_basename), dest_dirname,
))
send_head_first = False

if options.timestamping:
st = None
try:
if PY3 and os.stat in os.supports_dir_fd:
st = os.stat(dest_basename, dir_fd=hstat.dest_dir)
else:
st = os.stat(dest_file)
except EnvironmentError as e:
if getattr(e, 'errno', None) != errno.ENOENT:
raise # Not unusual

if st is not None:
# Timestamping is enabled and the local file exists
hstat.orig_file_exists = True
hstat.orig_file_size = st.st_size
hstat.orig_file_tstamp = int(st.st_mtime)
if options.mtime_postfix:
hstat.orig_file_tstamp += 1

if options.if_modified_since:
doctype |= IF_MODIFIED_SINCE # Send a conditional GET request
else:
send_head_first = True # Send a preliminary HEAD request
doctype |= HEAD_ONLY

# THE loop

@@ -701,44 +642,8 @@ def _retrieve_loop(hstat, url, dest_file, post_timestamp, adjust_basename, optio
continue # Non-fatal error, try again
if err == UErr.RETRUNNEEDED:
return
if err == UErr.HEADUNSUPPORTED:
# Fall back to GET if HEAD is unsupported.
send_head_first = False
doctype &= ~HEAD_ONLY
retry_counter.reset()
continue
assert err == UErr.RETRFINISHED

# Did we get the time-stamp?
if not got_head:
got_head = True # no more time-stamping

if (options.timestamping or options.use_server_timestamps) and hstat.remote_time is None:
logger.log(url, 'Warning: Last-Modified header is {}'
.format('missing' if hstat.last_modified is None
else 'invalid: {}'.format(hstat.last_modified)))

if send_head_first:
if hstat.orig_file_exists and hstat.remote_time is not None:
# Now time-stamping can be used validly. Time-stamping means that if the sizes of the local and
# remote file match, and local file is newer than the remote file, it will not be retrieved.
# Otherwise, the normal download procedure is resumed.
if hstat.remote_time > hstat.orig_file_tstamp:
logger.log(url, 'Retrieving remote file because its mtime ({}) is newer than what we have ({}).'
.format(format_date_time(hstat.remote_time),
format_date_time(hstat.orig_file_tstamp)))
elif hstat.enc_is_identity and hstat.contlen not in (None, hstat.orig_file_size):
logger.log(url,
'Retrieving remote file because its size ({}) is does not match what we have ({}).'
.format(hstat.contlen, hstat.orig_file_size))
else:
# Remote file is no newer and has the same size, not retrieving.
return

doctype &= ~HEAD_ONLY
retry_counter.reset()
continue

if hstat.contlen is not None and hstat.bytes_read < hstat.contlen:
# We lost the connection too soon
retry_counter.increment(url, hstat, 'Server closed connection before Content-Length was reached.')
@@ -756,7 +661,12 @@ def _retrieve_loop(hstat, url, dest_file, post_timestamp, adjust_basename, optio
else:
os.chmod(hstat.part_file.name, 0o644)

# Set the timestamp
if options.use_server_timestamps and hstat.remote_time is None:
logger.log(url, 'Warning: Last-Modified header is {}'
.format('missing' if hstat.last_modified is None
else 'invalid: {}'.format(hstat.last_modified)))

# Set the timestamp on the local file
if (options.use_server_timestamps
and (hstat.remote_time is not None or post_timestamp is not None)
and hstat.contlen in (None, hstat.bytes_read)