- Change
configure
defaults tooutput_az_paths=True
andsave_access_token_to_disk=True
- Azure timestamp parsing is now slightly faster
- Replace
xmltodict
withlxml
as it is slightly faster - Use lazy import for
pycryptodome
from @hauntsaninja - Print request ids when azure auth fails
- Added new
configure
option,multiprocessing_start_method
that defaults tospawn
, due to issues withfork
. To get the original behavior back, callbf.configure(multiprocessing_start_method="fork")
- Fix to default value for
use_azure_storage_account_key_fallback
from @hauntsaninja
- Add support for anonymous auth for GCS
- Clarify relationship with Tensorflow's
gfile
parallel
option forbf.copy
will now use a faster copy for azure when copying between two different storage accounts- fix a bug when reading a file from an account with read-only access and also using
cache_dir
- better error messages when using remote copy
parallel
option forrmtree
- fix trailing semicolon with
bf.join
- add
save_access_token_to_disk
to save local access tokens locally - fix bug preventing parallel copy of empty file
- calculate md5 for individual uploaded blocks
- more consistent glob behavior on local paths
- don't use global random state
- Add
shard_prefix_length
toscanglob
- Support using access tokens from new versions of azure-cli, thanks @hauntsaninja for adding this.
- Change to use oauth v2 for azure, thanks @cjgibson for adding this.
- Fix bug in
scanglob
that marked files as directories, thanks @jacobhilton for fixing this - Fix rare crash when writing files, thanks @unixpickle for fixing this
connection_pool_max_size
andmax_connection_pool_count
are no longer supported due to issues withProcessPoolExecutor
, instead you may specifyget_http_pool
if you need to control these settings.- Add
file_size
argument toBlobFile
to skip reading the size of the file on creation.
- Change
default_buffer_size
to8 * 2**20
- Increased the chunk size for
readall()
to8 * 2**20
- Fix GCS write file to respect
google_write_chunk_size
- Added an option to disable streaming reads since azure doesn't handle it well,
use_streaming_read=False
. These are now disabled by default. - Added an option to set the default buffer size,
default_buffer_size
, which is important to set if you disable streaming reads.
- When uploading to a file with
streaming=True
(not the default), avoid an extra copy of the data being uploaded. This is mostly an optimization for when you do a single largef.write()
.
- Set
use_azure_storage_account_key_fallback
toFalse
by default. This is a backwards breaking change if you rely on storage account keys. To go back to the previous behavior, callbf.configure(use_azure_storage_account_key_fallback=True)
.
- Support pagination of azure management pages.
- Don't log connection aborted errors on first try.
- Use slightly different backoff jitter algorithm
- Fix
tell()
for streaming write files, this fixes a bug where zip files would not be written correctly when written to astreaming=True
file while using thezipfile
library.
- Add
create_context()
function to createblobfile
instances with different configurations - Retry azure remote copy if a copy is already in progress
- Fix bug in parallel azure upload
- Remove
BLOBFILE_BACKENDS
environment variable - Various internal refactorings
- By default, common errors will not be logged unless a request has been retried enough times
- Attempt to query for subscriptions even less often
- Allow configuring connect and read timeouts through
bf.configure
- Add configure option
output_az_paths
, set this toTrue
to outputaz://
paths instead of thehttps://
ones - Add configure option
use_azure_storage_account_key_fallback
, set this toFalse
to disable falling back to storage account keys. This is recommended because the storage key fallback confuses users and can result in 429s from the Azure management endpoints. - Remove generated
.pyi
files. These were for use bypyright
, but confused PyCharm and are no longer necessary forpyright
. - Rename non-public modules to make it clear to
pyright
which symbols are exported.
- Fix for azure credentials when using service principals through azure cli, thanks to @hauntsaninja for the PR
- Fix for
bf.listdir
onaz://
paths in the presence of explicit directories, thanks to @WuTheFWasThat for the PR
- Add support for
az://
urls, thanks to @joschu for the PR. All azure urls output byblobfile
are still thehttps://
format.
- Fix to
bf.isdir()
from @hauntsaninja, which didn't work on some unusual azure directories. - Fewer calls to azure API to reduce chance of hitting rate limits, thanks to @hauntsaninja for the PR
- Tokens were being expired at the wrong time, thanks to @hauntsaninja for the PR
- Sleep when checking copy status, thanks to @hauntsaninja for the PR
- New version to work around pypi upload failure
- Better error message for bad refresh token, thanks @hauntsaninja for reporting this
- Include more error information when a request fails
- Fix
bf.copy(..., parallel=True)
logic, versions1.0.0
and0.17.3
could upload the wrong data when requests are retried internally bybf.copy
. Also azure paths were not properly escaped.
- Remove deprecated functions
LocalBlobFile
(useBlobFile
withstreaming=False
) andset_log_callback
(useconfigure
withlog_callback=<fn>
)
- Change default write block size to 8 MB
- Add
parallel
option tobf.copy
to do some operations in parallel as well asparallel_executor
argument to set the executor to be used. - Fix
bf.copy
between multiple azure storage accounts, thanks @hauntsaninja for reporting this
- Allow seeking past end of file
- Allow anonymous access for azure containers. Try anonymous access if other methods fail and allow blobfile to work if user has no valid azure credentials.
- Fixed GCS cloud copy for large files from @hauntsaninja
- Added workaround for TextIOWrapper to buffer the same way when reading in text or binary mode
- Don't clear block blobs when starting to write to them, instead clear only the uncommitted blocks.
- Log all request failures by default rather than just errors after the first one, can now be set with the
retry_log_threshold
argument toconfigure()
. To get the previous behavior, usebf.configure(retry_log_threshold=1)
- Use block blobs instead of append blobs in Azure Storage, the block size can be set via the
azure_write_chunk_size
option toconfigure()
. Writing a block blob will delete any existing file before starting the writing process and writing may raise aConcurrentWriteFailure
in the event of multiple processes writing to the same file at the same time. If this happens, either avoid writing concurrently to the same file, or retry after some period. - Make service principals fall back to storage account keys and improve detection of when to fall back
- Added
set_mtime
function to set the modified time for an object - Added
md5
to stat object, which will be the md5 hexdigest if present on a remote file. Also addversion
which, for remote objects, represents some unique id that is changed when the file is changed. - Improved error descriptions
- Require keyword arguments to
configure()
- Add
scanglob
which isglob
but returnesDirEntry
objects instead of strings - Add
scandir
which islistdir
but returnsDirEntry
objects instead of strings listdir
entries for local paths are no longer returned in sorted order- Add ability to set max count of connection pools, this may be useful for Azure where each storage account has its own connection pool.
- Handle
:
withjoin
- More robust checking for azure account-does-not-exist errors
- Handle exceptions during
close()
for_ProxyFile
- Use
storage.googleapis.com
instead ofwww.googleapis.com
for google api endpoint
- Add support for
NO_GCE_CHECK=true
environment variable used by colab notebooks - Remove use of
copy.copy()
due to odd behavior during interpreter shutdown which could cause write-modeBlobFile
s to not finish their final upload - Support azure "login" method instead of just "key" method, corresponding to "AZURE_STORAGE_AUTH_MODE". Automatically fallback to key method if login method doesn't work.
- Skip subscriptions we don't have permissions to access when getting storage keys.
- Use environment variable
AZURE_STORAGE_KEY
instead ofAZURE_STORAGE_ACCOUNT_KEY
by default - Add support for environment variable
AZURE_STORAGE_CONNECTION_STRING
- Don't return connections used for reading files to the connection pool to avoid a rare issue with a -1 file descriptor
- no longer allow
:
in remote paths used withjoin
except for the first path provided - add
BLOBFILE_BACKENDS
environment variable to set what backends will be available for use withBlobFile
, it should be set tolocal,google,azure
to get the default behavior of allowing all backends
- reopen streaming read files when an error is encountered in case urllib3 does not do this
- reduce readall() default chunk size to fix performance regression, thanks @jpambrun for reporting this!
- Added
configure()
to replaceset_log_callback
and add a configurable max connection pool size. - Make
pip install
work without having to have extra tools installed - Fix bug in response reading where requests would be retried, but reading the response body would not
- Added
topdown=False
support towalk()
- Added
copytree()
example
- Creating a write-mode
BlobFile
for a local path will automatically create intermediate directories to be consistent with blob storage, see blobfile#48