Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Golden spike" PR #488

Draft
wants to merge 47 commits into
base: main
Choose a base branch
from
Draft
Changes from 1 commit
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
a859b58
Stub out index_delta(), index_lance(), index_parquet().
knighton Oct 28, 2023
bd0208a
index_backend().
knighton Oct 28, 2023
69575bc
Fix.
knighton Oct 28, 2023
42b59f1
task.py for benchmarking.
knighton Oct 28, 2023
2fb1b09
generate_datasets.py.
knighton Oct 28, 2023
11dd673
Fix.
knighton Oct 28, 2023
82737e0
Organize/divide streaming/base/util.py:
knighton Oct 28, 2023
3212f66
Completely rip out and rewrite pretty args handling:
knighton Oct 28, 2023
eb93bea
Layer several new storage APIs wrapping/complementing streaming/base/…
knighton Oct 28, 2023
23554ac
Use those APIs to index a Parquet dataset (single-threaded).
knighton Oct 28, 2023
c711567
Add cli/index_parquet.py.
knighton Oct 28, 2023
4ea01b2
Rename get_list_arg() -> parse_strs() in keeping with parse_str2str()…
knighton Oct 28, 2023
157381a
Rename parse_(args stuff) -> unpack_(args stuff).
knighton Oct 28, 2023
c72127f
Long lines.
knighton Oct 28, 2023
d2be6a0
Populate streaming/examples/ with SD subclasses, also streaming/bench…
knighton Oct 28, 2023
da6f4af
Fix.
knighton Oct 28, 2023
b0fa3d7
Move benchmarks up and out.
knighton Oct 28, 2023
4a22638
Fix.
knighton Oct 28, 2023
cb80865
Now, rename streaming/base/... -> streaming/....
knighton Oct 28, 2023
4851888
Update paths accordingly.
knighton Oct 28, 2023
1051474
Update more paths.
knighton Oct 28, 2023
65ef0de
Formatting.
knighton Oct 28, 2023
408999a
Fix.
knighton Oct 28, 2023
b38f8a3
Move examples/ to top level.
knighton Oct 28, 2023
ff90826
Update multimodal.
knighton Oct 28, 2023
a7808ae
Update vision dataset sexamples -> kwargs.
knighton Oct 28, 2023
c857ed6
Update vision datasets to use kwargs (save us from bitrot, o kwargs).
knighton Oct 28, 2023
89d5719
Generalize `keep_zip` argument to `keep_packed`.
knighton Oct 28, 2023
c09248c
Add graceful migration from keep_zip to keep_packed.
knighton Oct 28, 2023
9befaa6
First take on a MDS write_dataset().
knighton Oct 29, 2023
c4a5094
Add enough column inference to keep going.
knighton Oct 29, 2023
48dce5c
WWriting all given samples as one indexless MDS shard, returning its …
knighton Oct 29, 2023
b0d1543
Naming.
knighton Oct 29, 2023
99ad0c0
Fixes.
knighton Oct 29, 2023
b38fce0
cli/hash.py.
knighton Oct 29, 2023
7a9fc90
walk_prefix() including local fs.
knighton Oct 29, 2023
5247bfe
generate_datasets.py: Tabulator.
knighton Nov 5, 2023
6dc5e22
Fix (passing `left`, and spacing).
knighton Nov 5, 2023
18f6474
Switch to box-drawing chars in Tabulator. Example:
knighton Nov 5, 2023
52af2cb
Rewrite task.py.
knighton Nov 5, 2023
bc125b4
Fixes.
knighton Nov 5, 2023
a2ff86f
Fix.
knighton Nov 5, 2023
57e7571
Misc.
knighton Nov 5, 2023
52dcb42
Merge branch 'main' into james/proto
knighton Nov 5, 2023
f1e10bb
Split out Tabulator.
knighton Nov 5, 2023
56674e8
Merge branch 'james/proto' of github.com:mosaicml/streaming into jame…
knighton Nov 5, 2023
cbfcab3
Refactor.
knighton Nov 5, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
cli/hash.py.
knighton committed Oct 29, 2023

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
commit b38fce03977244a81fd9337480882fd4d4e5002c
Empty file removed examples/__init__py
Empty file.
42 changes: 42 additions & 0 deletions streaming/cli/hash.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Copyright 2023 MosaicML Streaming authors
# SPDX-License-Identifier: Apache-2.0

"""Generate a Streaming index file for the given Parquet dataset."""

from argparse import ArgumentParser, Namespace

from streaming.hashing import get_hash, get_hashes
from streaming.util.pretty import unpack_strs


def parse_args() -> Namespace:
"""Parse command-line arguments.

Returns:
Namespace: Command-line arguments.
"""
supported = sorted(get_hashes())
args = ArgumentParser()
args.add_argument('--file', type=str, required=True, help='Path to file to hash.')
args.add_argument('--hash',
type=str,
required=True,
help=f'Comma-delimted names of hash algortihms. Must be in this list: ' +
f'{supported}. Names and hex digests will be listed one per line.')
return args.parse_args()


def main(args: Namespace):
"""Calculate one or more hashes of the data of the given file.

Args:
args (Namespace): Command-line arguments.
"""
data = open(args.file, 'rb').read()
for algo in unpack_strs(args.hash):
hex_digest = get_hash(algo, data)
print(f'{algo} {hex_digest}')


if __name__ == '__main__':
main(parse_args())