Skip to content

Crunch 100+ GB Strings in Python with ease, leveraging Arm Neon and x86 AVX2 SIMD Assembly to sort, split, and search through billions of tokens with Stringzilla πŸ¦–

License

Notifications You must be signed in to change notification settings

davvard/Stringzilla

Β 
Β 

Repository files navigation

StringZilla: The Godzilla of String Libraries πŸ¦–

Welcome to StringZilla, where we don't just handle strings, we devour them! 🍽️ If you've been on the hunt for a string library that's not just fast but freakishly fast, you've hit the jackpot. 🎰 StringZilla is the Godzilla of string libraries, stomping through your text faster than you can say "Tokyo Tower"! πŸ—Ό

Unleash the Beast: Performance πŸš€

StringZilla uses a heuristic so simple, it's almost stupid. But don't be fooled! This bad boy matches the first few letters of words with hyper-scalar code to achieve ludicrous speed. πŸŽοΈπŸ’¨ It's practical, easy to implement with different flavors of SIMD, and even SWAR for those less fortunate platforms. If you're haunted by open(...).readlines() and str().splitlines() taking forever, then StringZilla is your dream come true. 🌈

The Speed Showdown 🏁

Algorithm / Metric IoT Laptop Server
Speed Comparison πŸ’πŸ‡
Python for loop 🐌 4 MB/s 14 MB/s 11 MB/s
C++ for loop 🏍️ 520 MB/s 1.0 GB/s 900 MB/s
C++ string.find πŸš— 560 MB/s 1.2 GB/s 1.3 GB/s
Scalar Stringzilla πŸš€ 2 GB/s 3.3 GB/s 3.5 GB/s
Hyper-Scalar Stringzilla πŸ›Έ 4.3 GB/s 12 GB/s 12.1 GB/s
Efficiency Metrics πŸ“Š
CPU Specs 8-core ARM, 0.5 W/core 8-core Intel, 5.6 W/core 22-core Intel, 6.3 W/core
Performance/Core πŸ’ͺ 2.1 - 3.3 GB/s 11 GB/s 10.5 GB/s
Bytes/Joule ⚑ 4.2 GB/J 2 GB/J 1.6 GB/J

Quick Start: Python 🐍

1️⃣ Install via pip: pip install stringzilla
2️⃣ Import classes: from stringzilla import Str, File, Strs
3️⃣ Unleash the beast with built-in methods for string operations. πŸŽ‰

Basic Usage πŸ› οΈ

Stringzilla offers two interchangeable classes for your string and file munching needs:

from stringzilla import Str, File

text1 = Str('some-string')
text2 = File('some-file.txt')

Basic Operations πŸ“

  • Length: len(text) -> int
  • Substring check: 'substring' in text -> bool
  • Indexing: text[42] -> str
  • Slicing: text[42:46] -> str

Advanced Operations 🧠

  • text.contains('substring', start=0, end=9223372036854775807) -> bool
  • text.find('substring', start=0, end=9223372036854775807) -> int
  • text.count('substring', start=0, end=9223372036854775807, allowoverlap=False) -> int

Splitting and Line Operations πŸ•

  • text.splitlines(keeplinebreaks=False, separator='\n') -> Strs
  • text.split(separator=' ', maxsplit=9223372036854775807, keepseparator=False) -> Strs

Collection-Level Operations 🎲

Once split into a Strs object, you can sort, shuffle, and more:

lines = text.split(separator='\n')
lines.sort()
lines.shuffle(seed=42)

Sorted or shuffled copies? No problemo!

sorted_copy = lines.sorted()
shuffled_copy = lines.shuffled(seed=42)

Appending and extending? Easy peasy!

lines.append('Pythonic string')
lines.extend(shuffled_copy)

So what are you waiting for? Unleash the Godzilla of string libraries on your code today! πŸ¦–πŸ”₯

Quick Start: C πŸ› οΈπŸ”₯

Building a database, an operating system, or a runtime for your new fancy programming language? Why settle for LibC when you can unleash the Godzilla of string libraries? πŸ¦–

#include "stringzilla.h"

// Initialize your haystack and needle
strzl_haystack_t haystack = {your_text, your_text_length};
strzl_needle_t needle = {your_subtext, your_subtext_length, your_anomaly_offset};

// Count occurrences of a character like a boss 😎
size_t count = strzl_naive_count_char(haystack, 'a');

// Find a character like you're searching for treasure πŸ΄β€β˜ οΈ
size_t position = strzl_naive_find_char(haystack, 'a');

// Find a substring like it's Waldo πŸ•΅οΈβ€β™‚οΈ
size_t substring_position = strzl_naive_find_substr(haystack, needle);

// Sort an array of strings like you're Marie Kondo πŸ—‚οΈ
strzl_array_t array = {your_order, your_count, your_get_begin, your_get_length, your_handle};
strzl_sort(&array, &your_config);

Contributing: Be a Part of the Monster Squad! πŸ‘Ύ

Ready to contribute? Here's how you can set up your dev environment and run some tests.

Development Scripts πŸ“œ

# Clean up and install
rm -rf build && pip install -e . && pytest scripts/test.py -s -x

# Install without dependencies
pip install -e . --no-index --no-deps

Benchmarking πŸ‹οΈβ€β™‚οΈ

To benchmark on some custom file and pattern combinations:

python scripts/bench.py --haystack_path "your file" --needle "your pattern"

To benchmark on synthetic data:

python scripts/bench.py --haystack_pattern "abcd" --haystack_length 1e9 --needle "abce"

Packaging πŸ“¦

To validate packaging:

cibuildwheel --platform linux

Compiling C++ Tests πŸ§ͺ

# Install dependencies
brew install libomp llvm

# Compile and run tests
cmake -B ./build_release \
    -DCMAKE_C_COMPILER="/opt/homebrew/opt/llvm/bin/clang" \
    -DCMAKE_CXX_COMPILER="/opt/homebrew/opt/llvm/bin/clang++" \
    -DSTRINGZILLA_USE_OPENMP=1 \
    -DSTRINGZILLA_BUILD_TEST=1 \
    && \
    make -C ./build_release -j && ./build_release/stringzilla_test

So, are you ready to join the Monster Squad and make StringZilla even more epic? Let's do this! πŸ¦–πŸš€

License πŸ“œ

Feel free to use the project under Apache 2.0 or the Three-clause BSD license at your preference.

About

Crunch 100+ GB Strings in Python with ease, leveraging Arm Neon and x86 AVX2 SIMD Assembly to sort, split, and search through billions of tokens with Stringzilla πŸ¦–

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 80.0%
  • Python 11.5%
  • CMake 5.9%
  • JavaScript 1.2%
  • Other 1.4%