Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rules for \0\0 #57

Open
faassen opened this issue Feb 11, 2025 · 4 comments
Open

rules for \0\0 #57

faassen opened this issue Feb 11, 2025 · 4 comments

Comments

@faassen
Copy link
Collaborator

faassen commented Feb 11, 2025

I've been reading your changes with interest. One thing looks confusing:

test_sais_too_many_trailing_zero confirms that you cannot have \0\0 at the end of a string. This made me wonder whether "empty strings" (nothing between \0\0) is forbidden. But then I found that test_sais_too_many_trailing_zero explicitly allows this!

What is the rule?

@ajalab
Copy link
Owner

ajalab commented Feb 12, 2025

#[test]
#[should_panic]
fn test_sais_too_many_trailing_zero() {
let text = "toomanyzeros\0\0".to_string().into_bytes();
let converter = IdConverter::with_size(std::mem::size_of::<u8>() as u64);
build_suffix_array(&text, &converter);
}

Ah, providing \0\0 at the end of a text is prohibited now (see #[should_panic]). Maybe I should have written a comment about that.

@faassen
Copy link
Collaborator Author

faassen commented Feb 12, 2025

So why is \0\0 forbidden at the end but not elsewhere?

@ajalab
Copy link
Owner

ajalab commented Feb 14, 2025

This restriction originates from the SA-IS algorithm to build suffix arrays.

SA-IS categorizes every character in a text into three types: S-Type, L-Type, and LMS-type. This algorithm expects that the given text must end with an LMS-type character, which is usually the end-marker zero. Appending more zeros to the text breaks this assumption.

@faassen
Copy link
Collaborator Author

faassen commented Feb 20, 2025

So if I get this right:

\0\0foo\0

is allowed. this is one empty string and one string foo.

Yet

\0foo\0\0

is not, you can't have foo and then an empty string.

So this means an empty string at the end is forbidden, which seems like an odd exception. I think the rule should be that either empty strings are completely forbidden or not at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants