Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Half/full width-insensitive regular expressions #23028

Open
jidanni opened this issue Feb 26, 2025 · 4 comments
Open

Half/full width-insensitive regular expressions #23028

jidanni opened this issue Feb 26, 2025 · 4 comments

Comments

@jidanni
Copy link
Member

jidanni commented Feb 26, 2025

The year was 19xx. The /i case insensitivity regexp operator was born.

Speed ahead. The year is 2025. More insensitive operators are needed.
E.g., busting down the barriers of

$ unicode 4 4|grep U+
U+FF14 FULLWIDTH DIGIT FOUR
U+0034 DIGIT FOUR

Also, a way to make /i break this barrier:

$ unicode A a|grep U+
U+FF21 FULLWIDTH LATIN CAPITAL LETTER A
U+FF41 FULLWIDTH LATIN SMALL LETTER A

https://unix.stackexchange.com/questions/791654/how-to-make-perl-half-full-width-insensitive-regular-expressions

@guest20
Copy link

guest20 commented Feb 26, 2025

Your code example:

    if (/大茅埔段32(7|8|9).地/ ||
        /大茅埔段32(7|8|9).地/)     {...}

Is there a reason to not modify it to be more like this?

if ( double_width_to_single_width($_) =~ /大茅埔段32(7|8|9).地/ ) { ... }

Your version will also capture double-width 7, 8 and 9, so you'll have to either normalize it after the match or continue doubling your code throughout the rest of your script whenever you use those captured values (say if you look it up in a hash, or pass it to a sub etc)

It's worth noting that "width" is not a binary... There are characters who are width-ambiguous and there are those with Neutral width. There's also different names if you were originally half-width.

I propose the modifier for a width-insensitive match be both /₩‡ and /₩†, because the Won Sign looks like a crossed out W, which comes with a built-in mnemonic.

__
‡. U+20A9 ₩ WON SIGN
†. U+FFE6 ₩ FULLWIDTH WON SIGN

@khwilliamson
Copy link
Contributor

There are lots more possibilities of wanting to match things than just the width. Making it all hang together cleanly is a huge task. I know of no language that implements things fully; though Raku is much further along than I think any others. With Perl 5, this is accomplished by normalizing both operands before doing operations on them. Unicode::Normalize is furnished to accomplish this. You need a compatibility type normalization.

You can also use Unicode::UCD and the NFKC_Casefold property to accomplish this kind of task

@jidanni
Copy link
Member Author

jidanni commented Feb 27, 2025

Thanks. I hope all this gets implemented one day. By the way, I think all instruction characters should stick with ASCII. I.e., sorry about your won sign.

@guest20
Copy link

guest20 commented Feb 27, 2025

@jidanni well, you can't won them all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants