This package contains a standalone Ruby (executable) source file
bin/ruby_unicode_prop
- a command to be used from terminals etc, and it
outputs (to STDOUT) Unicode characters and/or their hexagonal codepoints that
satisfy one or more given expressions defined in Regexp in Ruby.
Specifically, they are p{XXX}
-type expressions (e.g., p{Katakana}
) for
Unicode, as well as [[:blank:]]
-type expressions for POSIX representation.
Some supplementary files are found in the top and test
directories, none of
which is essential to run the command.
The help doc is viewable with -h
(or --help
) option, which all the basics:
% /YOUR/INSTALLED/PATH/ruby_unicode_prop -h
USAGE: ruby_unicode_prop [options] Property1 [Property2, ...]
Print all the characters and/or their hex-codepoints that have
the given "Unicode property" used in Ruby Regexp like \p{Currency_Symbol}
(or POSIX expression like [[:blank:]] if -p option is given).
Options:
-c, --[no-]without-codepoint Print characters only? (Def: false)
-n, --[no-]without-char Print codepoints only? (Def: false)
-d, --delimiter=CHAR Delimeter in output.
-l, --[no-]lowercase Lower cases alphabets are used for Hex in codepoints (Def: false)
-p, --[no-]posix Use POSIX expression instead of Unicode (Def: false)
--[no-]list-property Print all the Ruby Unicode properties and exit.
Note1: Delimeter means one
(1) between multiple characters and codepoints if either of -n or -c is specified
(Default: Null for -c (characters only) and a new line for -n.
(2) between the number and character of each pair if both are specified
(Def: a whitespace), whereas the delimeter between pairs is always a newline.
To specify a newline as a delimiter, give 'NL'
Note2: Properties differ for '-p', 'ascii' in POSIX and 'ASCII' in Unicode.
The reference file (used in the -l
option) is dynamically retrieved from
https://github.com/k-takata/Onigmo/blob/master/doc/UnicodeProps.txt
The definition file in the Ruby source tree is at /enc/unicode/name2ctype.h
The output of this command is generated by the Ruby it runs, and hence is fully consistent with Regexp matching results with the same property names in any applications when you run the same Ruby. That also means the output can depend on the version of the Ruby you run, because the unicode table has expanded over the years (such as emojis) and it will keep doing so.
Currently, the searches by this command is limited up to the second Supplementary Plane (Supplementary Ideographic Plane), which should be enough in practice in most cases now in 2019 and perhaps will be so for some time.
In fact, in many practical cases, searching over only the Basic Multilingual Plane
(up to 0xFFFF) is probably sufficient, though it seems the second
Supplementary Plane does include groups of CJK characters some of which are
still in use occasionally in modern days. The maximum codepoint to search for
is defined in the constant MAX_UNICODE_HEX
near the beginning of the source
code. If you set it to a lower value, that can speed up the processing
considerably, potentially noticeably.
A typical example is as follows:
% bin/ruby_unicode_prop Greek
0370 Ͱ
0371 ͱ
0372 Ͳ
……(snipped)……
0391 Α
0392 Β
0393 Γ
0394 Δ
……(snipped)
For some, POSIX (bracket) expressions are supported:
% bin/ruby_unicode_prop -p -d '___ ' punct
0021___ !
0022___ "
0023___ #
0024___ $
0025___ %
0026___ &
0027___ '
0028___ (
0029___ )
002A___ *
002B___ +
……(snipped")
Note the corresponding property name for Unicode p{}
(a backslash followed
by p
and curly brackets) is Punct
— it is capitalized, compared with the
POSIX expression name.
Or, you can specify multiple properties. The order of the argument does not matter and the result is always in the order of the codepoints. No duplication is produced, even if some of the specified properties have overlapped ranges of characters. An example is,
% bin/ruby_unicode_prop -c Number Terminal_Punctuation
!,.0123456789:;?²³¹¼½¾;……(snipped)
% bin/ruby_unicode_prop -c Number Terminal_Punctuation Close_Punctuation
!),.0123456789:;?]}²³¹¼½¾;……(snipped)
This script requires Ruby Version 2.0 or above.
If you install it as the standard Ruby Gem package, the executable
bin/ruby_unicode_prop
should be located automatically in your command-line
search path.
If not, place (copy) it in any of your command-line search paths. It is a self-contained single file and does not need any external optional library except the standard library that come in default with Ruby 2.0.
You may need to modify the first line (Shebang line) of the script to suit your environment (it should be unnecessary for Linux and macOS), or run it explicitly with your Ruby command as
Prompt% /YOUR/ENV/ruby /YOUR/INSTALLED/ruby_unicode_prop
The master of this README file as well as the entire package is found in RubyGems/ruby_unicode_prop
The source code is maintained also in Github
Ruby codes under the directory test/
are the test scripts. You can run them
from the top directory as ruby test/test_****.rb
or simply run make test
.
None.
- Author
- Masa Sakano < info a_t wisebabel dot com >
- Versions
- The versions of this package follow Semantic Versioning (2.0.0) http://semver.org/
- License
- MI