-
-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
what's equals regex for control range in chinese #19
Comments
Good question, the I've open an issue for you nitely/nim-unicodedb#4 , I can probably get it done today or tomorrow 😄 |
glad to hear that,once it's ready I'll check out the performance,since re split takes 60-70 percent costs time in test case of my package. |
Well, split may create many small substrings/allocations. Maybe you can get ride of it somehow, by using import re
let reHan = re(r"(*UTF8)\p{Han}+")
proc processSubString(s: openArray[char]) =
# This allocates, do something that does not allocate :P
var myss = newString(s.len)
for i in 0 ..< s.len:
myss[i] = s[i]
echo myss
proc main(s: string) =
var start = 0
var bounds = (first: 0, last: 0)
while start < s.len:
bounds = findBounds(s, reHan, start)
if bounds.first == -1:
break
if bounds.first > 0:
processSubString(toOpenArray(s, start, bounds.first-1))
processSubString(toOpenArray(s, bounds.first, bounds.last))
start = bounds.last + 1
if start < s.len:
processSubString(toOpenArray(s, start, s.len-1))
echo "start1"
main("諸夏foo諸夏bar")
echo "end1"
echo "start2"
main("諸夏foo諸夏")
echo "end2"
echo "start3"
main("foo諸夏bar")
echo "end3"
echo "start4"
main("foo諸夏")
echo "end4"
echo "start5"
main("諸夏bar")
echo "end5"
#[
start1
諸夏
foo
諸夏
bar
end1
start2
諸夏
foo
諸夏
end2
start3
foo
諸夏
bar
end3
start4
foo
諸夏
end4
start5
諸夏
bar
end5
]# However, last time I checked Nim's string modules ( Next step would be getting ride of regex. |
I just updated unicodedb. The followind code does the same as the above code: import unicode
import unicodedb/scripts
proc isHanCheck(r: Rune): bool =
# fast ascii check followed by unicode check
result = r.int > 127 and r.unicodeScript() == sptHan
iterator splitHan(s: string): string =
var
i = 0
j = 0
k = 0
r: Rune
isHan = false
isHanCurr = false
fastRuneAt(s, i, r, false)
isHanCurr = r.isHanCheck()
isHan = isHanCurr
while i < s.len:
while isHan == isHanCurr:
k = i
if i == s.len:
break
fastRuneAt(s, i, r, true)
isHanCurr = r.isHanCheck()
yield s[j ..< k]
j = k
isHan = isHanCurr
proc main(s: string) =
for ss in s.splitHan():
echo ss
main("諸夏foo諸夏bar")
#[
諸夏
foo
諸夏
bar
]# |
If you just care about the han text, just check # ...
if isHan:
yield s[j ..< k]
# ...
#[
諸夏
諸夏
]# Oh, don't forget to compile the code in release mode |
@nitely thank you ! help me a lot ! these codes works fine,#19 (comment) will reduce 1 second, and second version will reduce to 1.5 seconds which faster than python's version (about 2 seconds) in my test case. |
@bung87 Awesome! I'm glad to hear that! 😸 I'll leave this open so I don't forget to implement the |
that's even more wonderful, since it's implemented in perl python php ... ,since you described about string copy that results so much performance affected, that I didn't aware of before, my package has a big table let PROB_EMIT_DATA* = {
'B': {
"一": -3.6544978750449433,
"丁": -8.125041941842026,
.... could you give me more advice to improve performance? |
Sure, I can try 😄 . Nim has proper constants, you would use Using strings as Table keys is slow since it needs hashing the string + string comparison. Using characters like # untested code
import unicode
# Use enums instead of this when keys are not provided by user
proc toRune(s: string): Rune =
var n = 0
fastRuneAt(s, n, result, true)
if n < s.len:
raise newException(ValueError, "not a single unicode char")
const ProbEmitData* = {
'B': {
"一".toRune: -3.6544978750449433,
"丁".toRune: -8.125041941842026,
.... Next step is getting ride of the hash table. You can try a switch statement instead. proc ProbEmitDataMap(keyA: char, keyB: Rune): float =
case keyA
of 'B':
case keyB
of "一".toRune: # enum would be better if possible
-3.6544978750449433 # maybe result = ... is needed here
of "丁".toRune:
-8.125041941842026
else:
raise newException(ValueError, "invalid key")
# ...
else:
raise newException(ValueError, "invalid key") This may get translated into a jump table (in assembly), a bunch of As a more general advice, it's better to think C rather than python performance wise. If you don't know C then learn it. It's a very small language and there are good books showing what good code looks like and explaining why it's good. It's probably not a good idea to write Nim as if it were C, but at least you'll get an idea of what code Nim may be generating. About string handling, I'm not sure using toOpenArray is such a great idea. At some point Nim will be smart enough to translate You can ask this kind of thing in IRC and the forum as well, as there are more knowledgeable people than me in there. |
Another thing that I just thought is using a flat hash table, like this: const ProbEmitData* = {
[Rune('B'.ord), "一".toRune]: -3.6544978750449433,
[Rune('B'.ord), "丁".toRune]: -8.125041941842026,
.... Depending on the table length, this may take a lot more memory, though. Also, there may be better data structures than this, like an "static trie" (not the one with nodes) or another kind of state machine, but it depends on the data. I can't tell for sure without the whole |
the table generated by gen_prob_json.py the original data came from a python package ,you can see the results data here: and now I reduced time costs to 1.2 think may make it down to 1 second.(but will throw index error on debug build) you mentioned about const, that I tried but did not figure out , since I trans it to table with |
is |
"龢": -10.61937952828986
}.newTable
}.newTable it should ends like this, 35235 lines. hmm I checked generated json file, it only has 'B' key indeed. that I did not noticed. seems other state just depends on 'B' state, turns out there's only |
I tried
re
andnre
module, this works but very slow,andre
is much slower thannre
, so I'm looking for this module, cuz Araq told menre
also deprecated.The text was updated successfully, but these errors were encountered: