Skip to content

Commit 5da006e

Browse files
authored
Merge pull request #43 from mbukeRepo/ft/gpt4-omni-tokenization
chore: setup o200k_base tokenizer
2 parents 86c270c + 27b4e20 commit 5da006e

11 files changed

+360
-4
lines changed

README.md

+4-2
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
[![Play with gpt-tokenizer](https://codesandbox.io/static/img/play-codesandbox.svg)](https://codesandbox.io/s/gpt-tokenizer-tjcjoz?fontsize=14&hidenavigation=1&theme=dark)
44

5-
`gpt-tokenizer` is a highly optimized Token Byte Pair Encoder/Decoder for all OpenAI's models (including those used by GPT-2, GPT-3, GPT-3.5 and GPT-4). It's written in TypeScript, and is fully compatible with all modern JavaScript environments.
5+
`gpt-tokenizer` is a highly optimized Token Byte Pair Encoder/Decoder for all OpenAI's models (including those used by GPT-2, GPT-3, GPT-3.5, GPT-4 and GPT-4o). It's written in TypeScript, and is fully compatible with all modern JavaScript environments.
66

77
This package is a port of OpenAI's [tiktoken](https://github.com/openai/tiktoken), with some additional features sprinkled on top.
88

@@ -11,7 +11,7 @@ OpenAI's GPT models utilize byte pair encoding to transform text into a sequence
1111
As of 2023, it is the most feature-complete, open-source GPT tokenizer on NPM. It implements some unique features, such as:
1212

1313
- Support for easily tokenizing chats thanks to the `encodeChat` function
14-
- Support for all current OpenAI models (available encodings: `r50k_base`, `p50k_base`, `p50k_edit` and `cl100k_base`)
14+
- Support for all current OpenAI models (available encodings: `r50k_base`, `p50k_base`, `p50k_edit`, `cl100k_base` and `o200k_base`)
1515
- Generator function versions of both the decoder and encoder functions
1616
- Provides the ability to decode an asynchronous stream of data (using `decodeAsyncGenerator` and `decodeGenerator` with any iterable input)
1717
- No global cache (no accidental memory leaks, as with the original GPT-3-Encoder implementation)
@@ -49,6 +49,7 @@ If you wish to use a custom encoding, fetch the relevant script.
4949
- https://unpkg.com/gpt-tokenizer/dist/p50k_base.js
5050
- https://unpkg.com/gpt-tokenizer/dist/p50k_edit.js
5151
- https://unpkg.com/gpt-tokenizer/dist/r50k_base.js
52+
- https://unpkg.com/gpt-tokenizer/dist/o200k_base.js
5253

5354
The global name is a concatenation: `GPTTokenizer_${encoding}`.
5455

@@ -150,6 +151,7 @@ chat:
150151
- `gpt-4-32k-0314` (`cl100k_base`)
151152
- `gpt-3.5-turbo` (`cl100k_base`)
152153
- `gpt-3.5-turbo-0301` (`cl100k_base`)
154+
- `gpt-4o` (`o200k_base`)
153155

154156
text-only:
155157

data/TestPlans.txt

+227
Original file line numberDiff line numberDiff line change
@@ -1022,3 +1022,230 @@ EncodingName: cl100k_base
10221022
Sample: 🍏🍎🍐🍊🍋🍌🍉🍇🍓🍈🍒🍑
10231023
Encoded: [9468, 235, 237, 9468, 235, 236, 9468, 235, 238, 9468, 235, 232, 9468, 235, 233, 9468, 235, 234, 9468, 235, 231, 9468, 235, 229, 9468, 235, 241, 9468, 235, 230, 9468, 235, 240, 9468, 235, 239]
10241024

1025+
EncodingName: o200k_base
1026+
Sample: a a
1027+
Encoded: [64, 261]
1028+
1029+
EncodingName: o200k_base
1030+
Sample: hello
1031+
Encoded: [24912]
1032+
1033+
EncodingName: o200k_base
1034+
Sample: Hello, World! How are you today? 🌍
1035+
Encoded: [13225, 11, 5922, 0, 3253, 553, 481, 4044, 30, 130321, 235]
1036+
1037+
EncodingName: o200k_base
1038+
Sample: こんにちは、世界!お元気ですか?
1039+
Encoded: [95839, 1395, 28428, 3393, 8930, 6753, 25717, 15121, 7128, 4802]
1040+
1041+
EncodingName: o200k_base
1042+
Sample: Hola, mundo! ¿Cómo estás hoy? 🇪🇸
1043+
Encoded: [49864, 11, 10225, 0, 12873, 46515, 58166, 20502, 30, 173468, 103, 55506, 116]
1044+
1045+
EncodingName: o200k_base
1046+
Sample: Привет, мир! Как дела?
1047+
Encoded: [23881, 131903, 11, 37934, 0, 26029, 78857, 30]
1048+
1049+
EncodingName: o200k_base
1050+
Sample: 안녕하세요, 세상! 오늘 기분이 어때요? 🇰🇷
1051+
Encoded: [14307, 171731, 11, 28126, 8612, 0, 106820, 11061, 15567, 2186, 21252, 41856, 7952, 30, 173468, 108, 55506, 115]
1052+
1053+
EncodingName: o200k_base
1054+
Sample: Bonjour, le monde ! Comment ça va aujourd'hui ? 🇫🇷
1055+
Encoded: [45751, 11, 505, 15807, 1073, 15406, 13590, 3423, 32226, 43820, 1423, 173468, 104, 55506, 115]
1056+
1057+
EncodingName: o200k_base
1058+
Sample: The quick brown fox jumps over 13 lazy dogs. 😺
1059+
Encoded: [976, 4853, 19705, 68347, 65613, 1072, 220, 1311, 29082, 16798, 13, 22861, 118]
1060+
1061+
EncodingName: o200k_base
1062+
Sample: Здравствуйте, это мой первый раз здесь. Что мне делать?
1063+
Encoded: [182298, 11, 8577, 65733, 62134, 4702, 44039, 13, 53319, 27934, 45321, 30]
1064+
1065+
EncodingName: o200k_base
1066+
Sample: હેલો, વિશ્વ! તમે આજે કેમ છો? 🇮🇳
1067+
Encoded: [6094, 11954, 1903, 11, 5059, 71706, 15432, 0, 21720, 1138, 107600, 1138, 3058, 38937, 4289, 1903, 30, 173468, 106, 55506, 111]
1068+
1069+
EncodingName: o200k_base
1070+
Sample: ความรักและการเป็นกันเองเป็นสิ่งสำคัญที่สุดในโลก 🇹🇭
1071+
Encoded: [26224, 1619, 18971, 45798, 11855, 21876, 19373, 3015, 6560, 121316, 21876, 19373, 4406, 2781, 2055, 2795, 75160, 5131, 61134, 3998, 8070, 4406, 21584, 28208, 93469, 173468, 117, 55506, 255]
1072+
1073+
EncodingName: o200k_base
1074+
Sample: Python vs Java: Which programming language should you learn first?
1075+
Encoded: [60502, 10217, 13114, 25, 21580, 23238, 6439, 1757, 481, 4484, 1577, 30]
1076+
1077+
EncodingName: o200k_base
1078+
Sample: A journey of a thousand miles begins with a single step. - Lao Tzu
1079+
Encoded: [32, 12647, 328, 261, 26791, 10753, 18015, 483, 261, 4590, 5983, 13, 533, 144616, 353, 7846]
1080+
1081+
EncodingName: o200k_base
1082+
Sample: Die Grenzen meiner Sprache bedeuten die Grenzen meiner Welt. 🇩🇪
1083+
Encoded: [8796, 111745, 39103, 89476, 93295, 9627, 1076, 111745, 39103, 23079, 13, 173468, 102, 55506, 103]
1084+
1085+
EncodingName: o200k_base
1086+
Sample: יש לי כמה שאלות בנוגע לפרויקט החדש שלך. 🇮🇱
1087+
Encoded: [7899, 42151, 60962, 129852, 2433, 34083, 110495, 108591, 181894, 162562, 69019, 13, 173468, 106, 55506, 109]
1088+
1089+
EncodingName: o200k_base
1090+
Sample: Det är en vacker dag i Sverige. 🇸🇪
1091+
Encoded: [3639, 7706, 469, 323, 17798, 8724, 575, 64714, 13, 173468, 116, 55506, 103]
1092+
1093+
EncodingName: o200k_base
1094+
Sample: A ∀ x (P(x) → Q(x)) ∧ (∃x P(x)) → ∃x Q(x)
1095+
Encoded: [32, 35353, 222, 1215, 350, 47, 4061, 8, 15155, 1486, 4061, 915, 35353, 100, 350, 18085, 225, 87, 398, 4061, 915, 15155, 35353, 225, 87, 1486, 4061, 8]
1096+
1097+
EncodingName: o200k_base
1098+
Sample: O Brasil é o maior país da América do Sul. 🇧🇷
1099+
Encoded: [46, 15278, 1212, 293, 15966, 11106, 1033, 45086, 621, 27109, 13, 173468, 100, 55506, 115]
1100+
1101+
EncodingName: o200k_base
1102+
Sample: L'amore è una forza potente che unisce le persone. 🇮🇹
1103+
Encoded: [43, 30344, 510, 6272, 1969, 125511, 111848, 1378, 537, 48541, 505, 40144, 13, 173468, 106, 55506, 117]
1104+
1105+
EncodingName: o200k_base
1106+
Sample: Είναι μια ηλιόλουστη ημέρα στην Ελλάδα. 🇬🇷
1107+
Encoded: [10303, 16239, 33246, 13115, 57330, 2097, 85087, 42851, 122278, 7648, 21399, 112618, 13, 173468, 105, 55506, 115]
1108+
1109+
EncodingName: o200k_base
1110+
Sample: Teslim tarihi yaklaşıyor, projeyi zamanında bitirmemiz gerekiyor. 🇹🇷
1111+
Encoded: [110176, 5406, 162005, 16000, 148409, 17368, 11, 16022, 33468, 30355, 10884, 3546, 2835, 347, 482, 195151, 13, 173468, 117, 55506, 115]
1112+
1113+
EncodingName: o200k_base
1114+
Sample: Det finnes ingen bedre tid enn nå for å starte noe nytt. 🇳🇴
1115+
Encoded: [3639, 145817, 30430, 56755, 8692, 23075, 19937, 395, 7086, 167203, 49921, 66369, 13, 173468, 111, 55506, 112]
1116+
1117+
EncodingName: o200k_base
1118+
Sample: Aanvaard de uitdagingen van het leven met moed en vastberadenheid. 🇳🇱
1119+
Encoded: [68832, 84482, 334, 180964, 1164, 1448, 21987, 1421, 137256, 469, 11332, 718, 9519, 7157, 13, 173468, 111, 55506, 109]
1120+
1121+
EncodingName: o200k_base
1122+
Sample: Chào mừng bạn đến với thế giới của lập trình. 🇻🇳
1123+
Encoded: [1205, 35134, 284, 75104, 22673, 27528, 18019, 46773, 69217, 12153, 96352, 49051, 13, 173468, 119, 55506, 111]
1124+
1125+
EncodingName: o200k_base
1126+
Sample: Dlaczego warto uczyć się języków obcych? 🇵🇱
1127+
Encoded: [136923, 182265, 82074, 337, 150478, 9721, 140914, 3705, 87043, 1067, 55175, 30, 173468, 113, 55506, 109]
1128+
1129+
EncodingName: o200k_base
1130+
Sample: E = mc², uma equação famosa na física. 🇵🇹
1131+
Encoded: [36, 314, 36958, 13848, 11, 3030, 2801, 3890, 96317, 898, 50251, 13, 173468, 113, 55506, 117]
1132+
1133+
EncodingName: o200k_base
1134+
Sample: 你今天遇到什么有趣的事情了吗?🇨🇳
1135+
Encoded: [12370, 47256, 57127, 6946, 10555, 3666, 57922, 1616, 162913, 112451, 4802, 55506, 101, 55506, 111]
1136+
1137+
EncodingName: o200k_base
1138+
Sample: Nå er det tid for å feire med familie og venner. 🇳🇴
1139+
Encoded: [45, 592, 1111, 1476, 8692, 395, 7086, 1193, 594, 1475, 39603, 2085, 131786, 13, 173468, 111, 55506, 112]
1140+
1141+
EncodingName: o200k_base
1142+
Sample: Þetta er góður dagur til að læra eitthvað nýtt. 🇮🇸
1143+
Encoded: [7860, 20476, 1111, 91455, 17041, 8724, 330, 3453, 5993, 29333, 614, 180350, 49697, 1037, 13, 173468, 106, 55506, 116]
1144+
1145+
EncodingName: o200k_base
1146+
Sample: გამარჯობა! როგორ ხართ დღეს? 🇬🇪
1147+
Encoded: [165502, 69106, 24045, 0, 57298, 10892, 10875, 55856, 30, 173468, 105, 55506, 103]
1148+
1149+
EncodingName: o200k_base
1150+
Sample: Mā te whakawhiti kōrero e whai hua ai tātou. 🇳🇿
1151+
Encoded: [44, 2485, 729, 145047, 174352, 92760, 41643, 319, 101354, 76899, 8440, 260, 36813, 283, 13, 173468, 111, 55506, 123]
1152+
1153+
EncodingName: o200k_base
1154+
Sample: Это был незабываемый опыт, который я буду помнить всегда.
1155+
Encoded: [63250, 11066, 37028, 66181, 42684, 6770, 67711, 11, 21903, 3277, 61571, 179329, 34056, 13]
1156+
1157+
EncodingName: o200k_base
1158+
Sample: Διαβάζοντας βιβλία, εμπλουτίζουμε τον εαυτό μας με γνώσεις.
1159+
Encoded: [16611, 5690, 63324, 9153, 92025, 164613, 113428, 11, 109925, 85087, 30711, 9153, 33850, 20894, 4278, 727, 75653, 35170, 9173, 8558, 954, 92830, 13]
1160+
1161+
EncodingName: o200k_base
1162+
Sample: A számítástechnika világa tele van izgalmas lehetőségekkel. 🇭🇺
1163+
Encoded: [32, 70578, 5348, 449, 168649, 3113, 11748, 449, 2225, 5443, 1164, 4297, 8298, 4227, 51215, 53922, 95521, 108844, 13, 173468, 255, 55506, 118]
1164+
1165+
EncodingName: o200k_base
1166+
Sample: Vždy je dobré mít plán B, pokud něco nevyjde. 🇨🇿
1167+
Encoded: [53, 99728, 1264, 54560, 377, 98517, 192660, 418, 11, 118907, 134570, 453, 16670, 56244, 13, 173468, 101, 55506, 123]
1168+
1169+
EncodingName: o200k_base
1170+
Sample: Dragostea e un sentiment minunat care ne unește pe toți. 🇷🇴
1171+
Encoded: [25765, 564, 12932, 319, 537, 39160, 182050, 266, 2631, 453, 2463, 74495, 1045, 316, 20660, 13, 173468, 115, 55506, 112]
1172+
1173+
EncodingName: o200k_base
1174+
Sample: دیکھو، آسمان میں کتنی تارے ہیں! 🇵🇰
1175+
Encoded: [547, 55459, 417, 1368, 3382, 11248, 1195, 6431, 144008, 14148, 112711, 1531, 12406, 0, 173468, 113, 55506, 108]
1176+
1177+
EncodingName: o200k_base
1178+
Sample: Nenda polepole na ujifunze kila siku. 🇹🇿
1179+
Encoded: [45, 5968, 25059, 112657, 898, 62112, 366, 119365, 52237, 54647, 13, 173468, 117, 55506, 123]
1180+
1181+
EncodingName: o200k_base
1182+
Sample: Каква е твоята любима храна? 🇧🇬
1183+
Encoded: [29831, 2224, 2404, 70888, 8886, 2734, 13230, 27621, 2442, 73698, 30, 173468, 100, 55506, 105]
1184+
1185+
EncodingName: o200k_base
1186+
Sample: Sträva alltid efter att bli en bättre version av dig själv.
1187+
Encoded: [3504, 450, 2873, 63479, 22852, 1927, 27757, 469, 100580, 3926, 1452, 3807, 71554, 13]
1188+
1189+
EncodingName: o200k_base
1190+
Sample: Філософія - це наука про знання. 🇺🇦
1191+
Encoded: [10334, 17058, 107824, 30929, 533, 54543, 1235, 59929, 4964, 41072, 17561, 13, 173468, 118, 55506, 99]
1192+
1193+
EncodingName: o200k_base
1194+
Sample: Το πρόγραμμα αυτό είναι πολύ ενδιαφέρον. 🇬🇷
1195+
Encoded: [63423, 198704, 43845, 17278, 60896, 162904, 171319, 13, 173468, 105, 55506, 115]
1196+
1197+
EncodingName: o200k_base
1198+
Sample: 4gH@!0sT*#(9^%$[x{}j+|Yz6;Q]~8
1199+
Encoded: [19, 70, 39, 31, 0, 15, 82, 51, 9, 2, 7, 24, 61, 4, 3, 58, 87, 12083, 73, 10, 91, 56, 89, 21, 26, 48, 60, 93, 23]
1200+
1201+
EncodingName: o200k_base
1202+
Sample: wNb)I<>#:i^P]*cR8ytUx1Q`6O@z/
1203+
Encoded: [86, 67111, 8, 40, 28052, 97210, 72, 61, 47, 18579, 66, 49, 23, 5240, 182325, 16, 48, 63, 21, 46, 31, 89, 14]
1204+
1205+
EncodingName: o200k_base
1206+
Sample: ÄÜö¿¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿
1207+
Encoded: [12921, 8858, 573, 11986, 20407, 61242, 18943, 43470, 43625, 41468, 18596, 64259, 19742, 25661, 4244, 74285, 8980, 98049, 6793, 32438, 13848, 45681, 14737, 39621, 69022, 5366, 68284, 84125, 11006, 1924, 43439, 27124, 75174, 11986]
1208+
1209+
EncodingName: o200k_base
1210+
Sample: ƒšŠŒŽƒšŠŒŽƒšŠŒŽƒšŠŒŽƒšŠŒŽƒšŠŒŽ
1211+
Encoded: [99760, 812, 7490, 189136, 12915, 99760, 812, 7490, 189136, 12915, 99760, 812, 7490, 189136, 12915, 99760, 812, 7490, 189136, 12915, 99760, 812, 7490, 189136, 12915, 99760, 812, 7490, 189136, 12915]
1212+
1213+
EncodingName: o200k_base
1214+
Sample: 5ħÅŸēýïūē$%#^*()_+{[ö&!@#?>|,.<>
1215+
Encoded: [20, 5762, 13631, 198355, 6238, 1840, 9954, 7637, 6238, 3, 4, 2, 61, 9, 416, 62, 10, 90, 58, 573, 5, 0, 31, 2, 10730, 91, 26887, 28052]
1216+
1217+
EncodingName: o200k_base
1218+
Sample: 1B4t#%&*()_+dF5g^hJk7LmN0pQrS<>?
1219+
Encoded: [16, 33, 19, 83, 2, 4, 5, 9, 416, 62, 10, 67, 37, 20, 70, 61, 71, 41, 74, 22, 196093, 45, 15, 79, 135047, 50, 28052, 30]
1220+
1221+
EncodingName: o200k_base
1222+
Sample: ¬§±²³µ¶·¹ºª«»¦©¯°±!@#$%^&*()_+
1223+
Encoded: [74285, 18596, 32438, 13848, 45681, 39621, 69022, 5366, 84125, 11006, 25661, 4244, 1924, 41468, 19742, 98049, 6793, 32438, 0, 31, 108156, 108254, 5, 9, 416, 62, 10]
1224+
1225+
EncodingName: o200k_base
1226+
Sample: 8mR5*w7^a$!F(0%#J9@X6vZ1)nU3]_Y/
1227+
Encoded: [23, 76, 49, 20, 147727, 22, 61, 64, 3, 0, 37, 7, 15, 4, 2, 41, 24, 31, 55, 21, 85, 57, 16, 143612, 52, 18, 167793, 56, 14]
1228+
1229+
EncodingName: o200k_base
1230+
Sample: 😊😀😁😂🤣😃😄😅😆😉😊😋😎😍😘😗😙😚☺️🙂🤗🤔
1231+
Encoded: [102630, 84083, 156437, 41736, 92916, 13865, 225, 13865, 226, 13865, 227, 13865, 228, 72041, 102630, 13865, 233, 13865, 236, 74762, 122588, 13865, 245, 13865, 247, 13865, 248, 155014, 15148, 37459, 50378, 245, 50378, 242]
1232+
1233+
EncodingName: o200k_base
1234+
Sample: 🤨😐😑😶🙄😏😣😥😮🤐😯😪😫😴😌🤓😛😜😝🤤
1235+
Encoded: [50378, 101, 13865, 238, 13865, 239, 13865, 114, 70125, 226, 13865, 237, 13865, 96, 13865, 98, 13865, 106, 50378, 238, 13865, 107, 13865, 103, 13865, 104, 13865, 112, 13865, 234, 50378, 241, 13865, 249, 13865, 250, 13865, 251, 50378, 97]
1236+
1237+
EncodingName: o200k_base
1238+
Sample: 😒😓😔😕🙃🤑😲😷🤒🤕🤢🤧😈👿👹👺💀☠️
1239+
Encoded: [13865, 240, 13865, 241, 13865, 242, 13865, 243, 70125, 225, 4103, 11566, 13865, 110, 13865, 115, 50378, 240, 50378, 243, 50378, 95, 50378, 100, 13865, 230, 28823, 123, 28823, 117, 28823, 118, 31446, 222, 8434, 254, 15148]
1240+
1241+
EncodingName: o200k_base
1242+
Sample: 😾😿🙀😽😼😻🙈🙉🙊👶👦👧👨👩👴👵👨‍⚕️👩‍⚕️
1243+
Encoded: [13865, 122, 13865, 123, 70125, 222, 13865, 121, 13865, 120, 13865, 119, 70125, 230, 70125, 231, 70125, 232, 28823, 114, 28823, 99, 28823, 100, 28823, 101, 28823, 102, 28823, 112, 28823, 113, 28823, 101, 2524, 84396, 243, 15148, 28823, 102, 2524, 84396, 243, 15148]
1244+
1245+
EncodingName: o200k_base
1246+
Sample: 🌞🌝🌚🌛🌜🌙⭐️🌟💫✨🔥💥☄️🌈☀️🌤️⛅️🌥️
1247+
Encoded: [64364, 252, 64364, 251, 64364, 248, 64364, 249, 64364, 250, 64364, 247, 62160, 15148, 64364, 253, 31446, 104, 97375, 96606, 31446, 98, 8434, 226, 15148, 64364, 230, 8434, 222, 15148, 64364, 97, 15148, 158, 249, 227, 15148, 64364, 98, 15148]
1248+
1249+
EncodingName: o200k_base
1250+
Sample: 🍏🍎🍐🍊🍋🍌🍉🍇🍓🍈🍒🍑
1251+
Encoded: [102415, 237, 102415, 236, 102415, 238, 102415, 232, 102415, 233, 102415, 234, 102415, 231, 102415, 229, 102415, 241, 102415, 230, 102415, 240, 102415, 239]

package.json

+3-1
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
"GPT-3",
1313
"GPT-3.5",
1414
"GPT-4",
15+
"GPT-4o",
1516
"NLP",
1617
"Natural Language Processing",
1718
"Text Generation",
@@ -77,11 +78,12 @@
7778
"build": "yarn build:cjs && yarn build:esm && yarn build:umd",
7879
"build:cjs": "yarn rrun tsc --outDir cjs --module commonjs --target es2022 --project tsconfig-cjs.json",
7980
"build:esm": "yarn rrun tsc --outDir esm --module esnext --target es2022 && echo '{\"name\": \"gpt-tokenizer\", \"type\": \"module\"}' > ./esm/package.json",
80-
"build:umd": "yarn build:umd:cl100k_base && yarn build:umd:p50k_base && yarn build:umd:p50k_edit && yarn build:umd:r50k_base",
81+
"build:umd": "yarn build:umd:cl100k_base && yarn build:umd:p50k_base && yarn build:umd:p50k_edit && yarn build:umd:r50k_base && yarn build:umd:o200k_base",
8182
"build:umd:cl100k_base": "beemo webpack --entry='./src/main.ts' --env 'outDir=dist' --env 'moduleTarget=umd' --env 'engineTarget=web' --env 'codeTarget=es2022' --env 'name=GPTTokenizer_cl100k_base' --env 'filename=cl100k_base.js'",
8283
"build:umd:p50k_base": "beemo webpack --entry='./src/encoding/p50k_base.ts' --env 'outDir=dist' --env 'moduleTarget=umd' --env 'engineTarget=web' --env 'codeTarget=es2022' --env 'name=GPTTokenizer_p50k_base' --env 'filename=p50k_base.js'",
8384
"build:umd:p50k_edit": "beemo webpack --entry='./src/encoding/p50k_edit.ts' --env 'outDir=dist' --env 'moduleTarget=umd' --env 'engineTarget=web' --env 'codeTarget=es2022' --env 'name=GPTTokenizer_p50k_edit' --env 'filename=p50k_edit.js'",
8485
"build:umd:r50k_base": "beemo webpack --entry='./src/encoding/r50k_base.ts' --env 'outDir=dist' --env 'moduleTarget=umd' --env 'engineTarget=web' --env 'codeTarget=es2022' --env 'name=GPTTokenizer_r50k_base' --env 'filename=r50k_base.js'",
86+
"build:umd:o200k_base": "beemo webpack --entry='./src/encoding/o200k_base.ts' --env 'outDir=dist' --env 'moduleTarget=umd' --env 'engineTarget=web' --env 'codeTarget=es2022' --env 'name=GPTTokenizer_o200k_base' --env 'filename=o200k_base.js'",
8587
"clean": "git clean -dfX --exclude=node_modules src && beemo typescript:sync-project-refs",
8688
"format": "yarn rrun prettier --write \"./{src,tests,.config}/**/!(*.d).{.js,jsx,ts,tsx,json,md}\"",
8789
"postinstallDev": "yarn prepare",

src/GptEncoding.test.ts

+17-1
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,21 @@ const sharedResults = {
2626
}
2727

2828
const results = {
29+
o200k_base: {
30+
space: [220],
31+
tab: [197],
32+
'This is some text': [2_500, 382, 1_236, 2_201],
33+
indivisible: [521, 349, 181_386],
34+
'hello 👋 world 🌍': [24_912, 61_138, 233, 2_375, 130_321, 235],
35+
decodedHelloWorldTokens: ['hello', ' ', '👋', ' world', ' ', '🌍'],
36+
'toString constructor hasOwnProperty valueOf': [
37+
935, 916, 9_220, 853, 18_555, 3_895, 1_432, 2_566
38+
],
39+
'hello, I am a text, and I have commas. a,b,c': [
40+
24_912, 11, 357, 939, 261, 2_201, 11, 326, 357, 679, 179_663, 13, 261,
41+
17_568, 22_261,
42+
],
43+
},
2944
cl100k_base: {
3045
space: [220],
3146
tab: [197],
@@ -111,7 +126,7 @@ describe.each(encodingNames)('%s', (encodingName: EncodingName) => {
111126
it('decode token-by-token via generator', () => {
112127
const str = 'hello 👋 world 🌍'
113128
const generator = decodeGenerator(result[str])
114-
result.decodedHelloWorldTokens.forEach((token) => {
129+
result.decodedHelloWorldTokens.forEach((token: string) => {
115130
expect(generator.next().value).toBe(token)
116131
})
117132
})
@@ -243,6 +258,7 @@ function loadTestPlans() {
243258
p50k_base: [],
244259
p50k_edit: [],
245260
r50k_base: [],
261+
o200k_base: [],
246262
}
247263
testPlanData.split('\n\n').forEach((testPlan) => {
248264
const [encodingNameLine, sampleLine, encodedLine] = testPlan.split('\n')

src/encoding/o200k_base.ts

+32
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
/* eslint-disable import/extensions */
2+
import { convertTokenBytePairEncodingFromTuples } from '../convertTokenBytePairEncodingFromTuples.js'
3+
import encoder from '../encodings/o200k_base.js'
4+
import { GptEncoding } from '../GptEncoding.js'
5+
6+
export * from '../specialTokens.js'
7+
8+
const api = GptEncoding.getEncodingApi('o200k_base', () =>
9+
convertTokenBytePairEncodingFromTuples(encoder),
10+
)
11+
const {
12+
decode,
13+
decodeAsyncGenerator,
14+
decodeGenerator,
15+
encode,
16+
encodeGenerator,
17+
isWithinTokenLimit,
18+
encodeChat,
19+
encodeChatGenerator,
20+
} = api
21+
export {
22+
decode,
23+
decodeAsyncGenerator,
24+
decodeGenerator,
25+
encode,
26+
encodeChat,
27+
encodeChatGenerator,
28+
encodeGenerator,
29+
isWithinTokenLimit,
30+
}
31+
// eslint-disable-next-line import/no-default-export
32+
export default api

src/encodings/o200k_base.js

+6
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)