😄 HappyNodeTokenizer

A basic Twitter aware tokenizer for Javascript environments.

A Typescript port of HappyFunTokenizer.py by Christopher Potts and HappierFunTokenizing.py by H. Andrew Schwartz.

Features

Accurate port of both libraries (run npm run test)
Typescript definitions
Uses generators / memoize for efficiency
Customizable and easy to use

Install

NPM

  npm install --save happynodetokenizer

JSR (Deno / Bun)

bunx jsr i @phughesmcr/happynodetokenizer

Usage

HappyNodeTokenizer exports a function called tokenizer() which takes an optional configuration object (See "The Options Object" below).

Example

import { tokenizer } from 'happynodetokenizer';
// or import * as mod from "@phughesmcr/happynodetokenizer"; if using JSR

const text = 'RT @ #happyfuncoding: this is a typical Twitter tweet :-)';

// these are the default options
const opts = {
  'mode': 'stanford',
  'normalize': undefined,
  'preserveCase': true,
};

// create a tokenizer instance with our options
const myTokenizer = tokenizer(opts);

// calling myTokenizer returns a generator function
const tokenGenerator = myTokenizer(text);

// you can turn the generator into an array of token objects like this:
const tokens = [...tokenGenerator()];

// you can also convert token objects to array of strings like this:
const values = Array.from(tokens, (token) => token.value);

Output

The tokens variable in the above example will look like this:

[
  { end: 1, start: 0, tag: 'word', value: 'rt' },
  { end: 3, start: 3, tag: 'punct', value: '@' },
  { end: 19, start: 5, tag: 'hashtag', value: '#happyfuncoding' },
  { end: 20, start: 20, tag: 'punct', value: ':' },
  { end: 25, start: 22, tag: 'word', value: 'this' },
  { end: 28, start: 27, tag: 'word', value: 'is' },
  { end: 30, start: 30, tag: 'word', value: 'a' },
  { end: 38, start: 32, tag: 'word', value: 'typical' },
  { end: 46, start: 40, tag: 'word', value: 'twitter' },
  { end: 52, start: 48, tag: 'word', value: 'tweet' },
  { end: 56, start: 54, tag: 'emoticon', value: ':-)' }
]

Where preserveCase in the Options Object is false, each result object may also contain a variation property which presents the token as originally matched if it differs from the value property. E.g.:

[
  { end: 1, start: 0, tag: 'word', value: 'rt', variation: 'RT' },
  { end: 3, start: 3, tag: 'punct', value: '@' },
  { end: 19, start: 5, tag: 'hashtag', value: '#happyfuncoding' },
  ...
  { end: 46, start: 40, tag: 'word', value: 'twitter', variation: 'Twitter' },
  ...
]

The Options Object

The options object and its properties are optional. The defaults are:

{
  'mode': 'stanford',
  'normalize': undefined,
  'preserveCase': true,
};

mode

string - valid options: stanford (default), or dlatk

stanford mode uses the original HappyFunTokenizer pattern. See Github.

dlatk mode uses the modified HappierFunTokenizing pattern. See Github.

normalize

string - valid options: "NFC" | "NFD" | "NFKC" | "NFKD" (default = undefined)

Normalize strings (e.g., when set, mañana becomes manana).

Normalization is disabled with set to null or undefined (default).

preserveCase

boolean - valid options: true, or false (default)

Preserves the case of the input string if true, otherwise all tokens are converted to lowercase. Does not affect emoticons.

Testing

To compare the results of HappyNodeTokenizer against HappyFunTokenizer and HappierFunTokenizing, run:

npm run test

The goal of this project is to provide an accurate port of HappyFunTokenizer and HappierFunTokenizing. Therefore, any pull requests with test failures will not be accepted.

Acknowledgements

Based on HappyFunTokenizer.py by Christopher Potts and HappierFunTokenizing.py by H. Andrew Schwartz.

Uses the "he" library by Mathias Bynens under the MIT license.

License

Shared under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported license.

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
.vscode		.vscode
dist		dist
src		src
test		test
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.npmignore		.npmignore
LICENSE		LICENSE
README.md		README.md
jsr.json		jsr.json
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Tag	Stanford	DLATK	Example
phone	✔️	✔️	+1 (800) 123-4567
url	❌	✔️	http://www.youtube.com
url_scheme	❌	✔️	http://
url_authority	❌	✔️	[0-3]
url_path_query	❌	✔️	/index.html?s=search
htmltag	❌	✔️	<em class='grumpy'>
emoticon	✔️	✔️	>:(
username	✔️	✔️	@somefaketwitterhandle
hashtag	✔️	✔️	#tokenizing
punct	✔️	✔️	,
word	✔️	✔️	hello
<UNK>	✔️	✔️	(anything left unmatched)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

😄 HappyNodeTokenizer

Features

Install

NPM

JSR (Deno / Bun)

Usage

Example

Output

The Options Object

mode

normalize

preserveCase

Tags

Testing

Acknowledgements

License

About

Releases 1

Packages

Languages

License

phughesmcr/happynodetokenizer

Folders and files

Latest commit

History

Repository files navigation

😄 HappyNodeTokenizer

Features

Install

NPM

JSR (Deno / Bun)

Usage

Example

Output

The Options Object

mode

normalize

preserveCase

Tags

Testing

Acknowledgements

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages