-
Notifications
You must be signed in to change notification settings - Fork 283
Home
- Read Joel Spolsky's Guide to Unicode and Character Sets.
- Keep in mind that all external resources (files, http pages, etc) are byte sequences, and are naturally represented in Node.js program as
Buffer
s orStream
s ofBuffer
s. - When you read from external resource and want to operate on strings:
- Know the character encoding of the external resource. It's a metadata. In general, you cannot deduce it from the byte sequence itself.
- Make sure you provide original
Buffer
-s as input to decoder, as well as the correct encoding name. - If you have strings at some place in your program, then decoding already happened before, probably using default encoding 'utf-8'. You cannot convert it to another encoding on this stage. You need to get the original
Buffer
s,concat()
them if needed, and pass to decoder. See more details. - It is tricky to convert encodings when you get data from stream (in chunks). Use Streaming API.
- When you write to external resource:
- Decide which encoding you want to use. Most useful is utf-8, this is the default in Node.
- Use Streaming API if you work with streams.
- If you don't encode strings yourself, then Node.js will do that for you, with default encoding.
- FYI, javascript strings are stored in memory as UTF-16 encoding.
- If you work with Chinese ideographs or rare characters outside Basic Multilingual Plane, be sure to familiarize yourself with Surrogate pairs. They can be a pain to work with.
Q: How encoding names are matched?
A: 1) They are lowercased, all non-alphanumeric characters are removed, 2) used as a key in iconv.encodings
object to retrieve codec.
Q: How do I add aliases to encodings?
A: In your project, iconv.encodings['newalias'] = 'encoding'
. Alias must be lowercase and have all non-alphanum characters removed.
Q: How do I add a new single-byte encoding?
A: See encodings/sbcs-data.js for an example of 'maccenteuro' encoding.
Q: How do I add a new multi-byte encoding?
A: See generation/gen-dbcs.js and encodings/dbcs-data.js for how it's done now. Just add sources for your encoding there. Current multi-byte codec is very versatile, should be enough for most encodings.
Q: What is the format of tables (encodings/tables/*)?
A: It is a JSON array of chunks. Each chunk represents a continuous mapping from multibyte encoding to unicode. First element of a chunk is hexadecimal 'address': what multibyte code the chunk starts. Then, there's a mix of strings and integers. String represents unicode chars that correspond to sequential multibyte codes. Integer represents length of a run of incrementing unicode chars, started from the last char of previous string, a-la RLE encoding.
Q: Why this format was chosen?
A: It's visual. You can easily check that the table is correct. Also, it's quite compact and easy to work with, as it's just JSON.
Q: How do I add a completely new encoding, not reducible to multi-byte? (stateful for example)
A: You'll need to write the code. Please look at examples in encodings/internal.js, encodings/sbcs-codec.js and encodings/dbcs-codec.js. Don't forget to write tests.
Q: What directories are necessary for this module to work?
A: Please look at .npmignore for directories that can be ignored. All others are necessary.