Home

Advice for working with encodings

Read Joel Spolsky's Guide to Unicode and Character Sets.
Keep in mind that all external resources (files, http pages, etc) are byte sequences, and are naturally represented in Node.js program as Buffers or Streams of Buffers.
When you read from external resource and want to operate on strings:
- Know the character encoding of the external resource. It's a metadata. In general, you cannot deduce it from the byte sequence itself.
- Make sure you provide original Buffer-s as input to decoder, as well as the correct encoding name.
- If you have strings at some place in your program, then decoding already happened before, probably using default encoding 'utf-8'. You cannot convert it to another encoding on this stage. You need to get the original Buffers, concat() them if needed, and pass to decoder. See more details.
- It is tricky to convert encodings when you get data from stream (in chunks). Use Streaming API.
When you write to external resource:
- Decide which encoding you want to use. Most useful is utf-8, this is the default in Node.
- Use Streaming API if you work with streams.
- If you don't encode strings yourself, then Node.js will do that for you, with default encoding.
FYI, javascript strings are stored in memory as UTF-16 encoding.
- If you work with Chinese ideographs or rare characters outside Basic Multilingual Plane, be sure to familiarize yourself with Surrogate pairs. They can be a pain to work with.

How to / Internals

Q: How encoding names are matched?
A: 1) They are lowercased, all non-alphanumeric characters are removed, 2) used as a key in iconv.encodings object to retrieve codec.

Q: How do I add aliases to encodings?
A: In your project, iconv.encodings['newalias'] = 'encoding'. Alias must be lowercase and have all non-alphanum characters removed.

Q: How do I add a new single-byte encoding?
A: See encodings/sbcs-data.js for an example of 'maccenteuro' encoding.

Q: How do I add a new multi-byte encoding?
A: See generation/gen-dbcs.js and encodings/dbcs-data.js for how it's done now. Just add sources for your encoding there. Current multi-byte codec is very versatile, should be enough for most encodings.

Q: What is the format of tables (encodings/tables/*)?
A: It is a JSON array of chunks. Each chunk represents a continuous mapping from multibyte encoding to unicode. First element of a chunk is hexadecimal 'address': what multibyte code the chunk starts. Then, there's a mix of strings and integers. String represents unicode chars that correspond to sequential multibyte codes. Integer represents length of a run of incrementing unicode chars, started from the last char of previous string, a-la RLE encoding.

Q: Why this format was chosen?
A: It's visual. You can easily check that the table is correct. Also, it's quite compact and easy to work with, as it's just JSON.

Q: How do I add a completely new encoding, not reducible to multi-byte? (stateful for example)
A: You'll need to write the code. Please look at examples in encodings/internal.js, encodings/sbcs-codec.js and encodings/dbcs-codec.js. Don't forget to write tests.

Q: What directories are necessary for this module to work?
A: Please look at .npmignore for directories that can be ignored. All others are necessary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Advice for working with encodings

How to / Internals

Clone this wiki locally