Skip to content
Alexander Shtuchkin edited this page Jun 11, 2014 · 5 revisions

Advice for working with encodings

  • Read Joel Spolsky's Guide to Unicode and Character Sets.
  • Keep in mind that all external resources (files, http pages, etc) are byte sequences, and are naturally represented in Node.js program as Buffers or Streams of Buffers.
  • When you read from external resource and want to operate on strings:
    • Know the character encoding of the external resource. It's a metadata. In general, you cannot deduce it from the byte sequence itself.
    • Make sure you provide original Buffer-s as input to decoder, as well as the correct encoding name.
    • If you have strings at some place in your program, then decoding already happened before, probably using default encoding 'utf-8'. You cannot convert it to another encoding on this stage. You need to get the original Buffers, concat() them if needed, and pass to decoder. See more details.
    • It is tricky to convert encodings when you get data from stream (in chunks). Use Streaming API.
  • When you write to external resource:
    • Decide which encoding you want to use. Most useful is utf-8, this is the default in Node.
    • Use Streaming API if you work with streams.
    • If you don't encode strings yourself, then Node.js will do that for you, with default encoding.
  • FYI, javascript strings are stored in memory as UTF-16 encoding.
    • If you work with Chinese ideographs or rare characters outside Basic Multilingual Plane, be sure to familiarize yourself with Surrogate pairs. They can be a pain to work with.

How to / Internals

Q: How encoding names are matched?
A: 1) They are lowercased, all non-alphanumeric characters are removed, 2) used as a key in iconv.encodings object to retrieve codec.

Q: How do I add aliases to encodings?
A: In your project, iconv.encodings['newalias'] = 'encoding'. Alias must be lowercase and have all non-alphanum characters removed.

Q: How do I add a new single-byte encoding?
A: See encodings/sbcs-data.js for an example of 'maccenteuro' encoding.

Q: How do I add a new multi-byte encoding?
A: See generation/gen-dbcs.js and encodings/dbcs-data.js for how it's done now. Just add sources for your encoding there. Current multi-byte codec is very versatile, should be enough for most encodings.

Q: What is the format of tables (encodings/tables/*)?
A: It is a JSON array of chunks. Each chunk represents a continuous mapping from multibyte encoding to unicode. First element of a chunk is hexadecimal 'address': what multibyte code the chunk starts. Then, there's a mix of strings and integers. String represents unicode chars that correspond to sequential multibyte codes. Integer represents length of a run of incrementing unicode chars, started from the last char of previous string, a-la RLE encoding.

Q: Why this format was chosen?
A: It's visual. You can easily check that the table is correct. Also, it's quite compact and easy to work with, as it's just JSON.

Q: How do I add a completely new encoding, not reducible to multi-byte? (stateful for example)
A: You'll need to write the code. Please look at examples in encodings/internal.js, encodings/sbcs-codec.js and encodings/dbcs-codec.js. Don't forget to write tests.

Q: What directories are necessary for this module to work?
A: Please look at .npmignore for directories that can be ignored. All others are necessary.

Clone this wiki locally