-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add some EBCDIC encodings #112
base: master
Are you sure you want to change the base?
Conversation
EBCDIC 500 mapping has been taken from http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP500.TXT and automatically converted from EBCDIC 1148 is said to be different only at code point |
There is some problems here with some of the control mappings. The problem arises because EBCDIC has a Carriage Return, New Line, and Line Feed. The problem with these mappings is that control characters in EBCDIC which do not translate have been given arbitrary unicode values starting at 0x80. This includes the NL character (0x20 in EBCDIC), which is assigned U+0080. On the systems I've touched the EBCDIC NL character is used in place of the LF character for marking EOL |
Currently Wikipedia says that EBCDIC NL is 0x15 in EBCDIC 500 (and in its variation EBCDIC 1148) and in EBCDIC 037 (and in its variation EBCDIC 1140). These four are mapped (by the Microsoft mappings, mentioned above) to U+0085 (officially said to be “NEXT LINE” or “NEL”) which seems correct to me. |
Im not really sure how many programs handle U+0085 properly. The other side is, when converting the other direction, with LF being the usual line terminator. means it gets converted to EBCDIC LF (0x25). I need to double check, but on the EBCDIC machines I've had access to, they do not like this at all. They want NL line endings. |
As I have to add support for various EBCDIC encodings also I did a little research on this matter and I found an implementation by IBM which they open sourced. Here ConversionMaps is used to map between encodings and code pages (or more formally CCSIDs). In ConvTable this mapping is now used to load the respective converter (i.e. ConvTable1140 to map between Unicode and EBCDIC (CCSID 037 = Euro update 1140 according to "Code pages with Latin-1 character sets" on the Wikipedia entry)). Skimming through their codebase a nice amount of such mappings are available, that might be helpful in adding support for those encodings to iconv-lite. On using a bit more complex EBCDIC sample taken from this page I was able, after some back and force conversions and modifying my local sbcs-data.js file, to validate the correctness of the ebcdic.txt sample file against the ascii.txt file with a test like this: it("Read EBCDIC from stream", () => {
let expected: string = fs.readFileSync("./test/ascii.txt", "latin1");
while (expected.includes("\n") || expected.includes("\r")) {
expected = expected.replace("\n", "").replace("\r", "");
}
// https://querysurge.zendesk.com/hc/en-us/articles/215029906-QuerySurge-and-Mainframe-Data-EBCDIC-Files
// the EBCDIC file is UTF-8 encoded, so we'll need to specify this in the call. For the output
// ASCII file, we'll use the ISO-8859-1 encoding. The record length for the sample file is 67
// bytes
const stream: Stream =
fs.createReadStream("./test/ebcdic.txt")
.pipe(iconv.decodeStream("utf8"))
.pipe(iconv.encodeStream("iso88591"))
.pipe(iconv.decodeStream("ebcdic037"))
// .pipe(iconv.decodeStream("ebcdic1140"))
// .pipe(iconv.decodeStream("ebcdic500"))
// .pipe(iconv.decodeStream("ebcdic1148"))
;
const chunks: unknown[] = [];
stream.on("data", (chunk: string) => chunks.push(Buffer.from(chunk)));
stream.on("end", () => assert.deepStrictEqual(chunks.toString(), expected));
}); This sample test works with CCSID: 037, 277, 280, 284, 285, 297, 500, 1047 but fails for i.e. 273 BTW, one can check EBCDIC files in IntelliJ quite easily just by changing the file encoding from the default UTF-8 to i.e. HTH |
Thanks for the research @RovoMe! Any specific action items you would like to add here, or is it mostly additional info? I always try to generate the encodings directly from authoritative sources, e.g. see in https://github.com/ashtuchkin/iconv-lite/blob/master/generation/gen-dbcs.js we download corresponding tables from unicode.org or encoding.spec.whatwg.org. To support EBCDIC, ideally I'd want something like gen-ebcdic.js that downloads the tables from unicode.org and transforms it to iconv-lite format. Java sources are not work great for that purpose, unfortunately. Also I think the NL concern by @devin122 is valid (see https://en.wikipedia.org/wiki/Newline#Representation). We might want to address it by 1) encoding/decoding without changes by default, this would keep 1:1 representation of all latin1 characters, but then 2) add a codec option like Finally, FYI, we do work on integrating iconv-lite into VS Code, but it hasn't happened yet. |
84ee650
to
9aa082f
Compare
I would like this please. Thanks! |
Agreed, it would be very helpful to have the capability of opening encodings like CP037. |
Is there any chance this PR is going forward? |
vscode depends on this issue - microsoft/vscode#49891 is "the big and old" one, duplicates are at least microsoft/vscode#147064 microsoft/vscode#179693. @ashtuchkin Can you take a look at integrating this and publish a new version? |
Fixes #111 partially.
EBCDIC 037 mapping has been taken from http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP037.TXT and automatically converted from
0xXXXX
to\uXXXX
format for JavaScript.EBCDIC 1140 is said to be different only at code point
9F
(I have manually retyped that difference).Note: this pull request does not contain tests because I am not sure how they should look like.