Support for iso-8859-1 #149

qLag · 2020-11-25T17:56:39Z

I try to parse an XML that comes from an iso-8859-1 API (somes strings have french accents).

Unfortunately, tikXml seems only to work with UTF-8.

I tried to use a TypeConverter :

`class StringUT8Converter : TypeConverter {

override fun read(value: String): String {
    return String(value.toByteArray(Charsets.ISO_8859_1), Charsets.UTF_8)
}

override fun write(value: String): String {
    return String(value.toByteArray(Charsets.UTF_8), Charsets.ISO_8859_1)

}

}
`
but it doesn't work.

Do you think you can include other encodings than UTF-8 (for poor old webservices 😝 )?

Thx

The text was updated successfully, but these errors were encountered:

qLag · 2020-11-25T18:49:03Z

I didn't try but is it not possible to use buffer.read(_index_, this.charset) instead of buffer.readUtf!(_index_) in all the file XmlReader (with a default charset = Charsets.UTF_8)

And give the the possibility to define a custom Charset with TikXmlConfig that will be used in TikXml.java :
XmlReader reader = XmlReader.of(source, config.charset);

reline · 2020-11-25T22:37:55Z

@qLag It is possible, Okio's API allows you to provide a charset with Buffer#readString(), Buffer#writeString(), and ByteString#encodeString()

I think the only issue is skipping the leading BOM for each charset. This is the current implementation.

private int nextNonWhitespace(boolean throwOnEof, boolean isDocumentBeginning) throws IOException {
  // Look for UTF-8 BOM sequence 0xEFBBBF and skip it
  if (isDocumentBeginning && source.rangeEquals(0, UTF8_BOM)) {
    source.skip(3);
  }
  ...
}

Not sure if this is the most optimal way to support skipping the BOM for each charset, but here's how OkHttp does it for several UTF charsets.
https://github.com/square/okhttp/blob/3f946d0b13534bcd1662e58624b0fc5816d1cc14/okhttp/src/main/java/okhttp3/internal/Util.kt#L255-L265

Edit:
FWIW, Moshi doesn't skip the BOM, you have to detect it and skip it yourself before handing the stream to Moshi. Perhaps that is another avenue of approach.

reline · 2020-11-26T22:54:19Z

I made a draft here #150, needs unit tests but I went ahead and started the leg work.

qLag · 2020-11-30T07:49:10Z

Hi reline,
Thank for your support :) I will please to test your feature when it gets ready 👍

reline · 2020-12-03T00:33:22Z

@qLag In the meantime you can always build a snapshot off of that branch if it's urgent and meets your needs.
I'd like to get more feedback from the maintainers now.

qLag · 2020-12-16T20:20:37Z

Hi reline,

I tried your draft using this line in Gradle :
implementation 'com.github.reline:tikxml:iso-8859-1-SNAPSHOT'

And this in my code :
val tikXml = TikXml.Builder() .charset(Charsets.ISO_8859_1) .exceptionOnUnreadXml(false) .build()

And... it works great ! 👍 😊 🎉
I needed to add these lines too in my build.gralde to make it work :
packagingOptions { exclude 'META-INF/gradle/incremental.annotation.processors' }

Its a really good new. How can we proceed now to be included in Tickaroo/tikXML ?
Thanks again :)

Qlag

reline · 2020-12-22T21:44:31Z

@qLag Glad that worked for you!

I updated the PR with some unit tests, only significant difference I made was fixing the XML declaration when writing in charsets other than UTF-8.

- XML_DECLARATION = ByteString.encodeUtf8("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
+ XML_DECLARATION = ByteString.encodeString("<?xml version=\"1.0\" encoding=\"" + charset.name() + "\"?>", charset);

Is anyone available to review it? @sockeqwe @Bodo1981

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for iso-8859-1 #149

Support for iso-8859-1 #149

qLag commented Nov 25, 2020

qLag commented Nov 25, 2020 •

edited

Loading

reline commented Nov 25, 2020 •

edited

Loading

reline commented Nov 26, 2020

qLag commented Nov 30, 2020

reline commented Dec 3, 2020

qLag commented Dec 16, 2020

reline commented Dec 22, 2020

Support for iso-8859-1 #149

Support for iso-8859-1 #149

Comments

qLag commented Nov 25, 2020

qLag commented Nov 25, 2020 • edited Loading

reline commented Nov 25, 2020 • edited Loading

reline commented Nov 26, 2020

qLag commented Nov 30, 2020

reline commented Dec 3, 2020

qLag commented Dec 16, 2020

reline commented Dec 22, 2020

qLag commented Nov 25, 2020 •

edited

Loading

reline commented Nov 25, 2020 •

edited

Loading