From eaa46a418b719602e361103fbdde48b9e1a09977 Mon Sep 17 00:00:00 2001 From: Shabanali Faghani Date: Mon, 31 Dec 2018 03:03:15 +0330 Subject: [PATCH] Update README.md --- README.md | 71 +++++++++++++++++++++++-------------------------------- 1 file changed, 30 insertions(+), 41 deletions(-) diff --git a/README.md b/README.md index 3eeef83..70f3311 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -#IUST HTMLCharDet +# IUST HTMLCharDet IUST HTMLCharDet is a meta java tool for detecting *Charset Encoding* of HTML web pages. **HTMLCharDet** stands for **HTML** **Char**set **Det**ector and **IUST** stands for **I**ran **U**niversity of **S**cience & **T**echnology. @@ -10,48 +10,51 @@ This tool is in connection with a paper entitled: which was presented In *[Proceedings of the 11th Asia Information Retrieval Societies Conference][1]* (pp. 215-226), Brisbane, Australia, 2015. -Although we wrote a paper to describe the algorithm, but this tool is not just an academic effort to solve *charset encoding detection* problem for HTML web pages. In fact this tool is an **industrial** product which is now actively used in a large-scale web crawler, under a load of over than **1 billion** web pages. But despite its accuracy in practice, it's very small in size (just has two class!!). Both the small size and the accuracy of this tool are originated from its algorithm. It is small, because its algorithm is so easy to implement, and it is accurate, because in addition to the logic of its algorithm per se; it uses two famous charset detector tools namely _**IBM ICU**_ and _**Mozilla CharDet**_ under the hood. +Although we wrote a paper to describe the algorithm, but this tool is not just an academic effort to solve *charset encoding detection* problem for HTML web pages. In fact this tool is an **industrial** product which is now actively used in a large-scale web crawler, under a load of over than **1 billion** web pages. -##Precision (quick view) +## Precision (quick view) -In order to determine the precision of IUST HTMLCharDet, we compared it with the two famous charset detector tools, i.e. _**IBM ICU**_ and _**Mozilla CharDet**_, against two test scenario including **Encoding-Wise** and **Language-Wise**. Results of the comparisons are presented in the [paper][paper], but bellow you can have a glance at results. To read more about comparisons, please find the paper inside *wiki* folder. +In order to determine the precision of IUST HTMLCharDet, we compared it with the two famous charset detector tools, i.e. _**IBM ICU**_ and _**Mozilla CharDet**_, against two test scenarios including **Encoding-Wise** and **Language-Wise**. Results of the comparisons were presented in the [paper][paper], but you can take a glance at them, below. To read more about comparisons, you can find the paper inside the *wiki* folder. -**Note:** In these images *Hybrid* is the same *IUST HTMLCharDet*, in the paper we called it *Hybrid* because it is actually a hybrid mechanism. +**Note:** In these images *Hybrid* is the same *IUST HTMLCharDet*, we called it *Hybrid* in the paper because it is actually a hybrid mechanism. -####Encoding-Wise Evaluation -In this test scenario, we compared *IBM ICU*, *Mozilla CharDet* and the *hybrid mechanism* against a corpus of HTML documents. To create this corpus, we wrote a multi-threaded crawler and then we gathered a collection of nearly 2700 HTML pages with various charset encoding types. The code which we wrote for creating this corpus is available in the [*./src/test/java/encodingwise/corpus*][corpus-code] folder of this repository and the created corpus is available via [*./test-data/encoding-wise/corpus.zip*][corpus-data]. Bellow find the comparison results ... +#### Encoding-Wise Evaluation +In this test scenario, we compared *IBM ICU*, *Mozilla CharDet* and the *Hybrid mechanism* against a corpus of HTML documents. To create this corpus, we wrote a multi-threaded crawler and then we gathered a collection of nearly 2700 HTML pages with various charset encoding types. The code which we wrote for creating this corpus is available in the [*./eval/src/main/java/encodingwise/corpus*][corpus-code] folder of this repository and the created corpus is available via [*./eval/test-data/encoding-wise/corpus.zip*][corpus-data]. Below find the comparison results ...

encoding-wise evaluation image

-Usually, graphical presentation of the results makes a better sense ... + +Usually, graphical presentation of results makes a better sense ...

encoding-wise evaluation diagram image

-I know that the results are odd and incredible indeed, specially for Windows-1251 and Windows-1256 (Cyrillic and Arabic specific charset encodings respectively), but it is just as well. To see the reason of this behavior of IUST HTMLCharDet from theoretical point of view please refer to the [paper][paper], but to see it in practice you can find prepared test codes in [*./test/java*][test-codes] and test data in [*./test-data*][test-data] folder. -####Language-Wise Evaluation -In this test scenario, we compared our *hybrid mechanism* with the two others from language point of view. In this connection, we collected a list of URLs that are pointing to various web pages with different languages. The URLs are selected from the **top 1 million websites** visited from all over the world, as reported by [Alexa][Alexa]. In order to collect HTML documents in a specific language, we investigated web pages with the internet domain name of that language. For example, *Japanese* web pages are collected from *.jp* domain. The results of evaluation for eight different languages are shown in details in the following table ... +I know that the results are odd and incredible indeed, specially for Windows-1251 and Windows-1256 (Cyrillic and Arabic specific charset encodings respectively), but it is just as well. Refer to the [paper][paper] to know why IUST HTMLCharDet is so accurate from theoretical point of view and to see its accuracy in practice yourself you can find a prepared evaluation project inside the [*./eval*][eval] folder. + +#### Language-Wise Evaluation +In this test scenario, we compared our *hybrid mechanism* with the two others from language point of view. In this connection, we gathered a collection of URLs of web pages with different languages. The URLs were selected from the URLs of [Alexa][Alexa]'s **top 1 million websites** visited from throughout the world. In order to get URLs in the list of a specific language, we investigated URLs with the Internet Top Level Domain (TLD) name of that language/country. For example, the URLs of *Japanese* web pages were collected from *.jp* TLD. The results of the evaluation for eight different languages are shown in details in the following table ...

language-wise evaluation image

-To find more details about this test, you may have a look at: [*./test-data/language-wise/results/*][lang-wise-results]. + +Take a look at [*./eval/test-data/language-wise/results/*][lang-wise-results] to find more details about this test.

language-wise evaluation diagram image

-As you can see from this diagram, in this test scenario the improveness in mean average accuracy of IUST HTMLCharDet aginst two other tools is less that from which in the previous test scenario (i.e. 0.14 and 0.10 in Lang-Wise versus 0.38 and 0.69 in Enc-Wise). It is due to the fact that over than 85 % of the websites use UTF-8 as their charset encoding [[ref][w3techs]]. With considering this fact and recalling [Encoding-Wise Evaluation diagram][ewe-diagram], in which we saw both *IBM ICU* and *Mozilla CharDet* are accurate in dealing with UTF-8, we would have been convinced with this results. -##Installation +As you can see from this diagram, in this test scenario the improveness in mean average accuracy of IUST HTMLCharDet aginst two other tools is less that from which in the previous test scenario (i.e. 0.14 and 0.10 in Lang-Wise versus 0.38 and 0.69 in Enc-Wise). It is because over than 85% of the websites use UTF-8 as their charset encoding ([ref][w3techs]) and as we know from [Encoding-Wise Evaluation diagram][ewe-diagram], compared to other charsets, both *IBM ICU* and *Mozilla CharDet* are more accurate when the charset is UTF-8. + +## Installation -####Maven -If you use Maven as dependency manager, just place this dependency into your POM's `` section: +#### Maven ```java ir.ac.iust @@ -59,40 +62,26 @@ If you use Maven as dependency manager, just place this dependency into your POM 1.0.1 ```` -####Scala SBT +#### Scala SBT ````scala libraryDependencies += "ir.ac.iust" % "htmlchardet" % "1.0.1" ```` -####Otherwise! -If you don't use any dependency manager and have a pure java project you can download **htmlchardet-1.0.1.jar** either from inside the [wiki][wiki] folder or from [mvnrepository.org][mvnrepo]. In this case you need also the first 4 dependency; those were mentioned in the [pom.xml][pom] file to get it to work for you. -####Warning!!! -Never use the version **1.0.0**, because it has a **FATAL** bug. See more details [here][bug-url]. +## Usage -##Usage - -If you trust in the Web Server or trust in the Website that the pages are crawled from, you can use this tool as follows: -```java -String charset = HTMLCharsetDetector.detect(htmlByteArray, true); -``` -otherwise, call detect method with `false` value for `lookInMeta` argument as follows: ```java -String charset = HTMLCharsetDetector.detect(htmlByteArray, false); +HTMLCharsetDetector htmlCharsetDetector = new HTMLCharsetDetector(); +String charset = htmlCharsetDetector.detect(htmlInputStream); +// or +String charset = htmlCharsetDetector.detect(htmlInputStream, true); // to look into meta tags ``` -Also, there is another detection method with `#detect(byte[] rawHtmlByteSequence)` signature, but I don't recommend to use it. To see why, please refer to its [javadoc][javadoc]. [1]: http://airs-conference.org/2015/program.html -[paper]: https://github.com/shabanali-faghani/IUST-HTMLCharDet/tree/master/wiki/Charset-Encoding-Detection-of-HTML-Documents.pdf -[corpus-code]: https://github.com/shabanali-faghani/IUST-HTMLCharDet/tree/master/src/test/java/encodingwise/corpus -[corpus-data]: https://github.com/shabanali-faghani/IUST-HTMLCharDet/tree/master/test-data/encoding-wise/corpus.zip -[test-codes]: https://github.com/shabanali-faghani/IUST-HTMLCharDet/tree/master/src/test/java -[test-data]: https://github.com/shabanali-faghani/IUST-HTMLCharDet/tree/master/test-data +[paper]: http://link.springer.com/chapter/10.1007/978-3-319-28940-3_17 +[corpus-code]: https://github.com/shabanali-faghani/IUST-HTMLCharDet/tree/master/eval/src/main/java/encodingwise/corpus +[corpus-data]: https://github.com/shabanali-faghani/IUST-HTMLCharDet/tree/master/eval/test-data/encoding-wise/corpus.zip +[eval]: https://github.com/shabanali-faghani/IUST-HTMLCharDet/tree/master/eval [Alexa]: www.alexa.com -[lang-wise-results]: https://github.com/shabanali-faghani/IUST-HTMLCharDet/tree/master/test-data/language-wise/results +[lang-wise-results]: https://github.com/shabanali-faghani/IUST-HTMLCharDet/tree/master/eval/test-data/language-wise/results [w3techs]: http://w3techs.com/technologies/history_overview/character_encoding -[pom]: https://github.com/shabanali-faghani/IUST-HTMLCharDet/blob/master/pom.xml [ewe-diagram]: https://cloud.githubusercontent.com/assets/14090324/12007849/cc8f46ca-ac2c-11e5-9600-dd3cd3a39ac1.jpg -[mvnrepo]: http://mvnrepository.org/artifact/ir.ac.iust/htmlchardet -[wiki]: https://github.com/shabanali-faghani/IUST-HTMLCharDet/tree/master/wiki -[bug-url]: https://github.com/shabanali-faghani/IUST-HTMLCharDet/issues/2 -[javadoc]: https://github.com/shabanali-faghani/IUST-HTMLCharDet/blob/master/src/main/java/ir/ac/iust/htmlchardet/HTMLCharsetDetector.java#L146