From 0a7f382e2d488d46f1e12bad647f0384c4a2288f Mon Sep 17 00:00:00 2001 From: Shabanali Faghani Date: Wed, 2 Jan 2019 00:01:57 +0330 Subject: [PATCH] Update README.md --- README.md | 26 ++++++++++++++++++-------- 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 70f3311..be502f8 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # IUST HTMLCharDet -IUST HTMLCharDet is a meta java tool for detecting *Charset Encoding* of HTML web pages. **HTMLCharDet** stands for **HTML** **Char**set **Det**ector and **IUST** stands for **I**ran **U**niversity of **S**cience & **T**echnology. +IUST HTMLCharDet is a java tool for detecting *Charset Encoding* of HTML web pages. **HTMLCharDet** stands for **HTML** **Char**set **Det**ector and **IUST** stands for **I**ran **U**niversity of **S**cience & **T**echnology. This tool is in connection with a paper entitled:

@@ -10,19 +10,19 @@ This tool is in connection with a paper entitled: which was presented In *[Proceedings of the 11th Asia Information Retrieval Societies Conference][1]* (pp. 215-226), Brisbane, Australia, 2015. -Although we wrote a paper to describe the algorithm, but this tool is not just an academic effort to solve *charset encoding detection* problem for HTML web pages. In fact this tool is an **industrial** product which is now actively used in a large-scale web crawler, under a load of over than **1 billion** web pages. +Although we wrote a paper to describe the algorithm, but this tool is not just an academic effort to solve *charset encoding detection* problem for HTML web pages. In fact this tool is an industrial product which is now actively used in a large-scale web crawler, under a load of over than **1 billion** web pages. ## Precision (quick view) -In order to determine the precision of IUST HTMLCharDet, we compared it with the two famous charset detector tools, i.e. _**IBM ICU**_ and _**Mozilla CharDet**_, against two test scenarios including **Encoding-Wise** and **Language-Wise**. Results of the comparisons were presented in the [paper][paper], but you can take a glance at them, below. To read more about comparisons, you can find the paper inside the *wiki* folder. +In order to determine the precision of IUST HTMLCharDet, we compared it with the two famous charset detector tools namely _**IBM ICU**_ and _**Mozilla CharDet**_, against two test scenarios including **Encoding-Wise** and **Language-Wise**. Results of the comparisons are presented in the [paper][paper], but you can take a glance at them, below. Interested readers are referred to [TIKA-2038](https://issues.apache.org/jira/browse/TIKA-2038) for more evaluations and explanations. **Note:** In these images *Hybrid* is the same *IUST HTMLCharDet*, we called it *Hybrid* in the paper because it is actually a hybrid mechanism. #### Encoding-Wise Evaluation -In this test scenario, we compared *IBM ICU*, *Mozilla CharDet* and the *Hybrid mechanism* against a corpus of HTML documents. To create this corpus, we wrote a multi-threaded crawler and then we gathered a collection of nearly 2700 HTML pages with various charset encoding types. The code which we wrote for creating this corpus is available in the [*./eval/src/main/java/encodingwise/corpus*][corpus-code] folder of this repository and the created corpus is available via [*./eval/test-data/encoding-wise/corpus.zip*][corpus-data]. Below find the comparison results ... +In this test scenario, we compared *IBM ICU*, *Mozilla CharDet* and the *Hybrid mechanism* against a corpus of HTML documents. To create this corpus, we wrote a multi-threaded crawler and then we gathered a collection of nearly 2700 HTML pages with various charset encoding types. The code which we wrote for creating this corpus is available in the [*./eval/src/main/java/encodingwise/corpus*][corpus-code] folder of this repository and the created corpus is available via [*./eval/test-data/encoding-wise/corpus.zip*][corpus-data]. Below find the results of comparison ...

-encoding-wise evaluation image +encoding-wise evaluation image

@@ -43,18 +43,19 @@ In this test scenario, we compared our *hybrid mechanism* with the two others fr

-Take a look at [*./eval/test-data/language-wise/results/*][lang-wise-results] to find more details about this test. +More details about this test can be found in [*./eval/test-data/language-wise/results/*][lang-wise-results].

language-wise evaluation diagram image

-As you can see from this diagram, in this test scenario the improveness in mean average accuracy of IUST HTMLCharDet aginst two other tools is less that from which in the previous test scenario (i.e. 0.14 and 0.10 in Lang-Wise versus 0.38 and 0.69 in Enc-Wise). It is because over than 85% of the websites use UTF-8 as their charset encoding ([ref][w3techs]) and as we know from [Encoding-Wise Evaluation diagram][ewe-diagram], compared to other charsets, both *IBM ICU* and *Mozilla CharDet* are more accurate when the charset is UTF-8. +As you can see from this diagram, in this test scenario the improveness in mean average accuracy of IUST HTMLCharDet aginst two other tools is less that from which in the previous test scenario (i.e. 0.14 and 0.10 in Lang-Wise versus 0.38 and 0.69 in Enc-Wise). It is because over than 85% of the websites use UTF-8 as their charset encoding ([ref][w3techs]) and as we know from the previous test, compared to other charsets, both *IBM ICU* and *Mozilla CharDet* are more accurate when the charset is UTF-8. ## Installation #### Maven + ```java ir.ac.iust @@ -63,17 +64,26 @@ As you can see from this diagram, in this test scenario the improveness in mean ```` #### Scala SBT + ````scala libraryDependencies += "ir.ac.iust" % "htmlchardet" % "1.0.1" ```` ## Usage +In case of using version 1.0.1: +```java +String charset = HTMLCharsetDetector.detect(htmlByteArray, false); +// or +String charset = HTMLCharsetDetector.detect(htmlByteArray, true); // to involve charsets in meta tags +``` + +For the current 1.0.2-SNAPSHOT version: ```java HTMLCharsetDetector htmlCharsetDetector = new HTMLCharsetDetector(); String charset = htmlCharsetDetector.detect(htmlInputStream); // or -String charset = htmlCharsetDetector.detect(htmlInputStream, true); // to look into meta tags +String charset = htmlCharsetDetector.detect(htmlInputStream, true); // to involve charsets in meta tags ``` [1]: http://airs-conference.org/2015/program.html