A simple util for extracting strings from a huge text.
We always use regex expression for extracting strings, but regex expression is hard to learn. This is a very easy util for extracting strings from a huge text.
For example, this is a search result, we want to extract the search results.
The following is our extract result.
All you need is defining the extract pattern, and for the pattern, all you need to know is {%} and {*} symbols.
- {%} are any characters you want to extract.
- {*} are any characters you want to omit.
public static final String PATTERN = " - </span>{%}</div><div class=\"f13\">{*}data-tools=\"{"title":"{%}","url":"{%}"}\">{*}{'rsv_snapshot':'1'}\" href=\"{%}\" target=\"_blank\" class=\"m\">百度快照";
The extracting will traverse all text, and match the pattern. Finally, you should invoke the following method to extract strings.
String text = readText("baidu_search_result.html");
List<List<String>> results = Squirrel.extract(PATTERN, text);
Please see the unit test for details.
Created by @[email protected] - feel free to contact me!