Floki is a simple HTML parser that enables search for nodes using CSS selectors.
Take this HTML as an example:
<!doctype html>
<html>
<body>
<section id="content">
<p class="headline">Floki</p>
<span class="headline">Enables search using CSS selectors</span>
<a href="https://github.com/philss/floki">Github page</a>
<span data-model="user">philss</span>
</section>
<a href="https://hex.pm/packages/floki">Hex package</a>
</body>
</html>
Here are some queries that you can perform (with return examples):
Floki.find(html, "p.headline")
# => [{"p", [{"class", "headline"}], ["Floki"]}]
Floki.find(html, "p.headline")
|> Floki.raw_html
# => <p class="headline">Floki</p>
Floki.find(html, "a[href^=https]")
# => [{"a", [{"href", "https://hex.pm/packages/floki"}], ["Hex package"]}]
Floki.find(html, "#content a")
# => [{"a", [{"href", "https://github.com/philss/floki"}], ["Github page"]}]
Floki.find(html, "[data-model=user]")
# => [{"span", [{"data-model", "user"}], ["philss"]}]
Floki.find(html, ".headline:nth-child(1), a")
# => [{"p", [{"class", "headline"}], ["Floki"]},
# => {"a", [{"href", "https://github.com/philss/floki"}], ["Github page"]},
# => {"a", [{"href", "https://hex.pm/packages/floki"}], ["Hex package"]}]
Each HTML node is represented by a tuple like:
{tag_name, attributes, children_nodes}
Example of node:
{"p", [{"class", "headline"}], ["Floki"]}
So even if the only child node is the element text, it is represented inside a list.
You can write a simple HTML crawler with Floki and HTTPoison:
html
|> Floki.find(".pages a")
|> Floki.attribute("href")
|> Enum.map(fn(url) -> HTTPoison.get!(url) end)
It is simple as that!
Add Floki to your mix.exs
:
defp deps do
[
{:floki, "~> 0.17.0"}
]
end
After that, run mix deps.get
.
Floki needs the leex
module in order to compile.
Normally this module is installed with Erlang in a complete installation.
If you get this kind of error,
you need to install the erlang-dev
and erlang-parsetools
packages in order get the leex
module.
The packages names may be different depending on your OS.
You can configure Floki to use html5ever as your HTML parser.
This is recommended if you need better performance
and a more accurate parser. However html5ever
is being under active development and may be unstable.
Since it's written in Rust, we need to install Rust and compile the project. Luckily we have have the html5ever Elixir NIF that makes the integration very easy.
You still need to install Rust in your system. To do that, please follow the instruction presented in the official page.
After setup Rust, you need to add html5ever
NIF to your dependency list:
defp deps do
[
{:floki, "~> 0.17.0"},
{:html5ever, "~> 0.3.0"}
]
end
Run mix deps.get
and compiles the project with mix compile
to make sure it works.
Then you need to configure your app to use html5ever
:
# in config/config.exs
config :floki, :html_parser, Floki.HTMLParser.Html5ever
After that you are able to use html5ever
as your HTML parser with Floki.
For more info, check the article Rustler - Safe Erlang and Elixir NIFs in Rust.
To parse a HTML document, try:
html = """
<html>
<body>
<div class="example"></div>
</body>
</html>
"""
Floki.parse(html)
# => {"html", [], [{"body", [], [{"div", [{"class", "example"}], []}]}]}
To find elements with the class example
, try:
Floki.find(html, ".example")
# => [{"div", [{"class", "example"}], []}]
To convert your node tree back to raw HTML (spaces are ignored):
Floki.find(html, ".example")
|> Floki.raw_html
# => <div class="example"></div>
To fetch some attribute from elements, try:
Floki.attribute(html, ".example", "class")
# => ["example"]
You can get attributes from elements that you already have:
Floki.find(html, ".example")
|> Floki.attribute("class")
# => ["example"]
If you want to get the text from an element, try:
Floki.find(html, ".headline")
|> Floki.text
# => "Floki"
Floki is under MIT license. Check the LICENSE
file for more details.