Stripper
is an Elixir package for normalizing input from unpredictable sources (such as web scraping), useful as a pre-processing step in ETL pipelines for machine learning or data analysis. It is parser-based (not regular expression based), so it does all its work in one pass and should be performant.
Why the name? Because it describes the purpose and it's memorable -- get over it ;)
Normalizing whitespace:
iex> Stripper.Whitespace.normalize!(" random\tstuff\fI scraped\t\t\tfrom\nthe web\n\n")
"random stuff I scraped from the web"
This will reduce all unicode whitespace and separator characters to the humble space -- multiple spaces will be collapsed into one.
Simplifying quotes:
iex> Stripper.Quotes.normalize!(~S|‘make’ «it» „stop“|)
"'make' \"it\" \"stop\""
See the online documentation for more information.
If available in Hex, the package can be installed
by adding stripper
to your list of dependencies in mix.exs
:
def deps do
[
{:stripper, "~> 1.4.0"}
]
end
See the Contributing Guidelines for more information.
The logo image is "wire strippers" by Designs by MB from the the Noun Project