RegexPath PoC

Proof of Concept for facilitating the creation of xpath-like regex, easily written regex for more efficient web crawling. The reason this project was created is due to the fact that lxml is by nature, quite slow, specifically the generation of the ElementTree used.

Warning: This library is intended mostly as an fun experiment, please do not do use this in production.

The behaviour should be the following:

'//h3/a' -> '<h3[^>]*>(<a[^>]*>[^<]*<a>)</h3>'
'//h1[@itemprop="name"]' -> '<h1[^>]*itemprop="name"[^>]*>[^<]*</h1>'
'//a[contains(@class,'checkBookDownloaded')]/@href' -> '<a[^>]*class="[^"]*checkBookDownloaded[^"]*"[^>]*href="([^"]*)">'
'//div[contains(@class,'property_year')]/div[contains(@class,'property_value')]' -> '<div[^>]*class="[^"]*property_year[^"]*"[^>]*><div[^>]*class="[^"]*property_year[^"]*"[^>]*><div>>'

Singleton elements may pose an issue:

Usage

The idea is that this can library can replace xpath calls to an etree, so instead of lxml_html etree we can use RegexPath. Calling etree.xpath() will convert the xpath expression into a valid regex designed for html documents and then process it. For example:

from regexpath import RegexPath
test_string = """<script type="text/javascript" src="test_files/typeahead.js"></script>
                <script type="text/javascript" src="test_files/bootstrap-tagsinput.js"></script>
                <script type="text/javascript" src="test_files/jquery.js"></script>
                <script type="text/javascript" src="test_files/z-booklists-carousel.js"></script>
                <script type="text/javascript" src="test_files/z-readlist-card.js"></script>
                <script type="text/javascript" src="test_files/book-details.js"></script>"""
etree = RegexPath(test_string)
etree.xpath("//script[contains(@src, 'jquery')]")
>>> '<script type="text/javascript" src="test_files/jquery.js"></script>'
etree.xpath("//script[contains(@src, 'jquery')]/@src")
>>> "test_files/jquery.js"

Development

In order to install development dependencies run:

pip install -r requirements.txt

In order to run tests locally:

pytest -k ''

Licence

MIT license

Authors

regexpath was written by Jesse Constante.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
regexpath		regexpath
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.rst		README.rst
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RegexPath PoC

Usage

Development

Licence

Authors

About

Releases

Packages

Languages

License

vonsteer/regexpath

Folders and files

Latest commit

History

Repository files navigation

RegexPath PoC

Usage

Development

Licence

Authors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages