Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use lxml instead of ugly HTMLParser #3

Open
albertmeronyo opened this issue Jun 24, 2014 · 6 comments
Open

Use lxml instead of ugly HTMLParser #3

albertmeronyo opened this issue Jun 24, 2014 · 6 comments

Comments

@albertmeronyo
Copy link
Member

No description provided.

@cgueret
Copy link
Member

cgueret commented Jun 24, 2014

I'm a big fan of BeautifoulSoup, even nicer than lxml. I could show you ;-)

@albertmeronyo
Copy link
Member Author

lxml is gloriously victorious in all benchmarks I read yesterday, beating
any other HTML/XML parser/docstructurer/serializer :-P

On 24 June 2014 10:00, Christophe Gueret [email protected] wrote:

I'm a big fan of BeautifoulSoup, even nicer than lxml. I could show you ;-)


Reply to this email directly or view it on GitHub
#3 (comment)
.

@cgueret
Copy link
Member

cgueret commented Jun 24, 2014

Ok! Enjoy it then :-)

@albertmeronyo
Copy link
Member Author

Although, from what I read now, lxml is the parser of BeautifulSoup,
which would make it more convenient for malformed HTML (which is likely to
be the case for the HISCO website, we should talk with Richard to see if
this is human or machine created).

A.

On 24 June 2014 10:02, Albert Meroño Peñuela [email protected]
wrote:

lxml is gloriously victorious in all benchmarks I read yesterday, beating
any other HTML/XML parser/docstructurer/serializer :-P

On 24 June 2014 10:00, Christophe Gueret [email protected] wrote:

I'm a big fan of BeautifoulSoup, even nicer than lxml. I could show you
;-)


Reply to this email directly or view it on GitHub
#3 (comment)
.

@cgueret
Copy link
Member

cgueret commented Jun 24, 2014

Here is a one-liner to find the first h1 in a file with BeautifoulSoup:

    title = BeautifulSoup(response.read()).find_all('h1')[0].string

As seen on
https://github.com/CEDAR-project/Integrator/blob/master/src/get_data_from_easy.py
Pretty neat isn't it ? You can let it assign all the visitors for lxml and
focus on looking for tags with simpler commands...

On 24 June 2014 10:04, Albert Meroño-Peñuela [email protected]
wrote:

Although, from what I read now, lxml is the parser of BeautifulSoup,
which would make it more convenient for malformed HTML (which is likely to
be the case for the HISCO website, we should talk with Richard to see if
this is human or machine created).

A.

On 24 June 2014 10:02, Albert Meroño Peñuela [email protected]
wrote:

lxml is gloriously victorious in all benchmarks I read yesterday,
beating
any other HTML/XML parser/docstructurer/serializer :-P

On 24 June 2014 10:00, Christophe Gueret [email protected]
wrote:

I'm a big fan of BeautifoulSoup, even nicer than lxml. I could show you
;-)


Reply to this email directly or view it on GitHub
<
https://github.com/CEDAR-project/hisco2rdf/issues/3#issuecomment-46942328>

.


Reply to this email directly or view it on GitHub
#3 (comment)
.

Onderzoeker
+31(0)6 14576494
[email protected]

Data Archiving and Networked Services (DANS)

DANS bevordert duurzame toegang tot digitale onderzoeksgegevens. Kijk op
www.dans.knaw.nl voor meer informatie. DANS is een instituut van KNAW en
NWO.

Let op, per 1 januari hebben we een nieuw adres:

DANS | Anna van Saksenlaan 51 | 2593 HW Den Haag | Postbus 93067 | 2509 AB
Den Haag | +31 70 349 44 50 | [email protected] [email protected] |
www.dans.knaw.nl

Let's build a World Wide Semantic Web!
http://worldwidesemanticweb.org/

e-Humanities Group (KNAW)
[image: eHumanities] http://www.ehumanities.nl/

@RinkeHoekstra
Copy link

+1 for BeautifulSoup!

It’s beautiful.. eh.. soup!

On 24 Jun 2014, at 10:07, Christophe Gueret [email protected] wrote:

Here is a one-liner to find the first h1 in a file with BeautifoulSoup:

title = BeautifulSoup(response.read()).find_all('h1')[0].string

As seen on
https://github.com/CEDAR-project/Integrator/blob/master/src/get_data_from_easy.py
Pretty neat isn't it ? You can let it assign all the visitors for lxml and
focus on looking for tags with simpler commands...

On 24 June 2014 10:04, Albert Meroño-Peñuela [email protected]
wrote:

Although, from what I read now, lxml is the parser of BeautifulSoup,
which would make it more convenient for malformed HTML (which is likely to
be the case for the HISCO website, we should talk with Richard to see if
this is human or machine created).

A.

On 24 June 2014 10:02, Albert Meroño Peñuela [email protected]
wrote:

lxml is gloriously victorious in all benchmarks I read yesterday,
beating
any other HTML/XML parser/docstructurer/serializer :-P

On 24 June 2014 10:00, Christophe Gueret [email protected]
wrote:

I'm a big fan of BeautifoulSoup, even nicer than lxml. I could show you
;-)


Reply to this email directly or view it on GitHub
<
https://github.com/CEDAR-project/hisco2rdf/issues/3#issuecomment-46942328>

.


Reply to this email directly or view it on GitHub
#3 (comment)
.

Onderzoeker
+31(0)6 14576494
[email protected]

Data Archiving and Networked Services (DANS)

DANS bevordert duurzame toegang tot digitale onderzoeksgegevens. Kijk op
www.dans.knaw.nl voor meer informatie. DANS is een instituut van KNAW en
NWO.

Let op, per 1 januari hebben we een nieuw adres:

DANS | Anna van Saksenlaan 51 | 2593 HW Den Haag | Postbus 93067 | 2509 AB
Den Haag | +31 70 349 44 50 | [email protected] [email protected] |
www.dans.knaw.nl

Let's build a World Wide Semantic Web!
http://worldwidesemanticweb.org/

e-Humanities Group (KNAW)
[image: eHumanities] http://www.ehumanities.nl/

Reply to this email directly or view it on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants