Use lxml instead of ugly HTMLParser #3

albertmeronyo · 2014-06-24T07:59:02Z

No description provided.

cgueret · 2014-06-24T08:00:12Z

I'm a big fan of BeautifoulSoup, even nicer than lxml. I could show you ;-)

albertmeronyo · 2014-06-24T08:03:01Z

lxml is gloriously victorious in all benchmarks I read yesterday, beating
any other HTML/XML parser/docstructurer/serializer :-P

On 24 June 2014 10:00, Christophe Gueret [email protected] wrote:

I'm a big fan of BeautifoulSoup, even nicer than lxml. I could show you ;-)

—
Reply to this email directly or view it on GitHub
#3 (comment)
.

cgueret · 2014-06-24T08:04:25Z

Ok! Enjoy it then :-)

albertmeronyo · 2014-06-24T08:04:54Z

Although, from what I read now, lxml is the parser of BeautifulSoup,
which would make it more convenient for malformed HTML (which is likely to
be the case for the HISCO website, we should talk with Richard to see if
this is human or machine created).

A.

On 24 June 2014 10:02, Albert Meroño Peñuela [email protected]
wrote:

lxml is gloriously victorious in all benchmarks I read yesterday, beating
any other HTML/XML parser/docstructurer/serializer :-P

On 24 June 2014 10:00, Christophe Gueret [email protected] wrote:

I'm a big fan of BeautifoulSoup, even nicer than lxml. I could show you
;-)

—
Reply to this email directly or view it on GitHub
#3 (comment)
.

cgueret · 2014-06-24T08:07:41Z

Here is a one-liner to find the first h1 in a file with BeautifoulSoup:

    title = BeautifulSoup(response.read()).find_all('h1')[0].string

As seen on
https://github.com/CEDAR-project/Integrator/blob/master/src/get_data_from_easy.py
Pretty neat isn't it ? You can let it assign all the visitors for lxml and
focus on looking for tags with simpler commands...

On 24 June 2014 10:04, Albert Meroño-Peñuela [email protected]
wrote:

Although, from what I read now, lxml is the parser of BeautifulSoup,
which would make it more convenient for malformed HTML (which is likely to
be the case for the HISCO website, we should talk with Richard to see if
this is human or machine created).

A.

On 24 June 2014 10:02, Albert Meroño Peñuela [email protected]
wrote:

lxml is gloriously victorious in all benchmarks I read yesterday,
beating
any other HTML/XML parser/docstructurer/serializer :-P

On 24 June 2014 10:00, Christophe Gueret [email protected]
wrote:

I'm a big fan of BeautifoulSoup, even nicer than lxml. I could show you
;-)

—
Reply to this email directly or view it on GitHub
<
https://github.com/CEDAR-project/hisco2rdf/issues/3#issuecomment-46942328>

.

—
Reply to this email directly or view it on GitHub
#3 (comment)
.

Onderzoeker
+31(0)6 14576494
[email protected]

Data Archiving and Networked Services (DANS)

DANS bevordert duurzame toegang tot digitale onderzoeksgegevens. Kijk op
www.dans.knaw.nl voor meer informatie. DANS is een instituut van KNAW en
NWO.

Let op, per 1 januari hebben we een nieuw adres:

Let's build a World Wide Semantic Web!
http://worldwidesemanticweb.org/

e-Humanities Group (KNAW)
[image: eHumanities] http://www.ehumanities.nl/

RinkeHoekstra · 2014-06-24T10:56:49Z

+1 for BeautifulSoup!

It’s beautiful.. eh.. soup!

On 24 Jun 2014, at 10:07, Christophe Gueret [email protected] wrote:

Here is a one-liner to find the first h1 in a file with BeautifoulSoup:

title = BeautifulSoup(response.read()).find_all('h1')[0].string

As seen on
https://github.com/CEDAR-project/Integrator/blob/master/src/get_data_from_easy.py
Pretty neat isn't it ? You can let it assign all the visitors for lxml and
focus on looking for tags with simpler commands...

On 24 June 2014 10:04, Albert Meroño-Peñuela [email protected]
wrote:

Although, from what I read now, lxml is the parser of BeautifulSoup,
which would make it more convenient for malformed HTML (which is likely to
be the case for the HISCO website, we should talk with Richard to see if
this is human or machine created).

A.

On 24 June 2014 10:02, Albert Meroño Peñuela [email protected]
wrote:

lxml is gloriously victorious in all benchmarks I read yesterday,
beating
any other HTML/XML parser/docstructurer/serializer :-P

On 24 June 2014 10:00, Christophe Gueret [email protected]
wrote:

I'm a big fan of BeautifoulSoup, even nicer than lxml. I could show you
;-)

—
Reply to this email directly or view it on GitHub
<
https://github.com/CEDAR-project/hisco2rdf/issues/3#issuecomment-46942328>

.

—
Reply to this email directly or view it on GitHub
#3 (comment)
.

Onderzoeker
+31(0)6 14576494
[email protected]

Data Archiving and Networked Services (DANS)

DANS bevordert duurzame toegang tot digitale onderzoeksgegevens. Kijk op
www.dans.knaw.nl voor meer informatie. DANS is een instituut van KNAW en
NWO.

Let op, per 1 januari hebben we een nieuw adres:

DANS | Anna van Saksenlaan 51 | 2593 HW Den Haag | Postbus 93067 | 2509 AB
Den Haag | +31 70 349 44 50 | [email protected] [email protected] |
www.dans.knaw.nl

Let's build a World Wide Semantic Web!
http://worldwidesemanticweb.org/

e-Humanities Group (KNAW)
[image: eHumanities] http://www.ehumanities.nl/
—
Reply to this email directly or view it on GitHub.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use lxml instead of ugly HTMLParser #3

Use lxml instead of ugly HTMLParser #3

albertmeronyo commented Jun 24, 2014

cgueret commented Jun 24, 2014

albertmeronyo commented Jun 24, 2014

cgueret commented Jun 24, 2014

albertmeronyo commented Jun 24, 2014

cgueret commented Jun 24, 2014

RinkeHoekstra commented Jun 24, 2014

Use lxml instead of ugly HTMLParser #3

Use lxml instead of ugly HTMLParser #3

Comments

albertmeronyo commented Jun 24, 2014

cgueret commented Jun 24, 2014

albertmeronyo commented Jun 24, 2014

cgueret commented Jun 24, 2014

albertmeronyo commented Jun 24, 2014

cgueret commented Jun 24, 2014

RinkeHoekstra commented Jun 24, 2014