Skip to content

Convert HTML to JSON. Can also (intelligently) convert HTML tables to JSON (using table headers (if available) as keys in the resulting JSON).

License

Notifications You must be signed in to change notification settings

crawlab-team/html-to-json-enhanced

 
 

Repository files navigation

HTML to JSON

PyPI Build Status codecov

Convert HTML and/or HTML tables to JSON.

Installation

pip install html-to-json

Usage

HTML to JSON

import html_to_json_enhanced

html_string = """<head>
    <title>Test site</title>
    <meta charset="UTF-8"></head>"""
output_json = html_to_json_enhanced.convert(html_string)
print(output_json)

When calling the html_to_json.convert function, you can choose to not capture the text values from the html by passing in the key-word argument capture_element_values=False. You can also choose to not capture the attributes of the elements by passing capture_element_attributes=False into the function.

Example

Example input:

<head>
    <title>Floyd Hightower's Projects</title>
    <meta charset="UTF-8">
    <meta name="description" content="Floyd Hightower&#39;s Projects">
    <meta name="keywords" content="projects,fhightower,Floyd,Hightower">
</head>

Example output:

{
    "head": [
    {
        "title": [
        {
            "_value": "Floyd Hightower's Projects"
        }],
        "meta": [
        {
            "_attributes":
            {
                "charset": "UTF-8"
            }
        },
        {
            "_attributes":
            {
                "name": "description",
                "content": "Floyd Hightower's Projects"
            }
        },
        {
            "_attributes":
            {
                "name": "keywords",
                "content": "projects,fhightower,Floyd,Hightower"
            }
        }]
    }]
}

HTML Tables to JSON

In addition to converting HTML to JSON, this library can also intelligently convert HTML tables to JSON.

Currently, this library can handle three types of tables:

A. Those with table headers in the first row B. Those with table headers in the first column C. Those without table headers

Tables of type A and B are diagrammed below:

This package can handle tables with the headers in the first row or headers in the first column

Example

This code:

import html_to_json_enhanced

html_string = """<table>
    <tr>
        <th>#</th>
        <th>Malware</th>
        <th>MD5</th>
        <th>Date Added</th>
    </tr>

    <tr>
        <td>25548</td>
        <td><a href="/stats/DarkComet/">DarkComet</a></td>
        <td><a href="/config/034a37b2a2307f876adc9538986d7b86">034a37b2a2307f876adc9538986d7b86</a></td>
        <td>July 9, 2018, 6:25 a.m.</td>
    </tr>

    <tr>
        <td>25547</td>
        <td><a href="/stats/DarkComet/">DarkComet</a></td>
        <td><a href="/config/706eeefbac3de4d58b27d964173999c3">706eeefbac3de4d58b27d964173999c3</a></td>
        <td>July 7, 2018, 6:25 a.m.</td>
    </tr></table>"""
tables = html_to_json_enhanced.convert_tables(html_string)
print(tables)

will produce this output:

[
    [
        {
            "#": "25548",
            "Malware": "DarkComet",
            "MD5": "034a37b2a2307f876adc9538986d7b86",
            "Date Added": "July 9, 2018, 6:25 a.m."
        }, {
            "#": "25547",
            "Malware": "DarkComet",
            "MD5": "706eeefbac3de4d58b27d964173999c3",
            "Date Added": "July 7, 2018, 6:25 a.m."
        }
    ]
]

Credits

This package was created with Cookiecutter and fhightower's Python project template.

About

Convert HTML to JSON. Can also (intelligently) convert HTML tables to JSON (using table headers (if available) as keys in the resulting JSON).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 54.0%
  • Python 45.7%
  • Other 0.3%