Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Share HTML and XML modelling with mxj and fq? #14

Open
wader opened this issue Jul 22, 2024 · 4 comments
Open

Share HTML and XML modelling with mxj and fq? #14

wader opened this issue Jul 22, 2024 · 4 comments
Assignees

Comments

@wader
Copy link

wader commented Jul 22, 2024

Would it make sense to share how XML and HTML is modelled with fq and mxj? i think this is the closes the spec there is to it https://www.xml.com/pub/a/2006/05/31/converting-between-xml-and-json.html

$ go run . -i html <<< '<html><b>111</b><b>222</b><a href="url">333</a><html>'
{
  "html": {
    "body": {
      "a": {
        "attr": {
          "href": "url"
        },
        "data": "333"
      },
      "b": [
        {
          "data": "111"
        },
        {
          "data": "222"
        }
      ]
    }
  }
}

$ fq <<< '<html><b>111</b><b>222</b><a href="url">333</a><html>'
{
  "html": {
    "body": {
      "a": {
        "#text": "333",
        "@href": "url"
      },
      "b": [
        "111",
        "222"
      ]
    },
    "head": ""
  }
}
@JFryy
Copy link
Owner

JFryy commented Jul 27, 2024

Thanks for the input here I would much rather adhere to this standard and like this better for consistency sake and just general brevity with omitting a “data” key. Sorry, didn’t see this until recently but will resolve this and submit a release once patched for this.

@wader
Copy link
Author

wader commented Jul 28, 2024

Yeah i agree the modelling is nearly a bit too terse. fq do support another mode where it modells xml and html as nested ["element", {attributes}, [children...]] arrays which is less lossy but is a bit of a pain to query.

$ fq -o array=true <<< '<html><b>111</b><b>222</b><a href="url">333</a><html>'
[
  "html",
  null,
  [
    [
      "head",
      null,
      []
    ],
    [
      "body",
      null,
      [
        [
          "b",
          {
            "#text": "111"
          },
          []
        ],
        [
          "b",
          {
            "#text": "222"
          },
          []
        ],
        [
          "a",
          {
            "#text": "333",
            "href": "url"
          },
          []
        ]
      ]
    ]
  ]
]

@JFryy
Copy link
Owner

JFryy commented Jul 28, 2024

Got it, yeah I really like how this is handled in fq and would have emulated it if I had known better at the time. Although it's just a draft pr because I need to add some more tests and just cleanup on the messy branch - the changes in this should get this back in line with the standard. Thanks very much for the feedback here @wader!

#16

@JFryy JFryy self-assigned this Jul 28, 2024
@JFryy
Copy link
Owner

JFryy commented Jul 29, 2024

I am going to leave this open since #16 still needs follow up with testing ensuring this and the change bundled in a few things that needed inclusion in a new release but it should be mostly covered with a few improvements to make. Some minor edge-cases are still occurring for more complex html that can be demonstrated in test-cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants