You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: doc/adr/0021-xml-with-no-docx.md
+25-16
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
# 21. XML with no docx
1
+
# 21. XML format for documents where source text cannot be extracted by the parser
2
2
3
3
Date: 2024-11-20
4
4
@@ -8,7 +8,7 @@ Draft
8
8
9
9
## Context
10
10
11
-
We need to support documents which do not have a source DOCX. They will typically be PDF-only, although we're imagining there might be other formats (zip files full of jpegs).
11
+
We need to support documents where the structured contents of the document cannot necessarily be extracted by the parser. They will typically be PDF-only, although we're imagining there might be other formats (such as zip files full of jpegs).
12
12
13
13
These documents might not have neutral ciation numbers -- this is outside the scope of this document.
14
14
@@ -24,43 +24,52 @@ The emitted XML will validate against our modified AkomaNtoso schema
24
24
25
25
Metadata is provided to the parser via spreadsheets. We will want to preserve some of this data.
26
26
27
-
The existing tag names (see the appendix) will be used inside a `uk:external-metadata` tag in the `proprietary` tag to allow this metadata to be searched at the same time as other documents that have these tags in the running body of the judgment. Values MUST NOT be deliberately replicated across `external-metadata` and other places, but MAY be replicated if the text body is successfully parsed and the string appeared in externally provided data sources.
27
+
The existing tag names (see the appendix) will be used inside a `uk:external-metadata` tag in the `proprietary` tag to allow this metadata to be searched at the same time as other documents that have these tags in the running body of the judgment; however, `akn:` namespaced tags are forbidden, so the `uk:` namespace MUST be used instead.
28
28
29
-
Editors SHOULD be able to edit the external metadata and the Editor UI have affordances to do so.
29
+
Values MUST NOT be deliberately replicated across `external-metadata` and other places, but MAY be replicated if the text body is successfully parsed and the string appeared in externally provided data sources.
30
30
31
-
### Marked as Low Quality
31
+
Editors SHOULD be able to edit these metadata values within the `proprietary` tag and the Editor UI have affordances to do so.
32
32
33
-
The `proprietary` tag will contain a `uk:source-document` tag, with attributes:
33
+
### Marked as Low Quality
34
34
35
-
TODO: I think this part wants tearing apart carefully.
35
+
The `proprietary` tag will contain a `uk:source-document` tag, containing two fractional attributes:
36
36
37
-
Each of these metrics defaults to `1`: we have no reason to doubt the quality of the document. Values of `0` imply a complete lack of quality, or an absence of that feature. Values below `0.5` should be considered poor, and mitigations applied. Values above `1` might be signs of additional verification.
37
+
Each of these numerical metrics defaults to `1.0`: we have no reason to doubt the quality of the document. Values of `0.0` imply a complete lack of quality, or an absence of that feature. Values below `0.5` should be considered poor, and mitigations applied. Values above `1.0` might be signs of additional verification.
38
38
39
39
-`uk:markup-quality-score`: Is the body of the document marked up well with appropriate HTML and AkomaNtoso tags?
40
40
41
41
-`uk:text-quality-score`: Are all the words believed to be in the right order, with appropriate interword spacing? (Linebreaks optional)
42
42
43
-
-`uk:parsed-format`: default `docx`, even if the docx isn't original. `pdf` an obvious value.
43
+
There is also:
44
44
45
-
The tag SHALL contain human-readable text warning that the XML is low quality and should not be relied upon.
45
+
-`uk:source-document-format`: the MIME type of the document from which this XML was generated. For Rich Text
46
+
documents, and others where the document was converted to docx first, the mime type still will be `application/vnd.openxmlformats-officedocument.wordprocessingml.document`;
46
47
47
-
### Purposes for Low Quality Marks
48
+
-`uk:markup-human-reviewed`: boolean -- An editor has reviewed the document. This will not be added by any parser, but may be added in the EUI.
48
49
49
-
#### Deprioritised in Search
50
+
-`uk:quality-warning`: Human readable text warning that the XML is low quality and should not be relied upon.
50
51
51
-
Since the risk of documents having bad search experiences due to broken OCR / text flow issues, documents with low quality scores should be deprioritised in search.
52
+
### Purposes for Low Quality Marks
52
53
53
54
#### Rejected for Impossible Tasks
54
55
55
-
Documents with no DOCX cannot be reparsed, so SHOULD be excluded from the list of reparse candidates, SHOULD not be reparsable in the UI and MUST not be sent for reparsing.
56
+
Documents with no DOCX cannot be reparsed via the existing framework, so SHOULD be excluded from the list of reparse candidates, SHOULD NOT be reparsable in the UI and MUST not be sent for reparsing.
57
+
58
+
Documents without text SHOULD NOT be sent to enrichment.
56
59
57
60
#### Hidden Information
58
61
59
-
The Public UI WILL NOT reveal the body of the XML, including HTML transforms, to users who have not opted in to recieving a poor quality document. It may reveal suitable subsections.
62
+
The Atom Feed MAY allow searching for only documents with sufficient quality scores.
63
+
64
+
The Public UI SHOULD NOT attempt to display XML or HTML transforms of XML to users where the text and/or markup quality scores are insufficient. (0.5 is probably the threshold)
65
+
66
+
XML representations MAY be available to users who explicitly request them but SHOULD NOT be routinely displayed
67
+
in the PUI.
60
68
61
69
#### Warning the user
62
70
63
-
The PUI MAY flag that a document exists only as a PDF so they are not surprised by the absence of a HTML version.
71
+
The Public UI SHOULD indicate to users that a document is not available as HTML and instead is only available
72
+
as the original source document (or other compiled artifact in the future)
0 commit comments