Skip to content

Commit

Permalink
feat: introduce HTML meta data extraction
Browse files Browse the repository at this point in the history
  • Loading branch information
shah committed Oct 15, 2023
1 parent 40c1006 commit 32f592c
Show file tree
Hide file tree
Showing 2 changed files with 58 additions and 18 deletions.
29 changes: 21 additions & 8 deletions pattern/content-aide/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Content Aide

This SQLite-based pattern provides tables and queries which collectively manage
and storing metadata related to file system content, MIME types, devices, and
file system content walk sessions. They include various fields for timestamps,
user information, and JSON data for additional details and elaboration.
and store content and metadata related to files, MIME types, devices, and file
system content walk sessions. They include various fields for timestamps, user
information, and JSON data for additional details and elaboration.

- `mime_type`: Stores MIME type information, including a ULID primary key and
various attributes like name, description, file extension, timestamps, and
Expand Down Expand Up @@ -84,10 +84,6 @@ Others to consider:

## Testing

Until is tests are fully automated, use
[RunMe](https://marketplace.visualstudio.com/items?itemName=stateful.runme) via
Visual Studio Code to execute the commands.

Scan the current directory for all files and store them into
`device-content.sqlite.db` (this is idempotent, by default it ignores `.git` and
`node_modules` directories):
Expand All @@ -96,10 +92,16 @@ Scan the current directory for all files and store them into
$ ./cactl.ts
```

See the contents with [SQLpage](https://github.com/lovasoa/SQLpage):

```bash
DATABASE_URL=sqlite://./device-content.sqlite.db sqlpage.bin
```

Show the stats:

```bash
$ ./cactl.ts sql contentStats | sqlite3 device-content.sqlite.db --table
$ ./cactl.ts sql fsContentWalkSessionStats | sqlite3 device-content.sqlite.db --table
```

Show all the HTML anchors in all HTML files:
Expand All @@ -115,6 +117,17 @@ $ ./cactl.ts sql allHtmlAnchors | sqlite3 device-content.sqlite.db --json
- [ ] Figure out what to do about symlinks
- [ ] Figure out what to do when fileio_read cannot read larger than 1,000,000
bytes for hash, etc.
- [ ] See [simon987/sist2](https://github.com/simon987/sist2) for other ideas
like:
- [ ] Extracts text and metadata from
[common file types](https://github.com/simon987/sist2#format-support)
- [ ] [Generates thumbnails](https://github.com/simon987/sist2#format-support)
- [ ] Manual tagging from the UI and automatic tagging based on file
attributes via
[user scripts](https://github.com/simon987/sist2/blob/master/docs/scripting.md)
- [ ] Recursive scan inside
[archive files](https://github.com/simon987/sist2#archive-files)
- [ ] [Named-entity recognition](https://github.com/simon987/sist2#NER)

## ULID Primary Keys across multiple devices

Expand Down
47 changes: 37 additions & 10 deletions pattern/content-aide/notebook.sqla.ts
Original file line number Diff line number Diff line change
Expand Up @@ -362,7 +362,9 @@ export function library<EmitContext extends SQLa.SqlEmitContext>(libOptions: {
-- this second pass walks the path again and connects all found files to the immutable fs_content
-- table; this is necessary so that if any files were removed in a subsequent session, the
-- immutable fs_content table would still contain the file for history but it would not show up in
-- fs_content_walk_path_entry
-- fs_content_walk_path_entry;
-- NOTE: we denormalize and compute using path_dirname, path_basename, path_extension, etc. so that
-- the ulid(), path_*, and other extensions are only needed on inserts, not reads.
INSERT INTO ${fscwpe.tableName} (fs_content_walk_path_entry_id, walk_session_id, walk_path_id, fs_content_id, file_path_abs, file_path_rel_parent, file_path_rel, file_basename, file_extn)
SELECT ulid() as fs_content_walk_path_entry_id,
fscwp.walk_session_id as walk_session_id,
Expand Down Expand Up @@ -501,7 +503,7 @@ export function library<EmitContext extends SQLa.SqlEmitContext>(libOptions: {
},
});

const allHtmlAnchors = (): SqlTextSupplier => ({
const htmlAnchors = (): SqlTextSupplier => ({
SQL: (ctx) => {
const { loadExtnSQL: load } = libOptions;
// deno-fmt-ignore
Expand All @@ -510,21 +512,45 @@ export function library<EmitContext extends SQLa.SqlEmitContext>(libOptions: {
-- find all HTML files in the fs_content table and return
-- each file and the anchors' labels and hrefs in that file
-- TODO: create a table called fs_content_html_anchor to store this data after inserting it into fs_content
-- so that simple HTML lookups do not require the html0 extension to be loaded
WITH html_content AS (
SELECT content, content_digest FROM fs_content WHERE file_extn = '.html'
SELECT fs_content_id, content, content_digest, file_path, file_extn FROM fs_content WHERE file_extn = '.html'
),
html AS (
SELECT content_digest,
SELECT file_path,
text as label,
html_attribute_get(html, 'a', 'href') as href
FROM html_content, html_each(html_content.content, 'a')
)
SELECT * FROM html;
`.SQL(ctx);
},
});

const htmlHeadMeta = (): SqlTextSupplier => ({
SQL: (ctx) => {
const { loadExtnSQL: load } = libOptions;
// deno-fmt-ignore
return SQL()`
${load("asg017/html/html0")}
-- find all HTML files in the fs_content table and return
-- each file and the <head><meta name="key" content="value"> pair
-- TODO: create a table called fs_content_html_head_meta to store this data after inserting it into fs_content
-- so that simple HTML lookups do not require the html0 extension to be loaded
WITH html_content AS (
SELECT fs_content_id, content, content_digest, file_path, file_extn FROM fs_content WHERE file_extn = '.html'
),
file as (
SELECT fs_content_path_id, file_path, label, href
FROM fs_content_path, html
WHERE fs_content_path.content_digest = html.content_digest
html AS (
SELECT file_path,
html_attribute_get(html, 'meta', 'name') as key,
html_attribute_get(html, 'meta', 'content') as value,
html
FROM html_content, html_each(html_content.content, 'head meta')
WHERE key IS NOT NULL
)
SELECT * FROM file;
SELECT * FROM html;
`.SQL(ctx);
},
});
Expand Down Expand Up @@ -581,7 +607,8 @@ export function library<EmitContext extends SQLa.SqlEmitContext>(libOptions: {
mimeTypesSeedDML,
insertContent,
fsContentWalkSessionStats,
allHtmlAnchors,
htmlAnchors,
htmlHeadMeta,
sqlPageFiles,
};

Expand Down

0 comments on commit 32f592c

Please sign in to comment.