Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Intermediate File Format to simplify note-sync syncing #260

Closed
kirkbyo opened this issue Mar 7, 2022 · 6 comments · Fixed by #266
Closed

Proposal: Intermediate File Format to simplify note-sync syncing #260

kirkbyo opened this issue Mar 7, 2022 · 6 comments · Fixed by #266

Comments

@kirkbyo
Copy link
Collaborator

kirkbyo commented Mar 7, 2022

Okay this is my proposal, for partially tackling #220. @andymatuschak (and anyone else who might stumble upon this issue) any feedback is appreciated!


To tackle eventually tackle #220 and #246, I propose creating an intermediate file format to aid and encapsulate the syncing component of this problem.

Why an intermediate format?

  • It enforces the seperation of concerns between syncing and parsing from various sources.
  • Orbit (or extenernal actors) could define parsers for there source and produce an Orbit Sync Document in which the sync library would translate to store events.
    • For example, Orbit could provide interpreters for Markdown and Anki as a source and generate sync documents.
  • Stable Identifier agnostic. One of the current problems with the note-sync library is that it is coupled to Bear's export ID. Having an intermediate format, allows for other sources (Obsidian Block Links, Bear ID, Anki ID, etc.) to all provide the stable identifiers of there prompts.
    • This can also help with two-sync problems down the road (sync from Markdown -> ingest into Orbit -> edit prompt in Orbit -> update original markdown note).
  • Interpreters can be created in other programming languages. The file acts as the transport medium.

File Format

Solution 1: Markdown as an Intermediate Format

One possible solution is to use markdown files augmented with CommonMark generic directive syntax as the intermediate format. This might look something like:

# Document Title

> Q. This is **Question 1**
> A. This is an answer to question 1
:orbit-prompt{prompt_id=d758904803cb}

> Q. This is **Question 2**
> A. This is an answer to question 2
:orbit-prompt{prompt_id=41b6d61c2aaa}

> This is a {cloze} prompt
:orbit-prompt{prompt_id=8656e0ee7d01}

This is just one example in how the format could be devised, if this solution is chosen, the format layout can be discussed further.

Pro:

  • Common file format that can edited in various editors.
  • Optimizes for developer readability.
  • Leverages existing Markdown syntax to help with styling.

Con:

  • Lacks expressiveness could lead to problems of ambiguity around markdown syntax (e.g Dollar signs in prompts are being misinterpreted by KaTeX #186).
  • Markdown is not a common serialization format in programming languages so additional libraries or hand rolled solutions need to be introduced within the interpreters.

Solution 2: JSON-LD as an Intermediate Format

An alternative solution would be to follow the JSON-LD spec as our intermediate format. One of the powerful properties of JSON-LD is that the schema is referenced within the data, so the output produced by the Markdown interpreter does not need to exactly follow the data outputed by the Anki Interpreter, the JSON-LD expansion-compaction process will handle the standardization to something that the sync module could interpret.

Pro:

  • Schema is self-documenting
  • Allows for powerful linking between documents
  • Leverages publicly available schemas
  • Since it just an extension of JSON, there is already defacto serialization libraries for each language.

Con:

  • Extra complexity since we now need to use an external parsing library for this format (likely jsonld.js).
  • Representing markup might become complicated since as far as I could tell there were no popular schemas for this.
  • The data being represented does not have many absolute links which is the real power of JSON-LD.

Solution 3: JSON as an Intermediate Format

Keep it simple, and just use JSON + JSON Schema for validation.

Pro:

  • Simple
  • We already use JSON Schema for API Validation

Con:

  • JSON Schema holds the schema, so in theory, another service would need to load the schema in order to validate a given source (as opposed to JSON-LD where's encode alongside the data).

Recommendation

As cool as JSON-LD is, I am currently skpetical of the benefits from this format as opposed to JSON and using JSON schema for validation for this particular problem. I would suggest Solution 3 in this case.

Representation

Regardless of the file format there are two options for modeling this problem, each with there own tradeoffs.

Explicit

In this version, the file format explicitly describes which prompts have been mutated since the last sync. A rough example in JSON would be:

{
	"inserted": [{ "id": "a", "type": "qa", "question": "Question", "answer": "Answer"}],
	"updated": [{ "id": "b", "question": "New question" }],
	"deleted": [{ "id": "c" }]
}

Pro:

  • Explicit control over which prompts are inserted, updated or deleted.
  • The sync module does not need to infer these actions from the input.
  • Potentially more performant since the interpreter likely already "knows" which prompts have been modified / inserted / deleted at interpretation time.

Con:

  • The source now has to keep track of its last sync to Orbit.
  • Essentially a JSON interface in front of REST calls. Potentially unnecessary indirection.

Inferred

In this version, the interpreter produces an absolute set of prompts for a given source [1]. The sync module is then in charge of inferring the corresponding events through comparsion.

Pro:

  • Interpreter is sync agnostic, it will produce the same result everytime regardless of the result of a particular sync.
  • The interpreter is relatively simple given it only needs to translate from the original source to the intermediate file format.

Con:

  • It could lead to inconsistencies if the sources on two different devices are not synced, but orbits store has synced.
    • For example, on computer A user syncs prompts from markdown, computer B doesn’t know about computer A prompts yet, so when it syncs, Orbit interprets all the prompts synced from Computer A to be deleted.

[1] A source is a collection of prompts. For example, a markdown interpreter might produce one source per markdown file and an anki interpreter might produce one source per deck.

Recommendation

I recommend the explicit approach since I believe interpretator is best suited for doing this type of inferring of the actions from the original sources.

Example Markdown Interpreter

To test this recommendation, here is how I imagine the markdown sync could be implemented using these two new components.

Say we have two files, apple.md and Bananas.md.

apple.md:

# Apple

Q. Question **A**
A. Answer to question A

bananas.md

# Bananas

Q. Question B, but with an update
A. This question was already synced previously, but the question changed

Q. This is a new question.
A. A new question deserves a new answer

An end-to-end sync script for this would look something like:

const interpreter = new MarkdownInterpreter(files)
const sync = new Sync()

// Generate a diff:
const result = interpreter.diff()
console.log(result)
/* 
{
  "inserted": [
    {
      "type": "qa",
      "millisecondsSince1970": "1242345124",
      "id": "d758904803cb",
      "question": {
        "text": "A",
        "attributes": [
          { "type": "bold", "range": [9, 10] }
        ]
      },
      "answer": {
        "text": "Answer to question A"
      },
      "source": {
        "id": "apple.md",
        "externalURL": "obsidian:apple.md",
        "name": "Apples"
      }
    }
  ],
  "updated": [
    {
      "type": "qa",
      "id": "41b6d61c2aaa",
      "milliseconds": "1242345124",
      "question": {
        "text": "Question B, but with an update"
      }
    }
  ],
  "deleted": []
}
*/

try {
    await sync.apply(result)
    interpreter.commit()
} catch (error) {
    console.error(error)
}

In this example, the MarkdownInterpreter is invoked and then produces a compliant JSON file for the sync module to ingest. It is worth noting that in this example, the ID's are automatically generated by the interpreter, but the exact mechanism in how these ID's are generated is a separate discussion that is up to the interpreter.

Also, after the sync module successfully completes, the interpreter "commits" the changes. This is done to inform that the sync was successful and the interpreter can adjust its internal metadata accordingly. In the event that commit is not called after a successful sync, this is no big deal! It will just include the same changes as last diff, and the sync module will disregard accordingly.

Okay wow, that was a lot. Let me know what you think of this approach!

@andymatuschak
Copy link
Owner

(Will be traveling without computer until 3/23, may be slow to reply!)

Thank you for thinking this through, Ozzie! I agree that for an intermediate representation, it’s appropriate to focus on machine readability, so something like JSON does make more sense than using Markdown directly, even if it means an extra layer of indirection. And if we find ourselves wanting to use JSON+LD in the future, one can always write a layer which resolves that more complex format to the simpler structured JSON.

I’m less clear on the “explicit” vs “inferred” question. In your source example, AFAICT, the MarkdownInterpreter could not generate a diff without also being provided access to the previous state of these files, right? This might not always be straightforward. For instance, if I just have a folder of Markdown files which I’m editing over time, I won’t necessarily still have the old version of the Markdown files around for the interpreter to diff against. I could provide it with an orbitStore instead, but now the Markdown interpreter is given the job of diffing structured Prompt data vs. markup AST. Wouldn’t it naturally implement this by translating the latter into structured Prompt data? If so, wouldn’t you end up with the same diffing implementation in a Biome interpreter?

That is, if I want to interpret a folder of Biome files and generate a diff, don’t I need to compare it to either an old folder of Biome files or the corresponding data in an orbitStore? And in the latter case, wouldn’t I end up translating the Biome data to some intermediate representation to generate the diff?

You make a good point that in some cases an overt diffing step can be avoided if a client instead uses semantic user actions to write diffs directly. But I’m not sure how useful that is in practice, if the goal is files on disk which present an “absolute” snapshot of the data to end users, rather than an event log.

I may be missing something important about how you’re thinking about this distinction, though, in which case please let me know!

@kirkbyo
Copy link
Collaborator Author

kirkbyo commented Mar 19, 2022

I’m less clear on the “explicit” vs “inferred” question. In your source example, AFAICT, the MarkdownInterpreter could not generate a diff without also being provided access to the previous state of these files, right?

Sorry about the confusion here! I should have been more explicit with this (pun mostly intended). In the "explicit" case, I was thinking each of the interpreters would implement some sort of storage mechanism to handle the diff portion. So, like you said, the markdown interpreter could use the the orbitStore and use this as the source of truth to determine which prompts have changed. So the example listed above, might be something like:

const orbitStore = "/path/to/my/orbitStore"
const interpreter = new MarkdownInterpreter(files, orbitStore)
const sync = new Sync()

// Generate a diff:
const result = interpreter.diff()
console.log(result)
/* { ... } */

try {
    await sync.apply(result)
    interpreter.commit()
} catch (error) {
    console.error(error)
}

One other advantages of this approach is that the interpreter can choose how to best "diff". For example, a markdown interpreter could scan the entire directory for all the files and their date modified, and then only parse and diff the files which have been updated since the last sync. A biome interpreter could use a similar approach, and an Anki interpreter might have similar metadata it can use.

Of course, if the "inferred" approach is used, the interpreters could provide this modified date alongside each of absolute set of prompts then the sync module could make this optimization. But at this point, the interpreters have already done the work of scanning and parsing all the prompts. In the "explicit" solution, both the interpreters and the sync module can leverage this optimization. That being said, whether the interpreters need* this optimization is a separate discussion that I don't have any strong intuitions for.

Handling moves and deletes
The other case which I failed to mention in my original post was how interpreters could handle moves and deletes. Here is how I imagine a markdown interpreter using the "explicit" solution could function:

  1. On execution: it filters for all the files that have a modified date which is greater than the date of last execution. At this point we know that each of these files either have new prompts, prompts which have been moved, or prompts that been deleted.
  2. If a file is completely new, that is, its path is not within our cache of known paths from our last sync, then we consider all the prompts as movedOrInserted
  3. If a file is in our cache, then we compare the prompts in the file to the ones we last synced. If a prompt is not known from the last time we synced, we consider it movedOrInserted. If a prompt that was previously synced is no longer in the file, we consider it "movedOrDeleted".
  4. We compare our set of known file paths to the file paths from the last time we synced. If a file is not within the current set of file paths, we consider all the prompts associated with this path as movedOrDeleted.
  5. We intersect the prompts of movedOrDeleted and movedOrInserted, the intersection of these prompts are all marked as "moved". Prompts in the movedOrDeleted set that are not within the intersection are marked as deleted, and the prompts in the movedOrInserted set which are not apart of the intersection are marked as inserted.

In this solution, we handle the following cases:

  • inserts
  • deletes
  • file renames (this would just be moves)

what we don't handle:

  • prompt updates, any time a prompt is adjusted between syncs, it would result in the "review history" of the prompt to be lost.

Let me know if that makes more sense and If I am overlooking some important details 😄

@andymatuschak
Copy link
Owner

[Very sorry to have lost track of this after my travel!!]

Got it, that makes more sense. If you feel like giving this a crack, I'd be happy to review or riff at any point, including a spiked-out implementation or a full interface. Feel free to use the "explicit" method if you prefer. My guess is that there will be significant overlap between the Markdown and Biome implementations around the diffing, and we can factor that out into a utility or a separate top-level stage (inferred-style) as seems appropriate.

@AB1908
Copy link

AB1908 commented Jun 10, 2023

Of course someone has thought of this already. This may not be the appropriate place to ask but do you know of any other markdown based flashcard formats? I'm in the process of building something between Readwise and Anki and my storage format of choice is markdown. Curious to hear if you've come across anything that does that already.

@andymatuschak
Copy link
Owner

@AB1908
Copy link

AB1908 commented Jun 13, 2023

Excellent. Thanks for the ideas and your work in general. I actually had some tangential thoughts related to SRS that I'd like to discuss with you at some point. The core idea is treating knowledge as code and thinking of prompts as tests. We could use the idea of test coverage as a metric, for example. This is something I hope to build in my implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants