-
-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Intermediate File Format to simplify note-sync syncing #260
Comments
(Will be traveling without computer until 3/23, may be slow to reply!) Thank you for thinking this through, Ozzie! I agree that for an intermediate representation, it’s appropriate to focus on machine readability, so something like JSON does make more sense than using Markdown directly, even if it means an extra layer of indirection. And if we find ourselves wanting to use JSON+LD in the future, one can always write a layer which resolves that more complex format to the simpler structured JSON. I’m less clear on the “explicit” vs “inferred” question. In your source example, AFAICT, the MarkdownInterpreter could not generate a diff without also being provided access to the previous state of these files, right? This might not always be straightforward. For instance, if I just have a folder of Markdown files which I’m editing over time, I won’t necessarily still have the old version of the Markdown files around for the interpreter to diff against. I could provide it with an orbitStore instead, but now the Markdown interpreter is given the job of diffing structured Prompt data vs. markup AST. Wouldn’t it naturally implement this by translating the latter into structured Prompt data? If so, wouldn’t you end up with the same diffing implementation in a Biome interpreter? That is, if I want to interpret a folder of Biome files and generate a diff, don’t I need to compare it to either an old folder of Biome files or the corresponding data in an orbitStore? And in the latter case, wouldn’t I end up translating the Biome data to some intermediate representation to generate the diff? You make a good point that in some cases an overt diffing step can be avoided if a client instead uses semantic user actions to write diffs directly. But I’m not sure how useful that is in practice, if the goal is files on disk which present an “absolute” snapshot of the data to end users, rather than an event log. I may be missing something important about how you’re thinking about this distinction, though, in which case please let me know! |
Sorry about the confusion here! I should have been more explicit with this (pun mostly intended). In the "explicit" case, I was thinking each of the interpreters would implement some sort of storage mechanism to handle the diff portion. So, like you said, the markdown interpreter could use the the const orbitStore = "/path/to/my/orbitStore"
const interpreter = new MarkdownInterpreter(files, orbitStore)
const sync = new Sync()
// Generate a diff:
const result = interpreter.diff()
console.log(result)
/* { ... } */
try {
await sync.apply(result)
interpreter.commit()
} catch (error) {
console.error(error)
} One other advantages of this approach is that the interpreter can choose how to best "diff". For example, a markdown interpreter could scan the entire directory for all the files and their date modified, and then only parse and diff the files which have been updated since the last sync. A biome interpreter could use a similar approach, and an Anki interpreter might have similar metadata it can use. Of course, if the "inferred" approach is used, the interpreters could provide this modified date alongside each of absolute set of prompts then the sync module could make this optimization. But at this point, the interpreters have already done the work of scanning and parsing all the prompts. In the "explicit" solution, both the interpreters and the sync module can leverage this optimization. That being said, whether the interpreters need* this optimization is a separate discussion that I don't have any strong intuitions for. Handling moves and deletes
In this solution, we handle the following cases:
what we don't handle:
Let me know if that makes more sense and If I am overlooking some important details 😄 |
[Very sorry to have lost track of this after my travel!!] Got it, that makes more sense. If you feel like giving this a crack, I'd be happy to review or riff at any point, including a spiked-out implementation or a full interface. Feel free to use the "explicit" method if you prefer. My guess is that there will be significant overlap between the Markdown and Biome implementations around the diffing, and we can factor that out into a utility or a separate top-level stage (inferred-style) as seems appropriate. |
Of course someone has thought of this already. This may not be the appropriate place to ask but do you know of any other markdown based flashcard formats? I'm in the process of building something between Readwise and Anki and my storage format of choice is markdown. Curious to hear if you've come across anything that does that already. |
Excellent. Thanks for the ideas and your work in general. I actually had some tangential thoughts related to SRS that I'd like to discuss with you at some point. The core idea is treating knowledge as code and thinking of prompts as tests. We could use the idea of test coverage as a metric, for example. This is something I hope to build in my implementation. |
Okay this is my proposal, for partially tackling #220. @andymatuschak (and anyone else who might stumble upon this issue) any feedback is appreciated!
To tackle eventually tackle #220 and #246, I propose creating an intermediate file format to aid and encapsulate the syncing component of this problem.
Why an intermediate format?
store
events.note-sync
library is that it is coupled to Bear's export ID. Having an intermediate format, allows for other sources (Obsidian Block Links, Bear ID, Anki ID, etc.) to all provide the stable identifiers of there prompts.File Format
Solution 1: Markdown as an Intermediate Format
One possible solution is to use markdown files augmented with CommonMark generic directive syntax as the intermediate format. This might look something like:
This is just one example in how the format could be devised, if this solution is chosen, the format layout can be discussed further.
Pro:
Con:
Solution 2: JSON-LD as an Intermediate Format
An alternative solution would be to follow the JSON-LD spec as our intermediate format. One of the powerful properties of JSON-LD is that the schema is referenced within the data, so the output produced by the Markdown interpreter does not need to exactly follow the data outputed by the Anki Interpreter, the JSON-LD expansion-compaction process will handle the standardization to something that the sync module could interpret.
Pro:
Con:
Solution 3: JSON as an Intermediate Format
Keep it simple, and just use JSON + JSON Schema for validation.
Pro:
Con:
Recommendation
As cool as JSON-LD is, I am currently skpetical of the benefits from this format as opposed to JSON and using JSON schema for validation for this particular problem. I would suggest Solution 3 in this case.
Representation
Regardless of the file format there are two options for modeling this problem, each with there own tradeoffs.
Explicit
In this version, the file format explicitly describes which prompts have been mutated since the last sync. A rough example in JSON would be:
Pro:
Con:
Inferred
In this version, the interpreter produces an absolute set of prompts for a given source [1]. The sync module is then in charge of inferring the corresponding events through comparsion.
Pro:
Con:
[1] A source is a collection of prompts. For example, a markdown interpreter might produce one source per markdown file and an anki interpreter might produce one source per deck.
Recommendation
I recommend the explicit approach since I believe interpretator is best suited for doing this type of inferring of the actions from the original sources.
Example Markdown Interpreter
To test this recommendation, here is how I imagine the markdown sync could be implemented using these two new components.
Say we have two files,
apple.md
andBananas.md
.apple.md:
bananas.md
# Bananas Q. Question B, but with an update A. This question was already synced previously, but the question changed Q. This is a new question. A. A new question deserves a new answer
An end-to-end sync script for this would look something like:
In this example, the
MarkdownInterpreter
is invoked and then produces a compliant JSON file for the sync module to ingest. It is worth noting that in this example, the ID's are automatically generated by the interpreter, but the exact mechanism in how these ID's are generated is a separate discussion that is up to the interpreter.Also, after the sync module successfully completes, the interpreter "commits" the changes. This is done to inform that the sync was successful and the interpreter can adjust its internal metadata accordingly. In the event that
commit
is not called after a successful sync, this is no big deal! It will just include the same changes as last diff, and the sync module will disregard accordingly.Okay wow, that was a lot. Let me know what you think of this approach!
The text was updated successfully, but these errors were encountered: