-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed Taxonomy tree #46
Conversation
d6e03a2
to
cd8cae6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copying an existing hierarchy makes a ton of sense.
What I'm less sure about is whether having a single repo that has no limit to its scope. It just doesn't seem practical or sustainable. Has there been any discussion about drawing some more clear lines around what to focus on in this first repo? Completely random knowledge contributions from the entire spectrum of human knowledge seems less effective than picking something, ideally with an associated test benchmark to refer to, so we can more concretely demonstrate whether we are moving in a positive direction.
So before agreeing to "sure the Wikipedia hierarchy makes sense," I'd love to understand some corresponding structure to keep a limited set of people focused on an achievable outcome. Are there docs on this anywhere I'm missing?
Maybe @mairin may know, but from what I understand it just existed when we started. I don't know of any design docs, hence our reason for trying to find some logic to this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
So is this for knowledge only? Wikipedia can be thought of as a source of knowledge and it has a taxonomy for its knowledge. But I am not sure the wikipedia taxonomy has much applicability to compositional skills. |
I would assume so, anything under |
OK, we should then make it clear in the title of this PR and the text that the proposal applies to the |
Creating a logical layout of the taxonomy tree needs to be agreed upon. The triage team has landed on the Wikipedia tree, and this PR is to help justify and enforce this decision. Signed-off-by: JJ Asghar <[email protected]>
cd8cae6
to
50ef474
Compare
Updated per BJ's request for clarification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@shivchander Does this proposal seem reasonable to you? |
So we started targeting the domains which improve MMLU, and we have this list https://github.com/instructlab/taxonomy/blob/main/knowledge/knowledge_domains.md - this could serve as a starting point to accept contributions I like the idea of this PR and the one from Ming (instructlab/taxonomy#780), both serve as a good way to organize the knowledge tree. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pushing this @jjasghar , I also like the idea of a definition of the tree and also using an existing definition like wikipedia.
However, I am in agreement with @russellb on what are the allowed topics or limits in the tree. I think that this is an important part of this design doc and should be defined or referenced here.
So we started targeting the domains which improve MMLU, and we have this list https://github.com/instructlab/taxonomy/blob/main/knowledge/knowledge_domains.md - this could serve as a starting point to accept contributions
@shivchander Is this list a definitive list to say that these are the only topic to be allowed in the taxonomy tree?
As @bjhargrave raised about knowledge only. Should we also include skills or is it sufficient to handle knowledge and skills separately.
I think especially since the initial knowledge contributions are from Wikipedia, this is as good a format as any. I wouldn't be surprised if we need to deviate from this down the line - but to @russellb's point I'm expecting domains that are more highly specific to a use case to end up in either another top-level directory or a different repository. Examples of documents that might cause us to reconsider this structure: policy documents, documentation, contracts, legal case filings, sales copy. Those wouldn't natively fit in a Wikipedia-style structure. For now this will work. |
Co-authored-by: Olivia Buzek <[email protected]> Signed-off-by: JJ Asghar <[email protected]>
I think we should figure out I have no idea (yet) how skills will develop, but we should absolutely start researching "skills trees" in this space. |
Some topical input from InstructLab slack: "Question: can a knowledge contribution be from a source other than the wikipedia? Context: the AI Alliance is creating a set of reference implementations/use cases and one of the suggested reference implementations is a legal chatbot (based on instructlab) that answer questions about the GDPR (General Data Protection Regulation). We would like to contribute the text of the law as knowledge to instructlab. Would that be an acceptable contribution?" and: "Is it possible to have a new domain if its not listed in current ones ? https://github.com/instructlab/taxonomy/blob/main/knowledge/knowledge_domains.md " |
I believe the plan is eventually, but until that is announced only Wikipedia is accepted. @lhawthorn should we get a standard blurb together about accepting things other then Wikipedia and the expectation of when it can happen? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yesterday it was confirmed by @shivchander that the folder structure is not relevant to the InstructLab SDG/training processes. It is for humans to organize the taxonomy. So using wikipedia as the knowledge taxonomy organizing principle is as good as any choice.
@jjasghar We should absolutely do so. @obuzek I note your comment I have heard from two different people who are interested in teaching InstructLab about legal texts (e.g. GDPR regulation) and about software CVE information (which I, perhaps naively, think of as documentation) If we went with the Wikipedia structure for taxonomy, how would we accommodate these use cases? I may not understand the problem space well enough, but appreciate the opportunity to be better educated. |
@lhawthorn This is pure speculation, but I almost wonder if there's not a need for a "foundational knowledge" tree that's different based on the type of data it is. CVE info would live most happily in a CVE-specific organizational taxonomy, and really if one of these is relevant to your use case, you'd want to have the ability to filter by document type. So maybe that calls for a high-level folder within Also I appreciate that you mentioned CVE info because I was very close to going back and editing my first message to add that exact case 😄. (Also journal articles, news, first person accounts ...) |
@bjhargrave I thought the prompt for SDG was still including the folder path. Can you confirm? |
TIL I could quote reply in GH Issues. Yay me! @obuzek I do believe we should plan for domain specific taxonomies. (In fact, I know there is an open issue suggesting same somewhere else, but my search skills to find it are currently failing me.) I absolutely envision a future where people will want domain specific taxonomies; perhaps we would be able to offer smaller footprint models based off these domain specific taxonomies at some point. We should plan for that future. |
Thanks @bjhargrave for that feedback. Based on this and the need to come up with a standard to start with, then I am ok to approve with a view to extending in the future. |
@russellb Are you ok to move forward with this PR as a start to bedding down standards for the taxonomy tree? |
Yes, to be clear my review was not a "-1", just a comment. Don't block on me. (I try to always use "Request Changes" in a review to indicate when I want to block on changes I'm asking for). |
Creating a logical layout of the taxonomy tree
needs to be agreed upon. The triage team has
landed on the Wikipedia tree, and this PR
is to help justify and enforce this decision.