Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed Taxonomy tree #46

Merged
merged 2 commits into from
May 20, 2024
Merged

Proposed Taxonomy tree #46

merged 2 commits into from
May 20, 2024

Conversation

jjasghar
Copy link
Member

Creating a logical layout of the taxonomy tree
needs to be agreed upon. The triage team has
landed on the Wikipedia tree, and this PR
is to help justify and enforce this decision.

@jjasghar jjasghar force-pushed the jjasghar/taxonomy_tree branch from d6e03a2 to cd8cae6 Compare May 13, 2024 21:32
Copy link
Member

@lhawthorn lhawthorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying an existing hierarchy makes a ton of sense.

What I'm less sure about is whether having a single repo that has no limit to its scope. It just doesn't seem practical or sustainable. Has there been any discussion about drawing some more clear lines around what to focus on in this first repo? Completely random knowledge contributions from the entire spectrum of human knowledge seems less effective than picking something, ideally with an associated test benchmark to refer to, so we can more concretely demonstrate whether we are moving in a positive direction.

So before agreeing to "sure the Wikipedia hierarchy makes sense," I'd love to understand some corresponding structure to keep a limited set of people focused on an achievable outcome. Are there docs on this anywhere I'm missing?

@jjasghar
Copy link
Member Author

Are there docs on this anywhere I'm missing?

Maybe @mairin may know, but from what I understand it just existed when we started. I don't know of any design docs, hence our reason for trying to find some logic to this.

Copy link
Member

@aakankshaduggal aakankshaduggal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@bjhargrave
Copy link
Contributor

So is this for knowledge only? Wikipedia can be thought of as a source of knowledge and it has a taxonomy for its knowledge. But I am not sure the wikipedia taxonomy has much applicability to compositional skills.

@bjhargrave bjhargrave requested a review from obuzek May 14, 2024 17:03
@jjasghar
Copy link
Member Author

So is this for knowledge only? Wikipedia can be thought of as a source of knowledge and it has a taxonomy for its knowledge.

I would assume so, anything under /knowledge, it does bring up the /skills tree, but with all skills I don't know of anything that reflects a taxonomy tree there.

@bjhargrave
Copy link
Contributor

I would assume so, anything under /knowledge

OK, we should then make it clear in the title of this PR and the text that the proposal applies to the knowledge part of the taxonomy.

Creating a logical layout of the taxonomy tree
needs to be agreed upon. The triage team has
landed on the Wikipedia tree, and this PR
is to help justify and enforce this decision.

Signed-off-by: JJ Asghar <[email protected]>
@jjasghar jjasghar force-pushed the jjasghar/taxonomy_tree branch from cd8cae6 to 50ef474 Compare May 14, 2024 22:33
@jjasghar
Copy link
Member Author

Updated per BJ's request for clarification.

Copy link
Member

@mairin mairin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mairin
Copy link
Member

mairin commented May 15, 2024

Maybe @mairin may know, but from what I understand it just existed when we started. I don't know of any design docs, hence our reason for trying to find some logic to this.

@shivchander Does this proposal seem reasonable to you?

@shivchander
Copy link
Member

So before agreeing to "sure the Wikipedia hierarchy makes sense," I'd love to understand some corresponding structure to keep a limited set of people focused on an achievable outcome. Are there docs on this anywhere I'm missing?

So we started targeting the domains which improve MMLU, and we have this list https://github.com/instructlab/taxonomy/blob/main/knowledge/knowledge_domains.md - this could serve as a starting point to accept contributions

I like the idea of this PR and the one from Ming (instructlab/taxonomy#780), both serve as a good way to organize the knowledge tree.

Copy link
Member

@hickeyma hickeyma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing this @jjasghar , I also like the idea of a definition of the tree and also using an existing definition like wikipedia.

However, I am in agreement with @russellb on what are the allowed topics or limits in the tree. I think that this is an important part of this design doc and should be defined or referenced here.

So we started targeting the domains which improve MMLU, and we have this list https://github.com/instructlab/taxonomy/blob/main/knowledge/knowledge_domains.md - this could serve as a starting point to accept contributions

@shivchander Is this list a definitive list to say that these are the only topic to be allowed in the taxonomy tree?

As @bjhargrave raised about knowledge only. Should we also include skills or is it sufficient to handle knowledge and skills separately.

@obuzek
Copy link
Contributor

obuzek commented May 15, 2024

I think especially since the initial knowledge contributions are from Wikipedia, this is as good a format as any. I wouldn't be surprised if we need to deviate from this down the line - but to @russellb's point I'm expecting domains that are more highly specific to a use case to end up in either another top-level directory or a different repository.

Examples of documents that might cause us to reconsider this structure: policy documents, documentation, contracts, legal case filings, sales copy. Those wouldn't natively fit in a Wikipedia-style structure.

For now this will work.

Co-authored-by: Olivia Buzek <[email protected]>
Signed-off-by: JJ Asghar <[email protected]>
@jjasghar
Copy link
Member Author

Should we also include skills or is it sufficient to handle knowledge and skills separately.

I think we should figure out compositional_skills/ differently. It will be significantly more subjective, which will require real conversations. Knowledge, on the other hand, if we have an agreed-upon template/formation, it's much easier to justify why.

I have no idea (yet) how skills will develop, but we should absolutely start researching "skills trees" in this space.

@hickeyma
Copy link
Member

Some topical input from InstructLab slack:

"Question: can a knowledge contribution be from a source other than the wikipedia? Context: the AI Alliance is creating a set of reference implementations/use cases and one of the suggested reference implementations is a legal chatbot (based on instructlab) that answer questions about the GDPR (General Data Protection Regulation). We would like to contribute the text of the law as knowledge to instructlab. Would that be an acceptable contribution?"

and:

"Is it possible to have a new domain if its not listed in current ones ? https://github.com/instructlab/taxonomy/blob/main/knowledge/knowledge_domains.md "

@jjasghar
Copy link
Member Author

I believe the plan is eventually, but until that is announced only Wikipedia is accepted.

@lhawthorn should we get a standard blurb together about accepting things other then Wikipedia and the expectation of when it can happen?

Copy link
Contributor

@bjhargrave bjhargrave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yesterday it was confirmed by @shivchander that the folder structure is not relevant to the InstructLab SDG/training processes. It is for humans to organize the taxonomy. So using wikipedia as the knowledge taxonomy organizing principle is as good as any choice.

@lhawthorn
Copy link
Member

@jjasghar We should absolutely do so.

@obuzek I note your comment
"Examples of documents that might cause us to reconsider this structure: policy documents, documentation, contracts, legal case filings, sales copy. Those wouldn't natively fit in a Wikipedia-style structure."

I have heard from two different people who are interested in teaching InstructLab about legal texts (e.g. GDPR regulation) and about software CVE information (which I, perhaps naively, think of as documentation)

If we went with the Wikipedia structure for taxonomy, how would we accommodate these use cases?

I may not understand the problem space well enough, but appreciate the opportunity to be better educated.

@obuzek
Copy link
Contributor

obuzek commented May 16, 2024

@lhawthorn This is pure speculation, but I almost wonder if there's not a need for a "foundational knowledge" tree that's different based on the type of data it is. CVE info would live most happily in a CVE-specific organizational taxonomy, and really if one of these is relevant to your use case, you'd want to have the ability to filter by document type.

So maybe that calls for a high-level folder within knowledge labeled wikipedia, so that we have room to expand later.

Also I appreciate that you mentioned CVE info because I was very close to going back and editing my first message to add that exact case 😄. (Also journal articles, news, first person accounts ...)

@obuzek
Copy link
Contributor

obuzek commented May 16, 2024

@bjhargrave I thought the prompt for SDG was still including the folder path. Can you confirm?

@lhawthorn
Copy link
Member

TIL I could quote reply in GH Issues. Yay me!

@obuzek I do believe we should plan for domain specific taxonomies. (In fact, I know there is an open issue suggesting same somewhere else, but my search skills to find it are currently failing me.)

I absolutely envision a future where people will want domain specific taxonomies; perhaps we would be able to offer smaller footprint models based off these domain specific taxonomies at some point. We should plan for that future.

@hickeyma
Copy link
Member

hickeyma commented May 20, 2024

Yesterday it was confirmed by @shivchander that the folder structure is not relevant to the InstructLab SDG/training processes. It is for humans to organize the taxonomy. So using wikipedia as the knowledge taxonomy organizing principle is as good as any choice.

Thanks @bjhargrave for that feedback. Based on this and the need to come up with a standard to start with, then I am ok to approve with a view to extending in the future.

@hickeyma hickeyma self-requested a review May 20, 2024 08:39
@hickeyma
Copy link
Member

@russellb Are you ok to move forward with this PR as a start to bedding down standards for the taxonomy tree?

@russellb
Copy link
Member

@russellb Are you ok to move forward with this PR as a start to bedding down standards for the taxonomy tree?

Yes, to be clear my review was not a "-1", just a comment. Don't block on me. (I try to always use "Request Changes" in a review to indicate when I want to block on changes I'm asking for).

@mingxzhao mingxzhao merged commit ba57807 into main May 20, 2024
4 checks passed
@hickeyma hickeyma deleted the jjasghar/taxonomy_tree branch May 21, 2024 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.