-
Notifications
You must be signed in to change notification settings - Fork 856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to create training data for NER task using snorkel ? #1254
Comments
@thak123 Follow the link #838 But I am doubtful on the area of tagging table data from PDFS/Receipts |
Hi Mageswaran. I found the link you posted is not found. |
Hi @thak123 while you can hopefully look at some of the existing tutorials to help you in the interim, we're actually planning to release an NER-specific tutorial soon! Marking as "feature request" and will leave open till this is done |
Hi @ajratner , I'm quite interested in this feature, do you have an expected timeline for the release of those tutorials? not a hard deadline, but just to know if some weeks, months, years... |
Hi @ajratner, I'm very interested in this feature. Any idea when the tutorial may be released? Here we are 2 months after your previous mention . . . does it still look months away? |
Any update on this issue? |
I found 2 papers in the snorkel resources page that tackles the NER task. |
any update on this issue ? |
Also interested.. C'mon guys! :D |
The simplest way to do NER/sequence labeling using the off-the-shelf Snorkel label model is to assume each token is independent and define your label matrix For example, you could do very simple NER
Generating weakly labeled sequence data then just requires some bookkeeping to split your predicted token probabilities back into their original sequences. When training the end model, you can either mask tokens that don't have any LF coverage or assume some prior (e.g., all tags are equally likely) and train a BERT, BiLSTM, etc. model. As @pfllo pointed out, there are dependencies between tokens that we would also like capture in the label model. The papers Multi-Resolution Weak Supervision for Sequential Data and Weakly Supervised Sequence Tagging from Noisy Rules do handle this (both have code available). In practice however, treating tokens independently and using the default Snorkel label model works surprisingly well, especially if you come from a domain with rich knowledge base & dictionary resources such as biomedicine, geography, etc. |
@jason-fries thanks so much! Just to make sure it's clear: this repo has been generously maintained primarily by researchers like @jason-fries, and we are in general very capacity limited in terms of major changes to the repo. As such, we currently don't have a timeline on an NER tutorial. Contributions are very welcome though! To additionally be clear: our policy for the issues page is that questions and comments are great, but demands such as "cmon guys" are not appropriate usage. Thanks for your understanding! |
And also just to be very clear: we all really want to put more stuff out here... we're working on it, and so grateful to all of you on the issues page for your patience, enthusiasm, and support in trying Snorkel out in the meantime!!! :) |
Thanks a lot for this response @jason-fries ! @ajratner apologies for coming across impatient/rude; I've been really amazed by the current release, and the corresponding research papers and did not mean anything other than: "I'm also super interested in staying up to date on the topic". Thanks! |
thanks for this. I already experimented with a similar approach in the past, but it's really useful to me to have confirmation that this actually works quite well and there's not much difference (given enough resources) as compared to something specific to sequence data 👍 |
@jason-fries thanks for this, but could you please tell how to train the |
@raj5287 Try this as a tactical fix:
|
The thing to do here is to use skweak, not Snorkel. It is a commercial tool now and investments in this area are going into other projects.
|
I want to create a dataset using the snorkel labelling function but I am not able to find any links.
I want to train a NER model using the above data.
Can anyone tell me how to proced
The text was updated successfully, but these errors were encountered: