-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to best deal with signal words to separate concatenated references #194
Comments
I think, in general, footnotes should probably parsed with a separate parser model that supports multiple references per sequence. I think the current parser model only has one or two features which take into account the position of a token in the input sequence (this way, for example, words towards the beginning could be weighted more towards being part of author names etc.). I doubt that these features are extremely important and they could be dropped in a multi-reference model. The other thing that would have to be changed are some normalizers that currently combine or re-label if there are multiple segments with the same label. And of course the decision about which segments constitute a single reference would have to be be made -- but instead of your current situation you'd have all the labels to work with, which should make it much easier (e.g., if you encounter one of the stop words above and you already have author, title, and year, and the thing after the stop word is again an author it's easy to make the call to make it two separate references). Of course you could also try to use the existing parser for this: just train it with footnotes containing multiple references and use the XML output and then group the segments yourself. Once you've separated the references you can just pass them to the normalizer. In other words, you'd use the existing parser (but trained on multiple references per sequence) and instead of letting it label and normalize everything, you let it apply the labels, then you separate the segments into groups belonging to a single reference and pass each group to the normalizers. You'd still need to solve the same problem, but in addition you'd have a label for each word in your footnote applied by the CRF model. |
Thank you for your thoughts. Probably it would make sense to introduce a new tag for these stop/signal words, since that - if correclty tagged by the parser - would help to find the beginning of new references. BTW, how does the learning work in this respect: would it suffice to have a list of tagged stop words in the training material (i.e. is it enough to occur once), or does frequency matter, i.e. they would have to appear, say, 20 times in the sequences to be correctly tagged as stop words? |
If you train a model it will know about all the labels in the set, even if a label occurs just once. For footnotes I would expect a lot of next unrelated to references; in the current core model I think we normally use |
Thanks, I'll try that. BTW to see what I am working on, I made a little screencast that showcases a web frontend to AnyStyle. If you are interested: https://owncloud.gwdg.de/index.php/s/u8AcKYwTn1F9PkL The end shows that there are still bugs. |
Pretty cool, thanks for sharing! |
I am experimenting with synthetic training data (with these signal words randomly inserted before the main reference) but the results aren't very good. Even though I have hundreds of training sequences such as, for example, |
Like I said, I don't think that an additional feature would help in this case. In general, if a word like One thing that may be happening is that if you insert the word at the beginning in your synthetic references when training, that the model recognizes it only when it's at the start of the reference (this might be even more likely if you have other occurrences of 'vgl.' labelled differently elsewhere in the training set). |
I work with references in footnotes, which often have several references in one line - i.e. they are not cleanly separated like references in separate bibliographies, There are a number of signal words and punctuation which for the human reader make it very clear where one citation starts and ends, but it is hard to figure out the exact rules for separating them so that AnyStyle the parser can do its magic.
The semicolon as separator already goes a long way because it is normally not part of any citation style, however, semicolons can also be found in titles. I came up with a couple of regular expressions but of course like always with regexes, you have to cover each and every case and there will always be false poisitives
I wondered if it would make sense to train a separate model just with these words to preprocess the raw reference lines. What would be your approach to dealing with this problem?
Thank you.
The text was updated successfully, but these errors were encountered: