Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #8 in Mini-RFC #11

Open
golnazads opened this issue Sep 25, 2024 · 8 comments
Open

Issue #8 in Mini-RFC #11

golnazads opened this issue Sep 25, 2024 · 8 comments

Comments

@golnazads
Copy link
Contributor

golnazads commented Sep 25, 2024

The import of references from classic files into the DB should just focus on the file parsing details and be decoupled from the individual reference format.

@ehenneken
Copy link
Member

Isn't that how things work currently? For reference files with a default extension, the extension determines what parser to use. In all other cases, the journal and volume determine what parser to use. Individual references do not carry the required information to determine what parser to use.

@golnazads
Copy link
Contributor Author

Yes, I added this issue because there was an ongoing discussion about addressing errors in reference files and whether to correct them directly in the file or within the pipeline. I wasn't sure if a conclusion had been reached. My reasoning has always been to manually fix a smaller number of problematic reference files because I combined the parsing logic of various formats into a unified parser to minimize the number of parsers, making maintenance easier. For example, some text references format multi-line references with a tab at the beginning of each line, while others use a tab for the subsequent lines, starting the first line at the beginning. A single parser can only handle one of these formats. I implemented the format that correctly parses the majority of reference files and anticipated manually fixing the few outliers.

There was a suggestion to add more parsers, but I am not in favor of this approach as it would increase complexity and maintenance overhead—issues that the classic system already struggles with and that we want to avoid here.

@golnazads
Copy link
Contributor Author

@ehenneken I am including two unresolved issues that were discussed during the meeting for further review and action. Both relate to nested reference strings. The first issue involves references separated by semicolons. As detailed in the Feedback Document - Reference Pipeline, the semicolon cannot be used to break up references by the pipeline because, in some instances, it is used to separate the title and journal within a reference string. Therefore, I have refrained from splitting the references based on semicolons. I have documented all the instances where manual correction is required for these, they are not that many. The second issue concerns nested author replacement. Multiple underscores or hyphens indicate author substitution for multiple references in one line. The pipeline performs the substitution only if both the first and subsequent reference strings include the year after the list of authors and the multiple underscores or hyphens, respectively. If the year is not present, the pipeline is unable to replace the authors as it lacks the necessary anchor for substitution.
@aaccomazzi

@aaccomazzi
Copy link
Member

What are the journals where we see semicolons separating titles and journals? The formats may be sufficiently different to allow some branching for the logic associated with reference processing.
For instance, the semicolons separating multiple references are common in APS (physics) journals, but not in astronomy.

@golnazads
Copy link
Contributor Author

I don’t remember, and unfortunately, I did not document it while working on the pipeline. I only knew they existed, likely from arXiv, because I had to ensure that the reference service could parse them correctly and separate the title from the journal when tokenizing. I primarily worked with arXiv while implementing reference service.

I would rather not go down this route and create another parser, as my understanding, along with the documented instances, is that there aren’t that many cases, and they can be fixed manually. However, if you insist on having multiple parsers and returning to the way things were done in classic, please provide the bibstems of the files to redirect to a new parser and accept semicolons as multi-references.

@golnazads
Copy link
Contributor Author

Please see error#2b in text parsers verification report for specific example of semicolon issue.

@aaccomazzi
Copy link
Member

Thanks, the report shows as you say that there are instances where semicolons are improperly used (they should be commas). I don't have the data to back up the following statement, but this is my current guess: the examples you have shown in the google doc are the outliers that need to be fixed by hand, because most physics journals tend to consistently use semicolons to separate references, so the data "fix" should be to edit those references where the semicolon was really supposed to be a comma rather than the other way around.

If I'm correct, it may be that an additional non-manual solution is possible: knowing the source journal for a reference (which we know since we use it to select the proper handler), we could "turn on" semicolon-splitting based on it. The logic behind this is that for the major physics journals we know that semicolons are used to separate references, whereas in all other cases they don't.

@golnazads
Copy link
Contributor Author

golnazads commented Oct 10, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants