Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bibtex.py parser accents problems #250

Open
vramiro opened this issue Dec 18, 2012 · 6 comments
Open

bibtex.py parser accents problems #250

vramiro opened this issue Dec 18, 2012 · 6 comments

Comments

@vramiro
Copy link

vramiro commented Dec 18, 2012

While displaying the json genrated by the bibtex.py parser I got all the accents wrong (shifted in one position). For instance: `Eric instead of Èric (which correspond to change \u0301Eric to E\u0301ric)

I did a simple patch, but not sure it will work for all the cases.

In the string_subst(self, val) function change:

                    if key+1 < len(parts) and len(parts[key+1]) > 0:
                        parts[key+1] = parts[key+1][0:]

for

                    if key+1 < len(parts) and len(parts[key+1]) > 0:
                        ### Change order to display accents
                        parts[key] = parts[key] + parts[key+1][0]
                        parts[key+1] = parts[key+1][1:]

am I missing something with my solution?

@sciunto
Copy link

sciunto commented Jan 1, 2013

Hi,

I also use this library in one of my projects and I agree with the bug. Nevertheless, I do not with the solution.

Your workaround is equivalent to replace in your bibtex e by e. And this is not correct.

I investigated a bit the issue, not that much. My understanding is the following:
The value can contain e or {e}. In both cases, when it enters to the section you mention, it turns to be `e. {} are stripped.
Then, from the dictionary (unicode to latex); it selects ' value and this leads to a wrong accent position.

I guess there is two problems at least:

  • {} are stripped for special character, and it should not. One need to find where is is replaced [1] and how we can fix that. This will fix the bug if the bibtex is encoded with braces.
  • Accents might have not braces around. IMHO, it should not be handled by ' value (for instance). I don't know if it's better to add {} around the next character or to fix it in another way.

[1] As far as I understand, it does not come from strip_braces calls in add_val. Probably before in parse_record(), in the block starting by the comment # for each line in record

@sciunto
Copy link

sciunto commented Jan 1, 2013

OK. I think I got it. was easier than I thought.

There is my diff.
Basically, I think the for loop is useless.
The original issue is caused by a wrong indentation. It caused that after the first item in the dict, braces where removed from the string.

In my previous post, I was wrong. The first point I mentioned seems to be already supported.

diff --git a/parserscrapers_plugins/bibtex.py b/parserscrapers_plugins/bibtex.py
index cfea621..aa9d669 100755
--- a/parserscrapers_plugins/bibtex.py
+++ b/parserscrapers_plugins/bibtex.py
@@ -244,11 +244,8 @@ class BibTexParser(object):
             for k, v in self.unicode_to_latex.iteritems():
                 if v in val:
                     parts = val.split(str(v))
-                    for key,val in enumerate(parts):
-                        if key+1 < len(parts) and len(parts[key+1]) > 0:
-                            parts[key+1] = parts[key+1][0:]
                     val = k.join(parts)
-                val = val.replace("{","").replace("}","")
+            val = val.replace("{","").replace("}","")
         return val

     def add_val(self, val):

Let me know if everything is fine on your side.

@vramiro
Copy link
Author

vramiro commented Jan 3, 2013

No, it does not work for me. I keep having `Eric instead of Èric
I'm also getting & instead of &

Not sure if I made this clear, but for me the problem is with the browser display of the unicode json produced.

@sciunto
Copy link

sciunto commented Jan 3, 2013

I see. Try this {E} instead of E. I thought the point number 1 is supported (from some of my tests), but maybe not.

@vramiro
Copy link
Author

vramiro commented Jan 3, 2013

Thanks for the answer again!

I think I did not made myself clear, so here we go with all the case:

  1. In my bibtex I have normalized entries (with BibtexTool[1]). All my accents follow the form as in {'E}ric
  2. The json produced by the parser translates this to \u0301Eric which is visualized in Chrome/Safari/Firefox as `Eric (instead of Èric)
  3. To actually get Èric visualized what I did was to change \u0301Eric to E\u0301ric (so, not sure it's a parser issue or visualization issue)

In latex {'E}ric and '{E}ric gives the same output, I think the parser here does not.

[1] http://strategoxt.org/Stratego/BibtexTools

@sciunto
Copy link

sciunto commented Jan 5, 2013

Thanks for the details.

I did a quick search on the internet about the best coding for accent. I
found something interesting there:
http://tex.stackexchange.com/questions/57743/how-to-write-a-and-other-umlauts-and-accented-letters-in-bibliography/57745#57745
It tells us that {'E}ric is better than '{E}ric.
Of course, the parser should handle all cases, and I agree, it does not.

I do not belong to this project, so this is only my own opinion.
In addition to the previous patch, I would add a new dict similar to unicode_to_latex, supposed to contain translations for accents like '{E}.
Then, update unicode_to_latex with correct accents (like {'E}).
String_subst should iterate over the first dict and then the second one.

Regarding only bibtex.py, a uniq dict would be enough because we iterate over values, not keys. But, since it's a library, it can be used by elsewhere in the other way (This is the case in my project for instance). dict does not ensure the order.

What do you think about this suggestion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants