Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentence merging still an issue #23

Open
DavidNemeskey opened this issue Jan 26, 2017 · 1 comment
Open

Sentence merging still an issue #23

DavidNemeskey opened this issue Jan 26, 2017 · 1 comment
Assignees
Labels

Comments

@DavidNemeskey
Copy link
Contributor

Ali kalifa/Ali ibn Abi Tálib (Mekka, 599. július 29. - Kúfa, 661. január 24.) volt az iszlám negyedik, a „helyesen vezetettek” közé tartozó kalifája (uralkodott 656. június 17-étől haláláig).

There are three errors in the output:

  • <w>kalifa/Ali</w> remains a single token
  • a </s> is inserted after július 29.
  • 24.) remains a single token instead of <w>24</w><c>.</c><c>)</c> (or sth like that)

@gaebor Thanks for spotting this.

@mittelholcz
Copy link
Collaborator

To "kalifa/Ali": Most of all \w+/\w+ strings listed from the hungarian webcorpus are URLs, abbreviations (e.g. TCP/IP) or measurement units (e.g. km/h), see webcorpus_-_top100_words_with_slash.txt, which should be treated as one token.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants