Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation: Preserve case? #19

Open
davidbernat opened this issue Nov 4, 2019 · 0 comments
Open

Segmentation: Preserve case? #19

davidbernat opened this issue Nov 4, 2019 · 0 comments

Comments

@davidbernat
Copy link

The Segmentation tool you provide is excellent. One feature request:

Unless I am mistaken, the tool always provided the split words in 1. lower case, and 2. does not provide information for where spaces were inserted. Instead, a preserve_case or capitalize parameter would be helpful (for 1). The following code capitalizes the split string according to the capitalization used in the hashtag.


from ekphrasis.classes.segmenter import Segmenter
segmenter = Segmenter(corpus="twitter")


def word_segmentation(text, fix_case=True):
    words_string = segmenter.segment(text)
    if not fix_case:
        return words_string

    fixed = ""

    n_add = 0
    for i in range(len(words_string)):
        if words_string[i] == " " and text[i+n_add] != " ":
            n_add += 1
            fixed += " "
            continue

        is_capital = text[i-n_add].isupper()
        if is_capital:
            fixed += words_string[i].upper()
        else:
            fixed += words_string[i]
    return fixed

Of course, if the user is using camelCase or PascalCase, the capitalization may not be meaningful, but in other cases, this can be. For instance:

I #eatsomuch food --> I eat so much food.
I care so much. #IranProtests --> I care so much. Iran Protests

Arguably, the use of a stand-alone hashtag approximately refers to a proper noun, in which case the adopted capitalization is meaningful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant