Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monosegmental Morphemes #31

Open
LinguList opened this issue Apr 15, 2024 · 50 comments
Open

Monosegmental Morphemes #31

LinguList opened this issue Apr 15, 2024 · 50 comments
Assignees

Comments

@LinguList
Copy link
Contributor

If I want to read a word from string now, the following situation will yield an unwanted result:

>>> Word.from_string("p a t + ia")
[['p', 'a', 't'], ['i', 'a']]

Instead, what I would expect would be:

>>> Word([x.split() for x in "p a t + ia".split(" + ")])
[['p', 'a', 't'], ['ia']]

A simple solution would be to add another subclass and to specify words then with specific morphemes. But it seems also odd to have the following situation now with the Morpheme class:

>>> Morpheme("pat")
['p', 'a', 't']
>>> str(Morpheme("pat"))
'p a t'

I remember we discussed this, but I wonder if there's a way to maybe pass an argument to the from_string method to allow for a Morpheme that would contain one segment consisting of multiple characters?

@xrotwang xrotwang self-assigned this Apr 15, 2024
@xrotwang
Copy link
Contributor

No, I think there's no way around this. Arguably, it's rather a bug that + is recognized as morpheme separator in Word.from_string. We cannot really allow abc and a b c as morpheme a b c in the same from_string method, because then the one-multi-char-segment morpheme ia can not be distinguished from the multisegment morpheme ia - because the only way to distinguish the two representations is the presence of the morpheme separator ...

@xrotwang
Copy link
Contributor

We could have multiple factory methods to distinguish: from_segmented_string or similar.

@LinguList
Copy link
Contributor Author

I see that we have two different morpheme representations of which we also assume they are frequently used: abc + def (or abcdef) and a b c + d e f. The former in IGT, where you don't really care that the segment is a valid sound, and in lingpy/edictor/lexibank-cldf where you care that the segment represents a meaningful sound and must therefore allow for a separator different from the empty string.

@xrotwang
Copy link
Contributor

I think that's why I was hoping we could switch to the unambiguous "nested lists" representation. If we had different factory methods "from_text" (parses stuff as it appears in text, e.g. whitespace separates words) and "from_segmented_string" (recognizes + as morpheme separator and expects (whitespace-)segmented morphemes) or similar, it would work. But then the step to doing away with this and expecting input to already be nested lists wouldn't be too big in my opinion.

@LinguList
Copy link
Contributor Author

The good thing for even now is that the workaround shown above also works and is not really tedious.

@LinguList
Copy link
Contributor Author

So I think I can perfectly live with different factory methods. This would mean then, however, that Word.from_string("a b c + a b c") would throw an error, or assume a morpheme ["a", " ", "b", " ", "c"], right?

@xrotwang
Copy link
Contributor

So I think I can perfectly live with different factory methods. This would mean then, however, that Word.from_string("a b c + a b c") would throw an error, or assume a morpheme ["a", " ", "b", " ", "c"], right?

You mean the behaviour of my proposed from_text method? Not sure. It might raise an exception on encountering the "word" +. But otherwise would result in 7 one-character words, I guess.

@LinguList
Copy link
Contributor Author

I was still thinking of the original problem that I brought up, and here, I wanted to say that if we leave it as is, the behavior is still unexpected, so it would require a change, right, since the "split on whitespace, if no white space do a list" seems problematic? And if that change consists in adding different methods for initialization from different kinds of strings, I am completely fine.

@xrotwang
Copy link
Contributor

Ok, yes, different methods should be a good - if backwards incompatible - solution. But that's why we went with v0.1 I guess :)

@LinguList
Copy link
Contributor Author

Yes :-)

@LinguList
Copy link
Contributor Author

The other solution would be to subclass morphemes differently, based on their purpose, and use different from_string methods.

Right now, we have:

>>> Word.from_string("abc-def", separator="-")
[['a', 'b', 'c', '-', 'd', 'e', 'f']]
>>> TypedSequence.from_string("abc-def", separator="-", type=Segment)
['abc', 'def']
>>> TypedSequence.from_string("abc-def", separator="-", type=Morpheme)
[['a', 'b', 'c'], ['d', 'e', 'f']]

I think for broader use cases, we would like to

  1. not overwrite "type" in the Word.from_string method
  2. allow to pass the separator to the Word.from_string method

But then, we will anyway need targeted from_string methods, since the strict interpretation that goes back from word to segment does not work.

But what would be nice, in my opinion, would be:

>>> Word.from_X("a b c + ei")
[["a", "b", "c"], ["ei"]]
>>> Word.from_Y("abc-ei", separator="-")
[["a", "b", "c"], ["e", "i"]

This would then also cover the two major use cases of IGT vs. Wordlists, right?

@LinguList
Copy link
Contributor Author

I think the following code would account for the problems:

class Morpheme(TypedSequence):

    item_type = Segment
    item_separator = None

    @classmethod
    def from_string(cls, s, **kw):
        return cls(list(s))



class SegmentedMorpheme(TypedSequence):

    item_type = Segment
    item_separator = " "

    @classmethod
    def from_string(cls, s, **kw):
        return cls(s.split(cls.item_separator))


class Word(TypedSequence):

    item_type = Morpheme
    item_separator = " + "

    @classmethod
    def from_string(cls, s, **kw):
        kw["type"] = kw.get("type", cls.item_type)
        kw["separator"] = kw.get("separator", cls.item_separator)
        
        return cls([kw["type"].from_string(m) for m in
                    s.split(kw["separator"])], **kw)

This would result in:

In [56]: Word.from_string("abc-def", separator="-")
Out[56]: [['a', 'b', 'c'], ['d', 'e', 'f']]

In [57]: Word.from_string("abc-def", separator="+")
Out[57]: [['a', 'b', 'c', '-', 'd', 'e', 'f']]

In [58]: Word.from_string("abc+def", separator="+")
Out[58]: [['a', 'b', 'c'], ['d', 'e', 'f']]

In [59]: Word.from_string("abc+def", separator="+", type=SegmentedMorpheme)
Out[59]: [['abc'], ['def']]

In [60]: Word.from_string("abc + d e f", separator=" + ", type=SegmentedMorpheme)
Out[60]: [['abc'], ['d', 'e', 'f']]

In [61]: Word.from_string("abc + d e f", separator=" + ", type=Morpheme)

TypeError: Segments must be non-empty strings without whitespace or "+".

@LinguList
Copy link
Contributor Author

I'd consider this behaviour as expected and desirable in all cases, also allowing us to handle different applications with the same Word class.

@xrotwang
Copy link
Contributor

Hm. That doesn't "feel" right. Internally we want morphemes to be exactly the same things - no matter where they came from, no? And even though that could be bolted on here with tailored __eq__ methods or similar, it would still provide potential for confusion.

@xrotwang
Copy link
Contributor

Then I'd rather replace your type argument with morpheme_factory=Morpheme.from_segmented_string or similar.

@LinguList
Copy link
Contributor Author

I'd be fine with that. The problem is that the use cases are different, and that we have different levels of representation. So we could also argue that we have a Morpheme in one case and something else in the other case.

@LinguList
Copy link
Contributor Author

To confirm: the different levels of representation are that in phonetic transcription, we would need space-segmented representations of morphemes, in orthography or many IGT examples, even if they claim to be phonetic transcriptions, we do not look at this level, so we have non-space sequences as representing a morpheme. Both types are not comparable due to the level which they represent.

@xrotwang
Copy link
Contributor

So we could also argue that we have a Morpheme in one case and something else in the other case.

But having two things of the same class Word where one time Morphemes are "inside", the other time something else, seems very intransparent.

@LinguList
Copy link
Contributor Author

So would we want to have two different words then?

@LinguList
Copy link
Contributor Author

In this case, I would suggest to add something like a WordInText that would consist of MorphemeInText and Word that consists of Morphemes.

@xrotwang
Copy link
Contributor

No, I'd vote for a single Word class, but Word.from_text would call Morpheme.from_text for each morpheme, while Word.from_segmented_morphemes would call Morpheme_from_segmented_string or similar.

@xrotwang
Copy link
Contributor

If we don't "reuse" from_string as method name, we could even keep the old implementation, adding a deprecation warning or similar.

@LinguList
Copy link
Contributor Author

Yes, that would be easiest, it seems.

@arubehn
Copy link

arubehn commented Jul 16, 2024

From what I can see, this issue has not been resolved yet, right? I stumbled upon the same unwanted behavior (also using the same workaround that @LinguList proposed here 😄) and wanted to report it, when I found that you guys have already discussed it here.

If neither of you have worked on a fix yet, I could have a go at it and send a PR later.

@LinguList
Copy link
Contributor Author

@arubehn and @xrotwang, I have just looked again at the current code, and I think we should discuss what we want as the basic behaviour here, before we modify anything.

My suggestion would be to say that a "normal" Morpheme should not use list(s) if no whitespace is detected, since this is also semantically problematic, since whitespace is not the sole criterion for space-segmentation, as we see from cases like ai. So this part should be modified in favor of but one solution, maybe two methods to call:

>>> Morpheme.from_segments("ai")
["ai"]
>>> Morpheme.from_string("ai")
["a", "i"]

The same should hold for the Word class, where we could also argue we have the two sources (segments now refers to the CLDF segmentation practice).

>>> Word.from_segments("ai + a a")
[["ai"], ["a", "a"]]
>>> Word.from_string("ai-aa", separator="-")
[["a", "i"], ["a", "a"]]

@arubehn
Copy link

arubehn commented Jul 17, 2024

I agree, it seems reasonable to have two different factory methods for these two use cases.

@LinguList
Copy link
Contributor Author

And we could also make sure that we set the default separators to - for the from_string case and + for the from_segments case.

@arubehn
Copy link

arubehn commented Jul 17, 2024

Just to make sure we're on the same page, the second use case would be representations such as ge-gang-en, right? In that case, this sounds like a good solution to me.

@LinguList
Copy link
Contributor Author

At the moment, the item_separator cannot be modified in linse. So we would have to make sure that the separator can be modified from the call. Right now, we can do the following:

>>> w = Word([["a"], ["b"]], separator=" _ ")
>>> print(w)
a _ b

But not the following:

>>> w = Word(separator=" _ ").from_string("b a _ a u")

That is: we can do it, but the separator is " + ".

Ideally, the call should be:

>>> word.from_string(STRING, separator=X)

@LinguList
Copy link
Contributor Author

Would you like to make a proposal in this regard, @arubehn?

@arubehn
Copy link

arubehn commented Jul 17, 2024

Sure!

@xrotwang
Copy link
Contributor

At the moment, the item_separator cannot be modified in linse. So we would have to make sure that the separator can be modified from the call. Right now, we can do the following:

>>> w = Word([["a"], ["b"]], separator=" _ ")
>>> print(w)
a _ b

But not the following:

>>> w = Word(separator=" _ ").from_string("b a _ a u")

That is: we can do it, but the separator is " + ".

Ideally, the call should be:

>>> word.from_string(STRING, separator=X)

I think we should make sure that particularities of the separator are strictly kept to input/output functionality. I wouldn't want the Word instance to store the separator that was used when the word was instantiated. I.e. internally, a Word is just a list of morphemes where a morpheme is a list of segments.

@xrotwang
Copy link
Contributor

So I'd consider

>>> w = Word([["a"], ["b"]], separator=" _ ")
>>> print(w)
a _ b

to be a bug.

@xrotwang
Copy link
Contributor

From my experience with LingPy, having these intuitions/assumptions about string representations of data structures permeate the code base is really difficult - and in particular difficult to get rid off later.

@arubehn
Copy link

arubehn commented Jul 17, 2024

@xrotwang so in other words, you are saying that the item_separator of each class should be a constant, independent of which separator is used in the input? So, the code should behave somewhat like this:

w = Word.from_segments("a b _ c", separator="_")
str(w)
>>> 'a b + c'

@arubehn
Copy link

arubehn commented Jul 17, 2024

i.e. the separator parameter is only used locally in the method and does not overwrite cls.item_separator?

@xrotwang
Copy link
Contributor

i.e. the separator parameter is only used locally in the method and does not overwrite cls.item_separator?

Yes, that would be my preference. I.e. cls.item_separator is just the default. Maybe it should not even be class attribute but a module constant that is used as default in method signatures.

@LinguList
Copy link
Contributor Author

Okay, but then we must have two instances of a Word.

@LinguList
Copy link
Contributor Author

We have a word that is the one that we find in LingPy's wordlists and in CLDF Wordlists, and we have a word that is found in IGT.

@LinguList
Copy link
Contributor Author

Ah, wait. If we agree that we set the standard to + as the separator, I am also fine, but please note that this may have an impact on phrases and IGT representations, which we also wanted to cover. I mean, this was the reason why the discussion emerged in the first place, since we assume that we list() a string if we don't find a whitespace, which is a behaviour that would not be in full line with the specification of a segmented string that we normally find in CLDF Wordlists for Segments, right?

@LinguList
Copy link
Contributor Author

So if we say that apart from Segments another type of representation must be covered in from_string, I'd just modify the current Morpheme.from_string method to not use list() in case no whitespace is detected, since this is also very implicit.

@arubehn
Copy link

arubehn commented Jul 17, 2024

Well, from what I see in the code, the list(s) was supposed to be replaced by a proper segmentation method later on anyway:

            if re.search(r'\s+', s):
                # We assume that s is a whitespace-separated list of segments:
                s = s.split()
            else:
                #
                # FIXME: do segmentation here!
                #
                s = list(s)

@LinguList
Copy link
Contributor Author

@arubehn, if you replace s = list(s) by s = [s], it should work in the usecases we have discussed so far, but would probably lead to unwanted behaviour in tests further down on IGT examples.

@arubehn
Copy link

arubehn commented Jul 17, 2024

@LinguList I did not really touch the from_string method, but implemented a parallel factory method from_segments instead. I thought that was what we agreed upon.

For IGL applications, I think it does make sense to perform some segmentation under the hood (even if it's just a naive character-by-character separation). So I think it makes sense to have two separate methods for these two use cases.

Likewise for the output, we could easily implement methods that allow for the generation of IGL-style text again by passing arguments which separator should be used, if and how whitespaces should be inserted etc. That way, the item_separator attribute would just be an internal default representation, but the application of linse is not bound to the choice of those symbols.

@xrotwang
Copy link
Contributor

Unrelatedly, the issue of applying orthography profiles to IGT data just came up. Sebastian Nordhoff came across a case, where - was used as glottal stop - for whatever reason. My first reaction was to bring orthography profiles up.

@arubehn
Copy link

arubehn commented Jul 17, 2024

On the same note, I am also not sure if we want to stick to the assumption that + can never be a segment:

class Segment(str):
    """
    A segment is a non-empty string which does not contain punctuation.
    """
    def __new__(cls, s):
        if not isinstance(s, str) or re.search(r'\s+', s) or '+' in s or not s:
            raise TypeError('Segments must be non-empty strings without whitespace or "+".')
        return str.__new__(cls, s)

I would bet that there is some dataset out there that uses + for a certain sound 😄

@LinguList
Copy link
Contributor Author

I think we can do so for now.

@LinguList
Copy link
Contributor Author

I don't have time to review now, @arubehn, but would try to have a look later. @xrotwang, would you also agree with the two methods from_segments and from_string for now?

@xrotwang
Copy link
Contributor

Two methods sounds good. I should have time for review later today.

@LinguList
Copy link
Contributor Author

Okay :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants