Monosegmental Morphemes #31

LinguList · 2024-04-15T09:32:10Z

If I want to read a word from string now, the following situation will yield an unwanted result:

>>> Word.from_string("p a t + ia")
[['p', 'a', 't'], ['i', 'a']]

Instead, what I would expect would be:

>>> Word([x.split() for x in "p a t + ia".split(" + ")])
[['p', 'a', 't'], ['ia']]

A simple solution would be to add another subclass and to specify words then with specific morphemes. But it seems also odd to have the following situation now with the Morpheme class:

>>> Morpheme("pat")
['p', 'a', 't']
>>> str(Morpheme("pat"))
'p a t'

I remember we discussed this, but I wonder if there's a way to maybe pass an argument to the from_string method to allow for a Morpheme that would contain one segment consisting of multiple characters?

xrotwang · 2024-04-18T06:15:46Z

No, I think there's no way around this. Arguably, it's rather a bug that + is recognized as morpheme separator in Word.from_string. We cannot really allow abc and a b c as morpheme a b c in the same from_string method, because then the one-multi-char-segment morpheme ia can not be distinguished from the multisegment morpheme ia - because the only way to distinguish the two representations is the presence of the morpheme separator ...

xrotwang · 2024-04-18T06:17:06Z

We could have multiple factory methods to distinguish: from_segmented_string or similar.

LinguList · 2024-04-18T08:15:26Z

I see that we have two different morpheme representations of which we also assume they are frequently used: abc + def (or abcdef) and a b c + d e f. The former in IGT, where you don't really care that the segment is a valid sound, and in lingpy/edictor/lexibank-cldf where you care that the segment represents a meaningful sound and must therefore allow for a separator different from the empty string.

xrotwang · 2024-04-18T08:30:48Z

I think that's why I was hoping we could switch to the unambiguous "nested lists" representation. If we had different factory methods "from_text" (parses stuff as it appears in text, e.g. whitespace separates words) and "from_segmented_string" (recognizes + as morpheme separator and expects (whitespace-)segmented morphemes) or similar, it would work. But then the step to doing away with this and expecting input to already be nested lists wouldn't be too big in my opinion.

LinguList · 2024-04-18T08:32:57Z

The good thing for even now is that the workaround shown above also works and is not really tedious.

LinguList · 2024-04-18T08:34:56Z

So I think I can perfectly live with different factory methods. This would mean then, however, that Word.from_string("a b c + a b c") would throw an error, or assume a morpheme ["a", " ", "b", " ", "c"], right?

xrotwang · 2024-04-18T08:38:43Z

So I think I can perfectly live with different factory methods. This would mean then, however, that Word.from_string("a b c + a b c") would throw an error, or assume a morpheme ["a", " ", "b", " ", "c"], right?

You mean the behaviour of my proposed from_text method? Not sure. It might raise an exception on encountering the "word" +. But otherwise would result in 7 one-character words, I guess.

LinguList · 2024-04-18T08:42:06Z

I was still thinking of the original problem that I brought up, and here, I wanted to say that if we leave it as is, the behavior is still unexpected, so it would require a change, right, since the "split on whitespace, if no white space do a list" seems problematic? And if that change consists in adding different methods for initialization from different kinds of strings, I am completely fine.

xrotwang · 2024-04-18T08:47:54Z

Ok, yes, different methods should be a good - if backwards incompatible - solution. But that's why we went with v0.1 I guess :)

LinguList · 2024-04-18T08:48:48Z

Yes :-)

LinguList · 2024-04-18T09:26:47Z

The other solution would be to subclass morphemes differently, based on their purpose, and use different from_string methods.

Right now, we have:

>>> Word.from_string("abc-def", separator="-")
[['a', 'b', 'c', '-', 'd', 'e', 'f']]
>>> TypedSequence.from_string("abc-def", separator="-", type=Segment)
['abc', 'def']
>>> TypedSequence.from_string("abc-def", separator="-", type=Morpheme)
[['a', 'b', 'c'], ['d', 'e', 'f']]

I think for broader use cases, we would like to

not overwrite "type" in the Word.from_string method
allow to pass the separator to the Word.from_string method

But then, we will anyway need targeted from_string methods, since the strict interpretation that goes back from word to segment does not work.

But what would be nice, in my opinion, would be:

>>> Word.from_X("a b c + ei")
[["a", "b", "c"], ["ei"]]
>>> Word.from_Y("abc-ei", separator="-")
[["a", "b", "c"], ["e", "i"]

This would then also cover the two major use cases of IGT vs. Wordlists, right?

LinguList · 2024-04-18T10:04:37Z

I think the following code would account for the problems:

class Morpheme(TypedSequence):

    item_type = Segment
    item_separator = None

    @classmethod
    def from_string(cls, s, **kw):
        return cls(list(s))



class SegmentedMorpheme(TypedSequence):

    item_type = Segment
    item_separator = " "

    @classmethod
    def from_string(cls, s, **kw):
        return cls(s.split(cls.item_separator))


class Word(TypedSequence):

    item_type = Morpheme
    item_separator = " + "

    @classmethod
    def from_string(cls, s, **kw):
        kw["type"] = kw.get("type", cls.item_type)
        kw["separator"] = kw.get("separator", cls.item_separator)
        
        return cls([kw["type"].from_string(m) for m in
                    s.split(kw["separator"])], **kw)

This would result in:

In [56]: Word.from_string("abc-def", separator="-")
Out[56]: [['a', 'b', 'c'], ['d', 'e', 'f']]

In [57]: Word.from_string("abc-def", separator="+")
Out[57]: [['a', 'b', 'c', '-', 'd', 'e', 'f']]

In [58]: Word.from_string("abc+def", separator="+")
Out[58]: [['a', 'b', 'c'], ['d', 'e', 'f']]

In [59]: Word.from_string("abc+def", separator="+", type=SegmentedMorpheme)
Out[59]: [['abc'], ['def']]

In [60]: Word.from_string("abc + d e f", separator=" + ", type=SegmentedMorpheme)
Out[60]: [['abc'], ['d', 'e', 'f']]

In [61]: Word.from_string("abc + d e f", separator=" + ", type=Morpheme)

TypeError: Segments must be non-empty strings without whitespace or "+".

LinguList · 2024-04-18T10:05:06Z

I'd consider this behaviour as expected and desirable in all cases, also allowing us to handle different applications with the same Word class.

xrotwang · 2024-04-18T10:08:33Z

Hm. That doesn't "feel" right. Internally we want morphemes to be exactly the same things - no matter where they came from, no? And even though that could be bolted on here with tailored __eq__ methods or similar, it would still provide potential for confusion.

xrotwang · 2024-04-18T10:09:47Z

Then I'd rather replace your type argument with morpheme_factory=Morpheme.from_segmented_string or similar.

LinguList · 2024-04-18T10:11:08Z

I'd be fine with that. The problem is that the use cases are different, and that we have different levels of representation. So we could also argue that we have a Morpheme in one case and something else in the other case.

LinguList · 2024-04-18T10:12:53Z

To confirm: the different levels of representation are that in phonetic transcription, we would need space-segmented representations of morphemes, in orthography or many IGT examples, even if they claim to be phonetic transcriptions, we do not look at this level, so we have non-space sequences as representing a morpheme. Both types are not comparable due to the level which they represent.

xrotwang · 2024-04-18T10:15:21Z

So we could also argue that we have a Morpheme in one case and something else in the other case.

But having two things of the same class Word where one time Morphemes are "inside", the other time something else, seems very intransparent.

LinguList · 2024-04-18T11:50:18Z

So would we want to have two different words then?

LinguList · 2024-04-18T11:53:07Z

In this case, I would suggest to add something like a WordInText that would consist of MorphemeInText and Word that consists of Morphemes.

xrotwang · 2024-04-18T11:55:45Z

No, I'd vote for a single Word class, but Word.from_text would call Morpheme.from_text for each morpheme, while Word.from_segmented_morphemes would call Morpheme_from_segmented_string or similar.

xrotwang · 2024-04-18T11:56:53Z

If we don't "reuse" from_string as method name, we could even keep the old implementation, adding a deprecation warning or similar.

LinguList · 2024-04-18T11:59:05Z

Yes, that would be easiest, it seems.

arubehn · 2024-07-16T13:48:10Z

From what I can see, this issue has not been resolved yet, right? I stumbled upon the same unwanted behavior (also using the same workaround that @LinguList proposed here 😄) and wanted to report it, when I found that you guys have already discussed it here.

If neither of you have worked on a fix yet, I could have a go at it and send a PR later.

LinguList · 2024-07-17T09:48:56Z

@arubehn and @xrotwang, I have just looked again at the current code, and I think we should discuss what we want as the basic behaviour here, before we modify anything.

My suggestion would be to say that a "normal" Morpheme should not use list(s) if no whitespace is detected, since this is also semantically problematic, since whitespace is not the sole criterion for space-segmentation, as we see from cases like ai. So this part should be modified in favor of but one solution, maybe two methods to call:

>>> Morpheme.from_segments("ai")
["ai"]
>>> Morpheme.from_string("ai")
["a", "i"]

The same should hold for the Word class, where we could also argue we have the two sources (segments now refers to the CLDF segmentation practice).

>>> Word.from_segments("ai + a a")
[["ai"], ["a", "a"]]
>>> Word.from_string("ai-aa", separator="-")
[["a", "i"], ["a", "a"]]

arubehn · 2024-07-17T09:58:40Z

I agree, it seems reasonable to have two different factory methods for these two use cases.

LinguList · 2024-07-17T10:00:11Z

And we could also make sure that we set the default separators to - for the from_string case and + for the from_segments case.

arubehn · 2024-07-17T10:03:03Z

Just to make sure we're on the same page, the second use case would be representations such as ge-gang-en, right? In that case, this sounds like a good solution to me.

LinguList · 2024-07-17T10:04:46Z

At the moment, the item_separator cannot be modified in linse. So we would have to make sure that the separator can be modified from the call. Right now, we can do the following:

>>> w = Word([["a"], ["b"]], separator=" _ ")
>>> print(w)
a _ b

But not the following:

>>> w = Word(separator=" _ ").from_string("b a _ a u")

That is: we can do it, but the separator is " + ".

Ideally, the call should be:

>>> word.from_string(STRING, separator=X)

LinguList · 2024-07-17T10:04:58Z

Would you like to make a proposal in this regard, @arubehn?

arubehn · 2024-07-17T10:08:56Z

Sure!

xrotwang · 2024-07-17T10:22:48Z

At the moment, the item_separator cannot be modified in linse. So we would have to make sure that the separator can be modified from the call. Right now, we can do the following:
>>> w = Word([["a"], ["b"]], separator=" _ ")
>>> print(w)
a _ b
But not the following:
>>> w = Word(separator=" _ ").from_string("b a _ a u")
That is: we can do it, but the separator is " + ".

Ideally, the call should be:
>>> word.from_string(STRING, separator=X)

I think we should make sure that particularities of the separator are strictly kept to input/output functionality. I wouldn't want the Word instance to store the separator that was used when the word was instantiated. I.e. internally, a Word is just a list of morphemes where a morpheme is a list of segments.

xrotwang · 2024-07-17T10:23:27Z

So I'd consider

>>> w = Word([["a"], ["b"]], separator=" _ ")
>>> print(w)
a _ b

to be a bug.

xrotwang · 2024-07-17T10:25:26Z

From my experience with LingPy, having these intuitions/assumptions about string representations of data structures permeate the code base is really difficult - and in particular difficult to get rid off later.

arubehn · 2024-07-17T10:35:38Z

@xrotwang so in other words, you are saying that the item_separator of each class should be a constant, independent of which separator is used in the input? So, the code should behave somewhat like this:

w = Word.from_segments("a b _ c", separator="_")
str(w)
>>> 'a b + c'

arubehn · 2024-07-17T10:36:20Z

i.e. the separator parameter is only used locally in the method and does not overwrite cls.item_separator?

xrotwang · 2024-07-17T11:00:03Z

i.e. the separator parameter is only used locally in the method and does not overwrite cls.item_separator?

Yes, that would be my preference. I.e. cls.item_separator is just the default. Maybe it should not even be class attribute but a module constant that is used as default in method signatures.

LinguList · 2024-07-17T11:17:43Z

Okay, but then we must have two instances of a Word.

LinguList · 2024-07-17T11:18:01Z

We have a word that is the one that we find in LingPy's wordlists and in CLDF Wordlists, and we have a word that is found in IGT.

LinguList · 2024-07-17T11:20:51Z

Ah, wait. If we agree that we set the standard to + as the separator, I am also fine, but please note that this may have an impact on phrases and IGT representations, which we also wanted to cover. I mean, this was the reason why the discussion emerged in the first place, since we assume that we list() a string if we don't find a whitespace, which is a behaviour that would not be in full line with the specification of a segmented string that we normally find in CLDF Wordlists for Segments, right?

LinguList · 2024-07-17T11:21:43Z

So if we say that apart from Segments another type of representation must be covered in from_string, I'd just modify the current Morpheme.from_string method to not use list() in case no whitespace is detected, since this is also very implicit.

arubehn · 2024-07-17T11:22:28Z

Well, from what I see in the code, the list(s) was supposed to be replaced by a proper segmentation method later on anyway:

            if re.search(r'\s+', s):
                # We assume that s is a whitespace-separated list of segments:
                s = s.split()
            else:
                #
                # FIXME: do segmentation here!
                #
                s = list(s)

LinguList · 2024-07-17T11:24:16Z

@arubehn, if you replace s = list(s) by s = [s], it should work in the usecases we have discussed so far, but would probably lead to unwanted behaviour in tests further down on IGT examples.

arubehn · 2024-07-17T11:28:34Z

@LinguList I did not really touch the from_string method, but implemented a parallel factory method from_segments instead. I thought that was what we agreed upon.

For IGL applications, I think it does make sense to perform some segmentation under the hood (even if it's just a naive character-by-character separation). So I think it makes sense to have two separate methods for these two use cases.

Likewise for the output, we could easily implement methods that allow for the generation of IGL-style text again by passing arguments which separator should be used, if and how whitespaces should be inserted etc. That way, the item_separator attribute would just be an internal default representation, but the application of linse is not bound to the choice of those symbols.

xrotwang · 2024-07-17T11:33:20Z

Unrelatedly, the issue of applying orthography profiles to IGT data just came up. Sebastian Nordhoff came across a case, where - was used as glottal stop - for whatever reason. My first reaction was to bring orthography profiles up.

arubehn · 2024-07-17T11:42:38Z

On the same note, I am also not sure if we want to stick to the assumption that + can never be a segment:

class Segment(str):
    """
    A segment is a non-empty string which does not contain punctuation.
    """
    def __new__(cls, s):
        if not isinstance(s, str) or re.search(r'\s+', s) or '+' in s or not s:
            raise TypeError('Segments must be non-empty strings without whitespace or "+".')
        return str.__new__(cls, s)

I would bet that there is some dataset out there that uses + for a certain sound 😄

LinguList · 2024-07-17T11:45:48Z

I think we can do so for now.

LinguList · 2024-07-17T11:46:23Z

I don't have time to review now, @arubehn, but would try to have a look later. @xrotwang, would you also agree with the two methods from_segments and from_string for now?

xrotwang · 2024-07-17T11:47:00Z

Two methods sounds good. I should have time for review later today.

LinguList · 2024-07-17T11:47:46Z

Okay :-)

xrotwang self-assigned this Apr 15, 2024

arubehn mentioned this issue Jul 17, 2024

Implement from_segments #32

Open

Monosegmental Morphemes #31

Monosegmental Morphemes #31

Comments

LinguList commented Apr 15, 2024

xrotwang commented Apr 18, 2024

xrotwang commented Apr 18, 2024

LinguList commented Apr 18, 2024

xrotwang commented Apr 18, 2024

LinguList commented Apr 18, 2024

LinguList commented Apr 18, 2024

xrotwang commented Apr 18, 2024

LinguList commented Apr 18, 2024

xrotwang commented Apr 18, 2024

LinguList commented Apr 18, 2024

LinguList commented Apr 18, 2024

LinguList commented Apr 18, 2024

LinguList commented Apr 18, 2024

xrotwang commented Apr 18, 2024

xrotwang commented Apr 18, 2024

LinguList commented Apr 18, 2024

LinguList commented Apr 18, 2024

xrotwang commented Apr 18, 2024

LinguList commented Apr 18, 2024

LinguList commented Apr 18, 2024

xrotwang commented Apr 18, 2024

xrotwang commented Apr 18, 2024

LinguList commented Apr 18, 2024

arubehn commented Jul 16, 2024

LinguList commented Jul 17, 2024

arubehn commented Jul 17, 2024

LinguList commented Jul 17, 2024

arubehn commented Jul 17, 2024

LinguList commented Jul 17, 2024

LinguList commented Jul 17, 2024

arubehn commented Jul 17, 2024

xrotwang commented Jul 17, 2024

xrotwang commented Jul 17, 2024

xrotwang commented Jul 17, 2024

arubehn commented Jul 17, 2024

arubehn commented Jul 17, 2024

xrotwang commented Jul 17, 2024

LinguList commented Jul 17, 2024

LinguList commented Jul 17, 2024

LinguList commented Jul 17, 2024

LinguList commented Jul 17, 2024

arubehn commented Jul 17, 2024 • edited Loading

LinguList commented Jul 17, 2024

arubehn commented Jul 17, 2024

xrotwang commented Jul 17, 2024

arubehn commented Jul 17, 2024

LinguList commented Jul 17, 2024

LinguList commented Jul 17, 2024

xrotwang commented Jul 17, 2024

LinguList commented Jul 17, 2024

arubehn commented Jul 17, 2024 •

edited

Loading