-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monosegmental Morphemes #31
Comments
No, I think there's no way around this. Arguably, it's rather a bug that |
We could have multiple factory methods to distinguish: |
I see that we have two different morpheme representations of which we also assume they are frequently used: |
I think that's why I was hoping we could switch to the unambiguous "nested lists" representation. If we had different factory methods "from_text" (parses stuff as it appears in text, e.g. whitespace separates words) and "from_segmented_string" (recognizes |
The good thing for even now is that the workaround shown above also works and is not really tedious. |
So I think I can perfectly live with different factory methods. This would mean then, however, that |
You mean the behaviour of my proposed |
I was still thinking of the original problem that I brought up, and here, I wanted to say that if we leave it as is, the behavior is still unexpected, so it would require a change, right, since the "split on whitespace, if no white space do a list" seems problematic? And if that change consists in adding different methods for initialization from different kinds of strings, I am completely fine. |
Ok, yes, different methods should be a good - if backwards incompatible - solution. But that's why we went with v0.1 I guess :) |
Yes :-) |
The other solution would be to subclass morphemes differently, based on their purpose, and use different Right now, we have: >>> Word.from_string("abc-def", separator="-")
[['a', 'b', 'c', '-', 'd', 'e', 'f']]
>>> TypedSequence.from_string("abc-def", separator="-", type=Segment)
['abc', 'def']
>>> TypedSequence.from_string("abc-def", separator="-", type=Morpheme)
[['a', 'b', 'c'], ['d', 'e', 'f']] I think for broader use cases, we would like to
But then, we will anyway need targeted from_string methods, since the But what would be nice, in my opinion, would be: >>> Word.from_X("a b c + ei")
[["a", "b", "c"], ["ei"]]
>>> Word.from_Y("abc-ei", separator="-")
[["a", "b", "c"], ["e", "i"] This would then also cover the two major use cases of IGT vs. Wordlists, right? |
I think the following code would account for the problems: class Morpheme(TypedSequence):
item_type = Segment
item_separator = None
@classmethod
def from_string(cls, s, **kw):
return cls(list(s))
class SegmentedMorpheme(TypedSequence):
item_type = Segment
item_separator = " "
@classmethod
def from_string(cls, s, **kw):
return cls(s.split(cls.item_separator))
class Word(TypedSequence):
item_type = Morpheme
item_separator = " + "
@classmethod
def from_string(cls, s, **kw):
kw["type"] = kw.get("type", cls.item_type)
kw["separator"] = kw.get("separator", cls.item_separator)
return cls([kw["type"].from_string(m) for m in
s.split(kw["separator"])], **kw) This would result in: In [56]: Word.from_string("abc-def", separator="-")
Out[56]: [['a', 'b', 'c'], ['d', 'e', 'f']]
In [57]: Word.from_string("abc-def", separator="+")
Out[57]: [['a', 'b', 'c', '-', 'd', 'e', 'f']]
In [58]: Word.from_string("abc+def", separator="+")
Out[58]: [['a', 'b', 'c'], ['d', 'e', 'f']]
In [59]: Word.from_string("abc+def", separator="+", type=SegmentedMorpheme)
Out[59]: [['abc'], ['def']]
In [60]: Word.from_string("abc + d e f", separator=" + ", type=SegmentedMorpheme)
Out[60]: [['abc'], ['d', 'e', 'f']]
In [61]: Word.from_string("abc + d e f", separator=" + ", type=Morpheme)
TypeError: Segments must be non-empty strings without whitespace or "+". |
I'd consider this behaviour as expected and desirable in all cases, also allowing us to handle different applications with the same Word class. |
Hm. That doesn't "feel" right. Internally we want morphemes to be exactly the same things - no matter where they came from, no? And even though that could be bolted on here with tailored |
Then I'd rather replace your |
I'd be fine with that. The problem is that the use cases are different, and that we have different levels of representation. So we could also argue that we have a Morpheme in one case and something else in the other case. |
To confirm: the different levels of representation are that in phonetic transcription, we would need space-segmented representations of morphemes, in orthography or many IGT examples, even if they claim to be phonetic transcriptions, we do not look at this level, so we have non-space sequences as representing a morpheme. Both types are not comparable due to the level which they represent. |
But having two things of the same class |
So would we want to have two different words then? |
In this case, I would suggest to add something like a WordInText that would consist of MorphemeInText and Word that consists of Morphemes. |
No, I'd vote for a single |
If we don't "reuse" |
Yes, that would be easiest, it seems. |
From what I can see, this issue has not been resolved yet, right? I stumbled upon the same unwanted behavior (also using the same workaround that @LinguList proposed here 😄) and wanted to report it, when I found that you guys have already discussed it here. If neither of you have worked on a fix yet, I could have a go at it and send a PR later. |
@arubehn and @xrotwang, I have just looked again at the current code, and I think we should discuss what we want as the basic behaviour here, before we modify anything. My suggestion would be to say that a "normal" Morpheme should not use >>> Morpheme.from_segments("ai")
["ai"]
>>> Morpheme.from_string("ai")
["a", "i"] The same should hold for the Word class, where we could also argue we have the two sources ( >>> Word.from_segments("ai + a a")
[["ai"], ["a", "a"]]
>>> Word.from_string("ai-aa", separator="-")
[["a", "i"], ["a", "a"]] |
I agree, it seems reasonable to have two different factory methods for these two use cases. |
And we could also make sure that we set the default separators to |
Just to make sure we're on the same page, the second use case would be representations such as ge-gang-en, right? In that case, this sounds like a good solution to me. |
At the moment, the >>> w = Word([["a"], ["b"]], separator=" _ ")
>>> print(w)
a _ b But not the following: >>> w = Word(separator=" _ ").from_string("b a _ a u") That is: we can do it, but the separator is " + ". Ideally, the call should be: >>> word.from_string(STRING, separator=X) |
Would you like to make a proposal in this regard, @arubehn? |
Sure! |
I think we should make sure that particularities of the separator are strictly kept to input/output functionality. I wouldn't want the |
So I'd consider >>> w = Word([["a"], ["b"]], separator=" _ ")
>>> print(w)
a _ b to be a bug. |
From my experience with LingPy, having these intuitions/assumptions about string representations of data structures permeate the code base is really difficult - and in particular difficult to get rid off later. |
@xrotwang so in other words, you are saying that the w = Word.from_segments("a b _ c", separator="_")
str(w)
>>> 'a b + c' |
i.e. the |
Yes, that would be my preference. I.e. |
Okay, but then we must have two instances of a Word. |
We have a word that is the one that we find in LingPy's wordlists and in CLDF Wordlists, and we have a word that is found in IGT. |
Ah, wait. If we agree that we set the standard to |
So if we say that apart from |
Well, from what I see in the code, the if re.search(r'\s+', s):
# We assume that s is a whitespace-separated list of segments:
s = s.split()
else:
#
# FIXME: do segmentation here!
#
s = list(s) |
@arubehn, if you replace |
@LinguList I did not really touch the For IGL applications, I think it does make sense to perform some segmentation under the hood (even if it's just a naive character-by-character separation). So I think it makes sense to have two separate methods for these two use cases. Likewise for the output, we could easily implement methods that allow for the generation of IGL-style text again by passing arguments which separator should be used, if and how whitespaces should be inserted etc. That way, the |
Unrelatedly, the issue of applying orthography profiles to IGT data just came up. Sebastian Nordhoff came across a case, where |
On the same note, I am also not sure if we want to stick to the assumption that class Segment(str):
"""
A segment is a non-empty string which does not contain punctuation.
"""
def __new__(cls, s):
if not isinstance(s, str) or re.search(r'\s+', s) or '+' in s or not s:
raise TypeError('Segments must be non-empty strings without whitespace or "+".')
return str.__new__(cls, s) I would bet that there is some dataset out there that uses |
I think we can do so for now. |
Two methods sounds good. I should have time for review later today. |
Okay :-) |
If I want to read a word from string now, the following situation will yield an unwanted result:
Instead, what I would expect would be:
A simple solution would be to add another subclass and to specify words then with specific morphemes. But it seems also odd to have the following situation now with the Morpheme class:
I remember we discussed this, but I wonder if there's a way to maybe pass an argument to the from_string method to allow for a Morpheme that would contain one segment consisting of multiple characters?
The text was updated successfully, but these errors were encountered: