Readme last updated: March 23, 2020
SigmaPie for subregular and subsequential grammar induction
pip install sigmapie
import sigmapie
# learning UTP pattern with an SP grammar
language = sigmapie.SP(polar="n")
language.k = 3
language.data = ["HHLLL", "LLHHH", "LHHHLL", "LLL", "HHH"]
language.alphabet = ["H", "L"]
language.learn()
language.grammar # [('H', 'L', 'H')]
language.scan("HHLLLL") # True
language.scan("HLLH") # False
language.generate_sample(n = 3) # ['LLHH', 'HL', 'HHHL']
This toolkit is relevant for anyone who is working or going to work with subregular grammars both from the perspectives of theoretical linguistics and formal language theory.
Why theoretical linguists might be interested in formal language theory?
Formal language theory explains how potentially infinite string sets, or formal languages,
can be generalized to grammars encoding the desired patterns and what properties those
grammars have. It also allows one to compare different grammars regarding parameters such as expressivity.
The Chomsky hierarchy aligns the main classes of formal languages with respect to their expressive power (Chomsky 1959).
- Regular grammars are as powerful as finite-state devices or regular expressions: they can "count" only until a certain threshold (no
a^n b^n
patterns); - Context-free grammars have access to potentially infinite stack that allows them to reproduce patterns that involve center embedding;
- Mildly context-sensitive grammars are powerful enough to handle cross-serial dependencies such as some types of copying;
- Context-sensitive grammars can handle non-linear patterns such as
a^2^n
forn > 0
; - Recursively enumerable grammars are as powerful as any theoretically possible computer and generate languages such as
a^n
, wheren
is a prime number.
Both phonology and morphology frequently display properties of regular languages.
Phonology does not require the power of center-embedding, which is a property of context-free languages. For example, consider a harmony where the first vowel agrees with the last vowel, the second vowel agrees with the pre-last one, etc.
GOOD: "arugula", "tropicalization", "electrotelethermometer", etc.
BAD: any other word violating the rule.
While it is a theoretically possible pattern, harmonies of that type are unattested in natural languages.
Morphology avoids center-embedding as well. In Aksënova et al. (2016) we show that it is possible to iterate prefixes with the meaning "after" in Russian. In Ilocano, where the same semantics is expressed via a circumfix, its iteration is prohibited.
RUSSIAN: "zavtra" (tomorrow), "posle-zavtra" (the day after tomorrow),
"posle-posle-zavtra" (the day after the day after tomorrow), ...
ILOCANO: "bigat" (morning), "ka-bigat-an" (the next morning),
<*>"ka-ka-bigat-an-an" (the morning after the next one).
Moreover, typological review of patterns shows that phonology and morphology do not require the full power of regular languages. As an example of an unattested pattern, Heinz (2011) provides a language where a word must have an even number of vowels to be well-formed.
Regular languages can be sub-divided into another nested hierarchy of languages decreasing in their expressive power: subregular hierarchy. Among some of the most important characteristics of subregular languages is their learnability only from the positive data: more powerful classes require negative input as well.
However, when there is a mapping, i.e. a correspondence of input form to its output counterpart, one can use subregular mappings. In linguistics, indeed, mapping underlying representations (URs) to surface forms (SFs) it is a frequent task. One of the ways to encode a rule of a mapping is to use a transducer. A class of less-than-regular, or subregular transducers are subsequential transducers. They deterministically read input strings one-by-one, and output the corresponding part of the output string.
The SigmaPie toolkit currently contains functionality for the following subregular language and grammar classes:
- strictly piecewise (SP);
- strictly local (SL);
- tier-based strictly local (TSL);
- multiple tier-based strictly local (MTSL).
Additionally, it implements the OSTIA algorithm that learns subsequential mappings following Oncina, Garcia and Vidal (1993) and Colin de la Higuera (2014).
- Learners extract grammars from string sets.
- Scanners evaluate strings with respect to a given grammar.
- Sample generators generate stringsets for a given grammar.
- FSM constructors translate subregular grammars to finite state machines.
- Polarity converters switch negative grammars to positive, and vice versa.
Negative strictly piecewise (SP) grammars prohibit the occurrence of sequences of symbols at an arbitrary distance from each other.
Every SP grammar is associated with the value of k
that defines the size of the longest sequence that this grammar can prohibit. Alternatively, if the grammar is positive, it lists all subsequences that are allowed in well-formed words of the language.
k = 2
POLARITY: negative
GRAMMAR: ab, ba
LANGUAGE: accaacc, cbccc, cccacaaaa, ...
<*>accacba, <*>bcccacbb, <*>bccccccca, ...
In phonology, an example of an SP pattern is tone plateauing discussed in Jardine (2015, 2016). For example, in Luganda (Bantu) a low tone (L) cannot intervene in-between two high tones (H): L is changed to H in such configurations. The prosodic domain cannot have more than one stretch of H tones.
Luganda verb and noun combinations (Hyman and Katamba (2010), cited by Jardine (2016))
-
/tw-áa-mú-láb-a, walúsimbi/ ---> tw-áá-mu-lab-a, walúsimbi
‘we saw him, Walusimbi’
HHLLL, LHLL -
/tw-áa-láb-w-a walúsimbi/ ---> tw-áá-láb-wá wálúsimbi
‘we were seen by Walusimbi’
HHHHHHLL -
/tw-áa-láb-a byaa=walúsimbi/ ---> tw-áá-láb-á byáá-wálúsimbi
‘we saw those of Walusimbi’
HHHHHHHHLL
This pattern can be described using SP grammar *HLH
.
Let us say that luganda
represents a "toy" example of tone plateauing (TP) pattern.
luganda = ["LLLL", "HHLLL", "LHHHLL", "LLLLHHHH"]
Our goal will be to learn the SP generalization behind TP.
Negative and positive SP grammars are implemented as the class SP()
. The next code cell initialized a positive SP grammar tp_pattern
.
tp_pattern = SP()
polar
("p" or "n") is the polarity of the grammar, by default, it is "p";alphabet
(list) is the set of symbols that the grammar uses;grammar
(list of tuples) is the list of allowed or prohibited substructures of the language;k
(int) is the size of the locality window of the grammar, by default, it is 2;data
(list of string) is the learning sample;fsm
(FSM object) is the finite state device that corresponds to the grammar.
The initial step is to define the training sample and the alphabet.
tp_pattern.data = luganda
tp_pattern.alphabet = ["H", "L"]
By default, the locality window of the grammar is 2.
print("Locality of the SP grammar:", tp_pattern.k)
SP attributes can be directly accessed. For example, let us change the locality of the window from 2 to 3:
tp_pattern.k = 3
print("Locality of the SP grammar:", tp_pattern.k)
-
check_polarity()
andswitch_polarity()
display and change the polarity of the grammar; -
learn()
extracts prohibited or allowed subsequences from the training sample; -
scan(string)
tells if a given string is well-formed with respect to the learned grammar; -
extract_alphabet()
collects alphabet based on the data or grammar attributes; -
generate_sample(n, repeat)
generates$n$ strings based on the given grammar; by default,repeat
is set to False, and repetitions of the generated strings are not allowed, but this parameter can be set to True; -
fsmize()
creates the corresponding FSM family by following the steps outlined in Heinz and Rogers (2013); -
subsequences(string)
returns allk
-piecewise subsequences of the given string; -
generate_all_ngrams()
generates all possible strings of the length$k$ based on the provided alphabet.
Checking and changing the polarity of the grammar
By default, the grammars are positive. The polarity can be checked by running the check_polarity
method:
print("Polarity of the grammar:", tp_pattern.check_polarity())
If the polarity needs to be changed, this can be done using the switch_polarity
method. It will automatically switch the grammar, if one is provided or already extracted, to the opposite one.
tp_pattern.switch_polarity()
print("Polarity of the grammar:", tp_pattern.check_polarity())
Learning the SP grammar
Method learn
extracts allowed or prohibited subsequences from the learning sample based on the polarity of the grammar and the locality window.
tp_pattern.learn()
print("Extracted grammar:", tp_pattern.grammar)
Indeed, it learned the TP pattern!
Scanning strings and telling if they are part of the language
scan
takes a string as input and returns True or False depending on the well-formedness of the given strings with respect to the encoded grammar.
tp = ["HHHLLL", "L", "HHL", "LLHLLL"]
no_tp = ["LLLLHLLLLH", "HLLLLLLH", "LLLHLLLHLLLHL"]
print("Tonal plateauing:")
for string in tp:
print("String", string, "is in L(G):", tp_pattern.scan(string))
print("\nNo tonal plateauing:")
for string in no_tp:
print("String", string, "is in L(G):", tp_pattern.scan(string))
Generating a data sample
Based on the learned grammar, a data sample of the desired size can be generated.
sample = tp_pattern.generate_sample(n = 10)
print("Sample:", sample)
Extracting subsequences
Finally, this toolkit can be used also in order to extract subsequences from the input word by feeding it to the subsequences
method.
tp_pattern.k = 3
print("k = 3:", tp_pattern.subsequences("regular"), "\n")
tp_pattern.k = 5
print("k = 5:", tp_pattern.subsequences("regular"))
While SP languages capture multiple long-distance processes such as tone plateauings or some harmonies, they are unable to encode a blocking effect or purely local processes.
Negative strictly local (SL) grammars prohibit the occurrence of consecutive substrings consisting of up to k
symbols. The value of k
defines the longest substring that cannot be present in a well-formed string of a language. Positive SL grammars define substrings that can be present in the language.
Importantly, to define first and last elements, SL languages use delimiters (">" and "<") that indicate the beginning and the end of the string.
k = 2
POLARITY: positive
GRAMMAR: >a, ab, ba, b<
LANGUAGE: ab, abab, abababab, ...
<*>babab, <*>abaab, <*>bababba, ...
In phonology, very frequently changes involve adjacent segments, and the notion of locality is therefore extremely important. The discussion of local processes in phonology can be found in (Chandlee 2014).
Russian word-final devoicing
In Russian, the final obstruent of a word cannot be voiced.
- "lug" [luK] meadow ---> "lug-a" [luGa] of the meadow
- "luk" [luK] onion ---> "luk-a" [luKa] of the onion
- "porog" [paroK] doorstep ---> "porog-a" [paroGa] of the doorstep
- "porok" [paroK] vice ---> "porok-a" [paroKa] of the vice
Assume the following toy dataset where the following mapping is defined:
- "a" stands for a vowel;
- "b" stands for a voiced obstruent;
- "p" stands for any other consonant.
russian = ["", "ababa", "babbap", "pappa", "pabpaapba" "aap"]
In this term, the Russian word-final devoicing generalization would be "do not have "b" at the end of the word".
This pattern can then be described using SL grammar *b<
.
Let us initialize an SL object.
wf_devoicing = SL()
wf_devoicing.data = russian
polar
("p" or "n") is the polarity of the grammar, by default, it is "p";alphabet
(list) is the set of symbols that the grammar uses;grammar
(list of tuples) is the list of allowed or prohibited substructures of the language;k
(int) is the size of the locality window of the grammar, by default, it is 2;data
(list of string) is the learning sample;edges
(list of two characters) are the delimiters used by the grammar, the default value is ">" and "<";fsm
(FSM object) is the finite state device that corresponds to the grammar.
-
check_polarity()
andswitch_polarity()
display and change the polarity of the grammar; -
learn()
extracts prohibited or allowed substrings from the training sample; -
scan(string)
tells if a given string is well-formed with respect to a learned grammar; -
extract_alphabet()
collects alphabet based on the data or grammar; -
generate_sample(n, repeat)
generates$n$ strings based on the given grammar; by default,repeat
is set to False, and repetitions of the generated strings are not allowed, but this parameter can be set to True; -
fsmize()
creates the corresponding FSA; -
clean_grammar()
removes useless k-grams from the grammar.
Extracting alphabet and learning SL grammar
As before, learn()
extracts dependencies from the data. It simply extracts k-grams of the indicated size from the data, and the default value of k
is 2.
wf_devoicing.learn()
print("The grammar is", wf_devoicing.grammar)
In order to automatically extract the alphabet from the data, it is possible to run extract_alphabet()
.
print("The original value of the alphabet is", wf_devoicing.alphabet)
wf_devoicing.extract_alphabet()
print("The modified value of the alphabet is", wf_devoicing.alphabet)
Changing polarity of the grammar
The grammar outputted above is positive. If we want to capture the pattern using restrictions rather then the allowed substrings, we can switch_polarity()
of the grammar:
wf_devoicing.switch_polarity()
print("The grammar is", wf_devoicing.grammar)
Scanning strings
As before, scan(string)
method returns True or False depending on the well-formedness of the given string with respect to the learned grammar.
wfd = ["apapap", "papa", "abba"]
no_wfd = ["apab", "apapapb"]
print("Word-final devoicing:")
for string in wfd:
print("String", string, "is in L(G):", wf_devoicing.scan(string))
print("\nNo word-final devoicing:")
for string in no_wfd:
print("String", string, "is in L(G):", wf_devoicing.scan(string))
Generating data samples
If the grammar is non-empty, the data sample can be generated in the same way as before: generate_sample(n, repeat)
, where n
is the number of examples that need to be generated, and repeat
is a flag allowing or prohibiting repetitions of the same strings in the generated data.
sample = wf_devoicing.generate_sample(5, repeat = False)
print(sample)
Cleaning grammar
Potentially, a grammar that user provides can contain "useless" k-grams. For example, consider the following grammar:
sl = SL()
sl.grammar = [(">", "a"), ("b", "a"), ("a", "b"), ("b", "<"),
(">", "g"), ("f", "<"), ("t", "t")]
sl.alphabet = ["a", "b", "g", "f", "t"]
This grammar contains 3 useless bigrams:
(">", "g")
can never be used because nothing can follow "g";("f", "<")
is useless because there is no way to start a string that would lead to "f";("t", "t")
has both problems listed above.
Method clean_grammar()
removes such
print("Old grammar:", sl.grammar)
sl.clean_grammar()
print("Clean grammar:", sl.grammar)
Even though SP and SL languages can capture a large portion of phonological well-formedness conditions, there are numerous examples of patterns that require increased complexity. For example, harmony with a blocking effect cannot be captured using SP grammars because they will "miss" a blocker, and cannot be encoded via SL grammars because they cannot be used for long-distance processes.
Tier-based strictly local (TSL) grammars operate just like the strictly local ones, but they have the power to ignore a certain set of symbols completely. The set of symbols that are not ignored are called tier symbols, and the ones that do not matter for the well-formedness of strings are the non-tier ones (Heinz et al. 2011).
Assume that we have the following sets of tier and non-tier symbols.
tier = [l, r]
non_tier = [c, d]
Non-tiers symbols are ignored when the strings are being evaluated by TSL grammars, so the alphabets tier
and non_tier
define the following mapping:
- lccrdclcddcrlc ---> lrlrl
- rldcdrccldcdrdl ---> rlrlrl
- cdcddcdcdcdc -> e
The strings on the right-hand side are called tier images of the original strings because they exclude all non-tier symbols. Then the TSL grammars can be defined as SL grammars that operate on a tier.
Continuing the example above, let's prohibit "l" following "l" unless "r" intervenes, and also ban "r" following "r" unless "l" intervenes (it yields a toy Latin dissimilation pattern). Over the tier
, the negative TSL grammar with the restrictions *ll
and *rr
expresses this rule.
Intuitively, TSL grammars make non-local dependencies local by evaluating only tier images of strings.
Latin liquid dissimilation
In Latin, liquids tend to alternate: if the final liquid of the stem is "l", the adjectival affix is realized as "aris". And vice versa, if the final liquid is "r", the choice of the affix is "alis". Consider the examples below.
- militaris ~ <*>militalis "military"
- floralis ~ <*>floraris "floral"
- pluralis ~ <*>pluraris "plural"
This pattern is not SP because SP grammars cannot exhibit the blocking effect, and it is not SL either due to its long-distance nature.
lat_dissim = TSL()
-
polar
("p" or "n") is the polarity of the grammar, the default value is "p"; -
alphabet
(list) is the set of symbols that the grammar uses; -
grammar
(list of tuples) is the list of allowed or prohibited substrings of the language; -
k
(int) is the size of the locality window of the grammar, by default, it is$2$ ; -
data
(list of string) is the learning sample; -
edges
(list of two characters) are the delimiters used by the grammar, the default value is ">" and "<"; -
fsm
(FSM object) is the finite state device that corresponds to the grammar; -
tier
(list) is the list of the tier symbols.
-
check_polarity()
andswitch_polarity()
display and change the polarity of the grammar; -
learn()
detects tier symbols and learns the tier grammar; -
tier_image(string)
returns the tier image of a given string; -
scan(string)
tells if a given string is well-formed with respect to the learned grammar; -
extract_alphabet()
collects alphabet based on the data or grammar; -
generate_sample(n, repeat)
generates$n$ strings based on the given grammar; by default,repeat
is set to False, and repetitions of the generated strings are not allowed, but this parameter can be set to True; -
fsmize()
creates the corresponding FSA; -
clean_grammar()
removes useless k-grams from the grammar.
Assume the toy Latin dissimilation dataset, where we mask every non-liquid as "c".
lat_dissim.data = ["ccc", "lccrcccclcr", "lrl", "rcclc"]
Extracting alphabet
We don't need to explicitly provide the alphabet. Instead, it can be extracted from the data using the extract_alphabet()
method.
lat_dissim.extract_alphabet()
print("Alphabet:", lat_dissim.alphabet)
Learning the tier and the grammar
After the alphabet is extracted and the training sample is provided, we can learn the dependency.
lat_dissim.learn()
print('Tier: ', lat_dissim.tier)
print('Grammar:', lat_dissim.grammar)
Switching polarity
By-default, the grammars are positive, but this pattern is more clear when represented as a restriction. We can convert the positive grammar to negative using the switch_polarity()
method.
print("Initial polarity of the grammar:", lat_dissim.check_polarity(), "\n")
lat_dissim.switch_polarity()
print("New polarity of the grammar:", lat_dissim.check_polarity())
print("New grammar:", lat_dissim.grammar)
We can learn a negative grammar directly as well. For example, let us learn a pattern like this:
aaabaaaa, baaaa, aaaaaba, aaaaaab, ...
<*>aababaaa, <*>baaaababb, <*>aaaa, ...
In simple words, the desired pattern is a single "b" must be present in a string. Translating it to a pattern relevant to linguistics would give us stress culminativity.
stress = TSL(polar="n")
stress.data = ["aaabaaaa", "baaaa", "aaaaaba", "aaaaaab"]
stress.extract_alphabet()
stress.learn()
print("Tier: ", stress.tier)
print("Grammar: ", stress.grammar)
The learned negative TSL grammar prohibits an empty tier (stress must be present in a word) and prohibits a tier where there is more than a single stress.
Generating sample
Data sample generation is also available for the class of TSL languages. Repetition of the same items within the dataset can be allowed or prohibited by changing the parameter repeat
.
print(stress.generate_sample(n=10, repeat=True))
print(stress.generate_sample(n=10, repeat=False))
The implemented learning algorithm for k-TSL languages is designed by McMullin and Jardine (2017), which is based on Jardine and Heinz (2016).
However, there are some phonological processes that require more power than TSL. Some languages have more than just a single long-distance assimilation: for example, separate vowel and consonantal harmonies. In this case, one tier is not enough: putting both vowels and consonants on a single tier will create the desired locality neither among vowels nor among consonants. For cases like this, a subregular class of multiple tier-based strictly local languages is especially useful.
There are numerous examples from the typological literature that show that there are phonological patterns complexity of which is beyond the power of TSL languages. The example could be any pattern where several long-distance dependencies affect different sets of elements, see McMullin (2016) and Aksënova and Deshmukh (2018) for examples and discussions of those patterns.
Two features spread, only one of them can be blocked
The first example comes from Imdlawn Tashlhiyt (Hansson 2010). Sibilants regressively agree in voicing and anteriority.
- s-as:twa CAUS-settle
- S-fiaSr CAUS-be.full.of.straw
- z-bruz:a CAUS-crumble
- Z-m:Zdawl CAUS-stumble
However, while voicing harmony can be blocked by voiceless obstruents, they are transparent for the anteriority agreement.
- s-mXazaj CAUS-loathe.each.other
- S-quZ:i CAUS-be.dislocated
The blockers need to be projected in order to capture the voicing harmony, however, having those blockers on the tier would make sibilants non-adjacent anymore, and therefore would cause problems for the anteriority harmony.
Vowel harmony and consonant harmony
In Bukusu, vowels agree in height, and a liquid "l" assimilates to "r" if followed by "r" somewhere further in the word (Odden 1994).
- reeb-er- ask-APPL
- lim-il- cultivate-APPL
- rum-ir- send-APPL
The tier containing both vowels and liquids would not capture this picture. Intervening vowels would make the liquid spreading non-local over the tier, and intervening liquids would cause vowels to be potentially far away from each other over the tier.
Multiple tier-based strictly local grammars are conjunction of multiple TSL grammars: they consist of several tiers, and restrictions defined for every one of those tiers. For example, consider the following toy example.
Good strings: aaabbabba, oppopooo, aapapapp, obooboboboobbb, ...
Bad strings: <*>aabaoob, <*>paabab, <*>obabooo, ...
Generalization: if a string contains "a", it cannot contain "o", and vice versa;
if a string contains "p", it cannot contain "b", and vice versa.
Two tiers are required to encode this pattern: a tier of vowels ("o" and "a"), and a tier of consonants ("p" and "b"). This restriction can be expressed via the following MTSL grammar,
where the tier containing vowels "a" and "o" imposes restrictions *ao
and *oa
, and the tier of
consonants containing p
and b
rules out *pb
and *bp
.
data = ['aabbaabb', 'abab', 'aabbab', 'abaabb', 'aabaab', 'abbabb', 'ooppoopp',
'opop', 'ooppop', 'opoopp', 'oopoop', 'oppopp', 'aappaapp', 'apap',
'aappap', 'apaapp', 'aapaap', 'appapp', 'oobboobb', 'obob', 'oobbob',
'oboobb', 'ooboob', 'obbobb', 'aabb', 'ab', 'aab', 'abb', 'oopp', 'op',
'oop', 'opp', 'oobb', 'ob', 'oob', 'obb', 'aapp', 'ap', 'aap', 'app',
'aaa', 'ooo', 'bbb', 'ppp', 'a', 'o', 'b', 'p', '']
The first step is to initialize the MTSL object.
harmony = MTSL()
polar
("p" or "n") is the polarity of the grammar, where "p" is the default value;alphabet
(list) is the set of symbols that the grammar uses;grammar
(dictionary) is a dictionary, where the keys (tuple) are the tier alphabet, and the values (lists) are the restrictions imposed on those tiers;k
(int) is the size of the locality window of the grammar, by default, it is 2;data
(list of string) is the learning sample;edges
(list of two characters) are the delimiters used by the grammar, the default value is ">" and "<".
check_polarity()
andswitch_polarity()
display and change the polarity of the grammar;learn()
detects the tier symbols and learns the tier grammar;scan(string)
tells if a given string is well-formed with respect to a learned grammar;extract_alphabet()
collects the alphabet based on the data and grammar.
Extracting alphabet and learning the grammar
Now we can initialize the data
and alphabet
attributes of the MTSL class, and apply the learn
method to learn the tiers and the grammars that correspond to them.
harmony.data = data
harmony.extract_alphabet()
harmony.learn()
The value of the attribute grammar
is represented in the following way:
G = {
tier_1 (tuple): tier_1_restrictions (list),
tier_2 (tuple): tier_2_restrictions (list),
...
tier_n (tuple): tier_n_restrictions (list)
}
for i in harmony.grammar:
print("Tier:", i)
print("Restrictions;", harmony.grammar[i], "\n")
Switching polarity
The grammar that is learned by default is positive and is pretty verbose, and can be easily converted to negative by appying the switch_polarity
method.
print("Old polarity:", harmony.check_polarity())
harmony.switch_polarity()
print("New polarity:", harmony.check_polarity(), "\n")
for i in harmony.grammar:
print("Tier:", i)
print("Restrictions;", harmony.grammar[i], "\n")
The learning algorithm for 2-MTSL languages is designed by McMullin et al. (2019).
Scanning strings
As before, the scan
method tells if the given string is well-formed with respect to the learned grammar.
good = ["apapappa", "appap", "popo", "bbbooo"]
bad = ["aoap", "popppa", "pabp", "popoa"]
for s in good:
print("String", s, "is in L(G):", harmony.scan(s))
print()
for s in bad:
print("String", s, "is in L(G):", harmony.scan(s))
The current state of the MTSL-related research
We are currently doing the theoretical work of extending the learning algorithm for MTSL languages from capturing 2-local dependencies to k
. Therefore this module of the toolkit will be updated as the theoretical work on this language class progresses.
Now, let us capture the feature changing process instead of defining the well-formedness conditions. Assuming that the spreading moves left to right, we can then mask all non-initial mentions of vowels and consonants inthe words.
[("baABB", "baabb"), ("bBBoA", "bbboo"), ("pBaAA", "ppaaa"), ("pBBBB", "ppppp")]
First of all, let us start by defining toy harmonic classes.
specifications = {("a", "o"):"A", ("b", "p"):"B"}
Now, let us generate the training sample.
num_examples = 10
len_examples = 5
S = generate_pairs(num_examples, len_examples, specifications)
show = 5
print(show, "first pairs of S:\n", S[:show])
5 first pairs of S:
[('bBaBA', 'bbaba'), ('abBAB', 'abbab'), ('aApBB', 'aappp'), ('pBoBB', 'ppopp'), ('opAAA', 'opooo')]
First, let us provide the necessary input to OSTIA and save the resulting machine.
S = generate_pairs(500, 5, specifications)
Sigma = ["a", "o", "A", "b", "p", "B"]
Gamma = ["a", "o", "b", "p"]
T = ostia(S, Sigma, Gamma)
For a step-by-step implementation of OSTIA, click here. In order to evaluate the performance of the obtained automaton, we can generate more input forms.
test = mask_words(generate_words(5, 5, specifications), specifications)
for w in test:
print(w, "--->", T.rewrite(w))
# pBBoA ---> pppoo
# opABB ---> opopp
# obBBB ---> obbbb
# pBBaA ---> pppaa
# obABA ---> obobo
The performance of OSTIA largely depends on the size of the training sample and the length of the words. For example, if the length of words is set to 5, we need to observe at least 200 examples in order to see the stably correct outputs. After the transducer was extracted, it is possible to explore its structure by simply viweing the list of transitions, states, and state outputs of that machine.
print("States:", T.Q)
print("State outputs:", T.stout)
print("\nTransitions:", T.E)
# States: ['', 'o', 'p', 'po']
# State outputs: {'': '', 'o': '', 'p': '', 'po': ''}
# Transitions: [['', 'a', 'a', ''], ['', 'o', 'o', 'o'], ['o', 'b', 'b', 'o'], ['', 'p', 'p', 'p'], ['p', 'o', 'o', 'po'], ['po', 'B', 'p', 'po'], ['o', 'A', 'o', 'o'], ['', 'b', 'b', ''], ['p', 'B', 'p', 'p'], ['o', 'p', 'p', 'po'], ['p', 'a', 'a', 'p'], ['po', 'A', 'o', 'po'], ['', 'A', 'a', ''], ['p', 'A', 'a', 'p'], ['o', 'B', 'b', 'o'], ['', 'B', 'b', '']]
We can visualize this FST, showing that the results of transduction learning are interpretable.
Acknowledgments
I am very grateful to Thomas Graf, Kyle Gorman, Jeffrey Heinz, Aniello De Santo and Ayla Karakaş whose input on different parts of this project was extremely helpful.