Skip to content

Basic English Translator: Simplifying text with natural language processing.

Notifications You must be signed in to change notification settings

NeverForged/BasicEnglishTranslator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Basic English Translator

According to http://ogden.basic-english.org/:

If one were to take the 25,000 word Oxford Pocket English Dictionary and take away the redundancies of our rich language and eliminate the words that can be made by putting together simpler words, we find that 90% of the concepts in that dictionary can be achieved with 850 words. The shortened list makes simpler the effort to learn spelling and pronunciation irregularities. The rules of usage are identical to full English so that the practitioner communicates in perfectly good, but simple, English.

The goal of the Basic English Translator was to take this idea, specifically the 850 words proposed by Charles K. Ogden in the 1930s, each of the international and supplementary words needed for various industries, and the next steps words that English speakers should know, and simplify a given block of text to these words.

Research Question

How can natural language processing be used to translate a block a text into a simpler block of (Basic English) text?

Data Understanding

  • Basic/Simple English: There are versions of 'Simple' or 'basic' English, Charles K Ogden’s 850 words being the most common. Simple English Wikipedia being another (with a broader vocabulary).
  • GoogleNews-vectors-negative300.bin: A set of words from Google News already vectorized, may help with translation of input text into simple English, at least until my model is ready.

Data Preparation

Given how long gensim functions take to run, forcing a user to wait for gensim to calculate cosine similarities is not practical. Instead, I will create a dictionary (in both the pythonic and literal sense of the term) to map English words to words that appear on Ogden's list. The following flow chart shows this process:

Blue -> BasicEnglishTranslator.py/graph_model.py Functions, Green -> gensim, Purple -> networkx, red -> nlt
  • First, find "best" matches for a word: This is done by taking a gensim model (in my case google news)...
    • Take each word, and find the top 10 connections (by cosine similarity)
    • Keep only the top 3 (that match in parts of speech)
    • Remove the worst match (using gensim)
    • Store in a dictionary
  • Next, make a weighted graph where each word connects to the two words it was matched to, plus any words matched to it. Set the weights of the graph to "(Sine Similarity)2," i.e. 1 - (cosine similarity)2.
  • Find shortest path to Basic English words, starting from the most complex and heading down to least (so that the model favors more simplistic words).
  • Make a dictionary that stores this information, save in a cPickle file so we can access it whenever we need to translate text.

Evaluation

Flesch-Kincaid scores of original document on the x-axis, and the difference between the translated and the original on the y.
  1. Checking Basic English Translator: For this I scraped some articles from both Simple English Wikipedia and Standard English Wikipedia. I then calculated the complexity of the text, using Flesch-Kincaid reading levels, and compared the original documents to the translations provided by my model. For my first model, where my starting dictionary was a construction of Basic English words with a mix of prefixes and suffixes:
    • Already Simple text was reduced in complexity by an average of 0.16 grade levels.
    • More complex text was reduced by an average of 0.71 grade levels.
  • This method did create a number of issues, namely complex words that are similar to Basic English words creeping into the model; see below for an example, where chip of wood -> microprocessor of wood. When basing the starting dictionary off of a word-2-vec model of actual Basic English texts:
    • Simple English Wikipedia texts dropped half a grade level.
    • Regular (English) Wikipedia texts dropped a full grade level.
  1. Look at Actual Text While not effective to check everything, it does allow for some basic intuition on the text itself. Example below is the simple Wikipedia article for "spoon":

    • Earlier Model:

    another spoon is another instrument for eating. it is sometimes used for eating foods that are like liquids (like soup and soy), and it might also be used for stirring. humans use spoons every day. spoons are mostly useful for eating liquids, such as soup, though some solids (like tapioca and ice butter) are also sometimes eaten with spoons. another ladle is another kind of serving spoon used for soup, lager, or other foods. there are many different kinds of spoons. there are dessert spoons, soup spoons, baby spoons, teaspoons, thirds and others. there are also spoons that are collector whips and are monopolist another farmer of money. some performers even use two spoons as another musical instrument like another castanet. spoons have been used as computers for eating since paleolithic times. prehistoric peoples probably used shells, or small sheets of wood as spoons. both the testament and latin words for spoon come from the word superposition, which is another spiral-shaped lager shell. the anglo-saxon word spoon, means another sidewalk or spicy of wood.

    • Constructed Dictionary Model:

    A spoon is a theory for feasting. It is normally used for feasting foods that are like liquids (like soup and honeyed), and it can also be used for boiling. Humans use spoons every day. Spoons are mostly useful for feasting liquids, such as soup, though some solids (like cereal and ice cheese) are also normally eaten with spoons. A spoon is a kind of serving spoon used for soup, soup, or other foods. There are many different kinds of spoons. There are dessert spoons, soup spoons, baby spoons, cups, cups and others. There are also spoons that are collector goods and are value a i of money. Some artists even use two spoons as a musical instrument like a verse. Spoons have been used as practices for feasting since Paleolithic times. Prehistoric peoples probably used shells, or small bits of wood as spoons. Both the Greek and Latin words for spoon come from the word endothelium, which is a spiral-shaped snake shell. The Anglo-Saxon word spoon, means a microprocessor or faction of wood.

    • Dictionary Based on Modeling Basic English Texts

    A spoon is a framework for eating. It is normally used for eating foods that are like liquids (like soup and honeyed), and it could also be used for boiling. Beings use spoons every day. Spoons are mostly useful for eating liquids, such as soup, although some impurities (like cereal and ice perfume) are also normally eaten with spoons. A spoon is a kind of serving spoon used for soup, soup, or other foods. There are many different kinds of spoons. There are dessert spoons, soup spoons, baby spoons, cups, cups and others. There are also spoons that are collector goods and are value a lot of money. Some bands even use two spoons as a musical instrument like a roman. Spoons have been used as ways for eating since Paleolithic times. Prehistoric peoples probably used shells, or small bits of wood as spoons. Both the Greek and Latin words for spoon come from the word locus, which is a spiral-shaped frog shell. The Anglo-Saxon word spoon, means a quantifier or delhi of wood.

  2. Look at word associations in a graph The graph highlights the word with a yellow dot for ease of searching, and has three colors: blue words are words from Ogden's Basic English, red words are words that are variations of the basic words, and represent stopping points in the translation, and green words are words that will be replaced...just keep tracing out the path until a red or blue word is hit.

Graph model of 'Spork' to Spoon

Future Plans

  1. N-Grams/Phraser Applying a phraser to the text to be translated, and incorporating n-grams in the modeling, would allow for phrase to word and word to phrase translations, though may add computational time to the end-user, which is why gensim modeling in the main translation was avoided in the first place.
  2. Model Books Written in Basic English Charles K. Ogden wrote books in Basic English to show that his theory worked; running a gensim model on these texts would allow for:
    • A better starting dictionary based on which variants of each Basic English word is actually used, rather than running through a list of prefixes and suffixes and assuming all were valid.
    • Ogden's original semantic meaning being preserved. The model used here uses the words as they were used in Google News; Basic English has specific semantic rules that would be better preserved through this process.
  3. Use a more "American" word list/theory since Charles K Ogden was British, you may notice that pants are replaced with trousers; finding a different model to use based on research done in the US for English Language Learner populations may make for a more useful model, at least for American ELL students.

Deployment

To use the program, go to www.BasicEnglishTranslator.com. This was made with Flask and runs off of an ec2 with nginx. The flag artwork was taken from a creative commons site and was made by Nicolas Raymond.

About

Basic English Translator: Simplifying text with natural language processing.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published