This algorithm expands abbreviations from input features; if text is recognized as abbreviation, the algorithm replaces it by last previous word in the text (in reading order). Additionally it changes first letters in the words to capitals and the others to lower. If words in the text are sorted in alphabetical order, the algorithm can recognize words with first letter not matching to its neighborhood and change it to proper letter.
PW_ABBREVIATIONS is dedicated especially for data from lexicons, dictionaries and encyclopaedias.
If usage of the script leads to a scientific publication, please acknowledge this fact by citing:
Graszka, O. (2021). Automatyzacja procesu rozpoznawania i weryfikacji nazw geograficznych ze źródeł historycznych na przykładzie Słownika geograficznego Królestwa Polskiego. W T. Epsztein (red.), Od Słownika geograficznego Królestwa Polskiego do map topograficznych Wojskowego Instytutu Geograficznego (s. 23–32).
The algorithm sets features in reading order at the beginning. Code implies that text is deployed in two columns dividing sheet in half. You can change that by editing code below.
for sheet in SheetsOrderedList:
if feedback.isCanceled(): break
FirstColumnRect = self.TakeColumnRect(feedback,sheet)[0]
SecondColumnRect = self.TakeColumnRect(feedback,sheet)[1]
FeaturesInFirstColumn = self.index.intersects(FirstColumnRect)
FeaturesInSecondColumn = self.index.intersects(SecondColumnRect)
def TakeColumnRect(self, feedback, sheet):
"""Returns two rectangles: first and second column"""
bbox = sheet.geometry().boundingBox()
x1, x2, x3, y1, y2 = bbox.xMinimum(), bbox.xMinimum() + (bbox.xMaximum() - bbox.xMinimum())/2, bbox.xMaximum(), bbox.yMinimum(), bbox.yMaximum()
FirstColumnRect = QgsRectangle(QgsPointXY(x1,y2),QgsPointXY(x2,y1))
SecondColumnRect = QgsRectangle(QgsPointXY(x2,y2),QgsPointXY(x3,y1))
The algorithm recognizes two features as lying in the same row of text if the differnce of y coordinates of both (a) is less than half of feature height (b).
if element.geometry().boundingBox().height()/2>(y_upper-element.geometry().centroid().asPoint().y()):
If features are already sorted in reading order the algorithm deletes unwanted characters at the beginning and the end of word. User can choose characters from the predefined list.
CharsList = ['.',',',':',';','/','\\','"',"'",'|','_','*','!','^','~','+','@','#','$','&','(',')',' ','0','1','2','3','4','5','6','7','8','9','-']
Cekcyn Polski,
-> Cekcyn Polski
|Cekinówka
-> Cekinówka
The algorithm recognizes abbreaviations (words shorter than 3 characters in this case) and replaces them by last word longer than 3 characters. Notice that words are sorted already in reading order.
self.if_short(string, 2)
Cegielnia
-> Cegielnia
C
-> Cegielnia
C
-> Cegielnia
C
-> Cegielnia
C
-> Cegielnia
C
-> Cegielnia
C
-> Cegielnia
The algorithm analyses what letter is the most numerous in the neighbourhood of the word (headwords lying before and after the word). After that script searches this letter in the group of capital letters at the beginning and cuts all previous letters. If the script doesn't find the letter (this most popular in the neighbourhood), replaces all of capital letters from the beginning by this letter. If the word doesn't start with the most popular letter in the neighbourhood and this is the lowercase, algorithm adds the most popular letter before the word. If the word starts with the most popular letter in the neighbourhood and this is the lowercase, algorithm does nothing.
HCegielnia
-> Cegielnia
cegielnia
-> Cegielnia
egielnia
-> Cegielnia
IEegielnia
-> Cegielnia
At the very end, the algorithm changes all of the first letters in the words to capital letters and changes all the other letters to lowercase.
Cekcyn polski
-> Cekcyn Polski
Czechowice-dziedzice
-> Czechowice-Dziedzice
SkierNiewice
-> Skierniewice
Input sheets layer
Text output field
Characters to remove on edges
Resolve first
Resolve capitalization
(Alphabetical order of text words is necassery)
Output layer
PW_ABBREVIATIONS algorithm may process data obtained from these scripts: