Enhance text normalization in `numbers_pt.py` #4

coderabbitai · 2024-11-07T03:57:35Z

As discussed in PR #1 and this comment, we can improve the text normalization implementation in numbers_pt.py.

🛠️ Refactor suggestion

Improve text normalization implementation.

Several improvements can be made to enhance maintainability and performance:

Fix the misspelled parameter name 'agressive' → 'aggressive'
Use constants for the word list and accents dictionary
Use sets for O(1) lookups
Consider renaming to _normalize_portuguese_text for clarity

Consider this improved implementation:

+# Common Portuguese words to remove during aggressive pruning
+_COMMON_PT_WORDS = {
+    "a", "o", "os", "as", "de", "dos", "das", "lhe", "lhes", 
+    "me", "e", "no", "nas", "na", "nos", "em", "para", "este",
+    "esta", "deste", "desta", "neste", "nesta", "nesse",
+    "nessa", "foi", "que"
+}
+
+# Portuguese accent mappings
+_PT_ACCENTS = {
+    "a": {"á", "à", "ã", "â"},
+    "e": {"ê", "è", "é"},
+    "i": {"í", "ì"},
+    "o": {"ò", "ó"},
+    "u": {"ú", "ù"},
+    "c": {"ç"}
+}
+
-def _pt_pruning(text, symbols=True, accents=True, agressive=True):
+def _normalize_portuguese_text(text: str, 
+                             remove_symbols: bool = True,
+                             remove_accents: bool = True,
+                             aggressive: bool = True) -> str:
+    """
+    Normalize Portuguese text by removing symbols, accents, and common words.
+    
+    Args:
+        text (str): Input text to normalize.
+        remove_symbols (bool): Remove punctuation marks if True.
+        remove_accents (bool): Replace accented chars with non-accented ones.
+        aggressive (bool): Remove common Portuguese words if True.
+    
+    Returns:
+        str: Normalized text.
+    """
     # agressive pt word pruning
-    words = ["a", "o", "os", "as", "de", "dos", "das",
-             "lhe", "lhes", "me", "e", "no", "nas", "na", "nos", "em", "para",
-             "este",
-             "esta", "deste", "desta", "neste", "nesta", "nesse",
-             "nessa", "foi", "que"]
-    if symbols:
-        symbols = [".", ",", ";", ":", "!", "?", "º", "ª"]
+    if remove_symbols:
+        symbols = {".", ",", ";", ":", "!", "?", "º", "ª"}
         for symbol in symbols:
             text = text.replace(symbol, "")
         text = text.replace("-", " ").replace("_", " ")
-    if accents:
-        accents = {"a": ["á", "à", "ã", "â"],
-                   "e": ["ê", "è", "é"],
-                   "i": ["í", "ì"],
-                   "o": ["ò", "ó"],
-                   "u": ["ú", "ù"],
-                   "c": ["ç"]}
-        for char in accents:
-            for acc in accents[char]:
+    if remove_accents:
+        for char, accents in _PT_ACCENTS.items():
+            for acc in accents:
                 text = text.replace(acc, char)
-    if agressive:
-        text_words = text.split(" ")
-        for idx, word in enumerate(text_words):
-            if word in words:
-                text_words[idx] = ""
-        text = " ".join(text_words)
-        text = ' '.join(text.split())
+    if aggressive:
+        # Filter out common words and normalize whitespace
+        text = ' '.join(
+            word for word in text.split()
+            if word not in _COMMON_PT_WORDS
+        )
     return text

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

# Common Portuguese words to remove during aggressive pruning
_COMMON_PT_WORDS = {
    "a", "o", "os", "as", "de", "dos", "das", "lhe", "lhes", 
    "me", "e", "no", "nas", "na", "nos", "em", "para", "este",
    "esta", "deste", "desta", "neste", "nesta", "nesse",
    "nessa", "foi", "que"
}

# Portuguese accent mappings
_PT_ACCENTS = {
    "a": {"á", "à", "ã", "â"},
    "e": {"ê", "è", "é"},
    "i": {"í", "ì"},
    "o": {"ò", "ó"},
    "u": {"ú", "ù"},
    "c": {"ç"}
}

def _normalize_portuguese_text(text: str, 
                             remove_symbols: bool = True,
                             remove_accents: bool = True,
                             aggressive: bool = True) -> str:
    """
    Normalize Portuguese text by removing symbols, accents, and common words.
    
    Args:
        text (str): Input text to normalize.
        remove_symbols (bool): Remove punctuation marks if True.
        remove_accents (bool): Replace accented chars with non-accented ones.
        aggressive (bool): Remove common Portuguese words if True.
    
    Returns:
        str: Normalized text.
    """
    if remove_symbols:
        symbols = {".", ",", ";", ":", "!", "?", "º", "ª"}
        for symbol in symbols:
            text = text.replace(symbol, "")
        text = text.replace("-", " ").replace("_", " ")
    if remove_accents:
        for char, accents in _PT_ACCENTS.items():
            for acc in accents:
                text = text.replace(acc, char)
    if aggressive:
        # Filter out common words and normalize whitespace
        text = ' '.join(
            word for word in text.split()
            if word not in _COMMON_PT_WORDS
        )
    return text

The text was updated successfully, but these errors were encountered:

coderabbitai bot assigned JarbasAl Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance text normalization in `numbers_pt.py` #4

Enhance text normalization in `numbers_pt.py` #4

coderabbitai bot commented Nov 7, 2024 •

edited by JarbasAl

Loading

Enhance text normalization in numbers_pt.py #4

Enhance text normalization in numbers_pt.py #4

Comments

coderabbitai bot commented Nov 7, 2024 • edited by JarbasAl Loading

Enhance text normalization in `numbers_pt.py` #4

Enhance text normalization in `numbers_pt.py` #4

coderabbitai bot commented Nov 7, 2024 •

edited by JarbasAl

Loading