Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance text normalization in numbers_pt.py #4

Open
coderabbitai bot opened this issue Nov 7, 2024 · 0 comments
Open

Enhance text normalization in numbers_pt.py #4

coderabbitai bot opened this issue Nov 7, 2024 · 0 comments
Assignees

Comments

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 7, 2024

As discussed in PR #1 and this comment, we can improve the text normalization implementation in numbers_pt.py.

Requester: @JarbasAl

🛠️ Refactor suggestion

Improve text normalization implementation.

Several improvements can be made to enhance maintainability and performance:

  1. Fix the misspelled parameter name 'agressive' → 'aggressive'
  2. Use constants for the word list and accents dictionary
  3. Use sets for O(1) lookups
  4. Consider renaming to _normalize_portuguese_text for clarity

Consider this improved implementation:

+# Common Portuguese words to remove during aggressive pruning
+_COMMON_PT_WORDS = {
+    "a", "o", "os", "as", "de", "dos", "das", "lhe", "lhes", 
+    "me", "e", "no", "nas", "na", "nos", "em", "para", "este",
+    "esta", "deste", "desta", "neste", "nesta", "nesse",
+    "nessa", "foi", "que"
+}
+
+# Portuguese accent mappings
+_PT_ACCENTS = {
+    "a": {"á", "à", "ã", "â"},
+    "e": {"ê", "è", "é"},
+    "i": {"í", "ì"},
+    "o": {"ò", "ó"},
+    "u": {"ú", "ù"},
+    "c": {"ç"}
+}
+
-def _pt_pruning(text, symbols=True, accents=True, agressive=True):
+def _normalize_portuguese_text(text: str, 
+                             remove_symbols: bool = True,
+                             remove_accents: bool = True,
+                             aggressive: bool = True) -> str:
+    """
+    Normalize Portuguese text by removing symbols, accents, and common words.
+    
+    Args:
+        text (str): Input text to normalize.
+        remove_symbols (bool): Remove punctuation marks if True.
+        remove_accents (bool): Replace accented chars with non-accented ones.
+        aggressive (bool): Remove common Portuguese words if True.
+    
+    Returns:
+        str: Normalized text.
+    """
     # agressive pt word pruning
-    words = ["a", "o", "os", "as", "de", "dos", "das",
-             "lhe", "lhes", "me", "e", "no", "nas", "na", "nos", "em", "para",
-             "este",
-             "esta", "deste", "desta", "neste", "nesta", "nesse",
-             "nessa", "foi", "que"]
-    if symbols:
-        symbols = [".", ",", ";", ":", "!", "?", "º", "ª"]
+    if remove_symbols:
+        symbols = {".", ",", ";", ":", "!", "?", "º", "ª"}
         for symbol in symbols:
             text = text.replace(symbol, "")
         text = text.replace("-", " ").replace("_", " ")
-    if accents:
-        accents = {"a": ["á", "à", "ã", "â"],
-                   "e": ["ê", "è", "é"],
-                   "i": ["í", "ì"],
-                   "o": ["ò", "ó"],
-                   "u": ["ú", "ù"],
-                   "c": ["ç"]}
-        for char in accents:
-            for acc in accents[char]:
+    if remove_accents:
+        for char, accents in _PT_ACCENTS.items():
+            for acc in accents:
                 text = text.replace(acc, char)
-    if agressive:
-        text_words = text.split(" ")
-        for idx, word in enumerate(text_words):
-            if word in words:
-                text_words[idx] = ""
-        text = " ".join(text_words)
-        text = ' '.join(text.split())
+    if aggressive:
+        # Filter out common words and normalize whitespace
+        text = ' '.join(
+            word for word in text.split()
+            if word not in _COMMON_PT_WORDS
+        )
     return text
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

# Common Portuguese words to remove during aggressive pruning
_COMMON_PT_WORDS = {
    "a", "o", "os", "as", "de", "dos", "das", "lhe", "lhes", 
    "me", "e", "no", "nas", "na", "nos", "em", "para", "este",
    "esta", "deste", "desta", "neste", "nesta", "nesse",
    "nessa", "foi", "que"
}

# Portuguese accent mappings
_PT_ACCENTS = {
    "a": {"á", "à", "ã", "â"},
    "e": {"ê", "è", "é"},
    "i": {"í", "ì"},
    "o": {"ò", "ó"},
    "u": {"ú", "ù"},
    "c": {"ç"}
}

def _normalize_portuguese_text(text: str, 
                             remove_symbols: bool = True,
                             remove_accents: bool = True,
                             aggressive: bool = True) -> str:
    """
    Normalize Portuguese text by removing symbols, accents, and common words.
    
    Args:
        text (str): Input text to normalize.
        remove_symbols (bool): Remove punctuation marks if True.
        remove_accents (bool): Replace accented chars with non-accented ones.
        aggressive (bool): Remove common Portuguese words if True.
    
    Returns:
        str: Normalized text.
    """
    if remove_symbols:
        symbols = {".", ",", ";", ":", "!", "?", "º", "ª"}
        for symbol in symbols:
            text = text.replace(symbol, "")
        text = text.replace("-", " ").replace("_", " ")
    if remove_accents:
        for char, accents in _PT_ACCENTS.items():
            for acc in accents:
                text = text.replace(acc, char)
    if aggressive:
        # Filter out common words and normalize whitespace
        text = ' '.join(
            word for word in text.split()
            if word not in _COMMON_PT_WORDS
        )
    return text
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant