You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Several improvements can be made to enhance maintainability and performance:
Fix the misspelled parameter name 'agressive' → 'aggressive'
Use constants for the word list and accents dictionary
Use sets for O(1) lookups
Consider renaming to _normalize_portuguese_text for clarity
Consider this improved implementation:
+# Common Portuguese words to remove during aggressive pruning+_COMMON_PT_WORDS = {+ "a", "o", "os", "as", "de", "dos", "das", "lhe", "lhes", + "me", "e", "no", "nas", "na", "nos", "em", "para", "este",+ "esta", "deste", "desta", "neste", "nesta", "nesse",+ "nessa", "foi", "que"+}++# Portuguese accent mappings+_PT_ACCENTS = {+ "a": {"á", "à", "ã", "â"},+ "e": {"ê", "è", "é"},+ "i": {"í", "ì"},+ "o": {"ò", "ó"},+ "u": {"ú", "ù"},+ "c": {"ç"}+}+-def _pt_pruning(text, symbols=True, accents=True, agressive=True):+def _normalize_portuguese_text(text: str, + remove_symbols: bool = True,+ remove_accents: bool = True,+ aggressive: bool = True) -> str:+ """+ Normalize Portuguese text by removing symbols, accents, and common words.++ Args:+ text (str): Input text to normalize.+ remove_symbols (bool): Remove punctuation marks if True.+ remove_accents (bool): Replace accented chars with non-accented ones.+ aggressive (bool): Remove common Portuguese words if True.++ Returns:+ str: Normalized text.+ """
# agressive pt word pruning
- words = ["a", "o", "os", "as", "de", "dos", "das",- "lhe", "lhes", "me", "e", "no", "nas", "na", "nos", "em", "para",- "este",- "esta", "deste", "desta", "neste", "nesta", "nesse",- "nessa", "foi", "que"]- if symbols:- symbols = [".", ",", ";", ":", "!", "?", "º", "ª"]+ if remove_symbols:+ symbols = {".", ",", ";", ":", "!", "?", "º", "ª"}
for symbol in symbols:
text = text.replace(symbol, "")
text = text.replace("-", " ").replace("_", " ")
- if accents:- accents = {"a": ["á", "à", "ã", "â"],- "e": ["ê", "è", "é"],- "i": ["í", "ì"],- "o": ["ò", "ó"],- "u": ["ú", "ù"],- "c": ["ç"]}- for char in accents:- for acc in accents[char]:+ if remove_accents:+ for char, accents in _PT_ACCENTS.items():+ for acc in accents:
text = text.replace(acc, char)
- if agressive:- text_words = text.split(" ")- for idx, word in enumerate(text_words):- if word in words:- text_words[idx] = ""- text = " ".join(text_words)- text = ' '.join(text.split())+ if aggressive:+ # Filter out common words and normalize whitespace+ text = ' '.join(+ word for word in text.split()+ if word not in _COMMON_PT_WORDS+ )
return text
📝 Committable suggestion
‼️IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Common Portuguese words to remove during aggressive pruning
_COMMON_PT_WORDS = {
"a", "o", "os", "as", "de", "dos", "das", "lhe", "lhes",
"me", "e", "no", "nas", "na", "nos", "em", "para", "este",
"esta", "deste", "desta", "neste", "nesta", "nesse",
"nessa", "foi", "que"
}
# Portuguese accent mappings
_PT_ACCENTS = {
"a": {"á", "à", "ã", "â"},
"e": {"ê", "è", "é"},
"i": {"í", "ì"},
"o": {"ò", "ó"},
"u": {"ú", "ù"},
"c": {"ç"}
}
def _normalize_portuguese_text(text: str,
remove_symbols: bool = True,
remove_accents: bool = True,
aggressive: bool = True) -> str:
"""
Normalize Portuguese text by removing symbols, accents, and common words.
Args:
text (str): Input text to normalize.
remove_symbols (bool): Remove punctuation marks if True.
remove_accents (bool): Replace accented chars with non-accented ones.
aggressive (bool): Remove common Portuguese words if True.
Returns:
str: Normalized text.
"""
if remove_symbols:
symbols = {".", ",", ";", ":", "!", "?", "º", "ª"}
for symbol in symbols:
text = text.replace(symbol, "")
text = text.replace("-", " ").replace("_", " ")
if remove_accents:
for char, accents in _PT_ACCENTS.items():
for acc in accents:
text = text.replace(acc, char)
if aggressive:
# Filter out common words and normalize whitespace
text = ' '.join(
word for word in text.split()
if word not in _COMMON_PT_WORDS
)
return text
The text was updated successfully, but these errors were encountered:
As discussed in PR #1 and this comment, we can improve the text normalization implementation in
numbers_pt.py
.Requester: @JarbasAl
🛠️ Refactor suggestion
Improve text normalization implementation.
Several improvements can be made to enhance maintainability and performance:
_normalize_portuguese_text
for clarityConsider this improved implementation:
📝 Committable suggestion
The text was updated successfully, but these errors were encountered: