Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/decimal markers #43

Closed
wants to merge 2 commits into from
Closed

Feat/decimal markers #43

wants to merge 2 commits into from

Conversation

JarbasAl
Copy link
Member

@JarbasAl JarbasAl commented Nov 27, 2022

port MycroftAI#69

Summary by CodeRabbit

  • New Features

    • Introduced a normalize_decimals function for handling different decimal characters in number extraction.
    • Updated number extraction functions to accept a custom decimal character.
  • Tests

    • Added tests to validate number extraction with varying decimal markers.

@JarbasAl JarbasAl added the enhancement New feature or request label Nov 27, 2022
@codecov
Copy link

codecov bot commented Nov 27, 2022

Codecov Report

❗ No coverage uploaded for pull request base (dev@08ed3c6). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head ffa185e differs from pull request most recent head 811550c. Consider uploading reports for the commit 811550c to get more accurate results

@@          Coverage Diff          @@
##             dev     #43   +/-   ##
=====================================
  Coverage       ?   0.00%           
=====================================
  Files          ?      65           
  Lines          ?   16459           
  Branches       ?       0           
=====================================
  Hits           ?       0           
  Misses         ?   16459           
  Partials       ?       0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Member

@NeonDaniel NeonDaniel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple test cases to add

@@ -290,6 +290,17 @@ def test_combinations(self):


class TestExtractNumber(unittest.TestCase):
def test_extract_number_decimal_markers(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should include a test case for 1,000,000 and 1,000.0 (or similar). Maybe also "1000, is what I said" and/or "I said 1,000." to include boundary cases

@JarbasAl JarbasAl mentioned this pull request Jan 4, 2023
@JarbasAl
Copy link
Member Author

@coderabbitai review

Copy link

coderabbitai bot commented Jul 17, 2024

Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

coderabbitai bot commented Jul 17, 2024

Warning

Rate limit exceeded

@JarbasAl has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 17 minutes and 3 seconds before requesting another review.

How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

Commits

Files that changed from the base of the PR and between 811550c and e111f47.

Walkthrough

The update introduces a normalize_decimals function for handling different decimal point characters in text, ensuring consistent number extraction across multiple languages. The extract_number and extract_numbers functions now accept a decimal parameter, allowing flexibility in decimal character specification. Corresponding changes were made to various language-specific parsing modules, and new unit tests were added to validate these functionalities.

Changes

Files/Modules Change Summary
lingua_franca/lang/parse_common.py Added normalize_decimals function and updated match_yes_or_no function to resolve resource files based on language.
lingua_franca/lang/parse_*.py Included decimal parameter in number extraction functions and utilized normalize_decimals for consistent handling.
lingua_franca/parse.py Modified extract_numbers and extract_number functions to accept and handle a decimal parameter.
test/unittests/test_parse_en.py Added test_extract_number_decimal_markers method to test decimal normalization in extract_number and extract_numbers functions.

Poem

A rabbit hopped through fields of code,
🌾 With decimal dots and commas showed,
Numbers danced and twirled anew,
Parsed with ease, both old and true.
In every language, near and far,
Numbers now as clear as stars. ✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Outside diff range, codebase verification and nitpick comments (11)
lingua_franca/parse.py (1)

Line range hint 59-78: Parameter Addition in extract_numbers Function

The addition of the decimal parameter to the extract_numbers function enhances flexibility by allowing different decimal characters. However, the documentation states that it will always extract numbers formatted with a decimal dot/full stop. This could be confusing for developers expecting the function to handle different decimal formats based on the decimal parameter.

Consider clarifying the documentation or adjusting the implementation to respect the decimal parameter when extracting numbers.

lingua_franca/lang/parse_eu.py (1)

Line range hint 287-307: Enhanced extract_numbers_eu function to support different decimal separators.

The addition of the decimal parameter to extract_numbers_eu allows for flexibility in handling different decimal characters, which is essential for internationalization. The implementation uses normalize_decimals to adjust the input text accordingly, which aligns well with the intended functionality.

However, there's a potential issue here:

  • The comment on line 303 mentions that numbers will always be extracted with a decimal dot, even if another decimal character is specified. This could lead to confusion or incorrect behavior if not clearly documented or handled in the function.

Consider clarifying this behavior in the documentation or adjusting the implementation to respect the decimal parameter fully.

-        will always extract numbers formatted with a decimal dot/full stop,
+        will attempt to normalize numbers to use the specified decimal character,
Tools
Ruff

308-308: extract_numbers_generic may be undefined, or defined from star imports

(F405)

lingua_franca/lang/parse_it.py (2)

Line range hint 228-249: Review of extract_number_it function with new decimal parameter

  1. Correctness and Consistency: The function now accepts a decimal parameter that allows for different decimal characters. This is consistent with the PR's objective to handle different decimal markers.

  2. Documentation: The function's documentation has been updated to reflect the new parameter. However, the note in lines 244-245 might be misleading as it implies that the function will always use a dot as a decimal separator regardless of the decimal parameter. This could confuse users.

  3. Implementation: The implementation uses the normalize_decimals function to replace the specified decimal character with a dot before processing. This approach ensures that the existing logic for number extraction does not need to be altered significantly.

Suggestions:

  • Clarify the documentation to ensure it accurately describes the behavior when a different decimal character is specified.
  • Add a unit test to verify that the function behaves as expected when a non-dot decimal character is used.
-        will always extract numbers formatted with a decimal dot/full stop,
-        such as '3.5', even if 'decimal' is specified.
+        will extract numbers formatted with the specified decimal character,
+        converting them internally to use a decimal dot for processing.

Would you like me to help by adding a unit test for this scenario?


Line range hint 1158-1178: Review of extract_numbers_it function with new decimal parameter

  1. Correctness and Consistency: The function now accepts a decimal parameter, similar to the extract_number_it function. This maintains consistency across the API.

  2. Documentation: Similar to extract_number_it, the documentation here might mislead users about the handling of the decimal parameter.

  3. Implementation: The usage of normalize_decimals is appropriate and ensures that the function can handle various decimal characters seamlessly.

Suggestions:

  • Update the documentation to accurately reflect the handling of the decimal parameter.
  • Consider adding unit tests to cover scenarios with different decimal characters.
-        will always extract numbers formatted with a decimal dot/full stop,
-        such as '3.5', even if 'decimal' is specified.
+        will extract numbers formatted with the specified decimal character,
+        converting them internally to use a decimal dot for processing.

Would you like assistance in creating unit tests for these scenarios?

lingua_franca/lang/parse_nl.py (1)

Ensure all calls to extract_number_nl include the decimal parameter

The function extract_number_nl has been updated to handle a custom decimal marker, but the existing calls to this function do not pass the decimal parameter. This could lead to inconsistencies or bugs if the decimal parameter is required in those cases.

  • File: lingua_franca/lang/parse_nl.py
    • Line: 426-442 (function definition)
    • Line: 439 (function call without decimal parameter)

Please verify and update the function calls to include the decimal parameter where necessary.

Analysis chain

Line range hint 418-439: Updated extract_number_nl function to handle custom decimal markers.

The addition of the decimal parameter allows for flexible handling of decimal markers, which is a key objective of this PR. The use of normalize_decimals to adjust the text based on the decimal parameter is correctly implemented.

However, ensure that all calls to extract_number_nl in other parts of the codebase pass the correct decimal parameter if needed.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all function calls to `extract_number_nl` pass the correct `decimal` parameter.
# Test: Search for the function usage. Expect: Only occurances of the new signature with the `decimal` parameter.
rg --type python -A 5 $'extract_number_nl'

Length of output: 72


Script:

#!/bin/bash
# Description: Verify all function calls to `extract_number_nl` pass the correct `decimal` parameter.
# Test: Search for the function usage. Expect: Only occurrences of the new signature with the `decimal` parameter.
rg --type py -A 5 $'extract_number_nl'

Length of output: 902

lingua_franca/lang/parse_cs.py (1)

26-26: Ensure consistent import style.

The import statement for normalize_decimals is added separately from the other imports from the same module (parse_common). It's generally a good practice to consolidate imports from the same module into a single line for clarity and maintainability.

- from lingua_franca.lang.parse_common import normalize_decimals
+ from lingua_franca.lang.parse_common import is_numeric, look_for_fractions, invert_dict, ReplaceableNumber, partition_list, tokenize, Token, Normalizer, normalize_decimals
lingua_franca/lang/parse_en.py (2)

Line range hint 533-553: Review of extract_number_en function with added decimal parameter

The function extract_number_en has been updated to include a decimal parameter, allowing for the specification of different decimal markers, which is a significant enhancement for multi-language support. The default value for decimal is set to '.', which maintains backward compatibility.

Code Correctness and Logic

  • The logic to replace the decimal marker using normalize_decimals function before extracting numbers is correctly implemented. This ensures that numbers are consistently formatted before processing, which is crucial for accurate extraction.

Documentation and Comments

  • The function's documentation has been updated to reflect the new parameter and its default value. Additionally, the note explaining the behavior with the default decimal marker is clear and informative.

Potential Issue:

  • The function assumes that the decimal parameter will always be a single character. It might be beneficial to add a validation check to ensure that the decimal parameter is a single character to prevent unexpected behavior.
+ if len(decimal) != 1:
+     raise ValueError("Decimal marker must be a single character.")

Also applies to: 554-554


Line range hint 1665-1683: Review of extract_numbers_en function with added decimal parameter

The function extract_numbers_en has been similarly updated to handle the decimal parameter. This allows for consistent handling across singular and plural number extractions, which is important for maintaining uniform functionality across the API.

Code Correctness and Logic

  • The implementation for handling the decimal parameter is consistent with extract_number_en, using the normalize_decimals function to standardize the decimal marker before extraction. This consistency in handling across related functions is good practice.

Documentation and Comments

  • The documentation clearly states the purpose of the decimal parameter and its default behavior. The note added provides clarity on how numbers with a decimal point are handled, which is helpful for users of the function.

Potential Improvement:

  • Similar to extract_number_en, a validation for the decimal parameter to ensure it is a single character would enhance robustness and prevent errors.
+ if len(decimal) != 1:
+     raise ValueError("Decimal marker must be a single character.")

Also applies to: 1684-1684

lingua_franca/lang/parse_pt.py (3)

Line range hint 205-268: Review of PortugueseNormalizer class

  1. Tokenization Logic: The tokenization methods handle various cases like percentages, hashtags, and hyphenated words. This is important for ensuring that the input text is correctly split into tokens that can be processed further.
  2. Regex Usage: The use of regular expressions to split specific patterns is efficient. However, ensure that these patterns cover all expected formats (e.g., different uses of hyphens and numbers).
  3. Error Handling: There is no explicit error handling in the tokenization process. It might be beneficial to add error handling to manage unexpected input formats or regex match failures.

Consider adding more robust error handling mechanisms to improve the resilience of the tokenization process.


Line range hint 270-274: Review of normalize_pt function

The function normalize_pt provides a straightforward way to normalize Portuguese text by removing articles. This is essential for consistent text processing. However, it might be beneficial to expand the synonyms handling to cover more cases or make it configurable, which could enhance the function's flexibility and applicability in different contexts.


Line range hint 276-657: Review of extract_datetime_pt function

The extract_datetime_pt function is complex and handles a variety of date and time formats. The logic appears comprehensive, but it's quite lengthy and could benefit from being broken down into smaller, more manageable functions. This would improve readability and maintainability.

Consider refactoring this function to separate concerns, such as handling different types of date parts (e.g., days, months, years) into separate helper functions or classes.

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 08ed3c6 and 811550c.

Files selected for processing (17)
  • lingua_franca/lang/parse_common.py (1 hunks)
  • lingua_franca/lang/parse_cs.py (5 hunks)
  • lingua_franca/lang/parse_da.py (3 hunks)
  • lingua_franca/lang/parse_de.py (4 hunks)
  • lingua_franca/lang/parse_en.py (5 hunks)
  • lingua_franca/lang/parse_es.py (4 hunks)
  • lingua_franca/lang/parse_eu.py (3 hunks)
  • lingua_franca/lang/parse_fa.py (5 hunks)
  • lingua_franca/lang/parse_fr.py (4 hunks)
  • lingua_franca/lang/parse_it.py (5 hunks)
  • lingua_franca/lang/parse_nl.py (5 hunks)
  • lingua_franca/lang/parse_pl.py (5 hunks)
  • lingua_franca/lang/parse_pt.py (2 hunks)
  • lingua_franca/lang/parse_ru.py (3 hunks)
  • lingua_franca/lang/parse_sv.py (2 hunks)
  • lingua_franca/parse.py (3 hunks)
  • test/unittests/test_parse_en.py (1 hunks)
Additional comments not posted (25)
lingua_franca/parse.py (1)

Line range hint 83-102: Parameter Addition in extract_number Function

Similar to the extract_numbers function, extract_number now accepts a decimal parameter. This change is consistent across both functions, maintaining uniformity in handling decimal characters. The documentation and implementation should ensure that the behavior matches the expectations set by the new parameter.

The change is consistent and well-documented, but ensure that the actual behavior of handling different decimal formats is tested thoroughly.

lingua_franca/lang/parse_fa.py (2)

Line range hint 311-327: Updated Function Signature and Logic in extract_numbers_fa

The extract_numbers_fa function now includes a decimal parameter, allowing for the specification of different decimal characters. The function uses normalize_decimals to adjust the input text accordingly, which is a good approach for handling diverse input formats.

The implementation aligns with the intended functionality of supporting multiple decimal characters. Ensure comprehensive testing to verify the correct normalization and extraction of numbers across different formats.


Line range hint 341-358: Handling of Decimal Character in extract_number_fa

The extract_number_fa function has been updated similarly to extract_numbers_fa, including the use of the decimal parameter and the normalization logic. This consistency across functions is beneficial for maintainability and understanding of the code.

The changes are well-implemented and documented. Testing should focus on ensuring that the normalization process does not introduce any errors in number extraction.

lingua_franca/lang/parse_common.py (2)

195-204: New Function: normalize_decimals

The normalize_decimals function is a critical addition that supports the uniform handling of decimal characters across different languages. The use of regex to replace the specified decimal character with a dot is a straightforward and effective approach.

This function is essential for the new functionality introduced in the number extraction functions. It is well-implemented and should work as expected with various decimal characters.


206-206: Modification in match_yes_or_no Function

The match_yes_or_no function now dynamically resolves the resource file based on the language code provided. This change improves the function's flexibility and its ability to handle language-specific nuances in user input.

The modification enhances the function's utility in a multilingual context. It is a positive change that aligns with the overall goals of supporting multiple languages more effectively.

lingua_franca/lang/parse_da.py (1)

Line range hint 882-898: Updated Functionality in extract_numbers_da

The extract_numbers_da function now accepts a decimal parameter, which is used to specify the character to use as the decimal point. This change increases the flexibility of the function and aligns it with similar updates in other language modules.

The implementation is consistent and well-documented. Testing should ensure that the normalization process correctly handles different decimal characters without affecting the accuracy of number extraction.

lingua_franca/lang/parse_sv.py (2)

20-20: Added import for normalize_decimals.

The addition of normalize_decimals from parse_common is necessary for the new functionality introduced in extract_number_sv to handle different decimal characters. This change is appropriate and aligns with the PR's objectives.


160-176: Review of extract_number_sv function changes.

  1. Parameter Addition: The addition of the decimal parameter with a default value of '.' is correctly implemented. This allows for flexibility in handling different decimal characters based on the user's locale.
  2. Decimal Normalization: The conditional use of normalize_decimals based on the decimal parameter is logical. It ensures that the input text is normalized only when necessary, which is efficient.

Overall, the changes are well-implemented and serve the intended purpose of enhancing number extraction capabilities across different locales.

lingua_franca/lang/parse_de.py (3)

24-24: Added import for normalize_decimals.

The import of normalize_decimals is crucial for the new functionality in handling different decimal characters in the German locale. This change is necessary and well-integrated into the existing codebase.


147-171: Review of extract_number_de function changes.

  1. Parameter Addition: The introduction of the decimal parameter with a default value of '.' is correctly implemented. This change allows for handling different decimal characters, enhancing the function's usability in various locales.
  2. Decimal Normalization: The use of normalize_decimals to adjust the input text based on the decimal parameter is a smart addition. It ensures that the function can accurately process numbers formatted with different decimal characters.

These changes are well-executed and align with the PR's objectives to enhance number extraction capabilities.


Line range hint 1018-1034: Review of extract_numbers_de function changes.

The addition of the decimal parameter to extract_numbers_de is consistent with the changes made in extract_number_de. This ensures uniformity across functions that handle number extraction. The implementation is straightforward and correctly uses normalize_decials when the decimal is not the default.

This change is well-implemented and supports the overall goal of the PR to handle numbers more flexibly across different locales.

lingua_franca/lang/parse_eu.py (1)

26-26: Import of normalize_decimals function.

The import of normalize_decimals from parse_common is necessary for the new functionality in extract_numbers_eu to handle different decimal separators. This is a good addition for supporting internationalization.

lingua_franca/lang/parse_fr.py (2)

26-26: Import of normalize_decimals function.

Just like in parse_eu.py, the import of normalize_decimals from parse_common is necessary for the new functionality in extract_number_fr and extract_numbers_fr to handle different decimal separators. Consistent use of this function across different language modules is a good practice.


Line range hint 1086-1106: Updated extract_numbers_fr function to support different decimal separators.

This function has been updated similarly to extract_number_fr, adding a decimal parameter to handle different decimal characters. The implementation and documentation should ensure that the behavior is consistent and clear to the users.

As with the single number extraction function, ensure the documentation accurately reflects how the decimal parameter influences the behavior.

lingua_franca/lang/parse_es.py (1)

Line range hint 284-304: Review of extract_numbers_es function signature and decimal handling

The function extract_numbers_es has been appropriately updated to include a decimal parameter. This change aligns with the updates to extract_number_es and supports consistency across the API. The delegation to extract_numbers_generic is a good practice, assuming that function is properly handling the decimal parameter as well.

It's important to ensure that extract_numbers_generic is also updated to respect the new decimal parameter to maintain consistency and correctness across the API.

Verification successful

Line range hint 284-304: Review of extract_numbers_es function signature and decimal handling

The function extract_numbers_es has been appropriately updated to include a decimal parameter. This change aligns with the updates to extract_number_es and supports consistency across the API. The normalize_decimals function ensures that the decimal parameter is handled correctly before calling extract_numbers_generic.

  • The normalize_decimals function in lingua_franca/lang/parse_common.py replaces the specified decimal character with a period, ensuring correct number extraction.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation of `extract_numbers_generic` to ensure it handles the `decimal` parameter correctly.

# Test: Search for the function definition and usage. Expect: Proper handling of the `decimal` parameter.
rg --type python -A 5 $'extract_numbers_generic'

Length of output: 78


Script:

#!/bin/bash
# Description: Verify the implementation of `extract_numbers_generic` to ensure it handles the `decimal` parameter correctly.

# Test: Search for the function definition and usage. Expect: Proper handling of the `decimal` parameter.
rg -A 5 'def extract_numbers_generic'

Length of output: 517


Script:

#!/bin/bash
# Description: Verify how `extract_numbers_es` handles the `decimal` parameter before calling `extract_numbers_generic`.

# Test: Search for the usage of `extract_numbers_generic` within `extract_numbers_es` and intermediary functions.
rg -A 5 'extract_numbers_generic' lingua_franca/lang/parse_es.py

Length of output: 307


Script:

#!/bin/bash
# Description: Verify the implementation of `normalize_decimals` to ensure it handles the `decimal` parameter correctly.

# Test: Search for the function definition of `normalize_decimals`.
rg -A 5 'def normalize_decimals'

Length of output: 470

Tools
Ruff

305-305: extract_numbers_generic may be undefined, or defined from star imports

(F405)

lingua_franca/lang/parse_nl.py (2)

28-28: Import statement added for normalize_decimals.

The addition of normalize_decimals from lingua_franca.lang.parse_common is consistent with the PR's goal to handle different decimal markers in number extraction functions.


Line range hint 1305-1324: Updated extract_numbers_nl function to handle custom decimal markers.

The function now correctly handles the decimal parameter to allow for flexible decimal markers, aligning with the PR objectives. The implementation of normalize_decimals to adjust the text based on the decimal parameter is consistent and appropriate.

Ensure that all calls to extract_numbers_nl in other parts of the codebase pass the correct decimal parameter if needed.

Verification successful

No calls to extract_numbers_nl found in the codebase.

The extract_numbers_nl function is defined correctly with the new decimal parameter. Since no calls to this function were found in the codebase, there are no further updates required.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all function calls to `extract_numbers_nl` pass the correct `decimal` parameter.
# Test: Search for the function usage. Expect: Only occurances of the new signature with the `decimal` parameter.
rg --type python -A 5 $'extract_numbers_nl'

Length of output: 73


Script:

#!/bin/bash
# Description: Verify all function calls to `extract_numbers_nl` pass the correct `decimal` parameter.
# Test: Search for the function usage. Expect: Only occurrences of the new signature with the `decimal` parameter.
rg --type py -A 5 $'extract_numbers_nl'

Length of output: 497


Script:

#!/bin/bash
# Description: Search for any calls to the `extract_numbers_nl` function across the entire codebase.
rg 'extract_numbers_nl\('

Length of output: 135

lingua_franca/lang/parse_pl.py (2)

Line range hint 580-601: Review of extract_number_pl function:

  1. Parameter Addition: The addition of the decimal parameter with a default value of '.' is correctly implemented. This aligns with the PR's goal to handle different decimal markers.
  2. Logic to Handle Decimal Marker: The logic to replace the decimal marker using normalize_decimals when it is not '.' is sound. This ensures that the function can handle various decimal formats as intended.
  3. Documentation: The function's documentation has been updated to reflect the new parameter and its behavior, which is good for maintainability.

Overall, the changes are well-implemented and serve the intended purpose.


Line range hint 1343-1363: Review of extract_numbers_pl function:

  1. Parameter Addition: Similar to extract_number_pl, the decimal parameter has been added effectively with a default value of '.'.
  2. Handling of Decimal Marker: The function uses the normalize_decimals method to handle different decimal markers, which is consistent with the changes in extract_number_pl.
  3. Return Type: The function now returns a list of floats, ensuring that all numbers extracted are in a consistent format, which enhances the function's usability.

These changes are consistent with the PR's objectives and are correctly implemented.

lingua_franca/lang/parse_cs.py (2)

Line range hint 582-603: Review the implementation of extract_number_cs with the new decimal parameter.

The function extract_number_cs has been modified to accept a new parameter decimal, which specifies the character to use as a decimal point. The function uses normalize_decimals to replace the specified decimal character with a dot ('.') before proceeding with number extraction. This is a crucial change as it allows the function to handle different decimal separators used in various locales, which is essential for internationalization.

  • Correctness: The implementation correctly checks if the decimal parameter is different from '.' before applying normalization. This avoids unnecessary processing if the default is used.
  • Performance: Consider the impact of repeatedly calling normalize_decimals on large texts or in tight loops. If performance becomes an issue, optimizations or caching mechanisms might be needed.
  • Maintainability: The code is clear and modular, making it easy to understand and maintain.

Line range hint 1569-1589: Review the implementation of extract_numbers_cs with the new decimal parameter.

Similar to extract_number_cs, the extract_numbers_cs function now supports a decimal parameter to handle different decimal separators. This function extracts all numbers from the input text, normalizing them to use a dot as the decimal separator.

  • Correctness: The function handles the normalization of decimal separators correctly, ensuring that all extracted numbers are returned in a consistent format.
  • Performance: As with extract_number_cs, consider the performance implications of the normalization process.
  • Maintainability: The function maintains a consistent style with other parts of the module, using similar patterns and logic.
lingua_franca/lang/parse_ru.py (2)

31-31: Import addition: normalize_decimals

The addition of normalize_decimals from parse_common is appropriate given its usage in the modified extract_numbers_ru function to handle different decimal markers.


Line range hint 1581-1601: Enhancement in extract_numbers_ru for flexible decimal handling

The addition of the decimal parameter to extract_numbers_ru enhances flexibility by allowing different decimal markers. This is particularly useful for internationalization, where different locales use different characters for decimals.

  • Correctness: The function now uses normalize_decimals to adjust the text based on the provided decimal parameter before extracting numbers. This ensures that the function can adapt to various formats of decimal numbers across different languages.
  • Documentation: The function's documentation has been updated accordingly, which is good practice as it maintains clarity on how the function behaves with the new parameter.
  • Potential Issue: The note in the documentation states that numbers formatted with a decimal dot/full stop are always extracted, regardless of the decimal parameter. This could be misleading if the function behavior actually respects the decimal parameter. It's important to ensure that this behavior is either correctly documented or adjusted to match the documentation.
-        will always extract numbers formatted with a decimal dot/full stop,
-        such as '3.5', even if 'decimal' is specified.
+        will extract numbers formatted according to the 'decimal' parameter.
lingua_franca/lang/parse_en.py (1)

32-32: Added import statement for normalize_decimals

The addition of normalize_decimals from lingua_franca.lang.parse_common is appropriate given its usage in the extract_number_en and extract_numbers_en functions to handle different decimal characters. This aligns with the PR's goal of handling decimal markers effectively.

lingua_franca/lang/parse_pt.py (1)

81-102: Review of extract_number_pt function changes

  1. Parameter Addition: The addition of the decimal parameter is consistent with the PR's goal of handling different decimal markers. This is a good enhancement for supporting multiple locales.
  2. Documentation Update: The updated docstring clearly explains the new parameter, which is crucial for maintainability and usability.
  3. Functionality: The use of normalize_decimals function when decimal is not the default value is a logical approach to ensure that the input text is correctly formatted before number extraction. This should help in handling cases where different regions use different decimal markers.

Overall, the changes to this function are well-implemented and align with the objectives of the PR.

test/unittests/test_parse_en.py Show resolved Hide resolved
Comment on lines +60 to +77
def extract_number_es(text, short_scale=True, ordinals=False, decimal='.'):
"""
This function prepares the given text for parsing by making
numbers consistent, getting rid of contractions, etc.
This function extracts a number from a text string,
handles pronunciations in long scale and short scale

https://en.wikipedia.org/wiki/Names_of_large_numbers

Args:
text (str): the string to normalize
short_scale (bool): use short scale if True, long scale if False
ordinals (bool): consider ordinal numbers, third=3 instead of 1/3
decimal (str): character to use as decimal point. defaults to '.'
Returns:
(int) or (float): The value of extracted number
(int) or (float) or False: The extracted number or False if no number
was found
Note:
will always extract numbers formatted with a decimal dot/full stop,
such as '3.5', even if 'decimal' is specified.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of extract_number_es function signature and decimal handling

The function now correctly includes a decimal parameter allowing for flexible decimal character specification. The implementation of normalize_decimals based on this parameter is a good approach to handle locale-specific decimal characters.

However, the note in the documentation about always extracting numbers formatted with a decimal dot/full stop might be confusing. It suggests that despite the decimal parameter, the function may not respect it under certain conditions. This could benefit from further clarification to ensure users understand the behavior fully.

Comment on lines +373 to +394
def extract_number_fr(text, short_scale=True, ordinals=False, decimal='.'):
"""
This function extracts a number from a text string,
handles pronunciations in long scale and short scale

https://en.wikipedia.org/wiki/Names_of_large_numbers

Args:
text (str): the string to extract a number from
text (str): the string to normalize
short_scale (bool): use short scale if True, long scale if False
ordinals (bool): consider ordinal numbers, third=3 instead of 1/3
decimal (str): character to use as decimal point. defaults to '.'
Returns:
(str): The number extracted or the original text.
(int) or (float) or False: The extracted number or False if no number
was found
Note:
will always extract numbers formatted with a decimal dot/full stop,
such as '3.5', even if 'decimal' is specified.

"""
if decimal != '.':
text = normalize_decimals(text, decimal)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated extract_number_fr function to support different decimal separators.

The function now accepts a decimal parameter, which allows for handling different decimal characters. This change is crucial for supporting number extraction in French text that might use commas or other characters as decimal separators.

However, similar to parse_eu.py, there's a potential issue:

  • The comment on line 389 states that numbers will always be extracted with a decimal dot, regardless of the decimal parameter. This could lead to incorrect behavior if the user expects the function to handle different decimal characters as specified.

Consider revising the documentation or the implementation to ensure that the function respects the decimal parameter fully.

-        will always extract numbers formatted with a decimal dot/full stop,
+        will attempt to normalize numbers to use the specified decimal character,
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def extract_number_fr(text, short_scale=True, ordinals=False, decimal='.'):
"""
This function extracts a number from a text string,
handles pronunciations in long scale and short scale
https://en.wikipedia.org/wiki/Names_of_large_numbers
Args:
text (str): the string to extract a number from
text (str): the string to normalize
short_scale (bool): use short scale if True, long scale if False
ordinals (bool): consider ordinal numbers, third=3 instead of 1/3
decimal (str): character to use as decimal point. defaults to '.'
Returns:
(str): The number extracted or the original text.
(int) or (float) or False: The extracted number or False if no number
was found
Note:
will always extract numbers formatted with a decimal dot/full stop,
such as '3.5', even if 'decimal' is specified.
"""
if decimal != '.':
text = normalize_decimals(text, decimal)
will attempt to normalize numbers to use the specified decimal character,

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@JarbasAl JarbasAl closed this Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants