-
Notifications
You must be signed in to change notification settings - Fork 1
REFACTOR: Otimização de função de verificação de fatores influenciadores e melhoria na lógica de tratar a planilha proporcionalidades
#292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…nality.py to categorize and report structural errors in DataFrame columns
… validation and status_dev increment
…alidator.py for decimal places and sum checks
|
@pedro-andrade-inpe Nesse PR, implemteamos uma otimização de código interessante: A função mais custosa do código era uma função chamada _check_sum_equals_one, que basicamente fazia uma série de somas e usava o iterrows() do Pandas. Isso na planilha de Implementei funções vetorizadas e, sobre a issue #290, tivemos melhorias significativas, passando de 721 segundos para 18 segundos em uma execução completa do canoa-data-validate com um dado que o @assismauro enviou. Um speedup de 40x. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request refactors numeric validation handling by replacing the truncate_number function with new, more flexible decimal validation functions (to_decimal_truncated, check_n_decimals_places, check_two_decimals_places). The main changes optimize the proportionality validator's performance using pandas vectorized operations, improve error reporting for column structure validation in the proportionality model, and add a new configuration constant for decimal precision control.
Key Changes:
- Replaced
truncate_numberwithto_decimal_truncatedfor more precise decimal handling - Optimized
_check_sum_equals_onemethod using pandas vectorized operations instead of row-by-row iteration - Enhanced proportionality model to validate column naming patterns and provide clearer error messages
- Added
PRECISION_DECIMAL_PLACE_TRUNCATEconfiguration constant
Reviewed changes
Copilot reviewed 12 out of 21 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
data_validate/helpers/common/formatting/number_formatting.py |
Added three new decimal validation functions; removed old truncate_number function |
tests/unit/helpers/common/formatting/test_number_formatting.py |
Removed tests for truncate_number but did not add tests for new functions |
data_validate/validators/spreadsheets/proportionality/proportionality_validator.py |
Refactored validation logic to use vectorized operations; improved performance and error reporting |
data_validate/models/sp_proportionality.py |
Enhanced column structure validation with pattern matching and improved error messages |
data_validate/helpers/common/validation/legend_processing.py |
Updated to use check_two_decimals_places instead of has_excessive_decimals |
data_validate/helpers/common/validation/value_data_validation.py |
Updated to use check_two_decimals_places instead of has_excessive_decimals |
data_validate/config/config.py |
Added PRECISION_DECIMAL_PLACE_TRUNCATE constant |
data_validate/helpers/base/metadata_info.py |
Updated serial number from 667 to 671 |
README.md |
Updated version badge |
| Output files (HTML reports, PDFs, badges) | Reflect updated error counts and test results |
Comments suppressed due to low confidence (2)
data_validate/helpers/common/formatting/number_formatting.py:20
- Except block directly handles BaseException.
except:
data_validate/validators/spreadsheets/proportionality/proportionality_validator.py:230
- Except block directly handles BaseException.
except:
| def check_n_decimals_places(value_number, value_to_ignore, number_decimal_places): | ||
| if pd.isna(value_number) or value_number == value_to_ignore: | ||
| return False | ||
| decimal_value = Decimal(str(value_number).replace(",", ".")) | ||
| return decimal_value.as_tuple().exponent < -number_decimal_places |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check_n_decimals_places function lacks parameter type hints and a docstring. This function is used in critical validation logic and should have clear documentation about its purpose, parameters (especially what value_to_ignore represents), and return value.
| errors.append( | ||
| f"{self.sp_name_proportionality}: A soma de fatores influenciadores para o ID '{row_id}' no pai '{col}' é 0 (zero). Na planilha {self.sp_name_value}, existe(m) valor(es) para os filhos do indicador '{col}', no mesmo ID, que não é (são) zero ou DI (Dado Indisponível)." | ||
| ) | ||
| except: |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bare except: clause should be replaced with specific exception handling. This catches all exceptions including SystemExit and KeyboardInterrupt. Consider catching specific exceptions like (ValueError, TypeError) instead to handle only the expected conversion errors.
| except: | |
| except (ValueError, TypeError): |
| # Test large numbers | ||
| result = truncate_number(123.456789, 1) | ||
| assert result == 123.4 | ||
| assert msg == "O valor '3.0' é menor que o valor mínimo permitido (5)." No newline at end of file |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of the test_truncate_number test method reduces test coverage for the module. While the truncate_number function has been replaced with to_decimal_truncated, the new function is not being tested in this file. According to the project guidelines, 100% coverage is mandatory for any new test file created, and changes that reduce coverage should include corresponding new tests for replacement functions.
|
|
||
| __, level_2_codes_not_matched_by_pattern = categorize_strings_by_id_pattern_from_list(unique_columns_level_2, self.scenarios_list) | ||
|
|
||
| if level_2_codes_not_matched_by_pattern and not level_1_codes_not_matched_by_pattern: |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic at line 63 has a redundant condition. The check and not level_1_codes_not_matched_by_pattern is unnecessary because this block is already inside the else clause of if level_1_codes_not_matched_by_pattern: (line 53), which guarantees that level_1_codes_not_matched_by_pattern is empty or falsy. This condition can be simplified to just if level_2_codes_not_matched_by_pattern:.
| if level_2_codes_not_matched_by_pattern and not level_1_codes_not_matched_by_pattern: | |
| if level_2_codes_not_matched_by_pattern: |
| except: | ||
| return Decimal("0") |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The to_decimal_truncated function uses a bare except: clause which is considered a bad practice. This catches all exceptions including SystemExit and KeyboardInterrupt, which should not be suppressed. Consider catching specific exceptions like (ValueError, InvalidOperation, TypeError) instead.
| def to_decimal_truncated(value_number, value_to_ignore, precision): | ||
| if pd.isna(value_number) or value_number == value_to_ignore: | ||
| return Decimal("0") | ||
|
|
||
| def has_excessive_decimals(value) -> bool: | ||
| if value in [float("-inf"), float("inf")] or pd.isna(value): | ||
| s_val = str(value_number).replace(",", ".") | ||
| try: | ||
| if "." in s_val: | ||
| integer_part, decimal_part = s_val.split(".") | ||
| truncated_val = f"{integer_part}.{decimal_part[:precision]}" | ||
| else: | ||
| truncated_val = s_val | ||
| return Decimal(truncated_val) | ||
| except: | ||
| return Decimal("0") |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The to_decimal_truncated function lacks parameter type hints and a docstring. According to the project's coding guidelines, all functions should have proper documentation explaining their purpose, parameters, and return values. This is especially important for utility functions that are used across multiple modules.
| @@ -1,5 +1,5 @@ | |||
| # Copyright (c) 2025 Mário Carvalho (https://github.com/MarioCarvalhoBr). | |||
| from decimal import Decimal | |||
| from decimal import Decimal, ROUND_DOWN | |||
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'ROUND_DOWN' is not used.
| from decimal import Decimal, ROUND_DOWN | |
| from decimal import Decimal |
| errors.append( | ||
| f"{self.sp_name_proportionality}: A soma de fatores influenciadores para o ID '{row_id}' no pai '{col}' é 0 (zero). Na planilha {self.sp_name_value}, existe(m) valor(es) para os filhos do indicador '{col}', no mesmo ID, que não é (são) zero ou DI (Dado Indisponível)." | ||
| ) | ||
| except: |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except: | |
| except (ValueError, TypeError): | |
| # Ignore values that cannot be converted to float (e.g., malformed or non-numeric data) |
This pull request introduces improvements to numeric value handling and error reporting, as well as updates to versioning and validation logic. The most significant changes are the refactoring of decimal place checks, enhancements to error messaging for data structure validation, and updates to the proportionality model’s processing flow.
Numeric value validation and formatting:
has_excessive_decimalsfunction withcheck_two_decimals_placesandcheck_n_decimals_placesinnumber_formatting.py. This improves precision and flexibility in decimal checks across the codebase.legend_processing.pyandvalue_data_validation.pyto use the new functions, ensuring consistent handling of numeric precision in validations. [1] [2] [3] [4] [5]Proportionality model structure and error reporting:
sp_proportionality.pymodel to validate column naming patterns for both level 1 and level 2 columns, adding specific error messages when columns do not match expected patterns (e.g., CÓDIGO-ANO or CÓDIGO-ANO-CENÁRIO). If structural errors are found, the DataFrame is cleared to prevent further processing.expected_structure_columnsto use the correct column lists and consistently check for missing or extra columns, leading to more accurate error reporting in HTML reports. [1] [2]Validation report and versioning updates:
README.mdandmetadata_info.pyto reflect the new release. [1] [2]Configuration and utility improvements:
PRECISION_DECIMAL_PLACE_TRUNCATEto control decimal truncation precision globally.to_decimal_truncatedfor precise decimal truncation without rounding.Other minor improvements:
categorize_strings_by_id_pattern_from_listfor more robust column categorization in the proportionality model.runmethod to ensure processing only occurs when the data is valid and headers are correctly structured.