Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The predicted probabilities of some issues are 99.99% #44

Open
mcxwx123 opened this issue Oct 25, 2022 · 0 comments
Open

The predicted probabilities of some issues are 99.99% #44

mcxwx123 opened this issue Oct 25, 2022 · 0 comments
Labels
bug Something isn't working scope: model

Comments

@mcxwx123
Copy link
Collaborator

Currently, many open issues in some projects have a GFI probability of 99.99%, and some of these issues clearly should not be marked as GFI.
image
The performance metric of the model is also unusually high.
image
I examined the code and found two features that may be problematic. The first is 'created_at_timestamp', which is not one of the features and should not be included in X (def get_x_y() in gfibot/model/utils.py). The second one is 'rpt_gfi_ratio', when I try to drop this feature, the model performance metrics appear to drop significantly.

The problems can be solved by the following steps:

  1. Add 'created_at_timestamp' to
    ["owner", "name", "number", "is_gfi", "created_at", "closed_at"]
  2. The gfi_ratio and gfi_num features should be calculated with a new issue list, which only includes issues closed before the data collection time.
    def _get_newcomer_ratio(n_user_commits: List[int], newcomer_thres: int) -> float:
    def _get_newcomer_num(n_user_commits: List[int], newcomer_thres: int) -> int:

    Now we use the following list. A new issues list like issues = [i for i in user.issues if i.closed_at <= t] should be created for calculating gfi_ratio and gfi_num later.
    issues = [i for i in user.issues if i.created_at <= t]

There may be a situation where most of the prediction probabilities are close to 0 after the above features are corrected because of the imbalance of positive and negative instances in the training data, which can be solved by balancing the training dataset using methods such as SMOTE and ADASYN. Then we can check whether the '99.99% probabilities' problem is solved.

@mcxwx123 mcxwx123 added bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed scope: model and removed good first issue Good for newcomers help wanted Extra attention is needed labels Oct 25, 2022
@mcxwx123 mcxwx123 assigned mcxwx123 and unassigned mcxwx123 Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working scope: model
Projects
None yet
Development

No branches or pull requests

1 participant