Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UofT-DSI | production - Assignment 3 #72

Closed
wants to merge 2 commits into from

Conversation

sijiao-liu
Copy link

What changes are you trying to make? (e.g. Adding or removing code, refactoring existing code, adding reports)

I am attempting to use SHAP values to explain the predictions of the best-performing model. This involves:

  • Implementing code to explain the impact of features on a specific observation from the test set.
  • Identifying which features are most and least important across the entire training set.
  • Providing guidance on feature removal and performance testing.

What did you learn from the changes you have made?

  • I learned that the SHAP library can be quite resource-intensive, and using it in restricted environments may cause issues related to library compatibility (e.g., CUDA or GPU dependencies).
  • I also realized the importance of ensuring that the data passed to SHAP matches the transformed format that the model uses, especially when categorical features have been one-hot encoded.

Was there another approach you were thinking about making? If so, what approach(es) were you thinking of?

  • Using KernelExplainer from SHAP, which is model-agnostic and does not require specialized GPU support, as a fallback.
  • Providing manual feature importance analysis using model coefficients (for linear models) or feature importances from tree-based models if SHAP visualizations continued to fail.

Were there any challenges? If so, what issue(s) did you face? How did you overcome it?

Challenges Faced:

  • Encountering compatibility issues with libraries that require GPU resources, leading to errors when importing SHAP or running specific visualizations.
  • Dealing with a dimension mismatch when trying to match SHAP values with the transformed features.

Overcoming the Challenges:

  • I fixed the dimension mismatch by ensuring that the features passed to SHAP were in the same transformed format used by the model.
  • For the CUDA errors, I planned to switch to a simpler SHAP setup that does not require GPU dependencies.

How were these changes tested?

The changes were tested by running:

  • Train-test split evaluations to ensure the model pipelines worked correctly.
  • Using SHAP to generate explanations, although visual outputs were difficult to render in the current environment.
    Testing also involved debugging errors and validating data transformations to ensure correctness.

A reference to a related issue in your repository (if applicable)

Checklist

  • I can confirm that my changes are working as intended

Copy link

github-actions bot commented Nov 5, 2024

Hello, thank you for your contribution. If you are a participant, please close this pull request and open it in your own forked repository instead of here. Please read the instructions on your onboarding Assignment Submission Guide more carefully. If you are not a participant, please give us up to 72 hours to review your PR. Alternatively, you can reach out to us directly to expedite the review process.

@sijiao-liu sijiao-liu closed this Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant