5. Ethics

Data Science Ethics Checklist

A. Data Collection

A.1 Informed consent: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
- This data consists of ground-truth from the Georgia Power Solar Project and features from North American Mesoscale Forecast System (NAM) and does not contain any of the human subjects.
A.2 Collection bias: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
- This dataset doesn't employ any of the 8 biases mentioned here.
A.3 Limit PII exposure: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
- This dataset does not contain any personal information. So exposure of personally identifiable information is irrelevant.

B. Data Storage

B.1 Data security: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
- The original data-set is not shared with anyone. The data is encrypted using pyAesCrypt and a randomly generated key. The encrypted files and key are stored separately GCP buckets which have restricted access and is shared to people selectively.
B.2 Right to be forgotten: Do we have a mechanism through which an individual can request their personal information be removed?
- The data does not contain any of the personal information and so the individual request to remove personal information is out-of-context. If needed and prompted, year/day wise data can be removed from usage.
B.3 Data retention plan: Is there a schedule or plan to delete the data after it is no longer needed?
- Yes. The decrypted data is deleted from the GCP bucket immediately after the models are trained and predictions are made.

C. Analysis

C.1 Missing perspectives: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
- We have consulted the potential blind spots with Georgia Power and Dr.Frederick Maier, Associate Director, Institute for Artificial Intelligence. Releasing the results but not releasing the model and data itself will prevent any outsider using it for any self-gain.
C.2 Dataset bias: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
- The data is collected from solar farms. Apart from geographical bias due to availability, it does not perpetuate any bias. It is a regression problem, therefore there is no problem of imbalanced classes.
C.3 Honest representation: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
- Yes. The results are obtained after training and testing the models multiple times to make sure that the results are reproducible using the same parameters.
C.4 Privacy in analysis: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
- This data does not contain any PII. Also, we changed the names of the features in the data so that any use of the data does not carry any useful information
C.5 Auditability: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?
- Yes. All the necessary steps that are to executed to reproduce the code are documented in README.md. The program has descriptions of each command line argument and also further descriptions of the application within the wiki.

D. Modeling

D.1 Proxy discrimination: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
- The models are trained on the various distinct features out of which none are discriminatory. All the features are scaled to remove any discrimination towards particular feature/variable. There are no proxy variables used, the variables simply represent an observation of solar farms or weather in a particular region.
D.2 Fairness across groups: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
- The dataset does not contain any information that can be associated to any individual/group of individuals.
D.3 Metric selection: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
- We implemented the cross-validation step to minimize over-fitting of the models on the data. Also, we used different metrics like R-Squared Error, Root Mean Squared Error and Mean Absolute Error to compare the performance of different models.
D.4 Explainability: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
- Our model captures variance in physical world properties such as seasons, day and night distinction and relative position of the sun from the earth. Therefore, when we make predictions on the model, they can be explained in terms of accuracy achieved across seasons or day intervals, which we can verbalize with physical world properties used.
D.5 Communicate bias: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?
- Yes, we discuss in conclusions sections of our project, the limits, shortcomings, and biases of our models. We will share this report with all the stakeholders.

E. Deployment

E.1 Redress: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
- We don't have users who are not stakeholders in our project. And with our stakeholders, we have communicated all the shortcoming, limitations and biases of our models. Moreover, our project's predictions do have economic risks if too much confidence is given on predictions. Therefore, our models should be right roughly 85% of the time (r-square value percentage for a given hour).
E.2 Roll back: Is there a way to turn off or roll back the model in production if necessary?
- We used the version control to roll back to the previous stable model if the present model is considered to have any glitches.
E.3 Concept drift: Do we test and monitor for concept drift to ensure the model remains fair over time?
- The models make predictions for the next 24 hours based on the present data and the training data will be updated every hour with features and ground truth and so any unforeseen changes in the target variable can be addressed as the updated training data consist of these changes.
E.4 Unintended use: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?
- The unintended use is prevented by limiting the use of the model as it is shared with the people selectively.

Data Science Ethics Checklist generated with deon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

5. Ethics

Data Science Ethics Checklist

A. Data Collection

B. Data Storage

C. Analysis

D. Modeling

E. Deployment

Clone this wiki locally