Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Created unit testing for analysis and bigquery2pandas #54

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

CGNx
Copy link
Contributor

@CGNx CGNx commented Sep 20, 2016

Unit testing works by comparing previous runs of a given analysis with the current run in a single BigQuery query (by appending the last analysis run and comparing the two and then appending the difference to the final unit test table). The test courses are kept private.

bigquery2pandas is a library for interacting with bigquery using pandas. SQL2df is the most frequently used function and will create a correctly typed, correctly ordered, pandas dataframe from a SQL query. Estimated time to completion and other useful features are supported.

@CGNx
Copy link
Contributor Author

CGNx commented Sep 24, 2016

HOW TO RUN ANALYSIS TESTS

The tests tell us the percentage change on average for a sample of columns for five test courses whenever a change is made. Here is drive code to run tests:

from edx2bigquery.edx2bigquery.bigquery2pandas import analysis_unit_tests

test_course_ids = analysis_unit_tests.fetch_test_course_ids()

update_msg = "Whatever the most recent update to the code is - keep it short, this will be added to the table"
analysis_unit_tests.ans_coupling_test1('dataset', test_course_ids=test_course_ids, what_changed=update_msg)
analysis_unit_tests.sab_test1("dataset", test_course_ids=test_course_ids, what_changed=update_msg)
analysis_unit_tests.cameo_test1("dataset", test_course_ids=test_course_ids, what_changed=update_msg)
print 'Done'

WILSON'S INTERVAL FOR RANKING CAMEO CHEATING AND COLLABORATION
CAMEO - show_ans_before
Collaboration - ans_coupling

The Wilson's Interval Score provides a single value which ranks master, harvester pairs.
This score combines a negative and positive score for each student as a confidence-based
measure.

The interpretability of the ranking is based on the features used to compute the Wilson's Interval
score. In the "show_ans_before" case, the score ranks user pairs based on their likelihood of
copying via CAMEO. In the "ans_coupling" case, the score ranks user pairs based on their
likelihood of answering problems together in pairs or groups (whether by copying or working
together).

IMPORTANT: The positive and negative scores are generated by first normalizing the features,
then combining them linearly with weights. How are these weights computed? A boosted logistic
regression classifer with regularization (with Cross-Validation to find parameters) is trained
on a randomly sampled 1 million master, harvester pairs. CAMEO cheating labels found using a
hand-tuned composite of five filtering algorithms are used as binary lables. The training set
uses the same features as those comprising the negative and positive scores. Features are
standardized using minimax.
This process is repeated 1000 times. The trained weights are also standardized at each iteration,
and then all 1000 trained, standardized weights are avearged to produce the final weights. These
weights represent the predictive power of each of the features and are used in the linear
combination for the positive and negative score. The positive and negative score are combined
using Wilson's interval to produce the final CAMEO ranking.
Since the weights are trained on CAMEO labels, not collaboration labels, the Wilson's Interval
Ranking is optimized for "show_ans_before" not "ans_coupling." However, the two tables
are nearly identical in structure, only with different semantics, making the Wilsons' Interval
ranking highly relevant to "compute_ans_coupling".

The Wilson's Interval is used to sort these analysis tables. The top row in the table therefore
represents the most statistically significant pair of users in the table, relevant to whichever
metric the table captures.

@@ -57,7 +57,9 @@ def get_creds(verbose=False):
print "service_acct=%s, key_file=%s" % (SERVICE_ACCT, KEY_FILE)
return get_service_acct_creds(SERVICE_ACCT, KEY_FILE)
elif KEY_FILE=='USE_GCLOUD_AUTH':
return get_gcloud_oauth2_creds()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of overwriting what is done for USE_GCLOUD_AUTH, could you please make this a different option, e.g. USE_GOOGLE_CREDENTIALS?

@maxliu maxliu mentioned this pull request Jan 3, 2018
maxliu and others added 2 commits April 17, 2018 13:15
Add ABS for HASH and remove INTEGER for sa_ca_dt_corr_ordered and sa_ca_dt_correlation.
1) HASH in the sql query might return a negative integer number. Add ABS to avoid it.
2) The sa_ca_dt_corr_ordred and  sa_ca_dt_correlation should be a real number between -1 and 1, e.g. 0.99993. The INTEGER(sa_ca_dt_corr_ordered) will return 0, 1, -1 only.
add ABS for HASH and remove INTEGER for corr
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants