11-06-21: think that the data update steps of this are now automated via Airflow: https://github.com/moj-analytical-services/airflow-pq-tool At least I've just checked the tool and 1: it's still working, and 2: it's got recent questions in it (ST)
This is a prototype tool for analysing and comparing written Parliamentary Questions for answer by the Ministry of Justice. Questions have been taken from the API provided by Parliament (accessed via http://www.data.parliament.uk/).
The tool allows the user to input a new question, or a key phrase, and produces a score and ranking of similarity between the input and the bank of past PQs. It also groups questions under 'topics' based on similar subject matter.
The tool is written in R and is based on a technique called Latent Semantic Analysis. For more information, or to provide any feedback/ideas please send an email to [email protected]
To access the deployed tool within the Ministry of Justice go to https://pq-tool.apps.alpha.mojanalytics.xyz/. If you are not from the MoJ, you can fork and run locally. There is also an external-facing tool that contains questions from the MoJ and selected other departments. This is located at https://pq-tool-external.apps.alpha.mojanalytics.xyz/. Access to it is via approved email addresses: if you wish to have a look please email the above address. Alternatively use the GTrebase
branch of this repo.
If you have been given access to our external tool, the corresponding code is on the GTrebase branch of this repo.
This version of the app stores data in the AWS S3 bucket 'alpha-app-pq-tool'. If there are any errors with mismatching questions in archived_pqs.csv, check to make sure that the mismatching questions look sensible (for example a change in an MP's political party), then rename or delete archived_pqs.csv and then re-run the scraping commands below.
Variables in block capitals are defined in .Rprofile because they're used in serveral different R files. This should load automatically whenever you start a new R session from the comand line. If you make changes to .Rprofile, remember that you will either need to open a new R session to load the changes or do source('./Rprofile')
Rscript data_generators/getTheData.R
Rscript data_generators/DataCreator.R -e prod
These two lines will create or update the following files:
getTheData.R
: creates (or updates) Data/archived_pqs.csv
(or updates it if it already exists)
DataCreator.R
: creates (or updates) searchSpace.rda
, MoJwrittenPQs.csv
, topDozenWordsPerTopic.csv
, and topDozenWordsPerMember.csv
.
Rscript data_generators/getTheData.R
This runs the file data_generators/getTheData.R
which contains code to run the following with show_progress = TRUE
.
source('./R/apiClient.R')
# Without feedback
fetch_questions()
# With feedback
fetch_questions(show_progress = TRUE)
-
When the
fetch_questions()
function is called for the first time, and no archive exists, it will createarchived_pqs.csv
in the s3 directory and download all answered PQs, that were posed to the MoJ, from http://lda.data.parliament.uk/answeredquestions. This takes about 8.5 minutes on a 2016 MacBook Pro. -
When an archive already exists, the function will update
archived_pqs.csv
by appending newly answered questions (downloaded from the same endpoint). -
Variables in BLOCK_CAPITALS are defined in
.Rprofile
From the command line you can run
Rscript tests/TestQs.R
This will download the most recent 2000 questions and check that they are all in your archived_pqs.csv
file. If you want a different number from 2000 you can define it using the argument -n
, so for example to get 7000 instead, do
Rscript tests/TestQs.R -n 7000
If any the questions remotely downloaded fail to match up in every particular to a question in the archive (e.g. if any are missing from the archive or the archive has the data wrong) those questions will be put in the Data/nonMatchingQuestions.csv
which will be generated for you.
There are three files that create the data, within the data_generators folder.
MoJScraper.R
Previously we scraped the parliament website to get our data, but now we use the API, so this file is no longer used, but is included for completeness.DataCreator.R
This does the work of getting and manipulating the data. See below for details of how to run it.MPClustering.R
This is a work in progress and is not yet used in the tool.
- The search space.
- A new csv of questions with cluster assignments.
- A new csv of the 12 most significant terms in each cluster.
- A new csv of the 12 most significant terms for each MP/Peer.
Four arguments can be passed to the DataCreator.R script. The environment flag -e
can be used as a shortcut to set sensible values for input (-i
), output (-o
) and K (-k
), for the two most common use cases:
- Quickly generating a small data set for testing purposes and avoid overwriting production data.
- Generating the full data set for use in production, overwriting previously generated production data
Input, output and K can also be set individually, but if environment is also set, they will be overridden.
Environment (test/prod)
- A shortcut to set values for the other three arguments in one go.
- Use
-e test
OR-e prod
Input file (questions)
- When
-e test
- "${SHINY_ROOT}/tests/testthat/examples/lsa_training_sample.csv"
- When
-e prod
- "${SHINY_ROOT}/Data/archived_pqs.csv"
- Set to something else using
-i
or--input_file
Output directory (where the new data files are saved)
- When
-e test
- "${SHINY_ROOT}/tests/testthat/examples/"
- When
-e prod
- "${SHINY_ROOT}/Data/"
- Set to something else using
-o
or--output_dir
Number of dimensions for rank-reduced space (x)
- When
-e test
- 100
- When
-e prod
- 2000
- Set to something else using
-x
or--x_dims
Number of clusters (k)
- When
-e test
- 100
- When
-e prod
- 1000
- Set to something else using
-k
or--k_clusters
-
Defaulting to
-e test
# From the command line Rscript ./data_generators/DataCreator.R # From an R console system("Rscript ./data_generators/DataCreator.R")
-
For production
# From the command line Rscript ./data_generators/DataCreator.R -e prod # From an R console system("Rscript ./data_generators/DataCreator.R -e prod")
This takes about 11 minutes on a 2016 Macbook Pro.
-
With specific args
# From the command line Rscript ./data_generators/DataCreator.R -i Data/archived_pqs.csv -o Data -k 1000 -x 2000 # From an R console system("Rscript ./data_generators/DataCreator.R -i Data/archived_pqs.csv -o Data -k 1000 -x 2000")
- Clone the Repo
- Point your working directory to the 'PQTool_master' folder
- Open one of global.R, server.R or ui.R in RStudio then hit 'Run App'.
Please make sure you run all tests, and that they pass, before making a pull request. This is especially important because some of the tests in test-tour.R will not run on Travis. This is, hopefully, temporary, whilst we get to the bottom of why those tests do not pass on Travis (whilst they do pass locally).
To run the tests you will need RSelenium and geckodriver
brew install selenium-server-standalone
brew install geckodriver
java -jar -Dwebdriver.gecko.driver=/<path_to_gecko>/geckodriver /<path_to_selenium>/selenium-server-standalone-3.3.1.jar
runApp('/path_to_app/', port=8888)
devtools::test()
To deploy, you will need access to the Jenkins console. Once there, find the name of this app (pq-tool), select the branch that you want to deploy then go to 'build with parameters'. If you're not sure what parameters to use, have a look at previous builds and see what parameters were used there.
At the moment, using this pipeline, the only way for us to deploy to more than one env, is to have more than one repo. We have created a second repo called pq-tool-staging. This is new repo exclusively for the purpose of testing branches and all branches pushed there should be considered disposable. You should also clean up after yourself and delete branches (from that repo) that are no longer needed for testing.
git remote add staging [email protected]:moj-analytical-services/pq-tool-staging.git
git push staging branch-to-test
A seperate repo (pq-tool-external) has been created to allow access outside of the MoJ to selected people - usually from other government departments.
git remote add external [email protected]:moj-analytical-services/pq-tool-external.git
Then you can push to that repo (and deploy from the Jenkins console using the pq-tool-external job - here you can also give a list of email addresses for the people you wish to give access)
git push external branch-to-deploy