-
Notifications
You must be signed in to change notification settings - Fork 3
Meeting Notes
Meeting 1
Meeting 2
Meeting 3 Course Update 1
Meeting 4
Meeting 5
Meeting 6 Consultation
Meeting 7
Meeting 8
Meeting 9 Consultation
[Meeting 10 Consultation](#DATA LAB-consultation)
###Meeting 5### April 1, 2016
Participants:
Woojin, Jack, John, Nicolie, Stella, Fionn, Jong-hwan Sun, Jenny
Next Week's task
- Go to google drive - > Data Collection -> Individual Source Text Files, and change the txt files' name into the numbers that match the ID Sources spread sheet column "new_id".
e.g.: "1-1 John.txt" -> "1.txt".
Note: For those file who do not have reference, simply create a empty file with the right name that match "new_id".
Meeting Time & Attendance
- Start from next Monday, all the absence should have a reasonable excuse and notified in advance on the Gitter or let Woojin know through private chat.
- All the team members are required to attend Monday's meeting. (3-4p Data Base, 4-5p Web Scraping)
- For Friday's meeting, each group(data base, web scraping) select 1-2 member to have a meeting with Woojin and make sure everyone is on the same page.
###Meeting 4 ### March 11, 2016
Participants:
Woojin, John, Nicolie, Stella, Fionn, Jong-hwan Sun
Brief Summary:
Due to the need of a project proposal so that Woojin can discuss it with the professor, both teams have to finish the methodology part of the proposal this weekend. Last week’s work and next week’s task are discussed.
Tasks:
Grant Proposal of CEGA Project - both teams
Contents: contributions and methodology (both teams are responsible to write a paragraph to discuss the methodology for the present tasks, half page, due this weekend)
Aid Database - DB team
Use R/SQL to make some aid data analysis. From the aid data we are aiming at classifying the aid sectors with its CRS code, and calculate the total aid amount of a certain recipient in a certain year, about a certain sector. Go to google drive folder aid data >> CRS code.xls >> purpose code sheet to find the corresponding sector. Go to google drive folder aid data >> data_process.R to get all the aid data.
NOTE: Make a WB aid sheet and two sub sheets about IDA and IBRD aid.
###Meeting 2 ### February 26, 2016
Participants:
Dav, Woojin, Jack, John, Zhichao, Nicolie, Stella, Fionn, Dongjie Zhu, Jong-hwan Sun
Brief Summary:
Two milestones to be accomplished: A World Bank and development paper by March.15 ; SSMART Grant proposal by March.28 On top of building the database of WB policy citations, we want to expand our database to documents found in the UN and the OECD. We also want to expand the research database to entire peer-reviewed journals, and further edit the Finance database to include both Aid Data and Government investment( this requires more research , we want to look at international and "within-nation" funds). Upon the completion of databases, do statistical analysis ( Descriptive; Linear ; Machine Learning ; Network), as well as creating visualizations for our results.
Note: 2-3 people will devote to help in writing codes to automatize the process of getting information from the openknowledge website and web-scraping in general.
Weekly Task: For everyone, go back to the IDSources Google Sheet and fill out the new columns, this include ( title, category, DOI , handle number if necessary). Also, save the original copies of the document ( both .pdf and .txt).
category consists of policy research working paper, journal article, other paper, ... and more
HOW TO FIND DOI: after you click a report, you see the abstract view and file downloads, scroll down the page until you see Author(s), and there will be a colored circle(ring) below. Next to the circle should say "tweeted by 1( or other number)" . Click on the "See more details" right below the circle, then you will see DOI.
NOTE:: Not every document has the colored circle on display, if that's the case, try to Google search and get the document's DOI. If a document's DOI cannot be found after Google search, then simply fill in the handle number for that document.
When saving the original copy of the document, change the name to its DOI.txt ; if DOI cannot be found, name the original copy its handle number.txt
The DOI is more important than the title, so try to get that column filled.
If you have not add the abstract view of some document, you can just leave it blank.
Also, take a minute to fill out the Task tracking, thus far we only have administrative tasks listed, technical parts are to be added later.
February 16, 2016
Participants:
Dav, Woojin, Jack, John, Nicole, Stella, Fionn, Jenny, Sunny
Week Task:
For everyone, start with downloading txt files from this site: https://openknowledge.worldbank.org/discover?scope=%2F&query=%22cash+transfers%22&submit=.
Then for each text file's reference part, manually cut off whitespace and upload your text file to Google Drive (in Data Collection/Individual Source Text Files Folder). Also make sure you fill in ID and Reference for each book/report in ID Sources spreadsheet. This week's task is listed as below:
- John:1-6
- Jack:7-12
- Woojin:13-18
- Sunny:19-24
- Nicolie:25-30
- Stella: 31-36
- Jenny: 37-42
- Fionn: 43-48
- Zhichao: 49-54
Format (important!):
- For Sublime or Vim, you will find "Ctrl+J" quite useful.
- Text file name: page_number-book_number (make sure the order is the same as the page on the website) Name, for example, 19-1 Sunny.txt.
- Reference format: Name. Year. Title. Publication. Organization. Location.
Updates:
- If text versions have weird formatting issues, use PDF versions, but save it as text files.
- for those txt files which have TWO space between each word, use "replace all" to replace the double space into single space.(Same as THREE spaces)
- In case of not "Lost" when concatenating the sentences, always keep the PDF form as an reference.
- Instead of using software to convert the PDF file into txt file, just try copy the reference part of the PDF file.
November 25, 2015
Participants:
Jiangjie Man (Jack), Yang Zhou (John)
Questions:
-
Method of dealing with Chinese character encoding with python in txt file.
-
Usage of Google Scholar API.
Dlab Suggestions:
- Try use chardet 2.3.0 for the encode detection and deal with Chinese character encoding.
- scholar.py can be helpful not only for getting the citation info but also for parsing the citation in to author name, Year, title, etc.
Procedure of extracting reference of txt file: detection of file encode type -> decode(cover the txt file into UTF-8 decode format) -> locating the reference page -> parsing the reference (eliminate indents) -> Google Scholar API -> parsed citation info (BibTeX format) -> put extracted info into .csv/xml table.
November 24, 2015
Participants:
Yang Zhou (John), Jiangjie Man (Jack), Zhichao Yang, Wing In Ng
Expected Model:
- Features: paper title, author, publication year, country, key words
- Response: citation, paper level, countries, budget
- Linear regression model: gym (R package)
- Visualization: ggplot or ShinyApp in R
- Mapping tools: mySQL
Questions:
-
Text analysis: Getting the information from txt files with different format seems tricky, can we automatize the process of obtaining the information about citations from Google Scholar API?
-
Feature and response quantification: What should our features and response be? We need to be specify about them so that we can do the analysis with regression model.
Contact:
- DLAB
- Chris Paciorek: [email protected]
November 20, 2015
Participants: Dav Clark, Woojin Jung, Yang Zhou(John), Jiangjie Man(Jack), Zhichao Yang, Wing In Ng
Week Task
- Modify the existing code to also scrape journal name from citations. Difficulties include the distinguishing between book citation and journal citation and dealing with inconsistently formatted citation.
- Try to automize the process of obtaining the information that how many times a paper has been cited from Google Scholar with API.
- Schedule a consultation with librarian as soon as possible.
- Schedule a statistical consultation (might be Nov. 24 2-3pm) to discuss what kinds of statistical analysis we could do with existing data.
- Obtain more data.
- Follow up with the WB people.
Next Meeting: Nov 30 Monday 3pm Prepare for the final presentation.
Final Presentation: Dec 03 Thursday
November 13, 2015
Participants: Dav Clark, Woojin Jung, Temina Madon, ,Fiona, Xavier Xiao, Yang Zhou(John), Zhichao Yang, Wing In Ng
Week Task
- text mining, start with click on txt links of documents on: https://openknowledge.worldbank.org/discover?scope=%2F&query=%22cash+transfers%22&submit=, find a nice way to extract the policy document’s title, author name and published date. Also, extract the information of the papers/reports which cited the document. The outcome are expected to be excel sheet, forms refer to:https://drive.google.com/drive/u/0/folders/0B9tHcQuMozKuaWFtQ3BKbHBEZzA Jack, Xavier, John)
- 1 page outline (which is on Google drive: https://docs.google.com/document/d/1K_2VpMYOXdOHvYqp3drRcwUxgXCJYla-6nWb2JbeDiQ/edit) , need everyone contribute to revise and derive a final form by next Wednesday (all)
- EDAM will send us the draft by Tuesday which is used to evaluation between teams and every one are supposed to work on that and finish it on Wednesday.(all)
- Team Evaluation, on Git issue #14: https://github.com/BIDS-collaborative/cega-trace/issues/14, due on Monday. (all)
- google form made by Woojin, for next weeks meeting.
- Schedule meeting with librarian.
- WB txt files link used .
November 11, 2015
Participants: Jose De Buerba, Thomas Breineder, Dav Clark, Woojin Jung, Temina Madon
Overview of relevant efforts at World Bank
- The research dept (DEC) has examined impact of WB research on policy decision-making, by analyzing citations (JdB will share links). See http://elibrary.worldbank.org/doi/pdf/10.1596/1813-9450-6851
- WB is working with Altmetric.com to look at citations in policy documents (e.g. using Doi, full text search, other handles). They have been successful in identifying policy citations of WB research reports, although this is an imperfect/preliminary effort that requires further QA (i.e. they are picking up items that are not WB). There may be some data that can be shared by Altmetric. Early findings are that Oxfam, others have cited WB research.
- Univ of Leiden - bibliometrics department: Ongoing study of where WB papers are cited in policy documents, then building citation maps to uncover which other institutions are citing WB research. (e.g. DFID, IMF)
- WB does generic stuff too: qualitative analysis (i.e. annual "leadership" survey), download/view rates for policy docs
- Of potential interest, NISO.org is developing working groups for altmetric standards - see Oct 2015 altmetrics conference in Amsterdam. WB participated in this.
For cega-trace project
- We should use Open Knowledge repository, which follows OAI protocols (with exposed meta-data): we can crawl whatever collections we are interested in.
CEGA Follow-up
Send more detail about the research our design, with a specific request about how the WB can help us. Jose will then get a contact on Operations side for cash transfers, to help us understand where project investments are best documented. In our 1-pager, we should include hypotheses to be tested-- e.g. are there temporal correlations between research reports and Bank investments? Is influence of research dispersed (across researchers/universities), or concentrated among "elites"? Are there geographic correlations?
WB Follow-up
WB can develop a virtual set of content (feed) using Meta-data Harvesting Protocol, see https://www.openarchives.org/pmh/. This allows us to run a query on the WB database for specific search terms, using their meta-data (essentially we crawl the xml feed to harvest the right publications/information). Tom will send us links for this virtual set. Unfortunately references/citations are not part of the meta-data, so we'll need to get that from the full text.
Jose will send links to work by Leiden, DEC, Altmetric. He can also provide contact with Altmetric.com (start-up)
November 6,2015 Participants: Woojin Jung, Xavier Xiao, Jiangjie Man (Jack), Yang Zhou (John), Zhichao Yang, Wing In Ng Skype: Temina, Fiona
Further details on the Outline
For 3. Research Questions and Hypothesis , here’s a further explanation of 3.1. and 3.2
3.1 (stage I of our research )
“What is the pattern of research uptake into policy and practice, using theh case of CCT impact evaluation and the world bank policy documents? By analyzing the frequency, relevance and distribution of academic articles cited by policy reports, we look for strong evidence…”
Frequency = how many research paper are cited
Relevance = time gap/ year gap between research report and policy report
3.2 (stage II of our research) Using Aid-data, we first find the correlation/mapping between countries with large volume of research and the amount of CCT funding , then find whether countries having more CCT funds will lead to research in elsewhere, i.e. neighboring countries or other parts of a big country, thus increasing the coverage of CCT research and possibly influence policy and project management.
More on Stage I
Important to find path of influence of :
[Academic Research] Research on Impact evaluation of CCT -> World Bank Policy Report & WB development report -> Policy-oriented research -> Planning & Budget -> Guidelines & tools -> Project WB evaluation reports ; The idea is that project WB evaluation reports should lead to new policy, then going through another planning & budget and guidelines & tools, producing another project WB evaluation report. And this second evaluation report will lead to another policy, forming a cycle.
Use https://openknowledge.worldbank.org/community-list to find the different class of reports:
- Planning & Budget report can be found under Economic and Sector Work (ESW) Studies
- Project WB evaluation reports can be found under Annual Reports & Independent Evaluations
- Key components of the Academic Research papers database: Author last name, year of publication, country, paper title, doi, keywords
- Optional ones: evaluation method, funding source, implementing countries
####For Now:
- Edit and expand Academic Research dataset on Google Drive (also table 4 on p.17 and p.299 from Fiszbein et al. - 2009 - Conditional cash transfers reducing present and f.pdf )
- Perform text mining to extract information from Planning & Budget and Project evaluation reports
- Look at all WB annual development reports from 1997-2015 and extract information
See more details on: https://drive.google.com/open?id=0B9tHcQuMozKub1dZSTBuQWE5R3c
October 23, 2015
Participants: Dav, Temina, Woojin Jung, Xavier Xiao, Jiangjie Man (Jack), Yang Zhou (John), Zhichao Yang, Wing In Ng
####Next steps
-
Collectively identify keywords for a more "systematic" review of the academic literature:
methods (like regression discontinuity, randomized trial, difference-in-difference);
content (CCTs in developing countries; look sectors/outcomes like health, education, social counseling, vocational training, etc.)
-
Define the search combinations/restrictions -- e.g. {CCT, methods(*)}
-
Speak with UCB Library Data Lab about more useful searching databases
-
carry out Web of Knowledge search; download hits for keywords identified in systematic review (text files)
-
As searches are carried out, create Google Sheet for sharing results-- with checkpointing (Google -> CSV -> Git)
-
Extraction procedure: steps of finding data and automating
-
Play with software that enables citation network analysis (e.g. GEPHI, http://gephi.github.io/ or tools for visuazling eigenvector centrality http://demonstrations.wolfram.com/NetworkCentralityUsingEigenvectors/)
####First Pass
-
Ask WB library for text version of Policy Reports database (Temina)
-
ALL:
a.Play with keyword searches on existing online databases (e.g. World Bank, dec.usaid.gov, etc).
b.Identify the web-based data available for policy reports, i.e. what information we'd want to pull into a table (e.g. authors, agency, date, ISBN, references/citations, etc).
c.Share your results by adding to Google Sheets, but "checkpoint" by exporting to CSV and uploading to Git.
- Poke around http://aiddata.org for budget data. (note: OECD DB only has CRS codes, while Aiddata has richer project data)
####Sources to look for datas ResearchGate:http://www.researchgate.net/post/Google_scholar_for_systematic_reviews_what_limit_on_search_returns
http://academic.research.microsoft.com/About/Help.htm
https://dec.usaid.gov/dec/home/Default.aspx
Web of science: https://apps.webofknowledge.com/UA_GeneralSearch_input.do?product=UA&search_mode=GeneralSearch&SID=4B1oocZSHuVy33CMUlf&preferencesSaved=
Aid data:http://aiddata.org/
####Ideas
- Building network between authors who are having strong effect in cash transfer field.
- Finding connection between budget and number of published papers
October 22, 2015
Participants: HM course attendees
Topic: Feedback on Slide Deck
Research Questions
- Examine variation in "networks of influence" across agencies (e.g. which studies are cited in World Bank or DFID reports-- agencies with strong technical capacity-- versus USAID-- which traditionally has weak technical capacity) [Dav]
- To what extent is an agency "dominated" by a few influential scholars?
- Team wants to identify the academic studies that have been most "influential" in the research community. But there are existing databases to pull citation networks for academic publications - no need to independently carry out this step. [Dav]
- Need to define the scale for this pilot-- how many agencies, publications, reports to be included in the analysis?
October 16, 2015
Lead: Yang Zhou
Teammates had a short meeting today and we have made some notes:
Journal & Report searching
- search key words: cash transfer + Unconditional Transfer + Conditional cash transfer
- Looking for journals & reports have items include "abstract" "Key word" "reference" "Published date" (better after 2010)
- form: html&txt & pdf(html and txt will save a lot of effort!!)
- where to find: "search contents" in in meeting note1 and google scholar
- any one who find a good resource, don't forget to create a new issue and share!
- Any one who find a good journal or report, don forget to share it on our google drive! Jack has already create some files for the convenience to distribute the materials.
Some questions for Dav:
- Given that most of the material we find might in pdf form and in order to scrap the certain contents including the writer's name, published date, paper name in the reference part of the material, we might need help from you to give us some help on how to do that.
- Is there any ways that we can make our final research outcomes become a more visualization? (e.g. some techniques enable us to build a network plot about the connection of citations between the Journals & Reports we find)
October 2, 2015
Participants: Dav, Garret, Jack (Jiangjie), Temina, Woojin, Xavier (Zhisheng), Yang, Zhichao
Required Skills
- Hitting urls
- Isolating text from PDF or html files
- Visualizing the resulting data (citation networks)
- Possible add-on: NLP
Sector Focus
- Domain of focus will be conditional and unconditional cash transfers (CCTs/UCTs)
Content to Search
- Citations in the "grey" literature: Reports and grant/project databases from DFID, World Bank, OECD, etc.
- Study/grant registries (DFID, NIH, 3IE, IPA, AEA) and data repositories (e.g. Dataverse, ICPSR)
- Social media: Twitter (links to DOIs)
- News media
- Curricula?
Project Strategy
- Develop protocol for systematic review (using Cochrane?). Set up inclusion/exclusion criteria, possibly including author name(s), paper title, publication keywords, "other" keywords, phrases in abstracts
- Create database to allow people to update, contribute content
- Set up task management with databases
References
- Dollar & Collier, CPIA
- Cochrane & Campbell Systematic Review protocols (to be added to Literature)
- Review paper on CCTs/UCTs/cash transfers (to be added to Literature)