Skip to content

Survey 1 2013 Jun 22 methods

Tim L edited this page Jun 20, 2016 · 52 revisions

What is first

Let's get to it

This page describes how we implemented a survey of authors for ~900 LOD datasets.

Methods

Gathering dataset metadata from datahub.io via DataFAQs

There are 337 lodcloud datasets and 902 datasets tagged 'lod'.

How much overlap?

The Prizms node at http://datafaqstest.tw.rpi.edu has two FAqT Brick datasets that survey the lod datasets:

  • how-o-is-lod gathers the VoID metadata for the 337 lodcloud datasets, and describes the named graphs in any SPARQL endpoint that is mentioned (run by lodcloud@datafaqstest on 2013-06-04).
  • lod-tag gathers the VoID metadata for the 902 lod tag datasets, and describes the named graphs in any SPARQL endpoint that is mentioned (run by lodcloud@datafaqstest on 2013-06-22).
Gathering dataset contact information from datahub.io CKAN API

The two datasets above do not preserve the contact information that is available in the original datahub.io entries (due to unicode issues with RDF libraries), so we had to create a specific dataset (source/datahub-io/lod-tag-and-lodcloud-group-contacts) to access the contact information directly. Running the global retrieval script version/retrieve.sh will recreate the survey emails using contact information directly from datahub.io. This was done by lebot@datafaqstest to create versions 2013-Jul-01, 2013-Jul-02, and 2013-Jul-03.

The (not public) spreadsheet datahub.io lod tag contacts is a hand made list of email address contacts for each dataset. If contact information isn't in datahub.io, this is used. In the future, it'll be the other way around: if it's not in this list, it will fall back to datahub.io.

The emails are organized in the source/ directory according to whether or not the dataset had contact information:

  • source/tagged-lod/is-contactable-by-recovery/is-in-lodcloud
  • source/tagged-lod/is-contactable-by-recovery/not-in-lodcloud
  • source/tagged-lod/is-contactable/is-in-lodcloud
  • source/tagged-lod/is-contactable/not-in-lodcloud
  • source/tagged-lod/not-contactable/is-in-lodcloud
  • source/tagged-lod/not-contactable/not-in-lodcloud

Because the emails were generated on the server, but we want to store them on our laptop, the local retrieval script retrieve.sh creates a version of the dataset from the temporary dump file that is created on the server. This needed to be done to bypass our traditional provenance- and archive-intensive publishing procedure and version controlling through GitHub, since we needed to avoid publishing the contact information. The counts.sh script, when run from the conversion cockpit, counts the number of surveys in each partition:

bash-3.2$ pwd
projects/lodcloud/github/lodcloud/data/source/datahub-io/lod-tag-and-lodcloud-group-contacts/version/2013-Jul-03

bash-3.2$ ../counts.sh 
     890 tagged lod
     756 originally contactable
         456 not in lodcloud
         300 in lodcloud
       75 contactable by recovery
           56 not in lodcloud
           19 in lodcloud
       59 not contactable
           45 not in lodcloud
           14 in lodcloud

The survey emails in manual/todo/ were copied from source/, then sent manually via email, and then placed into manual/done/ directory. These files are available on the unpublished sending-survey-1 branch on our local laptop (see THERE-IS-MORE-DATA-HERE.readme for details). It took about two hours for one person to email the 758 surveys.

Archiving the raw email responses ("us/survey-1-responses")

See the description below about the sending-survey-1 branch of data/source/us/survey-1-responses/version/2012-Jul-03.

Quoting and coding the email responses ("us/survey-1-results")

See the description below about the partially-public dataset data/source/us/survey-1-results.

data/source/us/survey-1-results/version/2013-Jul-07/doc/publishing-permission.graffle illustrates the publication permission workflow. survey-methods-dataflow.graffle illustrates the dataset flow.

  • (public) datahub.io lod and lodcloud survey 1 questions contains the question text sent to the LOD publisher via email.
  • (not public - never will be) datahub.io lod tag contacts contains a manually curated list of "updated" contact emails for the datasets. This is the result of getting responses pointing us to other people.
  • (not public) The email responses are archived on an unpublished branch (sending-survey-1) of the lodcloud prizms repository at data/source/us/survey-1-responses/version/2012-Jul-03/source/lodcloud-survey-1.mbox and data/source/us/survey-1-responses/version/2012-Jul-03/source/lodcloud-survey-1/. This file and directory are created by right-clicking on the lodcloud-survey-1 folder in Mail.app, selecting "Export Mailbox...", selecting "Export all subfolders", choosing the directory data/source/us/survey-1-responses/version/2012-Jul-03/source/, and pressing "Choose" button. This should be done on the sending-survey-1 branch (git checkout sending-survey-1).
    • version/2013-Jul-07/manual/contact-paths.graffle illustrates the "dead ends" faced when attempting to email the authors.
  • (not public - portions will be) datahub.io lod and lodcloud survey 1 results contains quotes from the email responses, with annotations to guide their parsing into Linked Data.
  • (public) datahub.io lod tools describes the tools mentioned in the survey (homepage, author, etc.).
  • (not public) datahub.io lod and lodcloud survey forward responses contains a list of names and email addresses of the participants that wish to see the survey results when they are complete.

What is next