Create data directory hierarchy if not present #201

gregoryfoster · 2017-02-26T00:03:29Z

Hello, and thank you for sharing and maintaining such a valuable project. I'm just getting started by way of legis-graph and intend to become a frequent user and hopefully a helpful contributor.

I've setup a fresh installation and Python 2.7 virtual environment. As a heads up for potential future congress users, I ran into an SSL handshake issue sourced to scrapelib which prevents execution of the fdsys task (and likely others). That issue and workaround is detailed here.

Currently, I'm attempting to ./run bills --congress=115 and the task fails because there is no data hierarchy in the filesystem yet. mkdir -p data/115 and a subsequent os.listdir call will fail because there are no bill types. This is easy enough to workaround with some knowledge of the expected hierarchy, but it seems like something we could also easily fix.

I see there's a mkdir_p function in utils.py we could reuse - is there a good central place in the codebase to anticipate this edge case? I'd be happy to put together a pull request with a little guidance.

Thanks again for this very useful project!

The text was updated successfully, but these errors were encountered:

gregoryfoster · 2017-02-26T00:49:13Z

Hmm, after messing around a bit more I'm starting to feel like I'm missing an important step between ./run fdsys --collections=BILLSTATUS and ./run bills. It looks to me like the bills task is expecting a bootstrapped dataset to already exist in the data hierarchy, but I don't see any mention of how to achieve that in the README or the wiki.

gregoryfoster · 2017-02-26T00:50:46Z

Re-opening, didn't mean to close the issue.

konklone · 2017-02-26T00:51:59Z

@gregoryfoster I filed a quick PR to fix the issue you identified: #202

However, the bills task is defunct and unused. It was designed for thomas.gov, which is now 💀 in favor of congress.gov. The fdsys task is active, and would be easier for @JoshData to speak to, as he has it up in production.

gregoryfoster · 2017-02-26T01:28:24Z

Thanks, @konklone, for the quick fix. It does take care of creating the data hierarchy through a specified Congress.

I'm a little puzzled and honestly a little distressed to hear that the bills task is regarded as defunct, as that shines a different light on GovTrack's announcement that they'll no longer support bulk data access after the 2017 summer recess. Is this project winding down?

JoshData · 2017-02-26T01:32:04Z

No no no, I re-wrote the bills task last year to convert the new official bill XML (from fdsys) into the existing JSON data format. Since GovTrack relies on the JSON format and I don't have the capacity to re-write GovTrack's importer to use the fdsys XML directly, I'm still invested in keeping the bills task running.

JoshData · 2017-02-26T01:32:54Z

The mkdir issue probably stemmed from my rewrite last year, btw. Sorry about breaking it on clean directories (which I never test on).

gregoryfoster · 2017-02-26T01:58:22Z

Whew, glad to hear, @JoshData!

Returning to the original edge case of an absent and now clean data hierarchy - should I open a separate issue to tackle a clean load scenario? Meaning: while PR #202 avoids the os.listdir errors, the bills task as written doesn't take any action on a clean directory as it's compiling the list of bill types and bill IDs from an empty data hierarchy. That seems like a more substantial chunk of work that would require traversing the fdsys sitemap metadata files (or is there an easier route?).

Let me know if you want me to open a separate issue. And if you can sketch an outline of what needs to be done, I'd be happy to contribute a PR.

konklone · 2017-02-26T02:00:27Z

Apologies for confusing the issue! And I can verify what @gregoryfoster says -- #202 fixes the errors, but it still doesn't cause the bills task to do anything, it just stops with some messages about fetching 0 bills. I couldn't figure out why that was, and mistook the lack of network requests to mean it'd been retired.

joec58 · 2019-06-12T05:09:15Z

Hello,
I came to this issue report after attempting to run a clean installation of this scraper and got the error: "No such file or directory: 'data'"

This issue and #202 seems to be related to my error even though it is over 2 years old and still Open. #202 says "This fixes #201 by using mkdir_p as necessary when examining data paths on disk.", but without any specific directions on how or where that fix should be applied.

After reading the last 2 comments here, I have to ask if this scraper is still being maintained? If so, where can I find directions on how to fix this issue? Thanks.

JoshData · 2019-06-12T14:01:13Z

Hi.

At GovTrack we use this project extensively.

Unfortunately we don't have the resources to fix problems that we're not experiencing ourselves, though. This repository was created at a time when multiple well-funded organizations (besides us) we're investing in creating a shared data ecosystem for legislative data, but now some of those organizations effectively don't exist anymore.

joec58 · 2019-06-12T23:23:05Z

Thanks for you quick response.

I started a project several years ago with GovTrack (GT) bulk data. When I came back to it last year the GT data was no longer online. I found parts of it on ProPublica and elsewhere but some parts I can’t find, like the set of Amendments.

I will spend some time over the next few days trying to figure this scraper out. If it can produce what I’m looking for I will post the fix. I might even try to fork it to Python3 since Python2 is due to be obsolete next year.

dwillis · 2019-06-13T00:27:32Z

@jox58 Can I ask specifically which scraper you're running that it doesn't create a data directory? I ask because I cloned the repository into a new directory and ran ./run govinfo --bulkdata=BILLSTATUS and it created a data directory.

joec58 · 2019-06-13T12:18:54Z

You are right. My mistake for not reading the instructions carefully. I did a ./run bills without first ./run govinfo --bulkdata=BILLSTATUS

gregoryfoster closed this as completed Feb 26, 2017

gregoryfoster reopened this Feb 26, 2017

konklone linked a pull request Feb 26, 2017 that will close this issue

mkdir_p as necessary when walking disk #202

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create data directory hierarchy if not present #201

Create data directory hierarchy if not present #201

gregoryfoster commented Feb 26, 2017

gregoryfoster commented Feb 26, 2017

gregoryfoster commented Feb 26, 2017

konklone commented Feb 26, 2017

gregoryfoster commented Feb 26, 2017

JoshData commented Feb 26, 2017

JoshData commented Feb 26, 2017

gregoryfoster commented Feb 26, 2017

konklone commented Feb 26, 2017

joec58 commented Jun 12, 2019 •

edited

Loading

JoshData commented Jun 12, 2019

joec58 commented Jun 12, 2019

dwillis commented Jun 13, 2019 •

edited

Loading

joec58 commented Jun 13, 2019

Create data directory hierarchy if not present #201

Create data directory hierarchy if not present #201

Comments

gregoryfoster commented Feb 26, 2017

gregoryfoster commented Feb 26, 2017

gregoryfoster commented Feb 26, 2017

konklone commented Feb 26, 2017

gregoryfoster commented Feb 26, 2017

JoshData commented Feb 26, 2017

JoshData commented Feb 26, 2017

gregoryfoster commented Feb 26, 2017

konklone commented Feb 26, 2017

joec58 commented Jun 12, 2019 • edited Loading

JoshData commented Jun 12, 2019

joec58 commented Jun 12, 2019

dwillis commented Jun 13, 2019 • edited Loading

joec58 commented Jun 13, 2019

joec58 commented Jun 12, 2019 •

edited

Loading

dwillis commented Jun 13, 2019 •

edited

Loading