Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create data directory hierarchy if not present #201

Open
gregoryfoster opened this issue Feb 26, 2017 · 13 comments · May be fixed by #202
Open

Create data directory hierarchy if not present #201

gregoryfoster opened this issue Feb 26, 2017 · 13 comments · May be fixed by #202

Comments

@gregoryfoster
Copy link

Hello, and thank you for sharing and maintaining such a valuable project. I'm just getting started by way of legis-graph and intend to become a frequent user and hopefully a helpful contributor.

I've setup a fresh installation and Python 2.7 virtual environment. As a heads up for potential future congress users, I ran into an SSL handshake issue sourced to scrapelib which prevents execution of the fdsys task (and likely others). That issue and workaround is detailed here.

Currently, I'm attempting to ./run bills --congress=115 and the task fails because there is no data hierarchy in the filesystem yet. mkdir -p data/115 and a subsequent os.listdir call will fail because there are no bill types. This is easy enough to workaround with some knowledge of the expected hierarchy, but it seems like something we could also easily fix.

I see there's a mkdir_p function in utils.py we could reuse - is there a good central place in the codebase to anticipate this edge case? I'd be happy to put together a pull request with a little guidance.

Thanks again for this very useful project!

@gregoryfoster
Copy link
Author

Hmm, after messing around a bit more I'm starting to feel like I'm missing an important step between ./run fdsys --collections=BILLSTATUS and ./run bills. It looks to me like the bills task is expecting a bootstrapped dataset to already exist in the data hierarchy, but I don't see any mention of how to achieve that in the README or the wiki.

@gregoryfoster
Copy link
Author

Re-opening, didn't mean to close the issue.

@gregoryfoster gregoryfoster reopened this Feb 26, 2017
@konklone konklone linked a pull request Feb 26, 2017 that will close this issue
@konklone
Copy link
Member

@gregoryfoster I filed a quick PR to fix the issue you identified: #202

However, the bills task is defunct and unused. It was designed for thomas.gov, which is now 💀 in favor of congress.gov. The fdsys task is active, and would be easier for @JoshData to speak to, as he has it up in production.

@gregoryfoster
Copy link
Author

Thanks, @konklone, for the quick fix. It does take care of creating the data hierarchy through a specified Congress.

I'm a little puzzled and honestly a little distressed to hear that the bills task is regarded as defunct, as that shines a different light on GovTrack's announcement that they'll no longer support bulk data access after the 2017 summer recess. Is this project winding down?

@JoshData
Copy link
Member

No no no, I re-wrote the bills task last year to convert the new official bill XML (from fdsys) into the existing JSON data format. Since GovTrack relies on the JSON format and I don't have the capacity to re-write GovTrack's importer to use the fdsys XML directly, I'm still invested in keeping the bills task running.

@JoshData
Copy link
Member

The mkdir issue probably stemmed from my rewrite last year, btw. Sorry about breaking it on clean directories (which I never test on).

@gregoryfoster
Copy link
Author

Whew, glad to hear, @JoshData!

Returning to the original edge case of an absent and now clean data hierarchy - should I open a separate issue to tackle a clean load scenario? Meaning: while PR #202 avoids the os.listdir errors, the bills task as written doesn't take any action on a clean directory as it's compiling the list of bill types and bill IDs from an empty data hierarchy. That seems like a more substantial chunk of work that would require traversing the fdsys sitemap metadata files (or is there an easier route?).

Let me know if you want me to open a separate issue. And if you can sketch an outline of what needs to be done, I'd be happy to contribute a PR.

@konklone
Copy link
Member

Apologies for confusing the issue! And I can verify what @gregoryfoster says -- #202 fixes the errors, but it still doesn't cause the bills task to do anything, it just stops with some messages about fetching 0 bills. I couldn't figure out why that was, and mistook the lack of network requests to mean it'd been retired.

@joec58
Copy link

joec58 commented Jun 12, 2019

Hello,
I came to this issue report after attempting to run a clean installation of this scraper and got the error: "No such file or directory: 'data'"

This issue and #202 seems to be related to my error even though it is over 2 years old and still Open. #202 says "This fixes #201 by using mkdir_p as necessary when examining data paths on disk.", but without any specific directions on how or where that fix should be applied.

After reading the last 2 comments here, I have to ask if this scraper is still being maintained? If so, where can I find directions on how to fix this issue? Thanks.

@JoshData
Copy link
Member

Hi.

At GovTrack we use this project extensively.

Unfortunately we don't have the resources to fix problems that we're not experiencing ourselves, though. This repository was created at a time when multiple well-funded organizations (besides us) we're investing in creating a shared data ecosystem for legislative data, but now some of those organizations effectively don't exist anymore.

@joec58
Copy link

joec58 commented Jun 12, 2019

Thanks for you quick response.

I started a project several years ago with GovTrack (GT) bulk data. When I came back to it last year the GT data was no longer online. I found parts of it on ProPublica and elsewhere but some parts I can’t find, like the set of Amendments.

I will spend some time over the next few days trying to figure this scraper out. If it can produce what I’m looking for I will post the fix. I might even try to fork it to Python3 since Python2 is due to be obsolete next year.

@dwillis
Copy link
Member

dwillis commented Jun 13, 2019

@jox58 Can I ask specifically which scraper you're running that it doesn't create a data directory? I ask because I cloned the repository into a new directory and ran ./run govinfo --bulkdata=BILLSTATUS and it created a data directory.

@joec58
Copy link

joec58 commented Jun 13, 2019

You are right. My mistake for not reading the instructions carefully. I did a ./run bills without first ./run govinfo --bulkdata=BILLSTATUS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants