-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idaho 2013 pdf to CSV #23
Comments
Have a little free time, am working on this now....what's the url source of this PDF? I don't see it listed in the ID README, and just want to include it in the comments at the top of the script file. |
2013 is from: https://pibuzz.com/wp-content/uploads/post%20documents/Idaho%202013.pdf But as you see on the title page, data are from transparent idaho. will get a link from that site. there are more pdfs like this on the transparent idaho website including for 2018: |
Cooool, thanks! |
Just finished extracting data from the 2013, 2014, and 2018 PDF's, and pushed the 7z files and script files to the repo. This ended up being a huge pain, for some reason pdftools was working just fine for the 2013 PDF but then just stopped working about a week ago, and from that point on it wouldn't work for any of the ID pdf files. By wouldn't work, I mean system2("pdftotext", args = c("-table", path_to_pdf_file)) which is pretty hacky. I'm working on a PC, I'm not sure if that will work on any other OS. Also, as of now the three ID script files for each individual PDF are effectively identical, at some point I will replace them with a single script that reads and writes to/from each individual yearly folder. |
oy! sorry to hear. pdftools:
no worries on the 3 scripts. and congrats on getting across the line on this one! seems v. painful and that is where some new software is born! :-) |
Yeah, it's all good. What's weirdest is that I was initially working with the 2013 doc on a Mac, then the issue started happening about a week ago, tested it on my work PC and it was doing the same thing (and is persisting for all of the Idaho pdf docs).....so the pdftools issue is cutting across Mac and PC for me. Actually, do you mind trying it yourself? Try running: url <- "https://pibuzz.com/wpcontent/uploads/post%20documents/Idaho%202013.pdf"
txt <- pdftools::pdf_text(url) and let me know if it works for you. For reference, it reads a single empty string for each page for me....so this resolves to identical(
pdftools::pdf_text("https://pibuzz.com/wp-content/uploads/post%20documents/Idaho%202013.pdf"),
rep("", 1012)
)
#> TRUE Just let me know what results you get if you don't mind. |
dear @ChrisMuir, reason for delay = URL is now dead. |
No worries on delay, thanks for trying and for the heads up! |
https://github.com/public-salaries/public_salaries/tree/master/id/2013
The text was updated successfully, but these errors were encountered: