Idaho 2013 pdf to CSV #23

soodoku · 2018-01-05T17:37:59Z

https://github.com/public-salaries/public_salaries/tree/master/id/2013

ChrisMuir · 2018-01-10T02:10:36Z

Have a little free time, am working on this now....what's the url source of this PDF? I don't see it listed in the ID README, and just want to include it in the comments at the top of the script file.

soodoku · 2018-01-10T02:26:19Z

2013 is from: https://pibuzz.com/wp-content/uploads/post%20documents/Idaho%202013.pdf

But as you see on the title page, data are from transparent idaho. will get a link from that site. there are more pdfs like this on the transparent idaho website including for 2018:
https://ibis.sco.idaho.gov/pubtrans/workforce/Workforce%20by%20Name%20Summary-en-us.pdf

soodoku · 2018-01-10T02:26:52Z

2014 here: http://mediad.publicbroadcasting.net/p/kisu/files/workforce.pdf

ChrisMuir · 2018-01-10T02:28:38Z

Cooool, thanks!

ChrisMuir · 2018-01-18T21:52:59Z

Just finished extracting data from the 2013, 2014, and 2018 PDF's, and pushed the 7z files and script files to the repo.

This ended up being a huge pain, for some reason pdftools was working just fine for the 2013 PDF but then just stopped working about a week ago, and from that point on it wouldn't work for any of the ID pdf files. By wouldn't work, I mean pdf_text would read the correct number of pages in the doc, but would return an empty string for each page. I ended up writing a custom function which mimics pdftools::pdf_text that calls

system2("pdftotext", args = c("-table", path_to_pdf_file))

which is pretty hacky. I'm working on a PC, I'm not sure if that will work on any other OS.

Also, as of now the three ID script files for each individual PDF are effectively identical, at some point I will replace them with a single script that reads and writes to/from each individual yearly folder.

soodoku · 2018-01-18T22:02:45Z

oy! sorry to hear.

pdftools:

dk on the situation with pdftools but post windows update, some stuff may need admin privs. correctly as the function may be calling something else in the backend. always worth a try to run as admin.
i did notice that my miktext conked out a week ago also. so i had reinstall that and setup path etc. again.
the other alternative to pdftools = abbyyfine reader. they aren't free but they have an API and there is a R wrapper. abbyy is generally considered best in class for commercial OCR.

no worries on the 3 scripts. and congrats on getting across the line on this one! seems v. painful and that is where some new software is born! :-)

ChrisMuir · 2018-01-19T03:42:14Z

Yeah, it's all good. What's weirdest is that I was initially working with the 2013 doc on a Mac, then the issue started happening about a week ago, tested it on my work PC and it was doing the same thing (and is persisting for all of the Idaho pdf docs).....so the pdftools issue is cutting across Mac and PC for me.

Actually, do you mind trying it yourself? Try running:

url <- "https://pibuzz.com/wpcontent/uploads/post%20documents/Idaho%202013.pdf"
txt <- pdftools::pdf_text(url)

and let me know if it works for you. For reference, it reads a single empty string for each page for me....so this resolves to TRUE for me:

identical(
  pdftools::pdf_text("https://pibuzz.com/wp-content/uploads/post%20documents/Idaho%202013.pdf"), 
  rep("", 1012)
)
#> TRUE

Just let me know what results you get if you don't mind.

soodoku · 2018-01-20T22:06:40Z

dear @ChrisMuir,

reason for delay = URL is now dead.
tried on both linux and windows --- same result --- bunch of empty strings.

ChrisMuir · 2018-01-21T00:38:38Z

No worries on delay, thanks for trying and for the heads up!

soodoku assigned ChrisMuir Jan 5, 2018

ChrisMuir added a commit that referenced this issue Jan 11, 2018

add ID data and script, issue #23

c3f4532

ChrisMuir added a commit that referenced this issue Jan 18, 2018

add ID data and scripts for 2013, 2014, 2018, issue #23

f023c42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idaho 2013 pdf to CSV #23

Idaho 2013 pdf to CSV #23

soodoku commented Jan 5, 2018

ChrisMuir commented Jan 10, 2018

soodoku commented Jan 10, 2018

soodoku commented Jan 10, 2018

ChrisMuir commented Jan 10, 2018

ChrisMuir commented Jan 18, 2018

soodoku commented Jan 18, 2018

ChrisMuir commented Jan 19, 2018

soodoku commented Jan 20, 2018

ChrisMuir commented Jan 21, 2018

Idaho 2013 pdf to CSV #23

Idaho 2013 pdf to CSV #23

Comments

soodoku commented Jan 5, 2018

ChrisMuir commented Jan 10, 2018

soodoku commented Jan 10, 2018

soodoku commented Jan 10, 2018

ChrisMuir commented Jan 10, 2018

ChrisMuir commented Jan 18, 2018

soodoku commented Jan 18, 2018

ChrisMuir commented Jan 19, 2018

soodoku commented Jan 20, 2018

ChrisMuir commented Jan 21, 2018