Parse UW Co-op Package PDFs

Parse out relevant information from co-op student resume package PDFs provided by the University of Waterloo.

NOTE: This program is not perfect, and the results (email address, etc.) will need to be manually inspected for errors and fixed.

Installation

Ensure that Go is installed and setup with a working $GOPATH.

Installation and sanity check:

$ go get github.com/curvegrid/parse-uw-coop-package
$ parse-uw-coop-package -h

Usage

This assumes you are an employer of University of Waterloo co-operative education (co-op, interns) students and have a valid Employer login on WaterlooWorks.

Post a job on WaterlooWorks and wait for student applications to become available.
Login to WaterlooWorks, navigate to the applications list, and click the blue 'Application Options' button button near the top of the page to create a custom application bundle with each application as a separate PDF.
Download and unzip the consolidated package.
Install this utility, parse-uw-coop-package.
From the directory where you unzipped the consolidated package of PDFs, run parse-uw-coop-package and pipe the output to a CSV file (e.g., parse-uw-coop-package > applicants.csv). You can tweak the options (try parse-uw-coop-package -h) as required.
Import into your spreadsheet of choice. As noted above, manual cleanup will be required.

Running and Command Line Options

By default, searches the current directory for all PDFs that fit a regular expression (-fileregex) and parse the text within for fields specific to UW co-op.

Usage of parse-uw-coop-package:
  -averagesRegex string
    	Regex for averages (default "Term Average:\\s*([0-9]{2}\\.*[0-9]*)")
  -concurrency int
    	Number of PDF parsing threads to run in parallel (default 4)
  -coverLetterRegex string
    	Regex for cover letter yes/no (default "[Ss]incerely|[Hh]iring [Mm]anager")
  -emailRegex string
    	Regex for email address (default "[A-Za-z0-9_.-]+\\@[A-Za-z0-9.-]+\\.[A-Za-z0-9]+")
  -fileregex string
    	Regex filter for filenames (default "([a-zA-Z ]+)-([a-zA-Z ]+)-[0-9]+-.*.pdf")
  -githubRegex string
    	Regex for Github (default "github.com/[A-Za-z0-9_.-]+")
  -idregex string
    	Regex filter for student IDs (default "[0-9]{7,10}")
  -linkedInRegex string
    	Regex for LinkedIn (default "linkedin.com/in/[A-Za-z0-9_.-]+")
  -pdftoascii string
    	PDF to ASCII converter (default "pdftotext %s -")
  -worktermEvalRegex string
    	Regex for work term evaluations (default "UNSATISFACTORY|MARGINAL|SATISFACTORY|VERY GOOD|EXCELLENT|OUTSTANDING")

Sample Run

$ parse-uw-coop-package 
ID,First name,Last name,Email,Email with name,LinkedIn,Github,Included a cover letter,Work term evaluations,Term averages,Overall average
123456,Able,Baker,[email protected],Able Baker <[email protected]>,,,Yes,"OUTSTANDING,OUTSTANDING,OUTSTANDING,GOOD,OUTSTANDING","72,81,84.5,72,78",73.4
...

Known Issues and Limitations

This has only been tested on macOS.
The PDF-to-text converter defaults to pdftotext, part of Xpdf, which may not be available on your system. On macOS via Homebrew it's part of Poppler: brew install poppler. See the command line options to adjust. Previously we used ps2ascii, a not-default part of Ghostscript, which unfortunately segfaults on a high proportion of modern PDF files.
The PDF-to-text process is not perfect, especially with formatted PDFs. Email addresses seem to be especially problematic, with many of them mangled. For example, we've seen [email protected] turn into .com example je ef@ with ps2ascii, even in what seems like a fairly "standard" formatted PDF. Manual cleanup will be required.

Future enhancements

DRY up the whole program
Switch from pdftotext to a native Go PDF-to-text solution
Improve the parsing accuracy: better regexes, etc.
Direct package download, and integration with tabular info, from WaterlooWorks
Keyword extraction

Contributing

Pull requests welcome.

Development

Assuming parse-uw-coop-package was installed per the previous step, then change to the directory where go get downloaded the source:

$ cd $GOPATH/src/github.com/curvegrid/parse-uw-coop-package
$ go build parse-uw-coop-package.go
$ ./parse-uw-coop-package

Note that you will now have two copies of the parse-uw-coop-package binary on your system, the one in $GOPATH/bin via go install, and the one just built in $GOPATH/src/curvegrid/parse-uw-coop-package via go build.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.vscode		.vscode
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
parse-uw-coop-package.go		parse-uw-coop-package.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parse UW Co-op Package PDFs

Installation

Usage

Running and Command Line Options

Sample Run

Known Issues and Limitations

Future enhancements

Contributing

Development

License and Copyright

About

Releases

Packages

Contributors 2

Languages

License

curvegrid/parse-uw-coop-package

Folders and files

Latest commit

History

Repository files navigation

Parse UW Co-op Package PDFs

Installation

Usage

Running and Command Line Options

Sample Run

Known Issues and Limitations

Future enhancements

Contributing

Development

License and Copyright

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages