TX salaries #4

soodoku · 2017-09-24T02:23:49Z

Texas salary data is provided here:

https://salaries.texastribune.org/agencies/

You can download data for each link. For instance, clicking on Austin brings you to:
https://salaries.texastribune.org/austin/ and you have a link to 'download this data.'

The state data is there. We need to download city, school, and university data. Combine each into city, k12, and university files. And 7zip it and upload it to relevant year folder.

Save script under scripts/tx/tx_tribune.R

tristanjkaiser · 2017-11-17T23:27:06Z

Added the city, k12, and university files.

soodoku · 2017-11-17T23:35:19Z

is anything remaining or are we done? if done, let's close the issue! yey!

soodoku · 2017-11-17T23:38:27Z

if you used any scripts, push them to scripts/tx/tx_tribune.R

tristanjkaiser · 2017-11-18T01:26:50Z

Yup that's everything. I ended up doing it manually, which was actually quicker than writing a script would have been. Complete!

soodoku · 2017-11-18T02:06:53Z

just so we are on the same page, you did a rbind on the cities etc. to create a city-year and school-year level file? thx.

tristanjkaiser · 2017-11-18T02:19:15Z

Ah no - the districts/unis/cities are still separated by each school/city etc. I will reopen and bind them.

tristanjkaiser · 2017-11-18T05:50:04Z

The data is really dirty (every file has different column names for the same thing, ie. "Annual Salary", "Annual Rt", "Rate of Pay"). What are the primary variables we are concerned with? Name, Title, and Salary?

soodoku · 2017-11-18T16:28:31Z

Interesting.

Normalize the col. names. and 2. preserve all the information.
The rows would just have NAs where that information is missing.

tristanjkaiser · 2017-11-28T04:54:49Z

Ok I pushed the data. I normalized the column names that were particularly important (First Name,
Last Name, Full Name, Annual Salary), there are a lot of unique columns remaining. Given how much time it was taking and how many files there were I wanted to get this out to you guys to see how much more time it is worth to normalize other columns like ethnicity. There is still lots of cleaning to do re: column formats, merging columns, and ensuring that 'annual salary' really is an annual salary (ie. some require multiplying by a corresponding total hours per year when 'annual salary' is actually a rate, say below $1000 for example).

vinay-pimple · 2017-11-28T08:30:42Z

Gaurav, what do you think? I anticipate having to work on column names and such with a lot of the other data sets. Also, doing the work of figuring out what each column means, and turning them into data that we can use. Do you think it makes sense for Tristan to work on this now, since I will probably be doing something like this later with a lot of other data sets? Thanks

…

On 11/27/2017 8:54 PM, Tristan Kaiser wrote: Ok I pushed the data. I normalized the column names that were particularly important (First Name, Last Name, Full Name, Annual Salary), there are a lot of unique columns remaining. Given how much time it was taking and how many files there were I wanted to get this out to you guys to see how much more time it is worth to normalize other columns like ethnicity. There is still lots of cleaning to do re: column formats, merging columns, and ensuring that 'annual salary' really is an annual salary (ie. some require multiplying by a corresponding total hours per year when 'annual salary' is actually a rate, say below $1000 for example). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHq78A-dY_Yjjh3eJh5kY5qdOobqYpUIks5s65IZgaJpZM4Phv0J>.

soodoku · 2017-11-28T11:57:53Z

1. @tristanjkaiser --- I will take a look and get back to you. 2. @vinay-pimple, You misunderstand the pt. Tristan is working on within-state and for a particular scrape. You will be working across states.

…

On Tue, Nov 28, 2017, 3:30 AM vinay-pimple ***@***.***> wrote: Gaurav, what do you think? I anticipate having to work on column names and such with a lot of the other data sets. Also, doing the work of figuring out what each column means, and turning them into data that we can use. Do you think it makes sense for Tristan to work on this now, since I will probably be doing something like this later with a lot of other data sets? Thanks On 11/27/2017 8:54 PM, Tristan Kaiser wrote: > > Ok I pushed the data. I normalized the column names that were > particularly important (First Name, > Last Name, Full Name, Annual Salary), there are a lot of unique > columns remaining. Given how much time it was taking and how many > files there were I wanted to get this out to you guys to see how much > more time it is worth to normalize other columns like ethnicity. There > is still lots of cleaning to do re: column formats, merging columns, > and ensuring that 'annual salary' really is an annual salary (ie. some > require multiplying by a corresponding total hours per year when > 'annual salary' is actually a rate, say below $1000 for example). > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > < #4 (comment)>, > or mute the thread > < https://github.com/notifications/unsubscribe-auth/AHq78A-dY_Yjjh3eJh5kY5qdOobqYpUIks5s65IZgaJpZM4Phv0J >. > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAsCOt6AV8pCWE5UfIo2S2PpV-bodKwYks5s68SygaJpZM4Phv0J> .

soodoku · 2017-11-28T14:20:52Z

@tristanjkaiser: can you post the scripts you have been working on so that I can see what is going on. Thanks.

ChrisMuir · 2017-11-28T14:34:20Z

With regard to how to handle the col headers of many data sets, this may be overkill but I'm dealing with a very similar problem at work and wanted to share how how we approached it.

I'm on a long term research project concerned with food safety in China, one of the key pieces of data the team wants to analyze is government food inspection records. So I'm constantly writing scrapers for different government websites, most of the sites serve up data in the form of xlsx files. Each site will have dozens or hundreds of xlsx files, each file has multiple sheets, some sheets contain multiple data sets, and there is no consistency across files in format/structure. There's also very little consistency in column headers (e.g., we currently have 24 different strings that all mean "manufacturer").

To solve this, I created a column header data dictionary, it's a Python dict in which each key is a column that I know I want in the final DB, and each value is a list containing known string representations of that specific column header. I also have a "skips" dict, to house column headers that I know can be skipped. From there, after scraping a new website, I have a script that will read in each of the new xlsx files, iterate over the column headers, and if a header appears in either of the data dicts it knows whether to skip it or how to bin/categorize it. When it sees a header that doesn't exist in either dict, it will throw an error and print to console the xlsx file name, the sheet name, the col header string, and the first five observations from that col (also translates some of this info, because I don't speak Mandarin lol), and I have to manually add the new header string to one of the data dicts.

It's a rather brute force approach, not very elegant, but it's worked pretty well for us so far. And again, it may be overkill for this project, but I figured I'd share. If anyone has questions about this just let me know.

soodoku · 2017-11-28T15:03:27Z

Thanks @ChrisMuir! What you suggest seems like a perfectly fine way to go.

Also @tristanjkaiser, I see the script and will get back, once I have taken a look. This tx data is a bit unusual.

vinay-pimple · 2017-11-28T16:51:57Z

At the back end (in both senses of the term), I was thinking of making a table with two fields: 1. Our descriptive Name 2. The variations of that descriptor in the different data sets. During the sql queries, we could just use a function instead of a column-name in the select clause like "select GetTotalComp(table-name) to get the relevant field from a particular table. I thought doing this at the back end would be more efficient. We can also look for some package (or code ourselves) so that we find out any relations between the different columns, e.g. A+B = C, A+B+C = D, A*B = C, etc. We will have to do this to make sense of some of the columns since I don't expect all column names to be self-explanatory. I expect this to be a common feature of most if not all our data sets.

…

On 11/28/2017 7:03 AM, soodoku wrote: Thanks @ChrisMuir <https://github.com/chrismuir>! What you suggest seems like a perfectly fine way to go. Also @tristanjkaiser <https://github.com/tristanjkaiser>, I see the script and will get back, once I have taken a look. This tx data is a bit unusual. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHq78K18kFWcEhBcORLnoC6n6YNU0RWmks5s7CC_gaJpZM4Phv0J>.

soodoku · 2017-11-28T19:32:12Z

@vinay-pimple what you suggest makes sense to me. cobb's third normal for the win.

To bring everyone on the same page, one of the ideas is to also provide a web interface to the data. And @vinay-pimple is talking a bit about the backend of that. But the insights can be used for packaging data as well.

TX was unusually complicated. For most datasets, I don't see this as a serious challenge. Time-consuming but doable.

We would also need to get all the salaries in 2017 dollars or something. We probably also need to get PPP for each area so that we can pro-rate wages etc. But all that is on the menu and not relevant for this particular issue. I will look into hiving this discussion off into a new issue.

soodoku · 2018-01-02T17:55:54Z

@ChrisMuir: would you like to pick this up?

soodoku assigned suriyan and ChrisMuir and unassigned suriyan Sep 24, 2017

soodoku assigned tristanjkaiser and unassigned ChrisMuir Nov 8, 2017

tristanjkaiser closed this as completed Nov 18, 2017

tristanjkaiser reopened this Nov 18, 2017

soodoku assigned soodoku and unassigned tristanjkaiser Jan 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TX salaries #4

TX salaries #4

soodoku commented Sep 24, 2017 •

edited

Loading

tristanjkaiser commented Nov 17, 2017

soodoku commented Nov 17, 2017

soodoku commented Nov 17, 2017

tristanjkaiser commented Nov 18, 2017

soodoku commented Nov 18, 2017

tristanjkaiser commented Nov 18, 2017 •

edited

Loading

tristanjkaiser commented Nov 18, 2017

soodoku commented Nov 18, 2017

tristanjkaiser commented Nov 28, 2017

vinay-pimple commented Nov 28, 2017 via email

soodoku commented Nov 28, 2017 via email •

edited

Loading

soodoku commented Nov 28, 2017

ChrisMuir commented Nov 28, 2017

soodoku commented Nov 28, 2017

vinay-pimple commented Nov 28, 2017 via email

soodoku commented Nov 28, 2017

soodoku commented Jan 2, 2018

TX salaries #4

TX salaries #4

Comments

soodoku commented Sep 24, 2017 • edited Loading

tristanjkaiser commented Nov 17, 2017

soodoku commented Nov 17, 2017

soodoku commented Nov 17, 2017

tristanjkaiser commented Nov 18, 2017

soodoku commented Nov 18, 2017

tristanjkaiser commented Nov 18, 2017 • edited Loading

tristanjkaiser commented Nov 18, 2017

soodoku commented Nov 18, 2017

tristanjkaiser commented Nov 28, 2017

vinay-pimple commented Nov 28, 2017 via email

soodoku commented Nov 28, 2017 via email • edited Loading

soodoku commented Nov 28, 2017

ChrisMuir commented Nov 28, 2017

soodoku commented Nov 28, 2017

vinay-pimple commented Nov 28, 2017 via email

soodoku commented Nov 28, 2017

soodoku commented Jan 2, 2018

soodoku commented Sep 24, 2017 •

edited

Loading

tristanjkaiser commented Nov 18, 2017 •

edited

Loading

soodoku commented Nov 28, 2017 via email •

edited

Loading