-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TX salaries #4
Comments
Added the city, k12, and university files. |
is anything remaining or are we done? if done, let's close the issue! yey! |
if you used any scripts, push them to scripts/tx/tx_tribune.R |
Yup that's everything. I ended up doing it manually, which was actually quicker than writing a script would have been. Complete! |
just so we are on the same page, you did a rbind on the cities etc. to create a city-year and school-year level file? thx. |
Ah no - the districts/unis/cities are still separated by each school/city etc. I will reopen and bind them. |
The data is really dirty (every file has different column names for the same thing, ie. "Annual Salary", "Annual Rt", "Rate of Pay"). What are the primary variables we are concerned with? Name, Title, and Salary? |
Interesting.
|
Ok I pushed the data. I normalized the column names that were particularly important (First Name, |
Gaurav, what do you think? I anticipate having to work on column names
and such with a lot of the other data sets. Also, doing the work of
figuring out what each column means, and turning them into data that we
can use.
Do you think it makes sense for Tristan to work on this now, since I
will probably be doing something like this later with a lot of other
data sets?
Thanks
…On 11/27/2017 8:54 PM, Tristan Kaiser wrote:
Ok I pushed the data. I normalized the column names that were
particularly important (First Name,
Last Name, Full Name, Annual Salary), there are a lot of unique
columns remaining. Given how much time it was taking and how many
files there were I wanted to get this out to you guys to see how much
more time it is worth to normalize other columns like ethnicity. There
is still lots of cleaning to do re: column formats, merging columns,
and ensuring that 'annual salary' really is an annual salary (ie. some
require multiplying by a corresponding total hours per year when
'annual salary' is actually a rate, say below $1000 for example).
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHq78A-dY_Yjjh3eJh5kY5qdOobqYpUIks5s65IZgaJpZM4Phv0J>.
|
1. @tristanjkaiser --- I will take a look and get back to you.
2. @vinay-pimple, You misunderstand the pt. Tristan is working on within-state and for a particular scrape. You will be working across states.
…On Tue, Nov 28, 2017, 3:30 AM vinay-pimple ***@***.***> wrote:
Gaurav, what do you think? I anticipate having to work on column names
and such with a lot of the other data sets. Also, doing the work of
figuring out what each column means, and turning them into data that we
can use.
Do you think it makes sense for Tristan to work on this now, since I
will probably be doing something like this later with a lot of other
data sets?
Thanks
On 11/27/2017 8:54 PM, Tristan Kaiser wrote:
>
> Ok I pushed the data. I normalized the column names that were
> particularly important (First Name,
> Last Name, Full Name, Annual Salary), there are a lot of unique
> columns remaining. Given how much time it was taking and how many
> files there were I wanted to get this out to you guys to see how much
> more time it is worth to normalize other columns like ethnicity. There
> is still lots of cleaning to do re: column formats, merging columns,
> and ensuring that 'annual salary' really is an annual salary (ie. some
> require multiplying by a corresponding total hours per year when
> 'annual salary' is actually a rate, say below $1000 for example).
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <
#4 (comment)>,
> or mute the thread
> <
https://github.com/notifications/unsubscribe-auth/AHq78A-dY_Yjjh3eJh5kY5qdOobqYpUIks5s65IZgaJpZM4Phv0J
>.
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#4 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAsCOt6AV8pCWE5UfIo2S2PpV-bodKwYks5s68SygaJpZM4Phv0J>
.
|
@tristanjkaiser: can you post the scripts you have been working on so that I can see what is going on. Thanks. |
With regard to how to handle the col headers of many data sets, this may be overkill but I'm dealing with a very similar problem at work and wanted to share how how we approached it. I'm on a long term research project concerned with food safety in China, one of the key pieces of data the team wants to analyze is government food inspection records. So I'm constantly writing scrapers for different government websites, most of the sites serve up data in the form of xlsx files. Each site will have dozens or hundreds of xlsx files, each file has multiple sheets, some sheets contain multiple data sets, and there is no consistency across files in format/structure. There's also very little consistency in column headers (e.g., we currently have 24 different strings that all mean "manufacturer"). To solve this, I created a column header data dictionary, it's a Python dict in which each key is a column that I know I want in the final DB, and each value is a list containing known string representations of that specific column header. I also have a "skips" dict, to house column headers that I know can be skipped. From there, after scraping a new website, I have a script that will read in each of the new xlsx files, iterate over the column headers, and if a header appears in either of the data dicts it knows whether to skip it or how to bin/categorize it. When it sees a header that doesn't exist in either dict, it will throw an error and print to console the xlsx file name, the sheet name, the col header string, and the first five observations from that col (also translates some of this info, because I don't speak Mandarin lol), and I have to manually add the new header string to one of the data dicts. It's a rather brute force approach, not very elegant, but it's worked pretty well for us so far. And again, it may be overkill for this project, but I figured I'd share. If anyone has questions about this just let me know. |
Thanks @ChrisMuir! What you suggest seems like a perfectly fine way to go. Also @tristanjkaiser, I see the script and will get back, once I have taken a look. This tx data is a bit unusual. |
At the back end (in both senses of the term), I was thinking of making a
table with two fields:
1. Our descriptive Name
2. The variations of that descriptor in the different data sets.
During the sql queries, we could just use a function instead of a
column-name in the select clause like "select GetTotalComp(table-name)
to get the relevant field from a particular table. I thought doing this
at the back end would be more efficient.
We can also look for some package (or code ourselves) so that we find
out any relations between the different columns, e.g. A+B = C, A+B+C =
D, A*B = C, etc. We will have to do this to make sense of some of the
columns since I don't expect all column names to be self-explanatory. I
expect this to be a common feature of most if not all our data sets.
…On 11/28/2017 7:03 AM, soodoku wrote:
Thanks @ChrisMuir <https://github.com/chrismuir>! What you suggest
seems like a perfectly fine way to go.
Also @tristanjkaiser <https://github.com/tristanjkaiser>, I see the
script and will get back, once I have taken a look. This tx data is a
bit unusual.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHq78K18kFWcEhBcORLnoC6n6YNU0RWmks5s7CC_gaJpZM4Phv0J>.
|
@vinay-pimple what you suggest makes sense to me. cobb's third normal for the win. To bring everyone on the same page, one of the ideas is to also provide a web interface to the data. And @vinay-pimple is talking a bit about the backend of that. But the insights can be used for packaging data as well. TX was unusually complicated. For most datasets, I don't see this as a serious challenge. Time-consuming but doable. We would also need to get all the salaries in 2017 dollars or something. We probably also need to get PPP for each area so that we can pro-rate wages etc. But all that is on the menu and not relevant for this particular issue. I will look into hiving this discussion off into a new issue. |
@ChrisMuir: would you like to pick this up? |
Texas salary data is provided here:
https://salaries.texastribune.org/agencies/
You can download data for each link. For instance, clicking on Austin brings you to:
https://salaries.texastribune.org/austin/ and you have a link to 'download this data.'
The state data is there. We need to download city, school, and university data. Combine each into city, k12, and university files. And 7zip it and upload it to relevant year folder.
Save script under scripts/tx/tx_tribune.R
The text was updated successfully, but these errors were encountered: