Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iss435 #437

Draft
wants to merge 31 commits into
base: main
Choose a base branch
from
Draft

Iss435 #437

wants to merge 31 commits into from

Conversation

ekgutierrez1
Copy link
Collaborator

The city homeless student files for 2019-2022 are ready. Specifically the total homeless student file and the subgroup by race file. I'll now turn to the county versions of the same.

@cdsolari cdsolari requested a review from rpitingolo December 17, 2024 21:34
@ekgutierrez1
Copy link
Collaborator Author

The county homeless student files for 2019-2022 are ready. Specifically the total homeless student file and the subgroup by race file. One big flag: CT counties prior to 2022 are not in crosswalk and therefore not in this data. This will need to be rerun when the crosswalk is fixed.

awunderground and others added 10 commits January 2, 2025 15:36
Added 2014 PEP population data into the crosswalk manually since the API is limited
…gh 2021

As discussed, aligning with the data team decision to maintain the original 8 CT counties through 2021.
Also manually adding PEP population data for 2014 to complete previously missing data.
Kicked off more specific documentation about the crosswalks in the README - will revisit as needed
@ekgutierrez1
Copy link
Collaborator Author

CT is now available for prior years based on the updated crosswalk for all homelessness 2019-2022 as well as subgroups 2019-2022

@rpitingolo
Copy link
Contributor

@ekgutierrez1 just finished reviewing. I will have line item comments in a bit. I need to do another pull to update for the commit you just pushed.

Copy link
Contributor

@rpitingolo rpitingolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ekgutierrez1 added line item comments! Many of the comments from the city code also apply to the county as much of the code is duplicated.

cap n ssc install libjson
net install educationdata, replace from("https://urbaninstitute.github.io/education-data-package-stata/")

*Set up globals and directories
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the checklist on the repo Wiki under reproducibility there is " The program runs from start to finish without stopping due to errors or incompleteness". This currently does not pass because of some of the code here in the set up section. The code will error if the folders in the mkdir lines already exist. I'm not sure how to do this off the top of my head, but the code should check to see if they exist first and then mkdir if they don't.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error it provides (that the folders already exist) does not prevent the rest of the code from running. If it is a new user, this code auto-creates the paths they need. Since it doesn't prevent the rest of the code from running but is necessary for the paths of the code, I'd prefer to leave as is (this is how all of my files currently run, but we can ask the higher ups).

clear all

global gitfolder "C:\Users\ekgut\OneDrive\Desktop\urban\Github2\mobility-from-poverty"
global years 2019 2020 2021 // refers to 2019-20 school year through most recent data
global gitfolder "C:\Users\ekgut\OneDrive\Desktop\urban\Github\mobility-from-poverty"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also from the checklist: "The program avoids hardcoding local file paths and instead uses global paths that will work regardless of where the program is being ran"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses a global - the user will have to hardcode this upon download. There isn't a way around this unfortunately. That's just how globals work.

*****************************
****City/Place Crosswalk*****
*****************************
** Import city crosswalk file to edit names of city crosswalk to match city location strings in CCD school district data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful if comments described more specifically what is happening. For example it looks like you are converting the place codes from numeric to string, then adding leading zeros but that isn't really described in the comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added relevant comments to this and the county dofiles.

}
save "intermediate/ccd_lea_recent_city_race.dta", replace

*merge two ccd datasets together
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the merge has 3,896 unmatched records. Is that expected? It would be helpful in the comment to describe what the expectation is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's expected. I added a line and a comment to explain.

unzipfile "EdDataEx Homelessness `year'.zip", replace
}
cd "${gitfolder}\02_housing\data"

*import csvs
*Due to changes in EdDataExpress website, 2022-23 data must be manually downloaded. Please follow the following steps.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes me a lot of stress. When I went to the website there is a big red banner that says

Due to current system issues, datasets must be limited to fewer than 150,000 rows. A selection of data greater than 150,000 rows may result in a truncated dataset.

It looks like we may need an export larger than 150k so this may be a problem. Aside from that, it requires an amount of human intervention that is prone to error.

I don't know what the best solution is if the file is not able to be downloaded programmatically. Perhaps @awunderground or @jwalsh28 can weigh in?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worked this out with Claudia as the best case solution for this - but welcome other comments. Their website is problematic right now, so we gave instructions as best as possible in the comments for how to download the data. If you follow the instructions, it should work without the 150,000 problem.

}

*new as of 4/13/23 - updated 2/8/24 - in ACS-based metrics, if it was less than 30, it's set to NA
*if aggregated enrollment is less than 30, quality of the variable is 3
replace homeless_quality = 3 if enrollment<30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the replace homeless_quality = 3 if enrollment<30 line be part of the loop?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the next set of code/loop takes care of the individual race categories. It total homeless student count is less than 30, regardless of the individual race counts, the quality should be 3. However, if the individual race categories homeless counts are less than 10, they should be replaced with NA, because they are smaller categories than aggregated counts.

replace `var'_count=-1 if `var'_count<=2
}

*merge to crosswalk of places/cities
merge 1:1 year state city_name using "intermediate/cityfile.dta"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are losing a vast majority of cases on this merge? Is that right? Felt like a red flag to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is correct because we only care about matches 2 and 3. I added some comments in the dofiles to help clarify.


*check quality
****************
*Quality Checks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of quality checks are difficult for me to interpret as someone without much topical knowledge of the data. For checks where a large number of summary stats are created it would be helpful to know what to look for in those numbers. Alternatively using pass/fail checks would be a bit easier to interpret.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added assert commands where relevant and comments for other checks.

*we replace other==. to other==1 to mirror lines 147 in the other 4 race categories
replace other = 1 if other==. & year==2019
*we replace other==. to other==1 to mirror other 4 race categories
replace other = 1 if other==.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two lines of commented out code in the loop:

*replace var'="1" if var'=="S"
*destring `var', replace

Is this temporary or permanent? If permanent please delete these lines. It doesn't look like this is commented out in the city code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, I have now deleted that commented out code.

tab `var'_quality if `var'_share==.
tab `var'_quality if `var'_count==.
}
*To reviewer: in 2019, state 17 and county 061 there is an instance where black enrollment is 0 but homeless count is 3.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we think this is an issue caused by the underlying data then it should probably be switched to the most conservative (unreliable data) flag and documented somewhere

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already a flag = 3 (most unreliable),but based on the underlying data, I think it should actually be NA. Making this change now.

jwalsh28 and others added 5 commits January 13, 2025 12:46
I separated the three subgroups from ELA subgroup data it three separate eval forms. The "ela_subgroup_county" file should be deleted
"
Fix final evaluation function
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants