Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several thousand duplicate records in taxdata-created puf.csv #398

Open
donboyd5 opened this issue Jul 8, 2021 · 0 comments
Open

Several thousand duplicate records in taxdata-created puf.csv #398

donboyd5 opened this issue Jul 8, 2021 · 0 comments

Comments

@donboyd5
Copy link

donboyd5 commented Jul 8, 2021

Probably of interest to @andersonfrailey @MattHJensen :

I created a puf.csv using the master branch of taxdata git cloned on 7/2/2021, by running make cps-files then make puf-files. I think I did it correctly.

I just finished creating state weights for this file and noticed many groups of records where the state weights are identical - for example, RECIDs 1-5 all have the same weight for every state. This means that the data for the 22 variables I used in the targeting process for these 5 records was identical.

That led me to look for duplicates in my 7/2/2021 puf.csv (code copied below). I did it two ways: (1) looking for duplicates using all variables other than RECID, and (2) also dropping FLPDYR on the theory that even if FLPDYR varies, if other variables are all the same then it is really the same person (there might be additional variables we should think of similarly).

This analysis showed that there are 2,673 records that are EXACT duplicates on all variables other than RECID (99 variables). If we also exclude FLPDYR, there are 2,828 records that are exact duplicates on the 98 variables compared.

This was a surprise to me and seems like it might not be intended.

In addition, I looked at the very first record in my 7/2/2021 puf.csv (RECID==1), which is duplicated 5 times under my second method, and it appears to be the same as the first record in the last official puf.csv I have ready access to (8/20/2020). (I can't easily compare other records because after that RECID is no longer comparable.)

This record has s006=147096 in the 8/20/2020 puf.csv, while ALL 5 corresponding duplicate records in my 7/2/2021 puf have s006=147096. If the duplicates occurred because they are being matched against cps records or something like that, I would have thought the weight would be split so that each of the 5 copies would have a smaller weight that, when summed across the 5 duplicates, equals 147096. Because that is not the case, my guess is that the weights on the duplicate records are too great.

Is it possible that this is unintended?

Thanks.

import pandas as pd

DIR_FOR_BOYD_PUFCSV = r'/media/don/data/puf_files/puf_csv_related_files/Boyd/2021-07-02/'
PUFPATH = DIR_FOR_BOYD_PUFCSV + 'puf.csv'

puf = pd.read_csv(PUFPATH)

# check for duplicates using ALL columns other than RECID
cols = puf.columns.tolist()
cols.remove('RECID')
dups = puf[puf.duplicated(subset=cols, keep=False)] # keep = False keeps all duplicate records
dups.shape  # 2673 duplicates

# also drop FLPDYR
cols2 = cols.copy()
cols2.remove('FLPDYR')
dups2 = puf[puf.duplicated(subset=cols2, keep=False)]
dups2.shape  # 2828 duplicates

# which h_seq has the most duplicates?
dups2.h_seq.value_counts()
@donboyd5 donboyd5 changed the title Several thousand duplicate records in taxdata Several thousand duplicate records in taxdata's puf.csv Jul 8, 2021
@donboyd5 donboyd5 changed the title Several thousand duplicate records in taxdata's puf.csv Several thousand duplicate records in taxdata-created puf.csv Jul 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant