Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some modify to utils convert function #335

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

MeloDi-23
Copy link

Corrfunc forked:

I rewrite the utils.py, so that the convert_rp_pi_counts_to_wp and convert_3d_counts_to_cf works more properly. The formal call is still valid, and I added that if you provide weight in the pair-counting, it will account for the weighting.

# load some catalogue...

# Code 1
wei_norm = galaxy['w'] / (galaxy['w'].mean())
wei_norm_r = random['w'] / (random['w'].mean())

dd = DDrppi_mocks(autocorr=True, cosmology=1, nthreads=50, pimax=pimax, binfile=rp_bin,
                   RA1=galaxy['ra'], DEC1=galaxy['dec'], CZ1=galaxy['distance'], weights1=wei_norm, is_comoving_dist=True, weight_type='pair_product')
dr = DDrppi_mocks(
    autocorr=False, cosmology=1, nthreads=50, pimax=pimax, binfile=rp_bin, 
    RA1=galaxy['ra'], DEC1=galaxy['dec'], CZ1=galaxy['distance'], weights1=wei_norm, 
    RA2=random['ra'], DEC2=random['dec'], CZ2=random['distance'], weights2=wei_norm_r, 
    is_comoving_dist=True, weight_type='pair_product')
rr = DDrppi_mocks(autocorr=True, cosmology=1, nthreads=50, pimax=pimax, binfile=rp_bin,
                   RA1=random['ra'], DEC1=random['dec'], CZ1=random['distance'], weights1=wei_norm_r, is_comoving_dist=True, weight_type='pair_product')

Nd = len(galaxy)
Nr = len(random)

wp_1 = convert_rp_pi_counts_to_wp(Nd, Nd, Nr, Nr, dd, dr, dr, rr, pimax=pimax, nrpbins=Nbins)


# Code 2
dd = DDrppi_mocks(autocorr=True, cosmology=1, nthreads=50, pimax=pimax, binfile=rp_bin,
                   RA1=galaxy['ra'], DEC1=galaxy['dec'], CZ1=galaxy['distance'], weights1=galaxy['w'], is_comoving_dist=True, weight_type='pair_product')
dr = DDrppi_mocks(
    autocorr=False, cosmology=1, nthreads=50, pimax=pimax, binfile=rp_bin, 
    RA1=galaxy['ra'], DEC1=galaxy['dec'], CZ1=galaxy['distance'], weights1=galaxy['w'], 
    RA2=random['ra'], DEC2=random['dec'], CZ2=random['distance'], weights2=random['w'], 
    is_comoving_dist=True, weight_type='pair_product')
rr = DDrppi_mocks(autocorr=True, cosmology=1, nthreads=50, pimax=pimax, binfile=rp_bin,
                   RA1=random['ra'], DEC1=random['dec'], CZ1=random['distance'], weights1=random['w'], is_comoving_dist=True, weight_type='pair_product')

Nd = galaxy['w'].sum()
Nr = random['w'].sum()

wp_2 = convert_rp_pi_counts_to_wp(Nd, Nd, Nr, Nr, dd, dr, dr, rr, pimax=pimax, nrpbins=Nbins)
assert np.isclose(wp_1, wp_2).all()

This code will work.

Note that, for simplicity, I didn't add new parameters to the function.
Instead you can

  • normalize the weight first, e.g. weight_normal = weight / weight.mean(), and pass the parameter in the old way.
  • pass the sum of weight of dataset1 to ND1, sum of weight of dataset2 to ND2, etc. This is reasonable, because if you assume no weighting(that is weight of every point is 1), then the sum of weights equals to number of points.

Both way will work.

@pep8speaks
Copy link

Hello @MeloDi-23! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 145:80: E501 line too long (88 > 79 characters)
Line 173:80: E501 line too long (99 > 79 characters)

@manodeep
Copy link
Owner

manodeep commented Dec 1, 2024

Thanks again for the PR @MeloDi-23. We had only talked about modifying the documentation but you have added entirely new functionality - might require a bit of a think about how to proceed.

@lgarrison Thoughts?

@lgarrison
Copy link
Collaborator

I think our API here is not great to begin with—accepting either an array or a dict that we try to unpack into an array seems pretty magical, and it's undocumented. I think it's a big part of the confusion about why the weights aren't used, because when passing a dict the function does indeed have enough information to apply the weighted counts (except for the sum of weights).

So I wonder if we should narrow the API to only accept results dicts, and maybe add a use_weights=True default parameter. That makes it pretty obvious what the function is doing. That, plus a documentation update, should help with the confusion.

(I would suggest that we only accept arrays instead of dicts, but then users have to type DD['npairs'] * DD['weightavg'] for each function argument, or we add new function args just to accept the weights, neither of which is great.)

@MeloDi-23
Copy link
Author

Thanks for your idea. I think one key point I want to say is to make it compatible with pair counting APIs (DDrppi that kind of thing). These functions return a record array (which looks like a dict of array). I don't know if people will use their own code to count pairs and use this function to calculate correlation function. For me, I just throw the data into the counting API and directly put the result into this function. This is easy to use, and reasonable.
P.S.: I think it might also be possible to modify the counting API, to make it return, lets say, an object. This object stores the counts, weights, and other infos like sum of weight. This way the API can be further simplified.

@lgarrison
Copy link
Collaborator

Yes, sorry, I meant to edit my comment to say "record array" instead of "dict". I'm not too concerned about other people using the API with their own counts, but if they are, I think the function can trivially accept a record array or dict of arrays. I also agree that an object-oriented API could be nice, but in the interest of fixing the current issue in a timely manner, maybe we should stick to improving the utils functions.

@manodeep
Copy link
Owner

manodeep commented Dec 5, 2024

I am clearly the bottleneck here. @lgarrison Should we work out what a good compromise would be? This could be an useful addition as long as we don't break user expectations/existing code.

@lgarrison
Copy link
Collaborator

I think I'm in favor of changing the function to only accept record arrays and/or dicts, and add a use_weights=True default kwarg. That way the results of DD() (or whatever) could be passed to this function, or users bringing their own pair counts could just wrap them in a dict. This would be a breaking API change.

If that's too disruptive, we could instead introduce a new function with this behavior, and restrict the old function to its documented behavior, which is to accept only arrays and not record arrays/dicts. "Documented" is a fuzzy term here, though, because we have examples that are at odds with the docstring.

I think no matter what we do, the change will (and should!) be breaking in at least a small way, because the current situation of silently producing unweighted CFs is not a good user experience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants