some modify to utils convert function #335

MeloDi-23 · 2024-11-18T18:20:16Z

Corrfunc forked:

I rewrite the utils.py, so that the convert_rp_pi_counts_to_wp and convert_3d_counts_to_cf works more properly. The formal call is still valid, and I added that if you provide weight in the pair-counting, it will account for the weighting.

# load some catalogue...

# Code 1
wei_norm = galaxy['w'] / (galaxy['w'].mean())
wei_norm_r = random['w'] / (random['w'].mean())

dd = DDrppi_mocks(autocorr=True, cosmology=1, nthreads=50, pimax=pimax, binfile=rp_bin,
                   RA1=galaxy['ra'], DEC1=galaxy['dec'], CZ1=galaxy['distance'], weights1=wei_norm, is_comoving_dist=True, weight_type='pair_product')
dr = DDrppi_mocks(
    autocorr=False, cosmology=1, nthreads=50, pimax=pimax, binfile=rp_bin, 
    RA1=galaxy['ra'], DEC1=galaxy['dec'], CZ1=galaxy['distance'], weights1=wei_norm, 
    RA2=random['ra'], DEC2=random['dec'], CZ2=random['distance'], weights2=wei_norm_r, 
    is_comoving_dist=True, weight_type='pair_product')
rr = DDrppi_mocks(autocorr=True, cosmology=1, nthreads=50, pimax=pimax, binfile=rp_bin,
                   RA1=random['ra'], DEC1=random['dec'], CZ1=random['distance'], weights1=wei_norm_r, is_comoving_dist=True, weight_type='pair_product')

Nd = len(galaxy)
Nr = len(random)

wp_1 = convert_rp_pi_counts_to_wp(Nd, Nd, Nr, Nr, dd, dr, dr, rr, pimax=pimax, nrpbins=Nbins)


# Code 2
dd = DDrppi_mocks(autocorr=True, cosmology=1, nthreads=50, pimax=pimax, binfile=rp_bin,
                   RA1=galaxy['ra'], DEC1=galaxy['dec'], CZ1=galaxy['distance'], weights1=galaxy['w'], is_comoving_dist=True, weight_type='pair_product')
dr = DDrppi_mocks(
    autocorr=False, cosmology=1, nthreads=50, pimax=pimax, binfile=rp_bin, 
    RA1=galaxy['ra'], DEC1=galaxy['dec'], CZ1=galaxy['distance'], weights1=galaxy['w'], 
    RA2=random['ra'], DEC2=random['dec'], CZ2=random['distance'], weights2=random['w'], 
    is_comoving_dist=True, weight_type='pair_product')
rr = DDrppi_mocks(autocorr=True, cosmology=1, nthreads=50, pimax=pimax, binfile=rp_bin,
                   RA1=random['ra'], DEC1=random['dec'], CZ1=random['distance'], weights1=random['w'], is_comoving_dist=True, weight_type='pair_product')

Nd = galaxy['w'].sum()
Nr = random['w'].sum()

wp_2 = convert_rp_pi_counts_to_wp(Nd, Nd, Nr, Nr, dd, dr, dr, rr, pimax=pimax, nrpbins=Nbins)
assert np.isclose(wp_1, wp_2).all()

This code will work.

Note that, for simplicity, I didn't add new parameters to the function.
Instead you can

normalize the weight first, e.g. weight_normal = weight / weight.mean(), and pass the parameter in the old way.
pass the sum of weight of dataset1 to ND1, sum of weight of dataset2 to ND2, etc. This is reasonable, because if you assume no weighting(that is weight of every point is 1), then the sum of weights equals to number of points.

Both way will work.

pep8speaks · 2024-11-18T18:20:22Z

Hello @MeloDi-23! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file Corrfunc/utils.py:

Line 145:80: E501 line too long (88 > 79 characters)
Line 173:80: E501 line too long (99 > 79 characters)

manodeep · 2024-12-01T21:35:18Z

Thanks again for the PR @MeloDi-23. We had only talked about modifying the documentation but you have added entirely new functionality - might require a bit of a think about how to proceed.

@lgarrison Thoughts?

lgarrison · 2024-12-01T23:03:40Z

I think our API here is not great to begin with—accepting either an array or a dict that we try to unpack into an array seems pretty magical, and it's undocumented. I think it's a big part of the confusion about why the weights aren't used, because when passing a dict the function does indeed have enough information to apply the weighted counts (except for the sum of weights).

So I wonder if we should narrow the API to only accept results dicts, and maybe add a use_weights=True default parameter. That makes it pretty obvious what the function is doing. That, plus a documentation update, should help with the confusion.

(I would suggest that we only accept arrays instead of dicts, but then users have to type DD['npairs'] * DD['weightavg'] for each function argument, or we add new function args just to accept the weights, neither of which is great.)

MeloDi-23 · 2024-12-02T23:39:21Z

Thanks for your idea. I think one key point I want to say is to make it compatible with pair counting APIs (DDrppi that kind of thing). These functions return a record array (which looks like a dict of array). I don't know if people will use their own code to count pairs and use this function to calculate correlation function. For me, I just throw the data into the counting API and directly put the result into this function. This is easy to use, and reasonable.
P.S.: I think it might also be possible to modify the counting API, to make it return, lets say, an object. This object stores the counts, weights, and other infos like sum of weight. This way the API can be further simplified.

lgarrison · 2024-12-02T23:52:16Z

Yes, sorry, I meant to edit my comment to say "record array" instead of "dict". I'm not too concerned about other people using the API with their own counts, but if they are, I think the function can trivially accept a record array or dict of arrays. I also agree that an object-oriented API could be nice, but in the interest of fixing the current issue in a timely manner, maybe we should stick to improving the utils functions.

manodeep · 2024-12-05T22:32:28Z

I am clearly the bottleneck here. @lgarrison Should we work out what a good compromise would be? This could be an useful addition as long as we don't break user expectations/existing code.

lgarrison · 2024-12-06T02:32:41Z

I think I'm in favor of changing the function to only accept record arrays and/or dicts, and add a use_weights=True default kwarg. That way the results of DD() (or whatever) could be passed to this function, or users bringing their own pair counts could just wrap them in a dict. This would be a breaking API change.

If that's too disruptive, we could instead introduce a new function with this behavior, and restrict the old function to its documented behavior, which is to accept only arrays and not record arrays/dicts. "Documented" is a fuzzy term here, though, because we have examples that are at odds with the docstring.

I think no matter what we do, the change will (and should!) be breaking in at least a small way, because the current situation of silently producing unweighted CFs is not a good user experience.

some modify to utils convert function

63f39aa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some modify to utils convert function #335

some modify to utils convert function #335

MeloDi-23 commented Nov 18, 2024

pep8speaks commented Nov 18, 2024

manodeep commented Dec 1, 2024

lgarrison commented Dec 1, 2024

MeloDi-23 commented Dec 2, 2024

lgarrison commented Dec 2, 2024

manodeep commented Dec 5, 2024

lgarrison commented Dec 6, 2024

some modify to utils convert function #335

Are you sure you want to change the base?

some modify to utils convert function #335

Conversation

MeloDi-23 commented Nov 18, 2024

Corrfunc forked:

pep8speaks commented Nov 18, 2024

manodeep commented Dec 1, 2024

lgarrison commented Dec 1, 2024

MeloDi-23 commented Dec 2, 2024

lgarrison commented Dec 2, 2024

manodeep commented Dec 5, 2024

lgarrison commented Dec 6, 2024