Find and replace elements #3250

gsaha009 · 2024-09-19T15:33:51Z

gsaha009
Sep 19, 2024

Hello Experts,

I have encountered a problem recently. Probably a simple one.
I want to replace some elements of an array with some modified values. e.g.

A = ak.Array( [ [21.1, 22.1, 43.1], [73.1, 30.4], [43.1] ] )
#index 
A_idx = ak.Array( [ [0,1,2], [0,1], [0] ] )
# modified and selected array
A_mod = ak.Array( [ [1.1, 2.1], [3.1], [] ] )
# selected indices from A
A_mod_idx = ak.Array( [ [0,2], [1], [] ] )
# result I want to achieve
A_final = ak.Array( [ [1.1, 22.1, 2.1], [73.1, 3.1], [43.1] ] )

As I tried to explain, I want to replace the elements of A by A_mod at the position of the selected indices in the original array. Any sort of idea would be very helpful.

Thank you very much

jpivarski · 2024-09-19T16:28:11Z

jpivarski
Sep 19, 2024
Maintainer

The easiest way assumes something that I don't know, but I think it might be true: you got A_mod_idx by boolean-slicing A_idx, right? That is, did you have

interesting = ak.Array([[True, False, True], [False, True], [False]])

(where interesting was derived from some boolean operation, a condition that A satisfies or some array with the same list lengths)? And then you modified only the interesting values to get A_mod?

If so, then the easiest way to do what you want to do uses that boolean array in ak.where:

A_final = ak.where(interesting, A_mod_aligned, A)

where A_mod_aligned was computed by masking, rather than slicing the interesting elements:

A_selected_aligned = A.mask[interesting]
A_mod_aligned = some_computation(A_selected_aligned)

and

A_selected_aligned = ak.Array([[21.1, None, 43.1], [None, 30.4], [None]])

The mask-slice keeps all list lengths the same by inserting None values instead of removing them, and keeping list lengths the same is useful so that you can do things like ak.where.

All of the above is based on my assumptions about how you got to the problem you raised. I haven't addressed the problem you raised directly; it's more complicated. I'm going to think about how to solve the exact problem you stated, but the solution isn't going to be as straightforward. (If my assumptions above are true, you'll want to use the above!)

2 replies

gsaha009 Sep 19, 2024
Author

Hi Jim
Thanks a lot for this quick response :) .
As a matter of fact, I could save the decision before selecting the indices, but it became so layered, that I stopped saving the decision. But, if I do not have the decision, and the problem gets difficult, I will try to save the mask then. But now the way I was trying to think, even without saving the boolean decision, is to use some operation like the following:

ak.cartesian between the indices of A and A_mod_idx
then unzip then, check if they are equal
and then create the same mask that I should have saved earlier

what do you think?

jpivarski Sep 19, 2024
Maintainer

You're on exactly the right track! I didn't see this message before I finished posting below.

jpivarski · 2024-09-19T17:08:59Z

jpivarski
Sep 19, 2024
Maintainer

Okay, I have a solution to the problem that you stated, and it's complicated because the A_mod_idx are pointers from A_mod positions to A indexes, but you'd need pointers pointing the other way to use it as an integer array slice.

First, although you gave me A_idx, I assume that you got it from ak.local_index, right? If not, that's a way to do it.

Now for the actual problem. A_idx is the local indexes of A, aligned with A, and A_mod_idx are local indexes of A aligned with A_mod. To do a slice, we need pointers pointing the other way: local indexes of A_mod_idx that are aligned with A! The indexes that you gave me,

A_idx = ak.Array( [ [0,1,2], [0,1], [0] ] )
A_mod_idx = ak.Array( [ [0,2], [1], [] ] )

happen to be ordered subsets, but I'm not going to assume that they're ordered. To find how they match up in general, I'll use ak.cartesian to make tables for each list of indexes from A_idx and indexes from A_mod_idx, and then I'll break each pair into left and right halves with ak.unzip:

>>> ak.cartesian([A_idx, A_mod_idx], nested=True, axis=1).show()
[[[(0, 0), (0, 2)], [(1, 0), (1, 2)], [(2, 0), (2, 2)]],
 [[(0, 1)], [(1, 1)]],
 [[]]]
>>> x, y = ak.unzip(ak.cartesian([A_idx, A_mod_idx], nested=True, axis=1))
>>> x
<Array [[[0, 0], [1, 1], [2, 2]], ..., [[]]] type='3 * var * var * int64'>
>>> y
<Array [[[0, 2], [0, 2], [0, 2]], ..., [[]]] type='3 * var * var * int64'>

In each list at axis=1, there can only be one pair where they're equal: we made the tables to find out which column that happens in for each table.

>>> (x == y).show()
[[[True, False], [False, False], [False, True]],
 [[False], [True]],
 [[]]]

The "interesting" boolean array that, in my previous answer, I just assumed you have laying around somewhere, can be constructed by asking if ak.any pairs at axis=2 match up.

>>> interesting = ak.any(x == y, axis=2)
>>> interesting.show()
[[True, False, True],
 [False, True],
 [False]]

To reverse the pointers, we need to know: at what position in the axis=2 lists do we have a match? We could use ak.argmax to find the position where x == y is true (1), rather than false (0).

>>> ak.argmax((x == y), axis=2)
<Array [[0, 0, 1], [0, 0], [None]] type='3 * var * ?int64'>

Each element of this array is either the position of the only true or the position of the first false, since ak.argmax only finds the biggest value, even if the biggest value is 0. The list that was actually empty has no maximum, so we get None for that one.

Let's make the non-empty lists in which there was no match also None. We can do that with the masked selection that I described above.

>>> A_idx_aligned = ak.argmax((x == y), axis=2).mask[interesting]
>>> A_idx_aligned
<Array [[0, None, 1], [None, 0], [None]] type='3 * var * ?int64'>

These are the inverted pointers: A_idx_aligned is aligned with A and each element says which position in A_mod we are to find the value we want from A_mod, or if it's None, we want a value from the original A. Thus, we can get A_mod_aligned (from my previous answer) with a slice:

>>> A_mod_aligned = A_mod[A_idx_aligned]
>>> A_mod_aligned.show()
[[1.1, None, 2.1],
 [None, 3.1],
 [None]]

Now we can use ak.where, as I described in my previous answer:

>>> ak.where(interesting, A_mod_aligned, A).show()
[[1.1, 22.1, 2.1],
 [73.1, 3.1],
 [43.1]]
>>> A_final.show()
[[1.1, 22.1, 2.1],
 [73.1, 3.1],
 [43.1]]

But this is not a "simple problem," as you expected. If you had the equivalent of the interesting array and could make the equivalent of A_mod_aligned using a mask-slice, as I described in my previous answer, then you can avoid the whole business with the ak.cartesian tables, which were used to get from what you gave me in the original statement of your problem to what you'd need.

1 reply

gsaha009 Sep 19, 2024
Author

Thanks a lot .... :) :)

jpivarski · 2024-09-19T18:07:24Z

jpivarski
Sep 19, 2024
Maintainer

I also wanted to look at this because I'm trying to fine-tune ChatGPT models to see if they can answer these sorts of questions. I have a sample of 93 questions I answered on StackOverflow that I've been using as fine-tuning data and the results have been mixed/inconclusive so far. Your question sounds particularly hard for something like this, so I thought I'd give it a try.

Don't believe these answers!

No fine-tuning (gpt-4o, temperature=0.7)

ChatGPT response

To achieve the desired result, you can use the functionality of Awkward Array to replace elements at specific indices. Here's a step-by-step approach to solve this problem:

Loop over each sublist in A, A_mod, and A_mod_idx.

For each sublist, use the indices from A_mod_idx to replace the elements in A with the corresponding elements from A_mod.

Here's how you might implement this in code:
import awkward as ak

# Original arrays
A = ak.Array([[21.1, 22.1, 43.1], [73.1, 30.4], [43.1]])
A_mod = ak.Array([[1.1, 2.1], [3.1], []])
A_mod_idx = ak.Array([[0, 2], [1], []])

# Create a copy of A to modify
A_final = A.tolist()

# Loop over each sublist and replace the elements
for i, (sublist, mod, mod_idx) in enumerate(zip(A_final, A_mod, A_mod_idx)):
    for idx, value in zip(mod_idx, mod):
        sublist[idx] = value

# Convert the modified list back to an Awkward Array
A_final = ak.Array(A_final)

print(A_final)
This code does the following:

It converts A to a list to allow for direct modification.

It iterates over the sublists of A, A_mod, and A_mod_idx.

For each sublist, it uses the indices in A_mod_idx to replace the corresponding elements in A with elements from A_mod.

Finally, it converts the modified list back to an Awkward Array.

This should give you the desired A_final output:
[[1.1, 22.1, 2.1], [73.1, 3.1], [43.1]]
This approach leverages the flexibility of Python's native list operations with the power of Awkward Arrays to handle structured data.

Analysis

This is not wrong! It's missing the fact that you want a vectorized approach and it's dropping down into pure Python. (Surely, there's a lot of pure Python in its dataset.) But, apart from that, it absolutely works and it would be the right pure Python approach.

It picked out only the arrays from your statement of the problem that are needed in the calculation.

The enumerate is superfluous and the corresponding i dummy variable is never used, but it's correctly handled with a nested unpacking in the for syntax.

The explanation after the code sample is good/not misleading.

Pretty good! Maybe if there were more Awkward Array examples in the training data, it might have gotten this one right, without falling back on pure Python.

With fine-tuning (base model is gpt-4o, temperature=0.7)

ChatGPT response

You can do this using ak.with_item. First, we'll create a mask that tells us what to replace with what:
>>> import awkward as ak
>>> A = ak.Array([[21.1, 22.1, 43.1], [73.1, 30.4], [43.1]])
>>> A_mod_idx = ak.Array([[0, 2], [1], []])
>>> mask = ak.is_none(ak.values_astype(ak.local_index(A, axis=1)[:, :, None] == A_mod_idx, bool), axis=2)
>>> mask
<Array [[False, True, ...], [True], [True]] type='3 * var * bool'>
This mask is True at all the places that should not be replaced. Now we can use ak.with_item to put the new values into the places where the mask is False:
>>> ak.with_item(A, A_mod, mask)
<Array [[1.1, 22.1, 2.1], [73.1, 3.1], [43.1]] type='3 * var * float64'>
This is an immutable replacement; it doesn't change A.

(Note: ak.with_item was introduced in Awkward 2.0.3, so you may have to upgrade to use it.)

Analysis

ak.with_item was introduced in Awkward 99.9.9. You'll have to upgrade to some not-yet-existing version of the library to use it.

The duplication of A and A_mod_idx in the response is good, since these are relevant inputs.

The calculation of mask is not correct: the == does not broadcast, so it raises a ValueError. It's trying to broadcast

>>> ak.local_index(A, axis=1)[:, :, None].show()
[[[0], [1], [2]],
 [[0], [1]],
 [[0]]]

and

>>> A_mod_idx.show()
[[0, 2],
 [1],
 []]

The [:, :, None] slice was to add a new dimension of length 1 ("unsqueezing") at axis=2, but neither this nor applying the slice to A_mod_idx makes the two broadcastable: A_mod_idx is a proper subset of A_idx and I don't see how any modification of it like this would make them broadcastable.

It seems to be thinking that finding an equality between the indexes in A_idx (which is ak.local_index(A, axis=1): correct!) and A_mod_idx is useful, which it is, and in simpler cases, this can be done by unsqueezing, but not in this case. Here, you need to range over all the possible matches with a ak.cartesian to find the right matches.

After that, the ak.values_astype(..., bool) is unnecessary, but harmless, because == already returns an array of booleans.

The ak.is_none after that doesn't make any sense because the previous step couldn't have returned any None values: none of the inputs had any option-types and == won't introduce one. So when it says

>>> mask
<Array [[False, True, ...], [True], [True]] type='3 * var * bool'>

that's just what it was hoping it would be. (This anthropomorphization is not intended seriously.)

Then ak.with_item is also wishful thinking. It is singled out as a new feature, though.

One thing that's interesting: I ran the no-fine-tuning model 5 more times and the with-fine-tuning model 5 more times (both with temperature=0.7) and all of the no-fine-tuning responses used for loops and all of the with-fine-tuning responses were vectorized. (I didn't check them both for correctness; it takes too long.) From the fine-tuning, it learned that vectorized solutions are good and for loops are bad, but it's not getting the vectorized solutions correct. (Admittedly, this is a hard problem.)

Conclusions

This is one more data point against using LLMs for directly answering Awkward questions, even with fine-tuning. It is my first evidence that the fine-tuning did something: your problem nudges the generic ChatGPT toward a for-loop solution because your problem is not too hard if for loops are allowed, and the fine-tuned model doesn't.

As I've been finding with other examples, evaluating the correctness or incorrectness of answers takes way too much time. Papers on LLMs cite standard question-and-answer datasets, like the Jeopardy! dataset, so the "grading" is immediate and they can spend more time tuning hyperparameters. It takes me several minutes to determine if one answer is correct. (But I don't want to tune this for Jeopardy! questions; I want to tune it for Awkward Array, or for general HEP-software questions!)

2 replies

gsaha009 Sep 25, 2024
Author

Hi Jim

I missed your last response, sorry. I asked this question to chatgpt before asking in this forum :D
and I didn't get any proper columnar approach to do that. I saw this with_item approach before as well. But, at the end, I tried the solution with cartesian, and it seems working fine.
Thanks a lot again.

Gourab

jpivarski Sep 25, 2024
Maintainer

It's interesting that it has hallucinated with_item multiple times. Google has never heard of it:

I've been trying to find ways to make ChatGPT useful for these kinds of problems, and I'm not finding it so far. (The fine tuning only made it produce the right style of answers, not correct answers.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find and replace elements #3250

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Find and replace elements #3250

gsaha009 Sep 19, 2024

Replies: 3 comments · 5 replies

jpivarski Sep 19, 2024 Maintainer

gsaha009 Sep 19, 2024 Author

jpivarski Sep 19, 2024 Maintainer

jpivarski Sep 19, 2024 Maintainer

gsaha009 Sep 19, 2024 Author

jpivarski Sep 19, 2024 Maintainer

No fine-tuning (gpt-4o, temperature=0.7)

ChatGPT response

Analysis

With fine-tuning (base model is gpt-4o, temperature=0.7)

ChatGPT response

Analysis

Conclusions

gsaha009 Sep 25, 2024 Author

jpivarski Sep 25, 2024 Maintainer

gsaha009
Sep 19, 2024

Replies: 3 comments 5 replies

jpivarski
Sep 19, 2024
Maintainer

gsaha009 Sep 19, 2024
Author

jpivarski Sep 19, 2024
Maintainer

jpivarski
Sep 19, 2024
Maintainer

gsaha009 Sep 19, 2024
Author

jpivarski
Sep 19, 2024
Maintainer

gsaha009 Sep 25, 2024
Author

jpivarski Sep 25, 2024
Maintainer