Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: index historical votes #38

Merged
merged 8 commits into from
Sep 12, 2023
Merged

feat: index historical votes #38

merged 8 commits into from
Sep 12, 2023

Conversation

wgwz
Copy link
Contributor

@wgwz wgwz commented Aug 23, 2023

Closes: #22
Closes: #37 (I decided to add the fix for this here, since it's small and potentially needed in this PR)

@wgwz wgwz marked this pull request as ready for review August 29, 2023 01:59
@wgwz wgwz marked this pull request as draft August 29, 2023 01:59
@wgwz wgwz force-pushed the kyle/22-index-votes branch 3 times, most recently from 78b4b59 to a123734 Compare August 29, 2023 17:53
@wgwz wgwz changed the title feat(WIP): index votes feat: index historical votes Aug 29, 2023
sql/V1_12__votes.sql Outdated Show resolved Hide resolved
index_votes.py Outdated Show resolved Hide resolved
index_votes.py Outdated Show resolved Hide resolved
@wgwz wgwz requested a review from a team August 29, 2023 18:11
@wgwz wgwz marked this pull request as ready for review August 29, 2023 18:11
@wgwz wgwz force-pushed the kyle/22-index-votes branch from b2f888d to eb57e69 Compare August 29, 2023 19:39
@wgwz
Copy link
Contributor Author

wgwz commented Aug 29, 2023

while testing this historical votes piece a bit further in my local environment, i'm observing that proposals that are rejected, do not get saved by the indexer in our proposals table. this seems to be because cosmos.group.v1.EventProposalPruned does not get emitted for rejected proposals. is that intended? am i wrong?

for that reason, historical rejected proposals are currently not supported with the indexer. consequently, we are unable to store historical votes for rejected proposals. is this a problem we need to solve in a separate issue? /cc @blushi @ryanchristo

@blushi
Copy link
Member

blushi commented Aug 30, 2023

while testing this historical votes piece a bit further in my local environment, i'm observing that proposals that are rejected, do not get saved by the indexer in our proposals table. this seems to be because cosmos.group.v1.EventProposalPruned does not get emitted for rejected proposals. is that intended? am i wrong?

for that reason, historical rejected proposals are currently not supported with the indexer. consequently, we are unable to store historical votes for rejected proposals. is this a problem we need to solve in a separate issue? /cc @blushi @ryanchristo

Could you tell how you tested this in details? I believe such proposals should get pruned eventually and emitting cosmos.group.v1.EventProposalPruned at voting period + max exec period: https://github1s.com/cosmos/cosmos-sdk/blob/HEAD/x/group/keeper/keeper.go#L370

@wgwz
Copy link
Contributor Author

wgwz commented Aug 30, 2023

Could you tell how you tested this in details? I believe such proposals should get pruned eventually and emitting cosmos.group.v1.EventProposalPruned at voting period + max exec period: https://github1s.com/cosmos/cosmos-sdk/blob/HEAD/x/group/keeper/keeper.go#L370

thanks @blushi i think the max exec period explains why i'm seeing this. it looks like that defaults to two weeks. i only created the rejected proposal a couple days ago, so that would explain it for my case.

i'm going to re-test this with an example @ryanchristo gave me where you can lower the exec time

@wgwz
Copy link
Contributor Author

wgwz commented Aug 30, 2023

i'm going to re-test this with an example @ryanchristo gave me where you can lower the exec time

so this example was just about adjusting the min exec time.
as mentioned in stand-up today, we'd have to customize the max exec time locally (by adjusting this value if i'm not mistaken), so i'll just assume it works and move on.

that said, it's something to be aware of when testing the contents of the indexer database.
a rejected proposal and it's votes will not appear in the database until max exec period has passed.

@wgwz wgwz force-pushed the kyle/22-index-votes branch from eb57e69 to 208c702 Compare August 30, 2023 17:27
Copy link
Member

@ryanchristo ryanchristo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Haven't tested yet. A couple questions/comments.

utils.py Show resolved Hide resolved
index_votes.py Outdated Show resolved Hide resolved
index_votes.py Show resolved Hide resolved
@ryanchristo
Copy link
Member

Successfully tested the vote is not stored if the proposal has not yet been indexed and the vote is stored after the proposal is indexed: https://github.com/regen-network/indexer/actions/runs/6042565087/job/16397918758?pr=41

@ryanchristo
Copy link
Member

In addition to adding comments, can you make sure we add V1_12__votes.sql to the all migrations script?

sql/V1_12__votes.sql Outdated Show resolved Hide resolved
@wgwz wgwz marked this pull request as draft September 1, 2023 20:42
@wgwz wgwz force-pushed the kyle/22-index-votes branch from a822d46 to 6f29a66 Compare September 6, 2023 01:47
@wgwz
Copy link
Contributor Author

wgwz commented Sep 6, 2023

with the latest commit i've changed the architecture of the index votes process to solve a few problems.
most importantly, properly handling the fact that we need to get all votes for a proposal with the current implementation of EventVote (which only specifies proposal_id and not the voter and option).

i've also dropped the relationship between the proposals table and the votes.
this just means we can't traverse that relationship in graphql as i had mentioned in a previous comment.
that should be fine, we'll still have the ability to write graphql that queries the votes table.

last but not least i added a new function for figuring out which events to process (new_events_to_process).
this is a slightly different version of the already existing events_to_process function.
the difference being that instead of finding all msg_events of a particular type that have yet to be indexed, it finds only the newest events that occurred since the last time it ran.

events_to_process returns absolutely all events that have not been processed.
this is great when you want to have all historical data at every point in time (i.e. retirements (might not be the best example)).
but it's not great when you only care about having "most recent state".
i.e. for votes we only care about the votes of a proposal "now".
and we wouldn't want to see the state of the votes as of two weeks ago (time = block_height).

that's why i needed to add this new function new_events_to_process.
i mentioned to @blushi in our 1-1 today that we should actually convert the index_class_issuers process to use this function.
because the same thing applies for class issuers, we only care about the state of the class issuers "now".
the current class issuers indexing process is inefficient for this reason.
we're storing the historical state of the class issuers, and using a flag latest to mark the most recent state
i'll open up an issue for this.

another important difference between new_events_to_process and events_to_process is resiliency (not sure this is the best word).
if you make use of events_to_process, you always get all events of a particular type at any point in time.
i.e. if i were to delete a retirement from the retirements table at some old block height, the next time the indexer for retirements runs, it would re-index that retirement without any further manual interactions.
Whereas with new_events_to_process, it wouldn't work like that because it only ever pays attention to the events that occurred since the last indexed block height.
in this particular example, you would have to delete all retirements after the initial deleted retirement to get the whole thing re-indexed.

@wgwz wgwz marked this pull request as ready for review September 6, 2023 01:52
@wgwz wgwz requested a review from ryanchristo September 6, 2023 01:52
Copy link
Member

@ryanchristo ryanchristo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we address the multiple chains issue separately? I think there is more work to be done and I'm not sure one process for all chains is the best approach.

Also updated tests, which look good: https://github.com/regen-network/indexer/actions/runs/6099955306/job/16553243546

sql/V1_12__votes.sql Outdated Show resolved Hide resolved
Comment on lines +42 to +59
all_chain_nums = [
record[0] for record in gen_records(cur, "select num from chain;")
]
max_block_heights = {
chain_num: max_block_height
for chain_num, max_block_height in gen_records(
cur,
"select chain_num, MAX(block_height) from votes group by chain_num;",
)
}
logger.debug(f"{all_chain_nums=}")
logger.debug(f"{max_block_heights=}")
for chain_num in all_chain_nums:
if chain_num not in max_block_heights.keys():
max_block_heights[chain_num] = 0
logger.debug(f"{max_block_heights=}")
for chain_num, max_block_height in max_block_heights.items():
logger.debug(f"{chain_num=} {max_block_height=}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is ok for now but this should either be extracted to reuse in other processes or we should rethink how we are running processes, i.e. whether we want to run separate processes for separate chains.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also noticing we no longer make use of the _chain_num parameter with this addition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should address this issue (#33) separately. Any reason for adding it to this pull request?

Copy link
Contributor Author

@wgwz wgwz Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is ok for now but this should either be extracted to reuse in other processes or we should rethink how we are running processes, i.e. whether we want to run separate processes for separate chains.

we can figure out a way to extract it. maybe we can do that as a part of refactoring indexing credit class issuers to use the same method (new_events_to_process)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should address this issue (#33) separately. Any reason for adding it to this pull request?

it’s not really added to this PR. it’s just that this PR will index votes in a forward looking way. if we want to index multiple chains in the same processes, this process will still work for that. (as do all the other processes we’ve written to date).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also noticing we no longer make use of the _chain_num parameter with this addition.

this is an unused parameter for all the indexing processes so far i.e.

harmless to leave it for now IMO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it’s not really added to this PR. it’s just that this PR will index votes in a forward looking way

It's not included in the other processes, e.g. index_proposals.py, index_class_issuers, etc. We are adding a loop here that does not exist in the other processes, and which seems more relevant to #33 than #22. This pull request is already addressing two issues. It would be nice to focus on one issue at a time.

if we want to index multiple chains in the same processes, this process will still work for that

As mentioned in the review comment, I'm not sure one process for all chains is the best approach. It might be easier to run multiple instances of the process, one for each chain. We have a single environment variable for a a regen node. It is diverging from the current architecture and its not required for indexing votes as far as I can tell.

this is an unused parameter for all the indexing processes so far i.e.

Why was it added in the first place? Is there a future use case for it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an unused parameter for all the indexing processes so far i.e.

Why was it added in the first place? Is there a future use case for it?

Same question here. Maybe it could be useful if we want to run multiple instances of the process for each chain?

Adding the loop on max_block_heights.items() is needed for the new new_events_to_process logic. But the previous loop on all_chain_nums could potentially be avoided if we use the _chain_num param instead and distribute this logic into different processes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was it added in the first place? Is there a future use case for it?

I honestly forgot why I put it there...

Maybe it could be useful if we want to run multiple instances of the process for each chain?

But yes indeed, it can be useful for this.

Adding the loop on max_block_heights.items() is needed for the new new_events_to_process logic. But the previous loop on all_chain_nums could potentially be avoided if we use the _chain_num param instead and distribute this logic into different processes.

Yes, I think a good follow up task will be to simplify this as you've described here:

#44

This is also something that @ryanchristo and I discussed last week, so good that we are all in agreement it's a good thing to do.

sql/V1_12__votes.sql Outdated Show resolved Hide resolved
@wgwz
Copy link
Contributor Author

wgwz commented Sep 7, 2023

Should we address the multiple chains issue separately? I think there is more work to be done and I'm not sure one process for all chains is the best approach.

the way all the indexing processes are set up, is such that if we were to index multiple chains, we can successfully index events from multiple chains. the method i implemented in this PR was so that this stays true for all of the indexing processes.

the indexing processes in general support that by paying attn to chain_num and having that as key in the db tables.

@ryanchristo
Copy link
Member

the way all the indexing processes are set up, is such that if we were to index multiple chains

We have one environment variable for a single regen node, meaning yes it supports multiple chains but only if you run multiple instances, which I do think is the right approach. Adding a loop in the index_votes.py script to support multiple chains seems to diverge from the existing architecture.

Copy link
Member

@ryanchristo ryanchristo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per discussion, I think we should consider other solutions to the voter address not being available in the event as well as how we configure and support multiple chains. We should also make sure that we smooth out any inconsistencies now that we are introducing a new approach to how we index votes (i.e. different than how we index class issuers).

I tested a single chain. single proposal, and multiple votes with #41. Not necessarily thorough testing. You can see the logs for each service (i.e. db, indexer, etc.) separately here.

Nice work overall! Not an easy set of problems to solve.

@wgwz wgwz merged commit 927ddb3 into main Sep 12, 2023
@wgwz wgwz deleted the kyle/22-index-votes branch September 12, 2023 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants