Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Balance Pool Group data for Cardano #145

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Cerkoryn
Copy link

All Submissions:

  • Have you followed the guidelines in our Contributing documentation?
  • Have you verified that there aren't any other open Pull Requests for the same update/change?
  • Does the Pull Request pass all tests?

Description

I added a different data source using Balance's Pool Group API available here. The Public API available at the bottom of the page provides a more conservative grouping of Cardano Pool Groups that is in line with other community tools such as Cexplorer. The script I used for obtaining and parsing the data is in the cardano_preprocessing folder as get_cardano_info_balance.py. By default the script will save the identifier and cluster files within that subfolder to prevent overwriting, but cardano_pool_data.json will get overwritten.

Additionally, I made minor changes in a few other files to assist with UTF-8 encoding issues. A slightly more impactful change was made in gini.py to prevent an integer overflow issue I was running into. However, this seems to have caused test_analyze.py to fail. I was not able to run the consensus tool for Dogecoin without this change though, so feel free to ignore it or any of the others if necessary.

Checklist

Update Mapping Support Information Submissions:

  • For which ledger do you update the mapping information?
    • Cardano
  • What mapping information do you update?
    • identifiers
    • addresses
    • clusters
    • legal links
  • Did you update the tests (if needed)?

Copy link
Member

@LadyChristina LadyChristina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your contributions, I think overall that it is a good idea to integrate pool data from Balance and other explorers in our tool, but there are some issues I have identified with the current code, so please take a look and let me know what your thoughts are on my comments.

for pool in pools:
current_time = time.time()
elapsed_time = current_time - last_request_time
if elapsed_time < request_interval: # Trying to stay within Cardanoscan's rate limits.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script seems to be about getting data from Balance but the comment mentions Cardanoscan - are they the same thing?
Same applies to the comment on L37

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pool group data (known as clusters in this consensus tool) is available via public API on Balance's website. However, the Pool IDs in Balance's data are in the BECH32 format, which is different than the format for Pool IDs in the source data. I needed a way to map one to the other and found that Cardanoscan (which is a Cardano block explorer) stored both together. There is possibly a better way to do this, but I am unfamiliar with any techniques for doing so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok I get it. I think it's fine to do it this way, but it would help to add a comment that explains it, so that it's more clear

representing a pool.
"""
logging.info("Getting pool data..")
filename = 'cardano_pool_data.json'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest saving this as a seperate file instead of overwriting the one with on-chain data, as this way we lose a lot of information, such as a pool's ticker, homepage and name that seem to be missing here. There is also a test that fails because a block gets mapped to a pool ticker (CFLOW) instead of pool name (CashFlow).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think it's nice to get additional data from another source (Balance) but perhaps it would make more sense to employ this only for the pools and clusters that are not already identified by our current method of looking at on-chain data directly. One approach for doing this would be to first load the current data and then update as needed before writing the new file. This is especially relevant for the identifiers file, where the one you have uploaded now contains no links to the pools' websites and is therefore inconsistent with the files of all other ledgers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. My primary reasons for integrating the Balance data was to validate the EDI consensus tool against known data sourced by the community. Although the community data doesn't have the rigorously verified legal links, websites, etc., it was heartening to see that the EDI methodology resulted in very similar results as community tools. For example, the Nakamoto Coefficient reported by Balance by simply counting pool stake is 30, but using the same data with this tool that counts actual historical blocks gave a Nakamoto Coefficient of 32. I think the differences could likely be explained by the randomness of slot battles or pools missing their propagation windows.

I could save the Balance data to separate files, but some additional options I've considered would be to add a --balance flag to the existing get_cardano_info.py. This method could use the existing identifiers/clusters as a source of truth (and retain the websites, links, etc.) and only fill in the gaps that are missing. The link or source fields here could be Balance API.

Otherwise, I could simply add the Balance data to the existing identifiers/clusters directly to the JSON which is less invasive and in-line with your contribution guidelines.

Do you have a preference for any of these options? I can get to work on these changes but it may take me some time to implement into the pull request.

Copy link
Member

@LadyChristina LadyChristina Apr 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you could add them to the existing identifiers / clusters files that would be great! Thanks again for taking the time to contribute and apologies for the slow responses :)

Queries the Balance Pool Groups API and Cardanoscan to get pool data and write it to a json file.
:returns: dictionary, where each key is a pool's hash and the corresponding value is a dictionary with the
pool's metadata (name, ticker, homepage, description)
:returns: list of dictionaries, where each each dictionary contains metadata (pool_hash, pool_id, pool_group, pool_ticker)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate "each"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it seems that there are two separate lists of dictionaries that are returned so it should be noted here that the return value is actually a tuple

if __name__ == '__main__':
logging.basicConfig(format='[%(asctime)s] %(message)s', datefmt='%Y/%m/%d %I:%M:%S %p', level=logging.INFO)

#mapping_info_dir = pathlib.Path(__file__).parent.parent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line can be deleted

logging.basicConfig(format='[%(asctime)s] %(message)s', datefmt='%Y/%m/%d %I:%M:%S %p', level=logging.INFO)

#mapping_info_dir = pathlib.Path(__file__).parent.parent
mapping_info_dir = pathlib.Path(__file__).parent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, saving the files in this directory will throw an error unless the subdirectories "clusters" and "identifiers" are manually created. This could be prevented by automatically creating the subdirectories if they don't exist.

@@ -27,4 +27,11 @@ def gini(array):
array = np.sort(array)
index = np.arange(1, array.shape[0] + 1)
n = array.shape[0]
return (np.sum((2 * index - n - 1) * array)) / (n * np.sum(array))
# Normalize the array to prevent overflow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these changes are fine, but there are two tests that fail because of floating point errors that are produced here. One option is to update the tests to use the new values, as they are not exactly wrong values. Another option would be to round the Gini coefficient (and perhaps all other metrics too) to some fixed number of decimals, e.g. 5, and then use the same number of digits when testing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants