Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve NewEpochState logging #5854

Merged
merged 3 commits into from
May 28, 2024
Merged

Conversation

Jimbo4350
Copy link
Contributor

@Jimbo4350 Jimbo4350 commented May 23, 2024

Description

Improvements

  1. The log output of startLedgerNewEpochStateLogging is now pretty JSON
  2. Output the diff of each transition of NewEpochState (i.e per block or at the turn of the epoch) to /logs/ledger-new-epoch-state-diffs.log

Prior output format of generated logs (calling show after each block application/epoch transition):

#### BLOCK ####
NewEpochState {nesEL = EpochNo 0, nesBprev = BlocksMade (fromList []), nesBcur = BlocksMade (fromList [(KeyHash "af20b2f144dc28a3bd62a300fa285666e3d3dc211ba483a2e4944846",1)]), nesEs = EpochState {esAccountState = AccountState {asTreasury = Coin 0, asReserves = Coin 191999999999}, esLState = LedgerState {lsUTxOState = UTxOState {utxosUtxo = UTxO (fromList [(TxIn (TxId {unTxId = SafeHash "2a9bb38ea6d56726250b03ab8dabfdce9dc9fb0021472e93cc51f6497775fbb2"}) (TxIx 0),(Addr Testnet (KeyHashObj (KeyHash "a17ad57373ec284b7ef329b744affdacf26dac6f6797ef0a715fe53d")) (StakeRefBase (KeyHashObj (KeyHash "0a19ee2042807a1b6925c170e545de22e0dc1b5f3e4dff4d70f70ca6"))),MaryValue (Coin 300000000000) (MultiAsset (fromList [])),NoDatum,SNothing)),(TxIn (TxId {unTxId = SafeHash "49e5059fc654f4640243b2c6ede2f2537009281722ba7a53816787ac2c0f3070"}) (TxIx 0),(AddrBootstrap (BootstrapAddress (Address {addrRoot = 06f9f79cccef8ecbf9e680d3302f4d6916dbdb0bb094ff58cc7155ab, addrAttributes = Attributes { data_ = AddrAttributes {aaVKDerivationPath = Nothing, aaNetworkMagic = NetworkTestnet 42} }, addrType = ATVerKey})),MaryValue (Coin 2666666667) (MultiAsset (fromList [])),NoDatum,SNothing)),(TxIn (TxId {unTxId = SafeHash "4ed4fd57533f3ba7316d47c013c19b21126940c20aa360ba09c7c4dead4852bc"}) (TxIx 0),(Addr Testnet (KeyHashObj (KeyHash "674a339680452ac1e2a8260c872b12ff98a5cdc19d4fca8f606bfa31")) StakeRefNull,MaryValue (Coin 300000000000) (MultiAsset (fromList [])),NoDatum,SNothing)),(TxIn (TxId {unTxId = SafeHash "59a53a350e4b6e28ba38f920bcc3a9104fee48ecda7a4efad9765af078d32b5c"}) (TxIx 0),(Addr Testnet (KeyHashObj (KeyHash "05c601a3a326f7499d3c63d59613b1f04097c4fb1b5c9798ea3922bc")) (StakeRefBase (KeyHashObj (KeyHash "411dbefbd05427f8f6b4cd4e53205dc0d19df3080874fff3b9e3e3d9"))),MaryValue (Coin 300000000000) (MultiAsset (fromList [])),NoDatum,SNothing)),(TxIn (TxId {unTxId = SafeHash "5f1f67f02daf2ce83dfc94a1e6c6b2f29ed41751887772ee5f922258926f62a3"}) (TxIx 0),(AddrBootstrap (BootstrapAddress (Address {addrRoot = a7011aecc6f237f2200606114ed891f3f07482ec1e58aba2d5300a89, addrAttributes = Attributes { data_ = AddrAttributes {aaVKDerivationPath = Nothing, aaNetworkMagic = NetworkTestnet 42} }, addrType = ATVerKey})),MaryValue (Coin 2666666667) (MultiAsset (fromList [])),NoDatum,SNothing)),(TxIn (TxId {unTxId = SafeHash "6d77de0e223335be259cb9764dcb279259249a562ed309a254f963b336158d0d"}) (TxIx 0),(AddrBootstrap 

#### BLOCK ####
NewEpochState {nesEL = EpochNo 1, nesBprev = BlocksMade (fromList []), nesBcur = BlocksMade (fromList [(KeyHash "af20b2f144dc28a3bd62a300fa285666e3d3dc211ba483a2e4944846",2)]), nesEs = EpochState {esAccountState = AccountState {asTreasury = Coin 0, asReserves = Coin 191999999999}, esLState = LedgerState {lsUTxOState = UTxOState {utxosUtxo = UTxO (fromList [(TxIn (TxId {unTxId = SafeHash "2a9bb38ea6d56726250b03ab8dabfdce9dc9fb0021472e93cc51f6497775fbb2"}) (TxIx 0),(Addr Testnet (KeyHashObj (KeyHash "a17ad57373ec284b7ef329b744affdacf26dac6f6797ef0a715fe53d")) (StakeRefBase (KeyHashObj (KeyHash "0a19ee2042807a1b6925c170e545de22e0dc1b5f3e4dff4d70f70ca6"))),MaryValue (Coin 300000000000) (MultiAsset (fromList [])),NoDatum,SNothing)),(TxIn (TxId {unTxId = SafeHash "49e5059fc654f4640243b2c6ede2f2537009281722ba7a53816787ac2c0f3070"}) (TxIx 0),(AddrBootstrap (BootstrapAddress (Address {addrRoot = 06f9f79cccef8ecbf9e680d3302f4d6916dbdb0bb094ff58cc7155ab, addrAttributes = Attributes { data_ = AddrAttributes {aaVKDerivationPath = Nothing, aaNetworkMagic = NetworkTestnet 42} }, addrType = ATVerKey})),MaryValue (Coin 2666666667) (MultiAsset (fromList [])),NoDatum,SNothing)),(TxIn (TxId {unTxId = SafeHash "4ed4fd57533f3ba7316d47c013c19b21126940c20aa360ba09c7c4dead4852bc"}) (TxIx 0),(Addr Testnet (KeyHashObj (KeyHash "674a339680452ac1e2a8260c872b12ff98a5cdc19d4fca8f606bfa31")) StakeRefNull,MaryValue (Coin 300000000000) (MultiAsset (fromList [])),NoDatum,SNothing)),(TxIn (TxId {unTxId = SafeHash "59a53a350e4b6e28ba38f920bcc3a9104fee48ecda7a4efad9765af078d32b5c"}) (TxIx 0),(Addr Testnet (KeyHashObj (KeyHash "05c601a3a326f7499d3c63d59613b1f04097c4fb1b5c9798ea3922bc")) (StakeRefBase (KeyHashObj (KeyHash "411dbefbd05427f8f6b4cd4e53205dc0d19df3080874fff3b9e3e3d9"))),MaryValue (Coin 300000000000) (MultiAsset (fromList [])),NoDatum,SNothing)),(TxIn (TxId {unTxId = SafeHash "5f1f67f02daf2ce83dfc94a1e6c6b2f29ed41751887772ee5f922258926f62a3"}) (TxIx 0),(AddrBootstrap (BootstrapAddress (Address {addrRoot = a7011aecc6f237f2200606114ed891f3f07482ec1e58aba2d5300a89, addrAttributes = Attributes { data_ = AddrAttributes {aaVKDerivationPath = Nothing, aaNetworkMagic = NetworkTestnet 42} }, addrType = ATVerKey})),MaryValue (Coin 2666666667) (MultiAsset (fromList [])),NoDatum,SNothing)),(TxIn (TxId {unTxId = SafeHash "6d77de0e223335be259cb9764dcb279259249a562ed309a254f963b336158d0d"}) (TxIx 0),(AddrBootstrap 

Pretty JSON output improvement:

#### BLOCK ####
{
    "currentEpoch": 0,
    "currentEpochBlocks": {
        "814e499df8e9ff7732c07c342ce0f0ecc1b939d36515177035232588": 1
    },
    "currentEpochState": {
        "esAccountState": {
            "reserves": 191999999999,
            "treasury": 0
        },
        "esLState": {
            "delegationState": {
                "dstate": {
                    "fGenDelegs": [],
                    "genDelegs": {
                        "07a6850426b4c50b03658f91e271d159c7d093b59e3faf68aad92f63": {
                            "delegate": "c45e6e2952d6bd58767b66b134e18ca4bdd46a85b400e75bb84b70ef",
                            "vrf": "0bd71f801f2fe70513488248bab5fbba0aa479e964b29458ea296927de012c18"
                        },
                        ....

#### BLOCK ####
{
    "currentEpoch": 0,
    "currentEpochBlocks": {
        "814e499df8e9ff7732c07c342ce0f0ecc1b939d36515177035232588": 2
    },
    "currentEpochState": {
        "esAccountState": {
            "reserves": 191999999999,
            "treasury": 0
        },
        "esLState": {
            "delegationState": {
                "dstate": {
                    "fGenDelegs": [],
                    "genDelegs": {
                        "07a6850426b4c50b03658f91e271d159c7d093b59e3faf68aad92f63": {
                            "delegate": "c45e6e2952d6bd58767b66b134e18ca4bdd46a85b400e75bb84b70ef",
                            "vrf": "0bd71f801f2fe70513488248bab5fbba0aa479e964b29458ea296927de012c18"
                        },
                        ....

New epoch state diff output:

Epoch state transition: 4
6c6
<         "e42f09a8c25b867068c106567fedec0da386f4b661ccc6dd56db99d9": 3
---
>         "e42f09a8c25b867068c106567fedec0da386f4b661ccc6dd56db99d9": 4

Epoch state transition: 5
3c3
<     "currentEpoch": 0,
---
>     "currentEpoch": 1,
5,6c5
<         "814e499df8e9ff7732c07c342ce0f0ecc1b939d36515177035232588": 1,
<         "e42f09a8c25b867068c106567fedec0da386f4b661ccc6dd56db99d9": 4
---
>         "814e499df8e9ff7732c07c342ce0f0ecc1b939d36515177035232588": 1
121c120
<                 "deposited": 1000000,
---
>                 "deposited": 4000000,
780c779,780
<                         "tag": "NoPParamsUpdate"
---
>                         "contents": null,
>                         "tag": "PotentialPParamsUpdate"
783c783,820
<                         "enactedGovActions": [],
---
>                         "enactedGovActions": [
>                             {
>                                 "actionId": {
>                                     "govActionIx": 0,
>                                     "txId": "6047fc73079a2efab718be3a39705ad17e26b8b06bc22678248688bff9cd7ae3"
>                                 },
........

Checklist

  • Commit sequence broadly makes sense and commits have useful messages
  • New tests are added if needed and existing tests are updated. These may include:
    • golden tests
    • property tests
    • roundtrip tests
    • integration tests
      See Runnings tests for more details
  • Any changes are noted in the CHANGELOG.md for affected package
  • The version bounds in .cabal files are updated
  • CI passes. See note on CI. The following CI checks are required:
    • Code is linted with hlint. See .github/workflows/check-hlint.yml to get the hlint version
    • Code is formatted with stylish-haskell. See .github/workflows/stylish-haskell.yml to get the stylish-haskell version
    • Code builds on Linux, MacOS and Windows for ghc-8.10.7 and ghc-9.2.7
  • Self-reviewed the diff

Note on CI

If your PR is from a fork, the necessary CI jobs won't trigger automatically for security reasons.
You will need to get someone with write privileges. Please contact IOG node developers to do this
for you.

@Jimbo4350 Jimbo4350 requested a review from a team as a code owner May 23, 2024 14:36
@Jimbo4350 Jimbo4350 force-pushed the jordan/improve-new-epoch-state-logging branch 2 times, most recently from af9f3bd to 746038c Compare May 23, 2024 14:45
handler outputFp (AnyNewEpochState sbe nes) = do
handleException . liftIO $ do
appendFile outputFp $ "#### BLOCK ####" <> "\n"
appendFile outputFp $ BSC.unpack (shelleyBasedEraConstraints sbe $ encodePretty nes) <> "\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is flushing a concern for our logging?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually I was using the epoch state logs to investigate what happened after the failure, because the slots are short in the testnet tests. But if we'd like a live view, it would be better to flush, to not get partial JSON.


-- In order to create a diff of the epoch state logging file contents
-- we must do so when the resources are deallocated (see runInBackground)
-- and therefore when the log file is no longer in use.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant by "the file is no longer in use". Does it mean no longer being written to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you waiting on 228 to complete? I would use a synchronisation primitive of some kind. My primitive of choice is STM, but I think an MVar would work.

Copy link
Contributor Author

@Jimbo4350 Jimbo4350 May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go for an MVar.
Edit: This wasn't possible. See my most recent commit.

handler :: FilePath -> AnyNewEpochState -> StateT Int IO LedgerStateCondition
handler outputFp (AnyNewEpochState sbe nes) = do
handleException . liftIO $ do
appendFile outputFp $ "#### BLOCK ####" <> "\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this tag a json too? I'm thinking about the use case, where we could filter the json logs through jq, so a non-json line would break jq afair.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but that would also break because of encodePretty. It would be nice to have an option to get only JSON lines here, so it could be consumed by jq. Consider my comment as a potential improvement outside of the scope of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, jsonl is better than json.

-- TODO: I hate this. Is there a better way to do this without
-- reaching for concurrency primitives?
isFileReadableLoop :: FilePath -> IO Bool
isFileReadableLoop fp = do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appendFiles in 248 and 249 are constantly opening and closing the files so I don't think this will work - I think you might get inbetween accidentally. I second John's advice https://github.com/IntersectMBO/cardano-node/pull/5854/files#r1611865524

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed some changes. Let me know what you think.

:: String
-> [(Text, Text)]
epochStateBeforeAfterValues logFileContents =
let allEpochStates = filter (/= "") . Text.splitOn "#### BLOCK ####" $ Text.pack logFileContents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: move the block tag to a variable and allow reusing it. Both usage places are 40 lines apart, so it'd be easy to get overlooked by someone making changes here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather do this in a follow up PR. I want to get the concurrency stuff right first.

isFileReadableLoop logFile >>= \case
False -> error "isFileReadableLoop: Impossible"
True -> do
logFileContents <- IO.readFile logFile
Copy link
Contributor

@carbolymer carbolymer May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logFileContents <- IO.readFile logFile
logFileContents <- Text.readFile logFile

You're using Text later anyway

then "No changes in epoch state"
else ppDiff diffResult

instance (L.EraTxOut ledgerera, L.EraGov ledgerera) => ToJSON (L.NewEpochState ledgerera) where
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we later upstream this to ledger?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer that 👍

-- In order to create a diff of the epoch state logging file contents
-- we must do so when the resources are deallocated (see runInBackground)
-- and therefore when the log file is no longer in use.
void . H.evalM $ allocate (pure ()) $ \_ -> do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also catch AsyncCancelled in the handler and act upon it instead. That may be easier than installing cleanups in MonadResource.

Copy link
Contributor Author

@Jimbo4350 Jimbo4350 May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I found out that technically allocate is the wrong thing to use here because we are forking a thread. See my most recent changes + comments.

appendFile outputFp $ "#### BLOCK ####" <> "\n"
appendFile outputFp $ show anyNewEpochState <> "\n"
pure ConditionNotMet
handler :: FilePath -> AnyNewEpochState -> StateT Int IO LedgerStateCondition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the Int in the state? It appears unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to remove it

Copy link
Contributor

@carbolymer carbolymer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Having epoch states as as json will be definitely more useful.

I think this may be problematic: https://github.com/IntersectMBO/cardano-node/pull/5854/files#r1613083518 . This also may be an easier alternative: https://github.com/IntersectMBO/cardano-node/pull/5854/files#r1613104145

@newhoggy newhoggy self-requested a review May 25, 2024 04:09
ignoreException :: E.SomeException -> Bool
ignoreException e =
case E.fromException e of
Just (MuxError errType _) ->
Copy link
Contributor

@carbolymer carbolymer May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be wrapped in ExceptT exception in foldEpochState. Now it seems that exceptions processing is fragmented - some are caught here, some are caught there and wrapped in ExceptT, which seems unintuitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show me what you want here? We need to ignore the MuxBearerClosed which occurs because the node is foldEpochState is connected to is shut down.

Copy link
Contributor

@carbolymer carbolymer May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like foldEpochState to be returning an error in ExceptT or a synchronous exception - not both of them. We're catching IOExceptions there already for example.

where
cleanUp :: H.Async a -> IO ()
cleanUp a = H.cancel a >> void (H.link a)
runInBackground runOnException act =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function body is too convoluted now.

  1. The constraint MonadResource m is redundant. You're running ResourceT in line 76, which means that any resource allocations here are freed in 76.
  2. You're actually not allocating any resources with ResourceT here. The runOnException and act are in IO, so you can't use ResourceT from this function. Why it's needed here? It does not help with concurrency.
  3. resourceForkWith f act - this looks like an incorrect use to me. f is a forking function, act is an action which that forking function gets as an argument (wrapped in bracket and ResourceT cleanup stuff) and executes. f is executed masked, but act is executed unmasked. So, your act does nothing, which reduces resourceForkWith to bracket_, because you're not using MonadResource here either. So why not just use bracket_ and async in the first place?
  4. What happens to the executed thread when the parent thread completes? Does it get re-parented to the test suite thread? It seems to me that it will live as long as the node, which results in foldEpochState throwing. So if you don't want to kill the thread, you can just fork the logger and forget, relying on that it will quit on broken node connection.
  5. linkOnly ignoreException r - so this will fail the test case on all other exceptions than MuxBearerClosed. What scenarios do you have in mind here?
  6. E.onException catches synchronous and asynchronous exceptions here. I think you should catch only synchronous (using onException from Control.Exception.Safe).

Copy link
Contributor Author

@Jimbo4350 Jimbo4350 May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This is true, removed.

2/3: We can and should use bracket 👍 . Initially I was trying to share the file handle between the threads and so I reached for resourceForkWith. However we are simply waiting for another thread to be finished with the resource (i.e we are not interleaving).

  1. In our use case when the parent thread completes our background process throws an exception because the node it was connected too shuts down. In the context of how we use runInBackground we can fork and forget because an exception will be thrown from foldEpochState.

  2. This is correct. I didn't have any specific scenarios in mind but if an unexpected exception is thrown we shouldn't ignore it.

  3. This is also a valid suggestion in our context. The parent thread will get terminated so we always expect our thread to throw a synchronous exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I responded in slack. This currently works, lets get it merged and you can investigate using bracket in a follow up PR.

Comment on lines -71 to -74
runInBackground act = void . H.evalM $ allocate (H.async act) cleanUp
where
cleanUp :: H.Async a -> IO ()
cleanUp a = H.cancel a >> void (H.link a)
Copy link
Contributor

@carbolymer carbolymer May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replying to: #5854 (comment)
I think allocate (H.async act) is correct here. Resources are cleaned in LIFO order, so this should ensure that the thread gets killed before resources it uses are freed.

@Jimbo4350 Jimbo4350 force-pushed the jordan/improve-new-epoch-state-logging branch from 0d8b075 to 7c1b7e8 Compare May 28, 2024 11:14
the recommended concurrency primitive the library exposes
Update `runInBackground` with the ability to execute an IO action when
the background thread is terminated due to an unexpected exception
@Jimbo4350 Jimbo4350 force-pushed the jordan/improve-new-epoch-state-logging branch from 7c1b7e8 to 37bc79b Compare May 28, 2024 14:01
@Jimbo4350 Jimbo4350 enabled auto-merge May 28, 2024 14:56
@Jimbo4350 Jimbo4350 disabled auto-merge May 28, 2024 16:19
@Jimbo4350 Jimbo4350 merged commit 6a2eefe into master May 28, 2024
16 of 17 checks passed
@Jimbo4350 Jimbo4350 deleted the jordan/improve-new-epoch-state-logging branch May 28, 2024 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants