Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leveldb exception handle #3356

Closed
vncoelho opened this issue Jun 24, 2024 · 27 comments
Closed

Leveldb exception handle #3356

vncoelho opened this issue Jun 24, 2024 · 27 comments
Assignees

Comments

@vncoelho
Copy link
Member

vncoelho commented Jun 24, 2024

Describe the bug
Run a setup with 4 nodes running private net

To Reproduce
Steps to reproduce the behavior:
Start nodes and they will crash almost instantaneously

Error

dotnet: ./db/dbformat.cc:16: uint64_t leveldb::PackSequenceAndType(uint64_t, leveldb::ValueType): Assertion `seq <= kMaxSequenceNumber' failed.
@cschuchardt88
Copy link
Member

Need more information.

@vncoelho
Copy link
Member Author

Need more information.

description updated

@shargon
Copy link
Member

shargon commented Jun 25, 2024

Seems that the data is corrupted, it's a fresh installation?

@vncoelho
Copy link
Member Author

fresh with master

@vncoelho
Copy link
Member Author

Seems that the data is corrupted, it's a fresh installation?

probably due to the unhanded exception management feature, but still did not investigate further.
It is easy to reproduce. Just run a node.

@Hecate2
Copy link
Contributor

Hecate2 commented Jun 26, 2024

Is it because the 4 nodes are using the same directory for leveldb?

@cschuchardt88
Copy link
Member

Based off the source code from you error, It look like this Your database is corrupt. try deleting it to see if the problem goes away.

Has to do with Seeking with KeyComparator source code says

// User key has become shorter physically, but larger logically.
// Tack on the earliest possible number to the shortened user key.

@vncoelho
Copy link
Member Author

Based off the source code from you error, It look like this Your database is corrupt. try deleting it to see if the problem goes away.

Has to do with Seeking with KeyComparator source code says

// User key has become shorter physically, but larger logically.
// Tack on the earliest possible number to the shortened user key.

No @cschuchardt88 , it is a recent introduced problem.

@Jim8y
Copy link
Contributor

Jim8y commented Jun 28, 2024

its because you run too many nodes in the same machine that all use leveldb. Not a core problem. This happens every time when you run multiple nodes in the same machine.

@vncoelho
Copy link
Member Author

its because you run too many nodes in the same machine that all use leveldb. Not a core problem. This happens every time when you run multiple nodes in the same machine.

No. This is not true in my Setup.

@vncoelho
Copy link
Member Author

Too much complaints and not a real investigation in a simple scenario.
The cause is that we now crash the clients with unhandled exception.

Without minimum tests the neo-cli will be unused until we implement the exception handle and find the BASIC problems.

@vncoelho
Copy link
Member Author

#3366 (comment)

@Jim8y
Copy link
Contributor

Jim8y commented Jun 29, 2024

Too much complaints and not a real investigation in a simple scenario.

You can say this when you locate the real problem.

We have being working like this for many years, and all of a sudden its all wrong, we all become complainers? And our work are lack of investigation products? But we definitely have tested it, checked it everywhere, and for this one, i have run the node~~~~ And i have asked help from NGD to test it as well.

But code were there, pr were there, you were able to test, to review, to comment. We have followed your suggestion to leave it for a while to review. Actually that pr was there for a week before i collected sufficient review approvals.

Before we release any new version, we still can correct any problem, so chill. A team means even some one made some problem, some one else can correct it, isn't it?

The cause is that we now crash the clients with unhandled exception.

Funny part is we should have crashed with unhandled exception, unless we have set plugins to ignore unhandled exception. I would say that pr have found an issue, if any, instead of introduced an issue.

BTW, i admit that even if i run the test on my machine, i at most run a single node,,,,, i dont have a 4 nodes private net test environment. I will create one.

@AnnaShaleva
Copy link
Member

AnnaShaleva commented Jul 1, 2024

its because you run too many nodes in the same machine that all use leveldb

It was not a problem for me either, I used NeoBench to run 4-nodes and 7-nodes privnet with Dockerized C# nodes on my single machine, and it was OK.

i dont have a 4 nodes private net test environment.

I'd suggest you to use NeoBench, but it's not yet updated to use fresh monorepo, we have nspcc-dev/neo-bench#175 for that.

@vncoelho
Copy link
Member Author

vncoelho commented Jul 1, 2024

its because you run too many nodes in the same machine that all use leveldb

It was not a problem for me either, I used NeoBench to run 4-nodes and 7-nodes privnet with Dockerized C# nodes on my single machine, and it was OK.

i dont have a 4 nodes private net test environment.

I'd suggest you to use NeoBench, but it's not yet updated to use fresh monorepo, we have nspcc-dev/neo-bench#175 for that.

Are you using leveldb? Maybe it was rocksdb instead.

Were your experiments with master branch?
Mine just run now reverting the exception handle crash.

@cschuchardt88
Copy link
Member

cschuchardt88 commented Jul 2, 2024

@vncoelho
Are you sure you didn't run out storage (disk space)? Why don't give #3355 a try?

@cschuchardt88
Copy link
Member

cschuchardt88 commented Jul 2, 2024

Try doing ./neo-cli /repair or neo-cli.exe /repair

@vncoelho
Copy link
Member Author

vncoelho commented Jul 2, 2024

Try doing ./neo-cli /repair or neo-cli.exe /repair

This is not the case, @cschuchardt88 .

The testing environment is the same for testing with and without the PR being reverted.
The problem is that leveldb probably regenerates from the crash, but the PR that handles exception detects it and then crash the client.

The behavior may not the wrong. But before merging that PR this should had been tested because the problem is simple to be seen.
Can you verify that @superboyiii ?

@cschuchardt88
Copy link
Member

cschuchardt88 commented Jul 2, 2024

Try with this version of LevelDbStore #3274

@Jim8y
Copy link
Contributor

Jim8y commented Jul 4, 2024

its because you run too many nodes in the same machine that all use leveldb

It was not a problem for me either, I used NeoBench to run 4-nodes and 7-nodes privnet with Dockerized C# nodes on my single machine, and it was OK.

i dont have a 4 nodes private net test environment.

i would love to argue, but i am not an expert of leveldb, all i can

say is now it happened, and apparently a leveldb exception, not related to the core.

possible reasons could be: platform, os, version, dependencies. i would suggest to try rockdb and memorydb as well.

@vncoelho
Copy link
Member Author

its because you run too many nodes in the same machine that all use leveldb

It was not a problem for me either, I used NeoBench to run 4-nodes and 7-nodes privnet with Dockerized C# nodes on my single machine, and it was OK.

i dont have a 4 nodes private net test environment.

i would love to argue, but i am not an expert of leveldb, all i can

say is now it happened, and apparently a leveldb exception, not related to the core.

possible reasons could be: platform, os, version, dependencies. i would suggest to try rockdb and memorydb as well.

So, this error without the Exception Handle was good and safe to run a node?
Now, after the PR the node is broken, right?Is it not a core problem?

@cschuchardt88
Copy link
Member

cschuchardt88 commented Jul 11, 2024

It's a corruption problem.

We need more information on your setup :

  1. are you using a container?
  2. what version of leveldb you have?
  3. what CI build you using?
  4. What filesystem?
  5. What Operating System?
  6. What CPU arch?
  7. Have you tried leveldb `repair?
  8. How many threads does you OS limit?
  9. Have you ran filesystem repair tool?
  10. Does this happen on other setups?
  11. What's your node setup?

@vncoelho
Copy link
Member Author

vncoelho commented Jul 11, 2024

1. are you using a `container`?

Yes

2. what `version` of `leveldb` you have?

Master compiled plugin and libleveldb-dev from apt get mcr.microsoft.com/dotnet/aspnet:8.0.3-jammy

it is all dockerfile in a container with the amount of threads that is necessary for it to run safe.
It usually run a node on mainnet with the resources it have available.
It is running perfect without the commit I said that should be reverted until fixed.

The problem could be due to some limitation on leveldb safe off course. But that should be handled before the PR was merged.
Furthermore, In my last tests rocksdb was also broken.

Only way to run a node nowdays is memorystore.

@vncoelho vncoelho reopened this Jul 15, 2024
@vncoelho
Copy link
Member Author

Still crashing. I thought it was solved but my config was with "MemoryStore" instead.

The problem persist even updating all libraries for dotnet during build and run.

RocksDb is also corrupted. But perhaps a difference reason.

@Jim8y
Copy link
Contributor

Jim8y commented Jul 15, 2024

I will setup a multi-nodes on my machine, will check it.

@Jim8y Jim8y self-assigned this Jul 15, 2024
@gsmachado
Copy link
Contributor

not entirely related, but see neo-project/neo-express#455

@vncoelho vncoelho changed the title Leveldb crash Leveldb exception handle Jul 18, 2024
@Jim8y Jim8y mentioned this issue Jul 18, 2024
15 tasks
@cschuchardt88
Copy link
Member

fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants