mirror of
https://github.com/nspcc-dev/neo-go.git
synced 2026-03-01 04:28:51 +00:00
Stop consensus if N last blocks failed to persist #1238
Labels
No labels
I1
I2
I3
I4
S1
S2
S3
S4
U0
U1
U2
U3
U3
U4
blocked
bug
bug
cli
compiler
config
config
consensus
dependencies
discussion
documentation
enhancement
epic
feature
go
good first issue
help wanted
neotest
network
oracle
performance
question
rpc
security
smartcontract
task
task
task
test
vm
wallet
windows
windows
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
nspcc-dev/neo-go#1238
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @fyfyrchik on GitHub (Dec 15, 2023).
Problem
neo-go database lives on a disk. If there is no more space, the database can no longer grow and we will accumulate blocks in memory with these messages in logs:
Nothing wrong in public networks, but in privnet scenario, where all nodes are likely to be similar in configuration, this situation can happen at the same time. This is pretty bad, because then we fail with OOM, restart neo-go and lose some of the last blocks (the exact number depends on the amount of RAM and timing), not to mention possible problems with consensus after restart.
Another problem is that clients connected to this particular node could not be prepared to time travelling.
Proposed solution
Add
MaxFailedToPersistBlockCount: Nconfig setting, which allows to stop consensus service if last N blocks failed to be persisted. The administrator can then extend partition or clean space via other means.I am ready to do a PR if we accept this solution.
This is proved to be useful and we have encountered this situation twice. The second time this solution worked as expected and we were able to clean space on the partition and continue consensusing without restarting neo-go or losing any blocks.
@roman-khimov commented on GitHub (Dec 15, 2023):
In general, I don't like new magic options, alternatives are:
At the same time, some reaction to write failures can be useful for ordinary nodes too.
Any other suggestions?
@fyfyrchik commented on GitHub (Dec 15, 2023):
So this is similar to
MaxFailedToPersistBlockCount: 1in the proposed solution? Looks also good to me. The option was initially there to behave smoothly in case of transient failures, even though I don't have any particular example, besides "no space" which is not transient. No options does indeed look better.@roman-khimov commented on GitHub (Dec 15, 2023):
Looks like.
Yeah, usually it's either "out of space" or "your disk/FS is broken, have fun".