MPT can't get synced if `RemoveUntraceableBlocks` enabled #1509

New issue

Open

opened 2025-12-28 17:16:41 +00:00 by sami · 1 comment

sami commented

2025-12-28 17:16:41 +00:00

Owner

Originally created by @Ayrtat on GitHub (Apr 18, 2025).

Current Behavior

MPT sync is not possible on n-th node when n-1 nodes ran GC on MPT

Expected Behavior

MPT sync is possible. n-th node get ready for protocol as its state gets synced

Possible Solution

Introduce a flag along with RemoveUntraceableBlocks which clarifies that MPT won't get truncated. Meanwhile, removing old data and transfers is OK

Fix billet traversal

Steps to Reproduce

This problem can be reproduced if we have 4 neo-go nodes

Generate chain

These protocol parameters must be enabled: StateRootInHeader, P2PStateExchangeExtensions for all nodes
Application parameter RemoveUntraceableBlocks must be enabled for all nodes
It's recommended to keep short blockchain "tail": MaxTraceableBlocks: 2104. Tune GarbageCollectionPeriod, make GC more frequent (2104 is fine).
Make TimePerBlock: 1s to generate blocks faster
Generate as many blocks as possible. The problem can be reproduced on 230k

Try to sync n-th node from the very beginning

Get 1 of 4 node
Stop neo-go
Remove mainnet.bolt
Start neo-go
You get panic: failed to get MPT node from the pool. See billet

Context

I've tried to carry out an experiment: try to sync n-th node from the very beginning when the rest n-1 nodes have removed untraceable blocks.
Here is my understanding why this problem occurs:

If StateRootInHeader, P2PStateExchangeExtensions are enabled, then we activate this path in stateSync module
It sync headers. The number of headers is calculated here. If we take default value for StateSyncInterval, then for 230k blocks in the network it's going to sync ~200k headers
As soon as headers are synced, the server requests MPT nodes
When billet is trying to traverse nodes, it gets failed with panic. I suppose this is because of we can't get MPT from the past

Regression

No idea

Your Environment

The problem is reproduced on cluster but can be reproduced with dev-env

Version: 0.106.3 but I suppose it's fair for the latest version

Originally created by @Ayrtat on GitHub (Apr 18, 2025).  ## Current Behavior MPT sync is not possible on n-th node when n-1 nodes ran GC on MPT ## Expected Behavior MPT sync is possible. n-th node get ready for protocol as its state gets synced ## Possible Solution Introduce a flag along with `RemoveUntraceableBlocks` which clarifies that MPT won't [get truncated](https://github.com/nspcc-dev/neo-go/blob/master/pkg/core/stateroot/module.go#L301). Meanwhile, removing old data and transfers is OK OR Fix [billet traversal](https://github.com/nspcc-dev/neo-go/blob/master/pkg/core/statesync/module.go#L257) ## Steps to Reproduce This problem can be reproduced if we have **4** neo-go nodes #### Generate chain 1. These protocol parameters must be enabled: `StateRootInHeader`, `P2PStateExchangeExtensions` for **all** nodes 2. Application parameter `RemoveUntraceableBlocks` must be enabled for **all** nodes 3. It's recommended to keep short blockchain "tail": `MaxTraceableBlocks: 2104`. Tune `GarbageCollectionPeriod`, make GC more frequent (`2104` is fine). 4. Make `TimePerBlock: 1s` to generate blocks faster 6. Generate as many blocks as possible. The problem can be reproduced on 230k #### Try to sync n-th node from the very beginning 1. Get 1 of 4 node 2. Stop neo-go 3. Remove `mainnet.bolt` 4. Start neo-go 5. You get `panic: failed to get MPT node from the pool`. See [billet](https://github.com/nspcc-dev/neo-go/blob/master/pkg/core/statesync/module.go#L257) ## Context I've tried to carry out an experiment: try to sync n-th node from the very beginning when the rest n-1 nodes have removed **untraceable** blocks. Here is my understanding why this problem occurs: 1. If `StateRootInHeader`, `P2PStateExchangeExtensions` are enabled, then we activate this [path](https://github.com/nspcc-dev/neo-go/blob/master/pkg/core/statesync/module.go#L112) in stateSync module 2. It sync headers. The number of headers is calculated [here](https://github.com/nspcc-dev/neo-go/blob/master/pkg/core/statesync/module.go#L140). If we take default value for `StateSyncInterval`, then for 230k blocks in the network it's going to sync ~200k headers 3. As soon as headers are synced, the server [requests MPT nodes](https://github.com/nspcc-dev/neo-go/blob/master/pkg/network/server.go#L909) 4. When [billet](https://github.com/nspcc-dev/neo-go/blob/master/pkg/core/statesync/module.go#L257) is trying to traverse nodes, it gets failed with panic. I suppose this is because of we can't get MPT from the past ## Regression No idea ## Your Environment The problem is reproduced on cluster but can be reproduced with dev-env **Version: 0.106.3** but I suppose it's fair for the latest version

sami added the

bug

labels

2025-12-28 17:16:41 +00:00

sami commented

2025-12-28 17:16:41 +00:00

Author

Owner

@roman-khimov commented on GitHub (Apr 18, 2025):

It looks like a misconfigured network. There is an inherent race between new node synchronization and old node GC. Old nodes store MPT up to 2 StateSyncInterval. New node picks its target height as the nearest state sync point relative to the current height. So it has from StateSyncInterval to 2×StateSyncInterval of time to synchronize, otherwise the sync point becomes obsolete and there is nothing you can do about it. Can happen during header processing, can happen during MPT fetching.
Still, when it's fetching headers it can technically fetch to the latest (I don't remember exactly, likely that's what it does anyway already) and determine state sync point afterwards.
At this point there is nothing more we can do, MPT has to be synchronized and if StateSyncInterval is too small it's a network configuration problem.
But it all shouldn't panic anyway.
https://github.com/neo-project/neo/issues/3463 is the future.

@roman-khimov commented on GitHub (Apr 18, 2025): 1. It looks like a misconfigured network. There is an inherent race between new node synchronization and old node GC. Old nodes store MPT up to 2 `StateSyncInterval`. New node picks its target height as the nearest state sync point relative to the current height. So it has from `StateSyncInterval` to `2×StateSyncInterval` of time to synchronize, otherwise the sync point becomes obsolete and there is nothing you can do about it. Can happen during header processing, can happen during MPT fetching. 2. Still, when it's fetching headers it can technically fetch to the latest (I don't remember exactly, likely that's what it does anyway already) and determine state sync point afterwards. 3. At this point there is nothing more we can do, MPT has to be synchronized and if `StateSyncInterval` is too small it's a network configuration problem. 4. But it all shouldn't panic anyway. 5. https://github.com/neo-project/neo/issues/3463 is the future.