Epoch hangs at one node #956

New issue

Closed

opened 2025-12-28 17:21:17 +00:00 by sami · 3 comments

sami commented

2025-12-28 17:21:17 +00:00

Owner

Originally created by @vkarak1 on GitHub (Jan 25, 2023).

I have faced problem when one of the node left at 179 epoch while other nodes were at 189:

root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node2:8080 -g
189
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node3:8080 -g
189
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node4:8080 -g
189
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node1:8080 -g
179

Please find output from the netmap snapshot command:

root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap snapshot -g -r node1:8080
Epoch: 179
Node 1: 02183147e3d30745d3ecf402679d1378bd549f36fc77bf282fc1ffda50a865d8da ONLINE /ip4/172.26.160.150/tcp/8080
        Continent: Europe
        Country: Finland
        CountryCode: FI
        Deployed: YACZROKH
        Location: Helsinki (Helsingfors)
        Node: 172.26.160.150
        Price: 10
        SubDiv: Uusimaa
        SubDivCode: 18
        UN-LOCODE: FI HEL
Node 2: 0309c85a8be8f0a11df58a2ee440a76f37a9fbe4b82b4fd6fcfdb21c95fa8480f7 ONLINE /ip4/172.26.160.46/tcp/8080
        Continent: Europe
        Country: Russia
        CountryCode: RU
        Deployed: YACZROKH
        Location: Saint Petersburg (ex Leningrad)
        Node: 172.26.160.46
        Price: 10
        SubDiv: Sankt-Peterburg
        SubDivCode: SPE
        UN-LOCODE: RU LED
Node 3: 03736791ef911acdbc37625daf91c796e3fff63d5961e4c048368a6033699f4bfc ONLINE /ip4/172.26.160.157/tcp/8080
        Continent: Europe
        Country: Russia
        CountryCode: RU
        Deployed: YACZROKH
        Location: Moskva
        Node: 172.26.160.157
        Price: 10
        SubDiv: Moskva
        SubDivCode: MOW
        UN-LOCODE: RU MOW
Node 4: 03855b0548c4408a4d7725111c55c7346715dea8701a948046eecec7928b10cac0 ONLINE /ip4/172.26.160.205/tcp/8080
        Continent: Europe
        Country: Sweden
        CountryCode: SE
        Deployed: YACZROKH
        Location: Stockholm
        Node: 172.26.160.205
        Price: 10
        SubDiv: Stockholms l�n
        SubDivCode: AB
        UN-LOCODE: SE STO

Also tried to issue 'force-new-epoch' command, please find result below:

root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node1:8080 -g
179
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-adm morph force-new-epoch  -c configuration/config.yaml
Current epoch: 187, increase to 188.
Waiting for transactions to persist...
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node1:8080 -g
179
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-adm morph force-new-epoch  -c configuration/config.yaml
Current epoch: 188, increase to 189.
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-adm morph force-new-epoch  -c configuration/config.yaml
Current epoch: 188, increase to 189.
Waiting for transactions to persist...
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node1:8080 -g
179

Expected Behavior

All nodes have to be at the same epoch.

Possible Solution

Restart 3 services helped me to reach consistency between nodes epoch:

root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# systemctl restart neofs-ir; systemctl restart neofs-storage; systemctl restart neo-go
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node1:8080 -g
190
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node2:8080 -g
190
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node3:8080 -g
190
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node4:8080 -g
190

Steps to Reproduce (for bugs)

Nodes had been used for failover tests during 3 days where tests are aimed to kill each service and several "healthcheck" commands were issued against service to 100% sure that service has been restarted and ready to process.

Logs.zip

Your Environment

NeoFS Inner Ring node
Version: v0.35.0-11-g1a7c827a-dirty
GoVersion: go1.18.4
NeoFS Storage node
Version: v0.35.0-11-g1a7c827a-dirty
GoVersion: go1.18.4
NeoGo
Version: 0.100.1-pre-1-g0cb86f39
GoVersion: go1.18.4

Originally created by @vkarak1 on GitHub (Jan 25, 2023). I have faced problem when one of the node left at 179 epoch while other nodes were at 189: ``` root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node2:8080 -g 189 root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node3:8080 -g 189 root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node4:8080 -g 189 root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node1:8080 -g 179 ``` Please find output from the `netmap snapshot` command: ``` root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap snapshot -g -r node1:8080 Epoch: 179 Node 1: 02183147e3d30745d3ecf402679d1378bd549f36fc77bf282fc1ffda50a865d8da ONLINE /ip4/172.26.160.150/tcp/8080 Continent: Europe Country: Finland CountryCode: FI Deployed: YACZROKH Location: Helsinki (Helsingfors) Node: 172.26.160.150 Price: 10 SubDiv: Uusimaa SubDivCode: 18 UN-LOCODE: FI HEL Node 2: 0309c85a8be8f0a11df58a2ee440a76f37a9fbe4b82b4fd6fcfdb21c95fa8480f7 ONLINE /ip4/172.26.160.46/tcp/8080 Continent: Europe Country: Russia CountryCode: RU Deployed: YACZROKH Location: Saint Petersburg (ex Leningrad) Node: 172.26.160.46 Price: 10 SubDiv: Sankt-Peterburg SubDivCode: SPE UN-LOCODE: RU LED Node 3: 03736791ef911acdbc37625daf91c796e3fff63d5961e4c048368a6033699f4bfc ONLINE /ip4/172.26.160.157/tcp/8080 Continent: Europe Country: Russia CountryCode: RU Deployed: YACZROKH Location: Moskva Node: 172.26.160.157 Price: 10 SubDiv: Moskva SubDivCode: MOW UN-LOCODE: RU MOW Node 4: 03855b0548c4408a4d7725111c55c7346715dea8701a948046eecec7928b10cac0 ONLINE /ip4/172.26.160.205/tcp/8080 Continent: Europe Country: Sweden CountryCode: SE Deployed: YACZROKH Location: Stockholm Node: 172.26.160.205 Price: 10 SubDiv: Stockholms l�n SubDivCode: AB UN-LOCODE: SE STO ``` Also tried to issue 'force-new-epoch' command, please find result below: ``` root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node1:8080 -g 179 root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-adm morph force-new-epoch -c configuration/config.yaml Current epoch: 187, increase to 188. Waiting for transactions to persist... root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node1:8080 -g 179 root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-adm morph force-new-epoch -c configuration/config.yaml Current epoch: 188, increase to 189. root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-adm morph force-new-epoch -c configuration/config.yaml Current epoch: 188, increase to 189. Waiting for transactions to persist... root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node1:8080 -g 179 ``` ## Expected Behavior All nodes have to be at the same epoch. ## Possible Solution Restart 3 services helped me to reach consistency between nodes epoch: ``` root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# systemctl restart neofs-ir; systemctl restart neofs-storage; systemctl restart neo-go root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node1:8080 -g 190 root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node2:8080 -g 190 root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node3:8080 -g 190 root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node4:8080 -g 190 ``` ## Steps to Reproduce (for bugs) Nodes had been used for failover tests during 3 days where tests are aimed to kill each service and several "healthcheck" commands were issued against service to 100% sure that service has been restarted and ready to process. [Logs.zip](https://github.com/nspcc-dev/neofs-node/files/10498814/epoch_hangs_full.zip) ## Your Environment NeoFS Inner Ring node Version: v0.35.0-11-g1a7c827a-dirty GoVersion: go1.18.4 NeoFS Storage node Version: v0.35.0-11-g1a7c827a-dirty GoVersion: go1.18.4 NeoGo Version: 0.100.1-pre-1-g0cb86f39 GoVersion: go1.18.4

sami

2025-12-28 17:21:17 +00:00

closed this issue
added the
bug

U0
labels

sami commented

2025-12-28 17:21:18 +00:00

Author

Owner

@vkarak1 commented on GitHub (Jan 25, 2023):

This could be related to unable to force-new-epoch

@vkarak1 commented on GitHub (Jan 25, 2023): This could be related to [unable to force-new-epoch](https://github.com/nspcc-dev/neofs-node/issues/2172)

sami commented

2025-12-28 17:21:18 +00:00

Author

Owner

@fyrchik commented on GitHub (Jan 27, 2023):

Right before the 180 epoch tick:

Jan 25 09:36:55 az neofs-ir[638]: 2023-01-25T09:36:55.242Z        info        neofs-ir/main.go:88        internal error        {"msg": "morph chain connection has been lost"}
Jan 25 09:36:55 az neofs-ir[638]: 2023-01-25T09:36:55.242Z        info        neofs-ir/main.go:107        application stopped
Jan 25 09:36:55 az systemd[1]: neofs-ir.service: Succeeded.
Jan 25 09:36:55 az systemd[1]: neofs-ir.service: Consumed 14.959s CPU time.
Jan 25 09:36:55 az neofs-node[480]: 2023-01-25T09:36:55.250Z        info        client/multi.go:51        connection to the new RPC node has been established        {"endpoint": "ws://172.26.
160.205:40332/ws"}
Jan 25 09:36:55 az systemd[1]: neo-go.service: Main process exited, code=killed, status=9/KILL
Jan 25 09:36:55 az systemd[1]: neo-go.service: Failed with result 'signal'.
Jan 25 09:36:55 az systemd[1]: neo-go.service: Consumed 44.372s CPU time.
Jan 25 09:36:57 az neofs-node[480]: 2023-01-25T09:36:57.101Z        debug        neofs-node/morph.go:232        new block        {"index": 13479}
Jan 25 09:37:00 az systemd[1]: neofs-ir.service: Scheduled restart job, restart counter is at 2.
Jan 25 09:37:00 az systemd[1]: neo-go.service: Scheduled restart job, restart counter is at 1.
Jan 25 09:37:00 az systemd[1]: Stopped NeoFS InnerRing node.
Jan 25 09:37:00 az systemd[1]: neofs-ir.service: Consumed 14.959s CPU time.
Jan 25 09:37:00 az systemd[1]: Stopped NeoGo N3-neo-go node.
Jan 25 09:37:00 az systemd[1]: neo-go.service: Consumed 44.372s CPU time.
Jan 25 09:37:00 az systemd[1]: Started NeoGo N3-neo-go node.
Jan 25 09:37:00 az systemd[1]: Started NeoFS InnerRing node.

@fyrchik commented on GitHub (Jan 27, 2023): Right before the 180 epoch tick: ``` Jan 25 09:36:55 az neofs-ir[638]: 2023-01-25T09:36:55.242Z info neofs-ir/main.go:88 internal error {"msg": "morph chain connection has been lost"} Jan 25 09:36:55 az neofs-ir[638]: 2023-01-25T09:36:55.242Z info neofs-ir/main.go:107 application stopped Jan 25 09:36:55 az systemd[1]: neofs-ir.service: Succeeded. Jan 25 09:36:55 az systemd[1]: neofs-ir.service: Consumed 14.959s CPU time. Jan 25 09:36:55 az neofs-node[480]: 2023-01-25T09:36:55.250Z info client/multi.go:51 connection to the new RPC node has been established {"endpoint": "ws://172.26. 160.205:40332/ws"} Jan 25 09:36:55 az systemd[1]: neo-go.service: Main process exited, code=killed, status=9/KILL Jan 25 09:36:55 az systemd[1]: neo-go.service: Failed with result 'signal'. Jan 25 09:36:55 az systemd[1]: neo-go.service: Consumed 44.372s CPU time. Jan 25 09:36:57 az neofs-node[480]: 2023-01-25T09:36:57.101Z debug neofs-node/morph.go:232 new block {"index": 13479} Jan 25 09:37:00 az systemd[1]: neofs-ir.service: Scheduled restart job, restart counter is at 2. Jan 25 09:37:00 az systemd[1]: neo-go.service: Scheduled restart job, restart counter is at 1. Jan 25 09:37:00 az systemd[1]: Stopped NeoFS InnerRing node. Jan 25 09:37:00 az systemd[1]: neofs-ir.service: Consumed 14.959s CPU time. Jan 25 09:37:00 az systemd[1]: Stopped NeoGo N3-neo-go node. Jan 25 09:37:00 az systemd[1]: neo-go.service: Consumed 44.372s CPU time. Jan 25 09:37:00 az systemd[1]: Started NeoGo N3-neo-go node. Jan 25 09:37:00 az systemd[1]: Started NeoFS InnerRing node. ```

sami commented

2025-12-28 17:21:18 +00:00

Author

Owner

@fyrchik commented on GitHub (Feb 3, 2023):

Closed via #2220.

@fyrchik commented on GitHub (Feb 3, 2023): Closed via #2220.