mirror of
https://github.com/nspcc-dev/neofs-node.git
synced 2026-03-01 04:29:10 +00:00
Node fails to init shard in some cases #1449
Labels
No labels
I1
I2
I3
I4
S0
S1
S2
S3
S4
U0
U1
U2
U3
U4
blocked
bug
config
dependencies
discussion
documentation
enhancement
enhancement
epic
feature
go
good first issue
help wanted
neofs-adm
neofs-cli
neofs-cli
neofs-cli
neofs-ir
neofs-lens
neofs-storage
neofs-storage
performance
question
security
task
test
windows
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
nspcc-dev/neofs-node#1449
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @roman-khimov on GitHub (Jul 1, 2025).
Originally assigned to: @cthulhu-rider on GitHub.
Expected Behavior
🟢
Current Behavior
🔴
Possible Solution
Unknown
Steps to Reproduce (for bugs)
https://rest.fs.neo.org/HXSaMJXk2g8C14ht8HSi7BBaiYZ1HeWh2xnWPGQCg4H6/3489-1751369771/index.html#suites/295ed2e000f8fa6f3ade59cc13b98615/dad9f158acc8a212/
Regression
Maybe.
Your Environment
@roman-khimov commented on GitHub (Jul 9, 2025):
So it's a lock and a tombstone in a single shard. Nice combo.
@roman-khimov commented on GitHub (Jul 10, 2025):
https://rest.fs.neo.org/HXSaMJXk2g8C14ht8HSi7BBaiYZ1HeWh2xnWPGQCg4H6/3567-1752135041/index.html#suites/295ed2e000f8fa6f3ade59cc13b98615/6057db5bdc59daee/
@roman-khimov commented on GitHub (Jul 24, 2025):
Blocked by https://github.com/nspcc-dev/neofs-testcases/issues/1108?
@carpawell commented on GitHub (Jul 25, 2025):
No.
@carpawell commented on GitHub (Jul 25, 2025):
There was once a temporary state when the error was more detailed, but resync error was not skipped yet: https://rest.fs.neo.org/HXSaMJXk2g8C14ht8HSi7BBaiYZ1HeWh2xnWPGQCg4H6/3636-1753225895/index.html#suites/295ed2e000f8fa6f3ade59cc13b98615/51606c8ab75498c4/
It gives us a TS address, and it is seen that it was PUT to all 4 nodes successfully, and then at the resyncing stage, it is surprisingly known that it should not be accepted because its target has been locked (wow).
@cthulhu-rider commented on GitHub (Jul 28, 2025):
seems like
5115e2f48ccovered occurring error. So both TOMBSTONE and LOCK are stored in the metabase7105afffc3injected LOCK check into metabase PUT (here), returning an errorso, the smallest fix i see is to not fail tombstone PUT complely but skip garbage bucket update (as before)
@cthulhu-rider commented on GitHub (Jul 29, 2025):
alright, idk how this happens in the test (dont see any removal), but currently LOCKing the TOMBSTONEd object is not prohibited. Having container incl. single SN with single shard, the situation is 100% reproducible using API. So,
is very possible
@cthulhu-rider commented on GitHub (Jul 29, 2025):
assuming that LOCK+TOMBSTONE in same shard is exactly the situation, i think inconsitency occurs due to variable order of objects depending on IDs in resync process:
5115e2f48c)iiuc lookin at the code, L marks are lost on resync in new format casei need to test 3 and 4. If 4 is true, it existed before v0.48.0. Overall, dependence on order is unsafe
it seems inconvenient to me that handling of old and new L/T formats are done at different layers now. Old are handled by Shard while new by Metabase. They are equivalent up to the number of associated elements
@roman-khimov commented on GitHub (Jul 29, 2025):
Old indexes are to be removed eventually, so we're in transition phase.
@cthulhu-rider commented on GitHub (Jul 29, 2025):
seems so
@roman-khimov commented on GitHub (Aug 5, 2025):
The same test fails with https://rest.fs.neo.org/HXSaMJXk2g8C14ht8HSi7BBaiYZ1HeWh2xnWPGQCg4H6/3709-1754406330/index.html#suites/295ed2e000f8fa6f3ade59cc13b98615/f2b70d4bbc29088e/ now
@cthulhu-rider commented on GitHub (Aug 11, 2025):
SN2
SN1
test queries:
according to the timings, the test first encountered the +1 replica state (*), and then the restored amount
(*)
transport: Error while dialing: dial tcp 127.0.0.1:49278: connect: connection refused: apparently SN3 was in shutdown (restart?) at that momentneofs-cli object nodesand getNlistGet Nodes With Objectstage. Require firstREP(2 now) ofNto respond with200. Require the rest torespond with either
200or404Get Nodes With Objectwith same assertsi'd stick to 2, but let @evgeniiz321 review 1 first
@roman-khimov commented on GitHub (Aug 11, 2025):
I think the original definition of the problem is no longer relevant. The issue of excessive copies is a test problem to me because there is a restart in this test and other nodes can and should react to that. Either we're making it so that the object copy is unique or we deal with (sometimes) excessive copies.
So let's reopen this in testcases.