mirror of
https://github.com/nspcc-dev/neofs-node.git
synced 2026-03-01 04:29:10 +00:00
MacOS tests failing #1388
Labels
No labels
I1
I2
I3
I4
S0
S1
S2
S3
S4
U0
U1
U2
U3
U4
blocked
bug
config
dependencies
discussion
documentation
enhancement
enhancement
epic
feature
go
good first issue
help wanted
neofs-adm
neofs-cli
neofs-cli
neofs-cli
neofs-ir
neofs-lens
neofs-storage
neofs-storage
performance
question
security
task
test
windows
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
nspcc-dev/neofs-node#1388
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @roman-khimov on GitHub (Mar 25, 2025).
Originally assigned to: @cthulhu-rider on GitHub.
Expected Behavior
Green
Current Behavior
Red. MacOS tests fail after #3234.
Possible Solution
Unknown.
Steps to Reproduce (for bugs)
Run tests on MacOS via GH. Local runs are confirmed to be OK by @evgeniiz321.
Regression
Yes.
@cthulhu-rider commented on GitHub (Mar 25, 2025):
debug logs to review https://github.com/nspcc-dev/neofs-testcases/pull/994. Quick SN1
grepdidn't reveal any problems. I'll come back to them lateri've already played with timeouts a bit:
neofs-cli object delete15-s>60s "fixed" some NeoFS suites using this cmdpolicer.head_timeoutdefault 5s->15s "fixed" storage policy suite which failed due to redundant copyi also suggest to cmp steps' durations against timeouts in runs when tests were PASS, maybe they also ended on the edge (i doubt)
@cthulhu-rider commented on GitHub (May 6, 2025):
i noticed that
Put object to random nodestage takes ~10s sometimes, but not always. Measurement of object content validation showed:(https://rest.fs.neo.org/HXSaMJXk2g8C14ht8HSi7BBaiYZ1HeWh2xnWPGQCg4H6/3182-1746451558/index.html#suites/c88214427ec6d7cddd7d0b6aa757ad15/220fbc29d21ab6ed/)
there is smth definitely wrong with this check, there is nothing todo 10s. Will see
P.S. this still doesn't explain why DELETE fails due to a timeout, when the LINK object doesn't appear. But mb the reason is close
@roman-khimov commented on GitHub (May 6, 2025):
We know MacOS is slow with tzhash, but this wasn't an issue previously and the amount of data to hash in tests is rather low. Still, maybe it has some influence here.
@cthulhu-rider commented on GitHub (May 13, 2025):
at the moment the problem is localized in gRPC layer, which explains the regression of tests. It took me quite a few runs because sometimes the server's request handlers also took 10s+ to execute. However, this only concerned requests in p2p case with forwarding
enabling logs with ENVs didn't help me at all. There is also
GRPC_TRACE, but i have not tried it yet (envs are tested here)so, to get needed debug info, i've made a fork branch for gRPC lib and pulled it into the SN one. With this, i found out where the processor seconds are spent: DNS. To be more precise, this part of the whole request processing
on macOS:
as we can see
localhost:49327took ~10s what explains the delay in query executioncompared to Ubuntu:
to sum up, for now i tend to think that transition from
DialContexttoNewClientmade in979811e90fchanged address processing somehow:TODO:
@cthulhu-rider commented on GitHub (May 13, 2025):
gotcha! Indeed, with the return to
passthroughscheme all tests passed https://rest.fs.neo.org/HXSaMJXk2g8C14ht8HSi7BBaiYZ1HeWh2xnWPGQCg4H6/3224-1747135106/index.html7fdabd6a277bcc4ea07bd65d74bcf51dd8082b50 heals the tests. Taking into account that:
dnsresolver is strongly proposed by gRPC authorsi dont wanna just apply this change like that. Instead, will try to figure out what cauzes freezing on Mac
@roman-khimov commented on GitHub (May 13, 2025):
localhostshould be resolved instantly, I'm not sure why Mac has any problems with that, don't they have/etc/resolv.confwith it as well?@cthulhu-rider commented on GitHub (May 13, 2025):
mb https://github.com/golang/go/issues/52839
@cthulhu-rider commented on GitHub (May 13, 2025):
mac:
ubuntu:
@cthulhu-rider commented on GitHub (May 14, 2025):
w/o DNS resolver (behavior before
979811e90f)mac:
linux:
DNS resolver adds sensible costs in the test env, even critical for Mac. All other envs were perfectly fine w/o it. It should be also noted that its usage was unintentional and happened invisibly due to the use of Linux (quite unexpectedly i'd say)
so, i propose to switch back to using the old resolver because there were no and are no known problems with it. Done in one line of code. At the same time, transition to DNS resolver can make sense and be left as an issue
@roman-khimov
@roman-khimov commented on GitHub (May 14, 2025):
Not sure I understand the difference. We're supposed to be compatible with DNS (non-IP) names anyway and we were compatible, like we have
/dns4/st4.storage.fs.neo.org/tcp/8080and alike in mainnet network map already. Let me check.@roman-khimov commented on GitHub (May 14, 2025):
IIUC this all comes from the fact that gRPC library allows to override resolver plugins at its level (rather than
Dialwhich is somewhat more canonical for Go). Then DNS is the default, then it works badly in some cases. I still don't understand why that's the case since regularDialwould handle the same address fine and it'll work fine with proper DNS names as well. Most likely resolving at gRPC level is beneficial in some load-balancing/multiconnection scenarios,Dialhides this somewhat. I agree withWe can figure out the way to use gRPC resolver later, currently we don't have any problems we'd like it to solve for us.
Ref. https://github.com/grpc/grpc/blob/master/doc/naming.md
@cthulhu-rider commented on GitHub (May 14, 2025):
just to sum up
i tend to think we're facing smth from (macOS, DNS, CGO) space, found several related issues, both old and pretty new ones. Many encounter the same problem - delays. I tried some of the suggested workarounds but none of them worked