MacOS tests failing #1388

Closed
opened 2025-12-28 17:22:49 +00:00 by sami · 12 comments
Owner

Originally created by @roman-khimov on GitHub (Mar 25, 2025).

Originally assigned to: @cthulhu-rider on GitHub.

Expected Behavior

Green

Current Behavior

Red. MacOS tests fail after #3234.

Possible Solution

Unknown.

Steps to Reproduce (for bugs)

Run tests on MacOS via GH. Local runs are confirmed to be OK by @evgeniiz321.

Regression

Yes.

Originally created by @roman-khimov on GitHub (Mar 25, 2025). Originally assigned to: @cthulhu-rider on GitHub. ## Expected Behavior Green ## Current Behavior Red. MacOS tests fail after #3234. ## Possible Solution Unknown. ## Steps to Reproduce (for bugs) Run tests on MacOS via GH. Local runs are confirmed to be OK by @evgeniiz321. ## Regression Yes.
sami 2025-12-28 17:22:49 +00:00
Author
Owner

@cthulhu-rider commented on GitHub (Mar 25, 2025):

debug logs to review https://github.com/nspcc-dev/neofs-testcases/pull/994. Quick SN1 grep didn't reveal any problems. I'll come back to them later

i've already played with timeouts a bit:

  • increasing neofs-cli object delete 15-s>60s "fixed" some NeoFS suites using this cmd
  • increasing policer.head_timeout default 5s->15s "fixed" storage policy suite which failed due to redundant copy

i also suggest to cmp steps' durations against timeouts in runs when tests were PASS, maybe they also ended on the edge (i doubt)

@cthulhu-rider commented on GitHub (Mar 25, 2025): debug logs to review https://github.com/nspcc-dev/neofs-testcases/pull/994. Quick SN1 `grep` didn't reveal any problems. I'll come back to them later i've already played with timeouts a bit: - increasing `neofs-cli object delete` 15-s>60s "fixed" some NeoFS suites using this cmd - increasing `policer.head_timeout` default 5s->15s "fixed" storage policy suite which failed due to redundant copy i also suggest to cmp steps' durations against timeouts in runs when tests were PASS, maybe they also ended on the edge (i doubt)
Author
Owner

@cthulhu-rider commented on GitHub (May 6, 2025):

i noticed that Put object to random node stage takes ~10s sometimes, but not always. Measurement of object content validation showed:

validate content A13SMNg1aE1H4ie2z7Ehu1Fd2UqngYkqqVdkM16wsJRP LINK 10.021997083s

(https://rest.fs.neo.org/HXSaMJXk2g8C14ht8HSi7BBaiYZ1HeWh2xnWPGQCg4H6/3182-1746451558/index.html#suites/c88214427ec6d7cddd7d0b6aa757ad15/220fbc29d21ab6ed/)

there is smth definitely wrong with this check, there is nothing todo 10s. Will see

P.S. this still doesn't explain why DELETE fails due to a timeout, when the LINK object doesn't appear. But mb the reason is close

@cthulhu-rider commented on GitHub (May 6, 2025): i noticed that `Put object to random node` stage takes ~10s sometimes, but not always. Measurement of object content [validation](https://github.com/nspcc-dev/neofs-node/blob/659f2335e3064b25da04c7a5846725a39e61b609/pkg/core/object/fmt.go#L247) showed: ``` validate content A13SMNg1aE1H4ie2z7Ehu1Fd2UqngYkqqVdkM16wsJRP LINK 10.021997083s ``` (https://rest.fs.neo.org/HXSaMJXk2g8C14ht8HSi7BBaiYZ1HeWh2xnWPGQCg4H6/3182-1746451558/index.html#suites/c88214427ec6d7cddd7d0b6aa757ad15/220fbc29d21ab6ed/) there is smth definitely wrong with this check, there is nothing todo 10s. Will see P.S. this still doesn't explain why DELETE fails due to a timeout, when the LINK object doesn't appear. But mb the reason is close
Author
Owner

@roman-khimov commented on GitHub (May 6, 2025):

We know MacOS is slow with tzhash, but this wasn't an issue previously and the amount of data to hash in tests is rather low. Still, maybe it has some influence here.

@roman-khimov commented on GitHub (May 6, 2025): We know MacOS is slow with tzhash, but this wasn't an issue previously and the amount of data to hash in tests is rather low. Still, maybe it has some influence here.
Author
Owner

@cthulhu-rider commented on GitHub (May 13, 2025):

at the moment the problem is localized in gRPC layer, which explains the regression of tests. It took me quite a few runs because sometimes the server's request handlers also took 10s+ to execute. However, this only concerned requests in p2p case with forwarding

enabling logs with ENVs didn't help me at all. There is also GRPC_TRACE, but i have not tried it yet (envs are tested here)

so, to get needed debug info, i've made a fork branch for gRPC lib and pulled it into the SN one. With this, i found out where the processor seconds are spent: DNS. To be more precise, this part of the whole request processing

on macOS:

waitForResolvedAddrs ctx.Done localhost:49323 dns:///localhost:49323 context deadline exceeded
waitForResolvedAddrs took localhost:49323 dns:///localhost:49323 /neo.fs.v2.object.ObjectService/Head 4.993979292s
...
waitForResolvedAddrs cc.firstResolveEvent.Done localhost:49327 dns:///localhost:49327
waitForResolvedAddrs took localhost:49327 dns:///localhost:49327 /neo.fs.v2.object.ObjectService/Replicate 10.004733667s

as we can see localhost:49327 took ~10s what explains the delay in query execution

compared to Ubuntu:

waitForResolvedAddrs cc.firstResolveEvent.Done localhost:46761 dns:///localhost:46761
waitForResolvedAddrs took localhost:46761 dns:///localhost:46761 /neo.fs.v2.object.ObjectService/Replicate 4.071665ms
...
waitForResolvedAddrs cc.firstResolveEvent.HasFired localhost:46761 dns:///localhost:46761
waitForResolvedAddrs took localhost:46761 dns:///localhost:46761 /neo.fs.v2.object.ObjectService/Replicate 2.465µs

to sum up, for now i tend to think that transition from DialContext to NewClient made in 979811e90f changed address processing somehow:

One subtle difference between NewClient and Dial and DialContext is that the former uses "dns" as the default name resolver, while the latter use "passthrough" for backward compatibility.

TODO:

  1. try previously used resolver
  2. why does mac resolver takes 10s
@cthulhu-rider commented on GitHub (May 13, 2025): at the moment the problem is localized in gRPC layer, which explains the regression of tests. It took me quite a few runs because sometimes the server's request handlers also took 10s+ to execute. However, this only concerned requests in p2p case with forwarding enabling logs with [ENVs](https://github.com/grpc/grpc-go#how-to-turn-on-logging) didn't help me at all. There is also [`GRPC_TRACE`](https://github.com/grpc/grpc/blob/master/doc/environment_variables.md), but i have not tried it yet (envs are tested [here](https://github.com/nspcc-dev/neofs-testcases/pull/994)) so, to get needed debug info, i've made a fork [branch](https://github.com/cthulhu-rider/grpc-go/tree/v1.70.0/timeouts) for gRPC lib and pulled it into the SN one. With this, i found out where the processor seconds are spent: DNS. To be more precise, [this](https://github.com/grpc/grpc-go/blob/98a0092952dd4d8443229c3a335ec592d9c40c9b/clientconn.go#L676-L690) part of the whole request processing on macOS: ``` waitForResolvedAddrs ctx.Done localhost:49323 dns:///localhost:49323 context deadline exceeded waitForResolvedAddrs took localhost:49323 dns:///localhost:49323 /neo.fs.v2.object.ObjectService/Head 4.993979292s ... waitForResolvedAddrs cc.firstResolveEvent.Done localhost:49327 dns:///localhost:49327 waitForResolvedAddrs took localhost:49327 dns:///localhost:49327 /neo.fs.v2.object.ObjectService/Replicate 10.004733667s ``` as we can see `localhost:49327` took ~10s what explains the delay in query execution compared to Ubuntu: ``` waitForResolvedAddrs cc.firstResolveEvent.Done localhost:46761 dns:///localhost:46761 waitForResolvedAddrs took localhost:46761 dns:///localhost:46761 /neo.fs.v2.object.ObjectService/Replicate 4.071665ms ... waitForResolvedAddrs cc.firstResolveEvent.HasFired localhost:46761 dns:///localhost:46761 waitForResolvedAddrs took localhost:46761 dns:///localhost:46761 /neo.fs.v2.object.ObjectService/Replicate 2.465µs ``` --- to sum up, for now i tend to think that transition from [`DialContext`](https://pkg.go.dev/google.golang.org/grpc#DialContext) to [`NewClient`](https://pkg.go.dev/google.golang.org/grpc#NewClient) made in 979811e90fb406001a42c52f86d0b83b8c85a30a changed address processing somehow: > One subtle difference between NewClient and Dial and DialContext is that the former uses "dns" as the default name resolver, while the latter use "passthrough" for backward compatibility. TODO: 1. try previously used resolver 2. why does mac resolver takes 10s
Author
Owner

@cthulhu-rider commented on GitHub (May 13, 2025):

gotcha! Indeed, with the return to passthrough scheme all tests passed https://rest.fs.neo.org/HXSaMJXk2g8C14ht8HSi7BBaiYZ1HeWh2xnWPGQCg4H6/3224-1747135106/index.html

7fdabd6a277bcc4ea07bd65d74bcf51dd8082b50 heals the tests. Taking into account that:

  • dns resolver is strongly proposed by gRPC authors
  • Linux is OK

i dont wanna just apply this change like that. Instead, will try to figure out what cauzes freezing on Mac

@cthulhu-rider commented on GitHub (May 13, 2025): gotcha! Indeed, with the return to [`passthrough`](https://pkg.go.dev/google.golang.org/grpc@v1.72.0/resolver/passthrough) scheme all tests passed https://rest.fs.neo.org/HXSaMJXk2g8C14ht8HSi7BBaiYZ1HeWh2xnWPGQCg4H6/3224-1747135106/index.html 7fdabd6a277bcc4ea07bd65d74bcf51dd8082b50 heals the tests. Taking into account that: * `dns` resolver is strongly [proposed](https://github.com/grpc/grpc-go/pull/7010#pullrequestreview-1907590080) by gRPC authors * Linux is OK i dont wanna just apply this change like that. Instead, will try to figure out what cauzes freezing on Mac
Author
Owner

@roman-khimov commented on GitHub (May 13, 2025):

  1. In general resolver can add some delays and we should be prepared for that.
  2. In this particular case localhost should be resolved instantly, I'm not sure why Mac has any problems with that, don't they have /etc/resolv.conf with it as well?
  3. Can it be related to IPv6 somehow?
@roman-khimov commented on GitHub (May 13, 2025): 1. In general resolver can add some delays and we should be prepared for that. 2. In this particular case `localhost` should be resolved instantly, I'm not sure why Mac has any problems with that, don't they have `/etc/resolv.conf` with it as well? 3. Can it be related to IPv6 somehow?
Author
Owner

@cthulhu-rider commented on GitHub (May 13, 2025):

mb https://github.com/golang/go/issues/52839

@cthulhu-rider commented on GitHub (May 13, 2025): mb https://github.com/golang/go/issues/52839
Author
Owner

@cthulhu-rider commented on GitHub (May 13, 2025):

mac:

dnsResolver.LookupTXT took localhost:49342 10.011094416s lookup _grpc_config.localhost on 192.168.64.1:53: read udp 192.168.64.22:55327->192.168.64.1:53: i/o timeout

ubuntu:

dnsResolver.LookupTXT took localhost:41621 4.604967ms lookup _grpc_config.localhost on 127.0.0.53:53: no such host
@cthulhu-rider commented on GitHub (May 13, 2025): mac: ``` dnsResolver.LookupTXT took localhost:49342 10.011094416s lookup _grpc_config.localhost on 192.168.64.1:53: read udp 192.168.64.22:55327->192.168.64.1:53: i/o timeout ``` ubuntu: ``` dnsResolver.LookupTXT took localhost:41621 4.604967ms lookup _grpc_config.localhost on 127.0.0.53:53: no such host ```
Author
Owner

@cthulhu-rider commented on GitHub (May 14, 2025):

w/o DNS resolver (behavior before 979811e90f)

mac:

waitForResolvedAddrs took localhost:49309 passthrough:///localhost:49309 /neo.fs.v2.object.ObjectService/Head 3.209µs

linux:

waitForResolvedAddrs took localhost:48615 passthrough:///localhost:48615 /neo.fs.v2.object.ObjectService/Head 8.276µs

DNS resolver adds sensible costs in the test env, even critical for Mac. All other envs were perfectly fine w/o it. It should be also noted that its usage was unintentional and happened invisibly due to the use of Linux (quite unexpectedly i'd say)

so, i propose to switch back to using the old resolver because there were no and are no known problems with it. Done in one line of code. At the same time, transition to DNS resolver can make sense and be left as an issue

@roman-khimov

@cthulhu-rider commented on GitHub (May 14, 2025): w/o DNS resolver (behavior before 979811e90fb406001a42c52f86d0b83b8c85a30a) mac: ``` waitForResolvedAddrs took localhost:49309 passthrough:///localhost:49309 /neo.fs.v2.object.ObjectService/Head 3.209µs ``` linux: ``` waitForResolvedAddrs took localhost:48615 passthrough:///localhost:48615 /neo.fs.v2.object.ObjectService/Head 8.276µs ``` --- DNS resolver adds sensible costs in the test env, even critical for Mac. All other envs were perfectly fine w/o it. It should be also noted that its usage was unintentional and happened invisibly due to the use of Linux (quite unexpectedly i'd say) so, i propose to switch back to using the old resolver because there were no and are no known problems with it. Done in one line of code. At the same time, transition to DNS resolver can make sense and be left as an issue @roman-khimov
Author
Owner

@roman-khimov commented on GitHub (May 14, 2025):

Not sure I understand the difference. We're supposed to be compatible with DNS (non-IP) names anyway and we were compatible, like we have /dns4/st4.storage.fs.neo.org/tcp/8080 and alike in mainnet network map already. Let me check.

@roman-khimov commented on GitHub (May 14, 2025): Not sure I understand the difference. We're supposed to be compatible with DNS (non-IP) names anyway and we were compatible, like we have `/dns4/st4.storage.fs.neo.org/tcp/8080` and alike in mainnet network map already. Let me check.
Author
Owner

@roman-khimov commented on GitHub (May 14, 2025):

IIUC this all comes from the fact that gRPC library allows to override resolver plugins at its level (rather than Dial which is somewhat more canonical for Go). Then DNS is the default, then it works badly in some cases. I still don't understand why that's the case since regular Dial would handle the same address fine and it'll work fine with proper DNS names as well. Most likely resolving at gRPC level is beneficial in some load-balancing/multiconnection scenarios, Dial hides this somewhat. I agree with

so, i propose to switch back to using the old resolver because there were no and are no known problems with it. Done in one line of code. At the same time, transition to DNS resolver can make sense and be left as an issue

We can figure out the way to use gRPC resolver later, currently we don't have any problems we'd like it to solve for us.

Ref. https://github.com/grpc/grpc/blob/master/doc/naming.md

@roman-khimov commented on GitHub (May 14, 2025): IIUC this all comes from the fact that gRPC library allows to override resolver plugins at its level (rather than `Dial` which is somewhat more canonical for Go). Then DNS is the default, then it works badly in some cases. I still don't understand why that's the case since regular `Dial` would handle the same address fine and it'll work fine with proper DNS names as well. Most likely resolving at gRPC level is beneficial in some load-balancing/multiconnection scenarios, `Dial` hides this somewhat. I agree with > so, i propose to switch back to using the old resolver because there were no and are no known problems with it. Done in one line of code. At the same time, transition to DNS resolver can make sense and be left as an issue We can figure out the way to use gRPC resolver later, currently we don't have any problems we'd like it to solve for us. Ref. https://github.com/grpc/grpc/blob/master/doc/naming.md
Author
Owner

@cthulhu-rider commented on GitHub (May 14, 2025):

just to sum up

i tend to think we're facing smth from (macOS, DNS, CGO) space, found several related issues, both old and pretty new ones. Many encounter the same problem - delays. I tried some of the suggested workarounds but none of them worked

@cthulhu-rider commented on GitHub (May 14, 2025): just to sum up i tend to think we're facing smth from (macOS, DNS, CGO) space, found several related issues, both old and pretty new ones. Many encounter the same problem - delays. I tried some of the suggested workarounds but none of them worked
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
nspcc-dev/neofs-node#1388
No description provided.