mirror of
https://github.com/nspcc-dev/neofs-node.git
synced 2026-03-01 04:29:10 +00:00
Evaluate tree service performance #1221
Labels
No labels
I1
I2
I3
I4
S0
S1
S2
S3
S4
U0
U1
U2
U3
U4
blocked
bug
config
dependencies
discussion
documentation
enhancement
enhancement
epic
feature
go
good first issue
help wanted
neofs-adm
neofs-cli
neofs-cli
neofs-cli
neofs-ir
neofs-lens
neofs-storage
neofs-storage
performance
question
security
task
test
windows
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
nspcc-dev/neofs-node#1221
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @roman-khimov on GitHub (May 5, 2024).
Originally assigned to: @carpawell on GitHub.
Is your feature request related to a problem? Please describe.
I'm always frustrated when we don't know what to expect from some critical components of our system. Tree service is like that. We kinda know about #1734, but we don't know exact numbers.
Describe the solution you'd like
Create an environment for tree service test (can be separated from the node (or not)). Measure single node ops/s (typical AddByPath) and delays. Add more nodes (up to 4 of them), push ops via a single node, see how they spread and what's the throughput and latency. Repeat on some real hw.
@carpawell commented on GitHub (May 29, 2024):
Test description
Tree service was used as a separate unit (no sharding, no other background processes), and network communications (fetching networkmap, containers, etc) were mocked; as a loader, I have added some k6 scripts and extensions:
https://github.com/nspcc-dev/neofs-node/tree/feat/tree-cmd
https://github.com/nspcc-dev/xk6-neofs/tree/feat/tree-loader
REP 4"container" was mocked and 4 distributed tree instances on 4 bare metal machines with SSD were used.This tree service config was used (
sync_intervalwas either 5 mins (2-3 times per test) or turned off, see below):Every run had a "target" rate, meaning k6 tried to iterate exactly that number of operations per every second. Network deadline was 5s so the max number of VUs (virtual users, a k6 term) was 5 times bigger than a "target" RPS multiplied by 5, so it had to be from 0 to N*5 number of request-in-progress at a time where N is a "target" RPS.
Two types of load were presented: the first one was kinda "system" load (
ADDtree operation, 10 RPS for every test) that wrote to a "system" tree (multipart upload, lock operation, etc for S3 GW) and the second one was "user" load (ADD_BY_PATHtree operation, variable number of operations in every test) that wrote to a "version" tree (regular object PUT for S3 GW). Every tree node had 7 meta fields and a unique path.Also, there were two types of tests in terms of tree synchronization: background sync does so much bad things to results so I decided to turn it off, although in general, that is not possible in real cases since this is a mechanism that restores missing operation in local logs on any normal and unexpected downtime (https://github.com/nspcc-dev/neofs-node/pull/2161, https://github.com/nspcc-dev/neofs-node/pull/2165). Every
"sync on"load was done for 15 mins., every "sync off" for 30 mins.Results
1000 RPS, NO sync,
3015 min loadAvg: 13.70ms
Errs: NO
summary1000_NOsync.html.pdf
1000 RPS, 5 min sync, 15 min load
Avg:
13.3913.43msErrs: NO
summary1000_5minsync.html.pdf
50002500 RPS, NO sync,3015 min loadAvg:
2094.3419.47msErrs:
YES, <1%NOsummary2500_NOsync.html.pdf
50002500 RPS, 5 min sync, 15 min loadAvg:
1202.1819.79msErrs:
YES, <1%NOsummary2500_5minsync.html.pdf
75005000 RPS, NO sync,3015 min loadAvg:
4550.521081.33msErrs:
YES, 84%NOsummary5000_NOsync.html.pdf
75005000 RPS, 5 min sync, 15 min loadAvg:
4082.701193.95msErrs:
YES, 67%NOsummary5000_5minsync.html.pdf
@carpawell commented on GitHub (May 29, 2024):
Oh, 2.2.0 version of the reporter is broken. Numbers are actual but are placed in the wrong fields... Will rerun.
UPD: done.
@carpawell commented on GitHub (May 30, 2024):
I have done 1h tests.
1000 RPS, NO sync, 1h load
Avg: 13.93ms
Errs: NO
summary1000_nosync_1h.html.pdf
2500 RPS, NO sync, 1h load
Avg: 85.35ms
Errs: NO
summary2500_nosync_1h.html.pdf
NOTE: 55m were OK and it was close to 1000 RPS results but last 5 min I saw degradation, so that is not stable load I would say (see p95).
2500 RPS, 15m sync, 1h load
Avg: 536.93ms
Errs: NO
summary2500_15msync_1h.html.pdf
@carpawell commented on GitHub (May 30, 2024):
@roman-khimov, saw
is it critical? I put requests throught different nodes.
@roman-khimov commented on GitHub (May 30, 2024):
Try it with a single one. Multinode is interesting as well, but it's a more complex scenario.
@carpawell commented on GitHub (May 30, 2024):
That is a little bit more problematic:
1000 RPS, NO sync, 15m load, 1 node load
Avg: 4776.86ms
Errs: 91%
summary1000_nosync_100_clients.html.pdf
@roman-khimov commented on GitHub (Jun 3, 2024):
Seems like a dead end to me, nothing we can reuse from it.