Evaluate tree service performance #1221

Closed
opened 2025-12-28 17:22:13 +00:00 by sami · 7 comments
Owner

Originally created by @roman-khimov on GitHub (May 5, 2024).

Originally assigned to: @carpawell on GitHub.

I'm always frustrated when we don't know what to expect from some critical components of our system. Tree service is like that. We kinda know about #1734, but we don't know exact numbers.

Describe the solution you'd like

Create an environment for tree service test (can be separated from the node (or not)). Measure single node ops/s (typical AddByPath) and delays. Add more nodes (up to 4 of them), push ops via a single node, see how they spread and what's the throughput and latency. Repeat on some real hw.

Originally created by @roman-khimov on GitHub (May 5, 2024). Originally assigned to: @carpawell on GitHub. ## Is your feature request related to a problem? Please describe. I'm always frustrated when we don't know what to expect from some critical components of our system. Tree service is like that. We kinda know about #1734, but we don't know exact numbers. ## Describe the solution you'd like Create an environment for tree service test (can be separated from the node (or not)). Measure single node ops/s (typical AddByPath) and delays. Add more nodes (up to 4 of them), push ops via a single node, see how they spread and what's the throughput and latency. Repeat on some real hw.
sami 2025-12-28 17:22:13 +00:00
Author
Owner

@carpawell commented on GitHub (May 29, 2024):

Test description

Tree service was used as a separate unit (no sharding, no other background processes), and network communications (fetching networkmap, containers, etc) were mocked; as a loader, I have added some k6 scripts and extensions:
https://github.com/nspcc-dev/neofs-node/tree/feat/tree-cmd
https://github.com/nspcc-dev/xk6-neofs/tree/feat/tree-loader

REP 4 "container" was mocked and 4 distributed tree instances on 4 bare metal machines with SSD were used.
This tree service config was used (sync_interval was either 5 mins (2-3 times per test) or turned off, see below):

tree:
  enabled: true
  cache_size: 15
  replication_worker_count: 1000
  replication_channel_capacity: 1000
  replication_timeout: 5s
  sync_interval: 5m # depends on tests

pilorama:
  max_batch_delay: 5ms
  max_batch_size: 100
  path: /tmp/tree.test

Every run had a "target" rate, meaning k6 tried to iterate exactly that number of operations per every second. Network deadline was 5s so the max number of VUs (virtual users, a k6 term) was 5 times bigger than a "target" RPS multiplied by 5, so it had to be from 0 to N*5 number of request-in-progress at a time where N is a "target" RPS.
Two types of load were presented: the first one was kinda "system" load (ADD tree operation, 10 RPS for every test) that wrote to a "system" tree (multipart upload, lock operation, etc for S3 GW) and the second one was "user" load (ADD_BY_PATH tree operation, variable number of operations in every test) that wrote to a "version" tree (regular object PUT for S3 GW). Every tree node had 7 meta fields and a unique path.
Also, there were two types of tests in terms of tree synchronization: background sync does so much bad things to results so I decided to turn it off, although in general, that is not possible in real cases since this is a mechanism that restores missing operation in local logs on any normal and unexpected downtime (https://github.com/nspcc-dev/neofs-node/pull/2161, https://github.com/nspcc-dev/neofs-node/pull/2165). Every "sync on" load was done for 15 mins., every "sync off" for 30 mins.

Results

1000 RPS, NO sync, 30 15 min load

Avg: 13.70ms
Errs: NO

summary1000_NOsync.html.pdf

1000 RPS, 5 min sync, 15 min load

Avg: 13.39 13.43ms
Errs: NO

summary1000_5minsync.html.pdf

5000 2500 RPS, NO sync, 30 15 min load

Avg: 2094.34 19.47ms
Errs: YES, <1% NO

summary2500_NOsync.html.pdf

5000 2500 RPS, 5 min sync, 15 min load

Avg: 1202.18 19.79ms
Errs: YES, <1% NO

summary2500_5minsync.html.pdf

7500 5000 RPS, NO sync, 30 15 min load

Avg: 4550.52 1081.33ms
Errs: YES, 84% NO

summary5000_NOsync.html.pdf

7500 5000 RPS, 5 min sync, 15 min load

Avg: 4082.70 1193.95ms
Errs: YES, 67% NO

summary5000_5minsync.html.pdf

@carpawell commented on GitHub (May 29, 2024): ## Test description Tree service was used as a separate unit (no sharding, no other background processes), and network communications (fetching networkmap, containers, etc) were mocked; as a loader, I have added some k6 scripts and extensions: https://github.com/nspcc-dev/neofs-node/tree/feat/tree-cmd https://github.com/nspcc-dev/xk6-neofs/tree/feat/tree-loader `REP 4` "container" was mocked and 4 distributed tree instances on 4 bare metal machines with SSD were used. This tree service config was used (`sync_interval` was either 5 mins (2-3 times per test) or turned off, see below): ```yml tree: enabled: true cache_size: 15 replication_worker_count: 1000 replication_channel_capacity: 1000 replication_timeout: 5s sync_interval: 5m # depends on tests pilorama: max_batch_delay: 5ms max_batch_size: 100 path: /tmp/tree.test ``` Every run had a "target" rate, meaning k6 tried to iterate exactly that number of operations per every second. Network deadline was 5s so the max number of VUs (virtual users, a k6 term) was 5 times bigger than a "target" RPS multiplied by 5, so it had to be from 0 to N*5 number of request-in-progress at a time where N is a "target" RPS. Two types of load were presented: the first one was kinda "system" load (`ADD` tree operation, 10 RPS for every test) that wrote to a "system" tree (multipart upload, lock operation, etc for S3 GW) and the second one was "user" load (`ADD_BY_PATH` tree operation, variable number of operations in every test) that wrote to a "version" tree (regular object PUT for S3 GW). Every tree node had 7 meta fields and a unique path. Also, there were two types of tests in terms of tree synchronization: background sync does so much bad things to results so I decided to turn it off, although in general, that is not possible in real cases since this is a mechanism that restores missing operation in local logs on any normal and unexpected downtime (https://github.com/nspcc-dev/neofs-node/pull/2161, https://github.com/nspcc-dev/neofs-node/pull/2165). Every ~~"sync on"~~ load was done for 15 mins.~~, every "sync off" for 30 mins.~~ ## Results ### 1000 RPS, NO sync, ~~30~~ 15 min load Avg: 13.70ms Errs: NO [summary1000_NOsync.html.pdf](https://github.com/nspcc-dev/neofs-node/files/15491677/summary1000_NOsync.html.pdf) ### 1000 RPS, 5 min sync, 15 min load Avg: ~~13.39~~ 13.43ms Errs: NO [summary1000_5minsync.html.pdf](https://github.com/nspcc-dev/neofs-node/files/15491678/summary1000_5minsync.html.pdf) ### ~~5000~~ 2500 RPS, NO sync, ~~30~~ 15 min load Avg: ~~2094.34~~ 19.47ms Errs: ~~YES, <1%~~ NO [summary2500_NOsync.html.pdf](https://github.com/nspcc-dev/neofs-node/files/15491685/summary2500_NOsync.html.pdf) ### ~~5000~~ 2500 RPS, 5 min sync, 15 min load Avg: ~~1202.18~~ 19.79ms Errs: ~~YES, <1%~~ NO [summary2500_5minsync.html.pdf](https://github.com/nspcc-dev/neofs-node/files/15491749/summary2500_5minsync.html.pdf) ### ~~7500~~ 5000 RPS, NO sync, ~~30~~ 15 min load Avg: ~~4550.52~~ 1081.33ms Errs: ~~YES, 84%~~ NO [summary5000_NOsync.html.pdf](https://github.com/nspcc-dev/neofs-node/files/15491750/summary5000_NOsync.html.pdf) ### ~~7500~~ 5000 RPS, 5 min sync, 15 min load Avg: ~~4082.70~~ 1193.95ms Errs: ~~YES, 67%~~ NO [summary5000_5minsync.html.pdf](https://github.com/nspcc-dev/neofs-node/files/15491868/summary5000_5minsync.html.pdf)
Author
Owner

@carpawell commented on GitHub (May 29, 2024):

Oh, 2.2.0 version of the reporter is broken. Numbers are actual but are placed in the wrong fields... Will rerun.

UPD: done.

@carpawell commented on GitHub (May 29, 2024): Oh, 2.2.0 version of [the reporter](https://github.com/benc-uk/k6-reporter) is broken. Numbers are actual but are placed in the wrong fields... Will rerun. UPD: done.
Author
Owner

@carpawell commented on GitHub (May 30, 2024):

I have done 1h tests.

1000 RPS, NO sync, 1h load

Avg: 13.93ms
Errs: NO

summary1000_nosync_1h.html.pdf

2500 RPS, NO sync, 1h load

Avg: 85.35ms
Errs: NO

summary2500_nosync_1h.html.pdf

NOTE: 55m were OK and it was close to 1000 RPS results but last 5 min I saw degradation, so that is not stable load I would say (see p95).

2500 RPS, 15m sync, 1h load

Avg: 536.93ms
Errs: NO

summary2500_15msync_1h.html.pdf

@carpawell commented on GitHub (May 30, 2024): I have done 1h tests. ### 1000 RPS, NO sync, 1h load Avg: 13.93ms Errs: NO [summary1000_nosync_1h.html.pdf](https://github.com/nspcc-dev/neofs-node/files/15500256/summary1000_nosync_1h.html.pdf) ### 2500 RPS, NO sync, 1h load Avg: 85.35ms Errs: NO [summary2500_nosync_1h.html.pdf](https://github.com/nspcc-dev/neofs-node/files/15500269/summary2500_nosync_1h.html.pdf) NOTE: 55m were OK and it was close to 1000 RPS results but last 5 min I saw degradation, so that is not stable load I would say (see p95). ### 2500 RPS, 15m sync, 1h load Avg: 536.93ms Errs: NO [summary2500_15msync_1h.html.pdf](https://github.com/nspcc-dev/neofs-node/files/15500272/summary2500_15msync_1h.html.pdf)
Author
Owner

@carpawell commented on GitHub (May 30, 2024):

@roman-khimov, saw

push ops via a single node

is it critical? I put requests throught different nodes.

@carpawell commented on GitHub (May 30, 2024): @roman-khimov, saw > push ops via a single node is it critical? I put requests throught different nodes.
Author
Owner

@roman-khimov commented on GitHub (May 30, 2024):

Try it with a single one. Multinode is interesting as well, but it's a more complex scenario.

@roman-khimov commented on GitHub (May 30, 2024): Try it with a single one. Multinode is interesting as well, but it's a more complex scenario.
Author
Owner

@carpawell commented on GitHub (May 30, 2024):

That is a little bit more problematic:

1000 RPS, NO sync, 15m load, 1 node load

Avg: 4776.86ms
Errs: 91%

summary1000_nosync_100_clients.html.pdf

@carpawell commented on GitHub (May 30, 2024): That is a little bit more problematic: ### 1000 RPS, NO sync, 15m load, 1 node load Avg: 4776.86ms Errs: 91% [summary1000_nosync_100_clients.html.pdf](https://github.com/nspcc-dev/neofs-node/files/15502892/summary1000_nosync_100_clients.html.pdf)
Author
Owner

@roman-khimov commented on GitHub (Jun 3, 2024):

Seems like a dead end to me, nothing we can reuse from it.

@roman-khimov commented on GitHub (Jun 3, 2024): Seems like a dead end to me, nothing we can reuse from it.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
nspcc-dev/neofs-node#1221
No description provided.