Reuse object/data memory for replication workers #1012

New issue

Open

opened 2025-12-28 17:21:29 +00:00 by sami · 1 comment

sami commented

2025-12-28 17:21:29 +00:00

Owner

Originally created by @roman-khimov on GitHub (Apr 24, 2023).

I'm always frustrated when I'm looking at the replication worker code. It takes and object from the storage, decompresses it, unmarshals it and then pushes the result to some other node. It allocates for the raw data, it allocates for the decompressed data, it allocates for the object and all of its fields. For big objects this means a heck of a lot of allocations.

Describe the solution you'd like

Have one per-replicator data buffer for raw data, one for decompressed data, one object. Reuse them. Likely this is not supported by our APIs at the moment, but this can probably be changed.

Additional context

#2300/#2178

Originally created by @roman-khimov on GitHub (Apr 24, 2023). ## Is your feature request related to a problem? Please describe. I'm always frustrated when I'm looking at the replication worker code. It takes and object from the storage, decompresses it, unmarshals it and then pushes the result to some other node. It allocates for the raw data, it allocates for the decompressed data, it allocates for the object and all of its fields. For big objects this means a heck of a lot of allocations. ## Describe the solution you'd like Have one per-replicator data buffer for raw data, one for decompressed data, one object. Reuse them. Likely this is not supported by our APIs at the moment, but this can probably be changed. ## Additional context #2300/#2178

sami added the

enhancement

neofs-storage

performance

labels

2025-12-28 17:21:29 +00:00

sami commented

2025-12-28 17:21:29 +00:00

Author

Owner

@cthulhu-rider commented on GitHub (Feb 21, 2024):

i suggest to start from size-segmented sync pool as more simple approach

in current situation, when worker 1:1 individual object, per-worker static buffer residence could lead us to bad memory utilization with large dispersion of objects up to 64M. It is worth thinking about the adaptive size of this buffer. For example:

start with a relatively small size S (e.g. 256K)
as large objects are processed, allocate buffers dynamically with counting MISS (quantity or excess average)
when MISS exceeds the specified limit, the buffer grows (e.g. doubled or +average)
shortening can be done in the reverse way

For this approach to be effective, a more adaptive work queue may be needed. To me, in this approach complexity could be above efficiency

an alternative approach could be a replication batch: all Policer workers pack to-be-replicated objects into a single limited buffer which is flushed after filling or by a timer. IMO managing the batch size will be simpler and more efficient than current tuning of worker pool capacity which relies only on incoming traffic and not outgoing. Per-node batches would allow background replication to be correlated with external traffic (PUT) and simplify prioritization model

@cthulhu-rider commented on GitHub (Feb 21, 2024): i suggest to start from size-segmented sync pool as more simple approach in current situation, when worker 1:1 individual object, per-worker static buffer residence could lead us to bad memory utilization with large dispersion of objects up to 64M. It is worth thinking about the adaptive size of this buffer. For example: * start with a relatively small size `S` (e.g. 256K) * as large objects are processed, allocate buffers dynamically with counting `MISS` (quantity or excess average) * when `MISS` exceeds the specified limit, the buffer grows (e.g. doubled or +average) * shortening can be done in the reverse way For this approach to be effective, a more adaptive work queue may be needed. To me, in this approach complexity could be above efficiency an alternative approach could be a replication batch: all Policer workers pack to-be-replicated objects into a single limited buffer which is flushed after filling or by a timer. IMO managing the batch size will be simpler and more efficient than current tuning of worker pool capacity which relies only on incoming traffic and not outgoing. Per-node batches would allow background replication to be correlated with external traffic (PUT) and simplify prioritization model