Memory consumption during replication #942

New issue

Open

opened 2025-12-28 17:21:14 +00:00 by sami · 0 comments

sami commented

2025-12-28 17:21:14 +00:00

Owner

Originally created by @alexvanin on GitHub (Dec 26, 2022).

I see some unexpected memory consumption during replication process. This is not reproduced on dev-env but can be seen on other environments: hardware or virtual machines.

If node considers the object has not enough replicas, it reads it into memory to put it into others container nodes.

I reduced replicator.pool_size down to 1 and set more aggressive GC setting (GOGC=20). However after some time I see that some object are keep stored in the memory for longer than I expect.

heap profile: 63: 803864784 [36195: 73107425384] @ heap/1048576
6: 402702336 [122: 8188280832] @ 0x4d8e0b 0xc4ca2e 0xc520b5 0xc899b8 0xc89e9c 0xc897ae 0xc9a574 0xca48a8 0xc99e65 0xc99ad6 0xc939ef 0xc99a39 0xc9aa3f 0xd0d4c6 0xe12576 0xe112fa 0xe138bb 0xbcc0f7 0x46b921
#	0x4d8e0a	os.ReadFile+0xea												os/file.go:693
#	0xc4ca2d	github.com/nspcc-dev/neofs-node/pkg/local_object_storage/blobstor/fstree.(*FSTree).Get+0xad			github.com/nspcc-dev/neofs-node/pkg/local_object_storage/blobstor/fstree/fstree.go:304
#	0xc520b4	github.com/nspcc-dev/neofs-node/pkg/local_object_storage/blobstor.(*BlobStor).Get+0x2f4				github.com/nspcc-dev/neofs-node/pkg/local_object_storage/blobstor/get.go:20
#	0xc899b7	github.com/nspcc-dev/neofs-node/pkg/local_object_storage/shard.(*Shard).Get.func1+0xb7				github.com/nspcc-dev/neofs-node/pkg/local_object_storage/shard/get.go:73
#	0xc89e9b	github.com/nspcc-dev/neofs-node/pkg/local_object_storage/shard.(*Shard).fetchObjectData+0x41b			github.com/nspcc-dev/neofs-node/pkg/local_object_storage/shard/get.go:127
#	0xc897ad	github.com/nspcc-dev/neofs-node/pkg/local_object_storage/shard.(*Shard).Get+0x22d				github.com/nspcc-dev/neofs-node/pkg/local_object_storage/shard/get.go:86
#	0xc9a573	github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).get.func1+0x113		github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/get.go:84
#	0xca48a7	github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).iterateOverSortedShards+0xc7	github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/shards.go:225
#	0xc99e64	github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).get+0x324			github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/get.go:78
#	0xc99ad5	github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).Get.func1+0x55			github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/get.go:48
#	0xc939ee	github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).execIfNotBlocked+0xce		github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/control.go:147
#	0xc99a38	github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).Get+0xb8			github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/get.go:47
#	0xc9aa3e	github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.Get+0x9e					github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/get.go:172
#	0xd0d4c5	github.com/nspcc-dev/neofs-node/pkg/services/replicator.(*Replicator).HandleTask+0x105				github.com/nspcc-dev/neofs-node/pkg/services/replicator/process.go:30
#	0xe12575	github.com/nspcc-dev/neofs-node/pkg/services/policer.(*Policer).processNodes+0xfd5				github.com/nspcc-dev/neofs-node/pkg/services/policer/check.go:241
#	0xe112f9	github.com/nspcc-dev/neofs-node/pkg/services/policer.(*Policer).processObject+0xc19				github.com/nspcc-dev/neofs-node/pkg/services/policer/check.go:127
#	0xe138ba	github.com/nspcc-dev/neofs-node/pkg/services/policer.(*Policer).shardPolicyWorker.func1+0x17a			github.com/nspcc-dev/neofs-node/pkg/services/policer/process.go:65
#	0xbcc0f6	github.com/panjf2000/ants/v2.(*goWorker).run.func1+0x96								github.com/panjf2000/ants/v2@v2.4.0/worker.go:68

At this memory profile I see 6 objects of 64 MiB (max object size in the network) in memory. This number is changing but tends to grow up over time (later I saw 10 objects in this run). Maybe we can do something about that.

Maybe it is just GC thing (need to try GOMEMLIMIT with go1.19 build), maybe something else.

To reduce number of object reads, node might want to skip replication in case of having 2048 access denied errors (related to https://github.com/nspcc-dev/neofs-node/issues/1709). This can happen when eACL restricts system operations.

Any other ideas are appreciated.

/cc @fyrchik

Originally created by @alexvanin on GitHub (Dec 26, 2022). I see some unexpected memory consumption during replication process. This is not reproduced on dev-env but can be seen on other environments: hardware or virtual machines. If node considers the object has not enough [replicas](https://github.com/nspcc-dev/neofs-node/blob/f4c3e40f47bf4eb9c2e6831c28e0056b5212babc/pkg/services/policer/check.go#L241), it [reads](https://github.com/nspcc-dev/neofs-node/blob/f4c3e40f47bf4eb9c2e6831c28e0056b5212babc/pkg/services/replicator/process.go#L30) it into memory to put it into others container nodes. I reduced `replicator.pool_size` down to `1` and set more aggressive GC setting (GOGC=20). However after some time I see that some object are keep stored in the memory for longer than I expect. ``` heap profile: 63: 803864784 [36195: 73107425384] @ heap/1048576 6: 402702336 [122: 8188280832] @ 0x4d8e0b 0xc4ca2e 0xc520b5 0xc899b8 0xc89e9c 0xc897ae 0xc9a574 0xca48a8 0xc99e65 0xc99ad6 0xc939ef 0xc99a39 0xc9aa3f 0xd0d4c6 0xe12576 0xe112fa 0xe138bb 0xbcc0f7 0x46b921 # 0x4d8e0a os.ReadFile+0xea os/file.go:693 # 0xc4ca2d github.com/nspcc-dev/neofs-node/pkg/local_object_storage/blobstor/fstree.(*FSTree).Get+0xad github.com/nspcc-dev/neofs-node/pkg/local_object_storage/blobstor/fstree/fstree.go:304 # 0xc520b4 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/blobstor.(*BlobStor).Get+0x2f4 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/blobstor/get.go:20 # 0xc899b7 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/shard.(*Shard).Get.func1+0xb7 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/shard/get.go:73 # 0xc89e9b github.com/nspcc-dev/neofs-node/pkg/local_object_storage/shard.(*Shard).fetchObjectData+0x41b github.com/nspcc-dev/neofs-node/pkg/local_object_storage/shard/get.go:127 # 0xc897ad github.com/nspcc-dev/neofs-node/pkg/local_object_storage/shard.(*Shard).Get+0x22d github.com/nspcc-dev/neofs-node/pkg/local_object_storage/shard/get.go:86 # 0xc9a573 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).get.func1+0x113 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/get.go:84 # 0xca48a7 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).iterateOverSortedShards+0xc7 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/shards.go:225 # 0xc99e64 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).get+0x324 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/get.go:78 # 0xc99ad5 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).Get.func1+0x55 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/get.go:48 # 0xc939ee github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).execIfNotBlocked+0xce github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/control.go:147 # 0xc99a38 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).Get+0xb8 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/get.go:47 # 0xc9aa3e github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.Get+0x9e github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/get.go:172 # 0xd0d4c5 github.com/nspcc-dev/neofs-node/pkg/services/replicator.(*Replicator).HandleTask+0x105 github.com/nspcc-dev/neofs-node/pkg/services/replicator/process.go:30 # 0xe12575 github.com/nspcc-dev/neofs-node/pkg/services/policer.(*Policer).processNodes+0xfd5 github.com/nspcc-dev/neofs-node/pkg/services/policer/check.go:241 # 0xe112f9 github.com/nspcc-dev/neofs-node/pkg/services/policer.(*Policer).processObject+0xc19 github.com/nspcc-dev/neofs-node/pkg/services/policer/check.go:127 # 0xe138ba github.com/nspcc-dev/neofs-node/pkg/services/policer.(*Policer).shardPolicyWorker.func1+0x17a github.com/nspcc-dev/neofs-node/pkg/services/policer/process.go:65 # 0xbcc0f6 github.com/panjf2000/ants/v2.(*goWorker).run.func1+0x96 github.com/panjf2000/ants/v2@v2.4.0/worker.go:68 ``` At this memory profile I see 6 objects of 64 MiB (max object size in the network) in memory. This number is changing but tends to grow up over time (later I saw 10 objects in this run). Maybe we can do something about that. Maybe it is just GC thing (need to try `GOMEMLIMIT` with go1.19 build), maybe something else. To reduce number of object reads, node might want to skip replication in case of having `2048 access denied` errors (related to https://github.com/nspcc-dev/neofs-node/issues/1709). This can happen when eACL restricts system operations. Any other ideas are appreciated. /cc @fyrchik