mirror of
https://github.com/nspcc-dev/neofs-node.git
synced 2026-03-01 04:29:10 +00:00
Support erasure codes in object service #186
Labels
No labels
I1
I2
I3
I4
S0
S1
S2
S3
S4
U0
U1
U2
U3
U4
blocked
bug
config
dependencies
discussion
documentation
enhancement
enhancement
epic
feature
go
good first issue
help wanted
neofs-adm
neofs-cli
neofs-cli
neofs-cli
neofs-ir
neofs-lens
neofs-storage
neofs-storage
performance
question
security
task
test
windows
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
nspcc-dev/neofs-node#186
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @alexvanin on GitHub (May 17, 2021).
@alexvanin commented on GitHub (Feb 11, 2022):
Erasure codes can be implemented on a containers with
REP 1policy. One replica doesn't make sense in terms of netmap placement algorithm, so it can notify node or the client, that the objects in this container are split with erasure encoding scheme. Details of that scheme may be stored in container attributes.Uploading / downloading scheme will be different. During payload split, we create new object with actual payload and parity data. Those objects may be linked the same way as they linked now with child links and zero-object. All these objects are stored in one copy as
REP 1describes by object placement rules.@roman-khimov commented on GitHub (Feb 7, 2024):
Doing it per regular object implies splitting into many smaller parts plus parity, which can be done, but then there are questions:
@roman-khimov commented on GitHub (Feb 20, 2024):
Can be inefficient for small (like 1K) objects, also.
@cthulhu-rider commented on GitHub (May 21, 2025):
there is a variety of storage schemes which provide different ways of encoding, exchanging and storing information about user data to achieve the required levels of reliability at different costs. Lets define this variety as
StorageScheme(SS). dditionally, each SS may provide additional parameters that may affect the encoding and/or exchange/storage of user dataup until now we have only one (default) SS -
Replication. Each user object is stored as is, but in multiple copies (REP N). In particular, single-copy replication involves a trivial singular scheme that is not explicitly distinguishederasure coding seems to be a separate SS since it fundamentally differs from replication. Obviously, we must keep previous scheme untouched. Additional schemes will give the user freedom of choice to solve storage problems more efficiently
Activation
Container is the best mechanism of grouping objects by common props in NeoFS. Replication is configured per container and applies to all its objects. Being one of the schemes, the most logical thing would be to extend this approach to EC
we may define enum of storage schemes, and specification through container attribute. Let it be
StorageScheme. Being a default scheme,Replicationcorresponds to the absent attribute. Other algorithms are set in valuefor Reed-Solomon codes
(d, p)w/ d and p data and parity blocks correspondingly, the attribute is"StorageScheme": "RS(d,p)". Per-value attributes - 3 for each RS, d and p - is also an option, but take more datawith this, the storage system will split and store data without going into the details of the rationale behind such a scheme. User wants, user gets. Full control over data life. At the same time, in practice, the selection of parameters will most likely cause difficulties. It is necessary to take into account the cluster topology (which also changes while the container is immutable), the volume/value of data and expectations for the space consumed relative to the desired fault tolerance. As a compromise, it will be needed to provide standard options covering practical use cases
Encoding
being stored in containers with
"StorageScheme": "RS(d,p)", original user data is split (*) into data and parity blocks (aka shards) of the same size(*) NeoFS has no arch limit on the user object's size. Since the entire block of coded data is required for EC, this operation will be applied to slices of the original data. This means that in general the original data is first split according to the current scheme, and only then each slice is subjected to EC and written to the storage
for each received block of code, an object (*) is formed, the payload of which is the code. These objects are placed according to
REP 1policy, i.e. in 1 copy with HRW-by-ID distribution. This will allow the blocks to be distributed across nodes as much as possible and, in the limit, obtain equivalence between node failure and code block lossStorage
to a first approximation, code blocks are stored as regular objects. Each physically stored objects in container in EC is a block of some code
Recovery (active)
to recover data using RS algorithm, we need to know the following:
storing blocks as objects, 2 is supported automatically. For 1, we may persist the following metadata:
Recovery (background)
Policer checks whether individual objects are stored according to the container policy or not. For
Replicationcontainers, error correction is achieved by rewriting the replica of the object. At the same time, forREP 1containers recovery is either not needed or impossiblefor EC containers, error detection comes down to detecting missing blocks that need to be restored. However, if the blocks are stored in one instance of an object, then to restore the missing blocks, it is necessary to create a new object that is the restored code block. The object will be created by some SN. Such behavior will entail a number of related nuances
@cthulhu-rider commented on GitHub (May 21, 2025):
to clarify
only user payload (or its slice) is encoded into RS shards. Each shard then becomes a payload of shard object. The header of the user object (or split-chain member) is embedded into the shard header, similar to the original header in split-chain member's header. This will ensure that the spawned EC shard is directly linked to the original object, without requiring additional lookups, and will ensure that the header is preserved while the payload is preserved
P.S. similar to split (V2) scheme in which split-object processing context switches to the parent object regarding access, etc., processing of block object gonna switch to the originating object. This is ensured by having the header at hand
@cthulhu-rider commented on GitHub (May 21, 2025):
to clarify
having EC blocks
0, ..., n-1; n=d+p,nnodes preserving the ordersince all objects in EC container are encoded the same way, the need to collect EC blocks to GET(OID) is determined purely by the container (CID)
looking ahead, if the container policy allows both EC and replication (may be needed as enhancement in future), the need for EC assembly will be determined from the response similar to determining the splitting of an object with the
rawflag@roman-khimov commented on GitHub (Nov 13, 2025):
Done here. Optimizations and other questions are separate issues.