Support erasure codes in object service #186

Closed
opened 2025-12-28 17:18:40 +00:00 by sami · 7 comments
Owner

Originally created by @alexvanin on GitHub (May 17, 2021).

Originally created by @alexvanin on GitHub (May 17, 2021).
sami 2025-12-28 17:18:40 +00:00
  • closed this issue
  • added the
    U4
    epic
    S1
    I1
    labels
Author
Owner

@alexvanin commented on GitHub (Feb 11, 2022):

Erasure codes can be implemented on a containers with REP 1 policy. One replica doesn't make sense in terms of netmap placement algorithm, so it can notify node or the client, that the objects in this container are split with erasure encoding scheme. Details of that scheme may be stored in container attributes.

Uploading / downloading scheme will be different. During payload split, we create new object with actual payload and parity data. Those objects may be linked the same way as they linked now with child links and zero-object. All these objects are stored in one copy as REP 1 describes by object placement rules.

@alexvanin commented on GitHub (Feb 11, 2022): Erasure codes can be implemented on a containers with `REP 1` policy. One replica doesn't make sense in terms of netmap placement algorithm, so it can notify node or the client, that the objects in this container are split with erasure encoding scheme. Details of that scheme may be stored in container attributes. Uploading / downloading scheme will be different. During payload split, we create new object with actual payload and parity data. Those objects may be linked the same way as they linked now with child links and zero-object. All these objects are stored in one copy as `REP 1` describes by object placement rules.
Author
Owner

@roman-khimov commented on GitHub (Feb 7, 2024):

Doing it per regular object implies splitting into many smaller parts plus parity, which can be done, but then there are questions:

  • one node can have many disks and we distribute objects per-node (can be mitigated by running multiple nodes per-machine)
  • perfect part-disk match can not be achieved, multiple parts can be written to the same node/disk
  • disk failure is object loss in this case and something has to recreate it
  • expanding the cluster can't affect old objects, it's not clear how new ones are gonna be changed
@roman-khimov commented on GitHub (Feb 7, 2024): Doing it per regular object implies splitting into many smaller parts plus parity, which can be done, but then there are questions: * one node can have many disks and we distribute objects per-node (can be mitigated by running multiple nodes per-machine) * perfect part-disk match can not be achieved, multiple parts can be written to the same node/disk * disk failure is object loss in this case and something has to recreate it * expanding the cluster can't affect old objects, it's not clear how new ones are gonna be changed
Author
Owner

@roman-khimov commented on GitHub (Feb 20, 2024):

Can be inefficient for small (like 1K) objects, also.

@roman-khimov commented on GitHub (Feb 20, 2024): Can be inefficient for small (like 1K) objects, also.
Author
Owner

@cthulhu-rider commented on GitHub (May 21, 2025):

there is a variety of storage schemes which provide different ways of encoding, exchanging and storing information about user data to achieve the required levels of reliability at different costs. Lets define this variety as StorageScheme (SS). dditionally, each SS may provide additional parameters that may affect the encoding and/or exchange/storage of user data

up until now we have only one (default) SS - Replication. Each user object is stored as is, but in multiple copies (REP N). In particular, single-copy replication involves a trivial singular scheme that is not explicitly distinguished

erasure coding seems to be a separate SS since it fundamentally differs from replication. Obviously, we must keep previous scheme untouched. Additional schemes will give the user freedom of choice to solve storage problems more efficiently

Activation

Container is the best mechanism of grouping objects by common props in NeoFS. Replication is configured per container and applies to all its objects. Being one of the schemes, the most logical thing would be to extend this approach to EC

we may define enum of storage schemes, and specification through container attribute. Let it be StorageScheme. Being a default scheme, Replication corresponds to the absent attribute. Other algorithms are set in value

for Reed-Solomon codes (d, p) w/ d and p data and parity blocks correspondingly, the attribute is "StorageScheme": "RS(d,p)". Per-value attributes - 3 for each RS, d and p - is also an option, but take more data

with this, the storage system will split and store data without going into the details of the rationale behind such a scheme. User wants, user gets. Full control over data life. At the same time, in practice, the selection of parameters will most likely cause difficulties. It is necessary to take into account the cluster topology (which also changes while the container is immutable), the volume/value of data and expectations for the space consumed relative to the desired fault tolerance. As a compromise, it will be needed to provide standard options covering practical use cases

Encoding

being stored in containers with "StorageScheme": "RS(d,p)", original user data is split (*) into data and parity blocks (aka shards) of the same size

(*) NeoFS has no arch limit on the user object's size. Since the entire block of coded data is required for EC, this operation will be applied to slices of the original data. This means that in general the original data is first split according to the current scheme, and only then each slice is subjected to EC and written to the storage

for each received block of code, an object (*) is formed, the payload of which is the code. These objects are placed according to REP 1 policy, i.e. in 1 copy with HRW-by-ID distribution. This will allow the blocks to be distributed across nodes as much as possible and, in the limit, obtain equivalence between node failure and code block loss

Storage

to a first approximation, code blocks are stored as regular objects. Each physically stored objects in container in EC is a block of some code

Recovery (active)

to recover data using RS algorithm, we need to know the following:

  1. size of the original data (for alignment recover)
  2. idx of each block in the code
  3. whether particular block it is corrupted or not

storing blocks as objects, 2 is supported automatically. For 1, we may persist the following metadata:

  • size of the encoded data, i.e. binary-encoded user root or split object, header+payload
  • number of data and parity blocks
  • block checksums
  • block order

Recovery (background)

Policer checks whether individual objects are stored according to the container policy or not. For Replication containers, error correction is achieved by rewriting the replica of the object. At the same time, for REP 1 containers recovery is either not needed or impossible

for EC containers, error detection comes down to detecting missing blocks that need to be restored. However, if the blocks are stored in one instance of an object, then to restore the missing blocks, it is necessary to create a new object that is the restored code block. The object will be created by some SN. Such behavior will entail a number of related nuances

@cthulhu-rider commented on GitHub (May 21, 2025): there is a variety of storage schemes which provide different ways of encoding, exchanging and storing information about user data to achieve the required levels of reliability at different costs. Lets define this variety as `StorageScheme` (SS). dditionally, each SS may provide additional parameters that may affect the encoding and/or exchange/storage of user data up until now we have only one (default) SS - `Replication`. Each user object is stored as is, but in multiple copies (`REP N`). In particular, single-copy replication involves a trivial singular scheme that is not explicitly distinguished erasure coding seems to be a separate SS since it fundamentally differs from replication. Obviously, we must keep previous scheme untouched. Additional schemes will give the user freedom of choice to solve storage problems more efficiently ### Activation Container is the best mechanism of grouping objects by common props in NeoFS. Replication is configured per container and applies to all its objects. Being one of the schemes, the most logical thing would be to extend this approach to EC we may define enum of storage schemes, and specification through container attribute. Let it be `StorageScheme`. Being a default scheme, `Replication` corresponds to the absent attribute. Other algorithms are set in value for Reed-Solomon codes `(d, p)` w/ d and p data and parity blocks correspondingly, the attribute is `"StorageScheme": "RS(d,p)"`. Per-value attributes - 3 for each RS, d and p - is also an option, but take more data with this, the storage system will split and store data without going into the details of the rationale behind such a scheme. User wants, user gets. Full control over data life. At the same time, in practice, the selection of parameters will most likely cause difficulties. It is necessary to take into account the cluster topology (which also changes while the container is immutable), the volume/value of data and expectations for the space consumed relative to the desired fault tolerance. As a compromise, it will be needed to provide standard options covering practical use cases ### Encoding being stored in containers with `"StorageScheme": "RS(d,p)"`, original user data is split (*) into data and parity blocks (aka shards) of the same size (*) NeoFS has no arch limit on the user object's size. Since the entire block of coded data is required for EC, this operation will be applied to slices of the original data. This means that in general the original data is first split according to the current scheme, and only then each slice is subjected to EC and written to the storage for each received block of code, an object (*) is formed, the payload of which is the code. These objects are placed according to `REP 1` policy, i.e. in 1 copy with HRW-by-ID distribution. This will allow the blocks to be distributed across nodes as much as possible and, in the limit, obtain equivalence between node failure and code block loss ### Storage to a first approximation, code blocks are stored as regular objects. Each physically stored objects in container in EC is a block of some code ### Recovery (active) to recover data using RS algorithm, we need to know the following: 1. size of the original data (for alignment recover) 2. idx of each block in the code 3. whether particular block it is corrupted or not storing blocks as objects, 2 is supported automatically. For 1, we may persist the following metadata: * size of the encoded data, i.e. binary-encoded user root or split object, header+payload * number of data and parity blocks * block checksums * block order ### Recovery (background) Policer checks whether individual objects are stored according to the container policy or not. For `Replication` containers, error correction is achieved by rewriting the replica of the object. At the same time, for `REP 1` containers recovery is either not needed or impossible for EC containers, error detection comes down to detecting missing blocks that need to be restored. However, if the blocks are stored in one instance of an object, then to restore the missing blocks, it is necessary to create a new object that is the restored code block. The object will be created by some SN. Such behavior will entail a number of related nuances
Author
Owner

@cthulhu-rider commented on GitHub (May 21, 2025):

to clarify

for each received block of code, an object (*) is formed, the payload of which is the code.

only user payload (or its slice) is encoded into RS shards. Each shard then becomes a payload of shard object. The header of the user object (or split-chain member) is embedded into the shard header, similar to the original header in split-chain member's header. This will ensure that the spawned EC shard is directly linked to the original object, without requiring additional lookups, and will ensure that the header is preserved while the payload is preserved

P.S. similar to split (V2) scheme in which split-object processing context switches to the parent object regarding access, etc., processing of block object gonna switch to the originating object. This is ensured by having the header at hand

@cthulhu-rider commented on GitHub (May 21, 2025): to clarify > for each received block of code, an object (*) is formed, the payload of which is the code. only user payload (or its slice) is encoded into RS shards. Each shard then becomes a payload of shard object. The header of the user object (or split-chain member) is embedded into the shard header, similar to the original header in split-chain member's header. This will ensure that the spawned EC shard is directly linked to the original object, without requiring additional lookups, and will ensure that the header is preserved while the payload is preserved P.S. similar to split (V2) scheme in which split-object processing context switches to the parent object regarding access, etc., processing of block object gonna switch to the originating object. This is ensured by having the header at hand
Author
Owner

@cthulhu-rider commented on GitHub (May 21, 2025):

to clarify

These objects are placed according to REP 1 policy, i.e. in 1 copy with HRW-by-ID distribution.

having EC blocks 0, ..., n-1; n=d+p,

  1. container nodes are selected using filters and selectors similar to replication policy (https://pkg.go.dev/github.com/nspcc-dev/neofs-sdk-go/netmap#NetMap.ContainerNodes)
  2. each node set is HRW-sorted by OID - parent for each block, originating user object or its split-chain member
  3. block objects are put into the first n nodes preserving the order

since all objects in EC container are encoded the same way, the need to collect EC blocks to GET(OID) is determined purely by the container (CID)


looking ahead, if the container policy allows both EC and replication (may be needed as enhancement in future), the need for EC assembly will be determined from the response similar to determining the splitting of an object with the raw flag

@cthulhu-rider commented on GitHub (May 21, 2025): to clarify > These objects are placed according to REP 1 policy, i.e. in 1 copy with HRW-by-ID distribution. having EC blocks `0, ..., n-1; n=d+p`, 1. container nodes are selected using filters and selectors similar to replication policy (https://pkg.go.dev/github.com/nspcc-dev/neofs-sdk-go/netmap#NetMap.ContainerNodes) 2. each node set is HRW-sorted by OID - parent for each block, originating user object or its split-chain member 3. block objects are put into the first `n` nodes preserving the order since all objects in EC container are encoded the same way, the need to collect EC blocks to GET(OID) is determined purely by the container (CID) --- looking ahead, if the container policy allows both EC and replication (may be needed as enhancement in future), the need for EC assembly will be determined from the response similar to determining the splitting of an object with the `raw` flag
Author
Owner

@roman-khimov commented on GitHub (Nov 13, 2025):

Done here. Optimizations and other questions are separate issues.

@roman-khimov commented on GitHub (Nov 13, 2025): Done here. Optimizations and other questions are separate issues.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
nspcc-dev/neofs-node#186
No description provided.