Request lifetime trace #1413

New issue

Open

opened 2025-12-28 17:22:54 +00:00 by sami · 3 comments

sami commented

2025-12-28 17:22:54 +00:00

Owner

Originally created by @cthulhu-rider on GitHub (May 14, 2025).

I'm always frustrated when I need to recreate the details of the request execution, including chronology. In most cases, input data, i.e. request body, and output status are known. In particular, the status can be DeadlineExceeded. Processing of any request goes through several stages, sometimes there are quite a lot of them

when debugging we almost always need to understand what was done and for how long. For example, removing a split object is a multi-step operation involving cryptography, network/disk I/O, etc

at the moment, per-SN log partially allow tracking data movement and inter-SN interactions, while metrics allow tracking duration. But the level of detail leaves much to be desired.

when exploring problems, the main challenges are:

identify request (input and/or output may be the same for different requests)
find what exactly was done by request ID
find how long

Describe the solution you'd like

support request identity. Have a way to assign system-global UID to any request. Support switchable request tracing: if it's enabled, all requests are identified, and their lives are recorded in detail and measured in time

mostly, these are debug log severity and metrics essentially. The only difference i think about is additional costs to metrics. Time measurement is not free. Doing it per-request unconditionally slows down request handling. Mixing and averaging requests also hinders

i'll try to formulate the implementation proposal more specifically. For now I will rather highlight the problem

Describe alternatives you've considered

obviously, just keep as is. It's possible that the time spent on implementing a verbose trace will exceed the total time spent on manual debugging. This will also complicate the code

it should be also noted that in production networks it will be impossible to add logs/metrics/printlines, we will have to investigate from what is available

Additional context

i needed to add printlines to my gRPC fork and pull it to the node in #3250

Originally created by @cthulhu-rider on GitHub (May 14, 2025). ## Is your feature request related to a problem? Please describe. I'm always frustrated when I need to recreate the details of the request execution, including chronology. In most cases, input data, i.e. request body, and output status are known. In particular, the status can be `DeadlineExceeded`. Processing of any request goes through several stages, sometimes there are quite a lot of them when debugging we almost always need to understand what was done and for how long. For example, removing a split object is a multi-step operation involving cryptography, network/disk I/O, etc at the moment, per-SN log partially allow tracking data movement and inter-SN interactions, while metrics allow tracking duration. But the level of detail leaves much to be desired. when exploring problems, the main challenges are: 1. identify request (input and/or output may be the same for different requests) 2. find what exactly was done by request ID 3. find how long ## Describe the solution you'd like support request identity. Have a way to assign system-global UID to any request. Support switchable request tracing: if it's enabled, all requests are identified, and their lives are recorded in detail and measured in time mostly, these are `debug` log severity and metrics essentially. The only difference i think about is additional costs to metrics. Time measurement is not free. Doing it per-request unconditionally slows down request handling. Mixing and averaging requests also hinders i'll try to formulate the implementation proposal more specifically. For now I will rather highlight the problem ## Describe alternatives you've considered obviously, just keep as is. It's possible that the time spent on implementing a verbose trace will exceed the total time spent on manual debugging. This will also complicate the code it should be also noted that in production networks it will be impossible to add logs/metrics/printlines, we will have to investigate from what is available ## Additional context i needed to add printlines to my gRPC fork and pull it to the node in #3250

sami added the

labels

2025-12-28 17:22:54 +00:00

sami commented

2025-12-28 17:22:55 +00:00

Author

Owner

@roman-khimov commented on GitHub (Jun 6, 2025):

I think @carpawell has mentioned Opentelemetry once. Maybe it can help.

@roman-khimov commented on GitHub (Jun 6, 2025): I think @carpawell has mentioned Opentelemetry once. Maybe it can help.

sami commented

2025-12-28 17:22:55 +00:00

Author

Owner

@EESergey commented on GitHub (Jun 16, 2025):

We can try to enable tracing of objects of a certain size, for example the main stream is 4096 KB and every 10 seconds a request for objects is 4090 KB. For these objects we output all stages of processing to the log.

@EESergey commented on GitHub (Jun 16, 2025): We can try to enable tracing of objects of a certain size, for example the main stream is 4096 KB and every 10 seconds a request for objects is 4090 KB. For these objects we output all stages of processing to the log.

sami commented

2025-12-28 17:22:55 +00:00

Author

Owner

@EESergey commented on GitHub (Dec 1, 2025):

Let's proceed step by step, first as described in https://github.com/nspcc-dev/neofs-node/issues/3710

@EESergey commented on GitHub (Dec 1, 2025): Let's proceed step by step, first as described in https://github.com/nspcc-dev/neofs-node/issues/3710