-
Notifications
You must be signed in to change notification settings - Fork 26
docs: added orchestration feature request #857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
License Check Results🚀 The license check job ran with the Bazel command: bazel run //:license-check Status: Click to expand output
|
The created documentation from the pull request is available at: docu-html |
5e0ad98
to
e4bccd9
Compare
|
||
Abstract | ||
======== | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few basic questions which I did not find completely answered:
- Does the feature intend to provide a mechanism to globally manage ("orchestrate") the compute resources within an ECU as a whole?
- Or does it intend to provide means for user-level scheduling and configuration of tasks and threads within a single application potentially consisting of multiple processes? (Similar to what an async framework -- e.g. with Tokio in Rust -- does for a single process.)
- Or is it both?
If both, I would propose to split the feature request into at least two different requests, namely one for 1) and another one for 2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short answer: both directions.
The goal is to provide user-level scheduling and configuration of tasks and threads within a single process. Additionally to that it should provide the building blocks that enable you to orchestrate the functionality within the ECU and across processes. Because this targets only the building blocks it should not force the user to do this in a fully centralized nor fully decentralized approach.
The reason for why I would like to combine these functionalities in one PR are elaborated in the answer to your comment below. I would propose a quick breakout session so that we can align how to proceed with the FR.
In existing platforms for microprocessors (µP), each application is expected to interact with approximately 15 system services or daemons - such as ``ara::diag`` for diagnostics and ``ara::com`` for SOME/IP communication. Under a straightforward implementation, this interaction model results in the creation of around 15 threads per application. When scaled to 100-150 applications, this amounts to roughly 1500 to 2250 threads solely managing inter-process communication, excluding those performing the core application tasks. | ||
|
||
Given that the HPC's µP typically provides between 2 and 16 cores, only a limited number of threads can be processed in parallel. In POSIX-like operating systems, threads become the currency for concurrency, meaning that when the thread count far exceeds available cores, the system must rely on context switching. With context switching times estimated between 2µs and 4.5µs [#f1]_ [#f2]_ [#f3]_, even a 100ms time slice could spend between 3% and 10% of its duration on context switching alone - assuming each thread is scheduled once. This overhead increases dramatically if threads are forced to switch more frequently due to competition for processing time. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I use your worst-case figures (2250 threads, only 2 cores, 4.5 us task switching time), I end up with 5 ms task-switching time for all tasks. This is 5% of 100 ms. With your lower-end figures (1500 threads -- still a lot --, 16 cores and 2 us task switching time) I get a total of 188 us, which is ~0.2% of the 100ms cycle time.
How did you calculate the 3% ... 10% figure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scheduling in operating systems has been optimized heavily. There are a few mayor motivations for task-based concurrency which share threads (like async
in Rust):
- Avoiding the overhead of spawning and reaping threads (e.g. not spawning and reaping one thread per connection when serving thousands of connections per second).
- Minimizing load in IO-bound workloads (if most of what the code does is network IO, we can save thread switches because they do not pale in comparison)
- Allowing imperative-style programming for concurrently executing code (every
await
is a potential hand-off to another execution path).
I do not see how our expected use-cases could harness those advantages. We will not have webserver-like applications which accept thousands of new connections every second and most applications are either heavier on the compute than the IO part or they do IO continuously without switching between different IO sources all the time. Furthermore, the nested language which you propose does not bring the benefit of easy imperative-style programming for concurrent code as Future
s in Rust or e.g. Promise
s in JS.
The broader objective of the HPC platform is to support the concurrent integration and maintenance of multiple applications, potentially sourced from different vendors. Unlike traditional microcontroller (µC) platforms that are statically configured, this platform must allow dynamic updates, upgrades, and new deployments throughout the vehicle's lifecycle. This introduces significant complexity in managing concurrency, particularly in mixed-criticality environments where applications of varying criticality levels must coexist without interference. | ||
|
||
Currently, application developers are expected to manage runtime scheduling directly via the operating system, controlling thread lifecycles, priorities, and affinities. Such an approach ties the configuration too closely to a specific deployment scenario and often leads to discrepancies between behaviors observed during development and those on the target system. This misalignment complicates integration efforts, as integrators must repeatedly iterate with the application developer to meet the system's reliability requirements. An orchestrator that abstracts these complexities could alleviate these challenges by offering a uniform, deterministic interface for managing runtime scheduling, thus ensuring that lower-criticality applications do not adversely affect the performance or reliability of higher-criticality ones. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the integrator still need to figure out the minimum requirements (e.g. compute resources) of each application with the developers, which might be an iterative process if resources are very limited?
How would the integrator combine the different concurrent programs they get from different suppliers? Is it intended to merge all of them into a huge program? Or would the integrator still have to use operating system mechanisms and tools to distribute the resources across applications?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this correct. The integrator still needs to figure out the minimum requirements together with the application developers. However, the goal is that we achieve a better separation of concerns. The application developer is responsible to define his algorithm, incl. potential parallelism that could be exploited. The integrator is responsible to integrate all applications on a given system and therefore IMO has to have the ability to assign the resources during integration. The goal is that these interfaces are clearly defined and that the integrator does not need to rely on the application developers assigning the correct configuration within the application code. For example, application developers implement their algorithms & define task based programs to execute their algorithms and the integrator instantiates and configures the executors and deploys the applications programs to them.
Most likely we are not able to completely avoid any iteration between the application developers and the integrator but the goal is that deployment specific configurations can be avoided in the application code and managed during the deployment with the integrator.
- Cooperative multi-tasking, allowing multiple concurrent tasks to use the same OS thread. | ||
- Provision of a configurable thread pool to enable parallelism for concurrent tasks. | ||
- Introduction of additional thread pools only when necessary, such as when tasks differ in criticality or require separation by process boundaries. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this differ from what the various available async frameworks offer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the answer to your previous comment, an async executor like e.g. tokio-rs is one possible solution to implement the functionality, but not the only possible solution.
Additional functionality as targeted in the following sections might however potentially influence the design of an async executor and lead to additional requirements besides the standard async use-case.
------------------------------------------------------------------------------------------------ | ||
|
||
To harness the benefits of user-space multi-tasking while still providing a user-friendly and deterministic interface for concurrent programs, this proposal advocates for a nested task-based programming framework. The choice for a nested structure over a graph-based one is driven by the need to design reliable programs and enable straightforward control flow analysis - a requirement that becomes critical during safety inspections. Although graph-based structures may have a gentler learning curve and offer rapid initial results, they often become limiting when more complex scheduling descriptions are needed, such as conditional branching, time monitoring, and error handling paths. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A safety-certified framework for task-based asynchronous programming could be very beneficial. However, it should at least enable application development in the usual async programming model using async functions, spawning of tasks, and "await"ing of futures (as, e.g. with "Tokio" in Rust).
A "nested" programming model could be added on top for applications actually needing it.
If a nested language is part of the feature request, it should be specified in that feature request, so that the programming model can be understood on a conceptual level.
Also: The nested approach appears to me like introducing a kind of an interpreted language on top of the native language. This might make it harder to verify the correctness of programs using it and thus make safety evaluations mor difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I fully, agree with you (see the answer on your previous comment). If the implementation selects the async programming model to facilitate the cooperative multi-taskig the it should expose a similar interface as defined in the rust async book.
My goal was to not restrict the format of the program definition too much in the feature request but actually leave this open for the implementation as long as all of the requirements are covered. If it would help to understand the thoughts behind it better, I could introduce an "admonition only" code block for readability. Let me know if this is something that you would like to see in the feature request?
Introducing an interpreted language is not my intention here, let me know how I could make this more clear or where you did get that impression from. The goal is to introduce a framework that exploits the functionalities of cooperative multitasking while offering an interface that achieves the goals mentioned here. One example for a graph-based programming framework is e.g. taskflow.
|
||
In response to the increasing complexity of modern centralized E/E architectures and the need to support hundreds of applications, this feature request proposes a comprehensive orchestration framework for managing concurrency in high-performance computing (HPC) systems. The motivation for this proposal is rooted in the significant performance penalties incurred by conventional thread-based approaches, where an excessive number of threads leads to costly context switching in operating systems. | ||
|
||
The proposed solution introduces user-level scheduling through cooperative multi-tasking, allowing task switches to occur in the nanosecond range instead of microseconds. By treating tasks as the fundamental unit of concurrency and enabling multiple tasks to share the same OS thread, the framework significantly reduces overhead and simplifies resource allocation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are existing frameworks for user-level scheduling based on cooperative multitasking in many languages including Rust and C++. As you know, Rust even provides language elements for it. Unfortunately, the frameworks I know of (e.g. "Tokio" for Rust) are not safety certified.
Having a (safety-certified) framework for task-based asynchronous programming could be very beneficial for the reasons you are pointing out below. I think, such a framework should enable application development in the common async programming model using async functions, spawning of tasks, and "await"-ing of futures (as, e.g. with "Tokio" in Rust).
According to my understanding, the functionality requested in this feature request more or less needs such a framework (at least a minimal one) as a basis. Thus, it could make sense to create a separate feature request for a safety-certified async-framework and then base remainder of the present feature request on top of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes correct, an async executor like e.g. tokio-rs is one possible solution to implement the functionality (and in fact this is also the direction we went). I wanted to focus however on the functionality that we intend to achieve and leave the implementation details for the feature architecture/component decomposition. Other possible solutions besides an async executor could also be the introduction of "green-threads" like e.g. in the fiber-framework of Google.
I also thought of introducing two feature requests, one for the cooperative executor and one for the higher level orchestration, however, besides the standard async executor functionality we also need to take into account e.g. potential "special" tasks that can preempt or escalate on errors in a cooperative context. These requirements come become only visible if the executor is designed with the context of the higher level orchestration in mind. This is the reason why I decided to combine the "executor" and "orchestration" functionalities in one FR. My expectation is that during the feature architecture we will decompose the functionality accordingly and end up with two components.
|
||
In response to the increasing complexity of modern centralized E/E architectures and the need to support hundreds of applications, this feature request proposes a comprehensive orchestration framework for managing concurrency in high-performance computing (HPC) systems. The motivation for this proposal is rooted in the significant performance penalties incurred by conventional thread-based approaches, where an excessive number of threads leads to costly context switching in operating systems. | ||
|
||
The proposed solution introduces user-level scheduling through cooperative multi-tasking, allowing task switches to occur in the nanosecond range instead of microseconds. By treating tasks as the fundamental unit of concurrency and enabling multiple tasks to share the same OS thread, the framework significantly reduces overhead and simplifies resource allocation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we had a broad agreement in score that if possible, the native means of the programming language or programming environment should be used. In case of user space scheduling and Rust, Rust async and the corresponding well known and wide spread API already solves this problem. Of course this needs a safe async runtime and working on that would be a very useful thing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing in this feature requests contradicts your statement. What exactly would you like to see changed?
:satisfies: stkh_req__execution_model__processes | ||
:status: invalid | ||
|
||
The system **SHALL** implement user-level scheduling for task management so that task switches occur in the nanosecond range. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, this could be achieved by providing an async framework comparable to "Tokio" in Rust, i.e. supporting the usual async programming model using async functions, spawning of tasks, and "await"ing of futures. There are similar frameworks for C++.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fully agree with you, see the answer on your earlier comment.
:satisfies: stkh_req__execution_model__processes | ||
:status: invalid | ||
|
||
The system **SHALL** support cooperative multi-tasking, allowing multiple concurrent tasks to share the same OS thread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above: This could be achieved by providing an async framework similar to the known ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fully agree with you, see the answer on your earlier comment.
:status: invalid | ||
|
||
The system **SHALL** provide a configurable thread pool for executing concurrent tasks. Additional thread pools **MAY** be introduced only when necessary (e.g., when tasks differ in criticality or require separation by process boundaries). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above: This could be achieved by providing an async framework similar to the known ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fully agree with you, see the answer on your earlier comment.
:status: invalid | ||
|
||
The programming framework **SHALL** allow developers to express concurrent and sequential dependencies, conditional branching, timing constraints, and error handling paths while abstracting explicit thread management and complex synchronization. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above: This could be achieved by providing an async framework similar to the known ones. There is no need for a "nested" language to achieve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not fully agree with you, see the answer on your earlier comment. This could potentially require a framework on top of an async executor.
:satisfies: stkh_req__execution_model__processes | ||
:status: invalid | ||
|
||
The system **SHALL** decouple algorithm design from deployment specifics, allowing dynamic updates, upgrades, and new deployments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you describe in more detail, how the intended orchestrator approach can solve this without closely coupling the applications into a huge meta-program?
:satisfies: stkh_req__execution_model__processes, stkh_req__dev_experience__tracing_of_exec | ||
:status: invalid | ||
|
||
The system **SHALL** provide hooks for tracing and profiling task execution to verify behavior and control flow of the integrated system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't the tracing and logging frameworks already provide means for this? If not, could you please specify in more detail the features the orchestration framework needs to add?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes correct, logging and tracing provide the means to log and trace, however, the specific place and information that has to be traced can only be known by the orchestration feature. The goal is that this is actually implemented so that abstractions on top of the OS threads can be traced in accordance to the traces exposed by the system kernel.
PS: see the answer on martin's comment for more context
:satisfies: stkh_req__execution_model__processes, stkh_req__dependability__security_features | ||
:status: invalid | ||
|
||
The orchestration feature **SHALL** assume that all code executing within a process is trusted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not a requirement to the orchestrator but rather a requirement on how developers shall partition their applications (i.e., independent of the existence of the orchestration feature).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal of this requirement is to define the assumed context for the implementation/design of the functionality. This is intended to be the input to the later required security analysis. What is your opinion? How/where should we define these "assumed" safety/security contexts?
:satisfies: stkh_req__execution_model__processes, stkh_req__dependability__automotive_safety | ||
:status: invalid | ||
|
||
All tasks within a single process **SHALL** share the same ASIL level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not a requirement to the orchestrator but rather a requirement on how developers shall partition their applications (i.e., independent of the existence of the orchestration feature).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal of this requirement is to define the assumed context for the implementation/design of the functionality. This is intended to be the input to the later required safety analysis. What is your opinion? How/where should we define these "assumed" safety/security contexts?
:satisfies: stkh_req__execution_model__processes, stkh_req__dependability__automotive_safety | ||
:status: invalid | ||
|
||
The system **SHALL** implement priority-based preemption between thread pools to ensure that lower-priority programs cannot interfere with higher-priority programs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intended to provide means for that in the orchestration feature? Or is there a separate mechanism for that (e.g. based on OS tools)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is to utilize preemption to achieve that lower priority programs cannot interfere with higher-priority programs. Ideally this is achieve by using the means of the OS, i.e. prioritizing threads in the thread pool.
|
||
Concurrent programming in our target environment spans multiple scopes. At one level, concurrency exists within the algorithms of individual applications or system services, while at another level, multiple applications must execute concurrently across the platform. The challenge is to offer an interface that is not only simple and expressive but also deterministic and reliable - an important requirement in safety-critical systems. | ||
|
||
Traditional thread-based concurrency in POSIX-like environments introduces complexities such as deadlocks, livelocks, and starvation. These issues, coupled with the inherent difficulties in debugging and validating thread-based systems, can compromise both performance and reliability. [#f4]_ [#f5]_ [#f6]_ Moreover, current designs often separate the management of timing requirements, monitoring, and error handling from the control flow. Integrating these aspects closer to the application logic would promote higher cohesion and lower coupling, enabling more effective debugging and validation, particularly when addressing application-specific scenarios. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain how the orchestrator helps to avoid dead locks, livelocks and starvation? In particular also in the case where priority based OS scheduling is used as described below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This is the motivation section, intending to give the background on why the feature requests asks for a specific functionality. The functionality that should be implemented within the orchestration feature is defined in the specification section (see e.g. chapter
Specification: How to enable a user-friendly & deterministic interface for concurrent programs?
). Please clarify if the motivation, leading up the specification is unclear or if the contents that are defined in the specification need more detail? - Describing how to implement a specific functionality is not the goal of a feature request. The feature request intends to clarify which functionality should be targeted by the S-CORE platform and its modules/components. Once we agree, which functionality we intend to target (with the feature request), we can start reviewing implementations behind the contributions targeting the feature request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Traditional thread-based concurrency in POSIX-like environments introduces complexities such as deadlocks, livelocks, and starvation.
Reading the rest of your FR, it does not sound like the proposed feature would tackle those concurrency challenges. Instead, it would introduce a new way how to react to timeouts and errors (which themselves might timeout and error...). Could you explain how this actually reduces the complexity for developers, instead of replacing a well-known set of concurrency challenges with a new, unknown set?
|
||
The broader objective of the HPC platform is to support the concurrent integration and maintenance of multiple applications, potentially sourced from different vendors. Unlike traditional microcontroller (µC) platforms that are statically configured, this platform must allow dynamic updates, upgrades, and new deployments throughout the vehicle's lifecycle. This introduces significant complexity in managing concurrency, particularly in mixed-criticality environments where applications of varying criticality levels must coexist without interference. | ||
|
||
Currently, application developers are expected to manage runtime scheduling directly via the operating system, controlling thread lifecycles, priorities, and affinities. Such an approach ties the configuration too closely to a specific deployment scenario and often leads to discrepancies between behaviors observed during development and those on the target system. This misalignment complicates integration efforts, as integrators must repeatedly iterate with the application developer to meet the system's reliability requirements. An orchestrator that abstracts these complexities could alleviate these challenges by offering a uniform, deterministic interface for managing runtime scheduling, thus ensuring that lower-criticality applications do not adversely affect the performance or reliability of higher-criticality ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain how the Orchestrator helps to integrate multiple applications and other services? For example, what does an integrator have to do when adding another orchestrator-enabled application to an existing system? I assume such an application comes with a description according to the "nested task-based programming framework". How will this new description be merged into the existing descriptions? How can it be ensured that the existing applications will still meet their timing requirements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the answer in armin's comment, maybe this already partially answers your question.
To ensure that system including the existing applications still meet their timing requirements, the functionality should Offer tracing and profiling of program execution to verify the behavior and control flow of the final integrated system.
(see the specification section). In the very first shot and for the sake of this feature request the goal is that we are able to trace the overall system behavior. This feature should extend the information that we receive from the OS scheduler with information of any higher level abstractions that the feature implements on top of the thread scheduling. This allows to profile the behavior of the individual applications deployed in the system as well as correlating information about threads that are outside of the control of the orchestration feature with threads that are managed within. In a second step we could introduce tooling that visualizes this information and helps to guide the integrator to find the optimal configuration for a reliable system. This is however outside the scope of this feature request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, application developers are expected to manage runtime scheduling directly via the operating system, controlling thread lifecycles, priorities, and affinities.
This also passes the necessary tools to developers using a well-matured, time-tested scheduling approach which they already know. I could imagine that the effort both for application developers and integrators is increased if they have to learn a new, restricted scheduling approach instead of using what they know and what has been optimized over time.
This misalignment complicates integration efforts, as integrators must repeatedly iterate with the application developer to meet the system's reliability requirements.
I do not think that the iteration for system reliability requirements disappears with an added orchestration layer. Threads and thread pools of the orchestration feature would still interfere with each other and with other processes and threads on the OS. The orchestration feature cannot offer guarantees like e.g. a realtime OS kernel when it comes to scheduling.
|
||
The broader objective of the HPC platform is to support the concurrent integration and maintenance of multiple applications, potentially sourced from different vendors. Unlike traditional microcontroller (µC) platforms that are statically configured, this platform must allow dynamic updates, upgrades, and new deployments throughout the vehicle's lifecycle. This introduces significant complexity in managing concurrency, particularly in mixed-criticality environments where applications of varying criticality levels must coexist without interference. | ||
|
||
Currently, application developers are expected to manage runtime scheduling directly via the operating system, controlling thread lifecycles, priorities, and affinities. Such an approach ties the configuration too closely to a specific deployment scenario and often leads to discrepancies between behaviors observed during development and those on the target system. This misalignment complicates integration efforts, as integrators must repeatedly iterate with the application developer to meet the system's reliability requirements. An orchestrator that abstracts these complexities could alleviate these challenges by offering a uniform, deterministic interface for managing runtime scheduling, thus ensuring that lower-criticality applications do not adversely affect the performance or reliability of higher-criticality ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the potential upsides of the approach of an algorithm independent description of scheduling requirements, there is the downside, that the application logic is split into two parts: The part in the algorithm and the part in the scheduling description. I think this does not improve user friendliness from the perspective of an application developer. Therefore, the pros and cons have to be weighted carefully against each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate a on the down sides that you see with such an approach so that we can incorporate/update this in the feature request?
PS: The goal of the FR is not to separate the algorithm from the scheduling requirements but to separate the algorithm incl. scheduling requirements for the resources they execute on.
- Free from complex synchronization mechanisms. | ||
- Capable of expressing both concurrent and sequential dependencies. | ||
- Capable of expressing conditional branching within the program. | ||
- Capable of expressing timing constraints and error handling paths directly within the program. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain what timing constraints you envision and what exactly the error handling paths would do? Is the error handling the reaction to a failed timing constraint check or is this error handling something more general? How does user code interact with the error handling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timing constraints that are envisioned are e.g. deadlines that could span multiple parts of the program/tasks and one or more reactions that are executed as soon as the deadlines are passed. The error handling is mostly targeting the timing constraints but should also provide the possibilities to escalate on errors of user tasks in a cooperative context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm what would the system do if deadlines are passed / errors occur in the "reaction branches" to other passed deadlines / errors? I could imagine that this introduces a good amount of complexity. And application developers would need to learn about it to implement the right reaction branches - but compared to OS-level scheduling, it is new, unknown and potentially not as mature as OS-level scheduling.
|
||
This can be achieved by leveraging the proposed programming framework to express application structures without direct dependence on kernel threads. Integrators would then define a shared thread pool that executes the routines of these programs via cooperative multi-tasking. This design allows integrators to manage the number, affinity, and priorities of threads in the pool, providing full control over system resources. Consequently, integrators can assign computing resources during deployment without the need for iterative fine-tuning with each application developer. | ||
|
||
Additionally, the system will use priority-based preemption between thread pools to ensure that lower-priority programs cannot interfere with higher-priority ones. Error handling is managed independently of the application's normal execution paths, with misbehaving code preemptively handled to maintain overall system stability. This mechanism empowers developers to build robust safety mechanisms on top of the orchestration feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify what you envision here? Launching threads in thread pools with different OS priorities / in different cgroups?
- Free from complex synchronization mechanisms. | ||
- Capable of expressing both concurrent and sequential dependencies. | ||
- Capable of expressing conditional branching within the program. | ||
- Capable of expressing timing constraints and error handling paths directly within the program. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm what would the system do if deadlines are passed / errors occur in the "reaction branches" to other passed deadlines / errors? I could imagine that this introduces a good amount of complexity. And application developers would need to learn about it to implement the right reaction branches - but compared to OS-level scheduling, it is new, unknown and potentially not as mature as OS-level scheduling.
|
||
The broader objective of the HPC platform is to support the concurrent integration and maintenance of multiple applications, potentially sourced from different vendors. Unlike traditional microcontroller (µC) platforms that are statically configured, this platform must allow dynamic updates, upgrades, and new deployments throughout the vehicle's lifecycle. This introduces significant complexity in managing concurrency, particularly in mixed-criticality environments where applications of varying criticality levels must coexist without interference. | ||
|
||
Currently, application developers are expected to manage runtime scheduling directly via the operating system, controlling thread lifecycles, priorities, and affinities. Such an approach ties the configuration too closely to a specific deployment scenario and often leads to discrepancies between behaviors observed during development and those on the target system. This misalignment complicates integration efforts, as integrators must repeatedly iterate with the application developer to meet the system's reliability requirements. An orchestrator that abstracts these complexities could alleviate these challenges by offering a uniform, deterministic interface for managing runtime scheduling, thus ensuring that lower-criticality applications do not adversely affect the performance or reliability of higher-criticality ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, application developers are expected to manage runtime scheduling directly via the operating system, controlling thread lifecycles, priorities, and affinities.
This also passes the necessary tools to developers using a well-matured, time-tested scheduling approach which they already know. I could imagine that the effort both for application developers and integrators is increased if they have to learn a new, restricted scheduling approach instead of using what they know and what has been optimized over time.
This misalignment complicates integration efforts, as integrators must repeatedly iterate with the application developer to meet the system's reliability requirements.
I do not think that the iteration for system reliability requirements disappears with an added orchestration layer. Threads and thread pools of the orchestration feature would still interfere with each other and with other processes and threads on the OS. The orchestration feature cannot offer guarantees like e.g. a realtime OS kernel when it comes to scheduling.
|
||
Concurrent programming in our target environment spans multiple scopes. At one level, concurrency exists within the algorithms of individual applications or system services, while at another level, multiple applications must execute concurrently across the platform. The challenge is to offer an interface that is not only simple and expressive but also deterministic and reliable - an important requirement in safety-critical systems. | ||
|
||
Traditional thread-based concurrency in POSIX-like environments introduces complexities such as deadlocks, livelocks, and starvation. These issues, coupled with the inherent difficulties in debugging and validating thread-based systems, can compromise both performance and reliability. [#f4]_ [#f5]_ [#f6]_ Moreover, current designs often separate the management of timing requirements, monitoring, and error handling from the control flow. Integrating these aspects closer to the application logic would promote higher cohesion and lower coupling, enabling more effective debugging and validation, particularly when addressing application-specific scenarios. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Traditional thread-based concurrency in POSIX-like environments introduces complexities such as deadlocks, livelocks, and starvation.
Reading the rest of your FR, it does not sound like the proposed feature would tackle those concurrency challenges. Instead, it would introduce a new way how to react to timeouts and errors (which themselves might timeout and error...). Could you explain how this actually reduces the complexity for developers, instead of replacing a well-known set of concurrency challenges with a new, unknown set?
In existing platforms for microprocessors (µP), each application is expected to interact with approximately 15 system services or daemons - such as ``ara::diag`` for diagnostics and ``ara::com`` for SOME/IP communication. Under a straightforward implementation, this interaction model results in the creation of around 15 threads per application. When scaled to 100-150 applications, this amounts to roughly 1500 to 2250 threads solely managing inter-process communication, excluding those performing the core application tasks. | ||
|
||
Given that the HPC's µP typically provides between 2 and 16 cores, only a limited number of threads can be processed in parallel. In POSIX-like operating systems, threads become the currency for concurrency, meaning that when the thread count far exceeds available cores, the system must rely on context switching. With context switching times estimated between 2µs and 4.5µs [#f1]_ [#f2]_ [#f3]_, even a 100ms time slice could spend between 3% and 10% of its duration on context switching alone - assuming each thread is scheduled once. This overhead increases dramatically if threads are forced to switch more frequently due to competition for processing time. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scheduling in operating systems has been optimized heavily. There are a few mayor motivations for task-based concurrency which share threads (like async
in Rust):
- Avoiding the overhead of spawning and reaping threads (e.g. not spawning and reaping one thread per connection when serving thousands of connections per second).
- Minimizing load in IO-bound workloads (if most of what the code does is network IO, we can save thread switches because they do not pale in comparison)
- Allowing imperative-style programming for concurrently executing code (every
await
is a potential hand-off to another execution path).
I do not see how our expected use-cases could harness those advantages. We will not have webserver-like applications which accept thousands of new connections every second and most applications are either heavier on the compute than the IO part or they do IO continuously without switching between different IO sources all the time. Furthermore, the nested language which you propose does not bring the benefit of easy imperative-style programming for concurrent code as Future
s in Rust or e.g. Promise
s in JS.
|
||
.. _orchestration_feature: | ||
|
||
Orchestration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the FR severaly times I would suggest to rename it. Orchestration sound too "global". What is proposed is a safe cross-process async framework.
Abstract | ||
======== | ||
|
||
In response to the increasing complexity of modern centralized E/E architectures and the need to support hundreds of applications, this feature request proposes a comprehensive orchestration framework for managing concurrency in high-performance computing (HPC) systems. The motivation for this proposal is rooted in the significant performance penalties incurred by conventional thread-based approaches, where an excessive number of threads leads to costly context switching in operating systems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In response to the increasing complexity of modern centralized E/E architectures and the need to support hundreds of applications, this feature request proposes a comprehensive orchestration framework for managing concurrency in high-performance computing (HPC) systems. The motivation for this proposal is rooted in the significant performance penalties incurred by conventional thread-based approaches, where an excessive number of threads leads to costly context switching in operating systems. | |
This feature request proposes a comprehensive orchestration framework for efficiently managing concurrency within and across processes on a HPC ECU. The motivation for this proposal is rooted in the significant performance penalties incurred by conventional thread-based approaches, where an excessive number of threads leads to costly context switching in operating systems. |
|
||
The proposed solution introduces user-level scheduling through cooperative multi-tasking, allowing task switches to occur in the nanosecond range instead of microseconds. By treating tasks as the fundamental unit of concurrency and enabling multiple tasks to share the same OS thread, the framework significantly reduces overhead and simplifies resource allocation. | ||
|
||
Overall, this orchestration feature is designed to provide a safe and secure, deterministic, user-friendly interface that streamlines concurrent execution, optimizes resource utilization, and facilitates the reliable integration of mixed-criticality applications in complex automotive and embedded systems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this orchestration feature is designed to provide a safe and secure, deterministic, user-friendly interface that streamlines concurrent execution, optimizes resource utilization, and facilitates the reliable integration of mixed-criticality applications in complex automotive and embedded systems. | |
Overall, this orchestration feature is designed to provide a safe and secure, deterministic, user-friendly interface that streamlines concurrent execution, optimizes resource utilization, and facilitates the reliable integration of mixed-criticality applications, which are spawned over several os processes. |
Motivation: Increasing Performance in a Highly Concurrent System | ||
----------------------------------------------------------------- | ||
|
||
In existing platforms for microprocessors (µP), each application is expected to interact with approximately 15 system services or daemons - such as ``ara::diag`` for diagnostics and ``ara::com`` for SOME/IP communication. Under a straightforward implementation, this interaction model results in the creation of around 15 threads per application. When scaled to 100-150 applications, this amounts to roughly 1500 to 2250 threads solely managing inter-process communication, excluding those performing the core application tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is the main motivation for the framework and it leads again in a direction that this framework shall be used by each and every component in the platform. We can have this discussion but I would separate it from this FR.
|
||
Given that the HPC's µP typically provides between 2 and 16 cores, only a limited number of threads can be processed in parallel. In POSIX-like operating systems, threads become the currency for concurrency, meaning that when the thread count far exceeds available cores, the system must rely on context switching. With context switching times estimated between 2µs and 4.5µs [#f1]_ [#f2]_ [#f3]_, even a 100ms time slice could spend between 3% and 10% of its duration on context switching alone - assuming each thread is scheduled once. This overhead increases dramatically if threads are forced to switch more frequently due to competition for processing time. | ||
|
||
Furthermore, reducing context switching overhead not only improves raw performance but also enhances overall safety and reliability. By minimizing the number of context switches, the CPU can devote more time to executing critical application logic and monitoring system health, thereby reducing the risk of timing violations in safety-critical functions. This reduction in overhead also minimizes the likelihood of cascading delays and resource starvation, which bolsters system resilience and ensures predictable, real-time execution for safety-critical tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. with this you can argue every efficiency measure as being important for safety and security. :-) Don't know if this is the key point.
The envisioned programming framework will be: | ||
- Free from explicit thread management. | ||
- Free from complex synchronization mechanisms. | ||
- Capable of expressing both concurrent and sequential dependencies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Capable of expressing both concurrent and sequential dependencies. | |
- Capable of expressing both concurrent and sequential dependencies of tasks |
Specification: How to manage reliable integrations in mixed-criticality environments? | ||
-------------------------------------------------------------------------------------- | ||
|
||
To address the challenge of integrating applications developed by distributed teams - each with only a partial view of the final target system - the solution must decouple algorithm design from deployment details. Application developers should be able to define algorithms that can exploit parallel execution when processing resources are available without binding their implementations to a specific deployment scenario. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address the challenge of integrating applications developed by distributed teams - each with only a partial view of the final target system - the solution must decouple algorithm design from deployment details. Application developers should be able to define algorithms that can exploit parallel execution when processing resources are available without binding their implementations to a specific deployment scenario. | |
To address the challenge of integrating applications developed by distributed teams - each with only a partial view of the final target application - the solution must decouple algorithm design from deployment details. Application developers should be able to define algorithms that can exploit parallel execution when processing resources are available without binding their implementations to a specific deployment scenario. |
|
||
To address the challenge of integrating applications developed by distributed teams - each with only a partial view of the final target system - the solution must decouple algorithm design from deployment details. Application developers should be able to define algorithms that can exploit parallel execution when processing resources are available without binding their implementations to a specific deployment scenario. | ||
|
||
This can be achieved by leveraging the proposed programming framework to express application structures without direct dependence on kernel threads. Integrators would then define a shared thread pool that executes the routines of these programs via cooperative multi-tasking. This design allows integrators to manage the number, affinity, and priorities of threads in the pool, providing full control over system resources. Consequently, integrators can assign computing resources during deployment without the need for iterative fine-tuning with each application developer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be achieved by leveraging the proposed programming framework to express application structures without direct dependence on kernel threads. Integrators would then define a shared thread pool that executes the routines of these programs via cooperative multi-tasking. This design allows integrators to manage the number, affinity, and priorities of threads in the pool, providing full control over system resources. Consequently, integrators can assign computing resources during deployment without the need for iterative fine-tuning with each application developer. | |
This can be achieved by leveraging the proposed programming framework to express application structures without direct dependence on threads and processes. Integrators would then define a set of processes with shared thread pools that executes the tasks of these programs via cooperative multi-tasking. This design allows integrators to manage the number, affinity, and priorities of threads in the pool, providing full control over system resources. |
I would remove this. It sounds like integrators can assign resources without taking the needs of the applications into account.
:satisfies: stkh_req__execution_model__processes | ||
:status: invalid | ||
|
||
The system **SHALL** implement user-level scheduling for task management so that task switches occur in the nanosecond range. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The system **SHALL** implement user-level scheduling for task management so that task switches occur in the nanosecond range. | |
The system **SHALL** implement user-level scheduling for task management. |
:satisfies: stkh_req__execution_model__processes, stkh_req__dependability__automotive_safety | ||
:status: invalid | ||
|
||
The orchestration feature **SHALL** include configurable error handling mechanisms that are insulated from the effects of misbehaving tasks (e.g., tasks in an infinite loop). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see here a problem that misbehaving in this case is limited to temporal misbehaving. As far as I understood there is no mechanism to protect tasks from spatial misbehaving of other tasks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ramceb can you give an example what do You mean here ?
closes: #273