Timed-Altarica-To-Fiacre-Tr.../doc/Rapport post-doc/4-aocs.tex


In order to assess the timed formal model approach, we apply it to an industrial case study, namely the validation of the automatic mode management model of an AOCS (Attitude and Orbit Control System) of a satellite. The full description is available in this report; the synthesis is published at ERTS 2018 conference.
\subsection*{Acronyms}
\begin{center}
\begin{table}[h!]
	\begin{small}
		\centering
		\begin{tabular}{|c|c|}
			\hline
			\textbf{Term} & \textbf{Description} \\
			\hline
			ACM & Attitude Control Mode \\
			\hline
			ARO & Automatic Recondifiguration Order \\
			\hline
			ASM & Acquisition and Safe Mode \\
			\hline
			CAM & Collision Avoidance Manoeuvre \\
			\hline
			FCM & Formation Control Mode \\
			\hline
			GNC &  Guidance, Navigation and Control \\ % Guidance and Navignatio Mode
			\hline
			IMU & Inertial Measurement Unit \\
			\hline
			ISL & Inter-Satellite Link \\
			\hline
			OCM& Orbit Control Mode \\
			\hline
			OFF&OFF Mode \\
			\hline
			PFailed&Permanent Failure\\
			\hline
			STR&Star Tracker  \\
			\hline
			TC&Telecommand\\
			\hline
			TFailed&Temporary Failure\\
			\hline
			TM&Telemetry\\
			\hline
		\end{tabular}
	\end{small}
%	\caption{Acronyms used}
\end{table}
\end{center}


\section{An Expression of Industrial Needs and Requirements}\label{sec:aocsintro}

Failure Detection, Isolation and Recovery (FDIR) functions are implemented aboard satellites in order to detect the occurrence of failures and to prevent the failure from propagating in the whole system, which could cause critical events and thus jeopardize the mission.
The complexity of the FDIR verification and validation (V\&V) increases with the system complexity. As systems include more and more interacting functions and functional modes, it becomes harder to evaluate at the overall system level the effects of local failures. That is even truer when we consider the effects of time. Indeed, unexpected behavior can arise from a bad timing of events. Timing constraints allow to represent the impact of computation times, delays on the propagation of failures, and the reaction time of the reconfigurations steps triggered in reaction to failure detection.
Thus, new models and tools are needed to assist early V\&V of the on-board processes and FDIR design. Thomas and Blanquart~\cite{thomas2013} describe a process to model FDIR satellite functions. This process requires to model failure propagation, detection, and recovery times, which requires modelling languages expressive enough to support complex system modelling. Associated tools should provide automatic verification that on-board FDIR functions are correct despite latency in failure propagation or management. For instance, one should be able to verify whether failures are recovered before they are propagated further in the system.

\begin{figure}[tb]
	\centering
	\includegraphics[width=\textwidth]{figures/satellite_telecom.png}
	\caption{A satellite and its equipment}
\end{figure}


\section{AOCS Case Study}\label{sec:aocs}

The validation of the approach on the automatic mode management model of an AOCS (Attitude and Orbit Control System) of a satellite considers
control and command specifications of a satellite constellation mission. The example discussed in these pages is a simplification of this industrial specification which
takes into account a single function and 3 equipment per satellite. Each equipment is necessary to the execution of its function in any of its operational modes.
The real specification includes 3 functions, several equipment, surveillance elements and automata.
Validation can be done by simulation or property verification; the way we take here is the latter.

\subsection{Architecture description}

In space missions, the control of the attitude and the orbit of the aircraft is delegated to the AOCS. In the case study, the system relies on
\begin{itemize}
	\item AOCS sensors:
	\begin{itemize}
	\item Inertial Measurement Unit (IMU) implements accelerometers which provide the acceleration and gyros which give the estimated angular speed;
	\item Star-TRackers (STR) gives the estimated attitude quaternion;
	\end{itemize}
	\item AOCS actuators: Thrusters (ColdGasProp) permit the control of the spacecraft;
	\item 	The On-Board Computer unit, which manages all the spacecraft’s activity and therefore the AOCS application software that acquires the information from the sensors and commands the actuators.
\end{itemize}

\subsection{AOCS mode automaton}
The space industries and agencies contribute in the initiative Space AVionics Open Interface Architecture (SAVOIR) to define a single, agreed and common solution for the definition of the architecture of avionics system. The assess of our approach is validate with an industrial case study based on these Ground-Board Interface specifications\cite{ASRA}, the ECSS standard for the Space Segment Operability\cite{ECSS‐E‐ST‐70‐11C} and the AOCS mode management\cite{ECSS-E-ST-60-30C}. Generally, the specification of the satellite mode management is specified in early phases at system level with a co-engineering approach. The focus is on:
\begin{itemize}
	\item The AOCS capabilities to control the attitude and the orbit of the satellite;
	\item The Operations capabilities to operate the satellite by telecommand (TC) from the ground segment;
	\item The FDIR capabilities to continue the mission operations and to survive critical situations without relying on ground intervention.
\end{itemize}

In our approach, an early validation is experimented to increase the confidence on these system specifications and later to reduce the consolidation with the subsystems engineering.
The use-case focuses on the initial satellite mode management shown in  Figure~\ref{fig:1}; the current representation of AOCS mode.

\begin{figure}[thb]
	\centering
	\includegraphics[width=\textwidth]{figures/AOCSmode_schema.png}\vspace*{-5em}
	\caption{Representation of the AOCS modes.\label{fig:1}}
\end{figure}


Several AOCS modes are designed for the different mission phases.
Switching between a mode and another depends on three possible reasons:
\begin{itemize}
\item either the satellite receives a telecommand (TC) from Earth to trigger a mode change (Acquisition \& Safe Mode, Attitude Control Mode, Orbit Control Mode, Formation Control Mode);
\item	or there is an automatic on-board transition (A) when the separation from the launcher is detected or when the collision avoidance manoeuvre is finished;
\item or an automatic FDIR reconfiguration order (ARO) can be triggered by the ground (TC) or by the on-board FDIR to recover a failure. The ARO triggers the AOCS into the OFF mode and it is available from all AOCS mode except CAM.
\end{itemize}

Transition between modes is possible only when the involved equipment is available. In fact, some devices can take tenths of minutes to be ready for a mode switch. This is why it is very important to take into account timing constraints.

It is easy to understand that the planning and the reactivity to messages triggering a transition between modes is crucial, as failing to detect a message or reconfiguring in time would yield an unwanted mode. For instance, it could result in a transition into Safe Mode, which entails a heavy reboot and loss of scientific acquisition time. Many tasks of the mission depend on different inputs and are time-dependent; the temporal bounds of those tasks are the ones that we will analyse through formal V\&V in order to detect possible unwanted situations, or undetected failures through the system, in order to maintain the safety and mission autonomy objectives.

Figure~\ref{fig:aocs} displays the AOCS mode automaton considered in the case study. This is a more precise version of the automaton of Fig.~\ref{fig:1} based on a SysML activity diagram that represents the transitions between mode and, for each AOCS mode, its associated state-machine. Each AOCS mode has essentially three states: an initial; a nominal; and a ``degraded'' state.
When the request of change mode is received, the AOCS triggers the initialization of equipment used by the mode in the Initial state and then it switches into the \emph{Nominal state}  when all involved equipment are operational. The degraded state is triggered when the failure of used equipment is detected. All AOCS mode are described in the next subsections.


\begin{figure}[thb]
	\centering
	\includegraphics[width=1.1\textwidth]{{"figures/2-2 - AOCS mode automaton"}.png}
	\caption{AOCS mode automaton.\label{fig:aocs}}
\end{figure}

\begin{figure}[thb]
	\centering
	\includegraphics[width=.9\textwidth]{{"figures/2-3 - ASH mode"}.png}
	\caption{Acquisition \& Safe mode automaton.\label{fig:asm}}
\end{figure}

\begin{figure}[hbt]
	\centering
	\includegraphics[width=.6\textwidth]{{"figures/2-4 set ON the equipment for ASM mode"}.png}
	\caption{Set ON equipment for ASM mode.\label{fig:asm-ON}}
\end{figure}

\subsubsection{OFF mode}
In this mode, any equipment of use-case can be started. This mode is used when the satellite is in the launcher; and after an ARO to reestablish the satellite in safe conditions.
\subsubsection{Acquisition \& Safe mode (ASM)}
In this mode, the first acquisition in safe configuration is realized. The ASM automaton with the 3 states is shown in Figure \ref{fig:asm} When the TC to switch on ASM mode is received, the initialization of equipment used by ASM is started by the \texttt{EquipASM\_ON} activity (Figure~\ref{fig:asm-ON}).

\begin{figure}[htb]
	\centering
	\includegraphics[width=\textwidth]{{"figures/2-5 ACM automaton"}.png}
	\caption{Attitude Control Mode automaton.\label{fig:acm}}
\end{figure}

\begin{figure}[bht]
	\centering
	\includegraphics[width=.6\textwidth]{{"figures/2-6 set ON the equipment for ACM mode"}.png}
	\caption{Set ON equipment for ACM mode.\label{fig:acm-ON}}
\end{figure}

\subsubsection{Attitude Control Mode (ACM)}
In this mode, the attitude is coarse controlled by the AOCS. The ASM automaton with the 3 states is shown in Figure~\ref{fig:acm}. When the TC to switch on ACM mode is received, the
initialization of equipment used by ACM mode is started by the  \texttt{EquipASM\_ON} activity (Figure~\ref{fig:acm-ON}).


\subsubsection{Collision Avoidance Manœuvre (CAM)}
This case is used when the ground or FDIR triggers a manoeuvre to avoidance a collision with debris or other satellite. These specific TC are called CAM and they are only available
from the Attitude Control Mode, Orbit Control Mode and Formation Control Mode. So, the request of CAM is not possible from the Acquisition \& Safe Mode.

The CAM automaton is described in Figure~\ref{fig:cam}. The exit of CAM is autonomous into ACM mode when the manoeuvre or predefined-timing is finished.

\begin{figure}[htb]
	\centering
	\includegraphics[width=.6\textwidth]{figures/{"2-7 Collision Avoidance Mode"}.png}
	\caption{Collision Avoidance Mode.\label{fig:cam}}
\end{figure}

\subsubsection{Orbit Control Mode (OCM)}
In this mode, the orbit is controlled and the mission is suspended. The OCM automaton is shown in Figure~\ref{fig:ocm}.

\begin{figure}[p]
	\centering
	\includegraphics[width=.8\textwidth]{figures/{"2-8 OCM automaton"}.png}
	\caption{Orbit Control Mode.\label{fig:ocm}}
\end{figure}

\subsubsection{Formation Control Mode (FCM)}
FCM is the nominal mode where the formation flying is operational. The FCM automaton with the 3 states is shown in Figure~\ref{fig:fcm}.

\begin{figure}[p]
	\centering
	\includegraphics[width=.9\textwidth]{figures/{"2-9 Formation Control Mode"}.png}
	\caption{Formation Control Mode.\label{fig:fcm}}
\end{figure}


\subsubsection{Equipment}

The model is based on the following assumption:
\begin{itemize}
	\item An equipment can be on, off, or faulty.
	\item An equipment is started by the function that needs it. The reboot has a duration.
	\item An equipment that is not useful in the current modes, is switched off.
	\item The number of redundancies is a parameter defined for each equipment.
	\item Temporary or permanent failures can affect an equipment\footnote{Those failures can be injected by the user.} only when it's running.
	\begin{itemize}
		\item After a temporary failure, the equipment is rebooted automatically.
		\item After a permanent failure, the equipment moves over one of its available redundancies. If no redundancy is available, the state remains faulty.
		\item When an ARO event is triggered, the equipment switches off. It can be switched on without moving to a redundancy by the functions.
	\end{itemize}
\end{itemize}

\begin{figure}[h]
	\centering
	\includegraphics[width=.5\textwidth]{figures/table1.png}
	\caption{Allocation of the equipment on the modes.\label{tab:equip}}
\end{figure}

\begin{figure}[bt]
	\centering
	\includegraphics[width=\textwidth]{figures/{"Equipment mode"}.png}
	\caption{Equipment state machine.\label{fig:equip}}
\end{figure}

Figure~\ref{fig:equip} also displays the state machine of an equipment. The IMU, ColdGasProp, and STR equipment have similar state machines. Equipment which is started from an ``OFF'' state, can only reach an ``ON'' state by passing through a ``Starting'' state. When a failure happens, the equipment makes a transition to state ``Failed''. If the failure is permanent, the equipment goes over the redundant equipment. If failure is temporary, state ``Starting'' can be restored with no further consequences. The allocation of equipment on modes is given by Table in Fig.~\ref{tab:equip}. For example, in our study case, the STR is used in four modes.


The AOCS, as all satellite systems, presents an increasing complexity of interactions between functions and other equipment, often raising design and integration issues during the product final testing and verification phases. Correcting these issues often generates a heavy rework and is a well-known cause for cost overruns and project delays. Moreover, the operation of space systems is traditionally informally expressed in design documents as a set of modes representing functions, equipments and monitoring mechanisms. Usually the mode dynamics of all the satellite systems and their interaction with the FDIR functions are validated by review, i.e. exhaustive manual (opposed to automatic) consistence checks and cross-reading. In case of formation flying, complex equipment and instruments are distributed over several spacecrafts. The human validation activities become practically impossible due to the high number of combinations to be analyzed. Powerful computer-aided analysis techniques are expected to help overcoming this issue.

FDIR functions validation is particularly difficult, especially with ``traditional approaches'' because of the large number of interactions and not nominal situations, which increases the difficulty in analysing them. A large variety of points of view is necessary to fully analyse the behaviour of FDIR and its impact on dependability properties, including to be complete an explicit representation of the entities handled by FDIR: architecture, faults, time, etc.

We believe that formalizing and validating the specifications through animation/simulation and model-checking has several strong advantages. On the one hand, modelling during the specification phase forces the designer to formalise and clarify the specifications. Animation/simulation is useful for validating the model against the specifications and for identifying behaviour inconsistencies based on relevant user-defined scenarios. Such inconsistencies are difficult to identify in a classical purely paper-based specification process. Last, formal verification proves that none of the possible execution scenarios violates the system properties. Such approaches are useful not only for validating an architecture or FDIR strategy once defined, but also for tuning its parameters during the definition phase.

\section{Case study modelling}

The original model of the AOCS case study comes from the validation of the FDIR of formation flying satellites. Formation flying requires specific techniques to ensure flight coordination and the safety of the spacecraft in case of anomaly. This requires giving more autonomy and complex decision-making mechanisms to the On-Board Computer unit. The SPaCIFY project~\cite{synoptic} described a similar architecture using the Synoptic language, in order to identify the needs in terms of guaranteeing traceability, validation, and general analysis---possibly using formal techniques---of the V-cycle of development~\cite{sutre09}. From this architecture, an AltaRica 1.0 version of the Synoptic model was generated automatically. We have used this initial AltaRica model and adapted it first to the AltaRica Dataflow syntax~\cite{Altarica2.3}. Then we can automatically translate this new model into Fiacre using our toolchain.


Validating a FDIR approach in satellite architecture has been done in project AGATA~\citep{rugina09} by coupling simulation with model-checking, the latter to prove that some given logical or timing properties hold in all states of a scenario generated by simulation. In AGATA the choice went in the same direction as us: focusing significant part of the system and abstracted away the rest of it using UPPAAL~\cite{uppaal1997}.
However, the model they performed model-checking was un-timed, with several variables used as clocks to track time, with the drawback that the size of the graph is exponential in the number of clocks~\cite{daws}, when Fiacre/Tina relies on a single clock.


\subsection{AltaRica modelling process}
In order to validate the FDIR software, the AltaRica model shall represent both the FDIR logic and the failure propagation in the hardware platform that is monitored and reconfigured according to the FDIR logic. During earlier design phase, AltaRica models without timing constraints can be used to abstract the details of the failure propagation paths and to support a preliminary safety and dependability analysis. Then the model details can be refined and timing information can be introduced in Time Altarica (see Chap.~\ref{altarica}). This will help system designers to verify critical properties when time-bounded reactions are required. On the application side, focusing on the interaction of FDIR mechanisms and AOCS modes is of an utmost importance when designing a space system.

In this case study, we considered a quite detailed view of the FDIR logic related to the management of the AOCS and a simplified view of the satellite hardware. Next study will consider more detailed view of the satellite hardware architecture.

\subsection{Details of the model}
Regarding the modelling activity, we focused first on the methodology applied to timed failure propagation described for a multi-phase system. In the starting architecture, described in Sect.~\ref{sec:aocs}, each satellite has three separate kinds of equipment: the STR (with a redundancy of 3), the ColdGasProp (with a redundancy of 3), and the IMU (with a redundancy of 2). We assume that mode transitions are initiated instantaneously (with a transition associated to a Dirac law), while events related to equipment “switching on” or rebooting are associated with a time law as follows:
\begin{itemize}
\item STR takes 30~mn, it is associated with the interval $[30,30]$
\item IMU takes less than 10~s, it is associated with the interval $[0,0.1]$
\item ColdGasProp takes between 5 and 10~mn, it is associated with $[5,10]$
\end{itemize}

We consider that these durations include all physical delays. The notification of a detected failure from Surveillance is a timed event, ranging between 0 and 10~ms.

It is worth noting that the Synoptic (and later, the AltaRica 1.0) models that we have had access to, make practically no use of flow variables, which are variables that represent the ports’ input/output of components. This modelling style enforces that either communication are simultaneously performed both by sender and receiver either communication cannot occur at all. Moreover, This has a great influence on the empirical evaluation we performed, as undesired states are avoided by the modeller, while one of the interests of our approach resides in speeding-up the early design evaluation phase by outlining design errors and allowing an automated assessment of the model. As the design of the AOCS mode has no such flaws,  we aimed at checking the invariants, in order to validate the model, and to evaluate the scalability of the approach on the former properties plus a temporal one.

\subsection{Empirical evaluation}

In general, real-time model-checking does not scale well on very detailed and large systems, especially when the system uses events that work on very different timescales. Nevertheless, failure propagation models can be large but not very detailed in early design phase. In this particular use case, we are able to check the smallest instances of the problem in less than 3 minutes on a typical laptop; while the most difficult problem can be checked in half an hour.

We propose different versions of the AOCS architecture in order to appraise the complexity of the case, and the scalability of our toolchain. We apply a series of simplifications to the model, generating different benchmarks of growing complexity. The main parameters in our experiments will be the number of replicas of each equipment; and the possibility, or not, to have transient failures.
A first simplification of the model is to consider permanent failures only. Benchmarks done in this condition are denoted ``Pfail only'' (see Table~\ref{tab:results1}). This choice simplifies the model-checking problem since it discards loops in the behavior of the system that originate from the system rebooting through its ``Starting'' state. On the other hand, we can build instances that are more complex by increasing the number of equipment. For example, we can add a second kind of thruster. In Table~\ref{tab:results1}, we label each experiment with the number of different kind of thrusters used.

For each configuration, we investigate two different kinds of properties.

A first set of properties contains invariants on the set of states reachable by the system, like the absence of deadlocks or the property that \emph{equipment STR is always OFF when the AOCS is in Acquisition \& Safe Mode}. These are typically referred to as \emph{logical safety properties}. Both examples of properties, when true, require enumerating all the possible states of the system. Therefore, they give a good estimate of the complexity and size of the problem. Since reachability properties do not require to compute the possible transitions between two states (to compute the \emph{state graph}), it is possible to use very efficient on-the-fly techniques that are both space and time-efficient.

Then we check an example of timing property, namely we prove a bound on the maximal time it takes for the system to reach a safe mode after a particular event. More precisely, given a duration δ, we check whether the AOCS can reach its ``Collision Avoidance'' mode in less than δ minutes after activating surveillance. We can check this kind of properties by adding an ``observer'' component that monitors the time elapsed since an event occurred and that can raise an error after a timeout. The observer adds extra behaviors to the analysis of the initial system and can therefore significantly increase its state space, especially when there is a lot of non-determinism in the system or when there are long-running activities. We observed that our use case exhibit partially these two causes of state space explosion, limiting the verification of timed properties to the one thruster benchmarks (in 640s for the ``\emph{Pfail only}'', and 44mn54s for the ``\emph{Pfail and Tfail}''), within the time-cut we set at 45mn. This can be explained by the fact that satellite systems are usually very deterministic. Indeed, operators need to plan the behavior of a satellite very precisely when they compute TCs. This is an encouraging observation for the tractability of our approach in the aerospace domain.

\begin{table}
	\centering
\begin{tabular}{|l|c|c|c|c|}
	\hline
	Model & states & transitions & time (s) & size (MB) \\
	\hline
	1 thruster Pfail only & 224 374 & 8345295  & 225 & 8 \\
	\hline
	1 thruster & 448 335 & 20 470 486  &  & 16 \\
	\hline
	2 thrusters & 4 003 939 & 207 594 548 &  & 189 \\
	\hline
	3 thrusters & - & - & - & - \\
	\hline
\end{tabular}
\caption{Empirical evaluation of state space generation, Intel Xeon @ 2.33GHz}
\label{tab:results1}
\end{table}

Table~\ref{tab:results1} gives the results obtained with our experiments on four different configurations. We record the size of the model (in number of states and number of transitions) as well as the execution time and the memory consumed. All these results where obtained on a typical laptop with 8 GB of memory and an Intel processor. We observe that the Tina model checker scales up well on these models, failing at the biggest instance, not producing results after 45mn of run-time. These numbers give a good estimate of the complexity of checking safety properties.

A little less than 10 years ago, a similar study~\cite{sutre09} was performed using models expressed in AltaRica 1.0 and two different model-checkers,  ARC~\cite{arc} and MEC 5~\cite{mec5}. These are two tools developped specifically for the AltaRica language. Like with the experiments reported in Table~\ref{tab:results1}, ARC and MEC were used to compute the set of reachable states from an initial configuration of the system. These tools are based on symbolic methods for representing the sets of states and transitions of an AltaRica model. This is usually more efficient than enumerative techniques, such as those used by Tina. On the other hand, neither ARC or MEC take inherently into account timing constraints; whereas Tina relies on a ``symbolic'' representation of time constraints. Therefore they need to model time using an explicit clock (an integer variable) that should be updated in the model. ARC and MEC both bumped into serious limitations when scaling-up the models, even when simple invariants were checked~\cite{sutre09}.

It is worth noting that, in our experiments, the addition of timed transitions did not increased the number of reachable states and only increased the execution time by a factor of 4.
Even if in our case study no counterexamples were generated because of the correctness of the proposed models, our toolsuite permits, in an untimed model, to compute cutsets and countreexamples showing a timed scenario where the safety property checked is violated~\cite{albore2017IMBSA}. Such couterexamples are generally not easily readable and difficult to debug. In the case of an implementation with FIACRE, it is possible to exploit relations between models representing the information required by the user on the one hand, and information produced by the tools, on the other hand to visualize in a compact view both the outcome of the model-checker and the FIACRE model, which facilitates greatly the interpretation of the analysis process~\cite{Zalila}. A possible future work would be of applying a similar approach to allow the representation of the analysis directly in the AltaRica initial model.


\newpage


%%  LocalWords:  ELEC Fiacre Dataflow Altarica evt Xeon

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: