Failure propagation models are defined by safety engineers and are usually obtained through manual assessment of the safety of the system. This is a complicated task since failures can depend on more than one element of the system; be the result of the interaction between many faults; be the consequence of the missed detection of another fault (e.g. a fault inside an element tasked with detecting faults); etc. To cope with the complexity of the systems and the scenarios that need to be analysed, several model-based approaches have been proposed such as AltaRica~\cite{Altarica2.3,arnold1999altarica}, Figaro~\cite{bouissou05}, etc. each with their associated tooling. Our work is concerned with the modelling and analysis of failures propagation in the presence of time constraints. We concentrate on a particular safety property, called \emph{loss detection convergence}, meaning that the system applies an appropriate and timely response to the occurrence of a fault before the failure is propagated and produces unwanted system behaviours. % Similar problems were addressed in~\cite{thomas2013}, where the authors describe a process to model Failure Detection Isolation and Reconfiguration architecture (for use on-board satellites) that requires to take into account failure propagation, detection, and recovery times. However, these needs are not matched by an effective way to express or check the safety constraints of the system. Our contribution is as follows. We define a lightweight extension of AltaRica, meaning that timing constraints are declared separately from the behaviour of a system (Sect.~\ref{sec:sect2}). Therefore it is easy to reuse a prior safety model and to define its temporal behaviour afterwards. We illustrate our method with an example inspired by safety architectures found in avionic systems. This example illustrate the impact of time when reasoning about failure propagation, and we use it to show that taking into accounts timing constraints---in particular propagation delays---can help finding new failure modes that cannot be detected in the untimed models currently in use. In the process, we define two safety properties: \emph{loss detection}; and its temporal version, \emph{loss detection convergence}, meaning that a system applies an appropriate and timely response to the occurrence of a fault before the failure is propagated and produces unwanted system behaviours. We show that these two properties, which are of interest in a much broader context, can be reduced to effective model-checking problems. In the next chapter we extend this approach, by applying it to an industrial satellite study case, and analysing the scalability of the methodology to more complex system models. \section{Model-Based Safety Analysis with AltaRica}\label{sec:sect2} \subsection{AltaRica language and versions} AltaRica is a high level modelling language dedicated to Safety Analysis. It has been defined to ease the modelling and analysis of failure propagation in systems. The goal is to identify the possible failure modes of the system and, for each mode, the chain of events that lead to an unwanted situation. In AltaRica, a system is expressed in terms of variables constrained by formulas and transitions. Several versions of the language have been defined (see for instance~\cite{arn00,prosvirnova2013altarica}). Three main versions of AltaRica have been designed so far. These three versions differ essentially with two respects: the way variables are updated after each transition firing and the way hierarchies of components are managed. In the first version, designed at the Computer Science Laboratory of University of Bordeaux (LaBRI), variables were updated by solving constraint systems. This model, although very powerful, was too resource consuming for industrial scale applications. A radical turn was therefore taken with the design of a data-flow version of the language. In AltaRica 2.0 Dataflow, variables are updated by propagating values in a fixed order. This order is determined at compile time. At the time this document is written, most of AltaRica tools rely on this Dataflow version. Ongoing investigation aim at improving AltaRica Dataflow in several ways, like propagating variable values by determining a fixpoint at execution time, hence justifying the third version of the language currently being developped: AltaRica 3.0 or OpenAltaRica\cite{prosvirnova2013altarica}. \subsection{AltaRica modelling} In this work, we use the AltaRica 2.0 Dataflow language which is a fragment of other versions and which is sufficient to analyse the behaviour of computer based systems. (The approach discussed here could be applied to other AltaRica dialects.) The models used in this work have been edited and analysed using Cecilia OCAS, a graphical interactive simulator developed by Dassault Aviation~\cite{bieber2004safety}. An AltaRica model is made of interconnected \emph{nodes}. A node can be essentially viewed as a mode automaton~\cite{rauzy02} extended with guards and actions on data variables. A node comprises three parts: a declaration of variables and events, the definition of transitions, and the definition of assertions. We illustrate these concepts with the example of a simple node called \FUNC. We give the code (textual definition) of \FUNC in Listing~\ref{ls:altarica} and a schematic representation in Fig.~\ref{fig:funct2}. The node \FUNC has one input, \code{I}, and one output, \code{O}. \lstinputlisting[language=Altarica,float=bht,captionpos=b,caption=\protect{Example of AltaRica code for the node \FUNC.},frame=single,label=ls:altarica]{algos/function2.alt} \begin{figure}[thb] \centering \begin{minipage}{0.25\textwidth} \centering \includegraphics[width=\textwidth]{altarica-function.pdf} \end{minipage} \begin{minipage}{0.7\textwidth} \centering \vspace{1em} \includegraphics[width=0.8\textwidth]{altarica-diagrams.pdf} \end{minipage} \caption{Graphical representation of node \FUNC (left) and its associated failure mode automaton (right).\label{fig:funct2}}\vspace{1em} \end{figure} \enlargethispage{\baselineskip} % Nodes can have an internal state stored in a set of \emph{state variables}, declared in a heading called \texttt{state}. In its nominal state (when \code{S = NOMINAL}), the \FUNC node acts as a perfect relay: it copies on its output the value provided by its input (we have \code{O = I}); this is expressed in the \code{assert} block. On the opposite, its output is set to \LOSS or \ERR when \code{S} equals \code{LOST} or \code{ERROR}, respectively. % The assert directive is used to express constraints on the values of the input and output variables of a node, also called \emph{flow variables}, establishing a link between the node and its environment, i.e. the other interconnected nodes. It distinguishes between input (\texttt{in}) and output (\texttt{out}) variables. An assertion defines a rule to update the value of output flows according to the state of the component and the value of input flows. The state variables of a node can only change when an event is triggered. The code of \FUNC declares two events: \code{fail\_err}, that changes the state from \code{NOMINAL} to \code{ERROR}, and \code{fail\_loss}, that can transition from any state to \code{LOST}. This behaviour is the one displayed on the mode automaton of Fig.~\ref{fig:funct2}. Transitions are listed in the \texttt{trans} block of the node. Each transition has an (event) name and a definition of the form \lstinline[language=Altarica]{g |-evt-> e}, where the guard \code{g} is a Boolean condition that can refer to state and flow variables. The event \code{evt} can be triggered when the guard is satisfied. In this case we apply the effect, \code{e}, that is an expression that modifies the values of state variables. Events are useful to model the occurrence of failures or the reaction to conditions on the flow variables. % It is possible to indicate % whether an event is triggered internally by a node---as in the case of % a reconfiguration---or externally by its environment---as in the case % of the occurrence of a failure. In the latter case, We can assign a law of probability on the occurrence of the failure using the heading \code{extern}. For instance we could assert that event \code{fail\_loss} follows an exponential distribution with the declaration: %% {\lstinline[language=Altarica]{extern law ()="exp 1e-4";}} %% In the next section, we propose a way to enrich this syntax to express timing constraints on events instead of probability distributions. At the moment, it is not possible to use stochastic events in addition to time events. \enlargethispage{\baselineskip} In the general case, an AltaRica model is composed of several interconnected node instances, following a component-based approach. Global assertions relate the input flows of a component to the output flows of other components. For the sake of brevity, we do not describe component synchronisation here and we refer the reader to \cite{Altarica2.3} for further details. More importantly, a hierarchical AltaRica model can always be ``flattened'', i.e. represented by a single node containing all the variables, events, assertions, and transitions from the composite system. We use this property in our interpretation of AltaRica in Fiacre. %% \subsection{Time AltaRica: Adding Timing Constraints to Events}%%\label{sec:adding-timing-constr} %% There already is a limited mechanism for declaring timing constraints in AltaRica. It relies on the use of external law associated with a \emph{Dirac distribution}. An event with Dirac$(0)$ law denotes an instantaneous transition, that should be triggered with the highest priority. Likewise, an event with Dirac($d$) (where $d$ is a positive constant) models a transition that should be triggered with a delay of $d$ units of time. % In practice, Dirac laws are rather a way to encode priorities between events than an actual mean to express duration. Moreover, while Dirac laws are used during simulation, they are not taken into account by the other analysis tools. Finally, the use of Dirac laws is not expressive enough to capture non-deterministic transitions that can occur within time intervals of the form $[a, b]$, where $a \neq b$. These constraints are useful to reason about failure propagation delays with different best and worst case traversal time. For this reason, we propose to extend event properties with \emph{temporal laws} of the form: %% {\lstinline[language=Altarica]{extern law (evt) = "[a,b]";}} %% It is also possible to use open and/or unbounded time intervals, such as \code{]a,}$\infty$\code{[}.%%\scriptstyle With such a declaration, the transition \lstinline[language=Altarica]{g |-evt-> e} can be triggered only if the guard \code{g} is satisfied for a duration (or \emph{waiting time}) $\delta$, with $\delta \in [a, b]$. A main difference with the original semantics of AltaRica is that the timing constraint of an event is not reinitialised unless its guard is set to false. %% % Hence the waiting time of an event decreases when time elapse whereas, % while with a Dirac($d$) law the waiting time is reset to $d$ each time % the event fires. Finally, Moreover, our semantics naturally entails a notion of \emph{urgency}, meaning that it is not possible to miss a deadline: when $\delta$ equals $b$, then either \code{evt} is triggered or another transition should (instantaneously) change the value of the guard \code{g} to false. We can illustrate the use of temporal laws with the following example of a new node, \PRE; see Listing.~\ref{ls:pre}. This node encodes a buffer that delays the propagation of its input. When the input changes, event \code{pre\_read} has to be triggered instantaneously. Then, only after a duration $\delta \in [a, b]$, the value stored by \PRE (in the state variable \code{Stored}) is propagated to its output. % \lstinputlisting[language=Altarica,float=t,captionpos=b,caption=\protect{Example of Time AltaRica code: the basic delay.},frame=single,label=ls:pre]{algos/pre.alt} \section{A Definition of Fiacre Using Examples}\label{sec:sect3} Fiacre~\cite{berthomieu2008fiacre} is a high-level, formal specification language designed to represent both the behavioural and timing aspects of reactive systems. Fiacre programs are stratified in two main notions: \emph{processes}, which are well-suited for modelling structured activities (like for example simple state machines), and \emph{components}, which describes a system as a composition of processes. In the following, we base our presentation of Fiacre on code examples used in our interpretation of Time AltaRica. % \subsection{A Definition of Fiacre by Examples} % \label{sec:defin-fiacre-exampl} We give a simple example of Fiacre specification in Listing~\ref{lst:fiacre}. This code defines a process, \code{Function}, that simulates the behaviour of the AltaRica node given in Listing~\ref{ls:altarica}. \lstinputlisting[language=Fiacre,float=hbt,captionpos=b,caption=\protect{Example of Fiacre code: type, functions and processes},label=lst:fiacre,frame=single]{algos/function.fcr} Fiacre is a strongly typed language, meaning that type annotations are exploited in order to avoid unchecked run-time errors. Our example defines two enumeration types, \code{FState} and \code{FailureType}, that are the equivalent of the namesake AltaRica domains. We also define a record type, \code{Flows}, that models the environment of the node \code{Function}, that is an association from flow variables to values. Fiacre provides more complex data types, such as arrays, tagged union or FIFO queues. Fiacre also supports native \emph{functions} that provide a simple way to compute on values. In our example, function \code{update} is used to compute the state of the environment after an event is triggered; that is to model the effect of assertions in AltaRica. It uses two ternary (conditional) operators to mimic the \code{case}-expression found in the \code{assert} heading of Listing~\ref{ls:altarica}. % Functions are evaluated applicatively (there are no side effects) and % can be recursively defined. A Fiacre \emph{process} is defined by a set of parameters and {control states}, each associated with a set of \emph{complex transitions} (introduced by the keyword \code{from}). Our example defines a process with two {shared variables}---symbol \code{\&} denotes variables passed by reference---that can be updated concurrently by other processes. In our case, variable \code{S} models the (unique) state variable of node \FUNC. Complex transitions are expressions that declares how variables are updated and which transitions may fire. They are built from constructs available in imperative programming languages (assignments, conditionals, sequential composition, \dots); non-deterministic constructs (such as external choice, with the \code{select} operator); communication on ports; and jump to a state (with the \code{to} or \code{loop} operators). In Listing~\ref{lst:fiacre}, the \code{select} statement defines two possible transitions, separated by the symbol \code{[]}, that loop back to \code{s0}. Each transition maps exactly to one of the AltaRica events, \code{fail\_loss} and \code{fail\_err}, that we want to translate. Transitions are triggered non-deterministically and their effects are atomic (they have an ``all or nothing'' semantics). A transition can also be guarded by a Boolean condition, using the operator \code{on} or another conditional construct. It is possible to associate a time constraint to a transition using the operator \code{wait}. Actually, the ability to express directly timing constraints in programs is a distinguishing feature of Fiacre. We illustrate this mechanism in the code below, that corresponds to the interpretation of the node \PRE of Listing~\ref{ls:pre}. Basically, a transition constrained by a (time) interval $I$ can be triggered after a time $\delta$, with $\delta \in I$, only if its guard stayed continuously valid during this time. It is this behaviour that inspired our choice of semantics for the temporal law. The last notion that we need to introduce is \emph{components}. Components allow the hierarchical composition of processes, in the same way than AltaRica nodes can be composed into classes in order to build more complex behaviors. Thus, a Fiacre component defines a parallel composition of components and/or processes using statements of the form \code{par} \vars{P}$_0$ $\parallel \dots \parallel$ \vars{P}$_n$ \code{end}. It can also be used to restrict the visibility of variables and ports and to define priorities between communication events. % Priorities are explicitly declared as constraints between % communication ports, as in \code{go1 > go2}. We give an example of Fiacre component in Listing~\ref{lst:pre2}, where we define an upgraded version of the delay operator \PRE. % \lstinputlisting[language=Fiacre,float=bt,captionpos=b,caption=\protect{Wait statement, components and synchronisation on ports.},label=lst:pre2,frame=single]{algos/pre2.fcr} % % % \lstinputlisting[language=Fiacre,float=bt,captionpos=b,caption=\protect{Example % of Fiacre code: declaring timing % constraints},label=lst:pre1,frame=single]{algos/pre.fcr} % % A problem with the implementation of \PRE is that at most one failure mode can be delayed at a time. Indeed, if the input of \PRE changes while the state is \code{Full}, then the updated value is not taken into account until after event \code{pre\_wait} triggers. % It is not possible to implement a version that can delay an unbounded number of events in a bounded time; this would require an unbounded amount of memory to store the intermediate values. More fundamentally, this would give rise to undecidable verification problems (see e.g.~\cite{bouyer2004updatable}). To fix this problem, we can define a family of operators, \PRE\code{\_k}, that can delay up-to $k$ simultaneous different inputs. Our implementation relies on a component that uses three process instances: one instance of \code{front}, that reads messages from the input (variable \code{I}), and two instances of \code{delay}, that act as buffers for the values that need to be delayed. Process \code{front} uses the local ports \code{go1} and \code{go2} to dispatch values to the buffers. Component \PRE\code{\_2} is enough to model the use case defined in Sect.~\ref{sec:example-fail-detect-1}. Indeed, any element in the FDI system may propagate at most two different status, one from \OK to \LOSS and then from \LOSS to \ERR. \section{Example of a Failure Detection and Isolation System} \label{sec:example-fail-detect-1} We study the example of a safety critical function that illustrates standard failure propagation problems. We use this example to show the adverse effects of temporal failure propagation even in the presence of % system safety improvements. In our case the addition of Failure Detection and Isolation (FDI) capabilities. % This example is inspired by the avionic functions that provide parameters for Primary Flight Display (PFD), which is located in the aircraft cockpit. The system of interest is the computer that acquires sensors measurements and computes the aircraft \emph{calibrated airspeed} (CAS) parameter. Airspeed is crucial for pilots: it is taken into account to adjust aircraft engines thrust and it plays a main role in the prevention of over speed and stall. %% \begin{figure}[thb] \centering \includegraphics[width=1\textwidth]{figures/AirSpeed_Computation.jpg} \caption{Functional and physical views of the airspeed computation function.\label{fig:cockpit}} \end{figure} %% CAS is not directly measured by a dedicated sensor, but is computed as a function of two auxiliary pressure measurements, the static pressure (Ps) and total pressure (Pt); that is $\mathrm{CAS} = f(\mathrm{Pt}, \mathrm{Ps})$. % \begin{equation*} % CAS = C_{SO}\sqrt{5 \cdot \left( \left(\frac{Pt - Ps}{P_0} + 1\right)^{2/7} - 1\right)} % \end{equation*} % with $C_{SO}$ and $P_0$ two % constants. % the speed of sound under standard day sea level conditions These two measurements come from sensors located on the aircraft nose, a pressure probe and a pitot tube. % Our proposed functional view shows: % \begin{itemize} % \item Two external functions \I{1} and \I{2} that measure static and total pressure. They represent the functions allocated to the sensors. % \item Three inner functions of the system: \F{1} and \F{2} for sensor measurements acquisition by the onboard computer and \F{3} for airspeed computation. % \item The PFD functions have not been modelled in the sole purpose of simplification. % \end{itemize} % %% Our proposed functional view is given in Fig.~\ref{fig:cockpit}. It consists in two external input functions \I{1} and \I{2} that measure static and total pressure; and three inner functions of the system, \F{1} and \F{2} for sensor measurements acquisition by the on-board computer and \F{3} for airspeed computation. For simplification purposes, the PFD functions have not been modelled. Next, we propose a first failure propagation view aiming at identifying the scenarios leading to an erroneous airspeed computation and display to the pilot (denoted \ERR). Such failure can only be detected if a failure detector is implemented, for instance by comparing the outputs of different functions. Undetected, it could mislead the pilot and, consequently, lead to an inappropriate engine thrust setting. We also want to identify the scenarios leading to the loss of the measure (denoted \LOSS). In such a case, the pilot can easily assess that the measure is missing or false and consequently rely upon another measure to control the aircraft (note that such redundancy is not modelled). For example, airspeed out of bound---indicating that an airliner has crossed the sonic barrier---is considered to be of kind \LOSS. It can be understood that scenarios leading to the loss of the airspeed are less critical than the ones leading erroneous values. % We consider two possible kinds of fail status: the ``silent'' failure of %an element (denoted \ERR); or a communication problem with one of the % functions (denoted \LOSS). A failure is typically a loss of connectivity or the detection of (clearly) erroneous transmitted values, for instance a variable out of bounds. A \LOSS fail is less critical than an \ERR since it can be detected immediately and isolated. On the other hand, an \ERR fail can only be detected if a failure detector is implemented, for instance by comparing the outputs of different functions. In the following, we consider that the status are ordered by their level of severity, that is \ERR $<$ \LOSS $<$\OK. \subsubsection{Safety model of the architecture without FDI} %%\label{sec:error-prop-simple} For the sake of understandability, we choose to study a simplified version of FDI system. While simple, this system is representative of failure propagation mechanisms found in actual onboard systems. %% We provide an AltaRica model corresponding to the functional view of the CAS function in Fig.~\ref{fig:example0}. This model, tailored to study failure propagation, is comprised of: % ( All the functions are modelled according to the node function % introduced in the Sect.~\ref{sec:sect2}. % two external functions, \I{1} and \I{2}, that have no input (so, in their nominal state, the output is set to \OK); two inner functions, \F{1} and \F{2}, which are instances of the node \FUNC described in Sect.~\ref{sec:sect2}; and a function, \F{3}, that is the composition of two basic elements: a multiplexer, \MIN, representing the dependence of the output of \F{3} from its two inputs, and a computing element \F{3Processing} that represents the computation of the airspeed. \F{3Processing} is also an instance of node \FUNC. \begin{figure}[t] \centering \begin{tikzpicture} \node[anchor=south west,inner sep=0] at (0,0) { \includegraphics[width=0.75\textwidth]{figures/Modele0}}; \draw[red!50,ultra thick, dashed] (7,2.85) rectangle (2.6,0.5); \node[red] at (6.5,2.5) {\texttt{\large F3}}; \end{tikzpicture} \caption{A simple example of failure propagation.\label{fig:example0}} \end{figure} % In case of single failure scenario, \MIN propagates the failure coming either from one input or the other. In case of multiple failures, %when both inputs have identical failure, the faillure propagates; %For instance, % both inputs to \LOSS lead to an output of \LOSS. % But when it receives an erroneous fail then the multiplexer will % propagate an \ERR. The remaining failure combination drew our % attention during \F{3} modelling. %on the other hand, when different failures propagate, one being \LOSS and the other being \ERR,---and without appropriate FDI---the system outcome is uncertain. % Solving this uncertainty would require a detailed behavioural model of the on-board computer and a model for all the possible failure modes, which is rarely feasible with a sufficient level of confidence, except for time-tested technology. Given this uncertainty, it is usual to retain the state with the most critical effect, that is to say: % in this case, \MIN can be thought of as a simple logic operator that % computes the minimum of its two inputs, and the output of \F{3} is \ERR. % Since a \LOSS can be ``easily'' detected, we want to avoid a situation % where the whole system propagates \ERR while one of its %internal element propagates a \LOSS. The rationale is that we should be % able to isolate (or quarantine) the system when we know for sure that % one of its element does not reliably respond to commands. This can be % expressed by the following safety property. %% % In the next section, we study the system with the addition of FDI % capabilities. Our goal is to prevent the computation of an erroneous airspeed while one of \F{3} input signals is lost. The rationale is that the system should be able to passivate automatically the airspeed when it detects that one of its input signals is not reliable. This behaviour can be expressed with the following property: %% \begin{prop}[Loss Detection and Instantaneous Propagation]\label{prop:1} A function is \emph{loss detection safe} if, when in nominal mode, it propagates a \LOSS whenever one of its input nodes propagates a \LOSS. \end{prop} % This is a simple system used to detect non-nominal behaviours % and to trigger observables in order to isolate the failure mode. % ({The figures are actual screen capture of models edited with Cecilia % OCAS.}) % \subsubsection{Analysis} We can show that our example of Fig.~\ref{fig:example0} does not meet this property using the \emph{Sequence Generation} tool available in Cecilia OCAS. %% % This system has several sources of unreliability. Both its sources and % the computing elements (\F{1}, \F{2} and \F{3Processing}) can % experience spontaneous failures, meaning that their outputs can % transition instantaneously to \LOSS or \ERR. Errors are irrecoverable; % once a function becomes faulty, its output will always stay equal to % \LOSS or \ERR. Likewise, if both inputs are ``nominal'' then the % output of \MIN is \OK. We assume that the multiplexer cannot undergo % spontaneous failures. %% % This can be tested using a perfect detectors, \FF{3}{Loss}, placed at % the output of the system. %% To this end, we compute the minimal cuts for the target equation $((\text{\F{1.O.Loss}} \vee \text{\F{2.O.Loss}}) \wedge \neg \text{\F{3.O.Loss}})$, meaning the scenario where \F{3} does not propagates \LOSS when one of \F{1} or \F{2} does. Hence function \F{3} is {loss detection safe} if and only if the set is empty. \footnote{To check the safety of \F{3}, we need to perform the same analysis for all the elements that can propagate a loss fail, not only \F{1}.}. In our example, once we eliminate the cases where \F{3} is not nominal (that is when \F{3Processing} is in an error state), we find eight minimal cuts, all of order $2$. In the following section, we correct the behaviour of \F{3} by considering a new architecture based on detectors and a switch to isolate the output of \F{3} when faulty: %% {%\centering \begin{small} \begin{tabular}{c@{\quad\quad}c} \begin{minipage}[t]{0.42\linewidth} \begin{verbatim} {'F1.fail_err', 'F2.fail_loss'} {'F1.fail_err', 'I2.fail_loss'} {'F1.fail_loss', 'F2.fail_err'} {'F1.fail_loss', 'I2.fail_err'} \end{verbatim} \end{minipage} & \begin{minipage}[t]{0.42\linewidth} \begin{verbatim} {'F2.fail_err', 'I1.fail_loss'} {'F2.fail_loss', 'I1.fail_err'} {'I1.fail_err', 'I2.fail_loss'} {'I1.fail_loss', 'I2.fail_err'} \end{verbatim} \end{minipage}\\ \end{tabular}\\ \end{small} } % Each of these cuts lead to trivial violations of our safety property. For instance, we obtain the cut \code{\{'F1.fail\_err', 'F2.fail\_loss'\}} that corresponds to the case where \F{1} propagates \ERR and \F{2} propagates \LOSS, in which case \F{3} propagates \ERR. This is exactly the situation that we want to avoid. \begin{figure}[bt] \centering \begin{tikzpicture} \node[anchor=south west,inner sep=0] at (0,0) { \includegraphics[width=\textwidth]{figures/Modele2.png}}; \draw[red!50,ultra thick, dashed] (10.8,3.4) rectangle (2,0); \node[red] at (10.4,3.1) {\texttt{\large F3}}; \end{tikzpicture} \caption{Model of a FDI function with a switch and an alarm.\label{fig:example1}} \end{figure}%\vspace*{-1cm} %\enlargethispage{\baselineskip} \subsubsection{Safety model of the architecture with FDI} %%\label{sec:simple-fail-detect} %% %In order to satisfy the Prop.~\ref{prop:1}, The updated implementation of \F{3} (see Fig.~\ref{fig:example1}) uses two perfect detectors, \F{1Loss} and \F{2Loss}, that can detect a loss failure event on the inputs of the function. The (Boolean) outputs of these detectors are linked to an OR gate ({\ttfamily AtLeastOneLoss}) which triggers an \ALARM when at least one of the detectors outputs true. The alarm commands a \SWITCH; the output of \SWITCH is the same as \MIN, unless \ALARM is activated, in which case it propagates a \LOSS failure. The alarm can fail in two modes, either continuously signaling a \LOSS or never being activated. The schema in Fig.~\ref{fig:example1} also includes two delays operators, \D{1} and \D{2}, that model delay propagation at the input of the detectors; we will not consider them in the following lines, but come back to these timing constraints at the end of the section. % \TODO{why do we need Alarm ! why does the alarm can fail ! we could % plug directly the or gate to the switch}- Permanent failure of the alarm. The FDI function---with a switch and an alarm---is a stable scheme for failure propagation: when in nominal mode, it detects all the failures of the system and it is able to disambiguate the case where its inputs contains both \ERR and \LOSS. Once again, this can be confirmed using the Sequence Generation tool. If we repeat the same analysis than before---and if we abstract away the delay nodes---we find $56$ minimal cuts, all involving a failure of either \ALARM or \F{3Processing}, i.e. a non-nominal mode. This means that, in an untimed model, our new implementation of \F{3} satisfies the loss detection property, as desired. % \begin{figure}[hbt] % \centering % \begin{tikzpicture} % \node[anchor=south west,inner sep=0] at (0,0) { \includegraphics[width=\textwidth]{figures/Modele2.png}}; % \draw[red,ultra thick, dashed] (10.6,3.6) rectangle (2.2,0); % \node[red] at (10,3) {\texttt{\huge F3}}; % \end{tikzpicture} % % \includegraphics[width=0.95\textwidth]{figures/Modele2} % \caption{Taking into consideration propagation delays in the FDI % system.\label{fig:example2}} % \end{figure} % Unfortunately, this model makes unrealistic assumptions about the % instantaneous failure propagation of detection. Indeed, it may be the % case that the outputs of \F{1} and \F{2} are delayed as they are % propagated to the observers \F{1Loss} and \F{2Loss}. Next, we study % new failure modes that may arise from this situation and how we can % detect them. %% % \subsection{Timed Safety model of the architecture with FDI} % \label{sec:timed-safety-model} %% % In this section, we consider a new AltaRica model that takes into % account propagation delays. To this end, we may insert two instances % of the delay node (the element \PRE defined in Sect.~\ref{sec:sect2}) % before the detectors \F{1Loss} and \F{2Loss}. Even so, it is easy to find a timed scenario where the safety property is violated. Assume now that \F{1} and \F{2} propagate respectively the status \LOSS and \ERR, at the same date. In such a case and considering possible latencies, while \ERR reaches \MIN instantaneously, the output of \F{1} might reach \F{1Loss} at successive date. % of the output of \F{2} reaching \F{2Loss},. This leads to a transient state where the alarm is not activated whereas the output of \MIN is set to \ERR. This brings us back to the same dreaded scenario than in our initial model. %% In particular, this scenario corresponds to the cut \code{\{'F1.fail\_loss', 'F2.fail\_err'\}}, that is not admissible in an untimed semantics. %% This example suggests that we need a more powerful method to compute the set of cuts in the presence of temporal constraints. On the other hand, we may also advocate that our safety property is too limiting in this context, where perfect synchronicity of events is rare. Actually, it can be proven that the output of \F{3} will eventually converge to a loss detection and isolation mode (assuming that \F{3} stays nominal and that its inputs stay stable). % To reflect this si\-tua\-tion, we pro\-pose an improved safety property that takes into account temporal properties of the system: %% \begin{prop}[Loss Detection Convergent]\label{prop:2} A function is \emph{loss detection convergent} if (when in nominal mode) there exists a duration $\Delta$ such that it continuously outputs a \LOSS after the date $\delta_0 + \Delta$ if at least one of its input nodes continuously propagates a \LOSS starting from $\delta_0$ onward. The smallest possible value for $\Delta$ is called the \emph{convergence latency} of the function. \end{prop} Hence, if the latency needed to detect the loss failure can be bound, and if the bound is sufficiently small safety-wise, we can still deem our system as safe. In the example in Fig.~\ref{fig:cockpit}, this property can indicate for how long an erroneous airspeed is shown on the PFD to the pilot, before the failure is isolated. In the next section, we use our approach to generate a list of ``timed cuts'' (as model-checking counterexamples) that would have exposed the aforedescribed problems. We also use model-checking to compute the {convergence latency} for the node \F{3}. In this simple example, we can show that the latency is equal to the maximal propagation delay at the input of the detectors. The value of the latency could be much harder to compute in a more sophisticated scenario, where delays can be chained and/or depends on the internal state of a component. \section{Compilation of AltaRica and Experimental evaluation}\label{sec:sect4} We have implemented the transformation outlined in Sect.~\ref{sec:sect3}; the result is a compiler that automatically generates Fiacre code from an AltaRica model. The compilation process relies on the fact that it is possible to ``flatten'' a composition of interconnected nodes into an intermediate representation, called a \emph{Guarded Transition System} (GTS)~\cite{rau08}. A GTS is very similar to a (single) AltaRica node and can therefore be encoded in a similar way. Our tool is built using the codebase of the model-checker EPOCH~\cite{epoch}, which provides the functionalities for the syntactic analysis and the linking of AltaRica code. After compilation, the Fiacre code can be checked using Tina~\cite{BRV04}. As seen in Sect.~\ref{sec:tina}, the core of the Tina toolset is an exploration engine that can be exploited by dedicated model-checking and transition analyser tools. Tina offers several abstract state space constructions that preserve specific classes of properties like absence of deadlocks, reachability of markings, or linear and branching time temporal properties. These state space abstractions are vital when dealing with timed systems that generally have an infinite state space (due to the use of a dense time model). In our experiments, most of the requirements can be reduced to reachability properties, so we can use on-the-fly model-checking techniques. % and very ``aggressive'' abstractions. \begin{figure}[thb] \centering \includegraphics[width=\textwidth]{figures/passerelle.png} \caption{Graphical representation of the AltaRica/Fiacre toolchain.\label{fig:toolchain}} \end{figure} We interpret a GTS by a Fiacre process whose parameters consist of all its state and flow variables. Each transition \lstinline[language=Altarica]{g |-evt-> e} in the GTS is (bijectively) encoded by a transition that matches the guard \code{g} and updates the variables to reflect the effect of \code{e} plus the assertions. Each transition can be labelled with time and priorities constraints to take into account the \code{extern} declarations of the node. This translation is straightforward since all the operators available in AltaRica have a direct equivalent in Fiacre. Hence every state/transition in the GTS corresponds to a unique state/transition in Fiacre. This means that the state (reachability) graph of a GTS and its associated Fiacre model are isomorphic. This is a very strong and useful property for formal verification, since we can very easily transfer verification artefacts (such as counterexamples) from one model back to the other. The close proximity between AltaRica and Fiacre is not really surprising. First of all, both languages have similar roots in process algebra theory and share very similar synchronisation mechanisms. More deeply, they share formal models that are very close: AltaRica semantics is based on the product of ``communicating automata'', whereas the semantics of Fiacre can be expressed using (a time extension of) one-safe Petri nets. The main difference is that AltaRica provide support for defining probabilities on events, whereas Fiacre is targeted towards the definition of timing aspects. This proximity in both syntax and semantics is an advantage for the validation of our tool, because it means that our translation should preserve the semantics of AltaRica on models that do not use extern laws to define probabilities and time. We have used this property to validate our translation by comparing the behaviours of the models obtained using Cecilia OCAS simulation tool and their translation. For instance, in the case of the CAS system of Sect.~\ref{sec:example-fail-detect-1}, we can compute the set of cuts corresponding to Safety Property~\ref{prop:1} (loss detection) by checking an invariant of the form $((\text{\F{1.O}} = \text{\code{Loss}}) \vee (\text{\F{2.O}} = \text{\code{Loss}}) \Rightarrow (\text{\F{3.O}} = \text{\code{Loss}}))$. % %%\TODO{Check if it is necessary to mention the tool, as opposed to counterexamples} In both cases---with and without FDI---we are able to compute the exact same set of cuts than Cecilia OCAS. This is done using the model-checker for modal mu-calculus provided with Tina, which can list all the counterexamples for a (reachability) formula as a graph. More importantly, we can use our approach to compute the timed counterexample described at the end of Sect.~\ref{sec:example-fail-detect-1}. All these computations can be done in less than a second on our test machine. \subsection{Empirical evaluation} We have used our toolchain to generate the reachable state space of several AltaRica models~\footnote{All the benchmarks tested in this paper are available at https://w3.onera.fr/ifa-esa/content/model-checking-temporal-failure-propagation-altarica}: RUDDER describes a control system for the rudder of an A340 aircraft~\cite{bernard2007experiments}; ELEC refers to three simplified electrical generation and power distribution systems for a hypothetical twin jet aircraft; the HYDRAU model describes a hydraulic system similar to the one of the A320 aircraft~\cite{bieber2002combination}. The results are reported in Table~\ref{table:2}. In each case, we indicate the time needed to generate the whole state space (in seconds) and the number of states and transitions explored. For information, we also give the number of state variables as reported by Cecilia OCAS. All tests were run on an Intel 2.50GHz CPU with 8GB of RAM running Linux. In the case of model HYDRAU we stopped the exploration after $30$ minutes and more than $9.10^{9}$ generated states. The state space is large in this benchmark because it models the physical a-causal propagation of a leak, so a leak can impact both upward and backward components and trigger a reconfiguration, multiplying the number of reachable states. Moreover, the topology of the system shall be reconfigurated after the detection of some failures and this dynamic reconfiguration combined with a-causal propagation increases again more the size of the state space. In all cases, the time needed to generate the Fiacre code is negligible, in the order of 10~ms. \begin{table}[tb] \centering \begin{tabular}{|l|c|c|c|c|c|} \hline \textbf{Model} & \textbf{time} (s) & \textbf{\# states} & \textbf{\# trans.} & \textbf{\# state vars}\\ \hline RUDDER~~~ & 0.85 & $3.3\,10^4$ & $2.5\,10^5$ & 15 \\ \hline ELEC 01 & 0.40 & 512 & $2.3\,10^3$ & 9 \\ \hline ELEC 02 & 0.40 & 512 & $2.3\,10^3$ & 9 \\ \hline ELEC 03 & 101 & $4.2\,10^6$ & $4.6\,10^7$ & 22 \\ \hline HYDRAU & 1800 & --- & --- & 59 \\ \hline CAS~ & 0.40 & 729 & $2.9\,10^3$ & 6 \\ \hline CAS with \PRE~~~ & 46 & $9.7\, 10^5$ & $4.3\, 10^6$ & 10 \\ \hline \end{tabular} \vspace*{1em} \caption{State space size and generation time for several use cases.\label{table:2}} \end{table} Our models also include two versions of the complete CAS system (including the detectors, the alarm and the switch); both with and without the delay functions \D{1} and \D{2}. The ``CAS with \PRE'' model is our only example that contains timing constraints. In this case, we give the size of the state class graph generated by Tina, that is an abstract version of the state space that preserves LTL properties. We can use Tina to check temporal properties on this example. More precisely, we can check that \F{3} has the \emph{loss detection convergence} property. To this end, a solution is to add a Time Observer to check the maximal duration between two events: first, a \code{obs\_start} event is triggered when the output of \F{1} or \F{2} changes to \LOSS; then an \code{obs\_end} event is triggered when the output of \F{3} changes to \LOSS. The observer has also a third transition (\code{obs\_err}) that acts as a timeout and is associated with a time interval $I$ and is enabled concurrently with \code{obs\_end}. Hence, Time Observer ends up in the state yield by \code{obs\_err} when the output of \F{3} deviates from its expected value for more than $d$ units of time, with $d \in I$. We have used this observer to check that the \emph{convergence latency} of the CAS system equals $3$, when we assume that the delays are in the time interval $[1, 3]$. The result is that \code{obs\_err} is fireable for any value of $d$ in the interval $[0,3]$, while \code{obs\_err} is not fireable if $I = ]3, \infty[$. % It is enough to indicate to choose $I = [0,3]$. These two safety properties can be checked on the system (plus the observer) in less than $0.6$~s. % \begin{wraptable}{r}{0.35\textwidth} % %\begin{table}[bth] % \centering % \vspace*{-7mm} % \begin{tabular}{|c|c|l} % \cline{1-2} % \textbf{delay}$\mathbf\tau$ & \textbf{safety property} & \\ \cline{1-2} % % {$\quad[4,4]\quad$} & ok & \\ \cline{1-2} % {$]3, \infty[$} & holds & \\ \cline{1-2} % {[}3,3{]} & not holds & \\ \cline{1-2} % {[}0,3{]} & not holds & \\ \cline{1-2} % \end{tabular} % \caption{Latencies check on Eq.~\eqref{eq:mc1} % with $\delta = [1,3]$.\label{table:1}} % %\end{table} % \end{wraptable} %%% Local Variables: %%% mode: latex %%% TeX-master: "main" %%% End: %% LocalWords: Fiacre