Timed-Altarica-To-Fiacre-Tr.../doc/Rapport post-doc/2-motivation.tex

\section{Example of a Failure Detection and Isolation System}
\label{sec:example-fail-detect-1}

We study the example of a safety critical function that illustrates
standard failure propagation problems. We use this example to show the
adverse effects of temporal failure propagation even in the presence
of % system safety improvements. In our case the addition of
Failure Detection and Isolation (FDI) safety capabilities.

This example is inspired by the avionic functions that provide
parameters for Primary Flight Display (PFD), which is located in the
aircraft cockpit. The system of interest is the computer that acquires
sensors measurements and computes the aircraft \emph{calibrated airspeed
 } (CAS)  parameter. Airspeed is crucial for pilots: it is taken into
account to adjust aircraft engines thrust and it plays a main role in
the prevention of over speed and stall.

%%
\begin{figure}[tb]
  \centering
\includegraphics[width=0.99\textwidth]{figures/AirSpeed_Computation.jpg}
  \caption{Functional and physical views of the airspeed computation function.\label{fig:cockpit}}
\end{figure}
%%

CAS is not directly measured by a dedicated sensor, but is computed as
a function of two auxiliary pressure measurements, the static pressure
(Ps) and total pressure (Pt); that is
$\mathrm{CAS} = f(\mathrm{Pt}, \mathrm{Ps})$.
% \begin{equation*}
% CAS = C_{SO}\sqrt{5 \cdot \left( \left(\frac{Pt - Ps}{P_0} + 1\right)^{2/7} - 1\right)}
%   \end{equation*}
%   with $C_{SO}$ and $P_0$ two
%   constants. % the speed of sound under standard day sea level conditions
These two measurements come from sensors located on the aircraft nose,
a pressure probe and a pitot tube.
% Our proposed functional view shows:
% \begin{itemize}
% \item 	Two external functions \I{1} and \I{2} that measure static and total pressure. They represent the functions allocated to the sensors.
% \item	Three inner functions of the system: \F{1} and \F{2} for sensor measurements acquisition by the onboard computer and \F{3} for airspeed computation.
% \item	The PFD functions have not been modelled in the sole purpose of simplification.
% \end{itemize}
%
%%

Our proposed functional view is given in Fig.~\ref{fig:cockpit}. It
consists in two external functions \I{1} and \I{2} that measure static
and total pressure; and three inner functions of the system, \F{1} and
\F{2} for sensor measurements acquisition by the onboard computer and
\F{3} for airspeed computation. For simplification purposes, the
PFD functions have not been modelled.

Next, we propose a first failure propagation view aiming at
identifying the scenarios leading to an erroneous airspeed computation
and display to the pilot (denoted \ERR).  Such failure can only be
detected if a failure detector is implemented, for instance by
comparing the outputs of different functions. Undetected, it could
mislead the pilot and, consequently, lead to an inappropriate engine
thrust setting.  We also want to identify the scenarios leading to
% (clearly) erroneous transmitted values, for instance a variable out of bounds or
the loss of the measure (denoted \LOSS). In such a case, the pilot can
easily assess that the measure is missing or false and consequently
rely upon another measure to control the aircraft (note that such
redundancy is not modelled). For example, airspeed out of
bound---indicating that an airliner has crossed the sonic barrier---is
considered to be of kind \LOSS. It can be unserstood that scenarios leading to the
loss of the airspeed are less critical than the ones leading erroneous
values.

% We consider two possible kinds of fail status: the ``silent'' failure of
%an element (denoted \ERR); or a communication problem with one of the
% functions (denoted \LOSS).

% A failure is typically a loss of connectivity or the
% detection of (clearly) erroneous transmitted values, for instance a
% variable out of bounds. A \LOSS fail is less critical than an \ERR
% since it can be detected immediately and isolated. On the other hand,
% an \ERR fail can only be detected if a failure detector is
% implemented, for instance by comparing the outputs of different
% functions. In the remainder of this paper, we consider that the status
% are ordered by their level of severity, that is \ERR $<$ \LOSS $<$\OK.


\subsubsection*{Safety model of the architecture without FDI.}
%%\label{sec:error-prop-simple}
%For the sake of brevity, we choose to study a simplified version of
%FDI system. While simple, this system is representative of failure
%propagation mechanisms found in actual onboard systems. To simplify
%the presentation, we start by describing the system without any
%safety-related elements, see Fig.~\ref{fig:example0}.
%%
%
We provide an AltaRica model corresponding to the functional structure
of the CAS function in Fig.~\ref{fig:example0}. This model, tailored
to study failure propagation, is comprised of:
% ( All the functions are modelled according to the node function
% introduced in the Sect.~\ref{sec:sect2}.
%
two external functions, \I{1} and \I{2}, that have no input (so, in
their nominal state, the output is set to \OK); two inner functions,
\F{1} and \F{2}, which are instances of the node \FUNC described in
Sect.~\ref{sec:sect2}; and a function, \F{3}, that is the composition
of two basic elements: a multiplexer, \MIN, representing the
dependence of the output of \F{3} from its two inputs; and a computing
element \F{3Processing} that represents the computation of the
airspeed. \F{3Processing} is also an instance of node \FUNC.

\begin{figure}[t]
  \centering
  \begin{tikzpicture}
  \node[anchor=south west,inner sep=0] at (0,0)   { \includegraphics[width=0.7\textwidth]{figures/Modele0}};
  \draw[red!50,ultra thick, dashed] (6.7,2.6) rectangle (2.6,0.5);
  \node[red] at (6.3,2.3) {\texttt{\large F3}};
\end{tikzpicture}
  \caption{A simple example of failure  propagation.\label{fig:example0}}
\end{figure}
%
In case of single failure scenario, \MIN propagates the failure coming
either from one input or the other. In case of multiple failures, when
both inputs have identical failure, this one propagates. For instance,
both inputs to \LOSS lead to an output of \LOSS.
% But when it receives an erroneous fail then the multiplexer will
% propagate an \ERR.  The remaining failure combination drew our
% attention during \F{3} modelling.
On the other hand, when different failures propagate, one being \LOSS
and the other being \ERR,---and without appropriate FDI---the system
outcome is uncertain.
%
Solving this uncertainty would require a detailed behavioural model of
the onboard computer and a model for all the possible failure modes,
which is rarely feasible with a sufficient level of confidence, except
for time-tested technology.  Given this uncertainty, it is usual to
retain the state with the most critical effect, that is to say:
% in this case, \MIN can be thought of as a simple logic operator that
% computes the minimum of its two inputs, and
the output of \F{3} is \ERR.


% Since a \LOSS can be ``easily'' detected, we want to avoid a situation
% where the whole system propagates \ERR while one of its
%internal element propagates a \LOSS. The rationale is that we should be
% able to isolate (or quarantine) the system when we know for sure that
% one of its element does not reliably respond to commands. This can be
% expressed by the following safety property.
%%

% In the next section, we study the system with the addition of FDI
% capabilities.

Our goal is to prevent the computation of an erroneous airspeed while
one of \F{3} input signals is lost.  The rationale is that the system
should be able to passivate automatically the airspeed when it detects
that one of its input signals is not reliable. This behavior can be
expressed with the following property.


%The goal of the safety engineer is to avoid a situation where \F{3}
%propagates an erroneous value while one of its input propagates a
%\LOSS. The rationale is that we should be able to isolate (or
% quarantine) the function when we can detect that one of its element is
% not reliable. This can be expressed by the following safety property.
%%
\begin{prop}[Loss Detection and Instantaneous Propagation]\label{prop:1}
  A function is \emph{loss detection safe} if, when in nominal mode, it
  propagates a \LOSS whenever one of its input nodes propagates a
  \LOSS.
\end{prop}

% This is a simple system used to detect non-nominal behaviours
% and to trigger observables in order to isolate the failure mode.
% ({The figures are actual screen capture of models edited with Cecilia
%   OCAS.})
% \subsubsection{Analysis}


We can show that our example of Fig.~\ref{fig:example0} does not meet
this property using the \emph{Sequence Generation} tool available in
Cecilia OCAS.
%%
% This system has several sources of unreliability. Both its sources and
% the computing elements (\F{1}, \F{2} and \F{3Processing}) can
% experience spontaneous failures, meaning that their outputs can
% transition instantaneously to \LOSS or \ERR. Errors are irrecoverable;
% once a function becomes faulty, its output will always stay equal to
% \LOSS or \ERR.  Likewise, if both inputs are ``nominal'' then the
% output of \MIN is \OK. We assume that the multiplexer cannot undergo
% spontaneous failures.
%%
% This can be tested using a perfect detectors, \FF{3}{Loss}, placed at
% the output of the system.
%%
To this end, we compute the minimal cuts for the target equation
$((\text{\F{1.O.Loss}} \vee \text{\F{2.O.Loss}}) \wedge \neg
\text{\F{3.O.Loss}})$, meaning the scenario where \F{3} does not
propagates \LOSS when one of \F{1} or \F{2} does. Hence function \F{3}
is {loss detection safe} if and only if the set is empty.
% \footnote{To check the safety of \F{3}, we need to perform the same
% analysis for all the elements that can propagate a loss fail, not
% only \F{1}.}.

In our example, once we eliminate the cases where \F{3} is not nominal
(that is when \F{3Processing} is in an error state), we find eight
minimal cuts, all of order $2$. In the following section, we correct
the behaviour of \F{3} by considering a new architecture based on
detectors and a switch to isolate the output of \F{3} when faulty.

%%
% {%\centering
%   \begin{small}
% \begin{tabular}{c@{\quad\quad}c}
%     \begin{minipage}[t]{0.42\linewidth}
% \begin{verbatim}
% {'F1.fail_err', 'F2.fail_loss'}
% {'F1.fail_err', 'I2.fail_loss'}
% {'F1.fail_loss', 'F2.fail_err'}
% {'F1.fail_loss', 'I2.fail_err'}
% \end{verbatim}
%     \end{minipage}
%     &
%     \begin{minipage}[t]{0.42\linewidth}
% \begin{verbatim}
% {'F2.fail_err', 'I1.fail_loss'}
% {'F2.fail_loss', 'I1.fail_err'}
% {'I1.fail_err', 'I2.fail_loss'}
% {'I1.fail_loss', 'I2.fail_err'}
% \end{verbatim}
%     \end{minipage}\\
%   \end{tabular}\\
%     \end{small}
% }
%%
% Each of these cuts lead to trivial violations of our safety
% property. For instance, we obtain the cut \code{\{'F1.fail\_err',
%   'F2.fail\_loss'\}} that corresponds to the case where \F{1}
% propagates \ERR and \F{2} propagates \LOSS, in which case \F{3}
% propagates \ERR. This is exactly the situation that we want to
% avoid.


\begin{figure}[bt]
  \centering
  \begin{tikzpicture}
  \node[anchor=south west,inner sep=0] at (0,0)   { \includegraphics[width=\textwidth]{figures/Modele2.png}};
  \draw[red!50,ultra thick, dashed] (10.8,3.4) rectangle (2,0);
  \node[red] at (10.4,3.1) {\texttt{\large F3}};
\end{tikzpicture}
  \caption{Model of a FDI function with a switch and an  alarm.\label{fig:example1}}
\end{figure}%\vspace*{-1cm}

%\enlargethispage{\baselineskip}

\subsubsection*{Safety model of the architecture with FDI.}
%%\label{sec:simple-fail-detect}
%%
%In order to satisfy the Prop.~\ref{prop:1},
We update our implementation of \F{3} (see Fig.~\ref{fig:example1})
using two perfect detectors, \F{1Loss} and \F{2Loss}, that can detect
a loss fail on the inputs of the function. The (Boolean) outputs of
these detectors are linked to an OR gate ({\ttfamily AtLeastOneLoss})
which triggers an \ALARM when at least one of the detectors outputs
true. The alarm commands a \SWITCH; the output of \SWITCH is the same
as \MIN, unless \ALARM is activated, in which case it propagates a
\LOSS fail status. The alarm can fail in two modes, either
continuously triggering the switch or never being activated. The
schema also includes two delays operators, \D{1} and \D{2}, that can
be used to model delay propagations at the input of the detectors. We
will come back to these timing constraints at the end of this section.

% \TODO{why do we need Alarm ! why does the alarm can fail ! we could
%   plug directly the or gate to the switch}- Permanent failure of the alarm.

The FDI function---with a switch and an alarm---is a stable scheme for
failure propagation: when in nominal mode, it detects all the failures
of the system and it is able to disambiguate the case where its inputs
contains both \ERR and \LOSS. Once again, this can be confirmed using
the Sequence Generation tool. If we repeat the same analysis than
before---and if we abstract away the delays nodes---we find $56$
minimal cuts, all involving a failure of either \ALARM or
\F{3Processing}. This means that, in an untimed model, our new
implementation of \F{3} satisifies the loss detection property, as
desired.
% \begin{figure}[hbt]
%   \centering
% \begin{tikzpicture}
%   \node[anchor=south west,inner sep=0] at (0,0)   { \includegraphics[width=\textwidth]{figures/Modele2.png}};
%   \draw[red,ultra thick, dashed] (10.6,3.6) rectangle (2.2,0);
%   \node[red] at (10,3) {\texttt{\huge F3}};
% \end{tikzpicture}
% %  \includegraphics[width=0.95\textwidth]{figures/Modele2}
%   \caption{Taking into consideration propagation delays in the FDI
%     system.\label{fig:example2}}
% \end{figure}
% Unfortunately, this model makes unrealistic assumptions about the
% instantaneous failure propagation of detection. Indeed, it may be the
% case that the outputs of \F{1} and \F{2} are delayed as they are
% propagated to the observers \F{1Loss} and \F{2Loss}.  Next, we study
% new failure modes that may arise from this situation and how we can
% detect them.
%%
% \subsection{Timed Safety model of the architecture with FDI}
% \label{sec:timed-safety-model}
%%
% In this section, we consider a new AltaRica model that takes into
% account propagation delays. To this end, we may insert two instances
% of the delay node (the element \PRE defined in Sect.~\ref{sec:sect2})
% before the detectors \F{1Loss} and \F{2Loss}.
Even so, it is easy to find a timed scenario where the safety property
is violated.

Assume now that \F{1} and \F{2} propagate respectively the status \LOSS and \ERR,
 at the same date. In such a case, the output of \F{1}
might reach \F{1Loss} at successive date of the output of \F{2}
reaching \F{2Loss}, while \ERR reaches \MIN instantaneously. This
leads to a transient state where the alarm is not activated whereas
the output of \MIN is set to \ERR. This brings us back to the same
dreaded scenario than in our initial model.
%%
% In particular, this scenario corresponds to the cut
% \code{\{'F1.fail\_loss', 'F2.fail\_err'\}}, that is not admissible in
% an untimed semantics.
%%

This example suggests that we need a more powerful method to compute
the set of cuts in the presence of temporal constraints. On the other
hand, we may also advocate that our safety property is too strong in
this context, where perfect synchronicity of events is rare. Actually,
we can prove that the output of \F{3} will eventually converge to a
loss detection and isolation (assuming that \F{3} stays nominal and
that its inputs stay stable). Therefore, if we can bound the latency
needed to detect the loss failure, and if this bound is sufficiently
small safety-wise, we could still deem our system as safe. To reflect
this situation, we propose an improved safety property that takes into
account temporal conditions.
%%
\begin{prop}[Loss Detection Convergent]\label{prop:2} A function is
  \emph{loss detection convergent} if (when in nominal mode) there
  exists a duration $\Delta$ such that it continuously outputs a \LOSS
  after the date $\delta_0 + \Delta$ if at least one of its input
  nodes continuously propagates a \LOSS starting from $\delta_0$
  onward. The smallest possible value for $\Delta$ is called the
  \emph{convergence latency} of the function.
\end{prop}


In the next section, we use our approach to generate a list ``timed
cuts'' (as model-checking counterexamples) that would have exposed the
problems that we just described. We also use model-checking to compute
the {convergence latency} for the node \F{3}. In this simple example,
we can show that the latency is equal to the maximal propagation delay
at the input of the detectors. The value of the latency could be much
harder to compute in a more sophisticated scenario, where delays can
be chained and/or depends on the internal state of a component.

% The failure to detect a lost element in the system strongly affects
% the safety assessment process, and motivates our approach to take
% directly into account temporal constraints when analysing AltaRica
% models. Unfortunately, while it is possible to define some timing
% constraints on AltaRica synchronizations using ``Dirac events'', this
% information is not taken into account during analysis. In particular,
% we cannot use the Sequence Generation tool to find the ``timed''
% minimal cuts that would expose the problems that we just described. In
% the next section, we propose a way to check our timed extension of
% AltaRica using a translation into the Fiacre specification
% language. After translation into this new format, the model can be
% used by a realtime model-checker Tina.

% \TODO{Limitations of OCAS are described above as well: to integrate the two parts}
% \TODO{talk about Dirac(2) in AltaRica as a way to do some crude
%   temporal constraints. It works in simulation mode in Cecilia OCAS but
%   no tooling !? actually I am not sure.}

% \subsubsection{Dialects of AltaRica}
% AltaRica is a high level modelling language dedicated to Safety Analysis.
% The semantics of AltaRica is formally defined in terms of Guarded Transitions Systems, and the state of the system is described by means of variables~\cite{poi99}. The system changes of state are dictate by events, i.e. the transitions between states happen only when an event occurs as it is the only mechanism updates the value of variables.

% Few versions of AltaRica exist; we consider here AltaRica 2.0 Dataflow, a version where variables are updated by propagating values in a fixed order, and this order is determined at compilation time.

% The model we realised used CECILIA as a tool, then brought to a more
% standard AltaRica 2.0 version (the one of Epoch).

% \subsection{Cecilia OCAS graphical interactive simulator}
% %% Pris de l'article de Christel
% The system architecture we modelled can be realised using Cecilia OCAS graphical interactive simulator, which has the capacity of exporting AltaRica code, in a dialect that can easily be translated in AltaRica DataFlow.

% Each  system  component  is  modelled  by  an  AltaRica  node
% that  can  be  regarded  as  a  mode  automaton~\cite{rauzy02}.  In  Cecilia
% OCAS, each node is associated to an icon and belongs to a library.
% Components are dragged and dropped from the library to the system
% architecture  sheet  and  then  linked  graphically.

% As failures are events in the AltaRica model, the safety engineer can use the graphical capacities of the tool to communicate more easily the model characteristics to other project members, as graphical representations are possibly the easiest way to communicate models, not only for the safety engineers.
% This tool permits to  inject  in  the  model  a  number  of  failure  events  in  order  to  observe  whether a failure condition is reached (such as loss of one or several elements).

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End:

%%  LocalWords:  PFD OCAS CAS Ps Fiacre