Skip to content

Introduction to Latch-X

Latch-X is a modeling framework for availability and reliability analysis of complex systems using YAML-defined components, Bayesian Networks (BN), and Monte Carlo (MC) simulation.


Availability vs Reliability (what we mean in this docs)

  • Availability (A or A∞)repairable systems. Probability the service is up1.
    By default we mean steady-state availability (A or A∞). When we discuss time-varying behavior, we’ll write A(t) for point availability or Ā[t₁,t₂] for interval/mission availability.

  • Reliability (R(t))non-repairable (mission) view. Probability a component/system operates without failure over mission time t.
    In general, reliability is “one minus the failure distribution”2.
    Exponential is most common; normal/lognormal are supported in Latch-X too34.


In short:
Availability → “How often is it up?” (A steady-state; A(t) if time-varying)
Reliability → “Chance it survives the next t hours?” (R(t))



How Latch-X supports both

Repairable (Availability mode)

  • Model components with mttf and mttr.
  • Use BN for steady-state availability and contribution analysis.
  • Use MC for temporal uptime/downtime distributions and RTO metrics.

Non-repairable (Reliability mode)

  • Model components with mttf only (no repair).
  • Use BN for mission reliability calculations.
  • Use MC for failure time distributions and mission success probability.

What Latch-X Analysis Answers

Both availability and reliability analysis help answer critical system questions:

  • What's our expected system availability over time?
  • What's the reliability (survival probability) for a mission duration?
  • Which components are most critical to system uptime?
  • How would losing a specific component affect the overall system?
  • What's the probability of meeting our SLA commitments?
  • Where should we invest in redundancy or improvements?

Core concepts

Components

The building blocks of your system. Each component can fail independently with its own failure rate (MTTF) and repair time (MTTR).

Examples

Web servers, databases, load balancers, network switches, power supplies

Dependencies

Relationships between components that determine how failures propagate through your system.

Examples

Web servers depend on databases, load balancers depend on web servers, everything depends on power

Availability vs Reliability Modeling

Different modeling approaches for different analysis goals:

Availability modeling (repairable systems)

  • Components have both MTTF and MTTR parameters
  • Formula shown in 1
  • Answers: "What fraction of time is the system operational?"

Reliability modeling (mission systems)

  • Set repair_enabled: false (repairs are ignored).
  • For normal nodes, the schema still requires mttf and mttr when modeling with times; the engine ignores mttr in this mode.
    (Alternatively, use prob instead of time-based params.)
  • Latch nodes: prob or mttf + max_delay (no mttr).
  • Formula examples in 2 and 3
  • Answers: "What's the probability of no failures during mission time t?"

Failure propagation

How the failure of one component affects other components and the overall system through dependency relationships.

Example

Database failure → Web server failure → Load balancer failure → System unavailable

Analysis approaches

BN Engine (Bayesian Network)

Mathematical, exact analysis providing fast results in seconds. Supports both availability (steady-state) and reliability (mission-time) calculations.

Key benefits

  • Fast: Results in seconds
  • Exact: Mathematically precise probabilities
  • Dual mode: Steady-state availability OR mission reliability
  • Versatile: Impact analysis, root cause analysis

MC Engine (Monte Carlo)

Statistical, simulation-based analysis providing detailed temporal insights for both availability and reliability scenarios.

Key benefits

  • Detailed: Statistical confidence intervals
  • Temporal: Timing patterns and distributions
  • Dual mode: Availability (repair cycles) OR reliability (failure times)
  • Realistic: Models actual event sequences
  • Validation: Cross-checks BN results

Getting started

Next steps


  1. A∞ ≈ MTTF / (MTTF + MTTR) — exact for a two-state exponential failure/repair model; a practical shortcut otherwise. 

  2. R(t) = 1 - F(t), where F(t) is the cumulative distribution function (CDF) of time-to-failure (TTF). 

    • Exponential TTF: R(t) = exp(−t / MTTF)
    • Normal TTF (mean mu, std sigma): R(t) = 1 − Phi((t − mu)/sigma)
    • Lognormal TTF (log-mean mu, log-std sigma): R(t) = 1 − Phi((ln t − mu)/sigma) (valid for t > 0)

  3. In YAML, set mttf_dist: exp|norm|lognorm (and include sigma for norm|lognorm). mttr_dist supports delta|exp|norm|lognorm