Skip to content

Entropy is a fault-injection tool for the Elixir/OTP runtime.

License

Notifications You must be signed in to change notification settings

nrednav/entropy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Entropy

Hex.pm

Entropy is a fault-injection tool for the Elixir/OTP runtime.

It acts as a sidecar application, stochastically selecting and suspending ("zombifying") processes to simulate Grey Failures (degradation) rather than simple termination.

Purpose

Standard supervisors recover from crashes (Termination). They do not recover from hanging processes (Degradation). Entropy validates system resilience by forcibly suspending processes for defined intervals, proving whether the host system correctly handles timeouts and backpressure.

Features

  • Grey Failure Simulation: Simulates degradation (freezing) in addition to simple termination to validate timeout handling.
  • Stochastic Selection: Uses weighted probabilistic selection to ensure fair coverage of the process tree over time.
  • Safety Circuit Breaker: Automatically halts injection if node CPU or Memory exceeds configured safety thresholds.
  • Immunity: Supports static and dynamic immunity to protect critical infrastructure processes.
  • Dead Man's Switch: Guarantees that all suspended processes are automatically resumed if the Entropy daemon crashes.

Installation

Add entropy to your list of dependencies in mix.exs:

def deps do
  [
    {:entropy, "~> 0.1.0"}
  ]
end

Ensure the :os_mon application is enabled in your application callback, as Entropy relies on :cpu_sup for safety checks.

def application do
  [
    extra_applications: [:logger, :os_mon]
  ]
end

Configuration

Entropy is configured via the standard application environment.

Note: The safety thresholds define the Circuit Breaker. If system resources exceed these limits, Entropy halts injection to prevent cascading failure.

# config/config.exs

config :entropy,
  # Enable/Disable the injection scheduler.
  # Default: false (Safety first)
  is_injection_enabled: false,

  # The time between injection attempts in milliseconds.
  # Default: 5000
  injection_interval_ms: 2000,

  # The frequency at which the Circuit Breaker polls system resources.
  # Lower values increase reaction time but add system overhead.
  # Default: 1000
  safety_check_interval_ms: 1000,

  # The maximum CPU utilization (0.0 - 100.0) allowed.
  # If the host node exceeds this, injection pauses.
  # Default: 95.0
  max_cpu_util_percent: 80.0,

  # The maximum Memory utilization (0.0 - 100.0) allowed.
  # Default: 90.0
  max_memory_util_percent: 80.0,

  # The maximum number of concurrent zombies allowed.
  # Default: 50
  max_active_zombies: 25,

  # A list of atoms (application names) strictly immune to selection.
  # :kernel, :init, :logger, and :entropy are immune by default.
  # Default: []
  immune_modules: [:my_critical_app],

  # A list of atoms (application names) allowed to be targeted.
  # If empty, all applications are valid targets.
  # Default: []
  target_applications: [:my_target_app],

  # The duration range {min, max} in ms for a process suspension.
  # Default: {1000, 10_000}
  zombie_ttl_range_ms: {1000, 10_000},

  # Fault Strategy Weights
  # A keyword list defining the relative frequency of fault types.
  # Keys: :suspend, :kill
  # Default: [suspend: 10, kill: 0] = Suspension only
  fault_strategy_weights: [suspend: 9, kill: 1],

  # Cooldown period for repetitive telemetry events in ms.
  # Default: 1000
  telemetry_debounce_ms: 1000,

  # Whether the AxiomaticLogger should output to `stdout`.
  # In standard operation, if the system crashes (T=0), the what and why must be
  # preserved.
  is_axiomatic_reporting_enabled: true,

  # The buffer subtracted from the max resource utilization limit required to
  # recover from an unsafe state. Prevents rapid oscillation in the circuit
  # breaker.
  # Default: 10.0
  hysteresis_padding: 10.0

Configuration Hierarchy

Entropy follows the standard Elixir configuration cascade:

  1. config/config.exs: Sets the Static Defaults (Safe/Disabled) at compile time.
  2. config/runtime.exs: Reads Environment Variables at boot time.
    • Rule: Values set here override the static defaults.

Environment Variables (Runtime)

Entropy supports runtime configuration via config/runtime.exs.

The following environment variables override static configuration in production:

  • ENTROPY_INJECTION_ENABLED: ("true" | "false") Toggles the injection scheduler.
  • ENTROPY_INJECTION_INTERVAL_MS: (Integer) Sets the time between injection attempts.

Example:

export ENTROPY_INJECTION_ENABLED="true"
export ENTROPY_INJECTION_INTERVAL_MS="5000"

Usage

Entropy operates as a daemon. Interactions occur via the Entropy module or by observing Telemetry events.

1. Verification

After deployment, confirm the daemon is active and the environment permits injection.

# Returns true if the Entropy supervision tree is alive.
iex> Entropy.is_alive?()
true

# Returns true if the Circuit Breaker allows injection.
# (i.e., CPU < max_cpu_util_percent AND Memory < max_memory_util_percent)
iex> Entropy.is_ready?()
true

2. Runtime Control

Configuration changes (e.g., increasing aggression) can be applied without restarting the node.

  1. Modify config.exs or runtime.exs
  2. Execute reload:
iex> Entropy.reload_config()
:ok

3. Dynamic Immunity

Specific processes can be temporarily granted immunity during critical transactions.

# Protect the current process from chaos
Entropy.State.ImmunityRegistry.register(self())

# Critical work...

# Revoke protection
Entropy.State.ImmunityRegistry.unregister(self())

Observability

Entropy emits structured events via :telemetry.

Injection Events

  • [:entropy, :injection, :start] - Injection attempt initiated.
  • [:entropy, :injection, :stop] - Injection successfully completed.
    • Metadata: %{strategy: :suspend | :killm ...}
  • [:entropy, :injection, :failure] - Injection failed (e.g., target died before suspension).

Safety Events

  • [:entropy, :safety, :veto] - Circuit Breaker tripped. Injection paused.
  • [:entropy, :safety, :recovery] - Circuit Breaker reset. Injection resumed.

Scheduler Events

  • [:entropy, :scheduler, :skip] - Cycle skipped (e.g., due to circuit breaker or zombie limit).
  • [:entropy, :scheduler, :noop] - Cycle executed but no valid victim found.

Configuration Events

Events emitted when Entropy.reload_config/0 applies runtime changes.

  • [:entropy, :scheduler, :injection_interval_change]
    • Metadata: %{old: integer(), new: integer()}
  • [:entropy, :circuit_breaker, :threshold_change]
    • Metadata: %{old: map(), new: map()}

Architecture

Circuit Breaker & Hysteresis

Entropy polls :cpu_sup and :memsup at a configurable interval (Default: 1000ms).

  1. Safety Trip: If usage exceeds max_cpu_util_percent or max_memory_util_percent, the system enters a Safety State. Injection halts immediately.
  2. Hysteresis Recovery: To prevent oscillation (flapping) between safe and unsafe states, Entropy applies a Hysteresis Padding (Default: 10.0%).
    • Example: If the CPU limit is 80%, the system becomes unsafe at >80%. However, it will not return to a safe state until CPU usage drops below 70% (80% - 10% padding).

This ensures the host system has genuinely recovered before chaos resumes.

Zombie Registry

Suspended processes are tracked in an ETS table owned by Entropy.State.ZombieRegistry.

  • Constraint: If the registry process crashes, the BEAM VM automatically resumes all suspended processes (Dead Man's Switch).
  • Limit: The system enforces a hard limit of max_active_zombies (Default: 50) to prevent total resource starvation.

Census

Entropy maintains a cached snapshot of the process table to minimize overhead. The Entropy.Sanctuary.Census process refreshes this list on a fixed interval (default: 5s).

Refresh Lifecycle:

  1. Retrieves the global process list.
  2. Filters processes based on the target_applications allowlist (if configured).
  3. Converts the result to a Tuple for O(1) random access.

This architecture ensures that the Scheduler performs constant-time victim selection without blocking the VM with expensive Process.list/0 calls during every tick.

Development

This section explains how to set up the project locally for development.

Requirements

  • Elixir ~> 1.16 (OTP 26+)
  • :os_mon (Required for System Sensors)

Setup

# 1. Clone the repository
## via HTTPS
git clone https://github.com/nrednav/entropy.git

## via SSH
git clone git@github.com:nrednav/entropy.git

cd entropy

# 2. Install dependencies
mix deps.get

# 3. Run the test suite
# Note: Tests use a Simulated Physics engine to avoid actual system interference.
mix test

Testing Strategy

Entropy uses a Deterministic Testing Pattern to eliminate race conditions.

  • Simulated Physics: Tests run against a Physics simulation, not the host OS.
  • Manual Polling: In the test environment, the Circuit Breaker's automatic polling loop is paused. You must explicitly trigger state updates.

Example test workflow:

# 1. Set the simulated physical state
Entropy.Simulation.Physics.set_cpu_util_percent(99.9)

# 2. Force the Circuit Breaker to read the new state
Entropy.State.CircuitBreaker.force_safety_check()

# 3. Assert the system reaction
Wait.until(fn ->
  case Entropy.State.CircuitBreaker.get_safety_report() do
    {:unsafe, metrics} -> metrics.cpu_util_percent == 99.9
    _ -> false
  end
end)

# or
assert Entropy.is_ready?() == false

Versioning

This project uses Semantic Versioning. For a list of available versions, see the repository tag list.

Issues & Requests

If you encounter a bug or have a feature request, please open an issue on the GitHub repository.

About

Entropy is a fault-injection tool for the Elixir/OTP runtime.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages