Routelock Knowledge Base

Comprehensive documentation for the intelligent BGP route optimization platform

What is Routelock?

Overview of the intelligent BGP route optimization platform

Introduction

Routelock is an intelligent BGP route optimization platform designed to automatically analyze, select, and implement the best network routes across multiple upstream transit providers. In multi-homed networks where traffic can exit through several carriers, the default BGP best-path selection algorithm often chooses suboptimal routes based on simple AS-path length rather than actual performance metrics. Routelock solves this by combining real-time NetFlow traffic analysis, active probing, and sophisticated optimization algorithms to make data-driven routing decisions that minimize latency, reduce packet loss, and optimize cost.

How It Works

The platform continuously collects NetFlow data from your network to identify which destination prefixes carry the most traffic. It then actively probes those destinations through each available upstream provider, measuring latency, jitter, and packet loss. An optimization engine compares these measurements against configurable thresholds and decides whether a route change would provide meaningful improvement. When an improvement is identified, Routelock injects a more-specific BGP route through the preferred provider using BIRD 2.x as the route server, steering traffic along the better path.

Comparison with Noction IRP

Routelock draws architectural inspiration from Noction Intelligent Routing Platform (IRP) v4.3 but offers several key advantages. It features a modern web interface with real-time WebSocket updates, a comprehensive REST API with 85+ endpoints, integrated DDoS detection and mitigation including XDP/eBPF-based packet scrubbing, and native support for high availability with active-passive failover. While Noction uses a proprietary BGP implementation, Routelock leverages BIRD 2.x, a well-tested open-source routing daemon, providing greater transparency and community support.

Key Features

Three operating modes: Test (observe only), Human (approval required), and Robot (fully automated)
Multi-provider support: Transit, partial-route, and IX providers with per-provider metrics
Active probing: ICMP, UDP, and TCP probes with policy-based routing to test each provider path
DDoS protection: EWMA-based anomaly detection, RTBH, FlowSpec, and XDP/eBPF scrubbing
95th percentile commit control: Automatic traffic balancing to stay within commit levels
Enterprise authentication: JWT, API keys, LDAP, Google/Microsoft SSO, email 2FA
Role-based access control: Admin, operator, and viewer roles with granular permissions

System Requirements

Hardware, software, and network prerequisites for deploying Routelock

Hardware Requirements

Routelock is designed to handle production-scale networks with up to 1.1 million active BGP routes and traffic throughput exceeding 300 Gbps of NetFlow-monitored traffic. The hardware requirements vary based on the number of routes, NetFlow volume, and whether DDoS scrubbing is enabled.

Component	Minimum	Recommended
CPU	4 cores	6+ cores (for XDP scrubbing)
RAM	4 GB	8 GB+
Storage	50 GB SSD	150 GB NVMe
Network	1 Gbps	10 Gbps (for scrubber)

Software Prerequisites

Operating System: Linux (Debian 12/Ubuntu 22.04+ recommended). Kernel 5.15+ required for XDP/eBPF scrubber features.
Go: Version 1.21+ (for building from source)
PostgreSQL: Version 15+ with TimescaleDB 2.x extension for time-series hypertables
BIRD 2.x: BGP routing daemon, version 2.13+ recommended
clang/llvm: Required only if compiling XDP/eBPF programs

Network Requirements

Routelock must be deployed on a server that can establish BGP sessions with your border routers and receive NetFlow exports. The server needs IP connectivity to all upstream transit providers for active probing, ideally with policy-based routing (PBR) configured on the routers to steer probe packets through specific providers. For best results, the Routelock server should be on the same management VLAN as your routing infrastructure.

Note: The server can operate behind NAT for its management interface, but must have routable connectivity for BGP sessions and active probes. Self-signed TLS certificates are supported for the web UI and API.

Network Topology

A typical deployment places Routelock adjacent to the border routers. Each router peers with the upstream providers and also establishes an iBGP session with Routelock (via BIRD). Routers export NetFlow v9 data to Routelock's collector. When Routelock decides to optimize a prefix, it announces a more-specific route with a higher local-preference, causing traffic to shift to the preferred provider.

Quick Start Guide

Get Routelock up and running in minutes

Step 1: Install Dependencies

Begin by installing the required system packages. On Debian/Ubuntu:

apt update && apt install -y postgresql bird2 golang-go
# Install TimescaleDB extension
apt install -y timescaledb-2-postgresql-15
timescaledb-tune --yes
systemctl restart postgresql

Step 2: Create the Database

Routelock uses TimescaleDB for high-performance time-series storage of NetFlow records, probe results, and traffic statistics.

sudo -u postgres psql -c "CREATE USER routelock WITH PASSWORD 'your-secure-password';"
sudo -u postgres psql -c "CREATE DATABASE routelock OWNER routelock;"
sudo -u postgres psql -d routelock -c "CREATE EXTENSION IF NOT EXISTS timescaledb;"

Step 3: Configure Routelock

Create the main configuration file at /etc/routelock/config.yaml. This file defines database connectivity, BGP settings, NetFlow listener ports, and operating mode.

server:
  listen: ":8080"
  mode: test          # Start in test mode (observe only)
database:
  host: localhost
  port: 5432
  name: routelock
  user: routelock
  password: your-secure-password
netflow:
  listen: ":2055"     # NetFlow v9 collector port
bgp:
  bird_socket: /run/bird/bird.ctl
  local_as: 65000
  router_id: 10.10.5.120
providers:
  - name: Provider-A
    type: transit
    asn: 64512
    communities: ["65000:100"]

Step 4: Run Migrations

routelock migrate up

Step 5: Start Routelock

routelock serve

Navigate to https://your-server:8080/ui/ to access the web dashboard. The default admin credentials are displayed in the startup log on first run. Change them immediately.

Step 6: Verify NetFlow Reception

Configure your Cisco routers to export NetFlow v9 to the Routelock server on port 2055. Within minutes, you should see traffic data populating the dashboard. The system will automatically begin identifying top prefixes and building traffic profiles.

Tip: Start in Test mode to observe Routelock's recommendations without making any actual routing changes. Once you're confident in the optimization suggestions, switch to Human mode for approval-based changes, or Robot mode for full automation.

Understanding Operating Modes

Test, Human, and Robot modes control how Routelock acts on optimization decisions

Overview

Routelock provides three distinct operating modes that control the level of automation for route optimization. These modes let you progressively build confidence in the platform before granting it full control over your routing decisions. The operating mode can be changed at any time through the web UI or API without restarting the service.

Test Mode (Observe Only)

In Test mode, Routelock performs all analysis, probing, and optimization calculations but does not inject any BGP routes. All proposed improvements are logged and visible in the dashboard as "pending" changes. This mode is ideal for initial deployment, letting you evaluate the quality of Routelock's recommendations against your network's actual behavior. Test mode still collects NetFlow, runs probes, and builds baseline metrics, so the system is fully warmed up when you're ready to enable route injection.

Human Mode (Approval Required)

Human mode generates route optimization proposals that require explicit approval from an operator before they are applied. When the optimization engine identifies an improvement, it creates a pending change request visible in the "Pending Changes" view. An administrator or operator can review the proposed change—including the current and proposed routes, probe metrics, expected latency improvement, and cost impact—and choose to approve or reject it. Approved changes are immediately injected via BIRD. This mode provides a safety net while still benefiting from Routelock's analysis.

Robot Mode (Fully Automated)

In Robot mode, Routelock automatically injects optimized routes without human intervention. The optimization engine applies all configured thresholds, anti-flap timers, rate limits, and cost constraints before making any change. This mode is recommended only after thorough validation in Test and Human modes. Robot mode includes safety mechanisms: maximum route injection rate (configurable, default 50 routes/minute), anti-flap timers to prevent rapid oscillation, and automatic withdrawal if probe metrics degrade after injection.

Changing Modes

# Via API
curl -X PUT /api/v1/config/mode -d '{"mode":"human"}'

# Via web UI: Settings → Operating Mode

Warning: Switching from Robot to Test mode does not automatically withdraw already-injected routes. Use the bulk withdrawal feature or let existing improvements expire naturally via their TTL.

Router & Interface Setup Guide

Overview

Accurate traffic analysis in Routelock depends on knowing where traffic enters and leaves your network. This is determined by the role assigned to each router interface. When an interface is classified as an upstream (provider) port, Routelock knows that traffic arriving on that interface is inbound from the internet, while traffic departing through it is outbound. Without proper classification, features like per-provider bandwidth reporting, DDoS detection direction, and optimization scoring cannot function correctly.

The setup process has three stages: register each router with its SNMP credentials, let Routelock discover all physical and logical interfaces via SNMP, and then classify each interface by its role in the network. SNMP-discovered interface names and descriptions make classification straightforward because they reflect the real cabling and purpose of each port.

Step 1: Register Your Routers

Navigate to the Routers page in the dashboard and click Add Router. Fill in the following fields:

Field	Description
Name	A human-readable label for this router (e.g., "edge-router-01")
Management IP	The IP address Routelock will use for SNMP queries
NetFlow Source IP	The IP that appears as the source address in NetFlow packets sent by this router. This must match exactly or flows will not be associated with the router.
SNMP Community	The SNMPv2c community string configured on the router
SNMP Port	UDP port for SNMP (default 161)
Role	The router's role in the network topology

Router Roles

Role	Description
Edge	Provider-facing router that peers with upstream transit carriers
Core	Backbone router connecting internal network segments
Distribution	Aggregation-layer router between core and access
Access	Customer-facing router providing last-mile connectivity

Note: Use the Test SNMP Connection button before saving. This verifies that Routelock can reach the router on the specified IP and community string. A failed test usually indicates a firewall rule blocking UDP 161 or an incorrect community string.

Step 2: Discover Interfaces

After registering a router, click Discover Interfaces on the router's detail page. Routelock performs an SNMP walk of four key OIDs:

OID	MIB Object	Purpose
1.3.6.1.2.1.2.2.1.2	ifDescr	Interface name (e.g., "HundredGigE0/0/0/3")
1.3.6.1.2.1.31.1.1.1.18	ifAlias	Interface description/alias set by the operator (e.g., "to Zayo1")
1.3.6.1.2.1.31.1.1.1.15	ifHighSpeed	Interface speed in Mbps (e.g., 100000 for 100G)
1.3.6.1.2.1.2.2.1.8	ifOperStatus	Operational status (up, down, testing)

All discovered interfaces are listed with their real names, descriptions, speed, and current operational status. Discovery can be re-run at any time to pick up new interfaces added to the router.

Tip: Setting meaningful interface descriptions on your routers (e.g., description to Zayo1 100G transit) makes classification much faster because the purpose of each port is immediately visible.

Step 3: Classify Interfaces

Navigate to the Interfaces page and select the router from the dropdown. For each discovered interface, assign a role that describes its function in the network:

Role	Description	Example
Upstream (Provider)	Connected to an ISP or transit provider. When selected, you also choose which provider this interface belongs to.	HundredGigE0/0/0/3 → Zayo
Downstream (Customer)	Connected to customers or downstream network segments that originate/receive end-user traffic.	TenGigE0/0/0/10 → Customer VLAN
Internal	Backbone or infrastructure links between your own routers. Traffic on these links is not counted toward provider bandwidth.	HundredGigE0/0/0/0 → Core link
Management	Out-of-band management interfaces used for SSH, SNMP, etc. Excluded from all traffic analysis.	MgmtEth0/RSP0/CPU0/0
Ignore	Loopback, null, and unused interfaces. These are hidden from traffic views.	Loopback0, Null0

Note: Routelock may suggest interface classifications based on interface names and descriptions. For example, an interface named "HundredGigE" with a description containing a known provider name will be suggested as Upstream. Review and confirm suggestions before saving.

How Direction Detection Works

Once interfaces are classified, Routelock uses their roles to determine the direction of every traffic flow and SNMP counter reading. The logic is straightforward:

Ingress Interface	Egress Interface	Direction
Upstream (Provider)	Downstream (Customer)	Inbound — traffic entering your network from the internet
Downstream (Customer)	Upstream (Provider)	Outbound — traffic leaving your network to the internet
Downstream	Downstream	Internal — traffic between customer segments
Internal	Internal	Internal — backbone transit traffic

This classification drives several key features:

DDoS detection direction: Attacks are identified as inbound (volumetric floods targeting your customers) or outbound (compromised hosts sending attack traffic), enabling direction-specific thresholds and mitigation.
SNMP bandwidth accuracy: Per-provider bandwidth is reported correctly because Routelock knows which counters correspond to which provider.
Traffic analytics: The dashboard's inbound/outbound breakdown, top-prefix tables, and provider utilization charts all depend on direction tagging.
Optimization scoring: The optimization engine evaluates route changes in the correct direction, ensuring improvements benefit the traffic that actually traverses each provider.

SNMP Bandwidth Polling

After classification, Routelock begins SNMP polling on all Upstream and Downstream interfaces. The counters are interpreted relative to the interface role:

Counter	On Upstream Interface	On Downstream Interface
ifHCInOctets (InOctets)	Network inbound traffic from this provider	Traffic received from customer segment
ifHCOutOctets (OutOctets)	Network outbound traffic to this provider	Traffic delivered to customer segment

Polling runs every 60 seconds. Raw counter deltas are smoothed using an Exponentially Weighted Moving Average (EWMA) with alpha 0.3, which reduces noise from traffic bursts while remaining responsive to sustained changes. The 95th percentile billing calculation uses the SNMP-derived per-provider interface counters over the configured billing period.

Important: SNMP bandwidth display uses interface counters, not NetFlow. NetFlow is used for traffic classification and prefix-level analysis. This ensures bandwidth numbers match what your upstream providers report on their billing portals, since both use SNMP counters as the source of truth.

Multi-Router Setup

In networks with multiple border routers, each router is registered and discovered independently. Key considerations:

Separate registration: Each router has its own management IP, NetFlow source IP, and SNMP credentials. Register them individually through the Routers page.
Independent discovery: Interface discovery is performed per router. Each router's interface list is managed separately.
Shared providers: The same provider can have interfaces on multiple routers. For example, if Zayo has a 100G link on both edge-router-01 and edge-router-02, classify both interfaces as Upstream and assign them to the Zayo provider. Routelock aggregates bandwidth across all interfaces for each provider.
NetFlow correlation: Flows from all routers are correlated using the interface_mappings table. The NetFlow source IP identifies the router, and the SNMP interface index (ifIndex) in the flow record maps to the discovered interface and its classification.

# Example: Two routers, same provider on both
Router: edge-router-01  (10.10.5.1)
  HundredGigE0/0/0/3  →  Upstream (Zayo)
  HundredGigE0/0/0/5  →  Upstream (RCN)
  HundredGigE0/0/0/0  →  Internal (core link)

Router: edge-router-02  (10.10.5.2)
  HundredGigE0/0/0/1  →  Upstream (Zayo)
  HundredGigE0/0/0/4  →  Upstream (PCCW)
  HundredGigE0/0/0/0  →  Internal (core link)

Troubleshooting

"SNMP connection failed"

Verify the community string matches the router configuration. Check that the Routelock server can reach the router's management IP on UDP port 161. Common causes include ACL restrictions on the router, host-based firewalls on the Routelock server, or NAT interfering with SNMP responses.

# Test SNMP reachability from the Routelock server
snmpwalk -v2c -c your-community router-ip 1.3.6.1.2.1.1.1.0

"No interfaces discovered"

The router may restrict which MIB objects are accessible via SNMP. Verify that the SNMP view or access list on the router includes the interfaces MIB (IF-MIB). Some routers require explicit configuration to expose ifAlias (the description field).

"Flows not tagged with provider"

This occurs when NetFlow records contain an ifIndex that does not match any classified interface. Ensure that the interface has been both discovered and classified. After saving a classification change, the NetFlow collector refreshes its interface map within 30 seconds. Also verify that the NetFlow source IP on the router registration matches the actual source IP of the exported flow packets.

"Bandwidth shows 0"

SNMP bandwidth calculation requires at least two consecutive poll cycles to compute a rate (delta bytes / delta time). After first registering a router, expect a 60-120 second delay before bandwidth values appear. If bandwidth remains at zero after several minutes, check that the interface is operationally up and that the SNMP counters (ifHCInOctets, ifHCOutOctets) are incrementing.

Providers

Understanding upstream transit providers, partial routes, and IX peers

What Are Providers?

In Routelock, a Provider represents an upstream network connection through which your traffic can be routed to the internet. Each provider is typically a transit carrier, partial-route peer, or Internet Exchange (IX) connection. Routelock monitors the performance and cost characteristics of each provider and uses this data to make intelligent routing decisions that optimize traffic across all available paths.

Provider Types

Transit Providers

Transit providers offer full routing tables (typically 900,000+ IPv4 prefixes) and carry traffic to any destination on the internet. These are your primary upstream carriers and usually represent the majority of traffic volume and cost. Routelock tracks each transit provider's 95th percentile billing, committed data rates, and per-prefix performance metrics.

Partial-Route Providers

Partial-route providers offer a subset of the full routing table, typically routes learned from their direct customers and peers. These connections are often cheaper than full transit and may offer better performance for specific regions. Routelock only considers prefixes that are reachable through partial-route providers when evaluating optimization candidates.

IX Providers

Internet Exchange providers represent peering connections at IXPs. These offer direct paths to other networks without traversing transit, typically providing lower latency and zero per-Mbps cost. Routelock can prefer IX routes over transit when performance is comparable, reducing transit costs.

Provider Configuration

providers:
  - name: "Cogent"
    type: transit
    asn: 174
    commit_mbps: 10000
    cost_per_mbps: 0.50
    communities:
      announce: "65000:174"
      local_pref: 100
    probe_source: "10.0.1.1"
    enabled: true

Metrics Tracked Per Provider

Metric	Description
Current throughput	Real-time inbound/outbound Mbps from NetFlow or SNMP
95th percentile	Rolling billing-period 95th percentile calculation
Average latency	Mean RTT from active probes across all monitored prefixes
Packet loss	Percentage of probe packets lost
Jitter	Variation in probe RTT values
Active improvements	Number of prefixes currently routed through this provider by Routelock

Prefixes & Routes

How BGP routing works within Routelock and how prefixes are optimized

BGP Routing Fundamentals

In BGP (Border Gateway Protocol), a prefix is a block of IP addresses identified by a network address and mask length, such as 203.0.113.0/24. Each prefix can be reachable through multiple paths (routes), each offered by a different upstream provider. The standard BGP best-path algorithm selects one route per prefix based on attributes like local-preference, AS-path length, MED, and origin type. However, this algorithm does not consider real-world performance metrics like latency or packet loss.

How Routelock Optimizes Prefixes

Routelock identifies the most important prefixes in your network by analyzing NetFlow data to determine which destinations carry the most traffic. These "top prefixes" are then actively probed through each available provider to measure actual performance. When the optimization engine determines that a different provider offers meaningfully better performance for a given prefix, Routelock can inject a more-specific BGP route to redirect traffic.

Prefix Lifecycle

Discovery: NetFlow analysis identifies a prefix with significant traffic volume
Probing: The prefix enters the active probing pool and is measured through all providers
Evaluation: The optimization engine compares probe results against thresholds
Optimization: If improvement meets criteria, a route change is proposed or injected
Monitoring: Post-injection probes verify the improvement remains valid
Expiry: Improvements have a TTL; they expire and must be re-evaluated

Best-Path Selection

Routelock's best-path selection goes beyond traditional BGP. It calculates a weighted score for each provider path incorporating latency (default weight 40%), packet loss (30%), jitter (20%), and cost (10%). These weights are configurable. A provider must beat the current path by the configured improvement threshold (default 20%) to trigger an optimization, preventing unnecessary route churn.

score = (w_latency × latency_improvement) +
        (w_loss × loss_improvement) +
        (w_jitter × jitter_improvement) -
        (w_cost × cost_penalty)

Improvements

Understanding route improvements, their lifecycle, and weighted scoring

What Are Improvements?

An improvement in Routelock represents an active route optimization—a prefix whose traffic has been redirected from the default BGP path to a better-performing provider. Each improvement tracks the original route, the optimized route, the performance gain achieved, and the remaining time-to-live (TTL) before the improvement expires and must be re-evaluated.

Improvement Lifecycle

Improvements progress through a well-defined state machine:

State	Description
`pending`	Optimization proposed but not yet applied (Human/Test mode)
`approved`	Operator approved the change, queued for injection
`active`	Route injected and traffic is flowing through the optimized path
`expired`	TTL reached zero; the route was withdrawn and prefix returns to re-evaluation
`withdrawn`	Manually withdrawn by operator or auto-withdrawn due to degradation
`rejected`	Operator rejected the proposed improvement

Weight Scoring

Each improvement candidate receives a composite score based on configurable weights. The default scoring formula considers latency improvement (40%), packet loss reduction (30%), jitter improvement (20%), and cost optimization (10%). An improvement must exceed the minimum threshold (default: 20% composite improvement) to be considered. This prevents marginal improvements that would cause unnecessary route churn.

TTL and Re-evaluation

Active improvements have a configurable TTL (default: 3600 seconds / 1 hour). When the TTL expires, the injected route is withdrawn and the prefix returns to the probing pool. If the optimization is still beneficial, a new improvement is created automatically. This ensures that route optimizations remain valid as network conditions change. The TTL is reset if the improvement is refreshed by new probe data confirming continued benefit.

Anti-Flap Protection

To prevent rapid oscillation between providers, Routelock implements anti-flap timers. After an improvement is withdrawn, the prefix enters a cooldown period (default: 300 seconds) during which it cannot be re-optimized to the same provider. This prevents scenarios where a marginal improvement repeatedly flaps between two providers.

Traffic Analysis

NetFlow collection, top prefix identification, and traffic distribution monitoring

NetFlow Collection

Routelock includes a high-performance NetFlow v9 collector that receives flow records from your Cisco routers. The collector listens on a configurable UDP port (default 2055) and parses flow records to extract source/destination IP addresses, byte counts, packet counts, protocol information, and interface indices. Flow data is aggregated into per-prefix traffic statistics and stored in TimescaleDB hypertables for efficient time-series querying.

Top Prefix Identification

The traffic analysis engine continuously ranks destination prefixes by traffic volume. This "top prefixes" list determines which prefixes are worth optimizing—there is no benefit in optimizing routes for prefixes carrying negligible traffic. The configurable top_n parameter (default: 1000) sets how many prefixes are actively tracked and probed. Prefixes can also be explicitly included or excluded using prefix lists.

Traffic Distribution

Routelock tracks how traffic is distributed across providers in real time. The traffic distribution view shows each provider's share of total traffic (by bytes and packets), both as current snapshots and historical trends. This data feeds into cost optimization decisions—the system can identify when a provider is approaching its commit threshold and proactively shift traffic to avoid overage charges.

Flow Processing Pipeline

Collection: Raw NetFlow v9 packets received on UDP socket
Decoding: Templates cached per source; flow records decoded into structured data
Aggregation: Flows aggregated by destination prefix over configurable intervals (default: 60s)
Storage: Aggregated records written to netflow_records hypertable in batches
Ranking: Background job computes top-N prefixes every analysis cycle

# Example: Query top prefixes via API
GET /api/v1/netflow/top-prefixes?limit=20&period=1h

# Response includes prefix, bytes, packets, provider, percentage of total

Performance: The NetFlow collector can process over 100,000 flows per second on modest hardware. TimescaleDB hypertables with compression retain 90 days of history in approximately 20 GB of storage.

Active Probing

ICMP, UDP, and TCP probes for measuring per-provider path quality

Overview

Active probing is the mechanism by which Routelock measures real-time network performance to each destination prefix through each available upstream provider. Unlike passive NetFlow analysis which shows traffic volumes, active probing reveals actual latency, packet loss, and jitter on each path. This data is essential for making informed route optimization decisions.

Probe Types

ICMP Probes

ICMP echo (ping) probes are the default and most widely compatible method. They measure round-trip time and detect packet loss. ICMP probes have minimal bandwidth impact but may be rate-limited or deprioritized by some networks.

UDP Probes

UDP probes send packets to high-numbered ports and measure ICMP Port Unreachable responses. They can bypass ICMP filtering but may be blocked by firewalls. UDP probes are useful when ICMP is unreliable for a particular destination.

TCP Probes

TCP SYN probes attempt connections to common ports (80, 443) and measure the SYN-ACK response time. TCP probes are the most reliable for measuring latency to web servers and are rarely filtered. They provide the most accurate representation of actual user experience.

Policy-Based Routing (PBR)

To measure performance through each specific provider, Routelock relies on PBR rules configured on your border routers. Each probe packet is tagged with a source address or DSCP value that the router's PBR policy matches, forcing the probe through the designated upstream provider. This ensures that probe measurements accurately reflect the performance of each individual path.

# Cisco IOS PBR example for provider probing
ip access-list extended PROBE-PROVIDER-A
 permit ip host 10.0.1.1 any
route-map PBR-PROBES permit 10
 match ip address PROBE-PROVIDER-A
 set ip next-hop 198.51.100.1

Adaptive Probing

Routelock implements adaptive probe intervals. High-traffic prefixes are probed more frequently (every 15 seconds), while low-traffic prefixes may only be probed every 60 seconds. When an active improvement exists, the target prefix is probed at the highest frequency to quickly detect any degradation. The probe scheduler automatically adjusts intervals based on traffic volume, active improvement status, and configured resource limits.

Probe Algorithms

Results are smoothed using exponential weighted moving averages (EWMA) to reduce the impact of transient spikes. A minimum sample count (default: 5 probes) is required before metrics are considered valid for optimization decisions. Outlier detection removes probe results that are more than 3 standard deviations from the mean.

Optimization Engine

How Routelock makes route optimization decisions

Decision Process

The optimization engine is the brain of Routelock. Every analysis cycle (configurable, default 60 seconds), it evaluates all probed prefixes and determines whether route changes would provide meaningful improvements. The engine considers probe metrics, traffic volume, cost implications, commit thresholds, anti-flap timers, and rate limits before making any decision.

Optimization Modes

Performance Mode

In performance mode (default), the engine prioritizes latency reduction and packet loss elimination. The best provider for each prefix is selected based on the weighted composite score of latency, loss, and jitter. Cost is a secondary consideration used only as a tiebreaker.

Cost Mode

In cost mode, the engine balances performance optimization with commit management. It actively steers traffic toward providers that are under their committed rate while avoiding providers approaching their 95th percentile billing threshold. Cost mode is ideal for networks where transit costs are a primary concern.

Threshold Configuration

optimization:
  min_improvement_pct: 20    # Minimum 20% composite improvement required
  min_latency_diff_ms: 5     # Ignore latency differences under 5ms
  min_loss_diff_pct: 1.0     # Ignore loss differences under 1%
  max_inject_rate: 50        # Maximum 50 route injections per minute
  anti_flap_seconds: 300     # 5-minute cooldown after withdrawal
  ttl_seconds: 3600          # Improvements expire after 1 hour
  weights:
    latency: 0.4
    loss: 0.3
    jitter: 0.2
    cost: 0.1

Anti-Flap Mechanism

The anti-flap mechanism prevents route oscillation that would destabilize the network. When a route is withdrawn (either by TTL expiry or manual action), the prefix enters a cooldown period for the specific provider pairing. During cooldown, the same provider cannot be selected again for that prefix, even if probe metrics suggest it would be beneficial. This prevents the classic scenario where two providers alternate as "best" due to minor metric fluctuations.

Rate Limiting

Route injection is rate-limited to prevent a thundering herd of changes that could overwhelm BIRD or cause a routing storm. The default limit of 50 injections per minute is sufficient for most networks but can be adjusted. In addition to per-minute limits, there is a maximum total active improvements limit (default: 10,000) to cap the number of more-specific routes in the routing table.

How decisions flow: NetFlow identifies prefix → Probes measure all paths → Engine calculates scores → Threshold check → Anti-flap check → Rate limit check → Inject (Robot) or Propose (Human/Test)

BIRD 2.x Integration

How Routelock communicates with the BIRD routing daemon

Architecture

Routelock uses BIRD 2.x as its BGP route server. Rather than implementing its own BGP stack, Routelock delegates all BGP session management, route advertisement, and protocol handling to BIRD. This approach provides a mature, well-tested BGP implementation while allowing Routelock to focus on optimization logic. Communication between Routelock and BIRD occurs through two channels: the BIRD control socket for runtime commands and generated configuration files for static setup.

Socket Control Interface

BIRD exposes a Unix domain socket (typically /run/bird/bird.ctl) that accepts text-based commands. Routelock connects to this socket to perform real-time operations:

# Show route for a specific prefix
birdc show route for 203.0.113.0/24 all

# Add a static route (used for injection)
birdc configure soft

# Show protocol status
birdc show protocols all

# Show memory usage
birdc show memory

Configuration Generation

Routelock generates BIRD configuration fragments for its optimization routes. These are placed in an include directory (default: /etc/bird/routelock.d/) and loaded by BIRD via the include directive. When improvements are created or withdrawn, Routelock updates the configuration fragment and triggers a soft reconfiguration via the socket.

# Generated BIRD config fragment example
protocol static routelock_opt {
    ipv4 { table master4; };
    route 203.0.113.0/25 via 198.51.100.1 {
        bgp_local_pref = 200;
        bgp_community.add((65000,174));
    };
}

BGP Session Monitoring

Routelock continuously monitors the health of all BGP sessions through BIRD. If a provider's BGP session goes down, all active improvements using that provider are immediately withdrawn. Session state changes trigger WebSocket events and alerts. The /api/v1/bgp/sessions endpoint provides real-time session status including uptime, prefix counts, and last error messages.

Route Injection

How optimized routes are announced to steer traffic through preferred providers

The Injection Process

When the optimization engine determines that a prefix should be routed through a different provider, it creates an "improvement" and initiates route injection. The injection process involves generating a more-specific BGP route (e.g., splitting a /24 into two /25s) with a higher local-preference value, then announcing it through BIRD. Because BGP prefers more-specific routes and higher local-preference, this injected route overrides the original BGP best-path, steering traffic to the optimized provider.

Local Preference

Injected routes use a configurable local-preference value (default: 200) that is higher than the standard local-preference of provider-learned routes (typically 100). This ensures that the optimization route is always preferred within your AS, regardless of other BGP attributes. Different local-preference values can be configured per provider to create a preference hierarchy.

BGP Communities

Each injected route is tagged with BGP communities that identify it as a Routelock optimization. These communities serve multiple purposes: they help operators identify optimized routes in router tables, they can be used in route-map filters on border routers, and they enable automated tooling to track which routes are managed by Routelock.

# Default community tagging
65000:10000  - Routelock managed route
65000:XXXX   - Provider identifier
65000:200    - High-priority optimization
65000:100    - Standard optimization

Rate Limiting

Injections are rate-limited to prevent overwhelming the routing infrastructure. The default maximum injection rate is 50 routes per minute. During initial deployment or after a mass withdrawal, the queue may build up; routes are injected in priority order (highest traffic volume first). The rate limit applies globally across all providers.

Important: Never manually edit BIRD configuration files in the routelock.d directory. Routelock manages these files automatically and manual changes will be overwritten on the next configuration cycle.

Route Withdrawal

When and why optimized routes are removed, including TTL expiry and manual withdrawal

Automatic Withdrawal

Routes injected by Routelock are not permanent. They are automatically withdrawn under several conditions to ensure the routing table always reflects current network conditions:

TTL Expiry: Every improvement has a time-to-live (default 3600 seconds). When the TTL expires, the route is withdrawn and the prefix returns to the probing pool for re-evaluation. If the optimization is still beneficial, a new improvement will be created.
Performance Degradation: If post-injection probes detect that the optimized path has degraded below acceptable thresholds, the route is immediately withdrawn. This can happen when a provider experiences congestion or an outage.
BGP Session Down: If the BGP session to the target provider drops, all routes using that provider are immediately withdrawn. Traffic falls back to the default BGP best-path.
Provider Disabled: When an operator disables a provider through the UI or API, all active improvements using that provider are withdrawn.
Maintenance Window: Scheduled maintenance windows can trigger bulk withdrawal for affected providers or prefixes.

Manual Withdrawal

Operators can manually withdraw individual improvements or perform bulk withdrawals through the web UI or API. Manual withdrawals take effect immediately and trigger the anti-flap cooldown period for the affected prefix-provider pairing.

# Withdraw a single improvement
DELETE /api/v1/improvements/{id}

# Bulk withdraw all improvements for a provider
POST /api/v1/improvements/bulk-withdraw
{"provider_id": 3}

# Withdraw all improvements (emergency)
POST /api/v1/improvements/withdraw-all

Withdrawal Behavior

When a route is withdrawn, Routelock removes the corresponding entry from the BIRD configuration fragment and triggers a soft reconfiguration. The withdrawal propagates to BGP peers within seconds. Traffic for the affected prefix reverts to the default BGP best-path. The improvement record is retained in the database with a withdrawn or expired status for historical reporting.

Commit Control

95th percentile management and traffic balancing across provider commits

Understanding Commit-Based Billing

Most transit providers bill based on the 95th percentile of traffic utilization measured over the billing period (typically monthly). This means that for each 5-minute interval, the average throughput is recorded, and at the end of the month, the top 5% of samples are discarded. The next highest value becomes the billable rate. Going significantly over the committed data rate (CDR) incurs expensive overage charges, while staying well under it means you are paying for unused capacity.

How Routelock Manages Commits

Routelock tracks the rolling 95th percentile for each provider in real time, calculated from SNMP interface counters or NetFlow aggregates. The commit control module compares each provider's current 95th percentile against configurable high and low thresholds relative to their committed rate.

Threshold	Default	Action
Rate High	85% of commit	Stop sending more traffic to this provider; actively drain if possible
Rate Low	50% of commit	Prefer this provider for optimizations to increase utilization

Cost-Aware Optimization

When operating in cost mode or with cost awareness enabled, the optimization engine factors commit utilization into its routing decisions. If Provider A offers 10ms better latency but is already at 90% of commit, while Provider B is at 40% of commit with only 15ms more latency, cost mode may select Provider B to avoid overage charges on Provider A while bringing Provider B closer to its committed utilization.

commit_control:
  enabled: true
  rate_high_pct: 85
  rate_low_pct: 50
  billing_day: 1          # Day of month billing period starts
  sample_interval: 300    # 5-minute samples (standard)

Billing Period Tracking

The dashboard displays each provider's current 95th percentile, projected end-of-month 95th percentile, commit utilization percentage, and estimated cost. Historical billing data is retained for trend analysis and capacity planning.

DDoS Detection

EWMA baselines, threshold triggers, anomaly detection, and severity levels

Detection Architecture

Routelock's DDoS detection engine continuously analyzes NetFlow data to identify volumetric attacks targeting your network. Unlike signature-based systems that rely on known attack patterns, Routelock uses statistical anomaly detection based on Exponentially Weighted Moving Averages (EWMA) to establish dynamic traffic baselines and detect deviations that indicate an attack in progress.

EWMA Baselines

For each monitored prefix, Routelock maintains EWMA baselines for bytes per second, packets per second, and flows per second. The EWMA algorithm gives more weight to recent observations while smoothing out normal traffic fluctuations. The smoothing factor (alpha, default 0.1) controls how quickly the baseline adapts to gradual traffic changes. A lower alpha means the baseline is more stable but slower to adapt; a higher alpha makes it more responsive but more prone to false positives.

baseline(t) = α × observation(t) + (1 - α) × baseline(t-1)

# With α = 0.1:
# Recent observation contributes 10% to the new baseline
# Historical average contributes 90%

Threshold Triggers

An alert is triggered when the current traffic rate exceeds the EWMA baseline by a configurable multiplier. The default multipliers define severity levels:

Severity	Multiplier	Example (baseline 100 Mbps)
Low	3x	Traffic exceeds 300 Mbps
Medium	5x	Traffic exceeds 500 Mbps
High	10x	Traffic exceeds 1 Gbps
Critical	20x	Traffic exceeds 2 Gbps

Anomaly Detection

Beyond simple threshold triggers, the engine performs protocol distribution analysis. A sudden shift in protocol mix (e.g., 90% UDP when the baseline is 30% UDP) indicates a potential amplification attack even if the total volume is below the threshold multiplier. Similarly, a spike in packets-per-second without a corresponding byte increase suggests a small-packet flood designed to exhaust router CPU rather than bandwidth.

Detection Pipeline

NetFlow records aggregated per destination prefix per interval
Current rates compared against EWMA baselines
Protocol distribution analyzed for anomalies
If thresholds exceeded, DDoS event created with severity and attack classification
WebSocket event fires; alert sent to configured channels
Mitigation engine evaluates response options based on severity and policy

DDoS Mitigation

RTBH blackholing, FlowSpec rules, and automated vs manual mitigation strategies

Mitigation Options

When a DDoS attack is detected, Routelock offers multiple mitigation strategies that can be applied individually or in combination. The appropriate strategy depends on the attack type, severity, and your network's capability.

RTBH (Remotely Triggered Black Hole)

RTBH is the fastest and most widely supported mitigation method. Routelock injects a BGP route for the targeted prefix with a well-known blackhole community (e.g., 65535:666), causing upstream providers to drop all traffic destined for the target. While effective at stopping the attack, RTBH also drops legitimate traffic. It is best suited for critical severity attacks where the target is already unreachable and the priority is protecting the rest of the network from collateral damage.

FlowSpec (BGP Flow Specification)

FlowSpec provides surgical mitigation by describing specific traffic patterns to filter. Routelock can inject FlowSpec rules that match attack traffic by protocol, port, packet size, and other attributes while allowing legitimate traffic to pass. FlowSpec requires router support (RFC 5575/8955) and is more sophisticated than RTBH. It is ideal for medium and high severity attacks where the attack traffic has identifiable characteristics.

XDP/eBPF Scrubbing

For networks where the Routelock server sits in the traffic path, the integrated XDP/eBPF scrubber provides line-rate packet filtering without involving the kernel networking stack. This is the most granular mitigation option, capable of filtering based on complex rules including rate limiting, geographic filtering, and protocol validation. See the dedicated XDP/eBPF Scrubber article for details.

Automatic vs Manual Mitigation

Mitigation can be configured to trigger automatically based on severity thresholds or require manual approval. The default configuration auto-mitigates only critical severity events with RTBH, while lower severities generate alerts for operator review. This behavior is fully configurable per severity level.

ddos:
  auto_mitigate:
    critical: rtbh       # Auto-blackhole critical attacks
    high: flowspec       # Auto-inject FlowSpec for high severity
    medium: alert        # Alert only for medium
    low: alert           # Alert only for low
  rtbh_community: "65535:666"
  flowspec_enabled: true
  scrubber_enabled: false  # Enable if server is inline

XDP/eBPF Scrubber

Inline packet filtering at wire speed using XDP and eBPF programs

What is XDP?

XDP (eXpress Data Path) is a Linux kernel technology that allows packet processing programs to run at the earliest point in the network stack—before the kernel allocates any socket buffers. eBPF (extended Berkeley Packet Filter) is the programmable bytecode that XDP programs are written in. Together, they enable line-rate packet filtering with minimal CPU overhead, making them ideal for DDoS scrubbing at speeds of 10 Gbps and beyond on commodity hardware.

Routelock's Scrubber Architecture

The Routelock XDP scrubber attaches eBPF programs to network interfaces to filter malicious traffic before it reaches the kernel. When a DDoS event is detected, the mitigation engine can push filtering rules to the XDP program via eBPF maps. These rules take effect immediately (within microseconds) and operate at line rate without consuming significant CPU resources.

Rule Types

Rule Type	Description
IP Blocklist	Drop all traffic from specific source IPs or prefixes
Protocol Filter	Drop specific protocols (e.g., all UDP to port 53 during DNS amplification)
Rate Limit	Per-source-IP packet rate limiting using token bucket algorithm
Packet Size	Drop packets outside expected size ranges (e.g., drop >1400 byte UDP)
GeoIP Filter	Drop traffic from specific countries using embedded GeoIP database
SYN Cookie	Validate TCP connections with SYN cookies to stop SYN floods

Multi-NIC Redirect

In a scrubbing topology, the XDP program can redirect clean traffic from the ingress interface to an egress interface using XDP_REDIRECT. This enables a bump-in-the-wire deployment where the Routelock server sits between the upstream router and internal network, scrubbing traffic transparently. Dirty traffic is dropped at the XDP layer; clean traffic is forwarded at line rate.

# Enable scrubber on interface
POST /api/v1/scrubber/enable
{"interface": "eth1", "mode": "xdp_native"}

# Add a filtering rule
POST /api/v1/scrubber/rules
{"type": "rate_limit", "src_prefix": "0.0.0.0/0",
 "protocol": "udp", "dst_port": 53, "pps_limit": 10000}

Kernel Requirement: XDP native mode requires kernel 5.15+ and a network driver that supports XDP. Most modern NICs (Intel, Mellanox) support native XDP. Generic/SKB mode works on all drivers but with reduced performance.

Scrubber Redirect Path (BGP Steering)

On-demand BGP-driven redirect of attack traffic through a Linux XDP scrubber, leaving normal customer traffic on the production path.

The Problem with Always-Inline Scrubbing

Putting a scrubber inline with all customer traffic creates two problems: the scrubber must scale to peak normal packet rate (not just peak attack rate), and a scrubber failure becomes a customer outage. Cloud providers like Cloudflare Magic Transit, AWS Shield Advanced, and Akamai Prolexic solve this by using BGP-redirect-on-demand — the scrubber is out of path until an attack is detected, then traffic for the targeted destination is steered through it via a more-specific BGP announcement. Routelock implements the same pattern with a Linux XDP scrubber acting as a transparent bridge between two router ports.

Architecture

Two physical paths between each edge router and the core:

Path A (production): existing direct ER↔CR LAG. Carries 100% of normal customer traffic. The scrubber is not in this path.
Path B (scrubbing): ER → scrubber agent (XDP transparent bridge) → CR. Used only when a redirect is active for a specific destination.

The scrubber agent is a bare-metal Linux server with a Mellanox ConnectX-6 Dx 100G NIC. Both ports of the NIC are joined by a small XDP/eBPF program that does line-rate L2 forwarding using BPF_MAP_TYPE_DEVMAP redirects, plus per-port packet/byte telemetry counters in BPF_MAP_TYPE_PERCPU_HASH. No IP, no routing, no control plane on the scrubber — it's a pure bump-in-the-wire.

BGP Community Signaling

Routelock uses two communities on its iBGP session from brain to edge:

Community	Meaning	Edge action
`11878:6660`	Scrubber redirect	Rewrite next-hop to scrubber-bundle's CR-side address; local-pref 5000 to beat all eBGP paths
`11878:6100`	Route optimization signal (Noction-style)	Boost local-pref to 500, preserve announced next-hop

Edge Policy

The edge router's inbound policy from brain matches the community and conditionally rewrites the next-hop. On IOS XR (NCS-5500):

route-policy rl-deny-in-4
  if community matches-any rl-comm-scrubber then
    if destination in rl-scrubber-targets-v4 then
      set next-hop 10.255.0.2
      set local-preference 5000
      pass
    else
      drop
    endif
  elseif community matches-any rl-comm-optimize then
    set local-preference 500
    pass
  else
    drop
  endif
end-policy

The rl-scrubber-targets-v4 prefix-set defense-in-depth limits which prefixes can be redirected, even if the community is set. The brain's protected_prefixes table is the operator-facing gate; the prefix-set is the router-side enforcement.

Brain BIRD Configuration

The brain runs BIRD 2.x with two key safeguards:

Dedicated tables (redirects4 / redirects6) fed by protocol static. A one-way protocol pipe copies routes from these tables into master4/master6 only if they carry community 11878:6660. This keeps the rest of the route table clean.
Kernel export filter drops community-tagged routes on export to the OS routing table, so the brain itself never installs an unreachable route for a customer host. The brain must keep its normal route to that host.

Failure Behavior

Why this design is safe: if the scrubber crashes, kernel-panics, or loses power, only redirected traffic is affected. Path A continues carrying every other customer flow. The brain monitors XDP packet counters and auto-withdraws redirects if the scrubber stops forwarding, falling traffic back to Path A. Hardware fail-to-wire bypass NICs are not required.

Telemetry

Per-port packet and byte counters are exposed via bpftool map dump name stats on the scrubber agent. The mlx5 driver's ethtool -S exposes rx_xdp_redirect, tx_xdp_xmit, and rx_xdp_drop per receive queue. In-bridge packet drops (mismatch between RX redirect and TX xmit on the peer port) indicate per-CPU XDP TX queue exhaustion or driver issues; these should be zero in steady state.

Limitations

MTU 1500 today. Multi-buffer XDP on mlx5 with kernel 6.1 is not fully wired for jumbo frames (>3500 bytes). Internet PMTU is 1500 anyway.
No BFD on Path B yet. Failure detection currently relies on BGP session state. Sub-second failure detection is on the roadmap.
No LACP on Path B. 802.1D bars bridges from forwarding slow-protocols multicast (01:80:C2:00:00:02), so static aggregation is required. Multi-scrubber LACP is possible later with custom XDP code that explicitly forwards slow-protocols frames.
Stateless filtering only. Asymmetric return paths are tolerated because the scrubber doesn't track flow state. Stateful filters (conntrack-style) would require active/standby pairs with state sync.

Scrubber Clustering

Multi-node scrubber synchronization and peer health monitoring

Why Cluster?

A single scrubber node may not have sufficient capacity to handle large-scale DDoS attacks, or it may represent a single point of failure. Routelock supports scrubber clustering, where multiple XDP-enabled nodes work together to distribute scrubbing load and provide redundancy. The cluster maintains synchronized rule sets so that any node can filter the same attack traffic.

Cluster Architecture

Scrubber clusters use a primary-replica model for rule distribution. The Routelock server acts as the control plane, pushing rules to all cluster members simultaneously. Each scrubber node runs a lightweight agent that receives rule updates over a gRPC channel and applies them to the local XDP program. Rule updates are atomic and transactional—either all nodes receive the update or it is rolled back.

Peer Health Checks

Each cluster node sends heartbeat messages to the control plane every 5 seconds. If a node misses 3 consecutive heartbeats (15 seconds), it is marked as unhealthy and traffic should be rerouted to healthy nodes using your upstream load balancing or ECMP configuration. The health check includes CPU utilization, packet processing rate, and drop counters to detect nodes that are alive but overwhelmed.

Rule Synchronization

When a new mitigation rule is created (either automatically by the DDoS detection engine or manually by an operator), the control plane distributes it to all healthy cluster members in parallel. Each node acknowledges the rule installation, and the rule is not considered active until a quorum (default: majority) of nodes confirm. This prevents split-brain scenarios where some nodes are filtering and others are not.

scrubber:
  cluster:
    enabled: true
    nodes:
      - address: "10.0.1.10:9090"
        interfaces: ["eth1", "eth2"]
      - address: "10.0.1.11:9090"
        interfaces: ["eth1", "eth2"]
    heartbeat_interval: 5s
    unhealthy_threshold: 3
    rule_quorum: majority

FlowSpec Rules

BGP Flow Specification for surgical DDoS mitigation

What is FlowSpec?

BGP Flow Specification (FlowSpec), defined in RFC 5575 and RFC 8955, extends BGP to carry traffic filtering rules alongside routing information. Instead of blackholing an entire prefix (RTBH), FlowSpec allows you to describe specific traffic patterns—by protocol, port, packet size, DSCP, fragment flags, and more—and instruct routers to drop, rate-limit, or redirect matching traffic. This enables surgical mitigation that stops attack traffic while preserving legitimate services.

How Routelock Uses FlowSpec

When the DDoS detection engine classifies an attack, it automatically maps the attack characteristics to FlowSpec rules. For example, a DNS amplification attack targeting port 53 with large UDP packets generates a FlowSpec rule matching UDP destination port 53 with packet size > 512 bytes. These rules are injected into BIRD, which propagates them via BGP to all FlowSpec-capable routers in your network.

Attack Type Mappings

Attack Type	FlowSpec Match	Action
DNS Amplification	UDP src-port 53, length >512	Drop
NTP Amplification	UDP src-port 123, length >468	Drop
SSDP Amplification	UDP src-port 1900	Drop
SYN Flood	TCP flags SYN, no ACK	Rate-limit
UDP Flood	UDP, specific dst-port	Rate-limit
ICMP Flood	ICMP type 8	Rate-limit 1000pps
Fragment Flood	Fragment flag set	Drop

Rule Management

# List active FlowSpec rules
GET /api/v1/flowspec/rules

# Create a manual FlowSpec rule
POST /api/v1/flowspec/rules
{
  "dst_prefix": "203.0.113.0/24",
  "protocol": "udp",
  "src_port": 53,
  "min_length": 512,
  "action": "drop",
  "expires_in": "1h"
}

# Delete a FlowSpec rule
DELETE /api/v1/flowspec/rules/{id}

Expiration and Cleanup

FlowSpec rules created by the auto-mitigation engine have a configurable TTL (default: 1 hour). When the DDoS detection engine confirms the attack has subsided (traffic returns to within 1.5x of baseline for 10 consecutive minutes), the associated FlowSpec rules are automatically withdrawn. Manual rules can have custom expiration times or be set to persist indefinitely until explicitly removed.

User Roles (RBAC)

Admin, operator, and viewer permissions explained

Role-Based Access Control

Routelock implements role-based access control (RBAC) with three predefined roles that govern what actions a user can perform. Every user is assigned exactly one role, which determines their access to API endpoints, web UI features, and operational capabilities. Roles are assigned during user creation and can be changed by administrators at any time.

Role Definitions

Role	Description	Key Capabilities
Admin	Full system access	User management, configuration changes, provider management, approval/rejection, DDoS mitigation, system settings, API key management
Operator	Operational access	View all data, approve/reject pending changes, manually withdraw routes, acknowledge alerts, manage maintenance windows, trigger manual probes
Viewer	Read-only access	View dashboard, reports, alerts, improvements, traffic data. Cannot make any changes or approve proposals

Permission Matrix

The following table shows key actions and which roles can perform them:

Action	Admin	Operator	Viewer
View dashboard & reports	Yes	Yes	Yes
Approve/reject changes	Yes	Yes	No
Withdraw routes	Yes	Yes	No
Manage providers	Yes	No	No
Change operating mode	Yes	No	No
Manage users	Yes	No	No
System configuration	Yes	No	No
DDoS mitigation actions	Yes	Yes	No
Manage API keys	Yes	Own only	No

API Enforcement

RBAC is enforced at the API middleware level. Every request is checked against the user's role before the handler executes. Unauthorized requests receive a 403 Forbidden response with a descriptive error message indicating the required role. Role checks are performed after authentication (JWT or API key validation) and before any business logic.

JWT Authentication

How JSON Web Tokens secure the Routelock API and web interface

How JWT Works in Routelock

Routelock uses JSON Web Tokens (JWT) for stateless authentication of API requests and web UI sessions. When a user logs in with valid credentials, the server issues an access token and a refresh token. The access token is a signed JWT containing the user's ID, role, and expiration time. It is included in the Authorization: Bearer header of all subsequent API requests.

Token Lifecycle

# Login
POST /api/v1/auth/login
{"username": "admin", "password": "secret"}

# Response
{
  "access_token": "eyJhbG...",    # Short-lived (15 min default)
  "refresh_token": "eyJhbG...",   # Long-lived (7 days default)
  "expires_in": 900
}

# Refresh
POST /api/v1/auth/refresh
{"refresh_token": "eyJhbG..."}

Token Claims

The JWT access token contains standard claims (iss, sub, exp, iat) plus custom claims for the user's role, username, and session ID. The token is signed using HMAC-SHA256 with a server-side secret key. Tokens cannot be tampered with without invalidating the signature.

Session Management

While JWTs are stateless by design, Routelock maintains a session registry for security features like concurrent session limits, forced logout, and token revocation. Each user is limited to a configurable number of concurrent sessions (default: 5). When the limit is reached, the oldest session is revoked. Administrators can force-logout any user, which invalidates all their active tokens.

Security Considerations

Short expiry: Access tokens expire after 15 minutes by default, limiting the window of exposure if a token is compromised
Refresh rotation: Each refresh generates a new refresh token and invalidates the old one, preventing replay attacks
HTTPS only: Tokens are only transmitted over TLS; the Secure flag is set on cookies
IP binding (optional): Tokens can be bound to the client IP, rejecting requests from different IPs

API Key Authentication

Creating and managing long-lived API keys for programmatic access

Overview

API keys provide an alternative to JWT authentication for programmatic and machine-to-machine access to the Routelock API. Unlike JWT tokens which expire frequently and require credential exchange, API keys are long-lived tokens that can be used directly in request headers. They are ideal for monitoring scripts, automation tools, and integrations that need persistent access without interactive login flows.

Creating API Keys

API keys are created through the web UI (Settings → API Keys) or via the API itself. Each key is associated with a user account and inherits that user's role permissions. Keys can have optional descriptions, IP restrictions, and expiration dates.

# Create an API key
POST /api/v1/auth/api-keys
{
  "name": "Monitoring Script",
  "expires_at": "2025-12-31T23:59:59Z",  # Optional
  "allowed_ips": ["10.0.0.0/8"]           # Optional IP restriction
}

# Response (key shown ONCE, store securely)
{
  "id": "ak_abc123",
  "key": "rl_live_k1_aBcDeFgHiJkLmNoPqRsT...",
  "name": "Monitoring Script",
  "created_at": "2025-01-15T10:00:00Z"
}

Using API Keys

Include the API key in the X-API-Key header of your requests:

curl -H "X-API-Key: rl_live_k1_aBcDeFgH..." https://routelock.example.com/api/v1/providers

Key Management

Administrators can view and revoke any API key in the system. Operators can manage only their own keys. Keys can be rotated by creating a new key and deleting the old one. The audit log records all API key creation, usage, and revocation events. Keys that have not been used in 90 days are flagged as stale in the UI.

Security: API keys are shown only once at creation time. They are stored as bcrypt hashes in the database and cannot be retrieved. If a key is lost, create a new one and delete the old one.

LDAP/Active Directory

Configuring LDAP authentication with group-to-role mapping

Overview

Routelock supports LDAP and Active Directory (AD) authentication, allowing users to log in with their corporate directory credentials. When LDAP is enabled, Routelock validates credentials against the LDAP server rather than its local user database. LDAP groups can be mapped to Routelock roles for automatic role assignment, eliminating the need to manually configure permissions for each user.

Configuration

auth:
  ldap:
    enabled: true
    url: "ldaps://ad.company.com:636"
    bind_dn: "CN=routelock-svc,OU=Service Accounts,DC=company,DC=com"
    bind_password: "${LDAP_BIND_PASSWORD}"
    base_dn: "OU=Users,DC=company,DC=com"
    user_filter: "(&(objectClass=user)(sAMAccountName={{username}}))"
    group_filter: "(&(objectClass=group)(member={{user_dn}}))"
    group_mappings:
      "CN=Network-Admins,OU=Groups,DC=company,DC=com": admin
      "CN=NOC-Operators,OU=Groups,DC=company,DC=com": operator
      "CN=NOC-Viewers,OU=Groups,DC=company,DC=com": viewer
    default_role: viewer     # Role when no group matches
    tls_skip_verify: false
    timeout: 10s

Authentication Flow

User submits username and password to the login endpoint
Routelock binds to LDAP using the service account credentials
Searches for the user entry matching the provided username
Attempts to bind as the found user with the provided password
On success, queries group membership to determine role
Creates or updates the local user record with the LDAP-derived role
Issues JWT tokens as with normal authentication

Fallback Behavior

When LDAP is enabled, local authentication can be configured as a fallback. If the LDAP server is unreachable, Routelock can fall back to local password verification for accounts that have local passwords set. This ensures administrators can still access the system during LDAP outages. The built-in admin account always supports local authentication as a safety net.

SSO (Google & Microsoft)

OAuth2/OIDC single sign-on with auto-provisioning

Overview

Routelock supports Single Sign-On (SSO) via Google Workspace and Microsoft Entra ID (formerly Azure AD) using the OAuth2/OpenID Connect (OIDC) protocol. SSO enables users to log in with their existing Google or Microsoft corporate accounts, eliminating the need for separate Routelock passwords and providing a seamless authentication experience.

OAuth2/OIDC Flow

User clicks "Sign in with Google/Microsoft" on the login page
Browser redirects to the identity provider's authorization endpoint
User authenticates with their corporate account (may include MFA)
Identity provider redirects back to Routelock's callback URL with an authorization code
Routelock exchanges the code for an ID token and access token
Routelock validates the ID token, extracts user info (email, name, groups)
User is created or updated locally and issued Routelock JWT tokens

Configuration

auth:
  sso:
    google:
      enabled: true
      client_id: "123456789.apps.googleusercontent.com"
      client_secret: "${GOOGLE_CLIENT_SECRET}"
      allowed_domains: ["company.com"]
      default_role: viewer
    microsoft:
      enabled: true
      client_id: "abcdef-1234-5678-..."
      client_secret: "${MICROSOFT_CLIENT_SECRET}"
      tenant_id: "your-tenant-id"
      allowed_groups: ["Network-Admins", "NOC-Team"]
      group_mappings:
        "Network-Admins": admin
        "NOC-Operators": operator
      default_role: viewer

Auto-Provisioning

When a user logs in via SSO for the first time, Routelock automatically creates a local user account based on the identity provider's claims. The user's email becomes their username, and their role is determined by group mappings (if configured) or the default role. Auto-provisioned users cannot set local passwords—they must always authenticate via SSO. Administrators can override the auto-assigned role after the account is created.

Domain Restrictions

For Google SSO, the allowed_domains setting restricts login to users from specific Google Workspace domains, preventing unauthorized access from personal Gmail accounts. For Microsoft SSO, the tenant_id setting restricts login to users from your organization's Entra ID tenant.

Two-Factor Authentication

Email-based 2FA setup and verification flow

Overview

Routelock supports email-based two-factor authentication (2FA) as an additional security layer. When 2FA is enabled for a user, they must provide a one-time code sent to their registered email address after entering their password. This ensures that even if a password is compromised, an attacker cannot access the account without also having access to the user's email.

Setup Process

Administrator enables 2FA requirement globally or per-user in Settings → Security
On next login, after entering valid credentials, the user is prompted to set up 2FA
A verification code is sent to the user's registered email address
User enters the code to complete setup; 2FA is now active on the account
Future logins will always require the email verification step

Verification Flow

# Step 1: Normal login
POST /api/v1/auth/login
{"username": "admin", "password": "secret"}

# Response indicates 2FA required
{"requires_2fa": true, "temp_token": "eyJ..."}

# Step 2: Submit 2FA code
POST /api/v1/auth/verify-2fa
{"temp_token": "eyJ...", "code": "847291"}

# Response: full JWT tokens
{"access_token": "eyJ...", "refresh_token": "eyJ..."}

Code Characteristics

Verification codes are 6-digit numeric codes generated using a cryptographically secure random number generator. Each code is valid for 5 minutes and can only be used once. If the user requests a new code, the previous code is immediately invalidated. After 5 failed verification attempts, the account is temporarily locked for 15 minutes to prevent brute-force attacks.

Email Configuration

2FA requires a properly configured SMTP server for sending verification emails. The email template is customizable and includes the code, expiration time, and a warning not to share the code. Routelock supports TLS-encrypted SMTP connections and SMTP authentication.

email:
  smtp_host: "smtp.company.com"
  smtp_port: 587
  smtp_user: "routelock@company.com"
  smtp_password: "${SMTP_PASSWORD}"
  from_address: "routelock@company.com"
  from_name: "Routelock"
  tls: true

High Availability

Active-passive failover, heartbeat monitoring, and VIP management

Architecture

Routelock supports active-passive high availability (HA) to eliminate single points of failure. In an HA deployment, two Routelock instances run on separate servers. The active node handles all operations (NetFlow collection, probing, optimization, route injection), while the standby node maintains a synchronized state and is ready to take over within seconds if the active node fails.

Heartbeat Protocol

The active and standby nodes exchange heartbeat messages over a dedicated link (or network) every 2 seconds. Each heartbeat includes the node's health status, current role, database replication lag, and uptime. If the standby node misses 5 consecutive heartbeats (10 seconds), it initiates a failover. The heartbeat protocol uses a lightweight UDP-based format to minimize overhead and latency.

Failover Process

Detection: Standby detects active node failure via missed heartbeats
Verification: Standby performs additional health checks (database connectivity, BIRD socket) to confirm it can safely take over
VIP Migration: Standby assumes the shared Virtual IP (VIP) using gratuitous ARP
Service Activation: Standby starts NetFlow collector, probe scheduler, and optimization engine
BGP Reattachment: Standby connects to BIRD socket and verifies all active improvements are still injected
Notification: Alert sent to configured channels announcing the failover

Split-Brain Resolution

Split-brain scenarios (both nodes believing they are active) are resolved using a fencing mechanism. When a node transitions to active, it updates a "leader" record in the shared PostgreSQL database with a short TTL lease. Only the node holding the current lease can inject routes into BIRD. If both nodes are active but only one holds the database lease, the other will detect the conflict and revert to standby within one lease interval (default: 30 seconds).

ha:
  enabled: true
  role: active             # or "standby"
  peer_address: "10.0.1.11:9100"
  vip: "10.0.1.100/24"
  vip_interface: "eth0"
  heartbeat_interval: 2s
  heartbeat_timeout: 10s
  db_lease_ttl: 30s

State Synchronization

Both nodes share the same PostgreSQL/TimescaleDB database via streaming replication. The standby node's Routelock instance reads from the local replica for monitoring purposes but does not write. Upon failover, the standby promotes its local replica (if using separate DB instances) or simply begins writing to the shared database.

Multi-Routing Domains

Per-POP and per-site routing optimization with domain-scoped providers

What Are Routing Domains?

A routing domain in Routelock represents an independent routing scope—typically a physical Point of Presence (POP) or data center site—that has its own set of upstream providers and BGP sessions. Multi-routing domain support allows a single Routelock instance to optimize routing across multiple sites simultaneously, each with different providers, policies, and traffic patterns.

Why Use Multiple Domains?

Large networks often operate from multiple locations, each with different transit providers and peering arrangements. Without routing domains, you would need separate Routelock deployments per site. With multi-domain support, a single deployment manages all sites, providing a unified view of network-wide optimization while respecting the fact that each site has its own routing table and provider set.

Configuration

routing_domains:
  - name: "NYC-POP"
    id: 1
    bird_socket: "/run/bird/bird-nyc.ctl"
    providers: [1, 2, 3]       # Provider IDs scoped to this domain
    probe_source: "10.1.0.1"
    netflow_source: "10.1.0.254"
  - name: "LAX-POP"
    id: 2
    bird_socket: "/run/bird/bird-lax.ctl"
    providers: [4, 5, 6]
    probe_source: "10.2.0.1"
    netflow_source: "10.2.0.254"

Domain Scoping

All core objects in Routelock are scoped to a routing domain: providers, improvements, probes, and traffic statistics. The optimization engine runs independently for each domain, ensuring that a provider outage in one site does not affect routing decisions in another. The web dashboard and API support filtering by domain, and the global overview aggregates metrics across all domains.

Cross-Domain Considerations

While each domain operates independently, Routelock provides cross-domain analytics. For example, it can identify if a destination prefix is being optimized through different providers in different POPs and whether the aggregate cost impact is beneficial. Future versions will support coordinated optimization where domains share probe data to reduce redundant probing of the same destinations.

Maintenance Windows

Scheduling downtime with automatic route withdrawal and probe suspension

Purpose

Maintenance windows allow operators to schedule periods when specific providers, prefixes, or the entire system should pause optimization activities. During maintenance, Routelock automatically withdraws affected improvements, suspends probing, and suppresses related alerts. This prevents the system from reacting to expected performance degradation during planned network changes.

Creating Maintenance Windows

# Schedule provider maintenance
POST /api/v1/maintenance
{
  "name": "Cogent fiber cut maintenance",
  "scope": "provider",
  "scope_id": 3,
  "start_time": "2025-02-15T02:00:00Z",
  "end_time": "2025-02-15T06:00:00Z",
  "auto_withdraw": true,
  "suppress_alerts": true
}

# Schedule global maintenance
POST /api/v1/maintenance
{
  "name": "Core router upgrade",
  "scope": "global",
  "start_time": "2025-02-20T04:00:00Z",
  "end_time": "2025-02-20T05:00:00Z",
  "auto_withdraw": true
}

Maintenance Behavior

When a maintenance window becomes active:

Route Withdrawal: If auto_withdraw is enabled, all active improvements in the maintenance scope are withdrawn gracefully
Probe Suspension: Active probing through the affected provider(s) is paused to avoid generating misleading metrics
Alert Suppression: Alerts related to the maintenance scope are suppressed to prevent notification fatigue
Optimization Pause: The optimization engine skips the affected scope during its analysis cycle

When the maintenance window ends, all paused activities resume automatically. Probing restarts, and the optimization engine begins evaluating the affected prefixes in the next cycle. Previously withdrawn improvements must be re-earned through the normal optimization process; they are not automatically re-injected.

Recurring Windows

Maintenance windows can be configured as recurring (daily, weekly, monthly) for regular maintenance activities. Recurring windows are evaluated at each scheduler tick and activated automatically when the schedule matches.

IX Peering

Internet Exchange support with DSCP-based probing and prefer-over-transit

Overview

Internet Exchanges (IXPs) provide direct peering between networks, typically offering lower latency and zero per-Mbps cost compared to transit providers. Routelock natively supports IX providers, enabling optimization decisions that prefer IX paths when performance is comparable to transit, thereby reducing transit costs without sacrificing quality.

IX Provider Configuration

IX providers are configured with type: ix and additional IX-specific settings. Since IX connections typically do not have committed data rates or per-Mbps billing, cost calculations treat IX traffic as free, making IX paths highly attractive in cost optimization mode.

providers:
  - name: "AMS-IX"
    type: ix
    asn: 64999
    cost_per_mbps: 0         # IX traffic is free
    prefer_over_transit: true
    ix_specific:
      peering_lan: "80.249.208.0/21"
      route_server: true

DSCP-Based Probing

Probing through IX connections requires special handling because IX peering LANs often have different traffic policies than transit links. Routelock uses DSCP (Differentiated Services Code Point) marking to tag probe packets for IX paths, allowing PBR rules on routers to steer these probes through the IX connection specifically. This ensures accurate measurement of IX path quality.

Prefer-Over-Transit Logic

When prefer_over_transit is enabled for an IX provider, the optimization engine gives IX paths a bonus in the scoring algorithm. Even if a transit provider offers marginally better latency (within a configurable tolerance, default 5ms), the IX path is preferred because it eliminates transit cost. This feature is especially valuable for networks with high traffic volumes where transit costs are significant.

Partial Reachability

IX connections typically only provide routes to the IX members' networks, not full internet reachability. Routelock handles this by only considering IX providers for prefixes that are actually reachable through the IX (i.e., present in the IX BGP table). The system automatically tracks IX reachability through the BGP RIB received from BIRD.

Inbound Optimization

AS-path prepend manipulation for inbound traffic rebalancing

The Inbound Challenge

While outbound optimization (controlling which provider carries your outbound traffic) is straightforward via local-preference and more-specific routes, inbound optimization is fundamentally harder. Inbound traffic is controlled by remote networks' routing decisions based on BGP attributes you announce. The primary tool for influencing inbound traffic is AS-path prepending—making your AS-path artificially longer through certain providers to make the path less attractive to remote networks.

How Routelock Handles Inbound

Routelock analyzes inbound traffic distribution across providers using NetFlow data and SNMP interface counters. When it detects an imbalance (e.g., one provider carrying 70% of inbound traffic while others are underutilized), it can automatically adjust AS-path prepend levels to redistribute inbound traffic more evenly.

Prepend Strategy

inbound:
  enabled: true
  target_distribution:
    provider_a: 40    # Target 40% of inbound traffic
    provider_b: 35    # Target 35%
    provider_c: 25    # Target 25%
  max_prepends: 3     # Never prepend more than 3 times
  adjustment_interval: 1h  # Re-evaluate hourly
  min_deviation_pct: 10    # Only act if >10% off target

Prepend Adjustment Algorithm

Measure current inbound traffic distribution per provider
Compare against target distribution
If a provider is over-target by more than the deviation threshold, increase prepend by 1
If a provider is under-target, decrease prepend by 1 (minimum 0)
Apply changes to BIRD's BGP export filters
Wait for adjustment interval before next evaluation (BGP convergence takes time)

Limitations

Inbound optimization via AS-path prepending is inherently imprecise. Remote networks may use local-preference overrides, traffic engineering, or routing policies that ignore AS-path length differences. Routelock's inbound optimization works best for achieving approximate traffic distribution goals rather than precise percentage targets. Changes take effect gradually as remote networks reconverge their routing tables, typically over 15-60 minutes.

Caution: Excessive prepending (more than 3x) can cause reachability issues with some remote networks that filter paths beyond a certain AS-path length. Always test prepend changes in Human mode first.

Real-Time Dashboard

Overview of all dashboard widgets and what they show

Dashboard Layout

The Routelock dashboard provides a comprehensive real-time view of your network's routing optimization status. It is the primary interface for operators to monitor system health, track improvements, and identify issues requiring attention. All dashboard data updates in real time via WebSocket connections, eliminating the need for manual page refreshes.

Widget Overview

System Status Banner

The top banner displays the current operating mode (Test/Human/Robot), system uptime, active alert count, and a quick health indicator. Green indicates all systems operational; yellow indicates warnings; red indicates critical issues requiring immediate attention.

Provider Overview

Shows each configured provider with their current status (up/down), BGP session state, current throughput (inbound/outbound), 95th percentile utilization, and active improvement count. Providers approaching their commit threshold are highlighted in amber.

Traffic Distribution Chart

A real-time pie chart and time-series graph showing how traffic is distributed across providers. The chart updates every 30 seconds and can be toggled between bytes, packets, and percentage views. Historical comparison (e.g., vs. same time yesterday) is available.

Active Improvements

Displays the count of active, pending, and recently expired improvements. A mini-table shows the top 10 improvements by traffic volume with their current provider, latency improvement, and remaining TTL. Click any improvement to view full details.

Probe Health

Shows the probe scheduler status, including active probes, probe success rate, and average probe latency across all providers. A sparkline chart displays probe health over the last hour. Probes with abnormal failure rates are flagged.

DDoS Status

Displays active DDoS events (if any), current mitigation status, and a traffic anomaly indicator. When no attacks are detected, it shows the time since the last event and current baseline values for the top monitored prefixes.

Recent Events

A live event feed showing the most recent system events: improvements created/withdrawn, alerts triggered, configuration changes, user logins, and BGP session state changes. Events are color-coded by severity and type.

NetFlow Statistics

Current NetFlow collection rate (flows/second), total flows processed in the current period, and a list of the top 5 destination prefixes by traffic volume. Links to the full traffic analysis view.

Customization

Dashboard widgets can be rearranged and resized by administrators. The layout is saved per-user, so each operator can configure their preferred view. Widgets can be collapsed or hidden entirely if not needed for a particular operator's workflow.

WebSocket Events

Real-time event streaming for live UI updates and toast notifications

WebSocket Architecture

Routelock maintains a persistent WebSocket connection between the web UI and the server for real-time event delivery. When significant events occur (improvement created, alert triggered, BGP session change), the server pushes an event message to all connected clients. This eliminates polling and provides instant visibility into system changes.

Connecting

// WebSocket endpoint (requires JWT authentication)
const ws = new WebSocket('wss://routelock.example.com/api/v1/ws?token=eyJ...');

ws.onmessage = function(event) {
    const data = JSON.parse(event.data);
    console.log(data.type, data.payload);
};

Event Types

Event Type	Trigger	Payload
`improvement.created`	New improvement proposed/injected	Improvement ID, prefix, provider, metrics
`improvement.withdrawn`	Route withdrawn	Improvement ID, reason
`improvement.approved`	Operator approved pending change	Improvement ID, approver
`alert.triggered`	New alert created	Alert ID, severity, message
`alert.resolved`	Alert condition cleared	Alert ID
`bgp.session_up`	BGP session established	Provider, peer IP
`bgp.session_down`	BGP session dropped	Provider, peer IP, reason
`ddos.detected`	DDoS attack detected	Target prefix, severity, type
`ddos.mitigated`	Mitigation applied	Target prefix, method
`system.mode_changed`	Operating mode changed	Old mode, new mode, user
`provider.status`	Provider metrics update	Provider ID, throughput, latency

Toast Notifications

The web UI displays toast notifications for important events. Toasts are color-coded by severity (blue for info, green for success, amber for warning, red for critical) and auto-dismiss after 5 seconds. Critical events remain visible until manually dismissed. Users can configure which event types trigger toast notifications in their profile settings.

Event Filtering

Clients can subscribe to specific event types by sending a subscription message after connecting. This reduces bandwidth for clients that only need specific event categories:

ws.send(JSON.stringify({
    action: "subscribe",
    types: ["improvement.*", "alert.*", "ddos.*"]
}));

Reports

Traffic, performance, cost, and security reports

Available Reports

Routelock generates comprehensive reports that provide historical analysis and trends. Reports can be viewed in the web UI, exported as CSV/PDF, or retrieved via the API. All reports support configurable time ranges and can be filtered by provider, routing domain, or prefix.

Traffic Report

Shows traffic volume trends over time, broken down by provider, protocol, and direction (inbound/outbound). Includes peak utilization, average throughput, and traffic growth rate. The traffic report is essential for capacity planning and identifying traffic pattern changes.

Performance Report

Summarizes latency, packet loss, and jitter trends per provider and per destination region. Highlights periods of degradation and correlates them with improvements or route changes. Includes before/after comparisons showing the impact of route optimizations on actual performance.

Cost Report

Tracks 95th percentile utilization per provider over the billing period. Shows projected end-of-month costs, cost savings from optimization, and commit utilization trends. The cost report helps justify the ROI of route optimization by quantifying transit cost reductions.

Optimization Report

Details all improvements created during the period: how many were successful, average improvement in latency/loss, total traffic optimized, and provider shift distribution. Includes improvement churn rate and anti-flap trigger counts.

Security Report

Lists all DDoS events, their severity, duration, and mitigation actions taken. Shows attack volume trends, most targeted prefixes, and attack type distribution. Includes scrubber performance metrics if XDP scrubbing is enabled.

Generating Reports

# Generate a performance report via API
GET /api/v1/reports/performance?from=2025-01-01&to=2025-01-31&provider_id=3

# Export as CSV
GET /api/v1/reports/traffic?format=csv&period=7d

Scheduled Reports

Reports can be scheduled for automatic generation and email delivery. Common schedules include daily traffic summaries, weekly performance reviews, and monthly cost reports. Scheduled reports are configured in Settings → Reports.

Alerts

Alert categories, severity levels, and acknowledgment workflow

Alert System

Routelock's alerting system monitors all aspects of the platform and generates notifications when conditions require attention. Alerts are categorized by source, assigned severity levels, and can be delivered through multiple channels. The system distinguishes between automatically resolved alerts (which clear when the condition resolves) and persistent alerts that require manual acknowledgment.

Alert Categories

Category	Examples
BGP	Session down, session flapping, prefix count anomaly
Performance	Provider latency spike, widespread packet loss, jitter threshold exceeded
DDoS	Attack detected, mitigation triggered, scrubber overloaded
Commit	Provider approaching commit threshold, 95th percentile warning
System	High CPU/memory, database lag, probe scheduler behind, disk space low
HA	Peer unreachable, failover triggered, split-brain detected

Severity Levels

Critical: Immediate action required. BGP session loss, active DDoS attack, system failure. Generates audio/visual notification and escalation.
High: Prompt attention needed. Significant performance degradation, commit threshold approaching, scrubber rule failure.
Medium: Should be investigated. Minor performance anomalies, stale improvements, configuration warnings.
Low: Informational. Routine events, cleanup reminders, optimization statistics.

Notification Channels

Alerts can be delivered through multiple channels simultaneously: web UI toast notifications, email, webhook (for integration with PagerDuty, Slack, OpsGenie, etc.), and syslog. Each channel can be configured to receive only specific severity levels—for example, send only critical alerts to PagerDuty while sending all severities to the web UI.

Acknowledgment

Persistent alerts must be acknowledged by an operator to indicate that the issue is being investigated. Acknowledging an alert stops escalation and removes it from the active alert count. Acknowledged alerts remain visible in the alert history. Auto-resolved alerts clear automatically when the triggering condition no longer exists (e.g., BGP session recovers).

# Acknowledge an alert
POST /api/v1/alerts/{id}/acknowledge
{"note": "Investigating with provider NOC, ticket #12345"}

API Overview

Base URL, authentication, response format, and pagination

Base URL

All API endpoints are served under the /api/v1/ path prefix. For a Routelock instance running at https://routelock.example.com, the full API base URL is:

https://routelock.example.com/api/v1/

Authentication

All API endpoints (except /api/v1/auth/login) require authentication. Two methods are supported:

# JWT Bearer Token
Authorization: Bearer eyJhbGciOiJIUzI1NiIs...

# API Key
X-API-Key: rl_live_k1_aBcDeFgHiJkLmNoPqRsT...

Unauthenticated requests receive a 401 Unauthorized response. Requests with insufficient role permissions receive 403 Forbidden.

Response Format

All API responses use JSON. Successful responses return the data directly or wrapped in a data envelope for list endpoints. Error responses follow a consistent format:

// Success (single resource)
{"id": 1, "name": "Cogent", "type": "transit", ...}

// Success (list)
{"data": [...], "total": 150, "page": 1, "per_page": 50}

// Error
{"error": {"code": "INVALID_PARAM", "message": "Invalid provider ID", "details": {...}}}

Pagination

List endpoints support cursor-based and offset-based pagination. Use page and per_page query parameters for offset pagination (default: page=1, per_page=50, max per_page=1000). The response includes total count and pagination metadata.

GET /api/v1/improvements?page=2&per_page=25&sort=-created_at

Filtering and Sorting

Most list endpoints support filtering via query parameters specific to the resource type (e.g., status=active, provider_id=3). Sorting is controlled via the sort parameter with a - prefix for descending order. Multiple sort fields are comma-separated.

Rate Limiting

The API enforces rate limiting to protect system resources. Default limits are 100 requests per minute for authenticated users and 10 requests per minute for unauthenticated endpoints (login). Rate limit headers are included in all responses:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 87
X-RateLimit-Reset: 1706540400

Versioning

The API is versioned via the URL path (/api/v1/). Future breaking changes will be introduced under /api/v2/ while maintaining backward compatibility on v1 for a deprecation period.

API Endpoints by Category

All 85+ endpoints grouped by functional area

Authentication (6 endpoints)

Method	Endpoint	Description
POST	`/auth/login`	Authenticate with username/password
POST	`/auth/refresh`	Refresh access token
POST	`/auth/logout`	Invalidate current session
POST	`/auth/verify-2fa`	Submit 2FA verification code
GET	`/auth/sso/{provider}`	Initiate SSO login flow
GET	`/auth/sso/{provider}/callback`	SSO callback handler

Users (7 endpoints)

Method	Endpoint	Description
GET	`/users`	List all users
POST	`/users`	Create a new user
GET	`/users/{id}`	Get user details
PUT	`/users/{id}`	Update user
DELETE	`/users/{id}`	Delete user
GET	`/users/me`	Get current user profile
PUT	`/users/me/password`	Change own password

API Keys (4 endpoints)

Method	Endpoint	Description
GET	`/auth/api-keys`	List API keys
POST	`/auth/api-keys`	Create API key
GET	`/auth/api-keys/{id}`	Get API key details
DELETE	`/auth/api-keys/{id}`	Revoke API key

Providers (8 endpoints)

Method	Endpoint	Description
GET	`/providers`	List all providers
POST	`/providers`	Create provider
GET	`/providers/{id}`	Get provider details
PUT	`/providers/{id}`	Update provider
DELETE	`/providers/{id}`	Delete provider
GET	`/providers/{id}/metrics`	Get provider performance metrics
GET	`/providers/{id}/traffic`	Get provider traffic stats
POST	`/providers/{id}/toggle`	Enable/disable provider

Improvements (10 endpoints)

Method	Endpoint	Description
GET	`/improvements`	List improvements (filterable by status)
GET	`/improvements/{id}`	Get improvement details
POST	`/improvements/{id}/approve`	Approve pending improvement
POST	`/improvements/{id}/reject`	Reject pending improvement
DELETE	`/improvements/{id}`	Withdraw active improvement
POST	`/improvements/bulk-approve`	Approve multiple improvements
POST	`/improvements/bulk-reject`	Reject multiple improvements
POST	`/improvements/bulk-withdraw`	Withdraw multiple improvements
POST	`/improvements/withdraw-all`	Emergency: withdraw all
GET	`/improvements/stats`	Improvement statistics summary

BGP (8 endpoints)

Method	Endpoint	Description
GET	`/bgp/sessions`	List BGP session status
GET	`/bgp/routes`	Query BGP routing table
GET	`/bgp/routes/{prefix}`	Get routes for specific prefix
GET	`/bgp/summary`	BGP summary (peer count, prefix count)
POST	`/bgp/reconfigure`	Trigger BIRD soft reconfigure
GET	`/bgp/communities`	List configured communities
GET	`/bgp/looking-glass`	Looking glass query
GET	`/bgp/rib`	RIB entries with detailed attributes

NetFlow (6 endpoints)

Method	Endpoint	Description
GET	`/netflow/stats`	Collector statistics
GET	`/netflow/top-prefixes`	Top prefixes by traffic
GET	`/netflow/top-talkers`	Top source IPs
GET	`/netflow/distribution`	Traffic distribution by provider
GET	`/netflow/protocols`	Protocol distribution
GET	`/netflow/timeseries`	Traffic time-series data

Probes (6 endpoints)

Method	Endpoint	Description
GET	`/probes/status`	Probe scheduler status
GET	`/probes/results`	Recent probe results
GET	`/probes/results/{prefix}`	Probe results for specific prefix
POST	`/probes/trigger`	Trigger manual probe
GET	`/probes/config`	Get probe configuration
PUT	`/probes/config`	Update probe configuration

DDoS (8 endpoints)

Method	Endpoint	Description
GET	`/ddos/events`	List DDoS events
GET	`/ddos/events/{id}`	Get event details
POST	`/ddos/events/{id}/mitigate`	Trigger mitigation for event
DELETE	`/ddos/events/{id}/mitigate`	Stop mitigation
GET	`/ddos/baselines`	View current EWMA baselines
GET	`/flowspec/rules`	List FlowSpec rules
POST	`/flowspec/rules`	Create FlowSpec rule
DELETE	`/flowspec/rules/{id}`	Delete FlowSpec rule

Scrubber (6 endpoints)

Method	Endpoint	Description
GET	`/scrubber/status`	Scrubber status and stats
POST	`/scrubber/enable`	Enable scrubber on interface
POST	`/scrubber/disable`	Disable scrubber
GET	`/scrubber/rules`	List scrubber rules
POST	`/scrubber/rules`	Add scrubber rule
DELETE	`/scrubber/rules/{id}`	Remove scrubber rule

Configuration & System (10 endpoints)

Method	Endpoint	Description
GET	`/config`	Get current configuration
PUT	`/config`	Update configuration
PUT	`/config/mode`	Change operating mode
GET	`/system/health`	Health check endpoint
GET	`/system/version`	Version and build info
GET	`/system/stats`	System resource usage
GET	`/alerts`	List alerts
POST	`/alerts/{id}/acknowledge`	Acknowledge alert
GET	`/maintenance`	List maintenance windows
POST	`/maintenance`	Create maintenance window

Reports (6 endpoints)

Method	Endpoint	Description
GET	`/reports/traffic`	Traffic report
GET	`/reports/performance`	Performance report
GET	`/reports/cost`	Cost/commit report
GET	`/reports/optimization`	Optimization effectiveness report
GET	`/reports/security`	DDoS/security report
GET	`/reports/overview`	Executive overview dashboard data

WebSocket (1 endpoint)

Method	Endpoint	Description
WS	`/ws`	Real-time event stream

Pending Changes Review

Reviewing and approving or rejecting proposed route optimizations

Overview

In Test and Human operating modes, the optimization engine creates pending changes rather than immediately injecting routes. These pending changes represent proposed route optimizations that require operator review. The Pending Changes view is the primary workflow interface for operators running Routelock in Human mode, providing all the information needed to make informed approval or rejection decisions.

Pending Change Details

Each pending change displays comprehensive information about the proposed optimization:

Target Prefix: The destination network being optimized (e.g., 203.0.113.0/24)
Current Provider: The provider currently carrying traffic for this prefix
Proposed Provider: The provider Routelock recommends switching to
Current Metrics: Latency, loss, and jitter through the current provider
Proposed Metrics: Expected latency, loss, and jitter through the new provider
Improvement Score: Composite improvement percentage
Traffic Volume: How much traffic this prefix carries (helps prioritize reviews)
Cost Impact: How the change affects commit utilization on both providers

Approval Workflow

# Approve a single pending change
POST /api/v1/improvements/{id}/approve

# Reject with reason
POST /api/v1/improvements/{id}/reject
{"reason": "Provider B has planned maintenance tomorrow"}

# Bulk approve all pending changes
POST /api/v1/improvements/bulk-approve
{"ids": [1, 2, 3, 4, 5]}

# Bulk approve by filter (e.g., all with >30% improvement)
POST /api/v1/improvements/bulk-approve
{"filter": {"min_improvement_pct": 30}}

Best Practices

Review pending changes at least every 15 minutes in Human mode to prevent a backlog of stale proposals
Sort by traffic volume to prioritize high-impact changes
Check the cost impact column to avoid pushing providers over their commit thresholds
Use bulk approve for changes above your confidence threshold and review lower-scoring changes individually
Rejected changes enter a cooldown period before being re-proposed, reducing repeated reviews of the same prefix

Tip: If you find yourself approving 90%+ of pending changes, consider switching to Robot mode with conservative thresholds. This reduces operator burden while maintaining the safety of high minimum improvement requirements.

Configuration Guide

Comprehensive guide to all configuration sections

Configuration File

Routelock is configured via a YAML file located at /etc/routelock/config.yaml (default) or specified with the --config flag. Environment variables can be referenced using ${ENV_VAR} syntax for sensitive values. The configuration is loaded at startup and can be partially reloaded at runtime via the API.

Server Section

server:
  listen: ":8080"          # HTTP/HTTPS listen address
  tls_cert: "/etc/routelock/cert.pem"
  tls_key: "/etc/routelock/key.pem"
  mode: test               # Operating mode: test, human, robot
  log_level: info           # debug, info, warn, error
  log_format: json          # json or text

Database Section

database:
  host: localhost
  port: 5432
  name: routelock
  user: routelock
  password: "${DB_PASSWORD}"
  max_connections: 25
  ssl_mode: require
  migrations_auto: true     # Run migrations on startup

BGP Section

bgp:
  bird_socket: "/run/bird/bird.ctl"
  config_dir: "/etc/bird/routelock.d/"
  local_as: 65000
  router_id: "10.10.5.120"
  reconfigure_delay: 5s     # Batch changes before BIRD reconfigure
  max_routes: 10000         # Maximum injected routes

NetFlow Section

netflow:
  listen: ":2055"
  workers: 4                # Parallel flow processing workers
  buffer_size: 8192         # UDP receive buffer
  aggregation_interval: 60s
  top_n: 1000               # Track top N prefixes

Optimization Section

optimization:
  mode: performance         # performance or cost
  cycle_interval: 60s       # Analysis cycle frequency
  min_improvement_pct: 20
  min_latency_diff_ms: 5
  max_inject_rate: 50
  anti_flap_seconds: 300
  ttl_seconds: 3600
  weights: {latency: 0.4, loss: 0.3, jitter: 0.2, cost: 0.1}

Probes Section

probes:
  type: icmp               # icmp, udp, tcp
  interval_high: 15s       # High-traffic prefix interval
  interval_low: 60s        # Low-traffic prefix interval
  timeout: 3s
  count: 5                 # Probes per measurement
  ewma_alpha: 0.3          # Smoothing factor

Security Sections

See dedicated articles for LDAP, SSO, 2FA, and DDoS configuration. Each section is documented in its respective article with full example configurations.

Runtime Configuration Changes

Some configuration parameters can be changed at runtime via the API without restarting Routelock. These include operating mode, optimization thresholds, probe intervals, and alert settings. Changes to database, BGP socket, or listen address require a restart.

Database & Migrations

Schema overview, running migrations, and TimescaleDB hypertables

Database Architecture

Routelock uses PostgreSQL with the TimescaleDB extension for its data store. TimescaleDB provides transparent time-series optimization through hypertables, which automatically partition data by time for efficient querying and retention management. The database contains 27 tables covering configuration, operational state, time-series metrics, and audit logging.

Key Tables

Table	Type	Description
`providers`	Regular	Provider configuration and metadata
`improvements`	Regular	Route improvements (active, pending, historical)
`netflow_records`	Hypertable	Aggregated NetFlow data by prefix and interval
`probe_results`	Hypertable	Active probe measurements per prefix per provider
`traffic_stats`	Hypertable	Provider traffic statistics over time
`ddos_events`	Regular	DDoS detection events and mitigation state
`ddos_baselines`	Hypertable	EWMA baseline values per prefix
`users`	Regular	User accounts and authentication data
`api_keys`	Regular	API key hashes and metadata
`sessions`	Regular	Active JWT sessions
`alerts`	Regular	Alert records
`audit_log`	Hypertable	All user and system actions
`maintenance_windows`	Regular	Scheduled maintenance periods
`config`	Regular	Runtime configuration key-value store

Running Migrations

# Apply all pending migrations
routelock migrate up

# Rollback last migration
routelock migrate down 1

# Show migration status
routelock migrate status

# Auto-migration on startup (config)
database:
  migrations_auto: true

TimescaleDB Hypertables

Hypertables are created automatically during migration. They chunk data by time (default: 1-day chunks) for efficient time-range queries. Compression is enabled on chunks older than 7 days, reducing storage by 90%+. Retention policies automatically drop data older than the configured retention period (default: 90 days for detailed data, 365 days for aggregates).

# Check hypertable info
SELECT hypertable_name, num_chunks, compression_enabled
FROM timescaledb_information.hypertables;

Backup and Recovery

Standard PostgreSQL backup tools (pg_dump, pg_basebackup) work with TimescaleDB. For large databases, use pg_basebackup for full backups and WAL archiving for point-in-time recovery. TimescaleDB-specific backup considerations include ensuring the extension is installed on the restore target and that chunk ordering is preserved.

Troubleshooting

Common issues, diagnostic procedures, and solutions

No NetFlow Data Appearing

Symptoms: Dashboard shows zero traffic, no top prefixes.

Verify routers are configured to export NetFlow v9 to the correct IP and port (default 2055)
Check firewall rules: ss -ulnp | grep 2055 to confirm the collector is listening
Verify source IPs are reachable: tcpdump -i eth0 udp port 2055 -c 5
Check logs for template parsing errors: NetFlow v9 requires templates before data records
Ensure NetFlow export version is v9 (not v5 or IPFIX)

BGP Session Not Establishing

Symptoms: BIRD shows session in Connect/Active state.

Verify BIRD is running: birdc show status
Check TCP connectivity to BGP peer: nc -zv peer_ip 179
Verify AS numbers match on both sides
Check router-id uniqueness
Review BIRD logs: journalctl -u bird -f
Ensure Routelock's BIRD config include directory is properly referenced in the main bird.conf

Improvements Not Being Created

Symptoms: System collects data and probes but no improvements appear.

Check operating mode is not stuck in a misconfigured state
Verify minimum improvement threshold: a 20% default may be too high for well-optimized networks
Ensure multiple providers have active BGP sessions (need at least 2 paths to compare)
Check probe results: GET /api/v1/probes/results—if all providers show similar metrics, no improvement is possible
Verify anti-flap timers are not blocking re-optimization of recently withdrawn prefixes
Check rate limits: if the injection queue is full, new improvements may be queued

High Memory Usage

Symptoms: Routelock consuming excessive RAM.

Full BGP tables (1.1M routes) require approximately 2-3 GB RAM in BIRD
Reduce top_n prefix count if monitoring too many prefixes
Check for NetFlow buffer growth: increase worker count to process flows faster
Enable TimescaleDB compression for older chunks
Review probe pool size: reduce concurrent probes if memory is constrained

Database Connection Errors

Symptoms: "connection refused" or "too many connections" errors.

Verify PostgreSQL is running: systemctl status postgresql
Check max_connections in postgresql.conf (should be higher than Routelock's pool size)
Ensure TimescaleDB extension is installed: psql -c "SELECT extversion FROM pg_extension WHERE extname='timescaledb'"
Check pg_hba.conf for authentication rules matching the Routelock user

Diagnostic Commands

# Check system health
curl -s http://localhost:8080/api/v1/system/health | jq

# View recent logs
journalctl -u routelock -n 100 --no-pager

# Check BIRD status
birdc show protocols
birdc show route count

# Check database size
psql -d routelock -c "SELECT pg_size_pretty(pg_database_size('routelock'));"

# Check hypertable chunk status
psql -d routelock -c "SELECT * FROM timescaledb_information.chunks ORDER BY range_start DESC LIMIT 10;"

Routelock Knowledge Base

What is Routelock?

Introduction

How It Works

Comparison with Noction IRP

Key Features

Related Articles

System Requirements

Hardware Requirements

Software Prerequisites

Network Requirements

Network Topology

Related Articles

Quick Start Guide

Step 1: Install Dependencies

Step 2: Create the Database

Step 3: Configure Routelock

Step 4: Run Migrations

Step 5: Start Routelock

Step 6: Verify NetFlow Reception

Related Articles

Understanding Operating Modes

Overview

Test Mode (Observe Only)

Human Mode (Approval Required)

Robot Mode (Fully Automated)

Changing Modes

Related Articles

Router & Interface Setup Guide

Overview

Step 1: Register Your Routers

Router Roles

Step 2: Discover Interfaces

Step 3: Classify Interfaces

How Direction Detection Works

SNMP Bandwidth Polling

Multi-Router Setup

Troubleshooting

"SNMP connection failed"

"No interfaces discovered"

"Flows not tagged with provider"

"Bandwidth shows 0"

Related Articles

Providers

What Are Providers?

Provider Types

Transit Providers

Partial-Route Providers

IX Providers

Provider Configuration

Metrics Tracked Per Provider

Related Articles

Prefixes & Routes

BGP Routing Fundamentals

How Routelock Optimizes Prefixes

Prefix Lifecycle

Best-Path Selection

Related Articles

Improvements

What Are Improvements?

Improvement Lifecycle

Weight Scoring

TTL and Re-evaluation

Anti-Flap Protection

Related Articles

Traffic Analysis

NetFlow Collection

Top Prefix Identification

Traffic Distribution

Flow Processing Pipeline

Related Articles

Active Probing

Overview

Probe Types

ICMP Probes

UDP Probes

TCP Probes

Policy-Based Routing (PBR)

Adaptive Probing

Probe Algorithms