burble.dn42 / Network / Network Design

Network Design

This page documents some key elements of the current burble.dn42 design.

Tunnel Mesh

Hosts within the burble.dn42 network are joined using an Wireguard/L2TP mesh. Static, unmanaged, L2TP tunnels operate at the IP level and are configured to create a full mesh between nodes. Wireguard is used to provide encryption and encapsulate L2TP traffic in plain UDP such that it hides fragmentation and allows packets to be processed within intermediate routers' fast path.

Using L2TP allows for a large, virtual MTU of 4310 between nodes; this is chosen to spread the encapsulation costs of higher layers across packets. L2TP also allows for multiple tunnels between hosts and this can also be used to separate low level traffic without incurring the additional overheads of VXLANs (e.g. for NFS cross mounting).

Network configuration on hosts is managed by systemd-networkd and applied with Ansible.

Real Life Networks and Fragmentation.

Earlier designs for the burble.dn42 relied on passing fragmented packets directly down to the clearnet layer (e.g. via ESP IPsec fragementation, or UDP fragmentation with wireguard). In practice it was observed that clearnet ISPs could struggle with uncommon packet types, with packet loss seen particularly in the IPv6 case. It seems likely that some providers' anti-DDOS and load balancing platforms had a particular impact at magnifying this problem.

To resolve this, the network was re-designed to ensure fragmentation took place at the L2TP layer such that all traffic gets encapsulated in to standard sized UDP packets. This design ensures all traffic is ‘normal’ and can remain within intermediate routers' fast path.

ISP Rate Limiting

The burble.dn42 network uses jumbo sized packets that are fragemented by L2TP before being encapsulated by wireguard. This means a single packet in the overlay layers can generate multiple wireguard UDP packets in quick succession, appearing to be a high bandwidth, burst of traffic on the outgoing clearnet interface. It’s vital that all these packets arrive at the destination, or the entire overlay packet will be corrupted. For most networks this is not a problem and generally the approach works very well.

However, if you have bandwidth limits with your ISP (e.g. a 100mbit bandwidth allowance provided on a 1gbit port) packets may be generated at a high bit rate and then decimated by the ISP to match the bandwidth allowance. This would normally be fine, but if a fragmented packet is sent, the burst of smaller packets is highly likely to exceed the bandwidth allowance and the impact on upper layer traffic is brutal, causing nearly all packets to get dropped.

The burble.dn42 network manages this issue by implementing traffic shaping on the outgoing traffic using linux tc (via FireQOS). This allows outgoing packets to be queued at the correct rate, rather than being arbitrarily decimated by the ISP.

BGP EVPN

EVPN diagram

Overlaying the Wireguard/L2TP mesh is a set of VXLANs managed by a BGP EVPN.

The VXLANs are primarily designed to tag and isolate transit traffic, making their use similar to MPLS.

The Babel routing protocol is used to discover loopback addresses between nodes; Babel is configured to operate across the point to point L2TP tunnels and with a static, latency based metric that is applied during deployment.

The BGP EVPN uses FRR with two global route reflectors located on different continents, for redundency. Once overheads are taken in to account the MTU within each VXLAN is 4260.

dn42 Core Routing

EVPN diagram

Each host in the network runs an unprivileged LXD container that acts as a dn42 router for that host. The container uses Bird2 and routes between dn42 peer tunnels, local services on the same node and transit to the rest of the burble.dn42 network via a single dn42 core VXLAN.

Local services and peer networks are fully dual stack IPv4/IPv6 however the transit VXLAN uses purely IPv6 link-local addressing, making use of BGP multiprotocol and extended next hop capabilities for IPv4.

The transit VXLAN and burble.dn42 services networks use an MTU of 4260, however the dn42 BGP configuration includes internal communities to distribute destination MTU across the network for per-route MTUs. This helps ensure path mtu discovery takes place as early and efficiently as possible.

Local services on each host are provided by LXD containers or VMs connecting to internal network bridges.
These vary across hosts but typically include:

  • tier1 - used for publically avialable services (DNS, web proxy, etc)
  • tier2 - used for internal services, with access restricted to burble.dn42 networks

Other networks might include:

  • dmz - used for hosting untrusted services (e.g. the shell servers)
  • dn42 services - for other networks, such as the registry services

dn42 peer tunnels are created directly on the host and then injected in to the container using a small script, allowing the router container itself to remain unprivileged.

The routers also run nftables for managing access to each of the networks, bird_exporter for metrics and the bird-lg-go proxy for the burble.dn42 looking glass.

Host Configuration

EVPN diagram

burble.dn42 nodes are designed to have the minimum functionality at the host level, with all major services being delivered via virtual networks, containers and VMs.

Hosts have three main functions:

  • connecting in to the burble.dn42 Wireguard/L2TP mesh and BGP EVPN
  • providing internal bridges for virtual networks
  • hosting LXD containers and VMs

Together these three capabilities allow for arbitary, isolated networks and services to be created and hosted within the network.

The hosts also provide a few ancillary services:

  • delivering clearnet access for internal containers/VMs using an internal bridge. The host manages addressing and routing for the bridge to allow clearnet access independent of the host capabilities (e.g. proxied vs routed IPv6 connectivity)
  • creating dn42 peer tunnels and injecting them in to the dn42 router container
  • monitoring via netdata
  • backup using borg