VyOS HA in AWS

TL;DR

This post shows how to solve a recurrent problem when using highly-available virtual routers in AWS: floating IPs.

This approach uses a python script for the new master router to claim an EC2 Secondary private IP in the failovering transition.

Motivation

For certain AWS architectures we need to deploy a managed virtual router (EC2 instance) to handle tunneling termination, BGP sessions, NATing, etc. In a production environment, High Availability for these network functions is clearly a must, so the services have a minimal impact in case one of the routers fails.

I’ve chosen VyOS for this scenario since it is an open-sourced fork of Vyatta. VyOS is an Operating System for network appliances with multiple capabilities such as routing, firewalling, vpn, vxlan, BGP peering, etc., which allows it to be used in projects with managed infrastructure. It is worth mentioning its easy-to-use command-line interface and extensive documentation.

Another added complexity we can find in these kinds of deployments is the fact that AWS does not support multicast traffic.

Architecture

A specific problem I’ve faced when designing a solution with managed routers in AWS was a NATed outgoing traffic from the on-premises private environment with the BGP sessions.

To replicate this scenario, I’ve set up a first tunnel against AWS infrastructure, and a second one between an on-premise VyOS router (shown without HA to simplify the diagram) and the highly-available AWS counterpart.

 

VIPA in AWS

For a regular active/passive cluster configuration like this one, we will need, apart from the routers’ IPs, a virtual IP address to float between them in a failover scenario.

AWS doesn’t provide this kind of floating IPs, but all the IPs in the VPC range must be assigned to an EC2 instance.

To solve this problem, I’ve created a script (vrrp-master.py) to be configured in both routers, which will claim (reassign to self) the IP designed as VIPA during failovering.

This script manages the VIPA assignment automatically, so any manual assignment of this IP is strictly discouraged in order to avoid any human error (like forgetting to allow re-assignation).

Considerations

Because this script uses the boto3 python module (Amazon Web Services SDK for Python), we must install it in the VyOS router:

echo "deb http://ftp.de.debian.org/debian/ jessie main contrib non-free" > /etc/apt/sources.list
apt-get update && apt-get -y install python-pip && pip install boto3

Since there is already a private connectivity between the on-premise facilities and AWS (using a Customer Gateway attached to a Transit Gateway), we don’t want to assign a public IP to the EC2 instances. Therefore, we create an EC2 VPC Endpoint (e.g. com.amazonaws.eu-central-1.ec2) making sure that the “Private DNS Name” is enabled, so that the endpoint will be resolved as a VPC private IP.

Currently, the STS (Security Token Service) only allows the creation of a VPC Endpoint for the Oregon region (com.amazonaws.us-west-2.sts), so if we want to use STS roles in another region, routers must have internet access (not really an option for production environments). To overcome this, we created a user with the following policy directly attached (limited to the “vyos-ha” user and VPC “vpc-0a6f6a161f5ae1fc2”):

{
   "Version": "2012-10-17",
   "Statement": [
    	{
           "Sid": "VisualEditor0",
           "Effect": "Allow",
           "Action": [
           "ec2:DescribeAddresses",
           "ec2:DescribeInstances",
           "ec2:AssignPrivateIpAddresses"
           ],
           "Resource": "*",
           "Condition": {
            	"StringEquals": {
                 "aws:username": "vyos-ha",
                 "aws:SourceVpc": "vpc-0a6f6a161f5ae1fc2"
            	}
           }
    	}
   ]
}

Cluster

According to the VyOS website, this is the recommended method since it allows us to have a service as a cluster’s resource, associated with the VIPA.

Unfortunately, the available VyOS version in the AWS Marketplace doesn’t allow unicast traffic in this mode:

vyos@vyos-1# set cluster interface eth0 peer 100.80.33.249
  Configuration path: cluster interface eth0 [peer] is not valid
  Set failed
[edit]
vyos@vyos-1#

VRRP

Luckily, unicast traffic for VRRP is implemented for the VyOS version in AWS.

Here is the VRRP configuration for both routers:

set high-availability vrrp group vyos-aws vrid 10
set high-availability vrrp group vyos-aws interface eth0
set high-availability vrrp group vyos-aws virtual-address /
set high-availability vrrp group vyos-aws priority 200
set high-availability vrrp group vyos-aws no-preempt
set high-availability vrrp group vyos-aws peer-address 
set high-availability vrrp group vyos-aws hello-source-address 
set high-availability vrrp group vyos-aws transition-script master "/config/scripts/vrrp-master.py "

Verifying the configuration

To check the VRRP status we can use this command:

vyos@vyos-1$ show vrrp
Name  	Interface  	VRID  State	Last Transition
--------  -----------  ------  -------  -----------------
vyos-aws  eth0         	10  MASTER   6s
vyos@vyos-1$

To test the failover, we can restart the MASTER node:

vyos@vyos-1$ reboot backup
Are you sure you want to reboot this system? [y/N] y

Once the master node is powered off, the slave will become the new master and the mentioned script will claim the VIPA to the EC2 VPC Endpoint:

When the previous stage finishes, the VIPA (100.80.33.100 in the example) will show up configured as the Secondary IP from the eth0 NIC:


This can be verified listing the eth0 interface within the router:

vyos@vyos-2# ip address list eth0
2: eth0: <broadcast,multicast,up,lower_up>mtu 1300 qdisc mq state UP group default qlen 1000
	link/ether 0a:bf:ab:a6:ff:68 brd ff:ff:ff:ff:ff:ff
	inet 100.80.33.158/24 brd 100.80.33.255 scope global eth0
   	valid_lft forever preferred_lft forever
	inet 100.80.33.100/24 scope global secondary eth0
   	valid_lft forever preferred_lft forever
	inet6 fe80::8bf:abff:fea6:ff68/64 scope link
   	valid_lft forever preferred_lft forever
vyos@vyos-2#</broadcast,multicast,up,lower_up>

In the AWS console, we can also see the VIPA at the new MASTER’s “Secondary private IP” field (EC2 instance, Description tab):

vrrp-master.py

You can use this simple script to claim the VIPA (it must be scp’ed to both nodes with exec permissions).

#!/usr/bin/python

import boto3
import os,sys
from botocore.exceptions import ClientError

def reassign_addr(nic, ip):
  try:
    print("Assigning IP ", ip," to NIC ", nic["NetworkInterfaceId"])
    response = ec2client.assign_private_ip_addresses(
      NetworkInterfaceId = nic["NetworkInterfaceId"],
      AllowReassignment = True,
      PrivateIpAddresses = [
        ip,
      ],
    )
    print(response)
  except ClientError as e:
    print(e)

def get_iface(instance):
  if instance.get("NetworkInterfaces", False):
	# VyOS instance have only one NIC
	return instance["NetworkInterfaces"][0]

def get_instance(response, instance_id):
  for reservation in response["Reservations"]:
    for instance in reservation["Instances"]:
      if instance["InstanceId"] == instance_id:
        return instance

if len(sys.argv) != 5:
  print("This script expects the VIPA as argument.")
  sys.exit(2)
vipa = sys.argv[1]

# Set credentials and config files path

os.environ["AWS_SHARED_CREDENTIALS_FILE"] = "/root/.aws/credentials"
os.environ["AWS_CONFIG_FILE"] = "/root/.aws/config"

session = boto3.session.Session()
ec2client = session.client('ec2', region_name = 'eu-central-1')
response = ec2client.describe_instances()

with open('/run/cloud-init/.instance-id', 'r') as instance_file:
  my_instance_id=instance_file.read().replace('\n', '')

instance = get_instance(response, my_instance_id)
iface = get_iface(instance)
reassign_addr(iface, vipa)

Conclusions

We’ve seen how to get around an AWS limitation when deploying a highly-available VyOS router.

Since AWS doesn’t provide floating IPs, the VIPA failover is done using the python’s SDK and a user with a restrictive policy. We couldn’t use STS since the VPC Endpoint is not available outside Oregon’s region and communication with routers directly from the internet is unacceptable.

Unfortunately, we cannot use the VyOS cluster mode since it is currently not supported in the latest AWS AMI version, so we have opted to use VRRP unicast instead.

Both routers were deployed with the VyOS AMI, so we need to install the boto3 module beforehand. This can be done connecting them to an Internet Gateway (test) or downloading the packages from a private and secured packages repository (prod).

That’s all for now, I hope you’ve enjoyed it and if you have any trouble testing or deploying this architecture, feel free to leave a question in the comments section.

[/et_pb_text][/et_pb_column][et_pb_column type=”1_4″][et_pb_sidebar area=”sidebar-1″ orientation=”left” remove_border=”off” background_layout=”light” show_border=”on” /][/et_pb_column][/et_pb_row][/et_pb_section]

Published by Santiago Sánchez Paz

Engineer, passionate about new technologies and a world citizen (always a traveller, never a tourist). As a Solution Architect, Santiago loves to design resilient, secure and fault-tolerant open source distributed systems, with a focus on Big Data and self-healing architectures.

Leave a Reply

Your email address will not be published. Required fields are marked *