This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Documentation

Kowabunga

Kowabunga is an SD-WAN and HCI (Hyper-Converged Infrastructure) Orchestration Engine.

Market BS aside, Kowabunga provides DevOps with a complete infrastructure automation suite to orchestrate virtual resources management automation on privately-owned commodity hardware.

It brings the best of both worlds:

  • Cloud API, automation, infrastructure-as-code, X-as-a-service …
  • On-Premises mastered and predictable flat-rate hardware.

1 - Overview

How can Kowabunga sustain your applications hosting ?

What is it ?

Kowabunga is an SD-WAN and HCI (Hyper-Converged Infrastructure) Orchestration Engine.

Market BS aside, Kowabunga provides DevOps with a complete infrastructure automation suite to orchestrate virtual resources management automation on privately-owned commodity hardware.

It brings the best of both worlds:

  • Cloud API, automation, infrastructure-as-code, X-as-a-service …
  • On-Premises mastered and predictable flat-rate hardware.

The Problem

Cloud Services are unnecessarily expensive and come with vendor-locking.

“Cloud computing is basically renting computers, instead of owning and operating your own server hardware. From the start, companies that offer cloud services have promised simplicity and cost savings. Basecamp has had one foot in the cloud for well over a decade, and HEY has been running there exclusively since it was launched two years ago. We’ve run extensively in both Amazon’s cloud and Google’s cloud, but the savings promised in reduced complexity never materialized. So we’ve left.

The rough math goes like this: We spent $3.2m on cloud in 2022.The cost of rack space and new hardware is a total of $840,000 per year.

Leaving the cloud will save us $7 million over five years.

At a time when so many companies are looking to cut expenses, saving millions through hosting expenses sounds like a better first move than the rounds of layoffs that keep coming.”

Basecamp by 37signals

Why Kowabunga ?

  • Cost-Effective: Full private-cloud on-premises readiness and ability to run on commodity hardware. No runtime fees, no egress charges, flat-rate predictable cost. Keep control of your TCO.

  • Resilient & Features-Rich: Kowabunga enables highly-available designs, across multiple data centers and availability zones and brings automated software-as-a-service. Shorten application development and setup times.

  • No Vendor-Locking: Harness the potential of Open-Source software stack as a backend: no third-party commercial dependency. We stand on the shoulders of giants: KVM, Ceph … Technical choices remain yours and yours only.

  • Open Source … by nature: Kowabunga itself is OpenSource, from API to client and server-side components. We have nothing to hide but everything to contribute. We believe in mutual trust.

A Kowabunga-hosted project costs 1/10th of a Cloud-hosted one.

Why do I want it ?

  • What is it good for?: Modern SaaS products success are tighly coupled with profitability. As soon as you scale up, you’ll quickly understand that you’re actually sponsoring your Cloud provider more than your own teams. Kowabunga allows you to keep control of your infrastructure and its associated cost and lifecycle. You’ll never get afraid of unexpected business model change, tariffs and whatnot. You own your stack, with no surprises.

  • What is it not good for?: PoC and MVP startups. Let’s be realistic, if you’re goal is to vibe-code your next million-dollar idea and deliver it, no matter how and what, forget about us. You have other fish to fry than mastering your own infrastructure. Get funded, wait for your investors to ask for RoI, and you’ll make your mind.

  • What is it not yet good for?: Competing with GAFAM. Let’s be honest, we’ll never be the next AWS or GCP (or even OpenStack). We’ll never have 200+ as-a-service kind of stuff, but how many people actually need that much ?

Is it business-ready ?

Simply put … YES !

Kowabunga allows you to host and manage personal labs, SOHO sandboxes, as well as million-users SaaS projects. Using Open Source software doesn’t imply living on your own. Through our sponsoring program, Kowabunga comes with 24x7 enterprise-grade level of support.

Fun Facts 🍿

Where does it comes from ? Everything comes as a solution to a given problem.

Our problem was (and still is …) that Cloud services are unnecessarily expensive and often come with vendor-locking. While Cloud services are appealing at first and great to bootstrap your project to an MVP level, you’ll quickly hit profitability issues when scaling up.

Provided you have the right IT and DevOps skills in-house, self-managing your own infrastructure makes sense at economical level.

Linux and QEMU/KVM comes in handy, especially when powered by libvirt but we lacked true resource orchestration to push it to next stage.

OpenStack was too big, heavy, and costly to maintain. We needed something lighter, simpler.

So we came with Kowabunga: Kvm Orchestrator With A BUNch of Goods Added.

Where should I go next ?

2 - Concepts

Learn about Kowabunga conceptual architecture

Conceptual Architecture

Simply put, Kowabunga allows you to control and manage low-level infrastructure at your local on-premises data-centers and spin up various virtual resources on top, as to leverage your applications on top.

Local data centers consist of a bunch of physical machines (can range from personal computers, commodity hardware to high-end enterprise-grade servers) providing raw networking, computing and storage resources. Physical assests plainly sit in your basement. They don’t need to be connected to other data-centers, they don’t even need to know about others data-centers’ existence and more than anything, they don’t need to be exposed to public Internet.

From an IT and assets management’s perspective, one simply needs to ensure they run and, capacity planning in mind, that they do offer enough physical resources to sustain future applications hosting needs.

On each data-center, some physical machines (usually lightweight) will be dedicated to providing networking kind of services, through Kowabunga’s Kiwi agents, while others will providing computing and storage capabilities, thanks to Kowabunga’s Kaktus agents.

The Kowabunga project then come with Kahuna, its orchestration engine. This is the masterpiece cornerstone of your architecture. Kahuna act as a maestro, providing API servicess for admins and end-users, and provising and controlling virtual resources on the various data-centers through Kowabunga connected agents.

Ultimately, DevOps consumers will only ever interface with Kahuna.

So, how does magic happen ?

Kahuna has a triple role exposure:

  • Public REST API: implements and operates the API calls to manage resources, DevOps-orchestrated, manually (not recommended) or through automation tools such as Terraform, OpenTofu or Ansible.
  • Public WebSocket endoint: agent connection manager, where the various Kowabunga agents (from managed data-centers) establish secure WebSocket tunnels to, for being further controlled, bypassing on-premises firewall constraints and preventing the need of any public service exposure.
  • Metadata endpoint: where managed virtual instances and services can retrieve information services and self-configure themselves.

Core Components

So, let’s rewind, the Kowabunga projects consists of multiple core components:

  • Kahuna: the core orchestration system. Remotely controls every resource and maintains ecosystem consistent. Gateway to the Kowabunga REST API.
  • Kaktus: the HCI node(s). Provides KVM-based virtual computing hypervisor with Ceph-based distributed storage services.
  • Kiwi: the SD-WAN node(s). Provides various network services like routing, firewall, DHCP, DNS, VPN, peering (with active-passive failover).
  • Koala: the WebUI. Allows for day-to-day supervision and operation of the various projects and services.

Aside from these, Kowabunga introduces the concept of:

  • Region: basically a physical location, which can be assimilated to a data-center.
  • Zone: a specific subset of a region, where all underlying resources are guaranteed to be self-autonomous (in terms of Internet connectivity, power-supply, cooling …). As with other Cloud providers, the zones allow for application workload distribution within a single region, offering resilience and high-availability.

Topology Uses Cases

This illustrates what a Kowabunga Multi-Zones and Regions topology would looks like:

On the left side, one would have a multi-zones region. Divided into 3 Zones (i.e. 3 physically isolated data-centers, physically inter-connected by a network link), the region features 11 servers instances:

  • 2 Kiwi instances, providing networking capabilities
  • 3x3 Kaktus instances, providing computing and storage capabilities.

Zones can be pictured in different ways:

  • several floors from your personal home basement (ok … useless … but for the sake of example).
  • several IT rooms from your company’s office.
  • several buildings from your company’s office.

Should a Kowabunga user request for a virtual machine creation in this dedicated region, he could specifically request it to be assigned to one of the 3 zones (the underlying hypervisor from each zone will be automatically picked), or request some -as-a-service feature, which would be seamlessly spawned in multiple zones, as to provide service redundancy.

Sharing the same L2/L3 network across the same region, disk instances will be distributed and replicating across zones, allowing for fast instance relocation in the event of one zone’s failure.

On the right side, one would have a single-zone region, with just a couple of physical instances.

What Makes it Different ?

Cloud providers aside, what makes Kowabunga different from other on-premises infrastructure and virtualization providers (such as VMware, Nutanix, OpenStack …).

Well … 0 licensing costs. Kowabunga is Open Source with no paywalled features. There’s no per-CPU or per-GB or memory kind of license. Whether you’d like to set your private small-sized copamy’s data-center with 3 servers or full fleet of 200+, your cost of operation will remain flat.

But aside from cost, Kowabunga has been developed by and for DevOps, the ones who:

  • need to orchestrate, deploy and maintain heterogenous applications on heterogenous infrastructures.
  • use Infrastructure-as-Code principles to ensure reliability, durability and traceability.
  • bear security in mind, ensuring than nothing more than what’s required must be publicly exposed.
  • believe than smaller and simpler is better.

2.1 - Kahuna

Learn about Kahuna orchestrator.

Kahuna is Kowabunga’s orchestration system. Its name takes root from Hawaiian’s (Big) Kahuna word, meaning “the expert, the most dominant thing”.

Kahuna remotely controls every resource and maintains ecosystem consistent. It’s the gateway to Kowabunga REST API.

From a technological stack perspective, Kahuna features:

  • a Caddy public HTTPS frontend, reverse-proxying requests to:
    • Koala Web application, or
    • Kahuna orchestrator daemon
  • a MongoDB database backend.

The Kahuna orchestrator features:

  • Public REST API handler: implements and operates the API calls to manage resources,interacting with rightful local agents through JSON-RPC over WSS.
  • Public WebSocket handler: agent connection manager, where the various agents establish secure WebSocket tunnels to, for being further controlled, bypassing on-premises firewall constraints and preventing the need of any public service exposure.
  • Metadata endpoint: where managed virtual instances and services can retrieve information services and self-configure themselves.

Kowabunga API folds into 2 type of assets:

  • admin ones, used to handle objects like region, zone, kaktus and kiwi hosts, agents, networks …
  • user ones, used to handle objects such as Kompute, Kawaii, Konvey

Kahuna implements robust RABC and segregation of duty as to ensure access boundaries, such as:

  • Nominative RBAC capabilities and per-organization and team user management.
  • Per-project teams associationfor per-resource access control.
  • Support for both JWT bearer (human-to-server) and API-Key token-based (server-to-server) authentication mechanisms.
  • Support for 2-steps account creation/validation and enforced robust passwords/tokens usage(server-generated, user-input is prohibited).
  • Nominative robust HMAC ID+token credentials over secured WebSocket agent connections.

This ensures that:

  • only rightful designated agents are able to establish WSS connections with Kahuna
  • created virtual instances can only retrieve the metadata profile they belong to (and self configure or update themselves at boot or runtime).
  • users can only see and manage resources for the projects they belong to.

2.2 - Koala

Learn about Koala Web application.

Koala is Kowabunga’s WebUI. It allows for day-to-day supervision and operation of the various projects and services.

Koala

But should you ask a senior DevOps / SRE / IT admin, fully automation-driven, he’d damn anyone who’d have used the Web client to manually create/edit resources and messes around his perfecly maintained CasC.

We’ve all been there !!

That’s why Koala has been designed to be read-only. While using Kowabunga’s API, the project’s directive is to enforce infrastructure and configuration as code, and such, prevents any means to do harm.

Koala is AngularJS based and usually located next to Kahuna’s instance. It provides users with capability to connect, check for the various projects (they belong to) resources, optionnally start/reboot/stop them and/or see various piece of information and … that’s it ;-)

2.3 - Kiwi

Learn about Kiwi SD-WAN node.

Kiwi is Kowabunga SD-WAN node in your local data-center. It provides various network services like routing, firewall, DHCP, DNS, VPN and peering, all with active-passive failover (ideally over multiple zones).

Kiwi is central to our regional infrastructure to operate smoothly and internal gateway to all your projects Kawaii private network instances. It controls the local network configuration and creates/updates VLANs, subnets and DNS entries per API requests.

Kiwi offers a Kowabunga project’s network isolation feature by enabling VLAN-bound, cross-zones, project-attributed, VPC L3 networking range. Created virtual instances and services are bound to VPC by default and never publicly exposed unless requested.

Access to project’s VPC resources is managed either through:

  • Kiwi-managed region-global VPN tunnels.
  • Kawaii-managed project-local VPN tunnels.

Decision to do or another depends on private Kowabunga IT policy.

2.4 - Kaktus

Learn about Kaktus HCI node.

Kaktus stands for Kowabunga Amazing KVM and TUrnkey Storage (!!), basically, our Hyper-Converged Infrastructure (HCI) node.

While large virtualization systems such as VMware usually requires you to dedicate servers as computing hypervisors (with plenty of CPU and memory) and associate them with remote, extensive NAS or vSAN, providing storage, Kowabunga follows the opposite approach. Modern hardware is powerful enough to handle both computing and storage.

This approach allows you to:

  • use commodity hardware, if needed
  • use heterogenous hardware, each member of the pool featuring more or less computing and storage resources.

If you’re already ordering a heavy computing rackable server, extending it with 4-8 SSDs is always going to be cheaper than adding an extra enterprise SAN.

Kaktus nodes will then consists of

  • a KVM/QEMU + libvirt virtualization computing stack. Featuring all possible VT-x and VT-d assistance on x86_64 architectures, it’ll provide near passthrough virtualization capabilities.
  • several local disks, to be part of a region-global Ceph distributed storage cluster.
  • the Kowabunga Kaktus agent, connected to Kahuna

From a pure low-level software perspective, our virtualization stack relies on 3 stacks:

  • Linux Network Bridging driver, for virtual interfaces access to host raw network interfaces and physical network.
  • Linux KVM driver, for CPU VT-X extension support and improved virtualization performances.
  • RBD (Rados Block Device) driver, for storing virtual block devices under distributed Ceph storage engine. QEMU drives these different backends to virtualize resources on to.

Kaktus Topology

Now QEMU being a local host process to be spawned, we need some kind of orchestration layer on top of that. Here comes libvirt. libvirt provides an API over TCP/TLS/SSH that wraps virtual machines definition over an XML representation that can be fully created/updated/destroyed remotely, controlling QEMU underneath. Kaktus agent controls the local KVM hypervisor through libvirt backend and the local-network distributed Ceph storage, allowing management of virtual machines and disks.

3 - Getting Started

Deploy your first Kowabunga instance !

3.1 - Hardware Requirements

Prepare hardware for setup

Setting up a Kowabunga platform requires you to provide the following hardware:

  • 1x Kahuna instance (more could used if high-availability is expected).
  • 1x Kiwi instance per-region (2x recommended for production-grade)
  • 1x Kaktus instance per-region (a minimum of 3x recommended for production-grade, can scale to N).

Kahuna Instance

Kahuna is the only instance that will be exposed to end users. It is recommended to have it exposed on public Internet, making it easier for DevOps and users to access to but there’s no strong requirement for that. It is fairly possible to keep it local to your private corporate network, only accessible from on-premises network or through VPN.

Hardware requirements are lightweight:

  • 2-cores vCPUs
  • 4 to 8 GB RAM
  • 64 GB for OS + MongoDB database.

Disk and network performance is fairly insignificant here, anything modern will do just fine.

We personnally use and recommend using small VPS-like public Cloud instances. They come with public IPv4 address and all that one needs for a monthly price of $5 to $20 only.

Kiwi Instance

Kiwi will act as a network software router and gateway. Even more than for Kahuna, you don’t need much horse-power here. If you plan on setting your own home labs, a small 2 GB RAM Raspberry Pi would be sufficient (keep in mind that SoHo routers and gateways are lightweight than that).

If you intend to use it for enteprise-grade purpose, just pick the lowest end server you could fine.

It’s probably going to come bundled with 4-cores CPU, 8 GB of RAM and whatever SSD and in any cases, it would be more than necessary, unless you really intend to handle 1000+ computing nodes being a multi-Gbps traffic.

Kaktus Instance

Kaktus instance are another story. If there’s one place you need to put your money on, here would be the place. The instance will handle as many virtual machines as can be and be part of the distributed Ceph storage cluster.

Sizing depends on your expected workload, there’s no accurate rule of thumb for that. You’ll need to think capacity planning ahead. How much vCPUs do you expect to run in total ? How many GBs of RAM ? How much disk ? What overcommit ratio do you expect to set ? How much data replication (and so … resilience) do you expect ?

These are all good questions to be asked. Note that you can easily start low with only a few Kaktus instances and scale up later on, as you grow. The various Kaktus instances from your fleet may also be heterogeneous (to some extent).

As a rule of thumb, unless you’re doing setting up a sandbox or home lab, a minimum of 3 Kaktus instance would be recommended. This allows you to move workload from one to another, or simply put one in maintenance mode (i.e. shutdown workload) while keeping business continuity.

Supposing you have X Kaktus instances and expect up to Y to be down at a given time, the following applies:

Instance Maximum Workload: (X - Y) / X %

Said differently, with only 3 machines, don’t go above 66% average load usage or you won’t be able to put one in maintenance without tearing down application.

Consequently, with availability in mind, better have more lightweight instances than few heavy ones.

Same applies (even more to Ceph storage cluster). Each instance local disk will be part of Ceph cluster (a Ceph OSD to be accurate) and data will be spread across those, from the same region.

Now, let’s consider you want to achieve 128 TB usable disk space. At first, you need to define your replication ratio (i.e. how many time objects storage fragments will be replicated across disks). We recommend a minimum of 2, and 3 for production-grade workloads. That means you’ll actually need a total of 384 TB of physical disks.

Here are different options to achieve it:

  • 1 server with 24x 16TB SSDs each
  • 3 servers with 8x 16TB SSDs each
  • 3 servers with 16x 8TB SSDs each
  • 8 servers with 6x 8TB SSDs each
  • […]

From a purely resilient perspective, last option would be the best. It provides the more machines, with the more disks, meaning that if anything happens, the smallest fraction of data from the cluster will be lost. Lost data is possibly only ephemeral (time for server or disk to be brought up again). But while down, Ceph will try to re-copy data from duplicated fragments to other disks, inducing a major private network bandwidth usage. Now whether you only have 8 TB of data to be recovered or 128 TB may have a very different impact.

Also, as your virtual machines performance will be heavily tight to underlying network storage, it is vital (at least for production-grade workloads) to use NVMe SSDs with 10 to 25 Gbps network controllers and sub-millisecond latency between your private region servers.

So let’s recap …

Typical Kaktus instances for home labs or sandbox environments would look like:

  • 4-cores (8-threads) CPUs.
  • 16 GB RAM.
  • 2x 1TB SATA or NVMe SSDs (shared between OS partition and Ceph ones)
  • 1 Gbps NIC

While Kaktus instances for production-grade workload could easily look like:

  • 32 to 128 cores CPUs.
  • 128 GB to 1.5 TB RAM.
  • 2x 256 GB SATA RAID-1 SSDs for OS.
  • 6 to 12x 2-8 TB NVMe SSDs for Ceph.
  • 10 to 25 Gbps NICs with link-agregation.

3.2 - Software Requirements

Get your toolchain ready

Kowabunga’s deployment philosophy relies on IaC (Infrastructure-as-Code) and CasC (Configuration-as-Code). We heavily rely on:

Kobra Toolchain

While natively compatible with the aformentionned, we recommend using Kowabunga Kobra as a toolchain overlay.

Kobra is a DevOps deployment swiss-army knife utility. It provides a convenient wrapper over OpenTofu, Ansible and Helmfile with proper secrets management, removing the hassle of complex deployment startegy.

Anything can be done without Kobra, but it makes things simpler, not having to care about the gory details.

Kobra supports various secret management providers. Please choose that fits your expected collaborative work experience.

At runtime, it’ll also make sure you’re OpenTofu / Ansible toolchain is properly set on your computer, and will do so otherwise (i.e. brainless setup).

Installation can be easily performed on various targets:

Installation Ubuntu Linux

Register Kowabunga APT repository and then simply:

$ sudo apt-get install kobra

Installation on macOS

macOS can install Kobra through Homebrew. Simply do:

$ brew tap kowabunga/cloud https://github.com/kowabunga-cloud/homebrew-tap.git
$ brew update
$ brew install kobra

Manual Installation

Kobra can be manually installed through released binaries.

Just download and extract the tarball for your target.

Setup Git Repository

Kowabunga comes with a ready-to-consumed platform template. One can clone it from Git through:

$ git clone https://github.com/kowabunga-cloud/platform-template.git

or better, fork it in your own account, as a boostraping template repository.

Secrets Management

Passwords, API keys, tokens … they are all sensitive and meant to be secrets. You don’t want any of those to leak on a public Git repository. Kobra relies on SOPS to ensure all secrets are located in an encrypted file (which is safe to to be Git hosted), which can be encrypted/decrypted on the fly thanks to a master key.

Kobra supports various key providers:

  • aws: AWS Secrets Manager
  • env: Environment variable stored master-key
  • file: local plain text master-key file (not recommended for production)
  • hcp: Hashicorp Vault
  • input: interactive command-line input prompt for master-key
  • keyring: local OS keyring (macOS Keychain, Windows Credentials Manager, Linux Gnome Keyring/KWallet)

If you’re building a large production-grade system, with multiple contributors and admins, using a shared key management system like aws or hcp is probably welcome.

If you’re single contributor or in a very small team, storing your master encryption key in your local keyring will do just fine.

Simply edit your kobra.yml file in the following section:

secrets:
  provider: string                    # aws, env, file, hcp, input, keyring
  aws:                                # optional, aws-provider specific
    region: string
    role_arn: string
    id: string
  env:                                # optional, env-provider specific
    var: string                       # optional, defaults to KOBRA_MASTER_KEY
  file:                               # optional, file-provider specific
    path: string
  hcp:                                # optional, hcp-provider specific
    endpoint: string                  # optional, default to "http://127.0.0.1:8200" if unspecified
  master_key_id: string

As an example, managing platform’s master key through your system’s keyring is as simple as:

secrets:
  provider: keyring
  master_key_id: my-kowabunga-labs

As a one-time thing, let’s init our new SOPS key pair.

$ kobra secrets init
[INFO 00001] Issuing new private/public master key ...
[INFO 00002] New SOPS private/public key pair has been successuflly generated and stored

Ansible

The official Kowabunga Ansible Collection and its associated documentation will seamlessly integrate with SOPS for secrets management.

Thanks to that, any file from your inventory’s host_vars or group_vars directories, being suffixed as .sops.yml will automatically be included when running playbooks. It is then absolutely safe for you to use these encrypted-at-rest files to store your most sensitive variables.

Creating such files and/or editing these to add extra variables is then as easy as:

$ kobra secrets edit ansible/inventories/group_vars/all.sops.yml

Kobra will automatically decrypt the file in-live, open the editor of your choice (as stated in your $EDITOR env var), and re-ecnrypt it with the master key at save/exit.

That’s it, you’ll never have to worry about secrets management and encryption any longer !

OpenTofu

The very same applies for OpenTofu, where SOPS master key is used to encrypt the most sensitive data. Anything sensitive you’d need to add to your TF configuration can be set in the terraform/secrets.yml file as simple key/value.

$ kobra secrets edit terraform/secrets.yml

Note however that their existence must be manually reflected into HCL formatted terraform/secrets.tf file, e.g.:

locals {
  secrets = {
    my_service_api_token = data.sops_file.secrets.data.my_service_api_token
  }
}

supposing that you have an encrypted my_service_api_token: ABCD…Z entry in your terraform/secrets.yml file.

Note that OpenTofu adds a very strong feature over plain old Terraform, being TF state file encryption. Where the TF state file is located (local, i.e. Git or remotely, S3 or alike) is up to you, but shall you use a Git located one, we strongly advise to have it encrypted.

You can achieve this easily by extending the terraform/providers.tf file in your platform’s repository:

terraform {
  encryption {
    key_provider "pbkdf2" "passphrase" {
      passphrase = var.passphrase
    }

    method "aes_gcm" "sops" {
      keys = key_provider.pbkdf2.passphrase
    }
    state {
      method = method.aes_gcm.sops
    }

    plan {
      method = method.aes_gcm.sops
    }
  }
}

variable "passphrase" {
  # Value to be defined in your local passphrase.auto.tfvars file.
  # Content to be retrieved from decyphered secrets.yml file.
  sensitive = true
}

Then, create a local terraform/passphrase.auto.tfvars file with the secret of your choice:

passphrase = "ABCD...Z"

3.3 - Network Topology

Our Tutorial network topology

Let’s use this sample network topology for the rest of this tutorial:

Network Topology

We’ll start with a single Kahuna instance, with public Internet exposure. The instance’s hostname will be kowabunga-kahuna-1 and it has 2 network adapters and associated IP addresses:

  • a private one, 10.0.0.1, in the event we’d need to peer further one with other instances for hugh-availability.
  • a public one, 1.2.3.4, exposed as kowabunga.acme.com for WebUI, REST API calls to the orchestrator and WebSocket agents endpoint. It’ll also be exposed as grafana.acme.com, logs.acme.com and metrics.acme.com for Kiwi and Kaktus to push logs and and metrics and allow for service’s metrology.

Next is the main (and only) region, EU-WEST and its single zone, EU-WEST-A. The region/zone will feature 2 Kiwi instances and 3 Kaktus ones.

All instances will be connected under the same L2 network layer (as defined in requirements) and we’ll use different VLANs and associated network subnets to isolate content:

  • VLAN101 will be used as default, administration VLAN, with associated 10.50.101.0/24 subnet. All Kiwi and Kaktus instances will be part of.
  • VLAN102 will be used for Ceph backpanel, with associated 10.50.102.0/24 subnet. While not mandatory, this allows differentiating the administrative control plane traffic from pure storage cluster data synchronization. This allows for better traffic shaping and monitoring, if ever needs be. Note that on enterprise-grade production systems, Ceph project would recommend to use dedicated NIC for Ceph traffic, so isolation here makes sense.
  • VLAN201 to VLAN209 would be application VLANs. Kiwi will bind them, being region’s router, but Kaktus don’t. Instantiated VMs will however, through bridged network adapters.

3.4 - Setup Kahuna

Let’s start with the orchestration core

Now let’s suppose that you’ve cloned the Git platform repository template and that your Kahuna instance server has been provisioned with latest Ubuntu LTS distribution. Be sure that it is SSH-accessible with some local user.

Let’s take the following assumptions for the rest of this tutorial:

  • We only have one single Kahuna instance (no high-availability).
  • Local bootstrap user with sudo privileges is ubuntu, with key-based SSH authentication.
  • Kahuna instance is public-Internet exposed through IP address 1.2.3.4, translated to kowabunga.acme.com DNS.
  • Kahuna instance is private-network exposed through IP address 10.0.0.1.
  • Kahuna instance hostname is kowabunga-kahuna-1.

Setup DNS

Please ensure that your kowabunga.acme.com domain translates to public IP address 1.2.3.4. Configuration is up to you and your DNS provider and can be done manually.

Being IaC-supporters, we advise using OpenTofu for that purpose. Let’s see how we can do, using Cloudflare DNS provider.

Start by editing the terraform/providers.tf file in your platform’s repository:

terraform {
  required_providers {
    cloudflare = {
      source  = "cloudflare/cloudflare"
      version = "~> 5"
    }
  }
}

provider "cloudflare" {
  api_token = local.secrets.cloudflare_api_token
}

extend the terraform/secrets.tf file with:

locals {
  secrets = {
    cloudflare_api_token = data.sops_file.secrets.data.cloudflare_api_token
  }
}

and add the associated:

cloudflare_api_token: MY_PREVIOUSLY_GENERATED_API_TOKEN

variable in terraform/secrets.yml file thanks to:

$ kobra secrets edit terraform/secrets.yml

Then, simply edit your terraform/main.tf file with the following:

resource "cloudflare_dns_record" "kowabunga" {
  zone_id = "ACME_COM_ZONE_ID"
  name    = "kowabunga"
  ttl     = 3600
  type    = "A"
  content = "1.2.3.4"
  proxied = false
}

initialize OpenTofu (once, or each time you add a new provider):

$ kobra tf init

and apply infrastructure changes:

$ kobra tf apply

Inventory Management

It is now time to declare your Kahuna instance in Ansible’s inventory. Simply extend the ansible/inventories/hosts.txt the following way:

[kahuna]
10.0.0.1 name=kowabunga-kahuna-1 ansible_ssh_user=ubuntu

The instance is now declared to be part of kahuna group and Ansible will use ubuntu local user account to connect through SSH.

Note that doing so, you can now safely:

  • declare host-specific variables in ansible/host_vars/10.0.0.1.yml file.
  • declare host-specific sensitive variables in ansible/host_vars/10.0.0.1.sops.yml file.
  • declare kahuna group-specific variables in ansible/group_vars/kahuna/main.yml file.
  • declare kahuna group-specific sensitive variables in ansible/group_vars/kahuna.sops.yml file.
  • declare any other global variables in ansible/group_vars/all/main.yml file.
  • declare any other global sensitive variables in ansible/group_vars/all.sops.yml file.

Note that Ansible variables precedence will apply:

role defaults < all vars < group vars < host vars < role vars

Ansible Kowabunga Collection

Kowabunga comes with an official Ansible Collection and its associated documentation.

The collection contains:

  • roles and playbooks to easily deploy the various Kahuna, Koala, Kiwi and Kaktus instances.
  • actions so you can create your own tasks to interact and manage a previously setup Kowabunga instance.

Check out ansible/requirements.yml file to declare the specific collection version you’d like to use:

---
collections:
  - name: kowabunga.cloud
    version: 0.0.1

By default, your platform is configured to pull a tagged official release from Ansible Galaxy. You may however prefer to pull it directly from Git, using latest commit for instance. This can be accomodated through:

---
collections:
  - name: git@github.com:kowabunga-cloud/ansible-collections-kowabunga
    type: git
    version: master

Once defined, simply pull it into your local machine:

$ kobra ansible pull

Kahuna Settings

Kahuna instance deployment will take care of everything. It’ll take the assumption of running a supported Ubuntu LTS release, enforce some configuration and security settings, install the necessary packages, create local admin user accounts, if required, and setup some form of deny-all filtering policy firewall, so you’re safely exposed.

Admin Accounts

Let’s start by declaring some user admin accounts we’d like to create. We don’t want to keep on using the single nominative ubuntu account for everyone after all.

Simply create/edit the ansible/inventories/group_vars/all/main.yml file the following way:

kowabunga_os_user_admin_accounts_enabled:
  - admin_user_1
  - admin_user_2

kowabunga_os_user_admin_accounts_pubkey_dirs:
  - "{{ playbook_dir }}/../../../../../files/pubkeys"

to declare all your expected admin users, and add their respective SSH public key files in the ansible/files/pubkeys directory, e.g.:

$ tree ansible/files/pubkeys/
ansible/files/pubkeys/
└── admin_user_1
└── admin_user_2

We’d also recommend you to set/update the root account password. By default, Ubuntu comes without any, making it impossible to login. Kowabunga’s playbook make sure that root login is prohibited from SSH for security reasons (e.g. brute-force attacks) but we encourage you setting one, as it’s always useful, especially on public cloud VPS or bare metal servers to get a console/IPMI access to log into.

If you intend to do so, simply edit the secrets file:

$ kobra secrets edit ansible/inventories/group_vars/all.sops.yml

and set the requested password:

secret_kowabunga_os_user_root_password: MY_SUPER_SETRONG_PASSWORD

Firewall

If you Kahuna instance is connected on public Internet, it is more than recommended to enable a network firewall. This can be easily done by extending the ansible/inventories/group_vars/kahuna/main.yml file with:

kowabunga_firewall_enabled: true
kowabunga_firewall_open_tcp_ports:
  - 22
  - 80
  - 443

Note that we’re limited opened ports to SSH and HTTP/HTTPS here, which should be more than enough (HTTP is only used by Caddy server for certificate auto-renewal and will redirect traffic to HTTPS anyway). If you don’t expect your instance to be SSH-accessible on public Internet, you can safely drop this line.

MongoDB

Kahuna comes with a bundled, ready-to-be-used MongoDB deployment. This comes in handy if you only have a unique instance to manage. This remains however optional (default), as you may very well be willing to re-use an existing external production-grade MongoDB cluster, already deployed.

If you intend to go with the bundled one, a few settings must be configured in ansible/inventories/group_vars/kahuna/main.yml file:

kowabunga_mongodb_enabled: true
kowabunga_mongodb_listen_addr: "127.0.0.1,10.0.0.1"
kowabunga_mongodb_rs_key: "{{ secret_kowabunga_mongodb_rs_key }}"
kowabunga_mongodb_rs_name: kowabunga
kowabunga_mongodb_admin_password: "{{ secret_kowabunga_mongodb_admin_password }}"
kowabunga_mongodb_users:
  - base: kowabunga
    username: kowabunga
    password: '{{ secret_kowabunga_mongodb_user_password }}'
    readWrite: true

and their associated secrets in ansible/inventories/group_vars/kahuna.sops.yml

secret_kowabunga_mongodb_rs_key: YOUR_CUSTOM_REPLICA_SET_KEY
secret_kowabunga_mongodb_admin_password: A_STRONG_ADMIN_PASSWORD
secret_kowabunga_mongodb_user_password: A_STRONG_USER_PASSWORD

This will basically instruct Ansible to install MongoDB server, configure it with a replicaset (so it can be part of a future cluster instance, we never know), secure it with admin credentials of your choice and create a kowabunga database/collection and associated service user.

Kahuna Settings

Finally, let’s ensure the Kahuna orchestrator gets everything he needs to operate.

You’ll need to define:

  • a custom email address (and associated SMTP connection settings) for Kahuna to be able to send email notifications to users.
  • a randomly generated key to sign JWT tokens (please ensure it is secure enough, not to compromise issued tokens robustness).
  • a randomly generated admin API key. It’ll be used to provision the admin bits of Kowabunga, until proper user accounts have been created.
  • a private/public SSH key-pair to be used by platform admins to seamlessly SSH into instantiated Kompute instances. Please ensure that the private key is being stored securely somewhere.

Then simply edit the ansible/inventories/group_vars/kahuna/main.yml file the following way:

kowabunga_public_url: "https://kowabunga.acme.com"

kowabunga_kahuna_http_address: "10.0.0.1"
kowabunga_kahuna_admin_email: kowabunga@acme.com
kowabunga_kahuna_jwt_signature: "{{ secret_kowabunga_kahuna_jwt_signature }}"
kowabunga_kahuna_db_uri: "mongodb://kowabunga:{{ secret_kowabunga_mongodb_user_password }}@10.0.0.1:{{ mongodb_port }}/kowabunga?authSource=kowabunga"
kowabunga_kahuna_api_key: "{{ secret_kowabunga_kahuna_api_key }}"

kowabunga_kahuna_bootstrap_user: kowabunga
kowabunga_kahuna_bootstrap_pubkey: "YOUR_ADMIN_SSH_PUB_KEY"

kowabunga_kahuna_smtp_host: "smtp.acme.com"
kowabunga_kahuna_smtp_port: 587
kowabunga_kahuna_smtp_from: "Kowabunga <{{ kowabunga_kahuna_admin_email }}>"
kowabunga_kahuna_smtp_username: johndoe
kowabunga_kahuna_smtp_password: "{{ secret_kowabunga_kahuna_smtp_password }}"

and add the respective secrets into ansible/inventories/group_vars/kahuna.sops.yml:

secret_kowabunga_kahuna_jwt_signature: A_STRONG_JWT_SGINATURE
secret_kowabunga_kahuna_api_key: A_STRONG_API_KEY
secret_kowabunga_kahuna_smtp_password: A_STRONG_PASSWORD

Ansible Deployment

We’re done with configuration (finally) ! All we need to do now is finally run Ansible to make things live. This is done by invoking the kahuna playbook from the kowabunga.cloud collection:

$ kobra ansible deploy -p kowabunga.cloud.kahuna

Note that, under-the-hood, Ansible will use Ansible Mitogen extension to speed things up. Bear in mind that Ansible’s run is idempotent. Anything’s failing can be re-executed. You can also run it as many times you want, or re-run it in the next 6 months or so, provided you’re using a tagged collection, the end result will always be the same.

After a few minutes, if everything’s went okay, you should have a working Kahuna instance, i.e.:

  • A Caddy frontal reverse-proxy, taking care of automatic TLS certificate issuance, renewal and traffic termination, forwarding requests back to either Koala Web application or Kahuna backend server.
  • The Kahuna backend server itself, our core orchestrator.
  • Optionally, MongoDB database.

We’re now ready for provisionning users and teams !

3.5 - Provisioning Users

Let’s populate admin users and teams

Your Kahuna instance is now up and running, let’s get things and create a few admin users accounts. At first, we only have the super-admin API key that was previously set through Ansible deployment. We’ll make use of it to provision further users and associated teams. After all, we want a nominative user acount for each contributor, right ?

Back to TF config, let’s edit the terraform/providers.tf file:

terraform {
  required_providers {
    kowabunga = {
      source  = "kowabunga-cloud/kowabunga"
      version = ">=0.55.0"
    }
  }
}

provider "kowabunga" {
  uri   = "https://kowabunga.acme.com"
  token = local.secrets.kowabunga_admin_api_key
}

Make sure to edit the Kowabunga provider’s uri with the associated DNS of your freshly deployed Kahuna instance and edit the terraform/secrets.yml file so match the kowabunga_admin_api_key you’ve picked before. OpenTofu will make use of these parameters to connect to your private Kahuna and apply for resources.

Now declare a few users in your terraform/locals.tf file:

locals {
  admins = {
    // HUMANS
    "John Doe" = {
      email  = "john@acme.com",
      role   = "superAdmin",
      notify = true,
    }
    "Jane Doe" = {
      email  = "jane@acme.com",
      role   = "superAdmin",
      notify = true,
    }

    // BOTS
    "Admin TF Bot" = {
      email = "tf@acme.com",
      role  = "superAdmin",
      bot   = true,
    }
  }
}

and the following resources definition in terraform/main.tf:

resource "kowabunga_user" "admins" {
  for_each      = local.admins
  name          = each.key
  email         = each.value.email
  role          = each.value.role
  notifications = try(each.value.notify, false)
  bot           = try(each.value.bot, false)
}

resource "kowabunga_team" "admin" {
  name  = "admin"
  desc  = "Kowabunga Admins"
  users = sort([for key, user in local.admins : kowabunga_user.users[key].id])
}

Then, simply apply for resources creation:

$ kobra tf apply

What we’ve done here was to register a new admin team, with 3 new associated user accounts: 2 regular ones for human administrators and one bot, which you’ll be able to use its API key instead of the super-admin master one to further provision resources if you’d like.

Better do this way as, shall the key be compromised, you’ll only have to revoke it or destroy the bot account, instead of replacing the master one on Kahuna instance.

Newly registered user will be prompted with 2 emails from Kahuna:

  • a “Welcome to Kowabunga !” one, simply asking yourself to confirm your account’s creation.
  • a “Forgot about your Kowabunga password ?” one, prompting for a password reset.

Once users have been registered and password generated, and provided Koala Web application has been deployed as well, they can connect to (and land on a perfectly empty and so useless dashboard ;-) for now at least ).

Let’s move on and start creating our first region !

3.6 - Create Your First Region

Let’s setup a new region and its Kiwi and Kaktus instances

Orchestrator being ready, we can now boostrap our first region.

Let’s take the following assumptions for the rest of this tutorial:

  • The Kowabunga region is to be called eu-west.
  • The region will have a single zone named eu-west-a.
  • It’ll feature 2 Kiwi and 3 Kaktus instances.

Back on the TF configuration, let’s use the following:

locals {
  eu-west = {
    desc = "Europe West"

    zones = {
      "eu-west-a" = {
        id = "A"
      }
    }
  }
}

resource "kowabunga_region" "eu-west" {
  name = "eu-west"
  desc = local.eu-west.desc
}

resource "kowabunga_zone" "eu-west" {
  for_each = local.eu-west.zones
  region   = kowabunga_region.eu-west.id
  name     = each.key
  desc     = "${local.eu-west.desc} - Zone ${each.value.id}"
}

And apply:

$ kobra tf apply

Nothing really complex here to be fair, we’re just using Kahuna’s API to register the region and its associated zone.

Now, we’ll register the 2 Kiwi instances and 3 Kaktus ones. Please note that:

  • we’ll extend the TF locals definition for that.
  • Kiwi is to be associated to the global region.
  • while Kaktus is ti be associated to the region’s zone.

Let’s start by registering one Kiwi and 2 associated agents:

locals {
  eu-west = {

    agents = {
      "kiwi-eu-west-1" = {
        desc = "Kiwi EU-WEST-1 Agent"
        type = "Kiwi"
      }
      "kiwi-eu-west-2" = {
        desc = "Kiwi EU-WEST-2 Agent"
        type = "Kiwi"
      }
    }

    kiwi = {
      "kiwi-eu-west" = {
        desc   = "Kiwi EU-WEST",
        agents = ["kiwi-eu-west-1", "kiwi-eu-west-2"]
      }
    }
  }
}

resource "kowabunga_agent" "eu-west" {
  for_each = merge(local.eu-west.agents)
  name     = each.key
  desc     = "${local.eu-west.desc} - ${each.value.desc}"
  type     = each.value.type
}

resource "kowabunga_kiwi" "eu-west" {
  for_each = local.eu-west.kiwi
  region   = kowabunga_region.eu-west.id
  name     = each.key
  desc     = "${local.eu-west.desc} - ${each.value.desc}"
  agents   = [for agent in try(each.value.agents, []) : kowabunga_agent.eu-west[agent].id]
}

Let’s continue with the 3 Kaktus instances declaration and their associated agents. Note that, this time, instances are associated to the zone itself, not the region.

locals {
  currency           = "EUR"
  cpu_overcommit     = 3
  memory_overcommit  = 2

  eu-west = {
    zones = {
      "eu-west-a" = {
        id = "A"

        agents = {
          "kaktus-eu-west-a-1" = {
            desc = "Kaktus EU-WEST A-1 Agent"
            type = "Kaktus"
          }
          "kaktus-eu-west-a-2" = {
            desc = "Kaktus EU-WEST A-2 Agent"
            type = "Kaktus"
          }
          "kaktus-eu-west-a-3" = {
            desc = "Kaktus EU-WEST A-3 Agent"
            type = "Kaktus"
          }
        }

        kaktuses = {
          "kaktus-eu-west-a-1" = {
            desc        = "Kaktus EU-WEST A-1",
            cpu_cost    = 500
            memory_cost = 200
            agents      = ["kaktus-eu-west-a-1"]
          }
          "kaktus-eu-west-a-2" = {
            desc        = "Kaktus EU-WEST A-2",
            cpu_cost    = 500
            memory_cost = 200
            agents      = ["kaktus-eu-west-a-2"]
          }
          "kaktus-eu-west-a-3" = {
            desc        = "Kaktus A-3",
            cpu_cost    = 500
            memory_cost = 200
            agents      = ["kaktus-eu-west-a-3"]
          }
        }
      }
    }
  }
}

resource "kowabunga_agent" "eu-west-a" {
  for_each = merge(local.eu-west.zones.eu-west-a.agents)
  name     = each.key
  desc     = "${local.eu-west.desc} - ${each.value.desc}"
  type     = each.value.type
}

resource "kowabunga_kaktus" "eu-west-a" {
  for_each          = local.eu-west.zones.eu-west-a.kaktuses
  zone              = kowabunga_zone.eu-west["eu-west-a"].id
  name              = each.key
  desc              = "${local.eu-west.desc} - ${each.value.desc}"
  cpu_price         = each.value.cpu_cost
  memory_price      = each.value.memory_cost
  currency          = local.currency
  cpu_overcommit    = try(each.value.cpu_overcommit, local.cpu_overcommit)
  memory_overcommit = try(each.value.memory_overcommit, local.memory_overcommit)
  agents            = [for agent in try(each.value.agents, []) : kowabunga_agent.eu-west-a[agent].id]
}

And again, apply:

$ kobra tf apply

That done, Kiwi and Kaktus instances have been registered, but more essentially, their associated agents. For each newly created agent, you should have received an email (check the admin one you previously set in Kahuna’s configuration). Keep track of these emails, they contain one-time credentials about the agent identifier and it’s associated API key.

This is the super secret thing that will allow them further to establish secure connection to Kahuna orchestrator. We’re soon going to declare these credentials in Ansible’s secrets so Kiwi and Kaktus instances can be provisioned accordingly.

Let’s continue and provision our region’s Kiwi instances !

3.7 - Provisioning Kiwi

Let’s provision our Kiwi instances

4 - Services

Discover Kowabunga pre-baked services

Kowabunga provides more than just raw infrastructure resources access. It features various “ready-to-be-consumed” -as-a-service extensions to easily bring life to your various application and automation deployment needs.

4.1 - Kaddie

Kowabunga Private Key Infrastructure

This service is still work-in-progess

4.2 - Kalipso

Kowabunga Application Load-Balancer

This service is still work-in-progess

4.3 - Karamail

Kowabunga SMTP Server

This service is still work-in-progess

4.4 - Kawaii

Kowabunga Internet Gateway

Kawaii is your project’s private Internet Gateway, with complete ingress/egress control. It stands for Kowabunga Adaptive WAn Intelligent Interface (if you have better ideas, we’re all ears ;-) ).

It is the network gateway to your private network. All Kompute (and other services) instances always use Kawaii as their default gateway, relaying all traffic.

Kawaii itself relies on the underlying region’s Kiwi SD-WAN nodes to provide access to both public networks (i.e. Internet) and possibly other projects’ private subnets (when requested).

Kawaii is always the first service to be created (more exactly, other instances cloud-init boot sequence will likely wait until they reach a proper network connectivity, as Kawaii provides). Being critical for your project’s resilience, Kawaii uses Kowabunga’s concept of Multi-Zone Resources (MZR) to ensure that, when the requested regions feature multiple availability zones, a project’s Kawaii instance gets created in each zone.

Using multiple floating virtual IP (VIP) addresses with per-zone affinity, this guarantees that all instantiates services will always be able to reach their associated network router. As much as can be, using weighted routes, service instances will target their zone-local Kawaii instance, the best pick for latency. In the unfortunate event of local zone’s failure, network traffic will then automatically get routed to other zone’s Kawaii (with an affordable extra millisecond penalty).

While obviously providing egress capability to all project’s instance, Kawaii can also be used as an egress controller, exposed to public Internet through dedicated IPv4 address. Associated with a Konvey or Kalipso load-balancer, it make it simple to expose your application publicly, as one would do with a Cloud provider.

Kowabunga’s API allows for complete control of the ingress/egress capability with built-in firewalling stack (deny-all filtering policy, with explicit port opening) as well as peering capabilities.

This allows you to inter-connect your project’s private network with:

  • VPC peering with other Kowabunga-hosted projects from the same region (network translation and routing being performed by underlying Kiwi instances).
  • IPSEC peering with non-Kowabunga managed projects and network, from any provider.

Note that thanks to Kowabunga’s internal network architecture and on-premises network backbone, inter-zones traffic is a free-of-charge possibility ;-) There’s no reason not to spread your resources on as many zones as can be, you won’t ever see any end-of-the-month surprise charge.

4.5 - Knox

Kowabunga Vault Service

This service is still work-in-progess

4.6 - Kompute

Kowabunga Virtual Machine instance

Kowabunga Kompute is the incarnation of a virtual machine instance.

Associated with underlying distributed block storage, it provides everything one needs to run generic kind of application workload.

Kompute instance can be created (and further edited) with complete granularity:

  • number of virtual CPU cores.
  • amount of virtual memory.
  • one OS disk and any number of extra data disks.
  • optional public (i.e. Internet) direct exposure.

Compared to major Cloud providers who will only provide pre-defined machine flavors (with X vCPUs and Y GB of RAM), you’re free to address machines to your exact needs.

Kompute instances are created and bound to a specific region and zone, where they’ll remain. Kahuna orchestration will make sure to instantiate the requested machine on the the best Kaktus hypervisor (at the time), but thanks to underlying distributed storage, it can easily migrate to any other instance from the specified zone, for failover or balancing.

Kompute’s OS disk image is cloned from one of the various OS templates you’ll have provided Kowabunga with and thanks to thin-provisioning and underlying copy-on-write mechanisms, no disk space is ever redeemed. Feel free to allocate 500 GB of disk, it’ll never get consumed until you actually store data onto !

Like any other service, Kompute instances are bound to a specific project, and consequently associated subnet, making it sealed from other projects’ reach. Private and public interfaces IP addresses are automatically assigned by Kahuna, as defined by administrator, making it ready to be consumed for end-user.

4.7 - Konvey

Kowabunga Network Load-Balancer

Konvey is a plain simple network Layer-4 (UDP/TCP) load-balancer.

It’s only goal is to accept remote traffic and ship it back to one of the many application backend, through round-robin algorithm (with health check support).

Konvey can either be used to:

  • load-balance traffic from private network to private network
  • load-balance traffic from public network (i.e. Internet) to private network, in association with Kawaii. In such a scenario, Kawaii holds public IP address exposure, and route public traffic to Konvey instances, through NAT settings.

As with Kawaii, Konvey uses Kowabunga’s concept of Multi-Zone Resources (MZR) to ensure that, when the requested region features multiple availability zones, a project’s Konvey instance gets created in each zone, making it highyl resilient.

4.8 - Kosta

Kowabunga Object Storage Service

This service is still work-in-progess

4.9 - Kryo

Kowabunga Backup and Cold Storage

This service is still work-in-progess

4.10 - Kylo

Kowabunga Distributed Network File System

Kylo is Kowabunga’s incarnation of NFS. While all Kompute instances have their own local block-device storage disks, Kylo provides the capability to access a network storage, shared amongst virtual machines.

Kylo fully implements the NFSv4 protocol, making it easy for Linux instances (and even Windows) to mount it without any specific tools.

Under the hood, Kylo relies on underlying CephFS volume, exposed by Kaktus nodes, making it natively distributed and resilient (i.e. one doesn’t need trying to add HA on top).

5 - Troubleshooting

Always get a plan B …

Google’s Site Reliability Engineering book says so:

Hope is not a strategy; wish for the best, but prepare for the worst.

We’re working hard to make Kowabunga as resilient and fault-tolerant as possible but human nature will always prevail. There’s always going to be one point in time where your database will get corrupted, when you’ll face a major power-supply incident, when you’ll have to bring everything back from ashes, in a timely manner …

Breath up, let’s see how we can help !

5.1 - Ceph

Troubleshooting Ceph storage

Kaktus HCI nodes rely on Ceph for underlying distributed storage.

Ceph provides both:

  • RBD block-device images for Kompute virtual instances
  • CephFS distributed file system for Kylo storage.

Ceph is awesome. Ceph is fault-tolerant. Ceph hashes your file objects into thousands of pieces, distributed and replicated over dozens if not hundreds of SSDs on countless machines. And yet, Ceph sometimes crashes or fails to recover (even though it has incredible self healing capabilities).

While Ceph perfeclty survives some occasional nodes failure, have a try when you have a complete network or power-supply outage in your region, and you’ll figure it out ;-)

So let’s so how we can restore Ceph cluster.

Unable to start OSDs

If Ceph OSDs can’t be started, it is likely because of un-detected (and un-mounted) LVM partition.

A proper mount command should provide the following:

$ mount | grep /var/lib/ceph/osd
tmpfs on /var/lib/ceph/osd/ceph-0 type tmpfs (rw,relatime,inode64)
tmpfs on /var/lib/ceph/osd/ceph-2 type tmpfs (rw,relatime,inode64)
tmpfs on /var/lib/ceph/osd/ceph-1 type tmpfs (rw,relatime,inode64)
tmpfs on /var/lib/ceph/osd/ceph-3 type tmpfs (rw,relatime,inode64)

If not, that means that /var/lib/ceph/osd/ceph-X directories are empty and OSD can’t run.

Run the following command to re-scan all LVM partitions, remount and start OSDs.

$ sudo ceph-volume lvm activate --all

Check for mount output (and/or re-run command) until all target disks are mounted.

Fix damaged filesystem and PGs

In case of health error and damaged filesystem/PGs, one can easily fix those:

$ ceph status

  cluster:
    id:     be45512f-8002-438a-bf12-6cbc52e317ff
    health: HEALTH_ERR
            25934 scrub errors
            Possible data damage: 7 pgs inconsistent

Isolate the damaged PGs:

$ ceph health detail
HEALTH_ERR 25934 scrub errors; Possible data damage: 7 pgs inconsistent
[ERR] OSD_SCRUB_ERRORS: 25934 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 7 pgs inconsistent
    pg 2.16 is active+clean+scrubbing+deep+inconsistent+repair, acting [5,11]
    pg 5.20 is active+clean+scrubbing+deep+inconsistent+repair, acting [8,4]
    pg 5.26 is active+clean+scrubbing+deep+inconsistent+repair, acting [11,3]
    pg 5.47 is active+clean+scrubbing+deep+inconsistent+repair, acting [2,9]
    pg 5.62 is active+clean+scrubbing+deep+inconsistent+repair, acting [8,1]
    pg 5.70 is active+clean+scrubbing+deep+inconsistent+repair, acting [11,2]
    pg 5.7f is active+clean+scrubbing+deep+inconsistent+repair, acting [5,3]

Proceed with PG repair (iterate on all inconsistent PGs):

$ ceph pg repair 2.16

and wait until everything’s fixed.

$ ceph status
  cluster:
    id:     be45512f-8002-438a-bf12-6cbc52e317ff
    health: HEALTH_OK

MDS daemon crashloop

If your Ceph MDS daemon (i.e. CephFS) is in a crashloop, probably because of corrupted journal, let’s see how we can proceed:

Get State

Check for global CephFs status, including clients list, number of active MDS servers etc …

$ ceph fs status

Additionnally, you can get a dump of all filesystem, trying to find MDS daemons’ status (laggy, replay …):

$ ceph fs dump

Prevent client connections

If you suspect the filesystem’s to be damaged, first thing to do is to preserve any more corruption.

Start by stopping all CephFs clients, if under control.

For Kowabunga, that means stopping NFS Ganesha server on all Kaktus instances:

$ sudo systemctl stop nfs-ganesha

Prevent all client connections from server-side (i.e. Kaktus).

We consider that filesystem name is nfs:

$ ceph config set mds mds_deny_all_reconnect true
$ ceph config set mds mds_heartbeat_grace 3600
$ ceph fs set nfs max_mds 1
$ ceph fs set nfs refuse_client_session true
$ ceph fs set nfs down true

Stop server-side MDS instances on all Kaktus servers:

$ sudo systemctl stop ceph-mds@$(hostname)

Fix metadata journal

You may refer to Ceph Troubleshooting guide for more details on disaster recovery.

Start backing up journal:

$ cephfs-journal-tool --rank nfs:all journal export backup.bin

Inspect journal:

$ cephfs-journal-tool --rank nfs:all journal inspect

Then proceed with dentries recovery and journal truncation

$ cephfs-journal-tool --rank=nfs:all event recover_dentries summary
$ cephfs-journal-tool --rank=nfs:all journal reset

Optionally reset session entries:

$ cephfs-table-tool all reset session
$ ceph fs reset nfs --yes-i-really-mean-it

Verify Ceph MDS can be brought up again:

$ sudo /usr/bin/ceph-mds -f --cluster ceph --id $(hostname) --setuser ceph --setgroup ceph

If ok, then kill it ;-) (Ctrl+C)

Resume Operations

Flush all OSD blocklisted MDS clients:

$ for i in $(ceph osd blocklist ls 2>/dev/null | cut -d ' ' -f 1); do ceph osd blocklist rm $i; done

Ensure we’re all fine:

$ ceph osd blocklist ls

There should be no entry anymore.

Start server-side MDS instances on all Kaktus servers:

$ sudo systemctl start ceph-mds@$(hostname)

Enable back client connections:

$ ceph fs set nfs down false
$ ceph fs set nfs max_mds 2
$ ceph fs set nfs refuse_client_session false
$ ceph config set mds mds_heartbeat_grace 15
$ ceph config set mds mds_deny_all_reconnect false

Start back all CephFs clients, if under control.

For Kowabunga, that means starting NFS Ganesha server on all Kaktus instances:

$ sudo systemctl start nfs-ganesha