Kowabunga is an SD-WAN and HCI (Hyper-Converged Infrastructure) Orchestration Engine.
Market BS aside, Kowabunga provides DevOps with a complete infrastructure automation suite to orchestrate virtual resources management automation on privately-owned commodity hardware.
On-Premises mastered and predictable flat-rate hardware.
1 - Overview
How can Kowabunga sustain your applications hosting ?
What is it ?
Kowabunga is an SD-WAN and HCI (Hyper-Converged Infrastructure) Orchestration Engine.
Market BS aside, Kowabunga provides DevOps with a complete infrastructure automation suite to orchestrate virtual resources management automation on privately-owned commodity hardware.
On-Premises mastered and predictable flat-rate hardware.
The Problem
Cloud Services are unnecessarily expensive and come with vendor-locking.
“Cloud computing is basically renting computers, instead of owning and operating your own server hardware. From the start, companies that offer cloud services have promised simplicity and cost savings. Basecamp has had one foot in the cloud for well over a decade, and HEY has been running there exclusively since it was launched two years ago. We’ve run extensively in both Amazon’s cloud and Google’s cloud, but the savings promised in reduced complexity never materialized. So we’ve left.
The rough math goes like this: We spent $3.2m on cloud in 2022.The cost of rack space and new hardware is a total of $840,000 per year.
Leaving the cloud will save us $7 million over five years.
At a time when so many companies are looking to cut expenses, saving millions through hosting expenses sounds like a better first move than the rounds of layoffs that keep coming.”
Cost-Effective: Full private-cloud on-premises readiness and ability to run on commodity hardware. No runtime fees, no egress charges, flat-rate predictable cost. Keep control of your TCO.
Resilient & Features-Rich: Kowabunga enables highly-available designs, across multiple data centers and availability zones and brings automated software-as-a-service. Shorten application development and setup times.
No Vendor-Locking: Harness the potential of Open-Source software stack as a backend: no third-party commercial dependency. We stand on the shoulders of giants: KVM, Ceph … Technical choices remain yours and yours only.
Open Source … by nature: Kowabunga itself is OpenSource, from API to client and server-side components. We have nothing to hide but everything to contribute. We believe in mutual trust.
A Kowabunga-hosted project costs 1/10th of a Cloud-hosted one.
Why do I want it ?
What is it good for?: Modern SaaS products success are tighly coupled with profitability. As soon as you scale up, you’ll quickly understand that you’re actually sponsoring your Cloud provider more than your own teams. Kowabunga allows you to keep control of your infrastructure and its associated cost and lifecycle. You’ll never get afraid of unexpected business model change, tariffs and whatnot. You own your stack, with no surprises.
What is it not good for?: PoC and MVP startups. Let’s be realistic, if you’re goal is to vibe-code your next million-dollar idea and deliver it, no matter how and what, forget about us. You have other fish to fry than mastering your own infrastructure. Get funded, wait for your investors to ask for RoI, and you’ll make your mind.
What is it not yet good for?: Competing with GAFAM. Let’s be honest, we’ll never be the next AWS or GCP (or even OpenStack). We’ll never have 200+ as-a-service kind of stuff, but how many people actually need that much ?
Is it business-ready ?
Simply put … YES !
Kowabunga allows you to host and manage personal labs, SOHO sandboxes, as well as million-users SaaS projects. Using Open Source software doesn’t imply living on your own. Through our sponsoring program, Kowabunga comes with 24x7 enterprise-grade level of support.
Fun Facts 🍿
Where does it comes from ? Everything comes as a solution to a given problem.
Our problem was (and still is …) that Cloud services are unnecessarily expensive and often come with vendor-locking.
While Cloud services are appealing at first and great to bootstrap your project to an MVP level, you’ll quickly hit profitability issues when scaling up.
Provided you have the right IT and DevOps skills in-house, self-managing your own infrastructure makes sense at economical level.
Linux and QEMU/KVM comes in handy, especially when powered by libvirt but we lacked true resource orchestration to push it to next stage.
OpenStack was too big, heavy, and costly to maintain. We needed something lighter, simpler.
So we came with Kowabunga: Kvm Orchestrator With ABUNch of Goods Added.
Where should I go next ?
Concepts: Lear about Kowabunga architecture and design
Simply put, Kowabunga allows you to control and manage low-level infrastructure at your local on-premises data-centers and spin up various virtual resources on top, as to leverage your applications on top.
Local data centers consist of a bunch of physical machines (can range from personal computers, commodity hardware to high-end enterprise-grade servers) providing raw networking, computing and storage resources. Physical assests plainly sit in your basement. They don’t need to be connected to other data-centers, they don’t even need to know about others data-centers’ existence and more than anything, they don’t need to be exposed to public Internet.
From an IT and assets management’s perspective, one simply needs to ensure they run and, capacity planning in mind, that they do offer enough physical resources to sustain future applications hosting needs.
On each data-center, some physical machines (usually lightweight) will be dedicated to providing networking kind of services, through Kowabunga’s Kiwi agents, while others will providing computing and storage capabilities, thanks to Kowabunga’s Kaktus agents.
The Kowabunga project then come with Kahuna, its orchestration engine. This is the masterpiece cornerstone of your architecture. Kahuna act as a maestro, providing API servicess for admins and end-users, and provising and controlling virtual resources on the various data-centers through Kowabunga connected agents.
Ultimately, DevOps consumers will only ever interface with Kahuna.
So, how does magic happen ?
Kahuna has a triple role exposure:
Public REST API: implements and operates the API calls to manage resources, DevOps-orchestrated, manually (not recommended) or through automation tools such as Terraform, OpenTofu or Ansible.
Public WebSocket endoint: agent connection manager, where the various Kowabunga agents (from managed data-centers) establish secure WebSocket tunnels to, for being further controlled, bypassing on-premises firewall constraints and preventing the need of any public service exposure.
Metadata endpoint: where managed virtual instances and services can retrieve information services and self-configure themselves.
Core Components
So, let’s rewind, the Kowabunga projects consists of multiple core components:
Kahuna: the core orchestration system. Remotely controls every resource and maintains ecosystem consistent. Gateway to the Kowabunga REST API.
Kaktus: the HCI node(s). Provides KVM-based virtual computing hypervisor with Ceph-based distributed storage services.
Kiwi: the SD-WAN node(s). Provides various network services like routing, firewall, DHCP, DNS, VPN, peering (with active-passive failover).
Koala: the WebUI. Allows for day-to-day supervision and operation of the various projects and services.
Aside from these, Kowabunga introduces the concept of:
Region: basically a physical location, which can be assimilated to a data-center.
Zone: a specific subset of a region, where all underlying resources are guaranteed to be self-autonomous (in terms of Internet connectivity, power-supply, cooling …). As with other Cloud providers, the zones allow for application workload distribution within a single region, offering resilience and high-availability.
Warning
While Zones are part of the same Region, they are recommended to be geographically isolated (5 to 30km for example), yet inter-connected through a sub-millisecond latency.
Regardless of their respective Zone, all physical instances from a given Region must share the same L2/L3 physical network backbone (black fiber) to provide efficient distributed storage performances.
Topology Uses Cases
This illustrates what a Kowabunga Multi-Zones and Regions topology would looks like:
On the left side, one would have a multi-zones region. Divided into 3 Zones (i.e. 3 physically isolated data-centers, physically inter-connected by a network link), the region features 11 servers instances:
3x3 Kaktus instances, providing computing and storage capabilities.
Zones can be pictured in different ways:
several floors from your personal home basement (ok … useless … but for the sake of example).
several IT rooms from your company’s office.
several buildings from your company’s office.
Should a Kowabunga user request for a virtual machine creation in this dedicated region, he could specifically request it to be assigned to one of the 3 zones (the underlying hypervisor from each zone will be automatically picked), or request some -as-a-service feature, which would be seamlessly spawned in multiple zones, as to provide service redundancy.
Sharing the same L2/L3 network across the same region, disk instances will be distributed and replicating across zones, allowing for fast instance relocation in the event of one zone’s failure.
On the right side, one would have a single-zone region, with just a couple of physical instances.
Tips
Bear in mind that regions are autonomous. They can blindly co-exist, with different underlying capabilities and level of services.
One could imagine having a specific region dedicated for staging and one for production workloads (to keep resources isolated from each environment) or even multiple regions, each being specific to a given company or customer.
What Makes it Different ?
Cloud providers aside, what makes Kowabunga different from other on-premises infrastructure and virtualization providers (such as VMware, Nutanix, OpenStack …).
Well … 0 licensing costs. Kowabunga is Open Source with no paywalled features. There’s no per-CPU or per-GB or memory kind of license. Whether you’d like to set your private small-sized copamy’s data-center with 3 servers or full fleet of 200+, your cost of operation will remain flat.
But aside from cost, Kowabunga has been developed by and for DevOps, the ones who:
need to orchestrate, deploy and maintain heterogenous applications on heterogenous infrastructures.
use Infrastructure-as-Code principles to ensure reliability, durability and traceability.
bear security in mind, ensuring than nothing more than what’s required must be publicly exposed.
believe than smaller and simpler is better.
Tips for Managed Services Provider
If you’re acting as a Managed Services Provider (MSP) having to sustain various applications for dozens if not hundreds of customers, Kowabunga might come in handy.
Simply picture your various customer on-premises data-centers as Kowabunga regions. All autonomous, un-aware of each others, non-exposed to Internet (hello IT !), yet fully remotely manageable in a single unique way, thanks to Kahuna’s orchestration !
2.1 - Kahuna
Learn about Kahuna orchestrator.
Kahuna is Kowabunga’s orchestration system. Its name takes root from Hawaiian’s (Big) Kahuna word, meaning “the expert, the most dominant thing”.
Kahuna remotely controls every resource and maintains ecosystem consistent. It’s the gateway to Kowabunga REST API.
From a technological stack perspective, Kahuna features:
a Caddy public HTTPS frontend, reverse-proxying requests to:
Public REST API handler: implements and operates the API calls to manage resources,interacting with rightful local agents through JSON-RPC over WSS.
Public WebSocket handler: agent connection manager, where the various agents establish secure WebSocket tunnels to, for being further controlled, bypassing on-premises firewall constraints and preventing the need of any public service exposure.
Metadata endpoint: where managed virtual instances and services can retrieve information services and self-configure themselves.
Kowabunga API folds into 2 type of assets:
admin ones, used to handle objects like region, zone, kaktus and kiwi hosts, agents, networks …
user ones, used to handle objects such as Kompute, Kawaii, Konvey …
Kahuna implements robust RABC and segregation of duty as to ensure access boundaries, such as:
Nominative RBAC capabilities and per-organization and team user management.
Per-project teams associationfor per-resource access control.
Support for both JWT bearer (human-to-server) and API-Key token-based (server-to-server) authentication mechanisms.
Support for 2-steps account creation/validation and enforced robust passwords/tokens usage(server-generated, user-input is prohibited).
Nominative robust HMAC ID+token credentials over secured WebSocket agent connections.
This ensures that:
only rightful designated agents are able to establish WSS connections with Kahuna
created virtual instances can only retrieve the metadata profile they belong to (and self configure or update themselves at boot or runtime).
users can only see and manage resources for the projects they belong to.
Warning
Despite being central, Kahuna’s implementation does not yet allow for stateless distribution.
The current design with DB caches and WebSocket connection makes it hard to distribute without involving a message queue middleware. This is a good problem for the future, but not for now. A single Kahuna instance is perefeclty capable of handling multiple thousands of concurrent connections, which we believe to be more than enough for a private platform orchestrator (it wouldn’t for a large-scale public Cloud one).
Providing Kahuna with high-availability remains however fully possible, using good-old active-passive failover mechanism.
2.2 - Koala
Learn about Koala Web application.
Koala is Kowabunga’s WebUI. It allows for day-to-day supervision and operation of the various projects and services.
But should you ask a senior DevOps / SRE / IT admin, fully automation-driven, he’d damn anyone who’d have used the Web client to manually create/edit resources and messes around his perfecly maintained CasC.
We’ve all been there !!
That’s why Koala has been designed to be read-only. While using Kowabunga’s API, the project’s directive is to enforce infrastructure and configuration as code, and such, prevents any means to do harm.
Koala is AngularJS based and usually located next to Kahuna’s instance. It provides users with capability to connect, check for the various projects (they belong to) resources, optionnally start/reboot/stop them and/or see various piece of information and … that’s it ;-)
2.3 - Kiwi
Learn about Kiwi SD-WAN node.
Kiwi is Kowabunga SD-WAN node in your local data-center. It provides various network services like routing, firewall, DHCP, DNS, VPN and peering, all with active-passive failover (ideally over multiple zones).
Kiwi is central to our regional infrastructure to operate smoothly and internal gateway to all your projects Kawaii private network instances. It controls the local network configuration and creates/updates VLANs, subnets and DNS entries per API requests.
Kiwi offers a Kowabunga project’s network isolation feature by enabling VLAN-bound, cross-zones, project-attributed, VPC L3 networking range. Created virtual instances and services are bound to VPC by default and never publicly exposed unless requested.
Access to project’s VPC resources is managed either through:
Kiwi-managed region-global VPN tunnels.
Kawaii-managed project-local VPN tunnels.
Decision to do or another depends on private Kowabunga IT policy.
2.4 - Kaktus
Learn about Kaktus HCI node.
Kaktus stands for Kowabunga Amazing KVM and TUrnkey Storage (!!), basically, our Hyper-Converged Infrastructure (HCI) node.
While large virtualization systems such as VMware usually requires you to dedicate servers as computing hypervisors (with plenty of CPU and memory) and associate them with remote, extensive NAS or vSAN, providing storage, Kowabunga follows the opposite approach. Modern hardware is powerful enough to handle both computing and storage.
This approach allows you to:
use commodity hardware, if needed
use heterogenous hardware, each member of the pool featuring more or less computing and storage resources.
If you’re already ordering a heavy computing rackable server, extending it with 4-8 SSDs is always going to be cheaper than adding an extra enterprise SAN.
Kaktus nodes will then consists of
a KVM/QEMU + libvirt virtualization computing stack. Featuring all possible VT-x and VT-d assistance on x86_64 architectures, it’ll provide near passthrough virtualization capabilities.
several local disks, to be part of a region-global Ceph distributed storage cluster.
the Kowabunga Kaktus agent, connected to Kahuna
From a pure low-level software perspective, our virtualization stack relies on 3 stacks:
Linux Network Bridging driver, for virtual interfaces access to host raw network interfaces and physical network.
Linux KVM driver, for CPU VT-X extension support and improved virtualization performances.
RBD (Rados Block Device) driver, for storing virtual block devices under distributed Ceph storage engine.
QEMU drives these different backends to virtualize resources on to.
Now QEMU being a local host process to be spawned, we need some kind of orchestration layer on top of that. Here comes libvirt. libvirt provides an API over TCP/TLS/SSH that wraps virtual machines definition over an XML representation that can be fully created/updated/destroyed remotely, controlling QEMU underneath. Kaktus agent controls the local KVM hypervisor through libvirt backend and the local-network distributed Ceph storage, allowing management of virtual machines and disks.
Note
When configured for production-systems, Ceph storage cluster will be backed by cross-zones N-times (usually 3) replicated high-performance block devices, providing virtually infinitely scalable and resizeable disk volumes with byte-precision.
Virtual disks contents being sharded into thousands of fragmented objects, spread across the various disks from the various Kaktus instances of a given region, the “chance” of data loss or corruption is close to none.
Enterprise Recommendations
If you intend to use Kowabunga to run serious business (and we hope you’ll do), you need to ensure to give Ceph its full potential.
Too many Cloud systems are today limited (CPU stuck in I/O wait) by disk bandwidth. Using Ceph, implies that your disks I/Os are to be adressed through network. Simply put, don’t expect to get NVME SSDs access time.
In order to ensure the fastest storage possible, it remains key that you:
use local NVMe SSDs on as much server instances as possible (they’ll all be part of the same cluster pool).
use physical servers with at 10 Gbps network interfaces (25 Gbps is even better, link-agregation is a nice bonus).
ensure that your regional zones are less than 1ms away from each other.
This may sounds like heavy requirements, but by today enterprise-grade standards, it’s really isn’t anymore ;-)
3 - Getting Started
Deploy your first Kowabunga instance !
3.1 - Hardware Requirements
Prepare hardware for setup
Setting up a Kowabunga platform requires you to provide the following hardware:
1x Kahuna instance (more could used if high-availability is expected).
1x Kiwi instance per-region (2x recommended for production-grade)
1x Kaktus instance per-region (a minimum of 3x recommended for production-grade, can scale to N).
Important
Note that while it should work on any kind of Linux distribution, Kowabunga has only been tested (understand it as supported) with Ubuntu LTS. Kowabunga comes pre-packaged for Ubuntu.
Kahuna Instance
Kahuna is the only instance that will be exposed to end users. It is recommended to have it exposed on public Internet, making it easier for DevOps and users to access to but there’s no strong requirement for that. It is fairly possible to keep it local to your private corporate network, only accessible from on-premises network or through VPN.
Hardware requirements are lightweight:
2-cores vCPUs
4 to 8 GB RAM
64 GB for OS + MongoDB database.
Disk and network performance is fairly insignificant here, anything modern will do just fine.
We personnally use and recommend using small VPS-like public Cloud instances. They come with public IPv4 address and all that one needs for a monthly price of $5 to $20 only.
Kiwi Instance
Kiwi will act as a network software router and gateway. Even more than for Kahuna, you don’t need much horse-power here. If you plan on setting your own home labs, a small 2 GB RAM Raspberry Pi would be sufficient (keep in mind that SoHo routers and gateways are lightweight than that).
If you intend to use it for enteprise-grade purpose, just pick the lowest end server you could fine.
It’s probably going to come bundled with 4-cores CPU, 8 GB of RAM and whatever SSD and in any cases, it would be more than necessary, unless you really intend to handle 1000+ computing nodes being a multi-Gbps traffic.
Kaktus Instance
Kaktus instance are another story. If there’s one place you need to put your money on, here would be the place. The instance will handle as many virtual machines as can be and be part of the distributed Ceph storage cluster.
Sizing depends on your expected workload, there’s no accurate rule of thumb for that. You’ll need to think capacity planning ahead. How much vCPUs do you expect to run in total ? How many GBs of RAM ? How much disk ? What overcommit ratio do you expect to set ? How much data replication (and so … resilience) do you expect ?
These are all good questions to be asked. Note that you can easily start low with only a few Kaktus instances and scale up later on, as you grow. The various Kaktus instances from your fleet may also be heterogeneous (to some extent).
As a rule of thumb, unless you’re doing setting up a sandbox or home lab, a minimum of 3 Kaktus instance would be recommended. This allows you to move workload from one to another, or simply put one in maintenance mode (i.e. shutdown workload) while keeping business continuity.
Supposing you have X Kaktus instances and expect up to Y to be down at a given time, the following applies:
Instance Maximum Workload: (X - Y) / X %
Said differently, with only 3 machines, don’t go above 66% average load usage or you won’t be able to put one in maintenance without tearing down application.
Consequently, with availability in mind, better have more lightweight instances than few heavy ones.
Same applies (even more to Ceph storage cluster). Each instance local disk will be part of Ceph cluster (a Ceph OSD to be accurate) and data will be spread across those, from the same region.
Now, let’s consider you want to achieve 128 TB usable disk space. At first, you need to define your replication ratio (i.e. how many time objects storage fragments will be replicated across disks). We recommend a minimum of 2, and 3 for production-grade workloads. That means you’ll actually need a total of 384 TB of physical disks.
Here are different options to achieve it:
1 server with 24x 16TB SSDs each
3 servers with 8x 16TB SSDs each
3 servers with 16x 8TB SSDs each
8 servers with 6x 8TB SSDs each
[…]
From a purely resilient perspective, last option would be the best. It provides the more machines, with the more disks, meaning that if anything happens, the smallest fraction of data from the cluster will be lost. Lost data is possibly only ephemeral (time for server or disk to be brought up again). But while down, Ceph will try to re-copy data from duplicated fragments to other disks, inducing a major private network bandwidth usage. Now whether you only have 8 TB of data to be recovered or 128 TB may have a very different impact.
Also, as your virtual machines performance will be heavily tight to underlying network storage, it is vital (at least for production-grade workloads) to use NVMe SSDs with 10 to 25 Gbps network controllers and sub-millisecond latency between your private region servers.
So let’s recap …
Typical Kaktus instances for home labs or sandbox environments would look like:
4-cores (8-threads) CPUs.
16 GB RAM.
2x 1TB SATA or NVMe SSDs (shared between OS partition and Ceph ones)
1 Gbps NIC
While Kaktus instances for production-grade workload could easily look like:
32 to 128 cores CPUs.
128 GB to 1.5 TB RAM.
2x 256 GB SATA RAID-1 SSDs for OS.
6 to 12x 2-8 TB NVMe SSDs for Ceph.
10 to 25 Gbps NICs with link-agregation.
Important
Remember that you can start low and grow later on. All instances must not need to be alike (you can perfectly have “small” 32-cores servers and higher 128-cores ones). But completely heterogenous instances (especially on disk and network constraints) could have disastrous effects.
Keep in mind that all disks form all instances will be part of the same Ceph cluster, where any virtual machine instance can read and write data from. Mixing 25 Gbps network servers with fast NVMe SSDs with low-end 1 Gbps one with rotational HDDs would lower down your whole setup.
3.2 - Software Requirements
Get your toolchain ready
Kowabunga’s deployment philosophy relies on IaC (Infrastructure-as-Code) and CasC (Configuration-as-Code). We heavily rely on:
While natively compatible with the aformentionned, we recommend using Kowabunga Kobra as a toolchain overlay.
Kobra is a DevOps deployment swiss-army knife utility. It provides a convenient wrapper over OpenTofu, Ansible and Helmfile with proper secrets management, removing the hassle of complex deployment startegy.
Anything can be done without Kobra, but it makes things simpler, not having to care about the gory details.
Kobra supports various secret management providers. Please choose that fits your expected collaborative work experience.
At runtime, it’ll also make sure you’re OpenTofu / Ansible toolchain is properly set on your computer, and will do so otherwise (i.e. brainless setup).
Installation can be easily performed on various targets:
or better, fork it in your own account, as a boostraping template repository.
Secrets Management
Passwords, API keys, tokens … they are all sensitive and meant to be secrets. You don’t want any of those to leak on a public Git repository. Kobra relies on SOPS to ensure all secrets are located in an encrypted file (which is safe to to be Git hosted), which can be encrypted/decrypted on the fly thanks to a master key.
Kobra supports various key providers:
aws: AWS Secrets Manager
env: Environment variable stored master-key
file: local plain text master-key file (not recommended for production)
hcp: Hashicorp Vault
input: interactive command-line input prompt for master-key
keyring: local OS keyring (macOS Keychain, Windows Credentials Manager, Linux Gnome Keyring/KWallet)
If you’re building a large production-grade system, with multiple contributors and admins, using a shared key management system like aws or hcp is probably welcome.
If you’re single contributor or in a very small team, storing your master encryption key in your local keyring will do just fine.
Simply edit your kobra.yml file in the following section:
Thanks to that, any file from your inventory’s host_vars or group_vars directories, being suffixed as .sops.yml will automatically be included when running playbooks. It is then absolutely safe for you to use these encrypted-at-rest files to store your most sensitive variables.
Creating such files and/or editing these to add extra variables is then as easy as:
Kobra will automatically decrypt the file in-live, open the editor of your choice (as stated in your $EDITOR env var), and re-ecnrypt it with the master key at save/exit.
That’s it, you’ll never have to worry about secrets management and encryption any longer !
OpenTofu
The very same applies for OpenTofu, where SOPS master key is used to encrypt the most sensitive data. Anything sensitive you’d need to add to your TF configuration can be set in the terraform/secrets.yml file as simple key/value.
$ kobra secrets edit terraform/secrets.yml
Note however that their existence must be manually reflected into HCL formatted terraform/secrets.tf file, e.g.:
supposing that you have an encrypted my_service_api_token: ABCD…Z entry in your terraform/secrets.yml file.
Note that OpenTofu adds a very strong feature over plain old Terraform, being TF state file encryption. Where the TF state file is located (local, i.e. Git or remotely, S3 or alike) is up to you, but shall you use a Git located one, we strongly advise to have it encrypted.
You can achieve this easily by extending the terraform/providers.tf file in your platform’s repository:
terraform {
encryption {
key_provider"pbkdf2" "passphrase" {
passphrase =var.passphrase }
method"aes_gcm" "sops" {
keys =key_provider.pbkdf2.passphrase }
state {
method =method.aes_gcm.sops }
plan {
method =method.aes_gcm.sops }
}
}
variable"passphrase" { # Value to be defined in your local passphrase.auto.tfvars file.
# Content to be retrieved from decyphered secrets.yml file.
sensitive =true}
Then, create a local terraform/passphrase.auto.tfvars file with the secret of your choice:
passphrase ="ABCD...Z"
Warning
Note that you don’t want the terraform/passphrase.auto.tfvars (being plain-text) file to be stored on Git, so make sure it is well ignored in your .gitignore configuration.
Also, it’s strongly advised that whatever passphrase you’d chose to encrypt TF state is kept secure. A good practice would be to have it copied and defined in terraform/secrets.yml file, as any other sensitive variable, so to keep it vaulted.
3.3 - Network Topology
Our Tutorial network topology
Let’s use this sample network topology for the rest of this tutorial:
We’ll start with a single Kahuna instance, with public Internet exposure. The instance’s hostname will be kowabunga-kahuna-1 and it has 2 network adapters and associated IP addresses:
a private one, 10.0.0.1, in the event we’d need to peer further one with other instances for hugh-availability.
a public one, 1.2.3.4, exposed as kowabunga.acme.com for WebUI, REST API calls to the orchestrator and WebSocket agents endpoint. It’ll also be exposed as grafana.acme.com, logs.acme.com and metrics.acme.com for Kiwi and Kaktus to push logs and and metrics and allow for service’s metrology.
Next is the main (and only) region, EU-WEST and its single zone, EU-WEST-A. The region/zone will feature 2 Kiwi instances and 3 Kaktus ones.
All instances will be connected under the same L2 network layer (as defined in requirements) and we’ll use different VLANs and associated network subnets to isolate content:
VLAN101 will be used as default, administration VLAN, with associated 10.50.101.0/24 subnet. All Kiwi and Kaktus instances will be part of.
VLAN102 will be used for Ceph backpanel, with associated 10.50.102.0/24 subnet. While not mandatory, this allows differentiating the administrative control plane traffic from pure storage cluster data synchronization. This allows for better traffic shaping and monitoring, if ever needs be. Note that on enterprise-grade production systems, Ceph project would recommend to use dedicated NIC for Ceph traffic, so isolation here makes sense.
VLAN201 to VLAN209 would be application VLANs. Kiwi will bind them, being region’s router, but Kaktus don’t. Instantiated VMs will however, through bridged network adapters.
Warning
It is suggested to use manual fixed-address for Kiwi and Kaktus instances. Being critical, you wouldn’t jeopardize the risk of service interruption because of a DHCP lease issue.
Note
Note that while Kiwi instances have static IP addresses (namely .2 and .3), they’ll also use a .1 as virtual IP (VIP), which is used for failover. Consequently, the .1 will always be the network’s router/gateway here, whichever Kiwi instace will hold it.
3.4 - Setup Kahuna
Let’s start with the orchestration core
Now let’s suppose that you’ve cloned the Git platform repository template and that your Kahuna instance server has been provisioned with latest Ubuntu LTS distribution. Be sure that it is SSH-accessible with some local user.
Let’s take the following assumptions for the rest of this tutorial:
We only have one single Kahuna instance (no high-availability).
Local bootstrap user with sudo privileges is ubuntu, with key-based SSH authentication.
Kahuna instance is public-Internet exposed through IP address 1.2.3.4, translated to kowabunga.acme.com DNS.
Kahuna instance is private-network exposed through IP address 10.0.0.1.
Kahuna instance hostname is kowabunga-kahuna-1.
Setup DNS
Please ensure that your kowabunga.acme.com domain translates to public IP address 1.2.3.4. Configuration is up to you and your DNS provider and can be done manually.
Being IaC-supporters, we advise using OpenTofu for that purpose. Let’s see how we can do, using Cloudflare DNS provider.
Start by editing the terraform/providers.tf file in your platform’s repository:
By default, your platform is configured to pull a tagged official release from Ansible Galaxy. You may however prefer to pull it directly from Git, using latest commit for instance. This can be accomodated through:
Once defined, simply pull it into your local machine:
$ kobra ansible pull
Kahuna Settings
Kahuna instance deployment will take care of everything. It’ll take the assumption of running a supported Ubuntu LTS release, enforce some configuration and security settings, install the necessary packages, create local admin user accounts, if required, and setup some form of deny-all filtering policy firewall, so you’re safely exposed.
Admin Accounts
Let’s start by declaring some user admin accounts we’d like to create. We don’t want to keep on using the single nominative ubuntu account for everyone after all.
Simply create/edit the ansible/inventories/group_vars/all/main.yml file the following way:
to declare all your expected admin users, and add their respective SSH public key files in the ansible/files/pubkeys directory, e.g.:
$ tree ansible/files/pubkeys/
ansible/files/pubkeys/
└── admin_user_1
└── admin_user_2
Note
Note that all registered admin accounts will have password-less sudo privileges.
We’d also recommend you to set/update the root account password. By default, Ubuntu comes without any, making it impossible to login. Kowabunga’s playbook make sure that root login is prohibited from SSH for security reasons (e.g. brute-force attacks) but we encourage you setting one, as it’s always useful, especially on public cloud VPS or bare metal servers to get a console/IPMI access to log into.
If you intend to do so, simply edit the secrets file:
If you Kahuna instance is connected on public Internet, it is more than recommended to enable a network firewall. This can be easily done by extending the ansible/inventories/group_vars/kahuna/main.yml file with:
Note that we’re limited opened ports to SSH and HTTP/HTTPS here, which should be more than enough (HTTP is only used by Caddy server for certificate auto-renewal and will redirect traffic to HTTPS anyway). If you don’t expect your instance to be SSH-accessible on public Internet, you can safely drop this line.
MongoDB
Kahuna comes with a bundled, ready-to-be-used MongoDB deployment. This comes in handy if you only have a unique instance to manage. This remains however optional (default), as you may very well be willing to re-use an existing external production-grade MongoDB cluster, already deployed.
If you intend to go with the bundled one, a few settings must be configured in ansible/inventories/group_vars/kahuna/main.yml file:
This will basically instruct Ansible to install MongoDB server, configure it with a replicaset (so it can be part of a future cluster instance, we never know), secure it with admin credentials of your choice and create a kowabunga database/collection and associated service user.
Kahuna Settings
Finally, let’s ensure the Kahuna orchestrator gets everything he needs to operate.
You’ll need to define:
a custom email address (and associated SMTP connection settings) for Kahuna to be able to send email notifications to users.
a randomly generated key to sign JWT tokens (please ensure it is secure enough, not to compromise issued tokens robustness).
a randomly generated admin API key. It’ll be used to provision the admin bits of Kowabunga, until proper user accounts have been created.
a private/public SSH key-pair to be used by platform admins to seamlessly SSH into instantiated Kompute instances. Please ensure that the private key is being stored securely somewhere.
Then simply edit the ansible/inventories/group_vars/kahuna/main.yml file the following way:
We’re done with configuration (finally) ! All we need to do now is finally run Ansible to make things live. This is done by invoking the kahuna playbook from the kowabunga.cloud collection:
$ kobra ansible deploy -p kowabunga.cloud.kahuna
Note that, under-the-hood, Ansible will use Ansible Mitogen extension to speed things up. Bear in mind that Ansible’s run is idempotent. Anything’s failing can be re-executed. You can also run it as many times you want, or re-run it in the next 6 months or so, provided you’re using a tagged collection, the end result will always be the same.
After a few minutes, if everything’s went okay, you should have a working Kahuna instance, i.e.:
A Caddy frontal reverse-proxy, taking care of automatic TLS certificate issuance, renewal and traffic termination, forwarding requests back to either Koala Web application or Kahuna backend server.
The Kahuna backend server itself, our core orchestrator.
Your Kahuna instance is now up and running, let’s get things and create a few admin users accounts. At first, we only have the super-admin API key that was previously set through Ansible deployment. We’ll make use of it to provision further users and associated teams. After all, we want a nominative user acount for each contributor, right ?
Back to TF config, let’s edit the terraform/providers.tf file:
Make sure to edit the Kowabunga provider’s uri with the associated DNS of your freshly deployed Kahuna instance and edit the terraform/secrets.yml file so match the kowabunga_admin_api_key you’ve picked before. OpenTofu will make use of these parameters to connect to your private Kahuna and apply for resources.
Now declare a few users in your terraform/locals.tf file:
and the following resources definition in terraform/main.tf:
resource"kowabunga_user" "admins" {
for_each =local.admins name =each.key email =each.value.email role =each.value.role notifications =try(each.value.notify, false)
bot =try(each.value.bot, false)
}
resource"kowabunga_team" "admin" {
name ="admin" desc ="Kowabunga Admins" users =sort([forkey, userinlocal.admins:kowabunga_user.users[key].id])
}
Then, simply apply for resources creation:
$ kobra tf apply
What we’ve done here was to register a new admin team, with 3 new associated user accounts: 2 regular ones for human administrators and one bot, which you’ll be able to use its API key instead of the super-admin master one to further provision resources if you’d like.
Better do this way as, shall the key be compromised, you’ll only have to revoke it or destroy the bot account, instead of replacing the master one on Kahuna instance.
Newly registered user will be prompted with 2 emails from Kahuna:
a “Welcome to Kowabunga !” one, simply asking yourself to confirm your account’s creation.
a “Forgot about your Kowabunga password ?” one, prompting for a password reset.
Warning
Account’s creation confirmation is required for the user to proceed further. For security purpose, newly created user accounts are locked-down until properly activated.
With security in mind, Kowabunga will prevent you from setting your own password. Whichever IT policy you’d choose, you will always end up with users having a weak password or finding a way to compromise your system. We don’t want that to happen, nor do we think it’s worth asking a user to generate a random ‘strong-enough’ password by himself, so Kowabunga does it for you.
Once users have been registered and password generated, and provided Koala Web application has been deployed as well, they can connect to (and land on a perfectly empty and so useless dashboard ;-) for now at least ).
Note that, despite have 2 Kiwi instances, from Kowabunga perspective, we’re only registering one. This is because, the 2 instances are only used for high-availability and failover perspective. From service point of view, the region only has one single network gateway.
Despite that, each instance will have its own agent, to establish a WebSocket connection to Kahuna orchestrator.
Let’s continue with the 3 Kaktus instances declaration and their associated agents. Note that, this time, instances are associated to the zone itself, not the region.
Information
Note that Kaktus instance creaion/update takes 4 specific parameters into account:
cpu_price and memory_price are purely arbitrary values that express how much actual money is worth your metal infrastructure. These are used to compute virtual cost calculation later, when you’ll be spwaning Kompute instances with vCPUs and vGB of RAM. Each server being different, it’s fully okay to have different values here for your fleet.
cpu_overcommit and memory_overcommit define the overcommit ratio you accept your physical hosts to address. As for price, not every server is born equal. Some have hyper-threading, other don’t. You may consider that a value of 3 or 4 is fine, other tend to be stricter and use 2 instead. The more you set the bar, the more virtual resources you’ll be able to create but the less actual physical resources they’ll be able to get.
That done, Kiwi and Kaktus instances have been registered, but more essentially, their associated agents. For each newly created agent, you should have received an email (check the admin one you previously set in Kahuna’s configuration). Keep track of these emails, they contain one-time credentials about the agent identifier and it’s associated API key.
This is the super secret thing that will allow them further to establish secure connection to Kahuna orchestrator. We’re soon going to declare these credentials in Ansible’s secrets so Kiwi and Kaktus instances can be provisioned accordingly.
Warning
There’s no way to recover the agent API key. It’s never printed anywhere but on the email you just received. Even the database doesn’t contain it. If one agent’s API key is lost, you can either request a new one from API or destroy the agent and create a new one in-place.
Kowabunga provides more than just raw infrastructure resources access. It features various “ready-to-be-consumed” -as-a-service extensions to easily bring life to your various application and automation deployment needs.
4.1 - Kaddie
Kowabunga Private Key Infrastructure
This service is still work-in-progess
4.2 - Kalipso
Kowabunga Application Load-Balancer
This service is still work-in-progess
4.3 - Karamail
Kowabunga SMTP Server
This service is still work-in-progess
4.4 - Kawaii
Kowabunga Internet Gateway
Kawaii is your project’s private Internet Gateway, with complete ingress/egress control. It stands for Kowabunga Adaptive WAn Intelligent Interface (if you have better ideas, we’re all ears ;-) ).
It is the network gateway to your private network. All Kompute (and other services) instances always use Kawaii as their default gateway, relaying all traffic.
Kawaii itself relies on the underlying region’s Kiwi SD-WAN nodes to provide access to both public networks (i.e. Internet) and possibly other projects’ private subnets (when requested).
Kawaii is always the first service to be created (more exactly, other instances cloud-init boot sequence will likely wait until they reach a proper network connectivity, as Kawaii provides). Being critical for your project’s resilience, Kawaii uses Kowabunga’s concept of Multi-Zone Resources (MZR) to ensure that, when the requested regions feature multiple availability zones, a project’s Kawaii instance gets created in each zone.
Using multiple floating virtual IP (VIP) addresses with per-zone affinity, this guarantees that all instantiates services will always be able to reach their associated network router. As much as can be, using weighted routes, service instances will target their zone-local Kawaii instance, the best pick for latency. In the unfortunate event of local zone’s failure, network traffic will then automatically get routed to other zone’s Kawaii (with an affordable extra millisecond penalty).
While obviously providing egress capability to all project’s instance, Kawaii can also be used as an egress controller, exposed to public Internet through dedicated IPv4 address. Associated with a Konvey or Kalipso load-balancer, it make it simple to expose your application publicly, as one would do with a Cloud provider.
Kowabunga’s API allows for complete control of the ingress/egress capability with built-in firewalling stack (deny-all filtering policy, with explicit port opening) as well as peering capabilities.
This allows you to inter-connect your project’s private network with:
VPC peering with other Kowabunga-hosted projects from the same region (network translation and routing being performed by underlying Kiwi instances).
IPSEC peering with non-Kowabunga managed projects and network, from any provider.
Warning
Keep in mind that peering requires a bi-directional agreement. Connection and possibly firewalling must be configured at both endpoints.
Note that thanks to Kowabunga’s internal network architecture and on-premises network backbone, inter-zones traffic is a free-of-charge possibility ;-) There’s no reason not to spread your resources on as many zones as can be, you won’t ever see any end-of-the-month surprise charge.
4.5 - Knox
Kowabunga Vault Service
This service is still work-in-progess
4.6 - Kompute
Kowabunga Virtual Machine instance
Kowabunga Kompute is the incarnation of a virtual machine instance.
Associated with underlying distributed block storage, it provides everything one needs to run generic kind of application workload.
Kompute instance can be created (and further edited) with complete granularity:
number of virtual CPU cores.
amount of virtual memory.
one OS disk and any number of extra data disks.
optional public (i.e. Internet) direct exposure.
Compared to major Cloud providers who will only provide pre-defined machine flavors (with X vCPUs and Y GB of RAM), you’re free to address machines to your exact needs.
Kompute instances are created and bound to a specific region and zone, where they’ll remain. Kahuna orchestration will make sure to instantiate the requested machine on the the best Kaktus hypervisor (at the time), but thanks to underlying distributed storage, it can easily migrate to any other instance from the specified zone, for failover or balancing.
Kompute’s OS disk image is cloned from one of the various OS templates you’ll have provided Kowabunga with and thanks to thin-provisioning and underlying copy-on-write mechanisms, no disk space is ever redeemed. Feel free to allocate 500 GB of disk, it’ll never get consumed until you actually store data onto !
Like any other service, Kompute instances are bound to a specific project, and consequently associated subnet, making it sealed from other projects’ reach. Private and public interfaces IP addresses are automatically assigned by Kahuna, as defined by administrator, making it ready to be consumed for end-user.
4.7 - Konvey
Kowabunga Network Load-Balancer
Konvey is a plain simple network Layer-4 (UDP/TCP) load-balancer.
It’s only goal is to accept remote traffic and ship it back to one of the many application backend, through round-robin algorithm (with health check support).
Konvey can either be used to:
load-balance traffic from private network to private network
load-balance traffic from public network (i.e. Internet) to private network, in association with Kawaii. In such a scenario, Kawaii holds public IP address exposure, and route public traffic to Konvey instances, through NAT settings.
As with Kawaii, Konvey uses Kowabunga’s concept of Multi-Zone Resources (MZR) to ensure that, when the requested region features multiple availability zones, a project’s Konvey instance gets created in each zone, making it highyl resilient.
Warning
Being a layer-4 network load-balancer, Konvey will passthrough any SSL/TLS traffic to configured backend. No traffic inspection is ever performed.
4.8 - Kosta
Kowabunga Object Storage Service
This service is still work-in-progess
4.9 - Kryo
Kowabunga Backup and Cold Storage
This service is still work-in-progess
4.10 - Kylo
Kowabunga Distributed Network File System
Kylo is Kowabunga’s incarnation of NFS. While all Kompute instances have their own local block-device storage disks, Kylo provides the capability to access a network storage, shared amongst virtual machines.
Kylo fully implements the NFSv4 protocol, making it easy for Linux instances (and even Windows) to mount it without any specific tools.
Under the hood, Kylo relies on underlying CephFS volume, exposed by Kaktus nodes, making it natively distributed and resilient (i.e. one doesn’t need trying to add HA on top).
Warning
Kylo access is restricted to project’s private network. While all your project’s instances can mount a Kylo endpoint, it can’t be reached from the outside.
Hope is not a strategy; wish for the best, but prepare for the worst.
We’re working hard to make Kowabunga as resilient and fault-tolerant as possible but human nature will always prevail. There’s always going to be one point in time where your database will get corrupted, when you’ll face a major power-supply incident, when you’ll have to bring everything back from ashes, in a timely manner …
Breath up, let’s see how we can help !
5.1 - Ceph
Troubleshooting Ceph storage
Kaktus HCI nodes rely on Ceph for underlying distributed storage.
Ceph provides both:
RBD block-device images for Kompute virtual instances
Ceph is awesome. Ceph is fault-tolerant. Ceph hashes your file objects into thousands of pieces, distributed and replicated over dozens if not hundreds of SSDs on countless machines. And yet, Ceph sometimes crashes or fails to recover (even though it has incredible self healing capabilities).
While Ceph perfeclty survives some occasional nodes failure, have a try when you have a complete network or power-supply outage in your region, and you’ll figure it out ;-)
So let’s so how we can restore Ceph cluster.
Unable to start OSDs
If Ceph OSDs can’t be started, it is likely because of un-detected (and un-mounted) LVM partition.
A proper mount command should provide the following:
$ mount | grep /var/lib/ceph/osd
tmpfs on /var/lib/ceph/osd/ceph-0 type tmpfs (rw,relatime,inode64)tmpfs on /var/lib/ceph/osd/ceph-2 type tmpfs (rw,relatime,inode64)tmpfs on /var/lib/ceph/osd/ceph-1 type tmpfs (rw,relatime,inode64)tmpfs on /var/lib/ceph/osd/ceph-3 type tmpfs (rw,relatime,inode64)
If not, that means that /var/lib/ceph/osd/ceph-X directories are empty and OSD can’t run.
Run the following command to re-scan all LVM partitions, remount and start OSDs.
$ sudo ceph-volume lvm activate --all
Check for mount output (and/or re-run command) until all target disks are mounted.
Fix damaged filesystem and PGs
In case of health error and damaged filesystem/PGs, one can easily fix those:
$ ceph status
cluster:
id: be45512f-8002-438a-bf12-6cbc52e317ff
health: HEALTH_ERR
25934 scrub errors
Possible data damage: 7 pgs inconsistent
Isolate the damaged PGs:
$ ceph health detail
HEALTH_ERR 25934 scrub errors; Possible data damage: 7 pgs inconsistent
[ERR] OSD_SCRUB_ERRORS: 25934 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 7 pgs inconsistent
pg 2.16 is active+clean+scrubbing+deep+inconsistent+repair, acting [5,11] pg 5.20 is active+clean+scrubbing+deep+inconsistent+repair, acting [8,4] pg 5.26 is active+clean+scrubbing+deep+inconsistent+repair, acting [11,3] pg 5.47 is active+clean+scrubbing+deep+inconsistent+repair, acting [2,9] pg 5.62 is active+clean+scrubbing+deep+inconsistent+repair, acting [8,1] pg 5.70 is active+clean+scrubbing+deep+inconsistent+repair, acting [11,2] pg 5.7f is active+clean+scrubbing+deep+inconsistent+repair, acting [5,3]
Proceed with PG repair (iterate on all inconsistent PGs):
$ ceph pg repair 2.16
and wait until everything’s fixed.
$ ceph status
cluster:
id: be45512f-8002-438a-bf12-6cbc52e317ff
health: HEALTH_OK
MDS daemon crashloop
If your Ceph MDS daemon (i.e. CephFS) is in a crashloop, probably because of corrupted journal, let’s see how we can proceed:
Get State
Check for global CephFs status, including clients list, number of active MDS servers etc …
$ ceph fs status
Additionnally, you can get a dump of all filesystem, trying to find MDS daemons’ status (laggy, replay …):
$ ceph fs dump
Prevent client connections
If you suspect the filesystem’s to be damaged, first thing to do is to preserve any more corruption.
Start by stopping all CephFs clients, if under control.
For Kowabunga, that means stopping NFS Ganesha server on all Kaktus instances:
$ sudo systemctl stop nfs-ganesha
Prevent all client connections from server-side (i.e. Kaktus).
We consider that filesystem name is nfs:
$ ceph config set mds mds_deny_all_reconnect true$ ceph config set mds mds_heartbeat_grace 3600$ ceph fs set nfs max_mds 1$ ceph fs set nfs refuse_client_session true$ ceph fs set nfs down true
Stop server-side MDS instances on all Kaktus servers: