Skip to content

2023

exclusive worker tokens with NATS

At kraud we of course use several orchestration engines. During the process or retiring Hashicorp Nomad, a few new exciting things are created that i hope to open source soon.

The core of it is a bunch of workers, and items to be worked on. A fairly standard work queue situation. Currently everything is built around notifying a worker through redis, and then holding a lock in redis for that particular resource. Redis is doing a great job at all of that, but somehow i felt like exploring the new(ish) thing for reliable notifications: nats.io's jetstream and then realized that nats can now also do locks!

locking with nats.io

the maintainers comment here describes the solution to building a lock with NATS.

The idea is to have a stream with DiscardNewPerSubject, which prevents new items being published to a key already holding a value.

1
2
3
4
5
6
7
8
nats stream add  work --defaults --discard-per-subject \
    --subjects='locks.*' --storage=memory \
    --discard=new   --max-msgs-per-subject=1

nats req locks.bob red # lock
nats req locks.bob red # will fail

nats stream rmm locks 1 #clear the lock

in a distributed system you would also have to think about expiry and refreshing, which would be done with with the MaxAge stream setting. but then exactly what i hoped for had happened: i discovered an entirely new method of building workers in the design of nats.

exclusive work tokens

in addition to locks, you would also have to notify a worker that new work is available, something trivially done in either redis or nats although i feel like nats has a slight edge here because it can actually guarantee that a message is delivered with jetstream, which simplifies retry on the application side.

But what if we just notified one worker and then made sure to not notify any other worker until the first one is done. That's essentially a classic token ring,bus,whatever architecture where a token is handed around that makes the holder eligible for accessing a shared resource.

We already have that thing in the previous step. Just have to use it the other way around.

1
2
3
4
nats stream add  work --defaults --discard-per-subject \
    --subjects='work.*' --storage=memory \
    --discard=new  --max-msgs-per-subject=1 \
    --retention=work

the first difference is retention=work, which says that any acked message is discarded. since we only allow one item per subject , this means the subject is blocked until it is acked.

now instead of locking an item from the worker, we push the work into the subject from the requester

1
2
3
nats req work.car1 paint # ok
nats req work.car2 paint # ok
nats req work.car1 tires # nope, car1 is already being worked on

workers then consume work items

1
2
3
4
nats con add work worker --ack=all --wait=5s --target="worker" --defaults --deliver-group=workers

nats sub worker --queue=workers
nats sub worker --queue=workers

nats will deliver the message to only one random worker. If the worker fails to ack the work in time, it is redelivered again to a random worker until an ack finally clears the message. Since we only allow one message per subject to be queued, this means only one worker will ever work on it. Of course the worker can receive work on other subjects in the meantime, just not on this one.

In this cli example the timeout is fixed, and the worker cannot really tell nats to hold back redelivery until its done, but i'll get to a full code example in a second where it can.

but queueing?

Most of the time you just want to set-and-forget work items into a queue and not wait for the workers to be available. nats has enother elegant construct we can use for that: sources

We can just have another stream that allows more than one message per subject

1
2
3
nats stream add  inbox --defaults --discard-per-subject \
    --subjects='work.*' --storage=memory \
    --discard=new  --max-msgs-per-subject=10

and then create the work stream with that one as source

1
2
3
4
nats stream add  work --defaults --discard-per-subject \
    --source='inbox' --storage=memory \
    --discard=new  --max-msgs-per-subject=1 \
    --retention=work

now publishing multiple work items on the same subject is allowed

1
2
3
nats req work.car1 paint # ok
nats req work.car1 tires # ok
nats req work.car1 battery # ok

yet, only one will be actually worked on at a time

1
2
3
4
5
6
7
8
9
nats stream report
╭─────────────────────────────────╮
│              Stream Report      │
├──────────┬───────────┬──────────┤
│ Stream   │ Consumers │ Messages │
├──────────┼───────────┼──────────┤
│ work     │         1 │ 1        │
│ inbox    │         0 │ 3        │
╰──────────┴───────────┴──────────╯

once the item in work is acked, nats will automatically feed the next one.

implementation in go

now as promised, here's a complete example in golang including the ability to hold an item in the queue for longer than its ack timeout.

package main

import (
    "fmt"
    "github.com/nats-io/nats.go"
    "time"
)

func InitNats() {
    nc, err := nats.Connect("localhost")
    if err != nil {
        panic(err)
    }
    defer nc.Close()

    js, err := nc.JetStream()
    if err != nil {
        panic(err)
    }

    // first create an inbox queue holding the latest state of work to be done
    // values in here are replaced when new work on the same topic is submitted
    _, err = js.AddStream(&nats.StreamConfig{
        Name:     "inbox",
        Subjects: []string{"inbox.*"},

        MaxMsgsPerSubject: 1,
        Discard:           nats.DiscardNew,
    })
    if err != nil {
        panic(fmt.Sprintf("Error creating jetstream [needs a nats-server with -js] : %v", err))
    }

    // items are moved from the inbox into a token lock.
    // these are held by a worker until its done and only THEN a new value is pulled from the inbox.
    // if the worker fails to ack the item, it is resent to a different worker
    _, err = js.AddStream(&nats.StreamConfig{
        Name: "work",
        Sources: []*nats.StreamSource{
            {
                Name: "inbox",
            },
        },

        MaxMsgsPerSubject: 1,
        Discard:           nats.DiscardNew,

        // this means you cant update a running token
        DiscardNewPerSubject: true,

        // an ack deletes the message and frees the topic for new work
        Retention: nats.WorkQueuePolicy,
    })
    if err != nil {
        panic(fmt.Sprintf("Error creating jetstream [needs a nats-server with -js] : %v", err))
    }

    // push the token into a delivery group
    _, err = js.AddConsumer("work", &nats.ConsumerConfig{
        Durable:        "work",
        DeliverSubject: "work",
        DeliverGroup:   "workers",
        DeliverPolicy:  nats.DeliverAllPolicy,
        AckPolicy:      nats.AckExplicitPolicy,
        AckWait:        30 * time.Second,
        Heartbeat:      time.Second,
    })
    if err != nil {
        panic(fmt.Sprintf("Error creating jetstream consumer : %v", err))
    }

    ch := make(chan *nats.Msg, 64)
    nc.ChanQueueSubscribe("work", "workers", ch)

    for msg := range ch {

        if len(msg.Reply) == 0 {
            // not jetstream, probably keepalive
            continue
        }

        fmt.Println(msg.Reply)

        fmt.Printf("Received a message on %s: %s\n", msg.Subject, string(msg.Data))
        go func(msg *nats.Msg) {
            for i := 0; i < 60; i++ {
                fmt.Println("working...")
                rsp, err := nc.Request(msg.Reply, []byte("+WPI"), time.Second)
                if err != nil {
                    // lost lock, stop immediately or we risk working in parallel
                    panic(err)
                }
                fmt.Println("got in progress response", string(rsp.Data))
                time.Sleep(1 * time.Second)
            }
            fmt.Println("done")
            msg.Ack()
        }(msg)
    }
}

ipv4 docker overlay networks are now supported

Each tenant on kraud has a least one VPC, a "virtual private cloud" as the industry term happens to be. VPCs are fully isolated networks from other VPCs, and can span across multiple datacenters, including on-prem.

using the ipv6 VPC network

Each pod can connect to every other pod within the same VPC (again, also across data centers) over a wireguard encrypted channel. A VPC is a layer 3 ipv6 transport, which is the most efficient way to do routing between many different sites and participants.

Domain names of pods resolve to their ipv6 vpc adress by default. You can simply start two pods in the same namespace, let's say "db" and "app" and then connect to "db" from the app. This works out of the box with almost all modern software.

You can also connect to an app in a different namespace using its fqn, i.e. "app"."namespace"

using ipv4

Feedback from you, our users, included the feature request to use ipv4 between pods instead of ipv6, because some legacy applications do not support v6. Also some docker compose setups include multiple networks with fixed ip addresses that do not work well with dynamically assigned v6 addresses.

Starting today, every namespace additionally has a default ipv4 overlay assigned. It is implemented using ip6ip on top of vpc, so it's not as efficient as the default vpc routing, but it allows for convenience features such as assigned fixed ipv4 addresses

You will notice that docker --context kraud network ls now returns slightly different output. Since docker cli lacks a large amount of features, we need to condense the information a bit

1
2
3
$ docker --context kraud.aep network ls
NETWORK ID     NAME                      DRIVER    SCOPE
v1223123123    nextcloud/default         ipip      global

the first part is the namespace, the same thing returned by kra ns ls

1
2
3
$ kra ns ls
NAME             CREATED              
nextcloud        2023-09-11 12:16:30

the second part is the overlay network

1
2
3
$ kra overlay ls
aid           namespace        name     driver  net4             net6                   
v1223123123   nextcloud        default  ipip    10.198.62.0/23   fd42:1a8:0:2a09::/64

as you can see, this one has an ipv4 assigned. you can specify which one!

docker-compose.yaml
networks:
  v:
    driver: ipip
    ipam:
      config:
        - subnet: 192.168.0.0/16
services:
  k2:
    ...
    networks:
      v:
        ipv4_address: 192.168.2.0/24

hopefully this brings us closer to direct docker compose compat, and helps you with legacy applications. We're always happy to hear about fedback and feature requests. If you need help, dont hestiate to contact [email protected]

Full Kubernetes is now generally available

get your very own real k8s for free with kraud.cloud

Good news, kubernetes is now available for everyone. k8s app

There's a free tier version with a single node and v6 addresses, as well as a version that has 3 nodes and uses v4 addresses. As you probably know, v4 addresses are rare so they're not available in free tier.

Deploying k8s on top of kraud enables you access to the full original k8s experience, including the ability to install CNI plugins and custom resource definitions.

You can also download and change the compose spec and launch k8s with kra

kubectl and compose are incompatible use cases

Thanks to the helpful feedback of our users we have learned that the original idea of emulating the k8s control plane is not the experience you're looking for.

The kraud k8s control plane is not fully compatible with the k8s ecosystem, and will always lack behind k8s upstream. With the availability of real k8s as app, we have decided to completely stop working on kubectl compatibility for the kraud control plane. kubectl may function for a while until we completely remove the api.

We will instead further enhance docker compose compat. For example just recently we introduced the ability to use overlay networks, which k8s does not support.

Going forward there are 4 ways to use kraud:

  1. launch or publish a complete app on the marketplace
  2. launch k8s from the marketplace and work with the k8s ecosystem
  3. launch your own app on managed docker compose clusters with kra
  4. launch a bare linux VM and use legacy tooling

as always we love your feedback and wish you an efficient computing experience.

NFS is production ready | LV2 will have lower guarantees

This year we introduced NFS for highly available file volumes and LV2 for very fast local block storage.

NFS is worry-free, just use NFS

NFS is sufficiently complete that its the second service to be marked production ready in the product specification and the first to come with a money back guarantee. It is fully single-zone redundant with active/passive failover and automatic snapshots. This is the default storage everyone should be using. It offers complete worry free file storage that works with most docker containers out of the box. NFS is also accessible via webdav and S3. This allows easy collaboration on file based workloads.

Later this year we're planning to make snapshots explicitly controlled by users, so that you can instantly rollback yourself instead of asking customer support to do it.

LV2 requires special attention

Some customers prefer "virtual disks" instead of file storage. The primary advantage is that they're generally faster for single process workloads. There are several different ways to implement "virtual disks". If a competitor does not tell you which technology they use, it is very likely what we call LV1, that is locally redundant (RAID) storage directly attached to the host. In case of a host failure, the volume becomes unavailable and the associated pod will not be rescheduled.

With LV2 we tried to approach similar high availability and guarantees to our HA system, but ultimately the two worlds are just too dissimilar, and customers have much lower expectations. Instead we are reducing high availability of LV2 for the benefit of much quicker launch times. Users may choose to either restore from cold storage with a Recovery Point Objective of 1 Hour, or wait for manual recovery of the host (usually within a business day).

If you look out there you may notice that services like AWS EBS or Hetzners CEPH volumes offer much lower RPO than krauds LV2. This is because they're network volumes, not directly attached. These products are very unpopular, so we currently do not offer them. The primary disadvantage is that network storage has a significant latency penalty, at which point you might just use NFS.

However, if you would like to see zero RPO block volumes at kraud, let us know!

open sourcing kraud cradle

I'm super excited to announce that we're making the source code for our microvm launcher available to the public. It's all out there in the GitHub repo

cradle, the green microvm pod launcher

Cradle is how we isolate pods from each other on the kraud platform using microvms in KVM. The repo should contain enough of the host implementation to enable anyone to launch docker containers in isolation.

bootup log

In contrast to other solutions such as firecracker, cradle is intended to run existing containers without changes. Kraud is primarily focused on creating a docker like experience in the cloud, for maximum developer comfort.

It is not "cloud native" compliant and does not fit into the k8s ecosystem. Instead cradles mission is to start a docker container within 100ms on an arbitrary host machine, so we can schedule workloads dynamically without the massive economic and ecological cost of operating the infrastructure usually required to do that.

trust no cloud, verify and attest

While this is cool and interesting tech, the primary purpose of publishing cradle as open source is to enable our customers to verify the complete boot stack using AMD SNP. Everything that runs inside the VM is open and can be attested to be exactly what we claim it to be.

This in addition to purchasing on premise cloud machines enables our customers to enjoy the flexibility and comfort of a cloud service while also staying in full control of their data.

Running an attested confidential docker container locally is a bit involved, since this is not intended as a finished product.

Assuming you have an SNP host setup:

1
2
3
4
5
6
sed -e 's/VARIANT=default/VARIANT=snp/g' -i Makefile
make
docker pull mariadb
kcradle summon /tmp/vm mariadb
ln -sf pkg-snp /tmp/vm/cradle
kcradle run /tmp/vm

For the high level confidential compute stack such as remote attestation, we recommend enclaive.

get ready to move your pods to Berlin

Backstory

Earlier this year we semi-announced the intention to move to a new datacenter location. Kraud is on a mission to build carbon negative infrastructure, and the current datacenter provider misrepresented the origin of their elecricity in a way that does not align with our values.

The new location is 3U Telecom in Berlin. 3U Holding, the parent company of 3U Telecom is heavily invested in renewable energy, and we hope to have found a partner that shares our vision.

We also picked new network partners, inter.link , Core-Backbone, and BCIX, significantly increasing the connection quality of the new site.

We spent half a year building out the new site with an entirely new storage architecture based on ZFS. The old site had extremly poor IOPS and frequent total cluster freezes due to bugs in ceph. We can now offer LV2 storage with half a million IOPS, and automatic instant snapshots. A huge shift away from thinking first about compute to thinking first about storage.

The old site will be shut in July, so please prepare your workloads asap.

TLDR

Moving pods with no storage attached is easy. Just delete the container and relaunch with the zone label -l kr.zone=uca. The site will become default in july, after which you need to specify -l kr.zone=yca to launch a pod in the legacy zone.

If you have storage attached, moving is slighly more involved. We currently recommend moving from rbd to lv2 and from gfs to nfs, as this will most likely match your expectations.

  1. stop all pods
  2. create a new volume with docker volume create myvol --driver nfs
  3. copy all the files from the old to the new volume, for example by using webdav or with docker exec, depending on your use case.
  4. delete the old volume with docker volume rm oldvol
  5. recreate your pods with -l kr.zone=yca

if you need assistance, don't hessitate to ask on discord or send an email to [email protected]

Berlin is a huge step forward

with the new 100G infiniband SAN, we're starting a new chapter of thinking about storage as a primary feature. Data is critical to all apps, and we want to make sure to offer a great experience in that regard too.

Being connected directly to an IXP also means we're now an actual proper datacenter with redundant paths to the internet and high speed direct peering to a significant chunk of the European internet.

Finally, i personally live in Berlin, which of course had a huge impact on the choice of locations. Having direct access to the machines enables me to run new and exciting new hardware features like GPU and risc-v.

If you're a free tier user or paid customer, i thank you for being on board this adventure. I hope you're happy with our new site. Let me know if you have comments or feedback.

picture of berlin server rack

open sourcing kraud bali

github: https://github.com/kraudcloud/bali

Internally at kraud we use lots of things to deploy cloud and on-prem machines. Old school dpkg and ansible is a whole different world than docker, kubernetes, etc.

One of the things that’s been missing has been a thing in the middle, where we can configure and deploy services in a consistent and repeatable way like docker, but without depending on high level services like docker being available yet.

docker build is the standard

There is an uncountable amount of innovative approaches to building consistent system services, nix being a popular one for example. But docker build is really the standard. It's not elegant, but it's simple, which is very important in an operational environment where you need to quickly understand a large amount of moving parts while all the red alarm lights are blinking.

docker usually builds images in layers, which is an unfortunate design choice. distributing those layers also requires docker registry. Registry is a docker container, with persistent storage. The recommended k8s architecture is boostrapping k8s from a registry that runs on k8s. But i'd rather sleep well at night so we're not doing that.

Instead bali builds docker images to a single tarball. They can be distributed using fairly standard tools, even replicated. They're also ephemeral, meaning no cleanup is required after run.

signed images

security is critical in cloud operations. Users give up control over the low level infrastructure and in turn expect us to do it correctly of course. docker content trust exists but it's ... weird and i'm not entirely sure if its even cryptographically correct.

Fortunately, single files are easy to sign, so bali signs any image it builds by default with a local ed25519 key. The signature is embedded, so that the image still works as a normal tar.gz you can extract with any other tool.

bali run then can yolo-get images from any untrusted storage, but checks the signature on open. This is convenient for us to make distribution of images as low-tech as possible, so we can rely on it to bootstrap higher level systems.

zero isolation by design

docker was designed back when containers where not a thing really. it can keep poorly behaving software in check fairly well. That plus it makes it very convenient for the user by hiding alot of complexity required to make that work.

When bootstrapping infrastructure, both of those features are counterproductive. While you can run docker with --net=host for example, it still creates a namespace anyway, interfering with our own namespacing for vpcs. The only namespace we really want is for filesystem, because it makes it convenient to just apk add dependencies without thinking of new build systems like nix.

An even bigger issue for us with docker was the daemon. When you do docker run, the calling process is really just an rpc client to the docker daemon, which then runs the container. This means for example SIGKILL to the docker run process will not kill the process. it will just kill the docker cli, leaving the service dangling. Any cgroup limits etc also do not actually apply to the system service. docker has its own solutions for each managment task, but they're often intended for developers rather than operators.

bali instead calls execve after its done with the setup. It becomes the process you originally wanted to run. This means any context coming from the caller directly applies to the service, making bali work great inside systemd units, nomad, etc.

alternatives

bali is trivial by design, only intended for our narrow use case. there are other systems for a wider use case. For example consider systemd-nspawn for services that have a persistent root system on the same machine.

We're happy to share bali with the community, if its useful to anyone else.

Introducing local storage tier

we're adding a new storage tier: LV

kraud currently has 3 different storage types

  • GP: our general purpose ceph cluster with 100TB of consumer grade SSDs.
  • GFS: global shared filesystem, very useful for just having files available in multiple pods concurrently. This is backed by GP with cephfs MD on top.
  • RED: a data garbage bin using 200TB of spining consumer disks. You should not be using this unless you want really cheap large slow storage.

The new LV tier will aid users of legacy applications that are built for more traditional virtual server deployments. It is backed by a raid 1 of enterprise NVME drives and peaks at 1 million IOPS. A 4TB volume will have 3GB/s bandwidth dedicated and can request pcie passthrough for low latency, while smaller allocations share the bandwidth and IOPS.

Unlike GP, LV does not survive host failure, meaning a loss of a host will result in the volume becoming unavailable. During the last (bad) month this would have resulting in a 99.1% uptime, unlike GP which had 99.9% uptime. There's a residual risk of total loss, due to the nature of being electrically connected to the same chassis. We advise users to build their own contingency plan, similar to what you should have done with competing virtual machine offers.

While LV has very little benefits in modern applications, it pairs well with traditional VMs and will become the default storage in the kraud marketplace for VMs.

ceph let us down(time)

Most kraud users would rather not bother with details of how storage works. This is after all, what we've built kraud for. However, as you noticed, we're not doing great in terms of uptime recently (still better than Azure, lol) and this is due to storage. To reach our goal of carbon negative computing while also taking in zero venture funding, we must navigate the difficult path of serving a variety of incompatible workloads.

Kraud is all about energy saving so we use lower clock EPYC cpus with the highest possible compute per energy efficiency. Ceph was built for high clock speed XEONs with very little respect for energy efficiency, so it does not perform great in this scenario.

Adding to that, we treat physical servers as expendable and built all of our software for graceful recovery from loosing a node. That allows building machines for a third of a cost of traditional OEMs like HP, etc. We use things like cockroachdb, which performs well under frequent failures. While ceph does also recover from such an event, it does NOT do it as graceful as you'd hope for, resulting in several minutes of downtime for the entire cluster every time a single node fails.

As i keep saying, high availability is the art of turning a single node incident into a multi node incident.

In summary ceph is not the correct solution for the customer group that is currently the most important to the companies survival (paying customers, yes) This is why we're moving that customer group away from ceph, so GFS can come back to its previously slow-but-stable glory.

In the future, once we become big enough, we hope to deliver a custom built storage solution that can work well within the energy targets.

Thank you for your patience and for joining us on this critical mission towards carbon negative compute.

Introducing the kraud cli: kra

From the beginning of the project we always strived for compatbility with your existing tools, be it docker or kubectl. Your feedback is always greatly apprechiated, as it helps us clarify what that exactly means in practice. How much compat is good, and where do the existing tools not work?

We haven't reached a stage where this is entirely clear yet, but the trend is pointing towards

  • Fully supporting the docker cli
  • Building a custom cli to supplement docker
  • Freezing kubectl at 1.24
  • Partially supporting the most popular of the many incompatible docker compose variants

Particularly kubectl is a difficult choice. Kubernetes is a standard. But unfortunately, it's not actually a standard, and keeping up with upstream does not seem feasible at the moment.

Instead we will shift focus entirely on supporting docker and docker compose. The compose spec is weird, and inconsistent, but it is simple and hence very popular. Most of the confusion we've seen in practice is easily addressable with better tooling.

So we are introducing: kra

The kra commandline program works on docker-compose files and will implement some of the processes that docker does not do at all (ingress config currently requires kubectl) or does incorrectly (local file mounts).

Specifically a pain point in some user setups has been CI. Since we don't support docker build yet, users build on the ci machine and then use docker load. This is slow, because the docker api was never intended to be used remotely.

Instead kra push is very fast and should be used in CI instead.

github CI example

here's a typical .github/workflows/deploy.yaml

name: Deploy
on:
  push:
    branches: [  main, 'staging' ]
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: 'deploy to kraud'
        env:
          KR_ACCESS_TOKEN: ${{secrets.KR_ACCESS_TOKEN}}
        run: |
          curl -L https://github.com/kraudcloud/cli/releases/latest/download/kra-linux-amd64.tar.gz  | tar xvz
          sudo mv kra /usr/local/bin/

          # get credentials for docker cli
          kra setup docker

          # build the images localy
          docker compose build

          # push images to kraud
          kra push

          # destroy running pod
          # this is currently needed until we fix service upgrades
          docker --context kraud.aep rm -f myapp

          # start new pods with new images
          docker compose up -d

kra is open source and available here: https://github.com/kraudcloud/cli. We're aways happy for feeedback. Feel free to open github issues or chat with a kraud engineer on discord.

screenshot

deployment screenshot

vdocker moved to cradle

Vdocker is how we call the thing that responds to docker api calls like log, attach, exec and cp.

Vdocker used to run on the host, and all the commands where carefully funneled through a virtio-serial. The advantage is that cradle is small, and starts faster. However, we realized most people do start fairly large containers that take a few seconds to start anyway. Hence sub-100ms startup time for cradle is no longer a priority

Instead we traded a few milliseconds of start time for much higher bandwidth by moving vdocker directly into cradle. It listens inside of your pod on port 1 and accepts the nessesary docker commands from the api frontend. docker cp now works properly and is much faster. Also docker attach no longer glitches.

Unfortunately this means docker run feels slower, although it really hasnt changed much. Log output starts appearing roughly 80ms after download completed, but for larger container it may take several seconds to download layers, which you can currently not see.

On the upside, all other commands now feel alot faster, because we skip vmm and just proxy the http call directly to vdocker.