Devlog 002

Posted Jan 3rd, 2024.

Happy New Year! It's been roughly a month and a half since my last "weekly" devlog; each time I sat down to write it, I'd always get carried away working on something else, pushing this devlog back further and further. Going forward, I'd rather release larger, more in-depth devlogs less frequently than release half-assed ones each week.

High level updates:

Finished the Zig port of LevelDB
Wrote a web server for serving static assets
Made significant progress on the design and implementation of a new type of filesystem

The remainder of this devlog will expand on each of these higher-level updates, and go in-depth into certain areas of interest.

Zig Port of LevelDB

For reasons mentioned in the last devlog, I spent a good bit of time understanding and porting LevelDB over to Zig (it was originally written in C++). Although this was probably a mistake in hindsight (and has delayed the launch of lovely software's first product significantly), it was certainly a fun exercise to a) get better at writing Zig code and b) to understand LevelDB from the ground up. It's a beautiful piece of software, and trying to re-create it to the same level of elegance and performance is well, difficult.

Since it is a near-direct port (with some changes outlined below), I thought it may be interesting to take some time to dive deeper into the implementation differences (and yes, with some caveats, benchmarks will be included).

Differences of note:

The Zig codebase is significantly smaller than LevelDB's C++ codebase. Running cloc on the original LevelDB (omitting test.cc files) yields ~12144 SLOC, whereas the Zig version, including tests, is only 10204 SLOC. I suspect the primary difference here is because Zig doesn't have header files (nearly ~3K SLOC in LevelDB is in headers).
The lack of RAII-style destructors in Zig is itself a feature, though can make working on these sorts of complex software projects slightly more difficult during development. Memory leak detection via Zig's general purpose allocator was a god-send throughout the development process.
LevelDB has a habit of returning virtual Iterator objects from functions. This introduces some difficulty in Zig, since it's very easy to naively return a reference to stack memory and segfault your program. To get around this, functions that return interfaces generally take a reference to a std.heap.ArenaAllocator which is used to create the underlying implementations on the heap (allowing the caller to deallocate them all at once).

In terms of performance, before showing any benchmarks, want to give a few legal disclaimers:

All benchmarks were performed on an M1 Pro w/ 16GB of memory, running on macOS Sonoma 14.1.1.
The Zig implementation is relatively unoptimized, i.e. this is the first pass at the database and no specific effort has been made to profile and/or otherwise optimize the code. Zig and C++ are extremely competitive from a baseline speed perspective (see programming language and compiler benchmarks), so any performance degradation in the Zig version is entirely my doing (and shouldn't be a reflection of Zig itself).
The Zig port of LevelDB has yet to implement Snappy-based compression or mmap-ing of files. As such, I've disabled these features in the original LevelDB for the baseline benchmark (read performance suffers as a result, roughly ~2x slower random and sequential reads).

All benchmark code (for the Zig version) can be found here.

Baseline LevelDB performance:

LevelDB:    version 1.23
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
WARNING: Snappy compression is not enabled
------------------------------------------------
fillseq      :       2.267 micros/op;   48.8 MB/s
fillsync     :    4002.117 micros/op;    0.0 MB/s (1000 ops)
fillrandom   :       3.714 micros/op;   29.8 MB/s
overwrite    :       4.419 micros/op;   25.0 MB/s
readrandom   :       3.180 micros/op; (864322 of 1000000 found)
readrandom   :       2.846 micros/op; (864083 of 1000000 found)
readseq      :       0.097 micros/op; 1143.4 MB/s
readreverse  :       0.191 micros/op;  580.6 MB/s
compact      :  553663.000 micros/op;
readrandom   :       2.209 micros/op; (864105 of 1000000 found)
readseq      :       0.092 micros/op; 1196.7 MB/s
readreverse  :       0.176 micros/op;  628.7 MB/s
fill100K     :     618.928 micros/op;  154.1 MB/s (1000 ops)
crc32c       :       0.973 micros/op; 4014.2 MB/s (4K per op)
(compression benchmarks omitted)

Zig port of LevelDB:

Performing 14 benchmark steps:
fillseq         1.988 us per op         (55.624 MiB/s)
fillsync        23.157 us per op        (4.777 MiB/s)
fillrandom      2.566 us per op         (43.102 MiB/s)
overwrite       3.158 us per op         (35.028 MiB/s)
readrandom      6.433 us per op
readrandom      6.501 us per op
readseq         0.328 us per op         (337.219 MiB/s)
readreverse     0.492 us per op         (224.674 MiB/s)
compact         433641.500 us per op
readrandom      3.391 us per op
readseq         0.189 us per op         (582.270 MiB/s)
readreverse     0.312 us per op         (354.024 MiB/s)
fill100K        281.926 us per op       (338.324 MiB/s)
crc32c          1.764 us per op         (2.162 GiB/s)

Some general takeaways,

There's still a good bit of optimization to do, especially on reads. That being said, generally OK with these results given that this is a first pass, and no specific optimization work has been done thus far.
mmap makes a huge difference in read performance (no shit), definitely going to prioritize implementing it in Zig.
LevelDB's crc32c implementation is ~2x faster than Zig's std.hash.Crc32. Not the highest priority item to fix, but certainly interesting.

Work on the DB will continue in the background as needed, but the focus these days has moved away from it and up a level, i.e. the application layer. Specificially, I've built two applications with the DB so far which'll be discussed below.

Web Server

When I started working on Lovely, I made the (perhaps irrational) call to build every bit of it from scratch, with zero software dependencies (1). This means no imports, no libraries, nothing. Following this philosophy, it didn't feel right to use a pre-built web server to host the website. Thus, the Lovely Web Server (lws) was born.

There hasn't been much effort into optimizing lws for performance, rather, the focus has been on developer and user experience. Namely, it does a few things that other web servers don't do:

Differentiates development and production environments, i.e. the latter can assume file content doesn't change and can cache heavily, whereas the former needs to always serve the latest version of the file.
Similar to the experience with .html files, markdown files are transparently converted to HTML before being served over the web. In fact, this is how this blog is hosted (i.e. md files directly alongside .html files in the same folder). And yes, the markdown parser and HTML renderer were also built from scratch.

Not the main priority at the moment, but there's a bit of work to do on lws still:

The markdown parser isn't fully compilant. It supports basic .md -> .html, though many markdown features aren't supported. In general, the approach has been to add the features I've needed for writing these devlogs, nothing more.
The threading model of the server is well, naive. It's a single acceptor thread with a one-worker-per-CPU-core model. There's some excellent work being done on Zig async / event loops (i.e. libxev), and I may eventually have to crack and use this library as there's no chance in hell I could write a better one myself.

Filesystem

The major focus of the last little while has been on the design and implementation of a new "filesystem", or at least something akin to one. I hesitate calling it a filesystem because I don't think it really counts as one, as it sits atop an existing traditional filesystem and implements some additional functionality on top (in addition to eventually, a mountable FUSE interface). Perhaps calling it a "virtual filesystem" or something of the like is more accurate, alas, I want to spend a good bit of time going through its design since it's really cool.

Before going into the design itself, it's worth mentioning and introducing Perkeep. It's a fantastic piece of software originally built by Brad Fitzpatrick, and its design underpins much of this filesystem's design. Specifically, files in Perkeep are modeled as "blobs" which are simply little content-addressable chunks of bytes (conceptually, think of a file whose name uniquely identifies its contents). Some blobs are special (i.e. schema blobs), and give context as to the meaning of other blobs.

There's a lot of advantages to "blob-based" systems, since well, blobs are easy to work with. Conceptually, you can think of a blob as a file, and a standard Perkeep deployment could manage millions of blobs. If two blobs contain the same content, then they'll by definition have the same name, and thus only one copy needs to be stored (which is to say, deduplication is built-in by design). Backing up blobs is also easy, since you don't have to be smart about it; you can just copy + paste blobs to as many places as you want, and as long as at least one copy of every blob is present somewhere, all your data is intact. Many Perkeep users routinely dump blobs onto a hard drive and move them to different machines, or use cloud-based services (i.e. S3) to back them up with more redundancy.

Perkeep is designed as a backup system, though with a few tweaks, I'm reasonably confident it can be adapted into a FUSE-compilant filesystem with some significant advantages (in some areas) over more traditional filesystems. Specifically, since blobs are never deleted, and file operations are modelled as a sequence of time series mutations to objects, it's possible to "time travel" (aka scrub forward and backward in time) through all versions of your files and directory layout.

The most important contribution Perkeep made was, in my opinion, permanodes. They're simply a blob that contains a random string of data:

{
    "random": "dsfj39resf239fj230f23fsdf439058",
}

Note: Many fields have been emitted for clarity.

Conceptually, you can imagine a permanode as an object; some "thing" that can be mutated and changed, but exists in some space. Mutations to permanodes take the form of claims, i.e. a set of attribute changes applied to a given permanode, at a given time:

{
    "mutations": [
        {
            "permanode": "sha256-xxx",
            "action": "set",
            "attribute": "foo",
            "value": "bar"
        },
    ],
    "timestamp": "2023-12-13T17:03:43.741977533Z"
}

Again, I've removed a few extra fields to help the reader focus on the important part. The permanode field inside the mutation specifies which permanode (by ref, aka file name) this mutation touches, and the rest of the fields define some change in the permanode's attributes.

If we consider a single file as a unique permanode, then claims are operations done on that file (aka changing its contents, renaming it, etc). If we're clever about how we model these changes, we can implement all of the core filesystem operations, while preserving the all-important time-travel property we're after.

Conclusion

Thanks for reading. As mentioned above, going to make more of an effort to keep these semi-regular, though certainly am not making any promises as I'd rather release longer, higher-quality devlogs than more frequent, lower quality ones. I've had a ton of fun these last few weeks, and I'm looking forward to finishing the core filesystem implementation shortly and moving on to indexing (i.e. full text + semantic).

On more of a personal note, I'll also be starting to figure out some job-stuff in the next few weeks. In the long run, making Lovely sustainable would be a dream come true. However, in the meantime, there's a few companies that I've been itching to talk with / consider joining, so will be pursuing those for the next little bit.

Morgan

4dd5affe3ec71e3dc8d59ea96bf3cffdbf90359e

45 files changed, 15506 insertions(+), 986 deletions(-)

Footnotes

With the exception of the Zig standard library, which is in fact, lovely.