Flat file key value database
2024-04-11

preface

	I just uploaded yet another youtube video about using flat files as a key value store instead of using a database. Clickity click it above, or read on below.

we're database junkies

	A database is the very first thing we spin up when beginning a new project, even when it makes zero sense to do so. "I might as well, I'll probably need it in the future." "Uhg, dealing with concurrency and flat files is a hassle.. I'll just spin up a database instead." "Flat files? What is this, the 2010s?"

But there's so many cases when a simple key value store is optimal, and a filesystem is an efficient version of that. I'm not saying that things like redis don't have a place, of course it does. But with kernel level caching having evolved as much as it has in the last decade that place is far smaller than it used to be. And it keeps shrinking.

In the video I just posted I parsed 25k rows from a flat file in a fraction of a section.. with a debug build of rust. If I had used the release profile it would have been far, far faster. Is that performant when it comes to online servers with tens of thousands of concurrent requests? No, it isn't.

But are you going to be reading 25k+ rows every second, concurrently, in that little project you're working on right now? No, you're not.

chimerasyntax's infrastructure

	I designed chimerasyntax to be key / value based from the ground up. This is partly because I was hoping that serverless tech would catch up to its promises. It did not. The performance is dreadful for even fairly simple use cases, and the tech is abyssmal. I remember a fun conversation in the cloudflare discord for their offering where a person noted that "maybe it isn't a great look when the example project.. doesn't compile."

It isn't really the fault of the single maintainer they had working on it. I got the impression that it was more of a hobby, or experiment, for cloudflare than a proper offering. But they were offering it commercially. So.

It was also the very personification of everything I despise about modern development. You weren't given a library and a reference manual. No, you had to install node.js to simply access the client used to upload your projects to cloudflare's servers. Node.js and its hundreds upon hundreds of dependencies. And once I got that infernal, bloated mess actually installed the performance was so poor that even a single, €2 vps would run circles around it. But it would have scaled better than the vps, no denying that.

So I eventually reduced chimerasyntax to 3 microservices;
cax - a custom http server.
dax - a database abstraction server.
rax - a websocket backend for application logic.

Eventually I added wax - a worker for generating cache files for searches, collections, etc. Wax is completely standalone and has no interaction with the other services beyond generating files for them to read, so I don't count it in the holy trinity.

All of these services are containerized through podman and can scale individually or in unison. There's logic built in to allow multiple instances of each of them to run concurrently and load-balance, but so far that hasn't been necessary. We're launching "officially" next week, after all. All of it is kubernetes compliant if I want to go down that road (I don't).

But this necessitated a database, and it needed to be properly decentralized. So going back to the notion of a key / value system I naturally looked up amazon's and all the other providers alternatives. But every single time I ended right back at "I could just use regular files for this".

A major benefit of files is that we have half a century worth of synchronization technology that has been battle tested and hardened. Every new solution being rolled out has their own pitfalls and gotchas. And those are only the ones we know about, more problems always turn up as the technology enters the trial by fire phase. 

So I said screw it and wrote "nodb". Yes, I'm that clever.

nodb

	nodb uses atomic appends to ensure that any number of threads can write to it at once. It's completely agnostic to how many servers writes to it at once (via networked filesystems) allowing infinite scalability (up to the i/o cap on the physical hardware itself). There's never any synchronization problems since all data is always appended with every write allowing updates to be made out of sequence. It doesn't require handshakes or any of the voodoo database servers require in order to keep the dataset in sync. And there's zero chance of it ever going out of sync.

It just works. But, there's drawbacks to it. Some quite severe, as I outline in the video. 

1. Since it is, in effect, a write ahead log that is never merged into the underlying data (the WAL is the underlying data) it requires parsing every single time an update is made. But since I've got a very efficient caching system that only, generally, happens once per collection / comment section / etc it simply isn't an issue. 99.3% of all requests hit the cache.
2. Parsing the data requires the use of a hashmap, rust's various hashmap offerings are blazing fast but they're still ultimately hashmaps with their inherent slowness.
3. As the amount of updates to a nodb file increases you eventually have to re-render it, this is a decision dax makes depending on the amount of processing time it spends on a nodb in total. A file with many changes but only requiring parsing once is seldom rendered. Meanwhile a nodb that requires re-caching it several times a day is rendered to account for any updates and deletes.	

summa summarum

	I'm beyond happy with the result. As it stands I am currently using zero services that I haven't made myself. I use a bare minimum of external dependencies ensuring that I have complete control of my code. There are some things I will not touch out of principle though, cryptography libraries and tokio's async runtime are two good examples. The cost of screwing either of those up are too great for me to even consider writing my own versions.

My servers require a base installation of any version of linux, podman. OpenSSH and OpenSSL. That's it. Nothing else. And it feels great.