Rendered at 23:23:59 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
carlsverre 1 days ago [-]
You might be interested in taking a look at Graft (https://graft.rs/). I have been iterating in this space for the last year, and have learned a lot about it. Graft has a slightly different set of goals, one of which is to keep writes fast and small and optimize for partial replication. That said, Graft shares several design decisions, including the use of framed ZStd compression to store pages.
I do like the B-tree aware grouping idea. This seems like a useful optimization for larger scan-style workloads. It helps eliminate the need to vacuum as much.
Have you considered doing other kinds of optimizations? Empty pages, free pages, etc.
russellthehippo 1 days ago [-]
Very cool, thanks. I hadn’t seen Graft before, but that sounds pretty adjacent in a lot of interesting ways. I looked at the repo and see what I can apply.
I've tried out all sorts of optimizations - for free pages, I've considered leaving empty space in each S3 object and serving those as free pages to get efficient writes without shuffling pages too much. My current bias has been to over-store a little if it keeps the read path simpler, since the main goal so far has been making cold reads plausible rather than maximizing space efficiency. Especially because free pages compress well.
I have two related roadmap item: hole-punching and LSM-like writing. For local on non-HDD storage, we can evict empty pages automatically by releasing empty page space back to the OS. For writes, LSM is best because it groups related things together, which is what we need. but that would mean doing a lot of rewriting on checkpoint. So both of these feel a little premature to optimize for vs other things.
carlsverre 1 days ago [-]
Both of those roadmap items make sense! Excited to see how you evolve this project!
russellthehippo 1 days ago [-]
Thanks! I think the hole punching is the key one, as it is important for "this delete has happened but we need to free the space without a vacuum".
wgjordan 17 hours ago [-]
Nice set of experiments! I appreciate that you're running benchmarks on real object storage setups to validate rapid design variations. (Meta-note: I love how agents have recently made this kind of experimental-research work possible with much less human time investment.)
I've been doing some experiments of my own in a relatively similar space, also focusing on S3/Tigris-backed SQLite on ephemeral compute, also with B-tree aware prefetching (see https://github.com/wjordan/sqlite-prefetch).
I think the idea of storing grouped pages together to optimize read-locality is interesting. Note that it steers in the opposite direction of the temporal locality that a format like LTX/Litestream uses to provide transaction-aware features like point-in time restore. The tradeoff also involves significantly greater write amplification (re-upload the entire page group every time a single page dirties), heavily favoring cold-read-heavy workloads over mixed-write or warm-read workloads.
The query-plan frontrunning is a very novel experiment as well, discovering in advance that SQLite is about to run a full-table scan seems like a very useful optimization hint to work with. I'd love to see experiments validating how much of an improvement that offers compared to simple reactive prefetch (which takes at least a couple page faults to get up to speed).
russellthehippo 3 hours ago [-]
Fantastic comment, thanks for jumping in. I have well-tuned replies here lol given we're working on exactly the same problems!
First: your prefetch solution is precisely what I have in the roadmap for the next level of optimization: B-tree introspection at the page-child level not the table level. I haven't launched it yet as I only realized the potential in the past couple days and I wanted to focus on stability before showing this to folks. I am 100% going to try to use sqlite-prefetch approach in that experiment. I'm curious what kind of results you're seeing.
On write amplification: you're right, re-uploading an entire page group when one page is dirty feels inefficient. This tradeoff is intentional because turbolite believes all PUTs are equal, and it is aimed at optimizing for queries on cold databases with bursty reads, not write-heavy workloads (though writes work too). Checkpoints are ideally infrequent due to upload latency, and the page group sizes are tunable (default 256 64KB pages = 16MB uncompressed, a few MB compressed to S3). For the use case I'm targeting, the read locality wins dominate. That being said, turbolite does let the user choose when to checkpoint locally (durable on disk) vs to S3 (atomic commit in case of disk crash or new compute).
On LTX temporal locality vs page-group spatial locality: agreed, they're different design goals. LTX optimizes for transaction-aware features like PITR. turbolite optimizes for cold query speed from object storage. You could imagine a system that does both (WAL shipping for durability + grouped pages for reads), which is roughly where my roadmap goes. In fact, WAL + immutable page groups on checkpoints gives a much faster restore time than Litestream's snapshots+WAL if only because uploading the entire snapshot on each commit is slow. I'm starting to explore embedded WAL shipping in my walrust project https://github.com/russellromney/walrust.
On frontrun vs reactive: I have very specific benchmarks for this I think you will be intrested in. The tiered-bench binary has a --plan-aware flag that toggles between prefetch schedule (reactive, similar in spirit to your sibling detection) and frontrun (EQP-based). Look at the benchmark/README.md for detail. On 100K rows, local to Tigris, EQP is:
- simple queries (mutual friends): 1.3x, less to gain
The wins are largest on cold index lookups and SCANs where reactive prefetch has to discover the scan through sequential misses. Frontrun knows the full plan upfront and fires everything in parallel. For single-table scans, reactive catches up quickly after 1-2 cache misses, so the gap is smaller.
Finally - I included a query tuner CLI in the repo that lets you compare different prefetch aggression levels for a given query on given data. You may be interested in using this.
edinetdb 22 hours ago [-]
The database-per-tenant angle resonates. I hit a similar tradeoff building a
financial data API—highly uneven access patterns where most entities are rarely
queried but a few hundred get hit constantly.
We went a different direction: BigQuery for durable storage, but an in-memory
layer that pre-loads the "gold" aggregated tables (~26MB of pre-computed data
for ~4,000 companies) at startup. Cold queries fall back to BigQuery directly;
warm queries skip it entirely. Reduced our query costs by ~88%, though that's
partly a function of table size being small enough to fit in memory.
The scans-are-slow limitation you mention is where we'd still hit BigQuery
directly anyway—full text search across 9M+ document sections can't be cached
that way. So there's probably a hybrid approach where hot structured data lives
in-memory and raw text stays in the warehouse.
What's your strategy for tables that grow beyond what fits in a memory budget?
Does the page-group layout still help when you can't fit the whole table—do you
see a way to do partial pre-loading by access frequency?
russellthehippo 1 days ago [-]
A bit more color on what I found interesting building this:
The motivating question for me was less “can SQLite read over the network?” and more “what assumptions break once the storage layer is object storage instead of a filesystem?”
The biggest conceptual shift was around *layout*.
What felt most wrong in naive designs was that SQLite page numbers are not laid out in a way that matches how you want to fetch data remotely. If an index is scattered across many unrelated page ranges, then “prefetch nearby pages” is kind of a fake optimization. Nearby in the file is not the same thing as relevant to the query.
That pushed me toward B-tree-aware grouping. Once the storage layer starts understanding which table or index a page belongs to, a lot of other things get cleaner: more targeted prefetch, better scan behavior, less random fetching, and much saner request economics.
Another thing that became much more important than I expected is that *different page types matter a lot*. Interior B-tree pages are tiny in footprint but disproportionately important, because basically every query traverses them. That changed how I thought about the system: much less as “a database file” and much more as “different classes of pages with very different value on the critical path.”
The query-plan-aware “frontrun” part came from the same instinct. Reactive prefetch is fine, but SQLite often already knows a lot about what it is about to touch. If the storage layer can see enough of that early, it can start warming the right structures before the first miss fully cascades. That’s still pretty experimental, but it was one of the more fun parts of the project.
A few things I learned building this:
1. *Cold point reads and small joins seem more plausible than I expected.*
Not local-disk fast, obviously, but plausible for the “many mostly-cold DBs” niche.
2. *The real enemy is request count more than raw bytes.*
Once I leaned harder into grouping and prefetch by tree, the design got much more coherent.
3. *Scans are still where reality bites.*
They got much less bad, but they are still the place where remote object storage most clearly reminds you that it is not a local SSD.
4. *The storage backend is super important.*
Different storage backends (S3, S3 Express, Tigris) have verg different round trip latencies and it's the single most important thing in determining how to tune prefetching.
Anyway, happy to talk about the architecture, the benchmark setup, what broke, or why I chose this shape instead of raw-file range GETs / replication-first approaches / etc.
hgo 1 days ago [-]
I really appreciate this post. Freely and humbly sharing real insights from an interesting project. I almost feel like I got a significant chunk of the reward for your investment into this project just by reading.
Thank you for sharing.
russellthehippo 1 days ago [-]
Thanks for your kind words!
agosta 1 days ago [-]
This is awesome! With all of the projects/teams working on improving sqlite, it feels like it's just a matter of time before it becomes a better default than postgres for serious projects.
I do wonder - for projects that do ultimately enforce single writer sqlite setups - it still feels to me as if it would always be better to keep the sqlite db local (and then rsync/stream backups to whatever S3 storage one prefers).
The nut I've yet to see anyone crack on such setup is to figure out a way to achieve zero downtime deploys. For instance, adding a persistent disk to VMs on Render prevents zero downtime deploys (see https://render.com/docs/disks#disk-limitations-and-considera...) which is a real unfortunate side effect. I understand that the reason for this is because a VM instance is attached to the volume and needs to be swapped with the new version of said instance...
There are so many applications where merely scaling up a single VM as your product grows simplifies devops / product maintenance so much that it's a very compelling choice vs managing a cluster/separate db server. But getting forced downtime between releases to achieve that isn't acceptable in a lot of cases.
Not sure if it's truly a cheaply solvable problem. One potential option is to use a tool like turbolite as a parallel data store and, only during deployments, use it to keep the application running for the 10 to 60 seconds during a release swap. During this time, writes to the db are slower than usual but entirely online. And then, when your new release is live, it can sync the difference of data written to s3 back to the local db. In this way, during regular operation, we get the performance of local IO and fallback onto s3 backed sqlite during upgrades for persistent uptime.
Sounds like a fraught thing to build. But man it really is hard/impossible to beat the speed of local reads!
russellthehippo 1 days ago [-]
Yeah I mostly agree, and another comment brought up this idea kind of. If you can keep SQLite local, that’s usually the better answer. Local reads are insanely hard to beat, and Litestream is the canonical “keep SQLite local and ship durability elsewhere” model.
That’s actually a big part of why I started another project, haqlite: https://github.com/russellromney/haqlite which is much more on the “keep SQLite local” side: leader election via S3 conditional PUTs, WAL replication to S3, follower catchup, write forwarding to the leader, and graceful leader handoff. So it can get pretty close to zero-downtime deploys with 2+ nodes.
But not true zero downtime on a single server — if there isn’t already another warm follower alive, there is nowhere for leadership to go. So in my head:
- Litestream: local SQLite is primary; object storage is for durability / replication / failover
- haqlite: kind of a Litestream but with a focus on embedding it into any application.
- turbolite: object storage is the actual backing store, and the question is how plausible cold reads can get. No HA ideas
hrmtst93837 1 days ago [-]
Keeping the DB local cuts the worst latency spikes, but then you trade away the whole pitch of ephemeral compute and just-in-time scaling, so you end up glued to old-school infra patterns in disguise, plus node affinity and warm-cache babysitting that look a lot like the stuff SQLite was supposed to let you dodge. Add a few readers on volatile nodes and it get ugly fast.
AbanoubRodolf 20 hours ago [-]
The tradeoff is real, but it's workload-specific. Turbolite is for read-mostly analytics on truly ephemeral compute -- Lambda or spot instances hitting a dataset that refreshes once a day. There's no persistent node to pin, so local-with-replication doesn't apply. You're comparing apples to a different fruit entirely.
What makes 250ms achievable isn't just range requests. It's the B-tree-aware page grouping. Standard S3 VFS pays 15-20ms per random page fetch. Group the pages a query actually touches into contiguous segments and you turn dozens of round trips into a few sequential reads. For cold JOIN queries on small-to-medium tables, that's the difference between 4 seconds and usable.
For OLTP or anything with meaningful write throughput, local WAL replication (LiteFS, haqlite, Litestream) is clearly right. These are two different problems that happen to both use SQLite.
russellthehippo 19 hours ago [-]
No. Turbolite is explicitly read-write, and it’s genuinely hard to balance availability against local query speed. Those tradeoffs are real. I don’t think your response is describing this project very accurately.
russellthehippo 1 days ago [-]
Couldn't have said it better myself.
AlexeyBelov 18 hours ago [-]
This is LLM-speak. Am I the only one who notices?
russellthehippo 17 hours ago [-]
the comment history speaks for itself
17 hours ago [-]
russellthehippo 1 days ago [-]
Also I want to acknowledge the other projects in adjacent parts of this space — raw SQLite range-request VFSes, Litestream/LiteFS-style replication approaches, libSQL/Turso, Neon, mvsqlite, etc. I took a lot of inspiration from them, thanks!
bob1029 1 days ago [-]
Have you considered using techniques like conditional PUT to enable multiple writers?
It's funny, I did consider that and put some work into doing it. I moved that work into a different project, https://github.com/russellromney/haqlite that focuses on high-availability SQLite. Like Litestream but focused on embedding in your application. It uses conditional PUTs for leader election, ships WAL to S3, and writes (and reads for consistency) are forwarded to the leader - so it's still single writer, but failover happens automatically when the leader fails to claim the leader lease again. Data window is the lease timeout + checkpoint interval. Maybe I'll explore how to use haqlite + turbolite together.
Short answer: conditional PUTs for distributed scare me for multiwriter. The issue isn't doing the writes, it's ensuring that the writer is writing against the most current data. For OLTP workloads with upserts, this is very hard! If you have immutable data without any upserts and your writes don't depend on reads, that actually works really well. But in any other scenario it's dangerous. The writer needs to ensure that the other writers have checkpointed first, and there's not a great way to do that.
One thing that could make this work in turbolite is a fast distributed lock system with transaction and lock timeouts. E.g. use Redis as the lock lease holder, and distributed writers acquire it, fetch the latest manifest from s3, sync any data they need, do the write, checkpoint, update the lock to be released and note the new manifest version, and return success to the user. then before any reads, they simple check Redis for the new manifest (or S3 for guarantee that the manifest update/lock release didn't fail). All writes have an N second timeout, and the write lock has N second + T timeout, so it's guaranteed that successful checkpoints are used in the next read, as long as readers check the manifest first.
This could work, but it's <still> single writer lol. But you'd only want it with infrequent writes. And reads cost an S3 GET. So I guess it would work best with Wasabi, which doesn't charge for operations, or self-hosted MinIO.
How do you handle manifest consistency and updates—append-only or rewrite-heavy?
russellthehippo 16 hours ago [-]
Rewrite on every checkpoint. The manifest is small (a few hundred KB for a million pages), so a single PUT is fine. The PUT is the atomic commit point - old page groups become garbage, cleaned up later by gc(). The SQLite engine fetches updated pages from the WAL so turbolite doesn't have to manage manifest updates until checkpoint. If it's not in WAL mode, checkpoints happen on every write so consistency is still there.
manudaro 1 days ago [-]
Sub 250ms for cold queries from S3 is impressive, but curious about the consistency of those numbers. Are you doing any prefetching of table schemas or statistics? With geographic datasets we often see huge variance in S3 latency depending on object size and region - a 10Mb spatial index file might take 400ms to fetch while smaller lookup tables stay under 100ms.
russellthehippo 2 hours ago [-]
[dead]
ovaistariq 19 hours ago [-]
Ovais - co-founder of Tigris here.
This is very cool. I have been thinking about embedded databases running on Tigris. Specially from an agent perspective, agents can suspend and continue their sessions. Would love to collaborate.
russellthehippo 15 hours ago [-]
Happy to hear from you, I'm a delighted customer. In fact I am building a database company on top of Tigris focused on exactly that use case. I'll reach out on X if that works.
mememememememo 14 hours ago [-]
It's great. Well done. 250ms is slow for a DB query though but for what you are doing and how you donit it is surprisingly fast.
russellthehippo 2 hours ago [-]
Thanks! yeah it's definitely slow, but compared to other options it's quite fast:
- Neon startup is 500ms+
- a Fly Machine with a SQLite database on a volume takes 150ms+ to start up at minimum
- restoring a db from S3 with Litestream can take multiple seconds depending on how long from the last snapshot and how large the database is
- downloading a whole .sqlite file from S3 can be faster than Litestream restore but you still have to download the whole db file.
- One similar option is, you could save each table in a separate database and only download the db files for tables you need for a given query, then ATTACH. But this is an awkward setup even though it's simple
pdyc 19 hours ago [-]
this is a great project, does it support wasm? i want to use it in browser with sqlite wasm.
russellthehippo 15 hours ago [-]
Not today, but the architecture isn't fundamentally incompatible. The page grouping and seekable compression would translate well to browser fetch + range GETs. It would need a new storage backend targeting OPFS/fetch instead of S3/disk. I'm happy to discuss more if you'd like to open a Github issue - abstracting the storage API seems like a decent idea in itself.
russellthehippo 2 hours ago [-]
Also I'm curious how you would handle credentials - would you be proxying through your backend? To me turbolite is definitely a backend tool.
alex_hirner 1 days ago [-]
What are your thoughts on eviction, re how easy to add some basic policy?
russellthehippo 1 days ago [-]
Great question. I have some eviction functions in the Rust library; I don’t expose them through the extension/VFS yet. The open question is less “can I evict?” and more “when should eviction fire?” via user action, via policy, or both.
The obvious policy-driven versions are things like:
- when cache size crosses a limit
- on checkpoint
- every N writes (kind of like autocheckpoint)
- after some idle / age threshold
My instinct is that for the workload I care about, the best answer is probably hybrid. The VFS should have a tier-aware policy internally that users can configure with separate policies for interior/index/data pages. But the user/application may still be in the best position to say “this tenant/session DB is cold now, evict aggressively.”
russellthehippo 1 days ago [-]
Also it’s super easy to add them, it’s definitely on the roadmap. The manifest tracks the data locally and could quickly evict on any operation that accesses it. Very safe post-checkpoint
michaeljelly 1 days ago [-]
Really cool
jijji 1 days ago [-]
i wonder how much that costs per hour to run any normal load? what benefit does this have versuss using mysql (or any similar rdbms) for the queries? mysql/pgsql/etc is free remember, so using S3 obviously charges by the request, or am i wrong?
russellthehippo 1 days ago [-]
Cost really depends on your workflow, your durability needs, prefetch aggression (how quickly you need all the data locally), and how aggressively you cache. The storage is what it is, and it's cheap. The variation comes in transaction counts.
If you evict the cache on every read/write, then all reads and writes do S3 GET. PUTs are 10-100x more expensive than GETs usually. Check the benchmark/README.md for GET counts, but it's usually 5-50 per cold read (with interior B-tree pages on disk).
S3 GETs are $0.0004/1000, S3 Express is $0.00003. So 10 queries per minute all day long averaging 20 GET operations, with full eviction on each request, would be 20*10*$0.0004*60*24/1000*30 = $3.45 per month. With S3 Express One Zone that's $0.26/mo. Both plus storage costs, which would probably be lower.
On Wasabi, that would be just the cost of storage (but they have minimum 1TB requirements at $6.99/mo).
If you checkpoint after every write and write 10 times per minute, each write hitting e.g. 5 page groups (S3 objects) plus the manifest file, the analysis looks like: 6*10*$0.005*60*24/1000*30=$12.96. Again that's worst case for the benchmark 1.5GB database. On S3 Express that' $2.92.
Point is - it's not too bad, and that's kind of worst-case scenario where you evict on every request every 6 seconds which isn't really realistic. If you evict the cache hourly, that cost goes is 1/600th - less than a $0.01 per month.
Summary: use S3 Express One Zone, don't evict the cache too often, checkpoint to S3 once a minute (turbolite lets you checkpoint either locally (disk-durability) or locally+S3 separately), and you're running a decent workload for pennies every month.
I do like the B-tree aware grouping idea. This seems like a useful optimization for larger scan-style workloads. It helps eliminate the need to vacuum as much.
Have you considered doing other kinds of optimizations? Empty pages, free pages, etc.
I've tried out all sorts of optimizations - for free pages, I've considered leaving empty space in each S3 object and serving those as free pages to get efficient writes without shuffling pages too much. My current bias has been to over-store a little if it keeps the read path simpler, since the main goal so far has been making cold reads plausible rather than maximizing space efficiency. Especially because free pages compress well.
I have two related roadmap item: hole-punching and LSM-like writing. For local on non-HDD storage, we can evict empty pages automatically by releasing empty page space back to the OS. For writes, LSM is best because it groups related things together, which is what we need. but that would mean doing a lot of rewriting on checkpoint. So both of these feel a little premature to optimize for vs other things.
I've been doing some experiments of my own in a relatively similar space, also focusing on S3/Tigris-backed SQLite on ephemeral compute, also with B-tree aware prefetching (see https://github.com/wjordan/sqlite-prefetch).
I think the idea of storing grouped pages together to optimize read-locality is interesting. Note that it steers in the opposite direction of the temporal locality that a format like LTX/Litestream uses to provide transaction-aware features like point-in time restore. The tradeoff also involves significantly greater write amplification (re-upload the entire page group every time a single page dirties), heavily favoring cold-read-heavy workloads over mixed-write or warm-read workloads.
The query-plan frontrunning is a very novel experiment as well, discovering in advance that SQLite is about to run a full-table scan seems like a very useful optimization hint to work with. I'd love to see experiments validating how much of an improvement that offers compared to simple reactive prefetch (which takes at least a couple page faults to get up to speed).
First: your prefetch solution is precisely what I have in the roadmap for the next level of optimization: B-tree introspection at the page-child level not the table level. I haven't launched it yet as I only realized the potential in the past couple days and I wanted to focus on stability before showing this to folks. I am 100% going to try to use sqlite-prefetch approach in that experiment. I'm curious what kind of results you're seeing.
On write amplification: you're right, re-uploading an entire page group when one page is dirty feels inefficient. This tradeoff is intentional because turbolite believes all PUTs are equal, and it is aimed at optimizing for queries on cold databases with bursty reads, not write-heavy workloads (though writes work too). Checkpoints are ideally infrequent due to upload latency, and the page group sizes are tunable (default 256 64KB pages = 16MB uncompressed, a few MB compressed to S3). For the use case I'm targeting, the read locality wins dominate. That being said, turbolite does let the user choose when to checkpoint locally (durable on disk) vs to S3 (atomic commit in case of disk crash or new compute).
On LTX temporal locality vs page-group spatial locality: agreed, they're different design goals. LTX optimizes for transaction-aware features like PITR. turbolite optimizes for cold query speed from object storage. You could imagine a system that does both (WAL shipping for durability + grouped pages for reads), which is roughly where my roadmap goes. In fact, WAL + immutable page groups on checkpoints gives a much faster restore time than Litestream's snapshots+WAL if only because uploading the entire snapshot on each commit is slow. I'm starting to explore embedded WAL shipping in my walrust project https://github.com/russellromney/walrust.
On frontrun vs reactive: I have very specific benchmarks for this I think you will be intrested in. The tiered-bench binary has a --plan-aware flag that toggles between prefetch schedule (reactive, similar in spirit to your sibling detection) and frontrun (EQP-based). Look at the benchmark/README.md for detail. On 100K rows, local to Tigris, EQP is:
- who-liked (one-to-many JOIN): 4.4x faster cold, 1.4x with interior cached
- scan + filter: 2.9x faster cold
- indexed filter: 2.9x faster cold
- point lookup + JOIN: 1.8x cold
- simple queries (mutual friends): 1.3x, less to gain
The wins are largest on cold index lookups and SCANs where reactive prefetch has to discover the scan through sequential misses. Frontrun knows the full plan upfront and fires everything in parallel. For single-table scans, reactive catches up quickly after 1-2 cache misses, so the gap is smaller.
Finally - I included a query tuner CLI in the repo that lets you compare different prefetch aggression levels for a given query on given data. You may be interested in using this.
We went a different direction: BigQuery for durable storage, but an in-memory layer that pre-loads the "gold" aggregated tables (~26MB of pre-computed data for ~4,000 companies) at startup. Cold queries fall back to BigQuery directly; warm queries skip it entirely. Reduced our query costs by ~88%, though that's partly a function of table size being small enough to fit in memory.
The scans-are-slow limitation you mention is where we'd still hit BigQuery directly anyway—full text search across 9M+ document sections can't be cached that way. So there's probably a hybrid approach where hot structured data lives in-memory and raw text stays in the warehouse.
What's your strategy for tables that grow beyond what fits in a memory budget? Does the page-group layout still help when you can't fit the whole table—do you see a way to do partial pre-loading by access frequency?
The motivating question for me was less “can SQLite read over the network?” and more “what assumptions break once the storage layer is object storage instead of a filesystem?”
The biggest conceptual shift was around *layout*.
What felt most wrong in naive designs was that SQLite page numbers are not laid out in a way that matches how you want to fetch data remotely. If an index is scattered across many unrelated page ranges, then “prefetch nearby pages” is kind of a fake optimization. Nearby in the file is not the same thing as relevant to the query.
That pushed me toward B-tree-aware grouping. Once the storage layer starts understanding which table or index a page belongs to, a lot of other things get cleaner: more targeted prefetch, better scan behavior, less random fetching, and much saner request economics.
Another thing that became much more important than I expected is that *different page types matter a lot*. Interior B-tree pages are tiny in footprint but disproportionately important, because basically every query traverses them. That changed how I thought about the system: much less as “a database file” and much more as “different classes of pages with very different value on the critical path.”
The query-plan-aware “frontrun” part came from the same instinct. Reactive prefetch is fine, but SQLite often already knows a lot about what it is about to touch. If the storage layer can see enough of that early, it can start warming the right structures before the first miss fully cascades. That’s still pretty experimental, but it was one of the more fun parts of the project.
A few things I learned building this:
1. *Cold point reads and small joins seem more plausible than I expected.* Not local-disk fast, obviously, but plausible for the “many mostly-cold DBs” niche.
2. *The real enemy is request count more than raw bytes.* Once I leaned harder into grouping and prefetch by tree, the design got much more coherent.
3. *Scans are still where reality bites.* They got much less bad, but they are still the place where remote object storage most clearly reminds you that it is not a local SSD.
4. *The storage backend is super important.* Different storage backends (S3, S3 Express, Tigris) have verg different round trip latencies and it's the single most important thing in determining how to tune prefetching.
Anyway, happy to talk about the architecture, the benchmark setup, what broke, or why I chose this shape instead of raw-file range GETs / replication-first approaches / etc.
Thank you for sharing.
I do wonder - for projects that do ultimately enforce single writer sqlite setups - it still feels to me as if it would always be better to keep the sqlite db local (and then rsync/stream backups to whatever S3 storage one prefers).
The nut I've yet to see anyone crack on such setup is to figure out a way to achieve zero downtime deploys. For instance, adding a persistent disk to VMs on Render prevents zero downtime deploys (see https://render.com/docs/disks#disk-limitations-and-considera...) which is a real unfortunate side effect. I understand that the reason for this is because a VM instance is attached to the volume and needs to be swapped with the new version of said instance...
There are so many applications where merely scaling up a single VM as your product grows simplifies devops / product maintenance so much that it's a very compelling choice vs managing a cluster/separate db server. But getting forced downtime between releases to achieve that isn't acceptable in a lot of cases.
Not sure if it's truly a cheaply solvable problem. One potential option is to use a tool like turbolite as a parallel data store and, only during deployments, use it to keep the application running for the 10 to 60 seconds during a release swap. During this time, writes to the db are slower than usual but entirely online. And then, when your new release is live, it can sync the difference of data written to s3 back to the local db. In this way, during regular operation, we get the performance of local IO and fallback onto s3 backed sqlite during upgrades for persistent uptime.
Sounds like a fraught thing to build. But man it really is hard/impossible to beat the speed of local reads!
That’s actually a big part of why I started another project, haqlite: https://github.com/russellromney/haqlite which is much more on the “keep SQLite local” side: leader election via S3 conditional PUTs, WAL replication to S3, follower catchup, write forwarding to the leader, and graceful leader handoff. So it can get pretty close to zero-downtime deploys with 2+ nodes.
But not true zero downtime on a single server — if there isn’t already another warm follower alive, there is nowhere for leadership to go. So in my head:
- Litestream: local SQLite is primary; object storage is for durability / replication / failover
- haqlite: kind of a Litestream but with a focus on embedding it into any application.
- turbolite: object storage is the actual backing store, and the question is how plausible cold reads can get. No HA ideas
What makes 250ms achievable isn't just range requests. It's the B-tree-aware page grouping. Standard S3 VFS pays 15-20ms per random page fetch. Group the pages a query actually touches into contiguous segments and you turn dozens of round trips into a few sequential reads. For cold JOIN queries on small-to-medium tables, that's the difference between 4 seconds and usable.
For OLTP or anything with meaningful write throughput, local WAL replication (LiteFS, haqlite, Litestream) is clearly right. These are two different problems that happen to both use SQLite.
https://aws.amazon.com/about-aws/whats-new/2024/08/amazon-s3...
https://docs.aws.amazon.com/AmazonS3/latest/userguide/condit...
Short answer: conditional PUTs for distributed scare me for multiwriter. The issue isn't doing the writes, it's ensuring that the writer is writing against the most current data. For OLTP workloads with upserts, this is very hard! If you have immutable data without any upserts and your writes don't depend on reads, that actually works really well. But in any other scenario it's dangerous. The writer needs to ensure that the other writers have checkpointed first, and there's not a great way to do that.
One thing that could make this work in turbolite is a fast distributed lock system with transaction and lock timeouts. E.g. use Redis as the lock lease holder, and distributed writers acquire it, fetch the latest manifest from s3, sync any data they need, do the write, checkpoint, update the lock to be released and note the new manifest version, and return success to the user. then before any reads, they simple check Redis for the new manifest (or S3 for guarantee that the manifest update/lock release didn't fail). All writes have an N second timeout, and the write lock has N second + T timeout, so it's guaranteed that successful checkpoints are used in the next read, as long as readers check the manifest first.
This could work, but it's <still> single writer lol. But you'd only want it with infrequent writes. And reads cost an S3 GET. So I guess it would work best with Wasabi, which doesn't charge for operations, or self-hosted MinIO.
How do you handle manifest consistency and updates—append-only or rewrite-heavy?
This is very cool. I have been thinking about embedded databases running on Tigris. Specially from an agent perspective, agents can suspend and continue their sessions. Would love to collaborate.
- Neon startup is 500ms+
- a Fly Machine with a SQLite database on a volume takes 150ms+ to start up at minimum
- restoring a db from S3 with Litestream can take multiple seconds depending on how long from the last snapshot and how large the database is
- downloading a whole .sqlite file from S3 can be faster than Litestream restore but you still have to download the whole db file.
- One similar option is, you could save each table in a separate database and only download the db files for tables you need for a given query, then ATTACH. But this is an awkward setup even though it's simple
The obvious policy-driven versions are things like:
- when cache size crosses a limit
- on checkpoint
- every N writes (kind of like autocheckpoint)
- after some idle / age threshold
My instinct is that for the workload I care about, the best answer is probably hybrid. The VFS should have a tier-aware policy internally that users can configure with separate policies for interior/index/data pages. But the user/application may still be in the best position to say “this tenant/session DB is cold now, evict aggressively.”
If you evict the cache on every read/write, then all reads and writes do S3 GET. PUTs are 10-100x more expensive than GETs usually. Check the benchmark/README.md for GET counts, but it's usually 5-50 per cold read (with interior B-tree pages on disk).
S3 GETs are $0.0004/1000, S3 Express is $0.00003. So 10 queries per minute all day long averaging 20 GET operations, with full eviction on each request, would be 20*10*$0.0004*60*24/1000*30 = $3.45 per month. With S3 Express One Zone that's $0.26/mo. Both plus storage costs, which would probably be lower.
On Wasabi, that would be just the cost of storage (but they have minimum 1TB requirements at $6.99/mo).
If you checkpoint after every write and write 10 times per minute, each write hitting e.g. 5 page groups (S3 objects) plus the manifest file, the analysis looks like: 6*10*$0.005*60*24/1000*30=$12.96. Again that's worst case for the benchmark 1.5GB database. On S3 Express that' $2.92.
Point is - it's not too bad, and that's kind of worst-case scenario where you evict on every request every 6 seconds which isn't really realistic. If you evict the cache hourly, that cost goes is 1/600th - less than a $0.01 per month.
Summary: use S3 Express One Zone, don't evict the cache too often, checkpoint to S3 once a minute (turbolite lets you checkpoint either locally (disk-durability) or locally+S3 separately), and you're running a decent workload for pennies every month.
[Apologies for the spreadsheet math in plaintext]