Following up on Bryan Newbold's request for trip reports this afternoon, I wanted to weigh in and provide an abbreviated tour of the Graze architecture for feed serving, and how we built a system to deliver and bean count the delivery of 24.5b posts across thousands of feeds in the first year of operation at Graze. There's a few large scale operations in the ATmosphere, and it's our pleasure to share some of the painful lessons we learned along the way to help people on their own journeys.
When Graze started in early November of 2024, we started on "God's own Postgres". In this era of compute, nobody ever got fired for going with postgres, and it seemed like a great starting point. We caught fire in that first month and were quickly on the hook for delivering something around a few million posts a day. One of the big requirements we wanted to maintain was analytics - we wanted to persist data about what we served to whom and when so we could deliver rich introspective details to our users - so we ended up persisting a ton of data about those deliveries.
Within 30 days of being online, Graze was already falling over with postgres. We had specified very small, very exact tables with exactly the format we needed. And yet, the query planner would continually reject the path I would have expected and instead do horrible table scans in all the wrong places. Over an incredibly painful few days, where we watched our query times increase, we realized we needed to stem the bleeding with what we now call our "front-line workers" - a set of 6-12 small servers that are essentially "HTTP repeaters" that just serve as the actual servers responding to api.graze.social requests, and hold in redis a snapshot of high use feeds for up to 15 seconds to reduce backpressure on the "live" servers that actually communicate to the database.
Even with that, a few days later, the query latency for fetching posts kept increasing and that strategy started to fall over. We needed a radical departure, probably off postgres altogether. For better or worse, under immediate duress, we moved over to Clickhouse cloud and it's been a dream. Here's a bit about what happens when you query a feed hosted on Graze today:
1. HTTP repeaters still exist, and still return cached content for very small snippets of time for almost all feeds (we're still not personalizing feeds quite yet),
2. Live requests hit an array of nine tables depending on different features a feed uses (i.e. reverse chronological requires very little, while a trending feed with user interaction cutoffs requires a ton more),
3. We compile a SQL query with joins against the relevant tables,
4. Literally that's it.
On the ingest side, we have one jetstream listener, which packets up sets of ≈100 posts at a time, then sends them into a "turbostream" worker that makes multiple concurrent requests for referenced objects in all the posts coming off the jetstream (i.e. if there's a quoted post / parent post, go grab that post, go grab user data for mentioned users and authors etc). We then have some internal LRUs to reduce pressure on requesting the same thing a bunch over short periods of time.
Turbostream feeds into "megastream" where we download images and videos, run ML inference against those, extract all the text, run a bunch of inference on all that text, and then these "megastream-enriched" records go into the "algo-smasher", a grid of machines that take the thousands of rules our users have created for their respective feeds (some rule sets are only 80 lines of formatted JSON, others are in excess of 50,000 lines). We use early quitting on all the scans (i.e. if this post would never satisfy the rules because it's failed on a higher earlier branch), but otherwise we're literally brute force scanning for set inclusion for every post for every feed in real time. It's immensely expensive. Probably something like ≈5 trillion post-to-feed checks just this year. This is painful.
The output is, finally, the subset of posts that match a given feed (typically expressed by its algorithm ID, the ID in postgres for that algorithm (yes, we still use postgres for most of our application logic)). We store this data into two tables:
CREATE TABLE default.algorithm_posts_uri_lookup_v2
(
`uri` String,
`algo_id` UInt32,
`inserted_at` DateTime DEFAULT now(),
INDEX idx_algo_id algo_id TYPE minmax GRANULARITY 1,
INDEX idx_uri_bloom uri TYPE bloom_filter(0.001) GRANULARITY 1
)
ENGINE = SharedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}')
PARTITION BY toYYYYMM(inserted_at)
ORDER BY (algo_id, cityHash64(uri), uri)
SETTINGS index_granularity = 8192, max_bytes_to_merge_at_max_space_in_pool = 10737418240This table keeps track of which posts appear in which feeds as a support table (i.e. where does this post appear throughout Graze feeds, and vice versa). There's about 20 billion records on this table:
We then store the actual details of the match on algorithm_posts_v2, the meat of our feed serving operation:
CREATE TABLE default.algorithm_posts_v2
(
`uri` String,
`cid` String,
`author` String,
`reply_parent` Nullable(String),
`reply_root` Nullable(String),
`bluesky_created_at` DateTime,
`algo_id` UInt32,
`indexed_at` DateTime,
`created_at` DateTime DEFAULT now(),
`version` UInt32 DEFAULT 1,
INDEX idx_created_at_minmax bluesky_created_at TYPE minmax GRANULARITY 3
)
ENGINE = SharedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}')
PARTITION BY intHash32(algo_id) % 50
ORDER BY (algo_id, bluesky_created_at)
SETTINGS index_granularity = 8192This table has 36 billion records on it:
We write to this table on a buffer table, which holds lots of small writes in memory for a few moments then sends them in bulk to the table. Moving to buffer tables reduced an extraordinary amount of server load. On the other lookup table, a buffer table increased the load. We haven't had time to learn why. Go figure. But, give that a shot.
These tables would be sufficient to serve basic reverse chronological feeds - but, on Graze we also let users sort feeds with trending scores (i.e. whats "doing numbers"), and also threshold feeds based on interaction (i.e. only show stuff that's "doing numbers", or the opposite, only show stuff that's not doing numbers). A fundamental requirement with Graze is to be "metric agnostic" - we want to not make any hard assumptions about which sorting mechanisms are good, which thresholds are best, which factors should matter most in scoring (i.e. is it about likes or about reposts). One of the most important things at Graze is to increase the agency of our users, and enable them to best find the future of the attention economy through maximal algorithmic choice. It's much more painful to support that, of course, because we're then on the hook to allow any arbitrary thresholding of any nature. There are definitely some feeds that are causing incredible amounts of load with poorly specified thresholding etc, but that is a Later Problem, the more important thing is user agency here.
So, the "thresholding and sorting tables" - we track engagement on every post that flows through a Graze feed from the moment we recognize it as being in one of our feeds (via that lookup table above). For every post, we track social actions (e.g. likes/quotes/replies etc) by running a dedicated jetstream listener, checking which posts from a burst of activity are in feeds hosted on Graze, then incrementing those scores on a "global action logs" table:
CREATE TABLE default.global_action_logs
(
`algo_id` UInt32,
`action_type` String,
`action_identifier` String,
`action_time` DateTime,
`action_count` UInt32 DEFAULT 1
)
ENGINE = SharedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}')
PARTITION BY toYYYYMM(action_time)
ORDER BY (action_type, algo_id, action_time, action_identifier)
SETTINGS index_granularity = 8192This table literally just sums up actions over a small snippet of time (e.g. 7 likes this 15 seconds on this post (action_identifier), which is in this feed (algo_id)). There's ≈250b records on this thing. We then have a materialized view on top of this:
CREATE MATERIALIZED VIEW default.uri_action_counts_v2
(
`action_identifier` String,
`algo_id` UInt32,
`action_date` Date,
`like_count` UInt64,
`reply_count` UInt64,
`repost_count` UInt64,
INDEX idx_action_id_bloom action_identifier TYPE bloom_filter() GRANULARITY 3,
INDEX idx_date_minmax action_date TYPE minmax GRANULARITY 3
)
ENGINE = SharedSummingMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}')
PARTITION BY toYYYYMM(action_date)
ORDER BY (algo_id, action_identifier, action_date)
SETTINGS index_granularity = 8192
AS SELECT
action_identifier,
algo_id,
toDate(action_time) AS action_date,
sumIf(action_count, action_type = 'app.bsky.feed.like') AS like_count,
sumIf(action_count, action_type = 'app.bsky.feed.reply') AS reply_count,
sumIf(action_count, action_type = 'app.bsky.feed.repost') AS repost_count
FROM default.global_action_logs
GROUP BY
algo_id,
action_identifier,
action_dateThis reduces it to ≈25b records. We then use this to query for "global" activity (i.e. threshold based on 10 likes from anyone in the ecosystem). We also have "reader-specific" thresholding (a reader is simply defined as anyone that has ever requested a page from this feed). This is a much smaller action set of ≈640m records, and reduces to 9.6m interactions on the materialized view since we started tracking feed feedback for all feeds:
CREATE TABLE default.user_action_logs
(
`algo_id` UInt32,
`user_did` String,
`action_type` String,
`action_identifier` Nullable(String),
`action_time` DateTime,
`action_count` UInt32 DEFAULT 1
)
ENGINE = SharedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}')
PARTITION BY toYYYYMM(action_time)
ORDER BY (algo_id, user_did, action_time, action_type)
SETTINGS index_granularity = 8192CREATE MATERIALIZED VIEW default.reader_action_counts_v3
(
`action_identifier` String,
`algo_id` UInt32,
`action_date` Date,
`like_count` UInt64,
`reply_count` UInt64,
`repost_count` UInt64,
`interaction_like_count` UInt64,
`interaction_reply_count` UInt64,
`interaction_repost_count` UInt64,
`interaction_quote_count` UInt64,
`interaction_seen_count` UInt64,
`request_less_count` UInt64,
`request_more_count` UInt64,
INDEX idx_action_id_bloom action_identifier TYPE bloom_filter() GRANULARITY 3,
INDEX idx_date_minmax action_date TYPE minmax GRANULARITY 3
)
ENGINE = SharedSummingMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}')
PARTITION BY toYYYYMM(action_date)
ORDER BY (algo_id, action_identifier, action_date)
SETTINGS index_granularity = 8192
AS SELECT
action_identifier,
algo_id,
toDate(action_time) AS action_date,
sumIf(action_count, action_type = 'app.bsky.feed.like') AS like_count,
sumIf(action_count, action_type = 'app.bsky.feed.reply') AS reply_count,
sumIf(action_count, action_type = 'app.bsky.feed.repost') AS repost_count,
sumIf(action_count, action_type = 'app.bsky.feed.defs#interactionLike') AS interaction_like_count,
sumIf(action_count, action_type = 'app.bsky.feed.defs#interactionReply') AS interaction_reply_count,
sumIf(action_count, action_type = 'app.bsky.feed.defs#interactionRepost') AS interaction_repost_count,
sumIf(action_count, action_type = 'app.bsky.feed.defs#interactionQuote') AS interaction_quote_count,
sumIf(action_count, action_type = 'app.bsky.feed.defs#interactionSeen') AS interaction_seen_count,
sumIf(action_count, action_type = 'app.bsky.feed.defs#requestLess') AS request_less_count,
sumIf(action_count, action_type = 'app.bsky.feed.defs#requestMore') AS request_more_count
FROM default.user_action_logs
GROUP BY
algo_id,
action_identifier,
action_dateFinally, we honor deletions from original authors as well as requests by feed operators to remove posts from their feeds. This is an extremely expensive operation and one of the "no-no's" on Clickhouse. Reaching into the database and editing records in-line or deleting records on the table needs to be done extremely sparingly. For that reason, we actually write the deletions to another table, then periodically flush that data as we go along in what is essentially scheduled garbage deletion events. All these tables look about the same:
CREATE TABLE default.clear_status
(
`algo_id` UInt32,
`uri` String,
`hidden` Bool DEFAULT false,
`updated_at` DateTime DEFAULT now(),
INDEX idx_uri_bloom uri TYPE bloom_filter() GRANULARITY 3
)
ENGINE = SharedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}')
ORDER BY (algo_id, uri)
SETTINGS index_granularity = 8192;
CREATE TABLE default.hidden_status
(
`algo_id` UInt32,
`uri` String,
`hidden` Bool DEFAULT false,
`updated_at` DateTime DEFAULT now(),
INDEX idx_uri_bloom uri TYPE bloom_filter() GRANULARITY 3
)
ENGINE = SharedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}')
ORDER BY (algo_id, uri)
SETTINGS index_granularity = 8192;
CREATE TABLE default.post_deletions
(
`uri` String,
`deleted_at` DateTime DEFAULT now(),
INDEX uri_bloom_idx uri TYPE bloom_filter(0.01) GRANULARITY 1
)
ENGINE = SharedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}')
ORDER BY uri
SETTINGS index_granularity = 8192And our queries look at these then diff out anything pending here.
The last thing that happens with our feeds is minimal content injection - for posts that our users pin at the top or put into rotation in a feed ("sticky" and "rotating" posts), we store those in postgres as they need to come in and out potentially very quickly, and get deleted often potentially. For sponsored posts that are running in a campaign, we need to pull those out of rotation as fast as possible once target impression counts have been hit to reduce cost overruns, so they're in postgres too.
We use an old school monitoring page some customer support person at Clickhouse showed us for tracking activity on the database. On a typical Saturday at 4pm west coast time, our typical load looks like:
The Clickhouse bill, when we started, was bad. We were spending something like ≈$9k a month because, well, we had about 48 hours to build in Clickhouse or die, so we build something that technically would not fall over, then immediately ran to the next fire (such is early stage life!). Over the course of a year, our team of relatively smart but non expert engineers have brought that bill down by about 65%. There's still a ton of cost savings to be had, but we have other bigger fires (the much larger cost is the algo smashing, as you may surmise). The kubernetes cluster we run for serving feeds is overprovisioned, and runs a bunch of other services, but could easily handle the serving load for ≈$500 / mo if it were just focused on this part of the equation. With sunsetting data past a given date, doing another few ablative sessions to cut down on excess costs, and particularly for other folks, not letting anyone threshold anything by any social signal, we bet that there's another bunch of cost savings against the database without a ton of effort as well.
So, to answer Bryan:
The pain points were postgres. Postgres fell over quick, quick, quick. Did we do everything right there? Probably not. But we did it good enough and it still totally fell short. It was extremely painful to learn this in prod.
Clickhouse ended up being amazing. It's super easy to use, fairly good documentation, and playing with it is simple as heck. We can and do routinely try out new hundred million table ideas on the fly. We can ask questions about billions of records in a few seconds. We can add indices or totally restructure and backfill painlessly. It's great.
One extremely specific nit - we had an experience where the python version of the clickhouse package very clearly had a memory leak in its session management. Customer support said it didn't exist. We empirically showed the accretive memory bloat query after query. We felt like we were insane. We ended up writing a thin stateless HTTP transaction layer in lieu of their package.
Right now storing and serving this costs ≈$4k / mo. It could be made way cheaper (maybe down to $1-2k / mo) by spending more time with the slowdowns, making compromises about what our users can do, or having more than zero full time engineers at Graze today.
You should strongly consider using Clickhouse for Bluesky data.