Building on ATProto is a team sport. As we've shown previously, in open social, we only win when other folks in the ATmosphere win. In that effort, the Graze team is delighted to announce access, effective immediately, to two archived datasets for researchers, developers, archivists, and other folks looking to push the boundaries of the ATmosphere.
Turbostream
The turbostream has been available for about six months via websocket - in short, it is a stream of metadata-enriched posts that hydrate referenced objects in posts such as the author of the post, mentioned users, parent/quoted posts, and so forth. Under the hood, we've been storing that data to S3 for long term archival - we've now made that S3 bucket public, and have set it up for requestor-pays access. In theory, nearly every single post should be within this archive, enriched with these referenced objects to the greatest extent possible.
Megastream
The megastream is a relatively new dataset - it is the turbostream, then enriched with ML inferences. At Graze, we run a handful of ML classifiers against every post to allow our users to be able to filter the content by those classifications. We also generate several text embeddings, and as of recently, even generate text transcriptions for every video passing through Bluesky. This is now generally available in the megastream bucket. While the turbostream archive begins at 2025-04-21, the megastream bucket starts effective 2025-09-09.
Graze Bluesky Archive Access
Two S3 buckets provide enriched Bluesky data snapshots as SQLite databases:
graze-turbo-01: Turbostream archive (hydrated references, no ML inferences)
graze-mega-02: Megastream archive (turbostream + ML inferences)
What's Inside
Each file contains a several-minute slice of the Bluesky firehose that has been progressively enriched:
Turbostream Archive (graze-turbo-01)
Available from: April 21, 2025
Jetstream: Raw Bluesky events (posts, likes, follows, etc.)
Turbostream: Hydrated references including full user profiles, mentions, parent/reply posts, and quoted posts
Megastream Archive (graze-mega-02)
Available from: September 9, 2025
Jetstream: Raw Bluesky events
Turbostream: Hydrated references
Megastream: Machine learning inferences added to each record
ML Inferences Included
The Megastream enrichment adds extensive analysis to each post, including:
Language detection: Probability scores for 20+ languages
Content moderation: Flags for violence, hate speech, self-harm, sexual content, harassment
Sentiment analysis: Positive, negative, and neutral classification
Topic classification: 20+ categories (Gaming, Arts & Culture, News, Sports, etc.)
Emotion detection: 28 emotions (Joy, Anger, Surprise, Sadness, Amusement, etc.)
Toxicity scores: Threat, insult, identity hate, obscenity levels
Financial sentiment: Market-relevant positive/negative/neutral signals
Marketing detection: Spam vs organic content classification
Text embeddings: Vector representations for semantic search (multiple models)
All inference scores are included as probability values (0-1 range) for each record.
File Format
Turbostream Archive
jetstream_YYYYMMDD_HHMMSS.db.zip
Example:
jetstream_20250421_235152.db.zip
Megastream Archive
mega/mega_jetstream_YYYYMMDD_HHMMSS.db.zip
Example:
mega/mega_jetstream_20250909_181102.db.zip
Each .db.zip
file is a compressed SQLite database containing enriched Bluesky posts from a specific time window.
Prerequisites
An AWS account (with whitelisted access - see below)
AWS CLI installed (installation guide)
AWS credentials configured (
aws configure
)
Getting Access
These buckets use access control via whitelist. To request access:
Fill out this Google Form to request access and agree to our usage terms
Get your AWS account ID by running:
aws sts get-caller-identity --query Account --output text
Send your 12-digit account ID to the bucket administrator
Once whitelisted, you'll be able to access the buckets using the commands below
Accessing the Buckets
Both buckets use Requester Pays, which means you pay for data transfer costs when downloading files. Storage costs are covered by the bucket owner.
List All Files
Turbostream archive:
aws s3 ls s3://graze-turbo-01/ --request-payer requester
Megastream archive:
aws s3 ls s3://graze-mega-02/mega/ --request-payer requester
Download a Specific File
Turbostream:
aws s3 cp s3://graze-turbo-01/jetstream_20250421_235152.db.zip . --request-payer requester
Megastream:
aws s3 cp s3://graze-mega-02/mega/mega_jetstream_20250909_181102.db.zip . --request-payer requester
Download All Files
Turbostream:
aws s3 sync s3://graze-turbo-01/ ./turbo-archive/ --request-payer requester
Megastream:
aws s3 sync s3://graze-mega-02/mega/ ./mega-archive/ --request-payer requester
Using Python (boto3)
import boto3
s3 = boto3.client('s3')
# List turbostream files
response = s3.list_objects_v2(
Bucket='graze-turbo-01',
RequestPayer='requester'
)
for obj in response.get('Contents', []):
print(obj['Key'])
# List megastream files
response = s3.list_objects_v2(
Bucket='graze-mega-02',
Prefix='mega/',
RequestPayer='requester'
)
for obj in response.get('Contents', []):
print(obj['Key'])
# Download a turbostream file
s3.download_file(
'graze-turbo-01',
'jetstream_20250421_235152.db.zip',
'local_turbo.db.zip',
ExtraArgs={'RequestPayer': 'requester'}
)
# Download a megastream file
s3.download_file(
'graze-mega-02',
'mega/mega_jetstream_20250909_181102.db.zip',
'local_mega.db.zip',
ExtraArgs={'RequestPayer': 'requester'}
)
Important Notes
Always include
--request-payer requester
in your commands or the request will failYou will be charged AWS data transfer costs for downloads
Storage costs are covered by the bucket owner
Anonymous access is not supported - you must use authenticated AWS credentials
Cost Estimation
AWS S3 data transfer pricing (as of 2025):
First 100 GB/month: $0.09/GB
Next 10 TB/month: $0.085/GB
Over 50 TB/month: Lower rates available
Check current pricing: https://aws.amazon.com/s3/pricing/
Questions?
Contact Graze.social on BSky or via our site for assistance.