Building on ATProto is a team sport. As we've shown previously, in open social, we only win when other folks in the ATmosphere win. In that effort, the Graze team is delighted to announce access, effective immediately, to two archived datasets for researchers, developers, archivists, and other folks looking to push the boundaries of the ATmosphere.

Turbostream

The turbostream has been available for about six months via websocket - in short, it is a stream of metadata-enriched posts that hydrate referenced objects in posts such as the author of the post, mentioned users, parent/quoted posts, and so forth. Under the hood, we've been storing that data to S3 for long term archival - we've now made that S3 bucket public, and have set it up for requestor-pays access. In theory, nearly every single post should be within this archive, enriched with these referenced objects to the greatest extent possible.

Megastream

The megastream is a relatively new dataset - it is the turbostream, then enriched with ML inferences. At Graze, we run a handful of ML classifiers against every post to allow our users to be able to filter the content by those classifications. We also generate several text embeddings, and as of recently, even generate text transcriptions for every video passing through Bluesky. This is now generally available in the megastream bucket. While the turbostream archive begins at 2025-04-21, the megastream bucket starts effective 2025-09-09.

Graze Bluesky Archive Access

Two S3 buckets provide enriched Bluesky data snapshots as SQLite databases:

  • graze-turbo-01: Turbostream archive (hydrated references, no ML inferences)

  • graze-mega-02: Megastream archive (turbostream + ML inferences)

What's Inside

Each file contains a several-minute slice of the Bluesky firehose that has been progressively enriched:

Turbostream Archive (graze-turbo-01)

Available from: April 21, 2025

  • Jetstream: Raw Bluesky events (posts, likes, follows, etc.)

  • Turbostream: Hydrated references including full user profiles, mentions, parent/reply posts, and quoted posts

Megastream Archive (graze-mega-02)

Available from: September 9, 2025

  • Jetstream: Raw Bluesky events

  • Turbostream: Hydrated references

  • Megastream: Machine learning inferences added to each record

ML Inferences Included

The Megastream enrichment adds extensive analysis to each post, including:

  • Language detection: Probability scores for 20+ languages

  • Content moderation: Flags for violence, hate speech, self-harm, sexual content, harassment

  • Sentiment analysis: Positive, negative, and neutral classification

  • Topic classification: 20+ categories (Gaming, Arts & Culture, News, Sports, etc.)

  • Emotion detection: 28 emotions (Joy, Anger, Surprise, Sadness, Amusement, etc.)

  • Toxicity scores: Threat, insult, identity hate, obscenity levels

  • Financial sentiment: Market-relevant positive/negative/neutral signals

  • Marketing detection: Spam vs organic content classification

  • Text embeddings: Vector representations for semantic search (multiple models)

All inference scores are included as probability values (0-1 range) for each record.

File Format

Turbostream Archive

jetstream_YYYYMMDD_HHMMSS.db.zip

Example:

jetstream_20250421_235152.db.zip

Megastream Archive

mega/mega_jetstream_YYYYMMDD_HHMMSS.db.zip

Example:

mega/mega_jetstream_20250909_181102.db.zip

Each .db.zip file is a compressed SQLite database containing enriched Bluesky posts from a specific time window.

Prerequisites

  • An AWS account (with whitelisted access - see below)

  • AWS CLI installed (installation guide)

  • AWS credentials configured (aws configure)

Getting Access

These buckets use access control via whitelist. To request access:

  • Fill out this Google Form to request access and agree to our usage terms

  • Get your AWS account ID by running: aws sts get-caller-identity --query Account --output text

  • Send your 12-digit account ID to the bucket administrator

  • Once whitelisted, you'll be able to access the buckets using the commands below

Accessing the Buckets

Both buckets use Requester Pays, which means you pay for data transfer costs when downloading files. Storage costs are covered by the bucket owner.

List All Files

Turbostream archive:

aws s3 ls s3://graze-turbo-01/ --request-payer requester

Megastream archive:

aws s3 ls s3://graze-mega-02/mega/ --request-payer requester

Download a Specific File

Turbostream:

aws s3 cp s3://graze-turbo-01/jetstream_20250421_235152.db.zip . --request-payer requester

Megastream:

aws s3 cp s3://graze-mega-02/mega/mega_jetstream_20250909_181102.db.zip . --request-payer requester

Download All Files

Turbostream:

aws s3 sync s3://graze-turbo-01/ ./turbo-archive/ --request-payer requester

Megastream:

aws s3 sync s3://graze-mega-02/mega/ ./mega-archive/ --request-payer requester

Using Python (boto3)

import boto3

s3 = boto3.client('s3')

# List turbostream files
response = s3.list_objects_v2(
    Bucket='graze-turbo-01',
    RequestPayer='requester'
)

for obj in response.get('Contents', []):
    print(obj['Key'])

# List megastream files
response = s3.list_objects_v2(
    Bucket='graze-mega-02',
    Prefix='mega/',
    RequestPayer='requester'
)

for obj in response.get('Contents', []):
    print(obj['Key'])

# Download a turbostream file
s3.download_file(
    'graze-turbo-01',
    'jetstream_20250421_235152.db.zip',
    'local_turbo.db.zip',
    ExtraArgs={'RequestPayer': 'requester'}
)

# Download a megastream file
s3.download_file(
    'graze-mega-02',
    'mega/mega_jetstream_20250909_181102.db.zip',
    'local_mega.db.zip',
    ExtraArgs={'RequestPayer': 'requester'}
)

Important Notes

  • Always include --request-payer requester in your commands or the request will fail

  • You will be charged AWS data transfer costs for downloads

  • Storage costs are covered by the bucket owner

  • Anonymous access is not supported - you must use authenticated AWS credentials

Cost Estimation

AWS S3 data transfer pricing (as of 2025):

  • First 100 GB/month: $0.09/GB

  • Next 10 TB/month: $0.085/GB

  • Over 50 TB/month: Lower rates available

Check current pricing: https://aws.amazon.com/s3/pricing/

Questions?

Contact Graze.social on BSky or via our site for assistance.