What types of social media datasets does SocialIntel offer?

SocialIntel offers datasets from 100+ social media platforms including Reddit, GitHub, YouTube, Medium, Discord, Telegram, and more. Each dataset is enriched with 250+ fields covering sentiment analysis, topic detection, financial signals, engagement metrics, and content classification.

How much does SocialIntel cost?

Catalog datasets up to 10K rows are always free to download. Individual datasets start at $18. Pro plans range from $78 to $504 per month, with annual pricing from $780 to $5,040 per year.

Can I request a custom dataset?

Yes! You can submit a custom dataset request specifying the platform, fields, date range, and format. Custom datasets are priced based on row count, starting at $18.

What formats are datasets available in?

Datasets are available in CSV, JSON, JSONL, and Parquet formats. You can download them directly or access them via our REST API.

Is the data ethically sourced and legal?

Yes. We collect only publicly accessible data from social media platforms using official APIs and permissive robots.txt directives. We comply with each platform's terms of service and applicable laws. See our Legality & Compliance page for a full platform-by-platform breakdown.

The foundation of everything we build

Data quality, baked in.

Every dataset passes through a multi-layer quality pipeline — from collection to delivery — ensuring production-ready data for AI training, research, and analytics.

Quality Stages

Collection to delivery

Enrichment Checks

Deterministic auto-corrections

Structural Fixes

Phase 1 + Phase 2 repairs

Dedup Strategies

Hash, URL, author, content, enrichment

▸ Quality Pipeline

6-stage quality pipeline.

Every stage has specific quality gates — posts failing any gate are flagged, corrected, or rejected.

1. Collection

Raw data is scraped from 130+ platforms and quality-checked before it ever touches the database.

Content hash generated for deduplication
Post-level quality scoring (0-100)
NSFW content flows through full pipeline (no hardcoded sentiment bypass)
Platform-level NSFW flagging via over_18 subreddit attribute
Duplicate detection against existing DB hashes
Intra-batch duplicate removal

2. Post Validation

Every post is validated across 7 dimensions before insertion. Posts scoring below 50/100 are rejected. Empty/stub rows (title-only, all null fields) rejected pre-enrichment with minimal overhead.

Content: min length, truncation, duplicate title
Author: missing, platform name, email/URL/hash
URL: missing, signup/login patterns, domain mismatch
Metrics: negatives, outliers, implausible ratios
Dates: future timestamps, excessive age (3yr+)
Content source: placeholder, URL-only, RSS teaser
Engagement sanity: likes > views, comments > views

3. Enrichment

250+ enrichment fields computed per post via a deterministic pipeline: lexicon, sliding-window, and heuristic formulas.

VADER sentiment (lexicon) + sliding-window aspect sentiment
8-emotion model (joy, trust, fear, anger…) with polarity dampening on pre_alignment
Topic classification (16+ domains) with word-boundary regex matching
Flow-through NSFW pipeline (no hardcoded sentiment/toxicity bypass)
Financial signal detection (FUD, FOMO, pump)
Sarcasm & subjectivity scoring
Language detection (15 language codes)
Content quality & intensity scoring
VADER reliability: True for English ("en"/"eng"), False for unknown/empty — non-English fields zeroed
Near-duplicate detection via word-shingling per author
Self-healing exception handler — partial enrichment never crashes
65+ mandatory fields with typed defaults via _ensure_essential_fields
Platform-specific signal suppression (AO3, fanfiction: misinformation/propaganda disabled)
25 deterministic validation checks

4. Quality Correction

25 deterministic validation checks auto-correct enrichment data. Errors trigger pipeline flags.

Hate speech → force negative sentiment
Emotion score capping (max threshold)
Sentiment↔emotion contradiction auto-fix
Sentiment score-label consistency check
Tone alignment with sentiment label
Language-VADER mismatch detection
Phishing/scam content override
Timestamp validation & future-date fix
Platform mismatch correction
Missing critical fields filled with defaults

5. Export & Dedup

Multi-strategy deduplication runs at export time. Schema is standardized across all formats.

Content-hash dedup (most reliable)
Enrichment duplicate_hash dedup
URL dedup (exact match)
Title + author combination dedup
Content-similarity dedup (first 200 chars)
Standardized column ordering
CSV · JSON · JSONL · Parquet formats

6. Storage & Delivery

Cleaned data served via CDN. Catalog refresh updates record counts instantly.

Files uploaded with correct headers for delivery
CDN for fast global distribution
Catalog refresh updates DB record counts
API cache invalidation after refresh
Correct Content-Disposition for downloads

▸ Enrichment Quality

42 deterministic quality checks.

Each check runs after enrichment. Auto-corrections fix issues silently; errors set pipeline flags.

Hate Speech Overrideerror

Language-Sentiment Consistencywarning

Low Relevance Detectioninfo

Emotion & Sentiment Score Cappingwarning

Syndication Evidencewarning

Pipeline Completenesswarning

Engagement Rate Capwarning

Tone-Sentiment Alignmentinfo

Sentiment-Emotion Contradictionwarning

Sentiment Score-Label Consistencywarning

Language-VADER Mismatchwarning

Reliability Flag Consistencyinfo

Tech Stack Validationinfo

Content Length Recalcinfo

Emotion Mutual Exclusivityinfo

Phishing/Scam Detectionerror

Spam Detectionwarning

Piracy Landing Page Detectionwarning

Platform Mismatchinfo

Engagement Sanityinfo

Language Code Validationwarning

Timestamp Validationwarning

Missing Null Fallbacksinfo

Near-Duplicate Detectioninfo

Quality Score Recalculationinfo

Financial Signal Gateinfo

On-Chain Signal Precisioninfo

Health Condition Contextinfo

Entity Extraction Precisioninfo

VADER Lexicon Neutralizedinfo

Finance Domain Context Gateinfo

Bot Template Detectioninfo

Credibility Score Accuracyinfo

On-Chain Signal Completenessinfo

Entity Extraction Precisioninfo

Hyphenated Content Handlinginfo

ALL-CAPS Neutralityinfo

Translation Language Mappinginfo

URL Detection Completenessinfo

Platform-Aware Hashtag Densityinfo

Mention-Only Placeholder Detectioninfo

▸ Database Repairs

17 structural & enrichment fixes.

The fix_quality.py script runs idempotently across all datasets, repairing data integrity at scale.

Phase 1 — Structural

1. Duplicate Records

Remove exact + content-hash duplicates from the database

2. Missing Authors

Fill from platform-appropriate fallback values

3. author_followers 0→NULL

Distinguish "unknown" from "0 followers"

4. HTML Stripping

Strip HTML tags from content fields

5. Engagement Rate

Recalculate from raw metric values

6. Bad URLs

Remove signup/login/search pattern URLs

7. Placeholder Content

Remove "click here", "loading…", "see more"

8. Ingestion Timestamps

Null out ingestion dates, not creation dates

9. Off-topic Noise

Remove blocked platform data from wrong datasets

10. Metric Mappings

YouTube views/shares, Reddit views, etc.

Phase 2 — Enrichment

A. Tone Correction

Tone derived from sentiment label, not emotion

B. Contradictory Emotions

Mutual exclusivity + polarity alignment

C. Sentinel Values

Replace -1→None, "n/a"→null across fields

D. Language Detection

Re-run detection on unknown-language records

E. Inapplicable Fields

Clear platform-specific fields from wrong records

F. UI Boilerplate

Strip "Skip to Navigation" etc. from content

G. URL/Platform Mismatch

Fix records where URL doesn't match platform

▸ Schema & Standards

Consistent schemas, every time.

Every dataset follows the same 200+ column standard, regardless of platform or format.

Standardized Schema

200+ columns with consistent ordering across all datasets. Core identifiers first, then content, metrics, enrichment fields.

Same columns for every platform
Predictable ordering
Backward-compatible additions

Dedup at Every Level

Four complementary strategies ensure no duplicate records reach your pipeline.

Content hash (SHA-256)
URL exact match
Title + author composite
Content-similarity check

Idempotent Repairs

The fix script is safe to run any number of times. All operations are idempotent.

Dry-run mode for preview
Per-step granular control
Enrichment-only mode
Safe for production use

▸ Get started

Ready for quality data?

Browse the catalog, download a free sample, and see the quality for yourself. No account needed.

Browse Datasets All Features

SocialINTEL

Fetching datasets…