Data Quality Report

Brain Combined Sentiment Dataset

Comprehensive data profiling, quality assessment, and index coverage analysis

Date: March 31, 2026 Source: Brain (brain_combined_sentiment) Records: 7,116,723 Time Span: Jan 2017 — Mar 2025 Quality Grade: A
🏆 Quality Scorecard
A
Overall Score
93.1
Completeness
97.6%
2 of 12 columns have nulls
Consistency
95%
0 missing business days
Accuracy
90%
3-9% outlier rate by column
Timeliness
88%
Latest: Mar 10, 2025
📋 Dataset Overview
Total Records
7.1M
7,116,723 rows × 12 columns
Unique Tickers
5,076
~3,326 avg per day
Trading Days
2,136
Jan 2, 2017 — Mar 10, 2025

Dataset Description

Brain's Combined Sentiment dataset aggregates sentiment signals from three sources: news articles, earnings call transcripts, and 10-K filing language into a single combined sentiment score for ~5,000 US equities. Each daily file contains sentiment metrics per ticker, with lookback windows configurable per source. This is a panel dataset ideal for cross-sectional factor construction and event-driven signal research.

Schema Profile

ColumnTypeNon-NullNull %UniqueDescription
DATEstring7,116,7230%2,136Trading date (YYYY-MM-DD)
COMPOSITE_FIGIstring7,116,7230%4,961Bloomberg FIGI identifier
NAMEstring7,116,7230%5,022Company name
TICKERstring7,116,7230%5,076Stock ticker symbol
NEWS_N_PAST_DAYS_AGGRint327,116,7230%1News lookback window (constant=30)
NEWS_VOLUMEfloat646,073,56514.7%3,479Article count in lookback period
NEWS_SENTIMENTfloat646,073,56514.7%864,063News-based sentiment score
EC_LAST_CALL_N_PAST_DAYSfloat647,116,7230%365Days since last earnings call
EC_LAST_CALL_SENTIMENTfloat647,116,7230%811,478Earnings call sentiment score
CF_LAST_10K_N_PAST_DAYSfloat647,116,7230%730Days since last 10-K filing
CF_LAST_10K_SENTIMENTfloat647,116,7230%561,92010-K filing sentiment score
COMBINED_SENTIMENTfloat647,116,7230%700,120Weighted combined sentiment
🔍 Data Completeness Analysis

Missing Data Summary

Columns with no nulls10 of 12
NEWS_VOLUME nulls1,043,158 (14.7%)
NEWS_SENTIMENT nulls1,043,158 (14.7%)
Null patternPerfectly correlated (same rows)
Missing business days0
Calendar coverage100% of business days

Interpretation: NEWS_VOLUME and NEWS_SENTIMENT are null for the same 1.04M rows — these represent tickers with zero news coverage in the 30-day lookback window. This is structurally expected (many small-caps have little news). The COMBINED_SENTIMENT still has values for these rows as it falls back on EC and 10K signals.

Missing data by month
🌍 Universe Coverage Over Time
Universe size over time
Min Tickers/Day
3,057
Max Tickers/Day
3,613
Avg Tickers/Day
3,326
Total Unique
5,076
Ticker coverage distribution
📊 Index Coverage Analysis
S&P 500
97.4%
443 / 455 constituents
EXCELLENT
Russell 3000
89.4%
1,102 / 1,233 constituents
GOOD
Russell 2000
84.7%
666 / 786 constituents
GOOD

Assessment: The dataset provides excellent large-cap coverage (97.4% SPX) and strong broad-market coverage (89.4% R3K). The 84.7% Russell 2000 coverage is impressive for a sentiment dataset, as many small-caps have limited news and filing coverage. This dataset is highly representative for factor research across all major US equity benchmarks.

📈 Sentiment Score Analysis
Sentiment distributions

Sentiment Statistics

MetricNews SentimentEarnings Call10-K FilingCombined
Range[-1.082, 0.817][-1.713, 0.340][-0.340, 0.575][-1.022, 0.431]
Mean0.00050.0137-0.0123-0.0005
Median0.01320.0915-0.02120.0234
Std Dev0.1850.2860.1060.133
Skewness-0.769-2.5250.507-1.688
Kurtosis2.7369.9180.4475.894
% Positive53.8%65.4%42.0%58.6%
% Negative46.2%34.6%58.0%41.4%
Outlier Rate (IQR)4.5%5.6%1.1%3.8%

⚠️ Key Observations

  • Earnings Call Sentiment is heavily left-skewed (skew=-2.53, kurtosis=9.92) — extreme negative calls drive fat left tails. Consider winsorizing at 1st/99th percentiles for factor construction.
  • 10-K Filing Sentiment has a negative median (-0.021) — legal boilerplate in 10-K filings naturally skews negative. Cross-sectional rank is more informative than raw levels.
  • Combined Sentiment is well-centered near zero (mean=-0.0005) — effective for long-short signal construction.
  • News Sentiment is the most symmetric (skew=-0.77) — most well-behaved for direct use in models.
⏱️ Temporal Analysis
Sentiment time series News volume

Temporal Observations

  • Zero missing business days — perfect daily delivery across 8+ years
  • Stable universe size — consistent ~3,200-3,600 tickers per day with gradual expansion
  • News volume trending upward — reflects growing news coverage, especially post-2020
  • Sentiment regime shifts visible — COVID crash (Mar 2020) shows clear negative spike in earnings call sentiment
🔗 Correlation Analysis
Correlation matrix

Key Correlations

News Sent ↔ CombinedStrong positive
EC Sent ↔ CombinedStrong positive
10K Sent ↔ CombinedModerate positive
News Vol ↔ News SentWeak
News Sent ↔ EC SentWeak
News Sent ↔ 10K SentWeak

Interpretation: The three sentiment sources (news, earnings calls, 10-K filings) have low cross-correlation with each other — confirming they provide independent information. This makes the combined score a genuinely diversified signal. News volume has minimal correlation with any sentiment measure, ruling out volume-bias concerns.

Outlier Detection (IQR Method)
ColumnOutliersRateLower BoundUpper BoundMin ValueMax ValueStatus
NEWS_VOLUME537,1418.84%-47.50100.501.003,968.00ELEVATED
NEWS_SENTIMENT272,3454.48%-0.3890.410-1.0820.817MODERATE
EC_LAST_CALL_SENTIMENT397,4465.58%-0.4980.618-1.7130.340MODERATE
CF_LAST_10K_SENTIMENT79,9521.12%-0.2960.263-0.3400.575LOW
COMBINED_SENTIMENT270,4493.80%-0.2740.301-1.0220.431MODERATE

News volume outliers are driven by mega-cap stocks (AAPL, TSLA, NVDA) which naturally attract orders of magnitude more coverage. Earnings call sentiment outliers come from extreme negative calls (profit warnings, restatements). These are real signals, not data errors — consider robust scaling (winsorization or rank-transform) rather than removal.

💡 Recommendations
HIGH PRIORITY
Winsorize Earnings Call Sentiment
EC_LAST_CALL_SENTIMENT has extreme left-tail values (min=-1.71, skew=-2.53). Winsorize at 1st/99th percentiles before factor construction to reduce noise from extreme negative calls.
HIGH PRIORITY
Handle News Nulls Explicitly
14.7% of rows lack news data (NEWS_VOLUME=null, NEWS_SENTIMENT=null). When building signals, impute these as neutral (0) or exclude — do not let them propagate as NaN into portfolio construction.
MEDIUM PRIORITY
Use Cross-Sectional Ranks for Factor Scores
Given the different scales and distributions across sentiment types, convert to cross-sectional z-scores or percentile ranks per day before aggregation for more robust factor signals.
MEDIUM PRIORITY
Monitor Staleness of EC and 10K Signals
EC_LAST_CALL_N_PAST_DAYS ranges up to 365 days, and CF_LAST_10K_N_PAST_DAYS up to 730 days. Consider applying a decay function or capping staleness at a reasonable threshold (e.g., 180 days for EC, 400 days for 10K).
LOW PRIORITY
Resolve TICKER vs FIGI Mismatch
5,076 unique tickers vs 4,961 unique FIGIs — 115 ticker symbols map to existing FIGIs (likely ticker changes, M&A). Use COMPOSITE_FIGI as the primary identifier for time-series continuity.
LOW PRIORITY
Drop NEWS_N_PAST_DAYS_AGGR Column
This column is a constant (30 for all rows). It carries no information and can be dropped to reduce file size. Document the value as metadata instead.