Data Quality Report — Brain Combined Sentiment Dataset

🏆 Quality Scorecard

A

Overall Score

93.1

Completeness

97.6%

2 of 12 columns have nulls

Consistency

95%

0 missing business days

Accuracy

90%

3-9% outlier rate by column

Timeliness

88%

Latest: Mar 10, 2025

📋 Dataset Overview

Total Records

7.1M

7,116,723 rows × 12 columns

Unique Tickers

5,076

~3,326 avg per day

Trading Days

2,136

Jan 2, 2017 — Mar 10, 2025

Dataset Description

Brain's Combined Sentiment dataset aggregates sentiment signals from three sources: news articles, earnings call transcripts, and 10-K filing language into a single combined sentiment score for ~5,000 US equities. Each daily file contains sentiment metrics per ticker, with lookback windows configurable per source. This is a panel dataset ideal for cross-sectional factor construction and event-driven signal research.

Schema Profile

Column	Type	Non-Null	Null %	Unique	Description
DATE	string	7,116,723	0%	2,136	Trading date (YYYY-MM-DD)
COMPOSITE_FIGI	string	7,116,723	0%	4,961	Bloomberg FIGI identifier
NAME	string	7,116,723	0%	5,022	Company name
TICKER	string	7,116,723	0%	5,076	Stock ticker symbol
NEWS_N_PAST_DAYS_AGGR	int32	7,116,723	0%	1	News lookback window (constant=30)
NEWS_VOLUME	float64	6,073,565	14.7%	3,479	Article count in lookback period
NEWS_SENTIMENT	float64	6,073,565	14.7%	864,063	News-based sentiment score
EC_LAST_CALL_N_PAST_DAYS	float64	7,116,723	0%	365	Days since last earnings call
EC_LAST_CALL_SENTIMENT	float64	7,116,723	0%	811,478	Earnings call sentiment score
CF_LAST_10K_N_PAST_DAYS	float64	7,116,723	0%	730	Days since last 10-K filing
CF_LAST_10K_SENTIMENT	float64	7,116,723	0%	561,920	10-K filing sentiment score
COMBINED_SENTIMENT	float64	7,116,723	0%	700,120	Weighted combined sentiment

🔍 Data Completeness Analysis

Missing Data Summary

Columns with no nulls10 of 12

NEWS_VOLUME nulls1,043,158 (14.7%)

NEWS_SENTIMENT nulls1,043,158 (14.7%)

Null patternPerfectly correlated (same rows)

Missing business days0

Calendar coverage100% of business days

Interpretation: NEWS_VOLUME and NEWS_SENTIMENT are null for the same 1.04M rows — these represent tickers with zero news coverage in the 30-day lookback window. This is structurally expected (many small-caps have little news). The COMBINED_SENTIMENT still has values for these rows as it falls back on EC and 10K signals.

🌍 Universe Coverage Over Time

Min Tickers/Day

3,057

Max Tickers/Day

3,613

Avg Tickers/Day

3,326

Total Unique

5,076

📊 Index Coverage Analysis

S&P 500

97.4%

443 / 455 constituents

EXCELLENT

Russell 3000

89.4%

1,102 / 1,233 constituents

GOOD

Russell 2000

84.7%

666 / 786 constituents

GOOD

Assessment: The dataset provides excellent large-cap coverage (97.4% SPX) and strong broad-market coverage (89.4% R3K). The 84.7% Russell 2000 coverage is impressive for a sentiment dataset, as many small-caps have limited news and filing coverage. This dataset is highly representative for factor research across all major US equity benchmarks.

📈 Sentiment Score Analysis

Sentiment Statistics

Metric	News Sentiment	Earnings Call	10-K Filing	Combined
Range	[-1.082, 0.817]	[-1.713, 0.340]	[-0.340, 0.575]	[-1.022, 0.431]
Mean	0.0005	0.0137	-0.0123	-0.0005
Median	0.0132	0.0915	-0.0212	0.0234
Std Dev	0.185	0.286	0.106	0.133
Skewness	-0.769	-2.525	0.507	-1.688
Kurtosis	2.736	9.918	0.447	5.894
% Positive	53.8%	65.4%	42.0%	58.6%
% Negative	46.2%	34.6%	58.0%	41.4%
Outlier Rate (IQR)	4.5%	5.6%	1.1%	3.8%

⚠️ Key Observations

• Earnings Call Sentiment is heavily left-skewed (skew=-2.53, kurtosis=9.92) — extreme negative calls drive fat left tails. Consider winsorizing at 1st/99th percentiles for factor construction.
• 10-K Filing Sentiment has a negative median (-0.021) — legal boilerplate in 10-K filings naturally skews negative. Cross-sectional rank is more informative than raw levels.
• Combined Sentiment is well-centered near zero (mean=-0.0005) — effective for long-short signal construction.
• News Sentiment is the most symmetric (skew=-0.77) — most well-behaved for direct use in models.

⏱️ Temporal Analysis

Temporal Observations

• Zero missing business days — perfect daily delivery across 8+ years
• Stable universe size — consistent ~3,200-3,600 tickers per day with gradual expansion
• News volume trending upward — reflects growing news coverage, especially post-2020
• Sentiment regime shifts visible — COVID crash (Mar 2020) shows clear negative spike in earnings call sentiment

🔗 Correlation Analysis

Key Correlations

News Sent ↔ CombinedStrong positive

EC Sent ↔ CombinedStrong positive

10K Sent ↔ CombinedModerate positive

News Vol ↔ News SentWeak

News Sent ↔ EC SentWeak

News Sent ↔ 10K SentWeak

Interpretation: The three sentiment sources (news, earnings calls, 10-K filings) have low cross-correlation with each other — confirming they provide independent information. This makes the combined score a genuinely diversified signal. News volume has minimal correlation with any sentiment measure, ruling out volume-bias concerns.

⚡ Outlier Detection (IQR Method)

Column	Outliers	Rate	Lower Bound	Upper Bound	Min Value	Max Value	Status
NEWS_VOLUME	537,141	8.84%	-47.50	100.50	1.00	3,968.00	ELEVATED
NEWS_SENTIMENT	272,345	4.48%	-0.389	0.410	-1.082	0.817	MODERATE
EC_LAST_CALL_SENTIMENT	397,446	5.58%	-0.498	0.618	-1.713	0.340	MODERATE
CF_LAST_10K_SENTIMENT	79,952	1.12%	-0.296	0.263	-0.340	0.575	LOW
COMBINED_SENTIMENT	270,449	3.80%	-0.274	0.301	-1.022	0.431	MODERATE

News volume outliers are driven by mega-cap stocks (AAPL, TSLA, NVDA) which naturally attract orders of magnitude more coverage. Earnings call sentiment outliers come from extreme negative calls (profit warnings, restatements). These are real signals, not data errors — consider robust scaling (winsorization or rank-transform) rather than removal.

💡 Recommendations

HIGH PRIORITY

Winsorize Earnings Call Sentiment

EC_LAST_CALL_SENTIMENT has extreme left-tail values (min=-1.71, skew=-2.53). Winsorize at 1st/99th percentiles before factor construction to reduce noise from extreme negative calls.

HIGH PRIORITY

Handle News Nulls Explicitly

14.7% of rows lack news data (NEWS_VOLUME=null, NEWS_SENTIMENT=null). When building signals, impute these as neutral (0) or exclude — do not let them propagate as NaN into portfolio construction.

MEDIUM PRIORITY

Use Cross-Sectional Ranks for Factor Scores

Given the different scales and distributions across sentiment types, convert to cross-sectional z-scores or percentile ranks per day before aggregation for more robust factor signals.

MEDIUM PRIORITY

Monitor Staleness of EC and 10K Signals

EC_LAST_CALL_N_PAST_DAYS ranges up to 365 days, and CF_LAST_10K_N_PAST_DAYS up to 730 days. Consider applying a decay function or capping staleness at a reasonable threshold (e.g., 180 days for EC, 400 days for 10K).

LOW PRIORITY

Resolve TICKER vs FIGI Mismatch

5,076 unique tickers vs 4,961 unique FIGIs — 115 ticker symbols map to existing FIGIs (likely ticker changes, M&A). Use COMPOSITE_FIGI as the primary identifier for time-series continuity.

LOW PRIORITY

Drop NEWS_N_PAST_DAYS_AGGR Column

This column is a constant (30 for all rows). It carries no information and can be dropped to reduce file size. Document the value as metadata instead.

Brain Combined Sentiment Dataset