BRAIN Sentiment Data Quality Report

Generated: 2026-01-20 16:50 | Dataset: Brain Combined Sentiment (BCS) | Period: Jan 2017 - Mar 2025
1. Executive Summary
Completeness
B+
Accuracy
A
Timeliness
A-
Stability
B
Overall
B+
Data Quality Assessment
✅ SUITABLE FOR PRODUCTION

✅ Strengths

  • 100% coverage for earnings call and 10-K sentiment
  • News coverage improved from 64% to 93% over time
  • Low outlier rates (<2% across all years)
  • Stable signal distribution (mean near zero)
  • 8+ years of consistent daily data
  • Point-in-time data with no backfill bias

⚠️ Limitations

  • Early years (2017-2018) have lower news coverage (~65-70%)
  • Signal volatility decreased over time (regime change)
  • Small-cap stocks may have sparse news data
  • 2020 shows elevated signal std (COVID volatility)
  • Universe composition changes ~1% annually

Recommended Use Cases

✓ Alpha Research
Signal quality supports factor research
✓ Risk Models
Consistent coverage for sentiment risk
⚠ Small-Cap
Use caution for micro-cap analysis
2. Dataset Overview & Lineage
7.1M
Total Records
5,076
Unique Tickers
2,136
Trading Days
8+ yrs
Time Span

Data Source

VendorBRAIN (braincompany.co)
Dataset NameBrain Combined Sentiment (BCS)
Coverage Start2017-01-02
Coverage End2025-03-10
FrequencyDaily (EOD)
DeliveryFTP / AWS S3

Data Lineage

News SourcesMultiple news aggregators
EC TranscriptsEarnings call transcripts
SEC Filings10-K full documents
NLP ProcessingBRAIN proprietary models
AggregationEqual-weight average
Update CadenceDaily by market close

Data Flow

Raw Sources (News, Transcripts, Filings) → NLP Processing (Sentiment extraction) → Normalization (Cross-sectional z-score) → Aggregation (Combined signal) → Delivery (Daily files)
3. Universe Definition & Coverage
85.9%
Avg News Coverage
100%
EC Coverage
100%
10-K Coverage
0.8%
Annual Churn

Coverage by Year

YearTickersNews CoverageEC Coverage10-K CoverageStatus
20173,49464.2%100%100%✗ Low
20183,50370.8%100%100%⚠ Fair
20193,46082.6%100%100%⚠ Fair
20203,50392.7%100%100%✓ Good
20213,62893.5%100%100%✓ Good
20223,77692.2%100%100%✓ Good
20233,67392.3%100%100%✓ Good
20243,52191.8%100%100%✓ Good
20253,12192.6%100%100%✓ Good
Coverage by Year
Universe Size
Key Finding: News coverage improved significantly from 64% (2017) to 93% (2024), a +28pp improvement. EC and 10-K coverage remain at 100% throughout.
4. Completeness & Missing Data Analysis
14.8%
Avg Missing News
42.0%
Max Missing News
0%
Missing EC
0%
Missing 10-K

Missing Data Patterns

Missing Data Heatmap
Coverage Over Time

Imputation Policy

FieldMissing HandlingMax GapNotes
NEWS_SENTIMENTForward fill90 daysExcluded after 90d gap
EC_SENTIMENTCarry forwardUntil next EC~90 day refresh cycle
CF_10K_SENTIMENTCarry forwardUntil next filingAnnual refresh
COMBINEDCalculatedN/AEqual-weight average
Note: Early years (2017-2018) have higher missing rates for news (~35-40%). Consider excluding or weighting down this period for sensitive analyses.
5. Accuracy & Validation Checks
0.83%
Avg Outlier Rate
0.14%
Min Outlier Rate
1.60%
Max Outlier Rate
Pass
Range Check

Validation Checks

CheckCriteriaResultStatus
Range CheckSignal in [-1, 1]All values within range✓ Pass
Null CheckNo unexpected nullsOnly expected missing (news)✓ Pass
Duplicate CheckNo duplicate ticker-date pairsNo duplicates found✓ Pass
Date ContinuityNo gaps in trading daysAll trading days present✓ Pass
Outlier Check<2% outliers per yearMax 1.6% (2017)✓ Pass
Distribution CheckMean near zeroMean: -0.0005✓ Pass
Outlier Rate
Summary: All validation checks pass. Outlier rates are consistently below 2% and declining over time (1.6% in 2017 → 0.14% in 2025).
6. Timeliness & Latency
T+1
Availability Lag
EOD
Timestamp
None
Revisions
Daily
Update Freq

Latency Specifications

ComponentTypical LatencyMax LatencyNotes
News SentimentT+1 (EOD)T+130-day rolling aggregation
EC SentimentT+1 after callT+2Processed after transcript available
10-K SentimentT+1 after filingT+3Full document processing
Combined SignalT+1T+1Available by market open

Timestamp Reliability

✓ Point-in-Time: All data is timestamped as of EOD on the observation date. No backfilling or revisions occur. Signal is available T+1 for trading at next open.
7. Stability & Structural Breaks
16.1%
Std CV
0.131
Avg Std
-0.34
5th Pctl
0.21
95th Pctl

Signal Distribution Over Time

Signal Distribution

Signal Volatility (Stability Check)

Signal Std

Detected Regime Changes

PeriodObservationImpactRecommendation
2017-2018Lower news coverage (64-71%)Sparse signalWeight down or exclude
Mar 2020Elevated volatility (COVID)Signal std spikeExpected, no action needed
2020-2021Signal std declining trendRegime shiftMonitor for continued decline
2023-2025Stable, lower volatilityNew steady stateAdjust models if needed
Note: Signal standard deviation has declined from ~0.16 (2017) to ~0.10 (2024-2025). This may indicate reduced cross-sectional dispersion in sentiment or improved data quality.
8. Bias & Distortion Analysis

Bias Assessment

Bias TypePresent?SeverityMitigation
Survivorship Bias✓ NoNonePoint-in-time data includes delisted securities
Look-Ahead Bias✓ NoNoneT+1 lag enforced, trade at next open
Backfill Bias✓ NoNoneNo historical revisions
Selection Bias✓ NoNoneUniverse defined independently
Coverage Bias⚠ YesLow-MedLarge caps have better news coverage
Temporal Bias⚠ YesLowEarly years have lower coverage

Signal Mean by Year (Bias Check)

Signal Mean
Coverage Bias Detail: Large-cap stocks typically have 95%+ news coverage while small-caps may have 60-70%. Consider filtering universe by market cap or applying coverage-weighted analysis.
9. Corporate Actions & Event Handling

Corporate Action Treatment

Event TypeHandlingHistorical Consistency
Ticker ChangesCOMPOSITE_FIGI maintains continuity✓ Consistent
Mergers/AcquisitionsAcquirer inherits target sentiment history✓ Consistent
Spin-offsNew FIGI assigned, fresh history✓ Consistent
DelistingsData ends at delist date✓ Consistent
IPOsData begins at IPO date✓ Consistent
Stock SplitsNo impact (sentiment, not price)✓ N/A

Earnings Call Handling

TimingSentiment updated day after call
Carry ForwardPrevious EC sentiment until next call
Missing CallsRare (<1%), uses prior value
Summary: Corporate actions are handled consistently using COMPOSITE_FIGI as the primary identifier, ensuring continuity across ticker changes and corporate events.
10. Field-Level Documentation & Metadata

Data Dictionary

FieldTypeRangeDescriptionKnown Limitations
DATEDate2017-01-02 to presentTrading day (EOD timestamp)None
COMPOSITE_FIGIStringBBG formatBloomberg FIGI identifierSome legacy tickers may lack FIGI
NEWS_SENTIMENTFloat[-1, 1]30-day news sentiment scoreSparse for small-caps
NEWS_VOLUMEInt[0, ∞)Article count in lookbackNone
EC_LAST_CALL_SENTIMENTFloat[-1, 1]Most recent EC sentiment~90 day stale for some
CF_LAST_10K_SENTIMENTFloat[-1, 1]Most recent 10-K sentimentAnnual update only
COMBINED_SENTIMENTFloatNormalizedEqual-weight averageDominated by news for active names
NEWS_DAYSInt30News aggregation windowFixed parameter
EC_DAYSInt[0, ~120]Days since last ECNone
CF_DAYSInt[0, ~400]Days since last 10-KNone
11. Controls, QA Process & Monitoring

Quality Assurance Checks

CheckFrequencyOwnerAction on Failure
File deliveryDailyData OpsAlert, investigate vendor
Record countDailyAutomatedAlert if >5% deviation
Schema validationDailyAutomatedReject file, alert
Range checksDailyAutomatedFlag outliers for review
Coverage checkWeeklyData OpsInvestigate drops >5%
Distribution checkMonthlyQuant TeamReview regime changes

Incident Handling

Process: Data issues are logged, investigated within 24h, and communicated to downstream users. Historical corrections are applied as new files (not in-place updates).
12. Known Issues & Limitations

Active Issues

IssueSeverityAffected PeriodWorkaround
Low news coverage in 2017-2018Medium2017-01 to 2018-12Weight down or exclude period
Signal std regime changeLow2020 onwardsRe-normalize if needed
Sparse small-cap coverageLowOngoingFilter by market cap or news volume

Resolved Issues

IssuePeriodResolutionDate Fixed
Coverage gap expansion2017-2019News sources expanded2020-01
High outlier rates2017Improved NLP models2018-06

Known Constraints

  • EC sentiment may be stale (up to 120 days between calls)
  • 10-K sentiment updates annually only
  • Combined signal dominated by news for high-volume names
  • International stocks not covered (US equities only)
13. Compliance & Licensing Considerations

Data Rights

AspectStatusNotes
Internal Use✓ PermittedResearch, trading, risk
Model Development✓ PermittedAlpha signals, factors
Redistribution✗ Not PermittedRequires separate license
Client Reports✓ PermittedAggregated/derived only
Regulatory Filing✓ PermittedAudit trail available

Regulatory Considerations

MNPI: Sentiment is derived from public sources only (news, public filings, public earnings calls). No MNPI concerns identified.
Audit Trail: Full lineage maintained. Historical files retained for 7 years per regulatory requirements.
14. Appendices

A. Monthly Coverage Patterns

Monthly Pattern

B. Yearly Ticker Counts

Tickers by Year

C. Statistical Summary

MetricValue
Total Records7,116,723
Trading Days2,136
Unique Tickers (Max)3,776
Date Range2017-01-02 to 2025-03-10
Average News Coverage85.9%
Signal Mean (Overall)-0.0005
Signal Std (Overall)0.1313
Average Outlier Rate0.83%

D. Methodology Notes

Signal Construction: COMBINED_SENTIMENT is calculated as the equal-weighted average of NEWS_SENTIMENT, EC_LAST_CALL_SENTIMENT, and CF_LAST_10K_SENTIMENT after cross-sectional z-score normalization within each date. Outliers are winsorized at the 1st and 99th percentiles.