EDA for Financial Data — The Gotchas
Six ways financial data trips up exploratory data analysis: fat tails, survivorship bias, look-ahead bias, regime changes, missingness that means something, and time-zone landmines.
Exploratory data analysis is the first thing you do with a new dataset, and financial data has a specific set of traps that EDA tutorials never warn you about. These are the ones I’ve watched smart people walk into.
1. Fat tails are not outliers
In a normal distribution, a 5-sigma event happens about once every 1.7 million observations. In daily equity returns, 5-sigma days happen multiple times per decade. They’re not bugs in the data, they’re the data.
If your EDA pipeline auto-removes “outliers” by z-score, you’ve just thrown away the most informative observations — the days that actually matter for risk. Don’t winsorize blindly.
What to do instead: plot the empirical CDF on a log scale. Look for the tail behavior directly. If you must trim, trim on documented data-error criteria (e.g., known stock split adjustments missing), not statistical ones.
2. Survivorship bias
If you pull “every company in the S&P 500 today” and backtest a strategy on them, you’ve selected the winners. The losers got delisted and aren’t in your sample. You’ll get returns that are 1–2% higher per year than reality.
Same trap with mutual funds: databases like Morningstar live-only have ~1% per year of survivorship bias. The dead funds had worse returns and disappeared.
What to do instead: use point-in-time data — the universe as it was on each historical date, not as it is now. CRSP for US equities has this. For corporate credit, use CompuStat with point-in-time financial statements.
3. Look-ahead bias
Easier to describe than to avoid. You build a feature using Q1 earnings as if you knew them on January 1st — but earnings aren’t reported until late February or March. Your model “knows the future” and looks brilliant in backtest.
What to do instead: lag every feature by its actual availability date. For accounting data, that’s the filing date (or filing date + a buffer for “when the market actually processed it”). For macro data, mind the release-date vs. reference-period distinction (a number labeled “January 2026 CPI” is released in mid-February).
4. Regime changes break stationarity
Most ML assumes the data-generating process doesn’t change. Financial data violates this constantly: monetary policy regimes, market structure changes, technological shifts in trading.
A model trained on 2010–2019 (zero-rate, low-vol) is a different model than one trained including 2020–2022. The 2008 crisis broke models that had worked beautifully for a decade.
What to do instead: check stationarity (ADF test, KPSS test, but more importantly visual inspection). Plot rolling means and rolling correlations. If they’re drifting, your model probably won’t generalize.
5. Missing values mean things
In a survey dataset, missing might mean “the respondent skipped this question.” In financial data, missing usually means something specific you should encode:
- A debt covenant ratio missing because the company has no debt — meaningful, not noise.
- A loan’s last-payment-date missing because no payment was ever made — extremely meaningful.
- A bond’s yield missing because the bond didn’t trade today — informative about liquidity.
Don’t impute mean. Don’t impute median. Create a missing indicator and let the model use it.
6. Time zones and trading calendars
Two ways this breaks:
- Time zones: a “daily” return for an Asian equity computed using US end-of-day timestamps is using the next calendar day’s close. Easy to off-by-one a whole day of returns.
- Trading calendars: US markets close on Thanksgiving, UK markets don’t. If you compute a US/UK correlation on calendar-date-aligned data, you’re correlating Wednesday-UK against Wednesday-US (which is a holiday — no change). Use trading calendars, not calendar dates.
A useful EDA checklist for any new financial dataset
Before any modeling, write down answers to:
- What is the actual unit of observation? (Loan? Borrower? Loan-month?)
- What is the date column actually — origination, observation, or report date?
- What’s the universe selection rule? Is it point-in-time or current-roster?
- What fraction is missing per column, and what does each missing mean?
- What does the tail look like? (Empirical CDF on log scale.)
- Is there a structural break in the time series? Where?
- What’s the earliest date by which each feature would have been knowable?
If you can answer all seven before fitting a single model, you’ve already avoided the most common embarrassments.
Next in this track: [[module-2-feature-engineering-credit]] — building features that respect all the gotchas above, using WOE and Information Value (which we’ll also use in the [[module-2-retail-credit-scoring]] credit-models post).
Get new posts by email
One email per new article. No spam, no upsells, unsubscribe anytime.