← Back to blog
machine learning
feature engineering
quant research

Feature Engineering for Financial Machine Learning

In financial ML, most of the edge and most of the disasters come from features, not models. A practitioner's guide to building features that generalize: stationarity, fractional differencing, triple-barrier labeling, look-ahead leakage, and purged cross-validation.

D&T Systems··12 min read

Why features beat models in finance

In most machine learning domains, model architecture is where the gains live. In finance it is almost the opposite. Two properties of market data invert the usual priorities: the signal-to-noise ratio is brutally low, and the data-generating process is non-stationary. A daily price series is mostly noise, and whatever predictable structure exists keeps changing as regimes shift.

Under those conditions, a flexible model is a liability, not an asset. Give a deep network thousands of free parameters and a noisy, short dataset and it will fit the noise beautifully, then collapse out of sample. A well-engineered feature set fed to a regularized gradient-boosted tree will routinely beat a deep net on the same problem, because the feature set has already done the hard work of exposing the little real structure that exists, and the model is constrained enough not to invent structure that isn't there.

The practical consequence: spend your time on features. The difference between a strategy that survives live trading and one that dies in the first month is almost never the model. It is whether your features carry genuine forward-looking information, and whether your labels and validation were honest. The rest of this guide is about both halves of that sentence.

Stationarity and the differencing dilemma

Almost every statistical learning method assumes the input distribution is stable. Raw price violates that assumption completely: its level wanders, its variance changes, and a model trained on $30,000 Bitcoin sees nothing like $90,000 Bitcoin. You have to make the series stationary before a model can learn anything that transfers.

The reflex fix is to take returns or log-returns. That works for stationarity, but it throws away almost all memory of where price actually is. A return series has no idea whether the market is at an all-time high or deep in a drawdown. You traded non-stationarity for amnesia.

The differencing trade-off

Raw price     → full memory,  not stationary

Returns (d=1) → stationary,  almost no memory

Fractional diff (0 < d < 1) → stationary + maximum memory

Pick the smallest d whose series passes an ADF stationarity test.

Fractional differencing, introduced for finance by Marcos López de Prado, is the middle ground. Instead of differencing once (an integer order of 1), you apply a fractional exponent, say 0.35 or 0.4, using a weighted sum of past values. You then search for the smallest differencing order that makes the series pass an Augmented Dickey-Fuller test. The result is stationary enough to model while retaining far more of the level information than plain returns. It is one of the highest-value transforms in the financial ML toolkit, and most retail pipelines skip it entirely.

Useful feature families

A strong feature set is diversified across the things that actually move markets. Stacking ten variations of the same momentum indicator adds correlated noise, not information. Cover distinct families instead.

  • Momentum and returns: computed over multiple horizons (1, 5, 20, 60 bars) so the model sees both fast and slow drift; ratios between horizons capture acceleration.
  • Volatility: realized vol over rolling windows, ATR, Garman-Klass or Parkinson estimators from OHLC; volatility regime is one of the most predictable things in markets.
  • Volume and microstructure: volume ratios, order-flow imbalance, bid-ask spread, depth, and (for perps) funding rate and open-interest change.
  • Calendar and session: day of week, time of day, session (Asia/EU/US), distance to known events; many effects are conditional on session.
  • Cross-asset and regime: DXY, rates, BTC dominance, correlated-asset returns, plus an explicit regime label (trending vs. ranging) so the model can condition on context.

Two principles tie this together. First, every feature should have an economic rationale you can state in one sentence; if you cannot say why it might predict returns, it is probably noise mining. Second, normalize features to be comparable across regimes (a raw ATR of 800 means nothing without dividing by price or z-scoring against a rolling window) but normalize using only past data, which brings us to the most dangerous topic in the field.

Labeling done right

Most people obsess over features and then attach a careless label. The label is half the supervised learning problem. Get it wrong and even perfect features cannot save you.

The naive approach is a fixed-horizon label: "will price be higher 10 bars from now?" This has two problems. It ignores the path (a trade that rockets up then stops you out before the horizon is labeled a win) and it makes every prediction blind to risk. Worse, overlapping fixed-horizon windows make neighboring labels heavily correlated, which leaks information into cross-validation.

The triple-barrier method, from López de Prado's Advances in Financial Machine Learning, fixes the path problem. Each observation gets three barriers, and the label is set by whichever is touched first:

  • Upper barrier: profit-taking level, often a multiple of realized volatility → label +1
  • Lower barrier: stop-loss level → label −1
  • Vertical barrier: maximum holding period in bars → label by sign of return at expiry

This produces labels that reflect how a trade would actually be managed, with volatility-scaled barriers so a 1% move in calm markets and a 4% move in volatile ones are treated comparably. It pairs naturally with meta-labeling: a primary model (or even a simple rule) decides direction, and a secondary ML model learns whether to act on each signal, effectively learning the precision of your primary strategy and sizing accordingly. Meta-labeling is one of the cleanest ways to add ML on top of a discretionary edge without letting the model take over direction entirely.

The cardinal sin: look-ahead leakage

If a backtest looks brilliant, assume leakage until proven otherwise. Look-ahead bias, letting information that would not have been available at decision time into your features or labels, is the single most common reason a model that scores 65% accuracy in research delivers 50% live. These are the leaks we find most often when auditing client pipelines:

Concrete leaks to hunt for

1. Using a bar's close to predict that same bar

you only know the close after the bar finishes

2. Normalizing with full-sample mean / std

the scaler has seen the test set's statistics

3. Resampling or interpolating across train/test split

future values bleed into past rows

4. Forward-filled / restated data used as-of past

fundamentals and corporate data leak when filled backward

The discipline that prevents all of these is simple to state and hard to enforce: every value in a feature row must be computable using only information available strictly before the prediction timestamp. Compute every rolling statistic on an expanding or trailing window. Fit every scaler, every PCA, every target encoder on the training fold only, then apply it to the test fold. Never fit on the full dataset. Lag any data that arrives with delay (fundamentals, on-chain, sentiment) by its real publication lag. When in doubt, ask: "Would I have had this number, with this value, at the moment I needed to act?"

Cross-validation that respects time

Standard k-fold cross-validation shuffles observations and assumes they are independent. Financial observations are neither independent nor exchangeable. They are serially correlated, and with overlapping label windows (think triple-barrier labels that span several bars), a training sample can share its outcome window with a test sample. Shuffle them together and you get spectacular, completely fictional scores.

There are two correct approaches, and serious pipelines use both:

  • Purged & embargoed CV: remove from the training set any sample whose label window overlaps the test window (purging), then add a small gap after the test window (embargo) to kill residual serial correlation. This is López de Prado's recommended k-fold replacement.
  • Walk-forward validation: always train on the past and test on the future, rolling or expanding the window forward. It mirrors live deployment exactly and exposes regime decay across folds.

One more discipline: never tune hyperparameters on the same out-of-sample windows you report. Each time you peek at the test set and adjust, you leak a little. Keep a final hold-out period that the model and the modeler never see until the very end. Watch how performance behaves across walk-forward folds, too. Monotonic decay into the future usually means the model learned regime-specific patterns rather than durable structure.

Feature selection without overfitting

With dozens of candidate features, the temptation is to throw them all in and let the model sort it out. On noisy data that just gives the model more ways to fit noise. Prune deliberately, and prune in a way that holds up across random seeds.

  • Trust importance from a single fit

    A single model's feature importance is unstable on noisy data. Train across many random seeds and CV folds, then keep features whose importance is consistently high. A feature that ranks first on one seed and fortieth on the next is noise.

  • Use Mean Decrease Accuracy (MDA), not just split counts

    Default tree importance (how often a feature is split on) is biased toward high-cardinality features. MDA permutes one feature at a time on out-of-sample data and measures how much accuracy drops. It directly answers: does this feature actually help prediction?

  • Kill correlated and unstable features

    Highly correlated features split their importance and mask each other. Cluster correlated features and keep one representative per cluster. Drop features whose importance or even sign flips across folds. Instability out of sample is a strong signal they will not survive live.

  • Confuse in-sample importance with predictive value

    A feature can be important to the model and still be useless or harmful out of sample, especially if it is leaking. Always validate the selected feature set on a clean walk-forward window, not on the data you selected from.

Clean features? Now compare the models →

Once your features are clean and your validation is honest, the model choice becomes a measurable question. Use our ML Model Comparison tool to see how XGBoost, LightGBM, linear baselines, and deep models actually stack up on financial data.

Open ML Model Comparison tool

A practical pipeline checklist

Putting it together, here is the order of operations we use when building a feature pipeline for a trading model. Each step assumes the previous one was done honestly.

Feature pipeline, in order

1. Make inputs stationary (fractional diff, returns, vol-scaling)

2. Build diversified feature families with economic rationale

3. Lag every feature to its real availability time

4. Label with triple-barrier; add meta-labels if layering on a rule

5. Fit all scalers / encoders inside the training fold only

6. Validate with purged + embargoed CV and walk-forward

7. Select features by MDA across many seeds; cut correlated ones

8. Confirm on an untouched hold-out → only then pick a model

Notice that model selection is the last step, not the first. By the time you reach it, the feature set has already determined the ceiling of what any model can achieve. That is the central claim of this entire article: in financial machine learning, the features are the strategy, and the model is just the part that reads them.

Summary

  • Spend your time on features, not architecture: low signal-to-noise and non-stationarity make a clean feature set worth more than a fancier model
  • Make inputs stationary while keeping memory; fractional differencing beats plain returns
  • Diversify feature families: momentum, volatility, microstructure, calendar, cross-asset/regime, each with an economic rationale
  • Label with the triple-barrier method, and use meta-labeling to layer ML on a directional rule
  • Hunt look-ahead leakage relentlessly; every value must be knowable before the decision timestamp
  • Validate with purged/embargoed CV and walk-forward, never shuffled k-fold
  • Select features by MDA across many seeds, kill correlated and unstable ones, and confirm on an untouched hold-out before choosing a model

Frequently asked questions

What is feature engineering in financial machine learning?

Feature engineering is the process of transforming raw market data (price, volume, orderbook) into inputs that expose predictable structure to a model. In finance it covers stationarity transforms, multi-horizon returns, realized volatility, microstructure features, calendar effects, and cross-asset context. Because financial data has a very low signal-to-noise ratio, the quality of these features determines almost all of the realized edge. The choice of model (XGBoost, LSTM, logistic regression) usually matters far less.

What is fractional differencing and why does it matter?

Raw price is non-stationary (its mean and variance drift over time), which breaks most statistical learning. The usual fix is to take returns, but integer differencing erases almost all memory of the price level. Fractional differencing, introduced for finance by Marcos López de Prado, applies a fractional exponent (e.g. 0.4) so the series becomes stationary while retaining as much memory as possible. You pick the smallest differencing order that passes a stationarity test like ADF, keeping maximum predictive structure.

What is the triple-barrier method for labeling?

The triple-barrier method labels each observation by which of three barriers price touches first: an upper profit-taking barrier, a lower stop-loss barrier, or a vertical time barrier (a maximum holding period). This produces labels that reflect how a trade would actually be managed, instead of an arbitrary fixed-horizon return. It pairs naturally with meta-labeling, where a primary model decides direction and a secondary model decides whether to act on each signal.

What is look-ahead bias and how do I avoid it?

Look-ahead bias (or leakage) is when information that would not have been available at decision time leaks into your features or labels. Common sources: using the current bar's close to predict that same bar, normalizing features with full-sample mean and standard deviation, resampling across the train/test boundary, and forward-filling data so future values bleed backward. Avoid it by computing every transform on a strictly expanding or rolling window, fitting scalers only on training data, and aligning every feature to information available before the prediction timestamp.

Why can't I use standard k-fold cross-validation on trading data?

Standard shuffled k-fold assumes observations are independent and identically distributed. Financial observations are serially correlated and often share overlapping label windows, so a shuffled fold leaks information between train and test and produces wildly optimistic scores. Use purged and embargoed cross-validation, which removes training samples whose label windows overlap the test set and adds an embargo gap after it, or use walk-forward validation that always trains on the past and tests on the future.

Clean features, then the right model

Once your features are clean and your validation is honest, compare how models actually stack up on financial data with our free ML Model Comparison tool. Or book a diagnostic and we will pressure-test your feature pipeline for leakage.