← Back to blog
machine learning
quant research
model selection

XGBoost Beats LSTM and Transformers on Most Financial Time Series

Deep learning gets picked because it looks impressive, not because it works better on financial data. XGBoost and LightGBM consistently outperform LSTM and Transformer models on typical trading datasets. Here is why, and when deep learning is actually the right call.

D&T Systems··12 min read

Why LSTM gets picked (and why that reflex is usually wrong)

AlphaFold used deep learning. GPT-4 uses Transformers. Every ML paper on arXiv seems to have an attention mechanism. So when a trader or quant developer starts thinking about prediction models, the reflex is natural: reach for LSTM or a Transformer architecture.

Those models were built for problems with very different properties. Long sequences. Rich structure. Hundreds of millions of training examples. Financial time series has almost none of that.

We see this pattern regularly. A client comes in with an LSTM that performed beautifully in-sample and degrades to coin-flip accuracy on the first 60 days of live data. The architecture worked fine in the abstract. Their specific data shape made it the wrong call. Gradient boosting would have generalized better with a fraction of the development time.

Financial time series is a tabular problem in disguise

Consider what your features actually look like by the time you feed them into a model. RSI(14) is already a 14-bar aggregation. ATR(20) is a 20-bar aggregation. Funding rate, volume ratio, spread, open interest change: each of these compresses a time window into a single number. You have already done the temporal work.

What you feed into the model is a row of numbers: one value per feature, one row per bar. That is a tabular dataset. LSTM's ability to model sequential dependencies is not helping you here, because the sequence information is already baked into the features. You are paying the parameter cost without getting the benefit.

Financial time series also has three structural properties that make it hostile to deep learning:

Three reasons financial data fights deep learning

01

Non-stationarity

The distribution shifts constantly. What predicted direction during a trending regime in 2021 may predict the opposite in a mean-reverting regime in 2023. A model trained on historical data learns patterns that may no longer exist.

02

Low signal-to-noise ratio

A typical daily price series is roughly 80% noise. The predictable component is small and unstable. Deep models with millions of parameters are very good at memorizing noise. That is the opposite of what you want.

03

Data scarcity

3 years of daily bars gives you roughly 750 samples. LSTM requires input sequences. With sequence length 20, you get approximately 730 non-overlapping training sequences. That is the entire dataset. Deep learning was designed for millions of examples, not hundreds.

The data math: 730 samples vs. 33,000 parameters

This is where the mismatch becomes concrete. Take a conservative LSTM setup: one layer, 64 hidden units. The parameter count for a single LSTM layer is approximately 4 × (input_size + hidden_size) × hidden_size. With 20 input features and 64 hidden units, that is roughly 4 × (20 + 64) × 64 = 21,504 parameters in the recurrent layer alone. Add a dense output layer: another 65. With a small embedding or additional features, you clear 33,000 parameters easily.

You are asking 730 training sequences to constrain 33,000 parameters.

The model will memorize. Dropout and early stopping reduce this. They do not fix the fundamental ratio. You are compensating for a structural mismatch, not solving it.

Transformers are worse. The attention mechanism was designed for 512-token sentences where token position carries meaning. A 20-bar window of OHLCV data has no equivalent positional structure. Multi-head attention on a 20-bar price window has no real temporal relationships to find. The attention mechanism just fits noise with a more expressive architecture.

LSTM on daily data

Training sequences~730
LSTM parameters (64 units)~33,000
Samples per parameter0.02
Training time2-10 min (GPU)
SHAP interpretabilityNone

Typical result

Strong in-sample accuracy. OOS performance degrades to near-random within 60 days. Model learned regime-specific noise.

XGBoost on daily data

Training rows~750
Typical leaf parameters~500–2,000
Effective regularizationHigh
Training time<30 sec (CPU)
SHAP interpretabilityFull

Typical result

In-sample and OOS accuracy within 2-4 percentage points on directional tasks. Retrain weekly. Debug with SHAP when it stops working.

Feature importance you can actually use

When your XGBoost model starts degrading, you have something to work with. Run SHAP on the last 90 days of predictions. You will see exactly which features drove each call and by how much. If funding rate used to be the top feature and now ATR is dominating, that is a regime shift signal. You can act on that.

When your LSTM degrades, you have a loss curve. The model is a 33,000-parameter function that you cannot interrogate. You retrain it and hope the new regime is similar enough to your training window. Retraining and hoping is a guess, not a workflow.

A practical SHAP workflow for a trading model looks like this:

# After training

explainer = shap.TreeExplainer(model)

shap_values = explainer.shap_values(X_oos)

# Which features matter most on OOS data

shap.summary_plot(shap_values, X_oos, feature_names=feature_cols)

# Per-prediction breakdown for a specific bar

shap.force_plot(explainer.expected_value, shap_values[i], X_oos.iloc[i])

This output tells you, for bar i: "the model predicted long because funding rate was -0.03% (pushing +0.12 toward long) and realized vol was below its 20-day average (pushing +0.09 toward long)." You can validate that reasoning against what actually happened. You cannot do this with LSTM.

What the benchmarks actually show

Academic papers comparing gradient boosting to deep learning on financial directional prediction consistently find gradient boosting within 1-3 percentage points of deep learning, with far less tuning and compute. On daily data, the gap is often zero or slightly in favor of trees.

When you see a paper claiming LSTM beats XGBoost on stock prediction by a wide margin, check four things:

01

Same features?

LSTM often gets raw price sequences while XGBoost gets engineered features. That is not a fair comparison. The feature engineering work is the real alpha source.

02

Future data leakage?

Normalizing across the full dataset, using future volatility in features, or look-ahead bias in labels inflates in-sample numbers. LSTM tends to overfit these artifacts more aggressively.

03

Walk-forward or single split?

A single train/test split on financial data is almost always misleading. Walk-forward validation with 5+ folds is the minimum for a credible comparison.

04

Daily or intraday?

Results on minute-level data with large sample counts do not transfer to daily trading models. Sample count changes the comparison entirely.

What makes XGBoost strong on financial features

Threshold splits on discontinuous features

Trees split on exact feature values. When funding rate crosses zero, that is a regime flip. XGBoost finds this boundary directly. LSTM has to infer it from sequence patterns.

Works on 500 to 2,000 rows

3 years of daily bars is about 750 samples. XGBoost regularizes well at this scale via learning rate, max depth, and subsampling. LSTM needs more data to generalize.

SHAP values out of the box

You can see exactly which features drove each prediction and by how much. With LSTM, the model is a black box. When the strategy starts losing, you have nothing to debug.

Retrains in under 30 seconds

You can retrain on every new week of data, or on each detected regime shift. No GPU required. A $10/month VPS handles it.

One-liner inference

model.predict(X) returns probabilities. No tokenization, no sequence padding, no batching. Latency is microseconds on CPU.

When deep learning earns its place

Deep learning fits specific conditions. Most traders are working with data that doesn't meet them. There are cases where it is the correct choice.

Tick-level L2 orderbook data

100K+ rows per day, raw bid/ask queue features, genuine millisecond-level temporal structure. LSTM or a convolutional architecture earns its place here.

NLP alpha

Earnings call transcripts, analyst reports, news headlines. BERT or a fine-tuned LLM is the correct tool. Do not try to hand-engineer sentiment into tabular features.

Cross-asset correlation modeling

10+ years of minute data across 20+ correlated instruments. Transformer attention can capture cross-sequence dependencies that trees miss.

More than 500K samples with 10+ correlated input sequences

If you have the data volume and genuine sequential structure, try a shallow MLP or attention layer. But validate honestly on walk-forward windows.

The common thread: high sample count, raw or minimally processed features, and genuine sequential structure that has not already been compressed into tabular form. If all three are present, deep learning is worth the cost. Otherwise, start with trees.

What the workflow actually looks like

When we build ML-based predictive models for trading strategies, the process follows a consistent pattern. Complexity is added only when the simpler step reaches its ceiling.

1. Baseline with Ridge or Lasso

Linear models first. If RSI, volume ratio, and funding rate cannot predict direction at above-chance accuracy with a regularized linear model, no amount of deep learning will save you. The signal is not there. Find better features.

2. XGBoost or LightGBM with Optuna

Tune max_depth, learning_rate, n_estimators, subsample, and colsample_bytree. Use walk-forward cross-validation with at least 5 folds and 20% OOS minimum per fold. Track feature importance across folds.

3. Walk-forward validation

Retrain on an expanding or rolling window. Measure OOS directional accuracy, Sharpe of signals, and hit rate stability across folds. Decay in later folds is a regime signal, not a model failure.

4. SHAP audit before deployment

Check which features drive predictions on the most recent OOS window. If the top features make no intuitive sense, something is wrong. Feature importance should be explainable in trading terms.

5. Deep learning only if XGBoost plateaus

If you have more than 50K samples and XGBoost accuracy has flattened over several Optuna trials, try a shallow MLP or a simple attention layer. Keep the same features. Compare on the same walk-forward windows. You will be surprised how rarely this step adds value.

Frequently asked questions

Why does XGBoost outperform LSTM on financial time series?

Financial time series with engineered features is essentially a tabular problem. XGBoost handles discontinuous thresholds well, trains in seconds on small datasets, and generalizes better when you have fewer than 2,000 samples. LSTM needs long sequences to learn temporal structure, but most financial datasets are too small for that to happen without memorizing noise.

When should I use LSTM or a Transformer for trading?

Deep learning earns its place when you have tick-level L2 orderbook data (100K+ rows per day), NLP alpha from earnings transcripts or news, or cross-asset models trained on 10+ years of minute data. With fewer than 50K samples and standard OHLCV features, start with gradient boosting.

How do I know if my LSTM is overfitting on financial data?

Check the gap between in-sample and out-of-sample directional accuracy on a walk-forward test. A gap larger than 5-8 percentage points on daily data usually means the model is fitting to noise. Also check: does OOS performance degrade monotonically as you move further into the future? If yes, the model learned regime-specific patterns that do not generalize.

Do I need a GPU to run XGBoost for trading predictions?

No. XGBoost on CPU trains in under 30 seconds on datasets typical for daily or hourly trading strategies. Inference is a single function call. You can retrain daily on a $10/month cloud instance with no GPU.

Building an ML model for your strategy?

We scope the right architecture for your data shape, validate it on walk-forward windows, and build the retraining pipeline that keeps it current. Book a free 30-minute diagnostic to discuss your specific setup.