Paper published at ECAI 2025 : EMTSF β Full paper (PDF)
https://arxiv.org/pdf/2510.23396
The immense success of the Transformer architecture in Natural Language Processing has led to its adoption in Time Series Forecasting (TSF), where superior performance has been shown. However, a recent important paper questioned their effectiveness by demonstrating that a simple single layer linear model outperforms Transformer-based models. This was soon shown to be not as valid, by a better transformer-based model termed PatchTST. More recently, TimeLLM demonstrated even better results by reprogramming i.e., repurposing a Large Language Model (LLM) for the TSF domain. Again, a follow up paper challenged this by demonstrating that removing the LLM component or replacing it with a basic attention layer in fact yields better performance.
One of the challenges in forecasting is the fact that TSF data favors the more recent past, and is sometimes subject to unpredictable events. Based upon these recent insights in TSF, we propose a Mixture of Experts (MoE) framework. Our method combines state-of-the-art (SOTA) models including xLSTM, enhanced Linear models, PatchTST, and minGRU among others. This set of complimentary and diverse models for TSF are integrated in a Transformer-based MoE architecture. Our results on standard TSF benchmarks demonstrate better results surpassing all current TSF models, including those based on recent MoE frameworks.
- Mixture of Experts Architecture: Combines multiple SOTA models (xLSTM, minGRU, PatchTST, Enhanced Linear) for superior forecasting
- Advanced Gating Mechanism: Transformer-based attention layer for intelligent expert selection
- Flexible Configuration: Support for multiple forecasting horizons (96, 192, 336, 720)
- Comprehensive Dataset Support: Works with 14+ standard benchmarks (ETT, Weather, Electricity, Traffic, etc.)
- Reversible Instance Normalization (RevIN): Built-in support for improved generalization
- Distributed Training: Multi-GPU support for efficient training
The framework supports the following standard TSF benchmarks:
- ETT (Electricity Transformer Temperature):
ettm1,ettm2,etth1,etth2 - Weather: Weather forecasting data
- Electricity: Electricity consumption data
- Traffic: Road occupancy rates
- Illness: Illness cases data
- PEMS: Traffic datasets (
PEMS03,PEMS04,PEMS07,PEMS08)
- Python 3.8 or higher
- PyTorch 2.0 or higher
- CUDA (optional, for GPU support)
- Clone the repository:
git clone https://github.com/muslehal/EMTSF.git
cd EMTSF- Install dependencies:
pip install torch torchvision torchaudio
pip install numpy pandas scikit-learn matplotlib
pip install einops timm- Download datasets:
- Place your datasets in the appropriate directories as configured in
datautils.py - Update the
root_pathindatautils.pyto match your local paths
- Place your datasets in the appropriate directories as configured in
Before training the MoE model, you need to train individual expert models:
# Train model_a (Linear model)
python main.py --dset ettm1 --model_type model_a --context_points 512 --target_points 96 --n_epochs 100
# Train model_b (xLSTM model)
python main.py --dset ettm1 --model_type model_b --context_points 512 --target_points 96 --n_epochs 100
# Train model_c (minGRU model)
python main.py --dset ettm1 --model_type model_c --context_points 512 --target_points 96 --n_epochs 100
# Train model_d (PatchTST model)
python main.py --dset ettm1 --model_type model_d --context_points 512 --target_points 96 --n_epochs 100After training all expert models:
python main.py --dset ettm1 --model_type EMTSF --context_points 512 --target_points 96 --n_epochs 50For automated training across multiple forecasting horizons:
# Run training for multiple target points (192, 336, 720)
bash script.sh -d ettm1 -e 100
# With testing
bash script.sh -d ettm1 -e 100 --testThe EMTSF model architecture consists of:
-
Expert Models:
- Model A: PatchTST for patch-based attention
- Model B: xLSTMTime model for long-term dependencies
- Model C: minGRU for efficient sequence modeling
- Model D: Enhanced Linear model with decomposition
-
Gating Network: Transformer-based attention mechanism that learns to weight expert predictions
-
Integration Layer: Combines expert outputs using learned gating weights
Input β [Expert A, Expert B, Expert C, Expert D] β Transformer Attention β Gating Weights β Weighted Combination β Output
Our EMTSF model achieves state-of-the-art performance across multiple benchmarks, outperforming:
- Traditional LSTM/GRU models
- Transformer-based models (Autoformer, FEDformer, etc.)
- Recent MoE frameworks
- LLM-based approaches (TimeLLM)
EMTSF/
βββ main.py # Main training script
βββ models.py # Model architectures (EMTSF, model_a-d)
βββ datautils.py # Dataset loading utilities
βββ lr_scheduler.py # Learning rate scheduling
βββ StandardNorm.py # Normalization utilities
βββ script.sh # Automated training script
βββ src/
β βββ learner.py # Training loop implementation
β βββ data/ # Data loading modules
β βββ models/ # Additional model components
β βββ callback/ # Training callbacks
βββ xlstm1/ # xLSTM implementation
βββ minGRU_pytorch/ # minGRU implementation
βββ layers/ # Custom layer implementations
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this code in your research, please cite:
@article{emtsf2025,
title={EMTSF: Extraordinary Mixture of SOTA Models for Time Series Forecasting},
authors={[Musleh Alharthia, Kaleel Mahmoodb, Sarosh Patela and Ausif Mahmooda]},
journal={https://ebooks.iospress.nl/volumearticle/76052},
year={2025}
}This project builds upon several excellent works:
π§ Contact: muslehneyash@gmail.com
For questions and feedback, please open an issue on GitHub.
Note: Make sure to update dataset paths in datautils.py before running experiments.