A benchmark for in-the-wild distribution shift over time

Get Started

The Wild-Time package is an open-source Python package that provides a simple, standardized interface for all datasets in the benchmark. It contains:

  • Data loaders that automatically handle data downloading, processing, and splitting, and
  • Dataset evaluators that standardize model evaluation for each dataset.


To use our code, you first need to install your own version of pytorch, with version > 1.7.1.

Then, we recommend using pip to install Wild-Time by running pip install wildtime.

Using the Wild-Time package

We provide the following steps to use Wild-Time package

Import dependencies

To load the wildtime data, you first need to import the configs, which include parameters that contain the imported dataset and will be discussed later.

import argparse    
from configures import configures
configures = argparse.Namespace(**configs)

Load Wild-Time data

Then, you can use the following code to load the dataset:

from WildTime import dataloader    
data = dataloader.getdata(configs)

Baseline Running

To train a baseline on a Wild-Time dataset and evaluate under Eval-Fix (default evaluation), use the code:

from WildTime import baseline_trainer    

Specify parameters in the config as follows:

  • Specify the dataset with 'dataset' in configs.
    • [arxiv, drug, fmow, huffpost, mimic, yearbook]
    • For MIMIC, specify one of two prediction tasks (mortality and readmission) using 'prediction_type': 'mortality' or 'prediction_type': 'readmission'.
  • Specify the baseline with 'method'.
  • To run Eval-Fix, set 'offline': True.
    • Specify the ID/OOD split time step with 'split_time'.
  • To run Eval-Stream, set 'eval_next_timesteps'.
  • Set the number of training iterations with 'train_update_iters'.
  • [Optional] If using a data directory or checkpoint directory other than './Data' and './checkpoints', specify their paths with '--data_dir' and '--log_dir'.


Distribution shift occurs when the test distribution differs from the training distribution, and it can considerably degrade performance of machine learning models deployed in the real world. Temporal shifts - distribution shifts arising from the passage of time – often occur gradually and have the additional structure of timestamp metadata. By leveraging timestamp metadata, models can potentially learn from trends in past distribution shifts and extrapolate into the future. While recent works have studied distribution shifts, temporal shifts remain underexplored. To address this gap, we curate Wild-Time, a benchmark of 5 datasets that reflect temporal distribution shifts arising in a variety of real-world applications, including patient prognosis and news classification. We use two evaluation strategies: evaluation with a fixed time split (Eval-Fix) and evaluation with a data stream (Eval-Stream).

We briefly introduce the dataset below. Please refer to our paper for more detailed dataset descriptions

  • Yearbook: Social norms, fashion styles, and population demographics change over time. This is captured in the Yearbook dataset, which consists of 37,921 frontal-facing American high school yearbook photos. We exclude portraits from 1905 − 1929 due to the limited number of examples in these years, resulting in 33,431 examples from 1930−2013. Each photo is a 32×32×1 grey-scale image associated with a binary label y, which represents the student’s gender.
  • FMoW: Machine learning models can be used to analyze satellite imagery and aid humanitarian and policy efforts by monitoring croplands and predicting crop yield and poverty levels. Due to human activity, satellite imagery changes over time, requiring models that are robust to temporal distribution shifts. We study this problem on the Functional Map of the World (FMoW) dataset, adapted from the WILDS benchmark. Given a satellite image, the task is to predict the type of land usage. The FMoW dataset consists of 141,696 examples from 2002 − 2017.
  • MIMIC-IV: Many machine learning healthcare applications have emerged in the last decade However, changes in healthcare over time, such as the emergence of new treatments and changes in patient demographics, are an obstacle in deploying machine learning-based clinical decision support systems. We study this problem on MIMIC-IV, one of the largest public healthcare datasets that comprises abundant medical records of over 40,000 patients. In MIMIC-IV, we treat each admission as one record, resulting in 216,487 healthcare records from 2008 − 2020. Specifically, we consider two classification tasks:
    • MIMIC-Readmission aims to predict the risk of being readmitted to the hospital within 15 days.
    • MIMIC-Mortality aims to predict in-hospital mortality for each patient.
  • Huffpost: In many language models which deal with information correlated with time, temporal distribution shifts cause performance degradation in downstream tasks such as Twitter hashtag classification or question-answering systems. Performance drops across time reflect changes in the style or content of current events. We study this temporal shift on the Huffpost dataset. The task is to identify tags of news articles from their headlines. Each input feature x is a news headline, and the output y is the news category
  • arXiv: Due to the evolution of research fields, the style of arXiv pre-prints also changes over time, reflected by the change in article categories. For example, “neural network attack" was originally a popular keyword in the security community, but gradually became more prevalent in the machine learning community. We study this temporal shift in the arXiv dataset [18], where the task is to predict the primary category of arXiv pre-prints given the paper title as input. The entire dataset includes 172 pre-print categories from 2007 − 2022.



To submit, please read our submission guidelines (coming this week).

Higher numbers are better for all metrics. In parentheses, we show corrected sample standard deviations across random replicates.

An asterisk next to a value indicates that the entry deviates from the official submission guidelines, for example because it uses a non-default model or additional pre-training data. The deviations are described in the notes in the dataset-specific leaderboards.

This overall leaderboard show out-of-distribution test performance across all datasets. For each dataset, we bold the best-performing algorithm that conforms to official submission guidelines.

The leaderboard is only used for comparing the results under the Eval-Fix setting.

Eval-Fix Evaluation

Yearbook (Acc) FMoW (Acc) MIMIC-Readmission (Acc) MIMIC-Mortality (AUC) Huffpost (Acc) arXiv (Acc) Contact References
Avg. Worst Avg. Worst Avg Worst Avg Worst Avg Worst Avg Worst
Fine-tuning 81.98 (1.52) 69.62 (3.38) 45.77 (0.53) 43.21 (0.85) 62.19 (3.71) 59.57 (4.43) 63.37 (1.91) 52.45 (2.64) 69.59 (0.10) 68.91 (0.49) 50.31 (0.39) 48.19 (0.41) Wild-Time Paper/Code
EWC 80.07 (0.22) 66.61 (1.95) 45.60 (0.28) 43.13 (0.50) 66.40 (0.09) 64.69 (0.01) 62.07 (1.52) 50.41 (2.03) 69.42 (1.00) 68.61 (0.98) 50.40 (0.11) 48.18 (0.18) Wild-Time Paper/Code
SI 78.70 (3.78) 65.18 (2.44) 44.87 (0.73) 42.97 (1.15) 62.60 (3.27) 61.13 (3.39) 61.76 (0.58) 50.19 (1.25) 70.46 (0.27) 69.05 (0.52) 50.21 (0.40) 48.07 (0.48) Wild-Time Paper/Code
A-GEM 81.04 (1.40) 67.07 (2.23) 45.21 (0.20) 42.49 (0.70) 63.95 (0.14) 62.66 (1.23) 61.78 (0.27) 50.40 (0.51) 70.22 (0.50) 69.15 (0.88) 50.30 (0.37) 48.14 (0.40) Wild-Time Paper/Code
ERM 79.50 (6.23) 63.09 (5.15) 51.99 (0.37) 48.79 (0.49) 61.33 (3.45) 59.46 (3.66) 72.89 (8.96) 65.80 (12.3) 70.42 (1.15) 68.71 (1.36) 45.94 (0.97) 44.09 (1.05) Wild-Time Paper/Code
GroupDRO 77.06 (1.67) 60.96 (1.83) 37.61 (1.16) 34.41 (1.39) 56.12 (4.35) 54.69 (4.36) 76.88 (4.74) 71.40 (6.84) 69.53 (0.54) 67.68 (0.78) 39.06 (0.54) 37.18 (0.52) Wild-Time Paper/Code
mixup 76.72 (1.35) 58.70 (1.36) 49.82 (0.19) 45.58 (0.31) 58.82 (4.03) 57.30 (4.77) 73.69 (4.74) 71.40 (6.84) 71.18 (1.17) 68.89 (0.38) 45.12 (0.71) 43.23 (0.75) Wild-Time Paper/Code
LISA 83.65 (4.61) 68.53 (5.79) 48.76 (0.48) 45.41 (0.21) 56.90 (0.95) 54.01 (0.92) 76.34 (8.94) 71.14 (12.4) 69.99 (0.60) 68.04 (0.75) 47.82 (0.47) 45.91 (0.42) Wild-Time Paper/Code
CORAL 77.53 (2.15) 59.34 (1.46) 47.34 (0.09) 44.04 (0.46) 57.31 (4.45) 54.69 (4.36) 77.98 (2.57) 64.81 (10.8) 70.05 (0.63) 68.39 (0.88) 42.32 (0.60) 40.31 (0.61) Wild-Time Paper/Code
IRM 80.46 (3.53) 64.42 (4.38) 38.73 (1.67) 34.93 (1.88) 56.53 (3.36) 52.67 (5.17) 76.17 (6.32) 70.64 (8.99) 70.21 (1.05) 68.71 (1.13) 35.75 (0.90) 33.91 (1.09) Wild-Time Paper/Code
SimCLR 78.59 (2.72) 60.15 (3.48) 42.91 (0.40) 39.54 (0.67) - - - - - - - - Wild-Time Paper/Code
SwaV 78.38 (1.86) 60.73 (1.08) 49.53 (0.27) 46.31 (0.58) - - - - - - - - Wild-Time Paper/Code
SWA 84.25 (3.06) 67.90 (4.34) 50.59 (0.46) 47.15 (0.92) 59.88 (5.48) 57.68 (6.36) 69.53 (1.60) 60.83 (2.73) 70.98 (0.05) 69.52 (0.10) 44.36 (0.77) 42.54 (0.68) Wild-Time Paper/Code



If you have any questions or feedback, or if you would like to contribute a dataset, please contact us at wildtime@googlegroups.com.

For issues with the code, please file a Github issue.

Wild-Time is under active development. To get updates, please subscribe to our mailing list!