0


COVID-19 Cases Prediction (Regression)

目录

Objectives:

  • Solve a regression problem with deep neural networks (DNN).
  • Understand basic DNN training tips.
  • Familiarize yourself with PyTorch.

Task Description

  • COVID-19 Cases Prediction
  • Source: Delphi group @ CMU- A daily survey since April 2020 via facebook.

Try to find out the data and use it to your training is forbidden

在这里插入图片描述

  • Given survey results in the past 5 days in a specific state in U.S., then predict the percentage of new tested positive cases in the 5th day.

在这里插入图片描述

Data

在这里插入图片描述

Conducted surveys via facebook (every day & every state) Survey: symptoms, COVID-19 testing, social distancing, mental health, demographics, economic effects, …

  • States (37, encoded to one-hot vectors)
  • COVID-like illness (4) - cli、ili …
  • Behavior Indicators (8) - wearing_mask、travel_outside_state …
  • Mental Health Indicators (3) - anxious、depressed …
  • Tested Positive Cases (1) - tested_positive (this is what we want to predict)

Data – One-hot Vector

  • One-hot vectors: Vectors with only one element equals to one while others are zero. Usually used to encode discrete values.

the details about One-hot Vector please read the blog:One-Hot

在这里插入图片描述

Evaluation Metric

  • Mean Squared Error (MSE)

在这里插入图片描述
在这里插入图片描述

Download data

If the Google Drive links below do not work, you can download data from Kaggle, and upload data manually to the workspace.

!gdown --id'1kLSW_-cW2Huj7bh84YTdimGBOJaODiOS'--output covid.train.csv
!gdown --id'1iiI5qROrAhZn-o4FPqsE97bMzDEFvIdg'--output covid.test.csv
/usr/local/lib/python3.7/dist-packages/gdown/cli.py:131: FutureWarning: Option `--id` was deprecated in version 4.3.1 and will be removed in 5.0. You don't need to pass it anymore to use a file ID.
  category=FutureWarning,
Downloading...
From: https://drive.google.com/uc?id=1kLSW_-cW2Huj7bh84YTdimGBOJaODiOS
To: /content/covid.train.csv
100% 2.49M/2.49M [00:00<00:00, 238MB/s]
/usr/local/lib/python3.7/dist-packages/gdown/cli.py:131: FutureWarning: Option `--id` was deprecated in version 4.3.1 and will be removed in 5.0. You don't need to pass it anymore to use a file ID.
  category=FutureWarning,
Downloading...
From: https://drive.google.com/uc?id=1iiI5qROrAhZn-o4FPqsE97bMzDEFvIdg
To: /content/covid.test.csv
100% 993k/993k [00:00<00:00, 137MB/s]

Import packages

# Numerical Operationsimport math
import numpy as np

# Reading/Writing Dataimport pandas as pd
import os
import csv

# For Progress Barfrom tqdm import tqdm

# Pytorchimport torch 
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, random_split

# For plotting learning curvefrom torch.utils.tensorboard import SummaryWriter

Some Utility Functions

You do not need to modify this part.

defsame_seed(seed):'''Fixes random number generator seeds for reproducibility.'''
    torch.backends.cudnn.deterministic =True
    torch.backends.cudnn.benchmark =False
    np.random.seed(seed)
    torch.manual_seed(seed)if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)deftrain_valid_split(data_set, valid_ratio, seed):'''Split provided training data into training set and validation set'''
    valid_set_size =int(valid_ratio *len(data_set)) 
    train_set_size =len(data_set)- valid_set_size
    train_set, valid_set = random_split(data_set,[train_set_size, valid_set_size], generator=torch.Generator().manual_seed(seed))return np.array(train_set), np.array(valid_set)defpredict(test_loader, model, device):
    model.eval()# Set your model to evaluation mode.
    preds =[]for x in tqdm(test_loader):
        x = x.to(device)with torch.no_grad():                   
            pred = model(x)                     
            preds.append(pred.detach().cpu())   
    preds = torch.cat(preds, dim=0).numpy()return preds

Dataset

classCOVID19Dataset(Dataset):'''
    x: Features.
    y: Targets, if none, do prediction.
    '''def__init__(self, x, y=None):if y isNone:
            self.y = y
        else:
            self.y = torch.FloatTensor(y)
        self.x = torch.FloatTensor(x)def__getitem__(self, idx):if self.y isNone:return self.x[idx]else:return self.x[idx], self.y[idx]def__len__(self):returnlen(self.x)

Neural Network Model

Try out different model architectures by modifying the class below.

classMy_Model(nn.Module):def__init__(self, input_dim):super(My_Model, self).__init__()# TODO: modify model's structure, be aware of dimensions. 
        self.layers = nn.Sequential(
            nn.Linear(input_dim,16),
            nn.ReLU(),
            nn.Linear(16,8),
            nn.ReLU(),
            nn.Linear(8,1))defforward(self, x):
        x = self.layers(x)
        x = x.squeeze(1)# (B, 1) -> (B)return x

Feature Selection

Choose features you deem useful by modifying the function below.

defselect_feat(train_data, valid_data, test_data, select_all=True):'''Selects useful features to perform regression'''
    y_train, y_valid = train_data[:,-1], valid_data[:,-1]
    raw_x_train, raw_x_valid, raw_x_test = train_data[:,:-1], valid_data[:,:-1], test_data

    if select_all:
        feat_idx =list(range(raw_x_train.shape[1]))else:
        feat_idx =[0,1,2,3,4]# TODO: Select suitable feature columns.return raw_x_train[:,feat_idx], raw_x_valid[:,feat_idx], raw_x_test[:,feat_idx], y_train, y_valid

Training Loop

deftrainer(train_loader, valid_loader, model, config, device):

    criterion = nn.MSELoss(reduction='mean')# Define your loss function, do not modify this.# Define your optimization algorithm. # TODO: Please check https://pytorch.org/docs/stable/optim.html to get more available algorithms.# TODO: L2 regularization (optimizer(weight decay...) or implement by your self).
    optimizer = torch.optim.SGD(model.parameters(), lr=config['learning_rate'], momentum=0.9) 

    writer = SummaryWriter()# Writer of tensoboard.ifnot os.path.isdir('./models'):
        os.mkdir('./models')# Create directory of saving models.

    n_epochs, best_loss, step, early_stop_count = config['n_epochs'], math.inf,0,0for epoch inrange(n_epochs):
        model.train()# Set your model to train mode.
        loss_record =[]# tqdm is a package to visualize your training progress.
        train_pbar = tqdm(train_loader, position=0, leave=True)for x, y in train_pbar:
            optimizer.zero_grad()# Set gradient to zero.
            x, y = x.to(device), y.to(device)# Move your data to device. 
            pred = model(x)             
            loss = criterion(pred, y)
            loss.backward()# Compute gradient(backpropagation).
            optimizer.step()# Update parameters.
            step +=1
            loss_record.append(loss.detach().item())# Display current epoch number and loss on tqdm progress bar.
            train_pbar.set_description(f'Epoch [{epoch+1}/{n_epochs}]')
            train_pbar.set_postfix({'loss': loss.detach().item()})

        mean_train_loss =sum(loss_record)/len(loss_record)
        writer.add_scalar('Loss/train', mean_train_loss, step)

        model.eval()# Set your model to evaluation mode.
        loss_record =[]for x, y in valid_loader:
            x, y = x.to(device), y.to(device)with torch.no_grad():
                pred = model(x)
                loss = criterion(pred, y)

            loss_record.append(loss.item())
            
        mean_valid_loss =sum(loss_record)/len(loss_record)print(f'Epoch [{epoch+1}/{n_epochs}]: Train loss: {mean_train_loss:.4f}, Valid loss: {mean_valid_loss:.4f}')
        writer.add_scalar('Loss/valid', mean_valid_loss, step)if mean_valid_loss < best_loss:
            best_loss = mean_valid_loss
            torch.save(model.state_dict(), config['save_path'])# Save your best modelprint('Saving model with loss {:.3f}...'.format(best_loss))
            early_stop_count =0else: 
            early_stop_count +=1if early_stop_count >= config['early_stop']:print('\nModel is not improving, so we halt the training session.')return

Configurations

config

contains hyper-parameters for training and the path to save your model.

device ='cuda'if torch.cuda.is_available()else'cpu'
config ={'seed':5201314,# Your seed number, you can pick your lucky number. :)'select_all':True,# Whether to use all features.'valid_ratio':0.2,# validation_size = train_size * valid_ratio'n_epochs':3000,# Number of epochs.            'batch_size':256,'learning_rate':1e-5,'early_stop':400,# If model has not improved for this many consecutive epochs, stop training.     'save_path':'./models/model.ckpt'# Your model will be saved here.}

Dataloader

Read data from files and set up training, validation, and testing sets. You do not need to modify this part.

# Set seed for reproducibility
same_seed(config['seed'])# train_data size: 2699 x 118 (id + 37 states + 16 features x 5 days) # test_data size: 1078 x 117 (without last day's positive rate)
train_data, test_data = pd.read_csv('./covid.train.csv').values, pd.read_csv('./covid.test.csv').values
train_data, valid_data = train_valid_split(train_data, config['valid_ratio'], config['seed'])# Print out the data size.print(f"""train_data size: {train_data.shape} 
valid_data size: {valid_data.shape} 
test_data size: {test_data.shape}""")# Select features
x_train, x_valid, x_test, y_train, y_valid = select_feat(train_data, valid_data, test_data, config['select_all'])# Print out the number of features.print(f'number of features: {x_train.shape[1]}')

train_dataset, valid_dataset, test_dataset = COVID19Dataset(x_train, y_train), \
                                            COVID19Dataset(x_valid, y_valid), \
                                            COVID19Dataset(x_test)# Pytorch data loader loads pytorch dataset into batches.
train_loader = DataLoader(train_dataset, batch_size=config['batch_size'], shuffle=True, pin_memory=True)
valid_loader = DataLoader(valid_dataset, batch_size=config['batch_size'], shuffle=True, pin_memory=True)
test_loader = DataLoader(test_dataset, batch_size=config['batch_size'], shuffle=False, pin_memory=True)

Start training!

it may take lots of time(depends on GPU you drew),be sure stay front your computer or get some scripts or devices to make your screen stay light.

在这里插入图片描述

model = My_Model(input_dim=x_train.shape[1]).to(device)# put your model and data on the same computation device.
trainer(train_loader, valid_loader, model, config, device)

and if you never modify any above code,it may train 1883 times

在这里插入图片描述

Plot learning curves with tensorboard (optional)

tensorboard

is a tool that allows you to visualize your training progress.

If this block does not display your learning curve, please wait for few minutes, and re-run this block. It might take some time to load your logging information.

%reload_ext tensorboard
%tensorboard --logdir=./runs/

you will get a picture like this.

在这里插入图片描述

Testing

The predictions of your model on testing set will be stored at

pred.csv

.

defsave_pred(preds,file):''' Save predictions to specified file '''withopen(file,'w')as fp:
        writer = csv.writer(fp)
        writer.writerow(['id','tested_positive'])for i, p inenumerate(preds):
            writer.writerow([i, p])

model = My_Model(input_dim=x_train.shape[1]).to(device)
model.load_state_dict(torch.load(config['save_path']))
preds = predict(test_loader, model, device) 
save_pred(preds,'pred.csv')
100%|██████████| 5/5 [00:00<00:00, 554.88it/s]

after these cells,you will get a pred.csv in the files
在这里插入图片描述

you can download this doc and submit it in kaggle

在这里插入图片描述

and the kaggle will give you a score,if you never modify,you may have a low score like this:

在这里插入图片描述

Some optimization

if you work above code,you will pass the Simple Baseline. About how can pass Medium Baseline & Strong Baseline, i will try to give the answer in the future(maybe in 09/2022), teaching assistant gave some hints:

  • 特征选择(Feature selection-what other features are useful?)
  • DNN结构:层数,维度,激活函数(DNN construction-layers, dimension, activation function)
  • 训练(training-mini batch, optimizer, leaning rate)
  • L2 regularization

Reference

all the code was from HUNG-Yi LEE(李宏毅),you can study the 《MACHINE LEARNING 2022 SPRING》 in https://speech.ee.ntu.edu.tw/~hylee/ml/2022-spring.php

above all was my study note, if you have any suggestions, welcome to comment.


本文转载自: https://blog.csdn.net/qq_52156445/article/details/125097008
版权归原作者 辰chen 所有, 如有侵权,请联系我们删除。

“COVID-19 Cases Prediction (Regression)”的评论:

还没有评论