TEXT SIZE

CrossRef (0)
Two-dimensional attention-based multi-input LSTM for time series prediction

Eun Been Kima, Jung Hoon Parka, Yung-Seop Leeb, Changwon Lim1,a

aDepartment of Applied Statistics, Chung-Ang University, Korea;
bDepartment of Statistics, Dongguk University, Korea
Correspondence to: 1Department of Applied Statistics, Chung-Ang University, 47, Heukseok-ro, Dongjak-Gu, Seoul 06974, Korea.
E-mail: clim@cau.ac.kr
Received August 19, 2020; Revised December 19, 2020; Accepted December 23, 2020.
Abstract
Time series prediction is an area of great interest to many people. Algorithms for time series prediction are widely used in many fields such as stock price, temperature, energy and weather forecast; in addtion, classical models as well as recurrent neural networks (RNNs) have been actively developed. After introducing the attention mechanism to neural network models, many new models with improved performance have been developed; in addition, models using attention twice have also recently been proposed, resulting in further performance improvements. In this paper, we consider time series prediction by introducing attention twice to an RNN model. The proposed model is a method that introduces H-attention and T-attention for output value and time step information to select useful information. We conduct experiments on stock price, temperature and energy data and confirm that the proposed model outperforms existing models.
Keywords : recurrent neural network, correlation, attention, time series
1. Introduction

Time series prediction is a field that has attracted significant attention. Algorithms for time series prediction are widely used in many fields such as weather, electricity, and financial markets. They have been developed by using well-known classic models such as the autoregressive moving average (McLeod and Li, 1983) as well as recurrent neural network (RNN) (Rumelhart et al., 1986; Werbos, 1990; Elman, 1991) and long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) to improve their performance. RNN is a sequential model widely used for time series data. Itx takes information from prior inputs to influence current input and output using hidden state recurrently. LSTM is a special structure of RNN. Unlike RNN, LSTM adds memory cells to calculate if the information is important or not. Three gates, input gate, forget gate and output gate, judge the input and decide to retain or forget by some special rules to capture long term dependency of time series data. In addition, as a nonlinear autoregressive exogenous (NARX) model (Chen et al., 2008) using other exogenous variables that can affect the target variable, not just the past value of the target variable, was proposed, the performance of time series prediction has been improved.

The NARX model has been continuously studied, such as an improved hybrid prediction model (Pham et al., 2010) and a new approach to identifying a new class of NARX models (Li et al., 2011). In particular, in the case of RNN, we succeeded in increasing the prediction rate by applying it to the NARX model. However, it was difficult to capture long-term dependency. To overcome this, methods using LSTM, gated recurrent unit algorithms (Cho et al., 2014a) and encoder decoder networks (Cho et al., 2014b; Sutskever et al., 2014) were developed.

Research on introducing an attention mechanism has also been actively conducted, and studies using a new RNN method such as an attention-based encoder decoder network (Liu and Lane, 2016) have also been conducted. Research has also been conducted to introduce attention to the NARX model. A hierarchical attention network that selects the hidden state of the related encoder using two attention mechanisms (Yang et al., 2016), a model that unifies the spatiotemporal feature extraction of exogenous variables and the temporal dynamics modeling of the target variable (Tao et al., 2018), two-stage two-phase RNN (Liu et al., 2019), and various algorithms have been proposed. All these algorithms show improved performance. A dual-stage attention-based RNN (DA-RNN) model (Qin et al., 2017) has also been proposed, which has the advantage of selecting exogenous variables with a great influence on the prediction of target variable and being able to properly capture the long-term dependency of time series. In addition, a model using a new attention-based multi-input LSTM (MILSTM) (Li et al., 2018) using factors with low correlation effectively, followed by a two-dimensional attention-based LSTM (2D-LSTM) (Yu and Kim, 2019) model was also proposed. 2D-LSTM is a model in which the importance of exogenous variables and the importance of time are separately calculated and combined.

In this paper, we propose a two-dimensional attention-based multi-input LSTM (2DA-MILSTM) model that includes the advantages of DA-RNN, 2D-LSTM and MI-LSTM. This model calculates the correlation between the target variable and the exogenous variables, divides the exogenous variables accordingly and then inputs them to MI-LSTM. It also applies two separate attention layers simultaneously using the output value of MI-LSTM and the hidden state of the previous layer. Two kinds of weights are created by applying the attentions to each hidden state and time step. By integrating these two weights, the predicted value can be calculated, and the importance of the hidden state and the time step can be considered to improve the long-term dependency problem and prediction performance. We compare the performance of the proposed model with existing models through prediction experiments on stock price, room temperature and energy data.

The rest of this paper is as follows. In Section 2, we review the existing models, and in Section 3 we explain the detailed parts and configurations of the proposed model in detail. We describe datasets used in the experiment and present results of the experiment in Section 4. Lastly, Section 5 draws the conclusion of this paper.

2. Existing methods

### 2.1. Recurrent neural network and long short-term memory

In neural network, RNNs and LSTM are the most popular model for time series including language and speech data (Huang et al., 2015). Figure 1 shows that the RNN model maintains consists of input layer x, hidden layer h, and output layer y. An input layer takes the features at time t, and output layer represents a probability distribution for labels at tie t. As the recurrent layer sores history informations, RNN connects between the previous hidden state and the current hidden state.

Then the values in the hidden and output layers are computed by h(t) = f (Ux(t) + Wh(t − 1))), y(t) = g(Vh(t)), where U,W, and V are the connection weights computed in training, and f (z) and g(z) are sigmoid and softmax activation functions. Softmax (Jang et al., 2016) is used as an activation function for the classification task. It calculates normalized output of probability distribution in the neural network. Defined as $σ(zi)=exp(zi)/Σj=1K exp(zj)$, which takes as input vector z of K classes, normalize output vectors dividing by the sum of all components in exponential scale.

LSTM is also similar to RNN but LSTM captures long term dependencies of times series data better than RNN using three gates, input, forget, and output gate in the LSTM cell. In Figure 2, σ represents logistic sigmoid function, and i, f, o, and c represent the input gate, forget gate, output gate and cell vectors with same size with the hidden vector h.

Then the procedure of LSTM cell using matrix notation is:

$it=σ (Wxixt+Whiht-1+Wcict-1+bi),ft=σ (Wxfxt+Whfht-1+Wcfct-1+bf),ct=ftct-1+it tanh (Wxcxt+Whcht-1+bc),ot=σ (wxoxt+Whoht-1+Wcoct+bo),ht=ot tanh(ct).$

For explanation, Wxo represents the input-output gate matrix, Whi represents the hidden-input gate matrix. These calculations concentrate on which information should be focused to capture short and long term dependencies in time series data as well as determine how much information should be forgotten using the forget gate.

### 2.2. Multi-input long short-term memory model

The target variable is defined as Y = (y1, y2, . . . , yT ) ∈ ℝT, and the exogenous variables related to the prediction of the target variable are defined as X = (X1,X2, . . . ,XT ) ∈ ℝT×D where T is the time window size and D is the number of exogenous variables. To express the exogenous variables in detail, two notations are used. First, $Xt=(xt1,xt2,…,xtD)⊤∈ℝD$ denotes the values of all exogenous variables at time t, and $Xk=(x1k,x2k,…,xTk)⊤∈ℝT$ denotes the values of the kth exogenous variable. The matrix X contains three kinds of exogenous variables: Xp ∈ ℝT×P having a positive correlation with the target variable, Xn ∈ ℝT×N having a negative correlation with the target variable, index variables Xi ∈ ℝT for stock price data if available, where D = P + N + 1.

The Pearson correlation coefficient is used to measure the correlation between the exogenous variables and the target variable, and Xp and Xn are generated using P-largest and N-smallest variables with positive and negative correlation coefficients, respectively. In our experiment, {4, 6, 10, 15} is a set of possible values for P and N, and 15, the largest value, is finally selected.

We define “Self”, “Index”, “Positive”, and “Negative” based on Y, Xi, Xp, and Xn values, respectively, by using LSTM with the hidden state size of r. The definitions are:

• Self:

$Y˜=LSTM(Y).$

• Index:

$I˜=LSTM(Xi),$

where ,Ĩ ∈ ℝT×r.

• Positive:

$P˜=1P∑p=1PLSTM (Xpp).$

• Negative:

$N˜=1N∑n=1NLSTM (Xnn),$

where , Ñ ∈ ℝT×r.

The MI-LSTM is a model that can extract valuable information from small correlation factors and discard negative noises using extra input gates controlled by convincing factors. This model deviates from the existing LSTM methods and shows that a model having multi-inputs to the LSTM can improve prediction performance. As defined above, MI-LSTM uses four input values: the past value of the target variable, exogenous variables with positive and negative correlation, and an index exogenous variable. The input values are expressed as , , Ñ,Ĩ, and the values after passing through MI-LSTM can be expressed as

$Y˜′=MILSTM (Y˜,P˜,N˜,I˜)$

and an output value of $Y˜′=(Y˜1′,Y˜2′,…,Y˜T′)⊤∈ℝT×p$ is obtained.

Figure 3 is the structure of MI-LSTM. The following attention is applied to Ỹ′ to give weights:

$jt =vb⊤ tanh (Wb (Y˜t′)⊤+bb),β =Softmax ([j1,j2,…,jT]⊤),y˜ =β⊤Y˜′,y^T+1 =ReLU (Wy˜+b),$

where the matrix Wb ∈ ℝp×p, bias bb ∈ ℝp, and vector vb ∈ ℝp are the parameters to be learned, ∈ ℝp is the attention output, W ∈ ℝp and b ∈ ℝ are also parameters to be learned, and ReLU (rectified linear unit) is an activation function. ReLU was introduced by Nair and Hinton (2010) to make the deep learning model train better. Marked as ReLU(x) = max(0, x), unlike the activation functions before such as sigmoidal activation, it gets fewer vanishing gradient problems which prohibit neurons to learn weights in the propagation steps.

### 2.3. Dual-stage attention-based recurrent neural network

DA-RNN consists of two steps. In the first step it extracts the hidden state of the previous encoder and the relevant exogenous variables in each step by using the input attention. Then, in the second step the model selects the output value of the related decoder using the temporal attention mechanism. This model can select the most relevant exogenous variables as well as adequately capture the long-term dependency of the time series. Figure 4 and Figure 5 show the structure of DA-RNN.

The n exogenous variables are represented by matrix X = (x1, x2, . . . , xn) = (x1, x2, . . . , xT ) ∈ ℝn×T, where T is the time window size, $xk=(x1k,x2k,…,xTk)⊤∈ℝT$ represents the kth exogenous variable, and $xt=(xt1,xt2,…,xtn)⊤∈ℝn$ represents the vectors of n exogenous variables at time t. DA-RNN creates a new input vector by multiplying this by the weight obtained through the following input attention mechanism. It inputs the new input vectors back to the LSTM and makes predictions by obtaining weights through a temporal attention mechanism.

$etk =ve⊤ tanh (We [ht-1;st-1]+Uexk),αtk =exp (etk)Σi=1nexp (eti),x˜t =(αt1xt1,αt2xt2,…,αtnxtn)⊤,$

where ht−1 ∈ ℝm and st−1 ∈ ℝm are the previous hidden state and the cell state in the encoder LSTM unit, respectively, and ve ∈ ℝT, We ∈ ℝT×2m, and Ue ∈ ℝT×T are parameters to be learned, and m is the size of the hidden state of the encoder. As in the above equations, a new input value is created by multiplying the weight, and after inputting it into LSTM as shown in the following equations, it is finally input into the temporal attention mechanism of the decoder.

$ht =f1(ht-1,x˜t),lii =vd⊤ tanh (Wd [dt-1;st-1′]+Udhi), 1≤i≤T,βti =exp (lti)Σj=1Texp (ltj),ct =∑i=1Tβtihi,y˜t-1=w˜⊤ [yt-1;ct-1]+b˜,dt =f2(dt-1,y˜t-1),y^T=F(y1,…,yT-1,x1,…,xT)=vy⊤ (Wy[dT;cT]+bw)+bv,$

where vd ∈ ℝm, Wd ∈ ℝm×2p, and Ud ∈ ℝm×m are parameters to be learned, p is the size of the decoder’s hidden state, f1 and f2 are LSTMs, and vy ∈ ℝp, Wy ∈ ℝp×(p+m), bw ∈ ℝp and bv ∈ ℝ are also parameters to be learned.

3. Proposed method

### 3.1. Two-dimensional attention-based multi-input LSTM

In this section we present the details of the two-dimensional attention-based multi-input LSTM (2DA-MILSTM) model, a new model that includes the advantages of the MI-LSTM and DA-RNN models described in Section 2 and 2D-LSTM. Figure 6 is the structure of the 2DA-MILSTM model.

The exogenous variables are divided using the value of the correlation coefficient, and the output value after inputting it to MI-LSTM is expressed as Z = (z1, z2, . . . , zT ) ∈ ℝT×u, where u is the number of MI-LSTM hidden units, and $zt=(zt1,zt2,…,ztu)⊤∈ℝu$ represents the u hidden unit input vector at time t. The proposed model uses H-attention and T-attention for Z, which can be calculated as:

$y^T+1=F(Y,X,Z),$

where F is the function we want to learn.

In this model, as in the existing MI-LSTM model, “Self”, “Positive”, and “Negative” are used as input values of MI-LSTM. However, unlike in the existing model, three factors excluding index variable are input. This is because there is no index exogenous variable in the data we consider in this paper. Therefore, we use the ,, Ñ ∈ ℝT×r as the input of the three factors, and t,t, Ñt ∈ ℝr as the corresponding input vector for t = 1, 2, . . . , T. Figure 7 shows the structure of the modified MI-LSTM.

The forget gate and the output gate of the MI-LSTM remain the same when compared to the original LSTM as shown in the following equations:

$ft=σ (Wf [ht-1;Y˜t]+bf),$$ot=σ (Wo [ht-1;Y˜t]+bo),$

where ft, ot ∈ ℝu are the forget gate and output gate, respectively, ht−1 ∈ ℝu is the hidden state of the previous time step and u is the number of the MI-LSTM hidden units. Wf,Wo ∈ ℝu×(u+r) are the weights of the forget gate and the output gate, respectively, and bf, bo ∈ ℝu are the biases.

All input gates are determined by the “Self” variable and the previous hidden state to control the auxiliary factors as in the following equations:

$it=σ (Wi [ht-1;Y˜t]+bi),$$ipt=σ (Wip [ht-1;Y˜t]+bip),$$int=σ (Win [ht-1;Y˜t]+bin),$

where it, ipt, int ∈ ℝu are the input gates of three factors, Wi,Wip,Win ∈ ℝu×(u+r) are the weights, and bi, bip, bin ∈ ℝu are the biases. The cell states of three factors t, pt, nt ∈ ℝu are obtained as:

$C˜t=tanh (Wc [ht-1;Y˜t]+bc),$$C˜pt=tanh (Wcp [ht-1;P˜t]+bcp),$$C˜nt=tanh (Wcn [ht-1;N˜t]+bcn),$

where Wc,Wcp,Wcn ∈ ℝu×(u+r) are the weights, and bc, bcp, bcn ∈ ℝu are the biases. In LSTM, the element-wise multiplication of t and it is the final cell state input at time t:

$lt=C˜t⊙it,$$lpt=C˜pt⊙ipt,$$lnt=C˜nt⊙int,$

where lt, lpt, lnt ∈ ℝu are the cell inputs of the factors, and ⊙ is the element-wise multiplication. Then, the sum of the weights is performed for the cell input as:

$Lt=αtlt+αptlpt+αntlnt,$

where αt, αpt, αnt ∈ ℝ are the attention weights, and Lt is the input of the final cell state at time t. The attention weights are determined by the cell state input itself and the cell state in the previous step as shown in the following equations:

$ut=tanh (lt⊤WaCt-1+ba),$$upt=tanh (lpt⊤WaCt-1+bap),$$unt=tanh (lnt⊤WaCt-1+ban),$$[αt,αpt,αnt]⊤=Softmax ([ut,upt,unt]⊤),$

where ut, upt, unt ∈ ℝ are values that appear in the middle of the calculation of the attention weights, and Wa ∈ ℝu×u and ba, bap, ban ∈ ℝ are parameters to be learned. Also, Ct−1 ∈ ℝu is the state of the cell at time t−1. [ut, upt, unt] is the attention score and is input to the softmax layer. When the update vector of the cell state Lt is stabilized, the remaining MI-LSTM units proceed in the same manner as the original LSTM.

$Ct=Ct-1⊙ft+Lt,$$ht=tanh (Ct)⊙ot,$

where Ct, ht ∈ ℝu are the cell state and hidden state at time t, respectively.

The structure of MI-LSTM is the same as the above equations. In this paper, the equations (3.1)(3.18) are expressed as F2 to summarize as:

$ht=F2 (ht-1,Y˜t,P˜t,N˜t).$

It is also intended to be expressed simply as:

$Z=MILSTM (Y˜,P˜,N˜),$

and we get an output value Z = (Z1,Z2, . . . ,ZT ) ∈ ℝT×u for each time step.

### 3.2. H-attention

Considering the kth output value $zk=(z1k,z2k,…,zTk)∈ℝT$, we can construct the attention model by referring to the previous hidden state $ht-1′$ and the cell state $Ct-1′$ in the LSTM unit. By constructing an attention model, an attention weight $αtk$ can be obtained. We focus on the output value in the MILSTM structure and consider whether the output value contains accurate information and sufficiently explains the information by looking at the importance at time step when using it. Therefore, we introduce an H-attention mechanism that can select which value is of high importance among the output values from MI-LSTM as:

$utk=ve⊤ tanh (We [ht-1′;Ct-1′]+Uezk)αtk=exp (utk)Σi=1uexp (uti)$

where ve ∈ ℝT, We ∈ ℝT×2m, and Ue ∈ ℝT×T are parameters to be learned, and $αtk$ is the attention weight that measures the importance of the kth hidden unit at time t. In addition, the softmax function is applied to $utk$, so that the weight values sum to 1.

A new input vector can be extracted by multiplying the weight derived through the H-attention mechanism by the input value.

$z˜t=(αt1zt1,αt2zt2,…,αtuztu)⊤$

Using this new input vector, the hidden state at time t can be updated as:

$ht′=flstm (ht-1′,z˜t),$

where flstm is the LSTM function.

### 3.3. T-attention

In order to predict the output ŷT+1, another LSTM-based RNN is used to introduce an attention that can select the importance of time step. Considering the input series $zt=(zt1,zt2,…,ztu)⊤∈ℝu$ at time t, the attention weight of each hidden state is calculated based on the previous hidden state dt−1 ∈ ℝu and the cell state $Ct-1″∈ℝu$ of the LSTM unit, and the attention model can be constructed as:

$rti=vd⊤ tanh (Wd [dt-1;Ct-1″]+Udzi), 1≤i≤T,βti=exp (rti)Σj=1Texp (rtj),$

where $[dt-1;Ct-1″]∈ℝu$ is a concatenation of the previous hidden state and the cell state of the LSTM unit, and vd ∈ ℝm,Wd ∈ ℝm×2u, and Ud ∈ ℝm×m are parameters to be learned. The attention weight $βti$ denotes the importance of the ith hidden state for prediction. The attention mechanism computes the context vector ct as the weighted average of all new input vectors as:

$ct=∑i=1Tβtiz˜i.$

Note that the context vector ct is distinct at each time step.

Once we get the context vector, we can obtain t−1 as:

$y˜t-1=w˜⊤[yt-1;ct-1]+b˜,$

where [yt−1; ct−1] ∈ ℝm+1 is a concatenation of the decoder input yt−1 and the calculated context vector ct−1. The parameters ∈ ℝm+1 and ∈ ℝ map the connected part to the input size of the T-attention.

The newly calculated t−1 can be used to update the decoder hidden state at time t as:

$dt=f2(dt-1,y˜t-1)$

where f2 is the LSTM function. Then dt can be updated as:

$ft′ =σ (Wf′ [dt-1;y˜t-1]+bf′),it′ =σ (Wi′ [dt-1;y˜t-1]+bi′),ot′ =σ (Wo′ [dt-1;y˜t-1]+bo′),Ct″ =ft′⊙Ct″+it′⊙tanh (Wc′ [dt-1;y˜t-1]+bc′),dt =ot′⊙tanh (Ct″),$

where [dt−1; t−1] ∈ ℝu+1 is a concatenation of the previous hidden state dt−1 and the input value t−1 of the LSTM of the T-attention. $Wf′,Wi′,Wo′,Wc′∈ℝu×(u+1)$ and $bf′,bi′,bo′,bc′∈ℝu$ are parameters to be learned. σ and ⊙ are logistic sigmoid functions and element multiplications, respectively.

4. Experiment

We conduct an experiment using two types of data sets to check whether the proposed model has good performance in stock price data as well as in time series data in other fields. A stock price prediction experiment is conducted using KOSPI 200 data. A general prediction experiment on time series data in various fields other than stock prices is conducted using temperature data and energy data.

### 4.1. Data description

• KOSPI 200: This is a data set where, among the listed companies on KOSPI, 200 stocks considered to represent KOSPI due to their large market capitalizations and large trading volume are selected, and used to predict the stock index. This is daily data measured for a total of 2158 days from August 23, 2011 to May 29, 2020. In our experiment, the KOSPI 200 index is used as the target variable, and the stock prices for 165 companies are used as exogenous variables (Table 1). These companies are selected so that there are no missing values in the data for companies such as Samsung Electronics, SK Hynix, Samsung Biologics, Naver, Hyundai Motor Company, and LG Chem.

• SML 2010 (Zamora-Martinez et al., 2014): This is a public dataset used for indoor temperature prediction as data collected from a monitor system installed in a general house. Data was sampled every minute and calculated every 15 minutes on average. It consists of approximately 40 days (before and after 43 days) of monitoring data. We use 4137 data. In the experiment, 16 exogenous variables are used as shown in Table 2, and data having missing values and variables such as date and time are removed. Among the indoor temperatures, the temperature of dinning room is used as the target variable.

• Appliances energy (Candanedo et al., 2017): This data has been used to predict appliance energy use in low energy buildings. Data were measured at intervals of 10 minutes over a period of 4.5 months. The total size of data is 19734, and as shown in Table 3; 27 exogenous variables are used (excluding date) and appliances, energy consumption, are used as the target variable. For time series data, there should be stationarity before make any forecast. So we preprocessed the three time series data with min max scaling using Scikit-learn packages (Pedregosa et al., 2011).

### 4.2. Design of experiment

The parameters of our model include the LSTM dimension r, the MI-LSTM dimension u, time window size T, the number of exogenous variables with positive correlation P and the number of exogenous variables with negative correlation N, the size of hidden states of the LSTM for H-attention, q, and the size of hidden states of the LSTM for T-Attention, m.

In our experiment, in the case of r and m, we selected 128 as a result of conducting experiments with 64 and 128. In addition, P and N are set to be the same and selected among {4,6,10,15}; in addition, the time window size T is selected among {10,16,32}. Lastly, 0.8, 0.1, and 0.1 are used as the ratio of training, validation, and test datasets.

We consider three different metrics to evaluate the performance of the models in time series prediction experiments. Assuming yt is the target at time t and ŷt is the predicted value at time t,

• Mean absolute percentage error (MAPE): a value obtained by taking the average of the absolute percentage of the residual

$MAPE=1N∑i=1N|yti-y^tiyti|.$

• Root mean square error (RMSE): a value obtained by taking the root of the average of the squares of the residual

$RMSE=1N∑i=1N(y^ti-yti)2.$

• Mean absolute error (MAE): average of all absolute errors

$MAE=1N∑i=1N|y^ti-yti|.$

Note that the MSE, RMSE, and MAE are calculated using normalized data due to huge difference in amount among different time series data.

To compare the performance of 2DA-MILSTM, we conduct a comparative experiment with two models, DA-RNN and MI-LSTM. Also, not based on neural network model, we conduct the classical time series model in statistics field, ARIMA. Designed by Hyndman and Khandakar, auto.arima in forecast package for R (Hyndman and Benítez, 2016), automatically determines the orer of ARIMA model using step-wise algorithm for forecasting. The experiment was conducted using KOSPI 200 data (stock price data), SML 2010 data (room temperature data), and appliances energy data (energy data).

### 4.3. Results

The time series prediction results of 2DA-MILSTM and three existing methods over the three datasets are shown in Table 4. In Table 4, we observe that the MAPE, RMSE and MAE of ARIMA are larger than those of the other neural net-based methods. There are two interesting exceptions for such observation, one of which is MAPE for KOSPI 200 dataset. ARIMA outperforms the other methods for the data in terms of MAPE. The possible reasons are (1) The stock price by nature reveals higher volatility as its level goes up, (2) The MAPE puts a heavier penalty on negative errors than on positive errors. These reasons are because of the high volatility at high level of stock price it is difficult to predict it accurately and there are possibly more positive errors at high level. Subsequently, it can be small even though a model does not predict well at high level since the MAPE puts a lighter penalty on positive errors.

We can also see that the proposed model outperforms the other methods for all three datasets in terms of all three criteria, except the case decsribed above. This suggests that it might be beneficial to consider the correlation between exogenous variables and target variable by introducing MI-LSTM. We can consider the importance of exogenous variables using the input attention; however, DA-RNN just downgrades the importance of variables with negative correlation with the target variable. With integration of the MI-LSTM as well as two attention mechanisms, the proposed 2DA-MILSTM achieves the best MSE, RMSE and MAE across three datasets since it uses the MI-LSTM to consider the correlation of exogenous variables effectively as well as employs H- and T-attention mechanisms to capture relevant hidden features across all time steps as well as potential output information.

For visual comparison, we show the prediction results of the three neural net-based methods over certain ranges of the three datasets in

Figures 810 show that our model makes better predictions for KOSPI 200 data than the other two models. In particular, in the case of DA-RNN, it follows the overall trend, but we can some time shift across all time period. In the case of MI-LSTM, the prediction is accurate until time point of 70, but after that time, the result deteriorates. However, we can see that 2DA-MILSTM predicts almost exactly the actual value except slight overestimating and shifting after time point of 115.

Figures 1113 show that our model predicts SML 2010 data better than the other two models. In the case of both DA-RNN and MI-LSTM, we can see that the prediction before time point of 60 is not accurate (the prediction is accurate only between 60 and 75 for DA-RNN). Using MI-LSTM, in addition to the range of 60–75, the prediction is pretty accurate after time point of 105. However, we can see that 2DA-MILSTM predicts very accurately over all the time period except between 70 and 95.

Figures 1416 show that our model makes better predictions for appliances energy data than the other two models. In the case of this data, all three methods do not predict accurately between time point of 20 and 40. DA-RNN predicts slightly better than the other two models for the time period. However, it overestimates before time point of 20. We can also see that MI-LSTM underestimates between 40 and 110. However, 2DA-MILSTM predicts very accurately before 20 and between 40 and 110.

5. Conclusion

In this paper, we proposed a two-dimensional attention-based MI-LSTM (2DA-MILSTM) motivated by the advantages of MI-LSTM, which effectively uses exogenous variables with low correlation, and DA-RNN and 2D-ALSTM, in which an attention mechanism is introduced twice. The proposed method was applied to KOSPI 200, SML 2010 and appliance energy datasets. The results show that our proposed 2DA-MILSTM can outperform existing methods for time series prediction.

The proposed model, 2DA-MILSTM, can selectively use output value information and temporal information by introducing H-attention and T-attention. H-attention can better capture potential information of the output value, and T-attention can capture temporal information. 2DA-MILSTM, which uses the two attention mechanisms selects the output value with accurate information as well as captures temporal information appropriately to improve predictive performance. Better forecasting depends on capturing long- and short-term dependencies. A longer forecasted time series increases variability. While MI-LSTM capturing the dependencies, T-attention and H-attention make the output values of MI-LSTM stable considering temporal and potential values. This means that this task can control the increasing variability of time series data because the two attention mechanism weights control both temporal and potential information.

The stock price has high volatility at high level. Therefore, if one uses the log-return of the stock price rather than the original scale, then the prediction at high level might be more accurate; consequently, the performance of the model can be improved. However, in the literature of NARX field, there are several papers where the stock prices are used as the original scale, and two of which are the motivation for our research, DA-RNN and MI-LSTM. Therefore, we believe that using the same (original) scale for the stock price makes it easier for readers to compare our proposed model with the two models.

As a follow-up study, visualization method for H-attention and T-attention weights could be considered to check where the two attention mechanisms concentrate among exogenous variables, time stamp and logits from MI-LSTM each. In addition, non-linear relation between exogenous variables and target variable should be considered. When applying non-linear relations, it can be expected that the proposed 2DA-MILSTM model receives additional information between exogenous variables and the target variables. Also, various ranges of the hyperparameters should be experimented with while considering the relationship between hyperparameters.

Acknowledgement
This research was supported by Next-Generation Information Computing Development Program through the National Research Foundation (NRF) of Korea funded by the Ministry of Science, ICT (NRF-2017M3C4A7083281). We thank the referees for many important comments which helped to improve the presentation of the manuscript.
Figures
Fig. 1. Basic structure of RNN ().
Fig. 2. Basic structure of LSTM ().
Fig. 3. Structure of MI-LSTM ().
Fig. 4. Input attention mechanism of DA-RNN ().
Fig. 5. Temporal attention mechanism of DA-RNN ().
Fig. 6. Structure of 2DA-MILSTM.
Fig. 7. Structure of a modified MI-LSTM.
Fig. 8. Prediction of KOSPI 200 using DA-RNN.
Fig. 9. Prediction of KOSPI 200 using MI-LSTM.
Fig. 10. Prediction of KOSPI 200 using 2DA-MILSTM.
Fig. 11. Prediction of SML 2010 using DA-RNN.
Fig. 12. Prediction of SML 2010 using MI-LSTM.
Fig. 13. Prediction of SML 2010 using 2DA-MILSTM.
Fig. 14. Prediction of Appliances energy using DA-RNN.
Fig. 15. Prediction of Appliances energy using MI-LSTM.
Fig. 16. Prediction of Appliances energy using 2DA-MILSTM.
TABLES

### Table 1

Variables of KOSPI 200 data (https://kr.investing.com/)

Target variableKOSPI200 index
Exogenous variable165 among 200 companies such as Samsung Electronics, SK Hynix, Samsung Biologics, Naver, Hyundai Motors, and LG Chem.

### Table 2

Variables of SML 2010 data (https://archive.ics.uci.edu/ml/datasets/SML2010)

Target variableIndoor temperature (dining room)
Exogenous variableIndoor temperature (room)
Weather forecast temperature
Carbon dioxide in ppm (dining room)
Carbon dioxide in ppm (room)
Relative humidity (dining room)
Relative humidity (room)
Lighting (dining room)
Lighting (room)
Rain (Percentage of the last 15 minutes in which rain was detected)
Sun dusk
Wind (m/s)
Enthalpic motor 2, 0, or 1
Enthalpic motor turbo, 0 or 1
Outdoor temperature

### Table 3

Variables of Appliances energy data (https://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction)

Target variableAppliances, energy use in Wh
Exogenous variablelights, energy use of light fixtures in the house in Wh
T1, Temperature in kitchen area, in Celsius
RH 1, Humidity in kitchen area, in %
T2, Temperature in living room area, in Celsius
RH 2, Humidity in living room area, in %
T3, Temperature in laundry room area
RH 3, Humidity in laundry room area, in %
T4, Temperature in office room, in Celsius
RH 4, Humidity in office room, in %
T5, Temperature in bathroom, in Celsius
RH 5, Humidity in bathroom, in %
T6, Temperature outside the building (north side), in Celsius
RH 6, Humidity outside the building (north side), in %
T7, Temperature in ironing room, in Celsius
RH 7, Humidity in ironing room, in %
T8, Temperature in teenager room 2, in Celsius
RH 8, Humidity in teenager room 2, in %
T9, Temperature in parents room, in Celsius
RH 9, Humidity in parents room, in %
To, Temperature outside (from Chievres weather station), in Celsius
Pressure (from Chievres weather station), in mm Hg
RH out, Humidity outside (from Chievres weather station), in %
Wind speed (from Chievres weather station), in m/s
Visibility (from Chievres weather station), in km
Tdewpoint (from Chievres weather station), in Celsius
rv1, Random variable 1, nondimensional
rv2, Random variable 2, nondimensional

### Table 4

Time series prediction results over the KOSPI 200 dataset, SML 2010 dataset and Appliances energy dataset (best performance displayed in boldface)

ModelKOSPI 200 datasetSML 2010 datasetAppliances energy dataset

MAPERMSEMAEMAPERMSEMAEMAPERMSEMAE
ARIMA6.35829.69623.79119.4403.6803.09849.85597.49853.859
DA-RNN10.9822.6472.66416.87211.2721.14832.88332.4886.049
MI-LSTM10.1820.4200.5565.3800.2400.3217.1200.5621.243
2DA-MILSTM9.1000.3560.4722.3000.1980.2895.1100.5101.166

References
1. Candanedo LM, Feldheim V, and Deramaix D (2017). Data driven prediction models of energy use of appliances in a low-energy house. Energy and Buildings, 140, 81-97.
2. Chen S, Wang XX, and Harris CJ (2008). NARX-based nonlinear system identification using orthogonal least squares basis hunting. IEEE Transactions on Control Systems Technology, 16, 78-84.
3. Cho K, van Merriënboer B, Bahdanau D, and Bengio Y (2014a). On the properties of neural machine translation: Encoder-decoder approaches. arXiv.1409.1259
4. Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, and Bengio Y (2014b). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv.1406.1078
5. Elman JL (1991). Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning, 7, 195-225.
6. Hochreiter S and Schmidhuber J (1997). Long short-term memory. Neural Computation, 9, 1735-1780.
7. Huang Z, Xu W, and Yu K (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv.1508.0.1991
8. Hyndman RJ and Benítez JM (2016). Bagging exponential smoothing methods using STL decomposition and Box-Cox transformation. International Journal of Forecasting, 32, 303-312.
9. Jang E, Gu S, and Poole B (2016). Categorical reparameterization with gumbel-softmax. arXiv:1611.01144
10. Li H, Shen Y, and Zhu Y (2018). Stock price prediction using attention-based multi-Input LSTM, 454-469.
11. Li G, Wen C, Zheng W, and Chen Y (2011). Identification of a class of nonlinear autoregressive models with exogenous inputs based on kernel machines. IEEE Transactions on Signal Processing, 59, 2146-2159.
12. Liu B and Lane I (2016). Attention-based recurrent neural network models for joint intent detection and slot filling. arXiv:1609.01454
13. Liu Y, Gong C, Yang L, and Chen Y (2019). DSTP-RNN: a dual-stage two-phase attention-based recurrent neural network for long-term and multivariate time series prediction. arXiv:1904.07464
14. McLeod AI and Li WK (1983). Diagnostic checking ARMA time series models using squared-residual autocorrelations. Journal of Time Series Analysis, 4, 269-273.
15. Nair V and Hinton GE (2010). Rectified linear units improve restricted boltzmann machines. ICML.
16. Pedregosa F, Varoquaux G, and Gramfort A, et al. (2011). Scikit-learn: machine learning in Python. The Journal of Machine Learning Research, 12, 2825-2830.
17. Pham H, Tran V, and Yang BS (2010). A hybrid of nonlinear autoregressive model with exogenous input and autoregressive moving average model for long-term machine state forecasting. Expert Systems with Applications, 37, 3310-3317.
18. Qin Y, Song D, Chen H, Cheng W, Jiang G, and Cottrell G (2017). A dual-stage attention-based recurrent neural network for time series prediction. arXiv:1704.02971
19. Rumelhart DE, Hinton GE, and Williams RJ (1986). Learning internal representations by backpropagating errors. Nature, 323, 533-536.
20. Sutskever I, Vinyals O, and Le QV (2014). Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems, 3104-3112.
21. Tao Y, Ma L, Zhang W, Liu J, Liu W, and Du Q (2018). Hierarchical attention based recurrent highway networks for time series prediction. ar.Xiv:1806.00685
22. Werbos P (1990). Backpropagation through time: What it does and how to do it. Proceedings of the IEEE, 78, 1550-1560.
23. Yang Z, Yang D, Dyer C, He X, Smola A, and Hovy E (2016). Hierarchical attention networks for document classification. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. , 1480-1489.
24. Yu Y and Kim YJ (2019). Two-dimensional attention-based LSTM model for stock index prediction. Journal of Information Processing Systems, 15, 1231-1242.
25. Zamora-Martinez F, Romeu-Guallart P, and Pardo J (2014). UCI Machine Learning Repository: SML2010 Data Set UCI Machine Learning Repository .