April 12, 2024

Building Advanced Horizons

Navigating the World of Business Trading

Multi-level deep Q-networks for Bitcoin trading strategies

16 min read

This section presents the learning algorithm employed to develop an effective trading strategy by leveraging the processed datasets. A more comprehensive understanding of the proposed model is provided in the three subsections. The proposed M-DQN consists of three independent DQN-based models: Trade-DQN, Predictive-DQN, and Main-DQN. The subsections respectively cover the theoretical background of RL and DQN, a summary of the results from the Trade-DQN and Predictive-DQN models, and the design methodology of the proposed Main-DQN model.

Background—RL and DQN

As mentioned above, because the proposed M-DQN is based on the RL and DQN structures, the basic concepts are introduced prior to describing the method in detail.

Reinforcement learning

Reinforcement learning is a machine-learning paradigm in which an agent learns to make decisions by interacting with its environment51. The goal of the agent is to maximize its cumulative reward by discovering an optimal policy that maps states to actions. The agent performs actions based on its current state, and the environment responds by providing feedback in the form of rewards or penalties. This process is iterative and continues until the agent has acquired a sufficient understanding of the environment to ensure that the given task is well performed.

The foundation of RL lies in the MDP framework, which comprises a tuple (\(S, A, P, R, \gamma\)), where S denotes the set of states; A represents the set of actions; P is the state-transition probability function; R is the reward function; and \(\gamma\) is the discount factor with \(0\le \gamma \le 1\). This indicates the agent’s preference for present over future rewards. When \(\gamma\) is closer to zero, the agent prioritizes immediate rewards, whereas a \(\gamma\) value of approximately one suggests that the agent values future rewards almost as much as immediate rewards.

The agent’s objective is to learn an optimal policy \(\pi\) that maximizes the expected cumulative reward, which is known as the value function, for each state. The value function V(s) is defined as the expected cumulative reward starting from state s and following policy \(\pi\). Similarly, the action-value function Q(sa) represents the expected cumulative reward starting from state s, taking action a, and following policy \(\pi\).

A popular method for solving RL problems is Q-learning, which is a model-free, value-based method that directly estimates the optimal action-value function52. Q-learning is an off-policy algorithm that learns the optimal policy regardless of the agent’s current policy. In Q-learning, the agent updates its action-value function using the Bellman equation, which expresses the optimal value of a state-action pair as the immediate reward plus the discounted future value of the next state-action pair. The agent iteratively updates the Q values using this equation until the optimal Q values converge.

Deep Q-network

Building upon the foundations of Q-learning, DQN is an extension that combines reinforcement learning with deep learning techniques15. It uses a deep neural network as an approximator to estimate the action-value function Q(sa). DQN addresses the main challenges of traditional Q-learning, such as learning stability. Moreover, by employing deep learning, DQN can handle high-dimensional state spaces, such as those encountered in image-based tasks or large-scale problems53.

To ensure stable learning, DQN incorporates two essential techniques: experience replay and target networks. Experience replay is a mechanism that stores an agent’s experiences (i.e., state transitions and rewards) in a replay buffer54. During training, the agent samples random minibatches of experiences from the buffer to update the Q values. This process helps break the correlation between consecutive experiences, thereby reducing the variance of updates and leading to more stable learning.

Complementing experience replay, target networks address the issue of moving targets in the Q-value update equation. In DQN, a separate neural network called the target network is used to compute the target Q-values for the Bellman update. The target network has the same architecture as the main Q-network. However, its parameters are updated less frequently, by periodically copying weights from the main network. This technique mitigates the issue of nonstationary targets and improves learning stability.

In summary, RL and DQN provide a robust and scalable framework for learning optimal policies in complex environments with large state spaces. By leveraging deep learning techniques, DQN effectively tackles the challenges of scalability and stability in traditional Q-learning. In the context of this study, the DQN framework was applied to develop an enhanced trading strategy that incorporates both Bitcoin historical price data and Twitter sentiment data.

Preprocessing DQN

As described previously, the proposed M-DQN consists of two parts: (1) Preprocessing DQN and (2) Main Trading DQN. Preprocessing DQN is a DQN that preprocesses the input data of Main DQN using the original data. For this purpose, two different types of DQN: trade-DQN with Bitcoin price data and predictive-DQN with Bitcoin price and tweet sentiment data, were constructed to deal with different datasets. A detailed explanation of the preprocessing DQN is provided below.

Trade-DQN with Bitcoin price data

Figure 3
figure 3

Trade-DQN model structure18.

In the Trade-DQN model, the agent attempts to maximize short-term profits in the Bitcoin market, learning from features related to market conditions, relationships between historical Bitcoin prices, and agent’s current financial position. Over time, the agent learns how to make optimal investment decisions—buy, sell, or hold Bitcoin.

The agent interacts with its environment, which is defined as an hourly Bitcoin market. That is, the agent observes the environment and receives hourly Bitcoin price data as a state, chooses an action based on the policies learned during training, and obtains a reward for the actions taken. In this study, the state, action, and reward are denoted by \(s_t\), \(a_t\), and \(r_t\), respectively for all DQN models at time t.

In the Trade-DQN step, the state is defined as \(s_t:=AP_t\), where \(AP_t\) is the actual Bitcoin price at time t. The Bitcoin price is considered up to the second decimal place and used as a discrete value. The action of the agent is \(a_t \in \buy, hold, sell\\) i.e.,, the agent can perform three types of actions: buy, hold, and sell Bitcoins, as shown in Fig. 3. Reward \(r_t\) is designed to encourage the agent to make profitable trades and discourage unprofitable or indecisive actions. If the agent chooses to “hold,” it gets zero feedback from the environment (\(r_t=0\)). However, if the “hold” action is repeated consistently several times (more precisely twenty times), the agent is punished with a negative reward (\(r_t=-1\) if the number of consecutive “hold” actions \(m \ge 20\)). After each “sell” action, an agent gets a reward from the environment, negative or positive. The reward value depends on the profitability of the selling action. This is calculated by subtracting the selling price, denoted by \(P_sell\) from the last purchasing price, denoted by \(P_buy\), whereby the reward is \(r_t=P_sell-P_buy\). If the agent continuously chooses the “buy” action and the number becomes higher than the limit (in this case 20), the agent receives a negative reward (\(r_t=-1\)). This is to prevent the market from making many sequential purchases and improve the agent’s performance.

The DQN model consists of four multilayer models designed to suggest one of three possible actions: buy, sell, or hold a position. The first layer has 64 hidden units; the second layer has 32; the third layer has eight neurons; and the last layer contains three units, corresponding to the number of possible actions. The activation function uses a rectified linear unit (ReLU) in the first three hidden layers and a linear function in the last layer. The mean square error (MSE) is used as the error function. The final results of all the four models are used to assess the confidence indicators for each of the three available outcomes.

Predictive-DQN with Bitcoin price and tweet sentiment data

Figure 4
figure 4

Predictive-DQN model structure.

In Predictive-DQN, Bitcoin-related tweets are utilized to extract sentiments, thereby separating them into positive (compound score between 0 and 1), neutral (score of 0), and negative (score between 0 and − 1) categories as described above. In leveraging these sentiment scores and employing the DQN algorithm, the objective is to construct a model capable of predicting future Bitcoin prices. Hence, in this model, the state \(s_t\) is defined by the pair \(s_t:=[AP_t, TS_t]\), where \(TS_t\) is the Twitter sentiment score at time t. Because up to the second decimal place is considered for \(AP_t\) and \(TS_t\), the state space is discrete. Based on this state, the action of the agent is defined as a number between \(a_t \in \-100, -99, \ldots ,0, \ldots ,99,100\\), representing the future prediction of the price as the change from its current value in terms of percentage (Fig. 4). The comparative difference reward (CDR) function designed in our previous work41 was adapted as the reward function. This is a unique reward function designed to provide more nuanced feedback to the model based on the accuracy of its predictions. The CDR function considers the rate of change in the actual Bitcoin price and introduces the concept of a zero-reward value, which was defined in41, as follows:

Definition 1

41 Let \(\alpha = (AP_t – AP_t-1)/AP_t-1\) where \(AP_t\) is the actual price of Bitcoin at time t, and \(AP_t-1>0\) i.e.,the rate of change in the actual price. Let \(PP_t\) be the predicted price at time t and let \(l=AP_t – PP_t-1(1+\alpha )>0\). This point is referred to as zero-value reward (\(ZR_t\)) at time t where the difference from \(AP_t\) is l.

The reward value is then computed based on whether the predicted price (\(PP_t\)) is higher or lower than the actual price (\(AP_t\)). Therefore, two ZRs exist, as shown in Fig. 5. The former case is denoted by \(ZR_t ^1\) and the latter case by \(ZR_t ^2\). If \(PP_t\) is smaller than \(AP_t\), the agent receives a negative reward (\(PP_t < ZR_t ^1\)) or a positive reward if \(PP_t\) is between \(ZR_t ^1\) and \(AP_t\). Mathematically, the reward value is calculated as follows:

$$\beginaligned r_t =\fracPP_t-ZR_t^1AP_t-ZR_t^1 *100\% \endaligned$$


If \(PP_t\) is higher than \(AP_t\), the reward is positive if \(PP_t\) is between \(AP_t\) and \(ZR_t ^2\), and the reward is negative if \(PP_t\) is higher than \(ZR_t ^2\). In this case, the equation for calculating the reward value is:

$$\beginaligned r_t =\fracPP_t-ZR_t^2AP_t-ZR_t^2*100\% \endaligned$$


In both cases, the reward increases when the predicted price approaches the actual price. The CDR function provides a more detailed feedback to the model, allowing it to better adjust its predictions over time.

The Predictive-DRL model comprises five multilayer models designed to output any number between − 100 and 100 with up to two decimal-point precision. As previously mentioned, this number represents the percentage change from the actual price. The first layer serves as the input layer and has two neurons that reflect the two features of the Bitcoin price: the actual price and sentiment scores. The three subsequent layers, referred to as the dense layers, contain 64 hidden units each. The final output layer comprises 20,001 units. These units correspond to the number of possible actions, accounting for all possible numbers between − 100 and 100 with up to two-decimal point precision. The ReLU function serves as an activation function for the first three hidden layers, whereas the output layer uses a linear function. MSE was adopted as the error metric.

Figure 5
figure 5

Computation of zero-value reward41.

To summarize, the results from the Predictive-DQN model were positive, achieving 86.13% accuracy and drawing attention to the effects of Bitcoin-related tweets on Bitcoin futures price changes. Therefore, in this study, to develop an efficient Bitcoin trading strategy, a unique dataset was proposed to include market decisions and market prediction information for Bitcoin.

Main trade recommendation DQN with integrated data

Figure 6
figure 6

Proposed M-DQN model. In Preprocessing DQN: Trade-DQN receives Bitcoin price data as input, and generates initial trade decisions as output \(x_1\); Predictive-DQN receives Bitcoin price data along with Tweet sentiment score as input and generates predicted change in percentage as output \(x_2\). In Trade Decision DQN: Main-DQN utilizes the two preprocessed outputs [\(x_1,x_2\)] as input, thereby generating the final trade decision as output.

In the main DQN, for the final trade, the output data of the aforementioned two Preprocessing DQNs are used for learning. Therefore, the performance of the proposed Main-DQN model is based on the output results from Trade-DQN, which provide trade recommendations, and Predictive-DQN, which offer futures price predictions. Before providing a detailed explanation of the Main-DQN model, the process of leveraging these outputs is described.

Data integration

First, two different types of output data are merged into one. Combining these datasets allows us to develop a more comprehensive trading strategy that leverages the strengths of both data sources.

Large-scale timespans were considered for both datasets. As there was a slight difference in the time periods covered by each, to maintain data integrity and consistency, we identified overlapping periods between the two datasets. Furthermore, to satisfy the research objective of uncovering correlations between variables within these datasets, it is crucial that the data originate from a consistent timeframe. This alignment guarantees that genuine relationships are not distorted by variations over time. The overlapping time periods in the datasets were identified as spanning from October 1, 2014 to November 14, 2018, comprising 1505 days. This overlapping period enabled us to effectively combine the datasets and ensure the integrity of the analysis.

Next, the two datasets were merged into a single dataset by aligning trading recommendations and futures price predictions based on their respective timestamps. For each hour within the overlapping period, the corresponding trading recommendations and futures price predictions were placed in the same row. This approach facilitates the seamless integration of data, allowing a more effective examination of the relationship between trading recommendations and futures price predictions.

Given that the objective was to develop an hourly trading strategy, the total number of hours within the experimental period were calculated by multiplying the number of days (1505) by 24 h. This resulted in 36,120 h of data, which formed the basis of our dataset. Therefore, the final dataset comprised 36,120 rows, with each row representing an hour within the experimental period. Each row contained trading recommendations and futures price predictions for a specific hour. This integrated dataset enabled us to explore the synergistic potential of combining Bitcoin historical price data with Twitter sentiment analysis, ultimately aiming to enhance our trading strategy and improve its performance in the volatile Bitcoin market.

Modeling the main DQN

After obtaining the integrated dataset, the DQN-based Bitcoin trading model (Main-DQN) was built. Similar to the preprocessing models, the proposed DQN model, which acts as an agent, interacts with the environment represented by the Bitcoin market. The MDP elements of state, action, and reward are defined as follows:

  • State Space \(\mathcalS\) A state represents the current situation in the market, which is crucial for making informed decisions. In our study, each row of the prepared dataset describes hourly data of historical price and Twitter sentiment and is considered a state \(s_t:=[x_1, x_2] \in \mathcalS\) at time t. Specifically, each state is a two-dimensional array, where the first element \(x_1\) can be either − 1, 0, or 1, indicating a sell, hold, or buy recommendation, respectively, and the second element \(x_2\) is a number between − 100 and 100, representing the predicted futures price change percentage based on Twitter sentiment, as shown in Fig. 6.

  • Action Space \(\mathcalA\) An action represents the decision made by the agent at a particular state. In our trading task, the action of the agent \(a_t \in \mathcalA\) at time t is defined as the final decision on trading, which can represent one of three options: buy a Bitcoin from the market, hold, or sell. The agent learns how to choose the most suitable action based on the state information and experience, aiming to maximize the expected cumulative reward.

  • Reward Function r For the trading task, the three important aspects considered for effective trading are: gaining high profits, keeping risk at a low level, and maintaining active trading, based on which the proposed reward function evaluates the agent’s decision and guides the learning process. As mentioned earlier, achieving high returns is one of the components of an effective trading strategy. In this study, the agent’s performance is assessed in terms of gains through profit and loss (PnL) calculations after each decision. It is important to note that transaction fees, which are charges incurred by traders when conducting buy or sell actions in the market, play a role in determining the PnL. These fees vary depending on the trading platform; however, they typically range between 0.1 and 1.5% of the trade value55. For the purpose of this study, a constant 1.5% transaction fee rate was assumed. Despite the transaction fee, a PnL can only be generated when a trade has both purchasing and selling prices. Therefore, when the agent decides to buy or hold, it receives zero reward, and as soon as it decides to sell, the selling price is subtracted from the sum of the buying price of Bitcoin and all transaction fees. If the resulting value is positive, the agent receives an equivalent positive reward. If the value is negative, the agent receives a penalty equal to the negative value. Thus, the reward function considers both the profit potential and transaction costs involved in trading. To define the reward function mathematically, the following notation is introduced: \(P_k^buy\) and \(P_k^sell\) represent the buying and selling price values of the Bitcoin for a given order k (i.e., the 1st Bitcoin, 2nd Bitcoin, and so forth), whereas \(c_k^buy\) and \(c_k^sell\) refer to the transaction fees incurred during the purchase and sale of the kth Bitcoin, respectively. Under these terms, the PnL for each Bitcoin transaction can be computed as:

    $$\beginaligned PnL_k =P_k^sell – P_k^buy – c_k^buy – c_k^sell. \endaligned$$


    This formula specifies a comprehensive method for quantifying the net profit (or loss) obtained from the kth Bitcoin transaction, after considering the transaction fees. This way, the reward value at time-step t is equal to the value of \(PnL_k\) (\(r_t=PnL_k\)). In describing the reward function, the second critical factor for an effective trading strategy is to maintain a low risk level. In this work, risk level is described as the percentage of the investment that the model is allowed to risk. After each decision made by the agent, the amount of investment is calculated, and if it is below a certain threshold, the agent receives a penalty. By contrast, if the investment is above the threshold, the agent receives zero reward. More precisely, \(I_current\) is defined to represent the current value of the investment, which is determined at each time-step t after purchasing Bitcoin. This value is computed by deducting all relevant expenses (\(P_k^buy\) and \(c_k^buy\)) from the initial investment amount, \(I_Initial\). The threshold, denoted by \(\alpha\), represents the maximum permissible part of the investment that the agent is allowed to risk. The final factor for an efficient trading strategy is to maintain active trading, which refers to buying or selling Bitcoins in the market rather than simply holding them. Encouraging the agent to engage in active trading is essential for capitalizing on market opportunities and adapting to changing market conditions. To promote active trading, a threshold is established and monitored to determine the number of active trades at the time of reaching or exceeding this threshold. Let m be the sum of all the number of purchases and sells and \(\omega\) be the threshold for active trades. In this scenario, the agent receives a negative reward only when the number of active trades (m) exceeds the threshold, \(r_t = -1,\) if \(m > \omega\). Then, the reward \(r_t\) can be computed depending on the action \(a_t\) by:

    $$\beginaligned r_t= \left\{ \beginarrayl -1, \quad \quad if \quad \quad m>\omega ~ or~ a_t = buy~ with~ I_current < \alpha \\ 0,\quad \quad if \quad \quad m \le \omega , ~ a_t = buy~ with~ I_current \ge \alpha ~ or~ a_t = hold\\ PnL_k,\quad \quad if \quad \quad m \le \omega , ~ a_t = sell,\\ \endarray \right. \endaligned$$


    In the equation above, when the action is buy or sell, m becomes \(m+1\). This formula provides a way to measure the instantaneous reward obtained at each time step, given the current state of the investment, the cost of the transaction, and the predefined risk threshold. To determine the optimal threshold, three different risk levels are considered, whereby the agent is trained separately for each level: 30% (low), 55% (medium), and 80% (high)56. The performance of the agent under these different risk levels is analyzed and reported in “Experiment and results” section. Furthermore, as the Bitcoin market operates 24/7 and our dataset reported hourly, the maximum number of possible active trades was 24 per day. To explore the optimal number of trades for the proposed model, we defined three different thresholds: up to 8, 16, and 24 active trades per day. By testing these thresholds, a better understanding can be acquired on how the agent’s trading activity impacts performance. The results of the experiments conducted with these thresholds are presented in “Experiment and results” section, showcasing the effectiveness of the proposed trading strategy under various levels of trading activity.

In incorporating these factors into the reward function, the goal is to create an agent capable of making effective trading decisions that balance risk, profitability, and trading activity, ultimately for an optimal trading strategy. The key components and layers of the architecture used in our trading strategy are outlined, explaining the role of each layer in extracting meaningful information from the input data and estimating the action-value function Q(sa). The model architecture begins with an input layer that has two neurons which capture the relevant information in the data, necessary for making trading decisions. These neurons represent market actions (− 1, 0, or 1) and the price prediction score (ranging from − 100 to 100).

The DQN model contains three fully connected dense layers, each containing 64 neurons. These layers aim to capture the complex relationships between the input features and actions. Increasing the number of neurons and layers allows the model to learn more complex patterns, at the expense of increased computational cost and risk of overfitting. Each dense layer utilizes a ReLU activation function to prevent vanishing gradient issues during training. The choice of the activation function is critical, as it introduces nonlinearity into the model, enabling the learning of intricate patterns.

The output layer of the model comprises three neurons corresponding to the three possible actions: buy, sell, or hold. These neurons represent the Q-values for each action, given the current state, and help the model produce a probability distribution for the actions to guide the decision-making process. To minimize the difference between the predicted and target Q-values, the model uses the MSE loss function. This choice of the loss function dictates model learning from errors and parameter updates. The Adam optimizer was employed in this model because it strikes a balance between fast convergence and stability during training.


Leave a Reply

Your email address will not be published. Required fields are marked *

Copyright © All rights reserved. | Newsphere by AF themes.