Why Deep Reinforcement Learning Can Help Improve Trading Efficiency

Viktor Tachev
12 min readMay 4, 2019

“Every aspect of learning or other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.”

— Mission Statement, Dartmouth Conference, 1956

It is interesting to see how the mission statement of the “Dartmouth Summer Research Project on Artificial Intelligence”, one of the most prominent events in the history of AI development, was so much ahead of its time. Looking at the first concepts, shaped more than 60 years ago, really puts in perspective how long it took us to reach the state where machines are able to start learning by doing. And today, thanks to deep reinforcement learning, machines are finally capable of replicating a more advanced decision-making process that was previously out of reach.

What is Deep Reinforcement Learning?

Imagine that you wake up one morning, totally committed to the idea of getting yourself a cat. You head straight to the closest shelter to adopt your new pal. Once you get back home, you start introducing your tiny new roommate to its new home. At first, the cat looks a little bit confused due to the unfamiliar environment. After a few minutes, though, it cautiously starts making its first steps around and explore. What it will do next is try to figure out and get used to the new surroundings. A sniff here and there and a paw-touch for this and that is the way it starts getting familiar with the environment. What it does is collect information on how the surrounding objects respond to its actions. That way, the cat can analyze the feedback and settle in its new home, far quicker. The conclusions it draws from the reaction of the environment to its actions will help it make informative choices in the future. The collected feedback improves its decision making on issues like, for example — which spot has the best characteristics for a good nap, which area of the floor gets the most sunlight, which are the good-to-eat plants and which are the don’t-touch-or-you-will-get-spanked ones, etc. in the most effective way. With the course of time, the longer the cat interacts with the surrounding environment, the better decision-maker it becomes* and the more comfortable it starts to feel.

*at least in theory

Deep reinforcement learning works in kind of the same way. The core idea is shaped around an agent that is put into an entirely new environment and should get used to taking the best decisions based on the specifics of the circumstances. Due to the uniqueness of the situation, the agent does not have any previous experience or prior knowledge of what consequences to expect from its interaction with the surrounding objects. That is why its main goal is to start collecting information about the environment which can, later on, help it make an informed decision. The agent starts doing that by testing how the objects around react to its decisions. To help it build logic on how to differentiate what-is-considered a “good” and a “bad” decision, it is rewarded for each positive and penalized for each negative choice. This distinction is usually set by the agent’s operator. The more the agent interacts with the environment, the more it learns and the better it becomes at making the right decisions.

But how is this different from the machine or deep learning models where a human operator is also involved in the process? The main difference is that deep reinforcement learning agents have a significant degree of freedom. Although at first they are taught what is considered “good” and “bad”, with the course of time, they tend to build on that logic and further develop it, based on their individual experience. So, in some way, they become independent operators that start basing their decisions on self-gathered feedback. This allows DRL agents to surpass the limits of human knowledge and, thanks to their far bigger capacity, solve more complex problems.

Deep Reinforcement Learning in Practice

The first major breakthrough in DRL application came from DeepMind and its efforts on achieving human-level control in 7 Atari games in 2013. Later on, the idea of the project was expanded and it focused on training DQN agents (bots powered by DeepMind’s deep Q-network algorithms) on 50 different Atari games, with no prior knowledge of the rules. The results, presented in DeepMind’s paper, revealed a successful human-level performance at almost 50% of the games with some scenarios even performing at a level superior to a professional human game tester. At the time of publishing, this appeared to be a very significant achievement when compared to all the present machine learning methodologies.

Source: DeepMind (https://deepmind.com/blog/deep-reinforcement-learning/)

With the course of time, the team managed to considerably improve the DQN algorithm which led to a 300% improvement in the mean score across all Atari games. This allowed the algorithm to achieve human-level performance in almost all of the test games. Apart from that, with the increase in the number of the tested scenarios, single DQN neural networks even became capable of self-learning how to improve their performance in multiple Atari games.

A video that illustrates the improvement in the performance of DQN after 100, 200, 400 and 600 training scenarios. Source: DeepMind

Although these advancements were really valuable breakthroughs, there was one limiting factor — they were applied in a 2D environment. DeepMind‘s team needed something more challenging.

Labyrinth, a suite of 3D navigation and puzzle-solving environments, presented deep reinforcement learning algorithms with new challenges — to figure out an unknown terrain, discover rewards and exploit all existing benefits. This new environment helped improve the algorithm in achieving human-level performance and out-of-the-box decision-making on many Labyrinth tasks. This paved the way for the concept of having an agent that tries to learn how to navigate an unknown environment and overcome obstacles by learning on the go. The following video visualizes the conclusions from DeepMind’s paper on locomotion bahviour in a rich environment.

Emergence of Locomotion Behaviours in Rich Environment: https://arxiv.org/pdf/1707.02286.pdf

Aside from that, DeepMind also developed a highly-successful super-human poker player for heads-up in Texas Hold’em. Although the algorithm was fed with imperfect information, it still managed to win 3 silver medals in a 2014 contest. This marked a major breakthrough in the mission of developing agents, capable of operating in imperfect-information environments — a solid foundation for the future application of deep reinforcement learning algorithms оn financial markets.

All the gradual improvements in the deep reinforcement learning methodology lead to the most popular example of computers’ dominance over humans — AlphaGo, DeepMind’s algorithm that defeated the reigning world Go champion. In 2016, AlphaGo won against Lee Sedol, an 18-times world Go champion, by 4 games to 1. Although at the time, this wasn’t the first case when a machine beats a human in a two-way contest, with IBM’s DeepBlue defeating Garry Kasparov at chess in 1996, the achievement was renowned for its importance because of Go’s complexity. Unlike chess, Go has a really hard-to-imagine number of possible moves which further emphasizes the intricacy of the machine’s decision-making process.

At some point, all the remarkable progress of deep reinforcement learning raised the question for the possibility of its application in the trading field. Unlike games, however, markets are filled with randomness and the states are nondeterministic. In games, for example, when PacMan sees a ghost approaching, the one and only rational thing to do to escape is to change the direction of its movement. Otherwise, it will fail and the game will end. When it comes to trading, things are quite different as the states are random and very hard to predict. With that said, it is worth noting that deep reinforcement learning has the potential to disrupt the trading field, due to its superiority to all existing algorithmic models.

Souce: Pixelart

Deep Reinforcement Learning in Trading

For now, most studies, conducted on the DRL framework are related to robotics and gameplay. There is not so much work, focused on the application of deep reinforcement learning in trading. Although the adoption is in its early stage, there are some studies, already, that point out the superiority of DRL algorithms when compared to existing automated trading mechanisms in terms of performance. This means deep reinforcement learning can help in the process of handling some of the most complex issues, typical for the nature of financial markets:

  • Markets require perfect handling of extensive continuous data;
  • Agents’ actions may result in long-term consequences that other machine-learning mechanisms are unable to measure;
  • Agents’ actions also have short-term effects on the current market conditions which makes the environment highly unpredictable;

Now let’s find out how deep reinforcement learning’s application for trading can work on practice. In the process of applying DRL on financial markets, researchers have narrowed down the architecture of trading algorithms to the following factors:

1. Agent

In this case, the agent is the trader. He opens his trading account, checks the current market situation and makes a trading decision.

2. Environment

The environment where the trading agent operates is, obviously, the market. However, the market is full of other agents as well — both human and computer-driven. The interaction between all agents within the particular environment is what makes things quite complicated.

The usual methodology when trading is as follows — the agent makes a move (places an order) and waits for the market’s reaction (the order is or isn’t executed). Based on the environment’s feedback, the agent then takes another action (submits a new order; changes the terms of the existing one or remains passive). So, what the agent does, is analyze the response of the environment and decide on his further moves based on the specifications of the current state. The agent does not have control over the environment and the actions of other agents. All he can do is react to them.

3. State

The state of the environment (the market) usually is unclear to the agent (unless he has some sort of insider information or technological advantage over other market participants) — he is not aware of the number of other agents, their actions, their positions, order specifications, etc.

4. Reward

The reward function is another characteristic, crucial for the success of the deep reinforcement learning algorithm. If the reward function is naively driven by the absolute maximization of potential profits, the algorithm will start placing highly-risky bets, underestimating the potential losses in the name of reaching its ultimate goal. In reality, traders strive for optimal Sharpe ratios, which has proven to be the most efficient reward scenario for DRL algorithms.

Why Deep Reinforcement Learning Can Help Improve Trading Efficiency

The application of deep reinforcement learning for trading still remains largely unexplored. However, the promise that the learning mechanism shows in other fields has attracted researchers’ interest and canalized efforts towards the exploration of DRL for trading.

For now, the odds that trading can be disrupted look promising, thanks to some of deep reinforcement learning’s main advantages:

  • It builds upon the existing algorithmic trading models

Most of the existing algorithmic models are designed with two main components — strategy and implementation. The strategy is usually designed by the trader (in reality there are autonomous trading mechanisms, although most of them still need some passive human guidance and are yet to prove their efficiency), while the implementation is handled by the machine. Although this may seem like a promising human-machine symbiosis, at some point, a crucial element in the whole system often tends to break down and leaves a lot to be desired in terms of trading results. For example — one of the main challenges is picking appropriate, unbiased and representative financial data. The inability to do so is often a trader’s fault, rather than a purely technical limitation. Although the data quality problem is a widely familiar weakness in the design of efficient trading algorithms, it still remains an issue and its solution is often quite poorly implemented.

However, with the major advancements that deep reinforcement learning brings, we are getting closer to facing the next level autonomous trading system — one where the machine is in charge of both the strategic and the implementation parts. Deep reinforcement learning mechanisms, as shown in other fields, can help us construct portfolio management systems with basic cognitive skills that can ensure better navigation in a complex stochastic environment like financial markets.

  • The self-learning process suits the ever-evolving market environment

Financial markets are dynamic structures. Trading nowadays is way different when compared to 15 or 20 years ago. Today, we have plenty of different types of orders, exotic trading strategies, new asset classes, technologically-enhanced market participants and many other market variables that were not present a while ago. Also, the markets of today are very turbulent with increased volatility, lower liquidity levels and periodical flash crashes. The presence of these factors influences the way we trade, thus results in the formation of short-lived or unique patterns. These patterns are usually very hard to identify in historical data and can also turn out to be irrelevant for the accurate prediction of future market movements. Even if similar patterns are found in historical data, there is no guarantee that the next time they occur, the outcome will be similar. All this has affected even the most sophisticated fund in the world — Renaissance Technologies. The hedge fund has reduced the use of pattern-based strategies for futures trading in its Renaissance Institutional Diversified Alpha (RIDA) fund by more than 60%. However, RIDA still remains head-and-shoulders ahead of its competition after the fund gained 3.23% in returns in 2018 when compared to industry’s average of -4.75%, according to HFR data. When it comes to other trading firms, the truth is that a major part of hedge funds have given up trend-following strategies after they struggled to replicate past returns.

What this means is that, in order for an automated trading mechanism to be successful, it has to be flexible and capable of adjusting itself to the present situation and not rely solely on past information, just like their human colleagues do. For now, deep reinforcement learning is the closest thing we have to resemble the way we learn. Just like us, DRL agents learn on-the-go and by doing. This means that they are getting better at taking real-time decisions, based on the specific characteristics of the momentum. In its core, learning is an interactive process that requires feedback from both sides. That is why it is reasonable to expect that deep reinforcement learning agents will become better in navigating financial markets due to their abilities to adapt to changing environments and learn from the immediate result of their actions.

  • Brings more power and efficiency in a high-density environment

Financial markets can be considered as a high-density environment because of all the variables, affecting the result of a trading decision. For example, even the most basic orders will require from the computer to make a set of feasible decisions like price, size, order time, duration, type, etc. All these factors require from the algorithm to make a set of interconnected decisions, such as — at what price to buy/sell and what quantity; whether it should be a single or multiple orders at different or at the same prices; should it trade at one or at multiple venues; should it do it one-after-another or simultaneously and so on and so forth.

To put it into perspective — the average chess game is approximately 40 moves long. A Go game — 200. According to estimations in JP Morgan team’s “Idiosyncrasies and challenges of data-driven learning in electronic trading” study, a medium frequency electronic trading algorithm that reconsiders its options every second will make 3 600 moves per hour. This is due to the fact that each action is a consequence of a collection of child orders with different characteristics (price, order type, size, etc.). What all this means is that financial markets are way too complex for straightforward algorithms as their action space is continuously expanding with all the possible combinations of moves and characteristics changing dynamically at any given point in time.

Are we there yet?

Although way more flexible and powerful, a deep reinforcement learning agent still requires millions of test scenarios to become flawless at what it does (think of AlphaGo, for example). Also, despite those types of agents are closer to autonomy, they still require an operator to reward their actions as either positive or negative, exactly like the case with your pet — should it decide that your new shoes are the perfect breakfast, the chance is slim for it to be rewarded.

However, from a purely technological point of view, the reward part is quite tricky as it has the potential to become the make-it-or-break-it element in the whole system. For example — if you reward a vacuum cleaner for the absolute goal of cleaning your living room, you may end up surprised that it has swept all dirt in the corridor. What this means is computers are still far from general intelligence and will continue to require human guidance, even if it is in a more limited form. For how long? It remains to be seen. But as things stand, we are closer to a fully-autonomous trading system than we have ever been.

--

--

Viktor Tachev

Driving growth to FinTech startups, blockchain ventures, banks, and asset managers. Consulting | Planning | Writing | Reaping rewards. https://viktortachev.com/