Data and execution in crypto far-west

4 min readFeb 19, 2023

Building trading systems in cryptocurrencies is another game. Protocols and APIs are newer than that of traditional asset classes and therefore less iterated and more prone to errors. There is a big infrastructure blackhole when it comes to developing systems. Getting a consolidated data feed for historical and real time execution is both expensive and complex.

One could argue that the existence of many segregated exchanges in cryptocurrencies is a similar concept as the stock exchanges one for the US equities market, where the concept of SOR, arbitrage and maker/taker models are well-known and important.

However, there are big differences when we go deeper into real implementation of models in these two worlds. Crypto exchanges are spaced around the world, most are unregulated, lightly regulated or self-regulated. Some change jurisdictions and servers’ location every few years. On top of this, exchanges offer services for trading cryptocurrencies and tokens (mostly centralized companies) and they try to get market share by establishing their own protocol for naming these tickers, setting their own orderbook dynamics (tick size, min/max volume, latency…) and there is no incentive for them to come to a protocol where everything will be standardized.

But there is beauty in this mess. As the concept of decentralization (within the token governance) is not as clear as it was conceived with Bitcoin for most cryptocurrencies, what is still prevalent is the tendency of the market to provide openness, transparency and fair access to all players regardless of other factors. In other words, institutional players don’t have the enormous advantage they have in traditional assets.

Ccxt is a powerful library designed to take care of most of these issues with a standard protocol that maps the APIs and websockets of the exchanges to an unified set of functions that take care of all the plumbing behind the scenes. Here is an example of an EMA calculation using ccxt and pandas, way simpler than building the connectors for Binance from scratch.

binance-ema.py

The main drawback of this approach is the limitation concerning latency and of course the dependency of an abstraction layer to deploy your strategies, but we find this library is one of the best if this concerns can be manageable.

At Quantitools, we use different technologies aiming to the problem we are trying to solve, while sometimes a fully-fledged EMS in C# or a model for HFT in C++ will be better suited, for simple strategies, ccxt is good enough. Of course, you can always deploy using the APIs of the exchanges, but then you will need to refactor your code if you want to trade on a new exchange and straight of the bat you will have more complexity as you might need to handle multi-threads (if for example you want to trade a portfolio of symbols), exchange reconnections, building your own OHLC stream from tick by tick data and so on.

Another infrastructure pain (probably the biggest one) is reliable historical data. For intraday, HFT and advanced trading models, there is virtually no quality data available in the market. This is where we came in and developed our flagship data service, QT Data.

We collected, validated and standardized all the transactional and orderbook information from the top 20+ exchanges. And the results look like this, daily csv files (Parquet and other formats available too) for all trades and orderbook changes in all tickers, exchanges and asset classes.

Exchange | Symbol_Type | Symbol_Name | ID | Type | Local_Timestamp | Exchange_Timestamp | Side | Amount | Price

Note that Exchange timestamp is missing in the particular dataset for Snapshots (which we periodically request from the exchange to facilitate the orderbook rebuilding process and to strengthen the validation process of our data), but the power of this simple mapping of the data is that we can combine it with other trades datasets to streamline the analysis and merge the data together if desired.

From the multi-format JSONs that exchanges dump we go to a fully granular, standardized datasets where any research, backtesting and ML model feed is just made possible. Furthermore, the biggest value added here is that exchanges only offer limited requests for historical data (usually OHLCV or throttled tick datasets), quotes are not available and full tick datasets rare to find. It is something understandable as we are talking Petabytes of data. This is the cornerstone of any successful quantitative approach and a few years after our quest for institutional-grade quality data began, we are ready to make it accessible to anyone.

By Jesús Martín García

Data and execution in crypto far-west

Written by Quantitools