Data sourcing is not all about finding alternative data vendors. Identifying potentially interesting data sets is only the beginning of the long path from data to alpha. In fact, now that so many alternative data products are available on the market, the biggest challenge is how to evaluate them, select the ones worth using and implement new data sources in the existing research process.
According to a recent report “A buyer’s guide to alternative data” by Greenwich Associates, it could take two highly-paid quantitative engineers on average 85 person-hours to assess a new alternative data source. In our company (as in many other quant trading firms) we apply machine learning techniques to automate the research process, which allows us to significantly reduce the amount of time needed to evaluate another data set from the perspective of alpha-generation potential. But it doesn’t mean that a vendor will hear from us in a few days after initiating trial access; the process from the initial pitch to the final decision still takes quite a long time.
A big part of the evaluation process is the preparation stage. This is work that needs to be done before we even start analyzing data. All datasets are different (different format, collection methodologies, metadata, point-in-time history handling approach, update patterns, etc.), there is no standard format even within data types let alone across different types of data. Buy-side firms often have to spend a disproportionate amount of time on cleansing and formatting vs doing actual alpha discovery research.
What happens between the vendor’s pitch and the first dollar of the fund’s profit attributed to the data set may vary significantly depending on the trading strategy. Here is what the process generally looks like from the perspective of a quant fund with an actively traded equity strategy.
(1). Initial contact and vendor’s pitch. Today there is enough information out there about alternative data use cases, and the competition between data vendors is getting tight. So, we fully expect that a data vendor knows what essential questions they need to cover in their first pitch. The standard questions like length of history, mapping, coverage, frequency of updates, delivery time and average delay, availability of point-in-time history, etc., need to be included into a one-pager or an intro e-mail. In addition to that, it is always helpful to receive research papers, case studies, and other relevant materials.
As a quant equity fund, we filter new data sets based on the following criteria: at least five years of point-in-time history; broad coverage (at least 100 public companies); reasonably low latency of updates delivery and a number of uncorrelated factors in the data set.
These criteria may differ depending on the strategy. The purpose of the initial screening is to get a rough idea of the data’s basic properties and filter out data sets that are outside the area of application of our alpha extraction process.
(2). If the data set looks good on paper, we sign an NDA and request data samples and detailed documentation describing file fields, data structure and attributes as well as the methodology the vendor uses to calculate pre-defined features and aggregated metrics (if applicable). This is not yet a trial access, neither should it be confused with historical data back-testing. At this point, we just need to assess meta data, estimate data wrangling efforts and define potential data quality related issues to address.
(3). Next, we proceed with the qualitative evaluation stage which includes communication with the vendor to dig deeper into their process of data collection and pre-processing. It is essential for us to understand the evolution of the data processing pipeline on the vendor’s side. We ask the vendor if they have made any structural changes in the data collection process, including: changes in panel or suppliers of the underlying data; changes in the delivery frequency or delivery schedule and backfills in historical data. If the vendor does backfill, we absolutely need to understand the logic behind this process to make sure it doesn’t lead to a forward looking bias. We aim to find out if the vendor tends to contaminate the data when pre-processing and handling it.
Although historical data quality issues (restatements, delivery schedule inconsistencies, format changes, etc.) may be a big problem, not all of them are automatically a deal breaker. Of course, if a data product is just an overfitted signal with an inflated Sharpe ratio on the vendor’s back-test, we will pass on it. But if we see potential value in the data set, we are willing to work with the vendor to get quality issues fixed.
(4). Back-test. Here we get access to all the historical data available. Our research team will be looking at a number of data points per name, the value ranges for the metrics and how these numbers evolved over time and if there are any major inconsistencies in those metrics. The closest analogy here lies in the change-point detection domain, where the goal is to identify times when the nature of the data has structurally changed over the historical period.
Our research process is fully automated, and scales across the datasets we use. Applying our machine learning based algorithm, we get a model portfolio for the dataset, where a number of different trading ideas are combined in a single track record. Analyzing this performance track record allows us to see clearly if the dataset experienced a dramatic deterioration of quality, sometimes with accuracy down to a month. If we observe abnormal performance behavior, indicating data quality issues, we will work with the vendor to investigate. Our experience is that vendors are often aware of those issues, they just don’t recognize their importance and potential negative impact on the accuracy of the back-test.
(5). Trial access. We normally ask for a 90-day real-time free trial period. We will compare the pattern of updates in the real-time data vs that observed in the history. If those patterns don’t match, then this is a major red flag for detecting overly processed historical data.
We understand data vendors don’t like providing longer trial periods. And if we follow this process and only request trial access when we’re done with the preparation stage and back-tests and are ready to work with real-time data, a standard trial period is enough to complete the evaluation process.
(6). Contract negotiation. Only now, after the evaluation process is complete, do we have a clear understanding of how much alpha this data set can add in the context of our strategy and the existing portfolio of alphas. At this point, it makes sense to move onto the conversation about pricing and contract terms. Pricing is certainly a subject worth a separate discussion, but the general approach is that the cost of dataset acquisition and internal resources to develop and implement it should be justified by projected alpha generated by the data set.
(7). Once we go live with the new data set, we regularly check both statistical properties (completeness, consistency, etc.) of the data we receive, as well as the delivery times and restatements. Out-of-sample analysis is important no matter how great of a back-test you have. Besides, there is always a risk of a vendor making changes without informing us in advance.
Building a robust data sourcing and evaluation process is essential for investment managers to gain an informational edge using alternative data. It helps optimize data budget and avoid damaging strategy performance by adding data sets of questionable quality.
Alternative data products inherently incur a long sales cycle. But if both vendors and data buyers adhere to a simple predefined procedure and timeline, the process can be standardized and become more efficient.