A software that spots future unicorns using public data. A Blockchain Master Thesis

Luca Bella
6 min readJun 24, 2021


Information is something crucial in the investing world.

On one side, there is the blockchain industry, in which companies tend to disclose as much as possible in order to stand from the crowd [1]. On the other, we can assist Venture Capitals (VCs) that build their competitive advantage on exclusive information.

Pixabay photo

Despite this difference, we are assisting a huge entrance of VCs in the blockchain industry. Bakkt is probably one of the biggest and most famous examples. They have raised $ 300 million in its Series B round in 2020 [2].

It is interesting to see so many VCs coming up with new millionaire deals every month. You can verify it by reading the tweets of @ICO_Analytics and observing how venture capitals are present in this industry.

How do VCs identify new investment opportunities?

It is difficult to keep up with an industry in which new interesting projects are launched every day. It is even more difficult to collect precious information by calling directly every entrepreneur.

Sourcing deals can become an impossible job.

The Fried and Hisrich model (1994)[3] illustrates how VCs screen new deals. They explain that this process accounts for the 60% of returns. This process is divided into 3 different main stages:

1. Deal origination: The VC becomes aware of potential investments. This comes from a slight majority of internal reference and from own deal sourcing [4].

2. Firm specific screen: Consisting in reducing the amount of potential companies gathered in the deal origination phase through proper VC’s selection criteria.

3. Generic screen: At this stage projects are evaluated more carefully . The most promising are selected in order to bee deepened .

Rodnae Productions photo

As readers can easily understand, this funnel starts from a very broad basis and narrows down to a small number of companies. Almost everything is done by the human hand.

What does manual screening imply? Biases.

The literature cites mostly similarity, overconfidence, and availability bias.[5]

What is the solution? Automated screening.

Software overcomes biases and is capable to analyse a superior amount of data.

The Unicorn Spotter

This is not a crystal ball.

It does not pretend to predict outcomes with extreme accuracy and cut the work of the analysts. However, it is a tool that can help a lot in all of the 3 screening phases and that has to be paired with human evaluation.

Considering this premise, we can dive into the software.

The scripts I have written make 3 main things:

1. Gather data in an automated way.

2. Measure what is effectively measurable and evaluate through standard procedures what is difficult to quantify.

For example, it is possible to gauge the number of partners but it is difficult to measure the previous experience of a team. The software handle these situations with predetermined procedures.

3. Make a classification in order to select the most promising companies.

Danny Meneses Photo

The first question that I made to myself was: “How can I find the projects to analyze?”

I considered the most famous VCs, Evangelists (reliable from my experience), Messari Research, and Crypto information aggregators.

The next step has been defining the variables to include in the model. I included parameters about:

· Team (LinkedIn): measuring diversity and shared cognition.

· Virality (Twitter): number of followers (divided by the own company’s Market Cap), average like per post, level of interactions compared to followers.

· Partners (Twitter): including the number (normalized by the company’s size) and ratios about their interactions on twitter.

· Product (GitHub): including GitHub’s variables.

· Market (Coingecko): including the number of different categories and the average number of competitors per category.

· VC presence: whether or not a VC invested

In order to feed the classification algorithm, I had to define when a company is successful or not.

For this task, I have relied on my experience since no company has done an IPO and I have not any information on secondary sales.

The token of the project had to exist before December 2020 in order to be included in the model.

Then, the project has to satisfy at least 3 of these 4 points:

· Average daily volume of 10 Million $ in the last month

· Be in top 250

· Listing in Major Exchanges (Binace or Coinbase)

· The current price is not lower than 80% from its all-time-high

The outcomes

Now let’s talk about numbers.

I have collected data from 386 companies.

I have divided it into 2 subsets: one in which I was able to retrieve LinkedIn data and the other without it.

The set with LinkedIn data was composed of 125 companies (one-third) and the other of 205. Some companies were not included because of the lack of GitHub or Twitter data

Then I applied the classification criteria. Since a lot of companies were born recently due to the crypto explosion in the last months, the subsets were reduced. 67 companies with LinkedIn data and 76 without.

The Random Forest algorithm gave the best results, with 86% of accuracy for the model without LinkedIn data and 76% for the one with it.

In order to handle the little size of the dataset, I have applied K-fold cross-validation with 10 subsets.

From the results I could extract the subsequent insights:

· The model considers Stars as the most important metric on GitHub to evaluate the code

· The number of partners divided by the company’s market cap is very relevant for the model, while the absolute number of partners is not. This implies that there is an optimal number of partners based on the company’s size.

· Team composition (Points) is the most important variable regarding the team. Shared cognition is relevant too, but not as much

· The size of the community on twitter divided by the company’s market cap is relevant for the algorithm

Feature importance in the random forest model

Limitations and final considerations

Of course, there are some limitations.

Firstly, I would have liked to include panel data about companies’ social sentiment. I would like to measure how this has changed compared to the general market sentiment during the time. Previous literature has demonstrated that it can signal the quality of a project.

Secondly, I would have liked to dive more into team personal profiles and elaborate more on the experience. The same applies to white papers, seeing if certain sections such as tokenomics were present. This could be done with Natural Language Processing. However, I had difficulties in collecting whitepapers since they can be in different ways, from PDF to web format.

The blockchain industry is undoubtedly the best and the easiest ground to develop this kind of software.

In my opinion, the way in which blockchain companies do business is the way in which every company should. It consists in building a community, making some noise, and continuously come on the market with prototypes and products.

This software is only at the beginning, I will continue to optimize it from time to time.

This project has been done thanks to the chair of Strategy and Organization of TUM and the Professor Isabel Welpe


[1] Adhami, Saman, Giancarlo Giudici, and Stefano Martinazzi. “Why do businesses go crypto? An empirical analysis of initial coin offerings.” Journal of Economics and Business 100 (2018): 64–75.

[2]Blockchain Venture Capital Report, 2020. https://mercuryredstone.com/wp-content/uploads/2021/04/Cointelegraph-consulting-venture-capital-report.pdf

[3]Fried, V. H., Bruton, G. D., & Hisrich, R. D. (1998). Strategy and the board of directors in venture capital-backed firms. Journal of business venturing, 13(6), 493–503.

[4]Gompers, P. A., Gornall, W., Kaplan, S. N., & Strebulaev, I. A. (2020). How do venture capitalists make decisions?. Journal of Financial Economics, 135(1), 169–190.

[5]Zacharakis, A. L., & Meyer, G. D. (1998). A lack of insight: do venture capitalists really understand their own decision process?. Journal of business venturing, 13(1), 57–76.



Luca Bella

I like to talk about new business models enabled by technology