The United States Stock Market has historically been dominated by institutional investors. Historically 90% of trades and orders in the market were from Banks, Pension Funds, Insurance Companies etc.
However, that percent has slowly been declining as retail investors, like you and I are placing more and more orders. With the advent of commission-free trades, regular people are realizing that the market is an accessible way to make some money.
Recently, however, with the spread of COVID-19 and people staying at home, we've seen a huge surge in the number of trades (order flow) from Retail Investors. This is likely because people are finding more time, that used to be spent commuting and travelling, to invest (gamble ;) ) in the stock market. According to data from Larry R Tabb of Bloomberg Intelligence (1) retail investors now account for 19.5% of order flow.
As the percentage of retail investors grow, understanding their actions becomes more and more profitable. If one can predict their actions or atleast understand how they work then this is very profitable. Such research is similar to what Engineers, Data Scientists and Researchers at Quant Firms like Renaissance Two Sigma do to create their investing models. Sucess in such research has netted funds like Rennisance's Medalion Fund a return of 66% annualized without fees over a 30 year lifespan whereas the S&P 500 , the most common US Stock Market Index has only returned 10%.
As can be seen by the chart above Robinhood is the second biggest choice of broker for retail investors. Until August 2020, Robinhood exported popularity data, which was captured and stored by the creators of RobinTrack. This dataset is a treasure trove because while it is easy to get information about trades in the market, it is hard to get data specifically about retail investors.
We downloaded the data from https://robintrack.net/data-download and untar the directory which seems like csv's of over >8000 stock tickers.
Let's take a look at what the data about a specific ticker looks like by reading the first 5 and last 5 lines of Alcoa Corp (AA)'s csv.
import pandas as pd
df1 = pd.read_csv("AA.csv")
df1.tail()
timestamp | users_holding | |
---|---|---|
14711 | 2020-08-13 15:45:17 | 5686 |
14712 | 2020-08-13 16:47:14 | 5691 |
14713 | 2020-08-13 20:46:44 | 5697 |
14714 | 2020-08-13 21:44:45 | 5697 |
14715 | 2020-08-13 22:46:45 | 5699 |
The data looks like a list of timestamps and the number of robinhood users holding a stock.
Let's get the top 5 stocks that were being held by the robinhood users at the last timestamp available and try to better understand those stocks by importing them into this ipython notebook.
Let's take a look at the top 5 stocks holidays from Aug 2020. From the above analysis, they are F - Ford GE - General Electric AAPL - Apple MSFT - Microsoft AAL - American Airlines
Let's plot the daily holdings by RobinHood users
FordDf = pd.read_csv("F.csv", parse_dates=['timestamp'], index_col=0).resample('D').mean().rename_axis('Date')
GEDf = pd.read_csv("GE.csv", parse_dates=['timestamp'], index_col=0).resample('D').mean().rename_axis('Date')
AppleDf = pd.read_csv("AAPL.csv", parse_dates=['timestamp'], index_col=0).resample('D').mean().rename_axis('Date')
MicrosoftDf = pd.read_csv("MSFT.csv", parse_dates=['timestamp'], index_col=0).resample('D').mean().rename_axis('Date')
AmericanAirlinesDf = pd.read_csv("AAL.csv", parse_dates=['timestamp'], index_col=0).resample('D').mean().rename_axis('Date')
ax = FordDf.plot(y="users_holding", label="Ford")
GEDf.plot(ax=ax, y="users_holding", label="GE")
AppleDf.plot(ax=ax, y="users_holding", label="Apple")
MicrosoftDf.plot(ax=ax, y="users_holding", label="Microsoft")
AmericanAirlinesDf.plot(ax=ax, y="users_holding", label="American Airlines")
ax.set_ylabel("Date")
ax.set_ylabel("# Of Users Holding Stock")
ax.set_title("Stock Holdings by Robinhood Users")
Text(0.5, 1.0, 'Stock Holdings by Robinhood Users')
Now let's look at the percent change over time
FordDFPctChange = FordDf.pct_change()
FordDFPctChange.plot(y="users_holding", label="Percent Change in Holdings", title="Ford")
GEPctChange = GEDf.pct_change()
GEPctChange.plot(y="users_holding", label="Percent Change in Holdings", title="GE")
ApplePctChange = AppleDf.pct_change()
ApplePctChange.plot(y="users_holding", label="Percent Change in Holdings", title="Apple")
MicrosoftPctChange = MicrosoftDf.pct_change()
MicrosoftPctChange.plot(y="users_holding", label="Percent Change in Holdings", title="Microsoft")
AmericanAirlinesPctChange = AmericanAirlinesDf.pct_change()
AmericanAirlinesPctChange.plot(y="users_holding", label="Percent Change in Holdings", title="American Airlines")
ax.set_ylabel("Date")
ax.set_ylabel("# Of Users Holding Stock")
ax.set_title("Stock Holdings by Robinhood Users")
Text(0.5, 1.0, 'Stock Holdings by Robinhood Users')
Let's also look at the stock prices for these stocks over time. We'll load them from the Yahoo Finance API and plot the chart
import datetime
import pandas_datareader.data as web
start = datetime.datetime(2018, 4, 1)
end = datetime.datetime(2020, 8, 1)
FordStockPrice = web.DataReader("F", 'yahoo', start, end)
GEStockPrice = web.DataReader("GE", 'yahoo', start, end)
AppleStockPrice = web.DataReader("AAPL", 'yahoo', start, end)
MicrosoftStockPrice = web.DataReader("MSFT", 'yahoo', start, end)
AmericanAirlinesStockPrice = web.DataReader("AAL", 'yahoo', start, end)
FordStockPrice.head()
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2018-04-02 | 11.09 | 10.70 | 11.06 | 10.86 | 43598100.0 | 9.405505 |
2018-04-03 | 11.18 | 10.90 | 11.01 | 11.15 | 42523900.0 | 9.656665 |
2018-04-04 | 11.36 | 10.80 | 10.87 | 11.33 | 44834600.0 | 9.812557 |
2018-04-05 | 11.38 | 11.23 | 11.34 | 11.35 | 28755600.0 | 9.829880 |
2018-04-06 | 11.35 | 11.07 | 11.28 | 11.18 | 31920700.0 | 9.682647 |
ax = FordStockPrice.plot( y="Close", label="Ford")
GEStockPrice.plot(ax=ax, y="Close", label="GE")
AppleStockPrice.plot(ax=ax, y="Close", label="Apple")
MicrosoftStockPrice.plot(ax=ax, y="Close", label="Microsoft")
AmericanAirlinesStockPrice.plot(ax=ax, y="Close", label="American Airlines")
ax.set_ylabel("$ Cost per stock")
ax.set_title("Stock Prices")
Text(0.5, 1.0, 'Stock Prices')
AAL here is quite interesting. There seems to have been a dip in the price, which has led to more people investing in the stock. This is opposite from what I would've assumed. As the stocks descrease the number of people holding the stock should decrease. Maybe this is because of the news cycle, as these stocks have been in the news and that has caused a rush of investors into American Airlines. Let's scrape some data from Google Trends to figure out if stocks are correlated to the number of searches for that type of stock.
from pytrends.request import TrendReq
pytrend = TrendReq()
pytrend.build_payload(kw_list=['AAL', 'F', 'GE', 'AAPL', 'MSFT'], cat=107, geo="US")
GoogleTrendsSearchesForStocks = pytrend.interest_over_time()
GoogleTrendsSearchesForStocks.plot(y="AAL")
AmericanAirlinesPctChange.plot(y="users_holding", label="Percent Change in Holdings", title="American Airlines")
<AxesSubplot:title={'center':'American Airlines'}, xlabel='Date'>
It looks like the two charts are pretty well correlated. Let's get a correlation coefficient for each of the stocks.
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import numpy as np
GoogleTrendsSearchesForStocksDaily = pd.merge_asof(AmericanAirlinesPctChange, GoogleTrendsSearchesForStocks, left_index=True, right_index=True, direction='nearest')
r, p = stats.pearsonr(FordDFPctChange.dropna()['users_holding'], GoogleTrendsSearchesForStocksDaily.dropna()['F'])
print(f"F r: {r} and p-value: {p}")
r, p = stats.pearsonr(GEPctChange.dropna()['users_holding'], GoogleTrendsSearchesForStocksDaily.dropna()['GE'])
print(f"GE r: {r} and p-value: {p}")
r, p = stats.pearsonr(ApplePctChange.dropna()['users_holding'], GoogleTrendsSearchesForStocksDaily.dropna()['AAPL'])
print(f"AAPL r: {r} and p-value: {p}")
r, p = stats.pearsonr(MicrosoftPctChange.dropna()['users_holding'], GoogleTrendsSearchesForStocksDaily.dropna()['MSFT'])
print(f"MSFT r: {r} and p-value: {p}")
r, p = stats.pearsonr(AmericanAirlinesPctChange.dropna()['users_holding'], GoogleTrendsSearchesForStocksDaily.dropna()['AAL'])
print(f"AAL r: {r} and p-value: {p}")
F r: 0.48740368132711914 and p-value: 5.613153146516906e-51 GE r: 0.3476926542613339 and p-value: 4.157065299407993e-25 AAPL r: 0.3396310395867009 and p-value: 5.806648403868119e-24 MSFT r: 0.3606291584349058 and p-value: 5.130887393707221e-27 AAL r: 0.44023214882951456 and p-value: 7.501615793970934e-41
Based on the analysis above it seems like news cycle can be used to predict what stocks retail investors are likely to buy. The analysis from the top 5 stocks show with high confidence that there is some correlation. This is especially true of Ford and American Airlines which seems to be bought according to the news cycle. There are other factors in play, but Google Trends might be a very good signal for traders to figure out what is being bought and sold by retail investors.