### Prediction Sample Approach

This article refers to a problem on Auquan’s platform. To follow along check out the details here: https://quant-quest.auquan.com/competitions/uk-data-science-varsity

#### Introduction

In our first sample approach for this competition, we’re going to take a look at how you might go about tackling a prediction problem. The aim isn’t going to be to produce the best answer we can. Instead, we will go through the steps that you might take in thinking about this kind of problem and then to produce an answer.

As a recap, in this problem we have access to fundamental data for a selection of companies in the S&P 100. The aim is to produce a model that can predict future revenue and income, based on previous figures. The idea being that if you knew future performance, you would be able to better value a given stock or company.

#### Working through a solution

If you’ve been following these articles for a while now, you will start to see a familiar structure in how we tackle these problems. Making the actual model is normally very quick and a lot of our time is spent working towards that point. This process helps us to understand the data and prepare it in a way that gives us the best chance of creating a useful model.

The steps we are going to go through today are as follows:

- Data Exploration
- Dimensionality Reduction
- Feature Engineering
- Clustering
- Creating a Prediction Model

(If you don’t want to read the whole guide, this would be a good summary or take away!)

As you go through this approach, remember that the implementation that we’ve done here is just an example. It is almost certain that there are better ways to do each step, so you should experiment. For example, at the end, we train a linear regression model but this might not be the best approach. Instead, you could try non-linear or parametric methods and play around with the hyperparameters. Finally, in this approach, we are only considering revenue, but the same steps apply for income.

#### Step One: Data Exploration

Regardless of the problem we are tackling, the thing we need to do is to understand the data we are working with. If you are working as a professional data scientist, this is probably what you spend most of your time doing. Once you understand a dataset you can begin asking the right questions to extract value from it!

Let’s get started:

Imports and setting up Pandas

import matplotlib.pyplot as plt

import pandas as pd

from sklearn import linear_model

from sklearn.metrics import mean_squared_error, r2_score

pd.options.display.max_rows = 999

In this problem, we are given normalised data as part of the template file and the original data in a folder of CSVs. At this point, it’s unclear which data will be most useful, so we would want to experiment with both. In order to save time and make this guide clearer, we are just going to work with the original data. To make this step quicker, I’ve created a list of all the tickers that we have data for (the code for this isn’t shown because the list is super long).

Importing the data

def checkNa(df):

if df.isna().any().any():

print(df.isna().any())for tickername in tickers:

folder = 'normalizedData'

try:

df = pd.read_csv('%s/%s.csv'%(folder,tickername), header=0, index_col=0)

print(tickername)

except FileNotFoundError as e:

print(e)

continue

If we then look at the information we can see that there are around 50 stocks with 281 features (or columns) and up to 40 rows of data for each. This presents us with a problem — We have too many features for only a few rows. Reducing this will be our first task.

#### Step 2: Reducing Dimensionality

To understand why we need to reduce the number of columns it’s helpful to think of each feature as being composed of some useful information and bad information. If you then imagine that each factor contributing its information to our final mode, it’s easy then to see that having lots of features means lots of bad information!

This is then compounded because the good information contained by two columns might actually be the same. Because we already know the good information that feature is just adding noise. To illustrate this, imagine you are creating a model to predict the weather and you have two features: the number of ice creams sold and the number of ice cream sticks in the trash. These numbers are going to be different but they are essentially measuring the same thing.

f (x) = u + n

The first step of reducing the dimensionality of this data is to understand which of our columns contain valuable information. To do this in the simplest manner we are going to calculate a linear regression of all the factors and print the coefficients. Remember, we are only doing this for revenue, but you would want to follow a similar process for revenue.

columns = []Creating training and test data

X_train, y_train = df.drop(columns=['Revenue(Y)', 'Income(Y)']).iloc[:30], df['Revenue(Y)'].iloc[:30]

X_test, y_test = df.drop(columns=['Revenue(Y)', 'Income(Y)']).iloc[30:36], df['Revenue(Y)'].iloc[30:36]

r, y_pred = linear_regression(X_train, y_train, X_test, y_test)

If we look at the results it becomes apparent that this isn’t that useful. Many of the columns have very different scales, meaning we can’t directly compare the coefficients with each other. In order to get around this problem, we are going to have to normalise our data.

Normally normalisation is a bit tricky with time series data as you are prone to introducing lookahead bias. Fortunately, the Auquan toolbox prevents this from happening, so we can just go straight ahead and create a rolling mean. I’ve created a variable for the period in case we want to change it later to see how that affects our model.

Create normalise function

def normalize(X, y, period, key):

X_norm = (X - X.rolling(period).mean())/X.rolling(period).std()

X_norm.dropna(inplace=True,axis=0, how='all')

X_norm.fillna(0, inplace=True)

y_norm = (y - X[key].rolling(period).mean())/X[key].rolling(period).std()

y_norm = y_norm[X_norm.index]

y_norm[y_norm.index.isin(X_norm.index)].fillna(0, inplace=True)

return X_norm, y_norm[y_norm.index.isin(X_norm.index)]

Now we’ve created this function we can normalise the data and run the regression again. Instead of printing the coefficients again, I’ve decided just to get rid of every column which has an r value less than 0.1. This should just leave columns that have some predictive power. The output from this should show use which features are related to revenue in some way.

columns=list(df.drop(columns=['Revenue(Y)', 'Income(Y)']).columns)

for tickername in tickers[:5]:

# folder = 'qq16p1Data'

folder = 'cleanData'

try:

df = pd.read_csv('%s/%s.csv'%(folder,tickername), header=0, index_col=0)

print(tickername)

except FileNotFoundError as e:

print(e)

continue

norm_period = 3

X_norm, y_norm = normalize(df.drop(columns=['Revenue(Y)', 'Income(Y)']), df['Revenue(Y)'], norm_period, 'Revenue')regr_norm, by_pred = linear_regression(X_norm.iloc[:-6], y_norm.iloc[:-6], \

X_norm.iloc[-6:], y_norm.iloc[-6:])

for i in range(len(regr_norm.coef_)):

if np.abs(regr_norm.coef_[i])>0.1:

print("'%s', "%df.columns[i])#r.coef_[i])#, )

else:

if df.columns[i] in columns:

# print("removing '%s', "%df.columns[i])

columns.remove(df.columns[i])

# if df.columns[i] not in columns:

# columns.append(df.columns[i])print(columns)

#### Step 3: Feature Engineering

Now that we have some features that we know have some information about the target variable, we need to concentrate that information. This is the aim of feature engineering. In finance, this is probably the most important step in determining how good your final model will be.

There are lots of different ways to create new features. Some of the things you can look into doing:

- Creating non-linear versions of features
- You could create combinations of features (there are a lot of these in finance e.g. P/E ratio)
- You could create altogether new information — e.g. PCA
- Identify which features have the same information (and just use one feature for that group)
- Many more…

If you were trying to completely solve this problem you would want to try all of these things to see what worked best for this problem. You also might want to do research to see what other people have found predictive in similar situations.

We are going to create 2 sets of similar new features as an example. These columns are going to be the value of each feature at t-1year and t-2years. In other words, to predict revenue, we are guessing that you don’t just need today’s features, but also the value of each feature last year and the year before. To understand why, imagine I was asking you to predict Company X’s sales on Black Friday or 11/11 Singles Day (or a similar event in your country). Which sets of data would be more useful: The volume of sales for the three months up to that point or the volume of sales for the three previous sales days?

The second reason we have chosen this set of features to build is to simplify the problem. One of the difficulties of working with time series data is that the history is important to what will happen next. The data important data is a 3d array of features x stocks x time. By including trailing values in the feature dimension, we are able to transform the data from 3d to 2d. Here we have added 2 annual trailing values, but you will need to experiment to find out what values are important:

- How many years back?
- Are trailing quarters important?
- Are the differences between values important? Is there some other relationship that is important? (e.g. Moving average)

Hopefully, that makes sense. Once you’ve got your head around why, take a look below to see our implementation.

X=None

t = []

for tickername in tickers:

folder = 'qq16p1Data'

# folder = 'cleanData'

try:

df = pd.read_csv('%s/%s.csv'%(folder,tickername), header=0, index_col=0)

print(tickername)

t.append(tickername)

except FileNotFoundError as e:

print(e)

continue

temp = df[columns].iloc[-1].values

# temp = np.append(df[columns].iloc[-2].values, temp)

# temp = np.append(df[columns].iloc[-3].values, temp)

temp = np.append(df[columns].iloc[-5].values, temp)

temp = np.append(df[columns].iloc[-9].values, temp)

# temp = np.append(df[columns].iloc[-8].values, temp)

# temp = np.append(df[columns].iloc[-10].values, temp)

# temp = np.append(df[columns].iloc[-12].values, temp)

# temp = np.append(df[columns].iloc[-14].values, temp)

# temp = np.append(df[columns].iloc[-16].values, temp)

if X is None:

X = temp

else:

X = np.vstack((temp, X))X = np.nan_to_num(X)

#### Step 4: Clustering

Now that we’ve created some better features, we need to get smart about our stocks. It isn’t feasible to create a model for each individual stock and it wouldn’t be a good idea even if we could. If you did create individual models, with the amount of data we have here, it is likely to cause overfitting. Again if we take an example to help think this through. Imagine you have 5 aeroplane stocks and the price of oil doubled. The biggest cost of all these companies would have just doubled, which would likely cause all their stocks to devalue. We would expect the magnitude of this effect to be the same for each of these companies, based on the data we have. Of course, if we happened to know that one company had stockpiled a load of cheap oil and that another’s fleet was comprised of ageing planes with poor fuel efficiency, this would merit individual approaches.

How you choose to cluster your stocks is likely to also have a significant effect on your ability to produce good predictions. The better related the companies in your clusters are, the more accurate your models will be. This is why we removed the irrelevant factors before clustering. If we hadn’t done that, we might have created companies that were similar in ways that doesn’t predict revenue growth.

Here we’ve just taken a quick and simple approach to clustering. You should experiment with how you do the clustering, the hyperparameters for doing so, including the number of clusters.

from sklearn.cluster import KMeans

n_clusters=8

km = KMeans(n_clusters=n_clusters, random_state=0)

km.fit(np.nan_to_num(X))clusters = {}

for i in range(n_clusters):

clusters[i] = []Next we are just going to label each cluster (from 0 to X), so we can easily identify them later. So we have a reference, we can also print the list of clusters and explore which companies are in which group.

for i in range(len(km.labels_)):

clusters[km.labels_[i]].append(t[i])print(clusters)

#### Step 5: Create a model(s)

The next step is to finally create models for each of the clusters. We’re going to just do one approach (linear regression) for one cluster ([0]) but you would want to experiment with different models (e.g. non-linear).

This first step is just preparing the data of group 0:

X=None

y = []

for tickername in clusters[0]:

folder = 'qq16p1Data'

try:

df = pd.read_csv('%s/%s.csv'%(folder,tickername), header=0, index_col=0)

print(tickername)

t.append(tickername)

except FileNotFoundError as e:

print(e)

continue

for i in range(len(df.index)-24):

if df['Revenue(Y)'].iloc[-1-i] > 0:temp = df[columns].iloc[-1-i].values

temp = np.append(df[columns].iloc[-2-i].values, temp)

temp = np.append(df[columns].iloc[-3-i].values, temp)

temp = np.append(df[columns].iloc[-4-i].values, temp)

temp = np.append(df[columns].iloc[-6-i].values, temp)

temp = np.append(df[columns].iloc[-8-i].values, temp)

temp = np.append(df[columns].iloc[-10-i].values, temp)

temp = np.append(df[columns].iloc[-12-i].values, temp)

temp = np.append(df[columns].iloc[-14-i].values, temp)

temp = np.append(df[columns].iloc[-16-i].values, temp)

if X is None:

X = temp

else:

X = np.vstack((temp, X))

y.append(df['Revenue(Y)'].iloc[-1-i])

# print(df[columns].iloc[-1-i], df['Revenue(Y)'].iloc[-1-i])

Next, we need to train our model:

len(X), len(y)def linear_regression(X_train, y_train, X_test, y_test):

# regr = linear_model.LinearRegression()

regr = linear_model.Lasso(alpha=0.1)

# regr = linear_model.Ridge(alpha=.5)

# regr = linear_model.ElasticNet(random_state=0)

# Train the model using the training sets

regr.fit(X_train, y_train)

# Make predictions using the testing set

y_pred = regr.predict(X_test)# The coefficients

print('Coefficients: \n', regr.coef_)

# The mean squared error

print("Mean squared error: %.2f"

% mean_squared_error(y_test, y_pred))

# Explained variance score: 1 is perfect prediction

print('Variance score: %.2f' % r2_score(y_test, y_pred))# Plot outputs

plt.scatter(y_test, y_pred, color='black')

plt.plot(y_test, y_test, color='blue', linewidth=3)plt.xlabel('Y(actual)')

plt.ylabel('Y(Predicted)')plt.show()

return regr, y_pred

Finally, we can plot our results:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(np.nan_to_num(X), np.nan_to_num(y), test_size=0.33, random_state=42)regr_norm, y_pred = linear_regression(X_train, y_train, X_test, y_test)

We can see from the output that this model has good accuracy with the following results:

- Mean squared error: 0.06
- Variance score: 0.18

For comparison, the model at the beginning had a variance of around -35. This is a huge improvement, but we can still make it much better.

If you want to put this into practice, sign up for the UK Data Science Varsity here: links.auquan.com/DSV