There are two steps to this process:

- First, you need to decide if you want to use the feature as a market feature, or an instrument feature. Instrument features are calculated per instrument (for example position, fees, moving average of instrument price). The toolbox auto-loops through all instruments to calculate features for you. Market features are calculated for the whole trading system (for example portfolio value).

- Instrument features get declared in the function:
*getInstrumentFeatureConfigDicts()* - Market features get declared in the function:
*getMarketFeatureConfigDicts()*

- Instrument features get declared in the function:
Then you need to create one config dictionary per feature. Feature config dictionaries are defined with the following keys:

*featureId: A*string representing the feature you want to use*featureKey*{optional}: A string representing the key to access the value of this feature. If not present, use featureId*params*{optional}: A dictionary with which contains other parameters, if needed by the feature

Example: Creating a moving sum

def getInstrumentFeatureConfigDicts(self): msDict = {'featureKey': 'ms_5', 'featureId': 'moving_sum', 'params': {'period': 5, 'featureName': 'basis'}} return

In order to help you quickly create features, the Auquan toolbox contains several common features predefined, for use in these dictionaries. To see the list, click here.

If you are a more advanced practitioner, you can create your own features to use in the config dicts. To see how, click here

]]>This function does exactly what it says on the label: it gets predictions. Specifically, the getPrediction function will combine all of the features you've created, plus any other logic you include, and calculate the predictions at each point in time. The getPrediction function will only use data up to that time stamp. This ensures that your prediction models don't suffer from lookahead bias.

You can call your previously created features by referencing their featureId. For example, to use the moving average feature and a custom feature, you would write the following code:

def getPrediction(self, time, currentMarketFeatures, instrumentManager): # holder for all the instrument features lookbackInstrumentFeatures = instrumentManager.getLookbackInstrumentFeatures() ms5Data = lookbackInstrumentFeatures.getFeatureDf('ms_5') # dataframe for a historical instrument feature (ms_5 in this case). # The index is the timestamps of atmost lookback data points. # The last row of this data frame would contain the features # which are being calculated in this update cycle or for this time. # The second to last row (if exists) would have the features for the previous # time update. # The columns of this dataframe are the stock symbols/instrumentIds. ms5 = ms5Data.iloc # This returns the value of the feature at the last time update # Returns a series with index as all the instrumentIds. customData = lookbackInstrumentFeatures.getFeatureDf('my_custom_feature_key') custom = customData.iloc predictions = ms5 / custom return predictions

Predictions from this function can be of many types:

- You can calculate(predict) the
*FairValue*of a parameter, for example price. - You can predict the probability that price will increase in the future.
- You can predict the ratio of price of two securities.
- You can predict the ranking of a basket of securities.

Output of the prediction function is used by the toolbox to make further trading decisions via the execution system. In competitions, this function is called by the scorer to assess your strategy's performance.

]]>For example, the value of a roll of a die is a random variable. This variable, X, can take values 1 - 6, each with a probability of ⅙, but it’s exact value is unknown till the die roll is actually performed.

A **probability distribution** is a mathematical function that assigns a probability to every possible value of a random variable. For example, the random variable X that represents the value of a die rolls and can take values 1 to 6, each with a probability of ⅙ has a distribution: 𝑃(𝑋=𝑖)=1/6, where i = 1,2,3,4,5,6

Random variables can be separated into two different classes:

- Discrete random variables
- Continuous random variables

Discrete random variables have finitely countable outcomes. For example, the value of a coin toss can only be H or T, each with a probability of 1/2. Similarly, the value of a die roll can only be between 1 and 6.

For discrete random variables where X can take a finite set of values, the probability distribution function gives the probability p(x) that X is exactly equal to some value. p(x)=P(X=x), where x belongs to the finite set of values that are possible

Let's look at the distribution of a die roll below.

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import scipy

from auquanToolbox import dataloader as dl

from __future__ import divisionclass DiscreteRandomVariable:

def __init__(self, a=0, b=1):

self.variableType = ""

self.low = a

self.high = b

return

def draw(self, numberOfSamples):

samples = np.random.randint(self.low, self.high, numberOfSamples)

return samples

A die roll can have 6 values, each value can occur with a probability of 1/6. Each time we roll the die, we have an equal chance of getting each face. This is an example of uniform distribution. The chart below shows the distribution for 10 die rolls.

DieRolls = DiscreteRandomVariable(1, 6)

plt.hist(DieRolls.draw(10), bins = , align = 'mid')

plt.xlabel('Value')

plt.ylabel('Occurences')

plt.legend()

plt.show()

In the short run, this looks uneven, but if we take a large number of samples it is apparent that each face is occurring the same percentage of times. The chart below shows the distribution for 10,000 die rolls

plt.hist(DieRolls.draw(10000), bins = , align = 'mid')

plt.xlabel('Value')

plt.ylabel('Occurences')

plt.legend();

plt.show()

A random variable is **independent and identically distributed** *(i.i.d)* if each random variable has the same probability distribution as the others and all are mutually independent, i.e. outcome of one doesn’t affect the other. For example, random variables representing die rolls are i.i.d. The value of one die roll does not affect the value of next die roll.

A binomial distribution is used to describe successes and failures in a binary experiment. This can be very useful in an investment context as many of our choices tend to be binary like this. A single experiment which can result in success with probability p and failure with probability (1-p) is called a Bernoulli trial.

𝑝(1)=𝑃(𝑋=1)=𝑝

𝑝(0)=𝑃(𝑋=0)=1−𝑝

A binomial distribution is a set of 𝑛n Bernoulli trials. There can be between 00 and 𝑛n successes in 𝑛n trials, with each trial having the same probability of success, 𝑝p, and all of the trials being independent of each other. A binomial random variable is denoted as X **~** B(n,p).

The probability function of a binomial random variable p(x) is the probability that there are exactly 𝑥 successes in 𝑛 trials. This is defined by choosing x trials which should result in success and multiplying by the probability that these x trails result in success and the remaining n−x trials result in failure. The resulting probability function is:

If X is a binomial random variable distributed with B(n,p)

class BinomialRandomVariable(DiscreteRandomVariable):

def __init__(self, numberOfTrials = 10, probabilityOfSuccess = 0.5):

self.variableType = "Binomial"

self.numberOfTrials = numberOfTrials

self.probabilityOfSuccess = probabilityOfSuccess

return

def draw(self, numberOfSamples):

samples = np.random.binomial(self.numberOfTrials, self.probabilityOfSuccess, numberOfSamples)

return samples

Let's draw the distribution of 10000 samples of a binomial random variable 𝐵(5,0.5)B(5,0.5), i.e 5 trials with 50% probability of success.

plt.hist(StockProbabilities.draw(10000), bins = , align = 'left')

plt.xlabel('Value')

plt.ylabel('Occurences');

plt.show()

We see that the distribution is symmetric, since probability of success = probability of failure. If we skew the probabilities such that the probability of success is 0.25, we get an asymmetric distribution.

StockProbabilities = BinomialRandomVariable(5, 0.25)

plt.hist(StockProbabilities.draw(10000), bins = , align = 'left')

plt.xlabel('Value')

plt.ylabel('Occurences');

plt.show()

We can extend this idea of an experiment following a binomial random variable into a framework that we call the **Binomial Model of Stock Price Movement**. This is used as one of the foundations for option pricing. In the Binomial Model, it is assumed that for any given time period a stock price can move up or down by a value determined by the up or down probabilities. This turns the stock price into the function of a binomial random variable, the magnitude of upward or downward movement, and the initial stock price. We can vary these parameters in order to approximate different stock price distributions.

For continuous random variables (where X can take an infinite number of values over a continuous range), the probability of a single point, the probability that X is exactly equal to some value is zero. In this case, the probability distribution function gives the probability over intervals which can include infinitely many outcomes. Here we define a **probability density function** (PDF), f(x), such that we can say:

For example, if you buy a piece of rope and the scale reads 1 meter, this value is possible but the probability that the length is exactly 1 meter is zero; You can keep increasing the accuracy of your instrument so that the probability of measuring exactly 1m tends to zero. However, we might be able to say that there is 99% probability that the length is between 99cm and 1.01 m. Just like a probability distribution function f(x) gives the probability that a random variable lies in a range, a cumulative distribution function F(x) describes the probability that a random variable is less than or equal to a given value.

class ContinuousRandomVariable:

def __init__(self, a = 0, b = 1):

self.variableType = ""

self.low = a

self.high = b

return

def draw(self, numberOfSamples):

samples = np.random.uniform(self.low, self.high, numberOfSamples)

return samples

The most widely used distribution with widespread applications in finance is the normal distribution.

Many important tests and methods in statistics, and by extension, finance, are based on the assumption of normality. A large part of this is due to the results of the Central Limit Theorem (CLT) which states that the sum of many independent random variables tends toward a normal distribution, even if the original variables themselves are not normally distributed. The convenience of the normal distribution finds its way into certain algorithmic trading strategies as well.

class NormalRandomVariable(ContinuousRandomVariable):

def __init__(self, mean = 0, variance = 1):

ContinuousRandomVariable.__init__(self)

self.variableType = "Normal"

self.mean = mean

self.standardDeviation = np.sqrt(variance)

return

def draw(self, numberOfSamples):

samples = np.random.normal(self.mean, self.standardDeviation, numberOfSamples)

return samples

Normal distributions are described by their mean (μ) and variance (σ2, where 𝜎σ is the standard deviation). The probability density of the normal distribution is:

And is defined for −∞<x<∞. When we have μ=0 and σ=1, we call this the standard normal distribution.

By changing the mean and standard deviation of the normal distribution, we can change the depth and width of the bell curve. With a larger standard deviation, the values of the distribution are less concentrated around the mean.

mu_1 = 0

mu_2 = 0

sigma_1 = 1

sigma_2 = 2

x = np.linspace(-8, 8, 200)

y = (1/(sigma_1 * np.sqrt(2 * 3.14159))) * np.exp(-(x - mu_1)*(x - mu_1) / (2 * sigma_1 * sigma_1))

z = (1/(sigma_2 * np.sqrt(2 * 3.14159))) * np.exp(-(x - mu_2)*(x - mu_2) / (2 * sigma_2 * sigma_2))

plt.plot(x, y, x, z)

plt.xlabel('Value')

plt.ylabel('Probability');

plt.show()

In modern portfolio theory, stock returns are generally assumed to follow a normal distribution. We use the distribution to model returns instead of stock prices because prices cannot go below 0 while the normal distribution can take on all values on the real line, making it better suited to returns.

One major characteristic of a normal random variable is that a linear combination of two or more normal random variables is another normal random variable. This is useful for considering mean returns and variance of a portfolio of multiple stocks.

This rule of thumb states that given the mean and variance of a normal distribution, we can make the following statements:

- Around 68% of all observations fall within one standard deviations around the mean (μ±σ)
- Around 95% of all observations fall within two standard deviations around the mean (μ±2σ)
- Around 99% of all observations fall within three standard deviations around the mean (μ±3σ)

**The power of normal distributions lies in the fact that using the central limit theorem, we can standardize different random variables so that they become normal random variables.**

We standardize a random variable X by subtracting the mean and dividing by the variance, resulting in the standard normal random variable Z.

Let's look at the case where X **~** B(n,p) is a binomial random variable. In the case of a binomial random variable, the mean is μ=np and the variance is σ2=np(1−p).

n = 50

p = 0.25

X = BinomialRandomVariable(n, p)

X_samples = X.draw(10000)

Z_samples = (X_samples - n * p) / np.sqrt(n * p * (1 - p))plt.hist(X_samples, bins = range(0, n + 2), align = 'left')

plt.xlabel('Value')

plt.ylabel('Probability');

plt.show()

plt.hist(Z_samples, bins=20)

plt.xlabel('Value')

plt.ylabel('Probability');

plt.show()

The idea that we can standardize random variables is very important. By changing a random variable to a distribution that we are more familiar with, the standard normal distribution, we can easily answer any probability questions that we have about the original variable. This is dependent, however, on having a large enough sample size.

Let's assume that stock returns are normally distributed. Say that 𝑌Y is the price of a stock. We will simulate its returns and plot it.

Y_initial = 100

X = NormalRandomVariable(0, 1)

Y_returns = X.draw(1000) # generate 1000 daily returns

Y = pd.Series(np.cumsum(Y_returns), name = 'Y') + Y_initial

Y.plot()

plt.xlabel('Time')

plt.ylabel('Value')

plt.show()

Say that we have some other stock, Z, and that we have a portfolio of Y and Z, called 𝑊W.

Z_initial = 50

Z_returns = X.draw(1000)

Z = pd.Series(np.cumsum(Z_returns), name = 'Z') + Z_initial

Z.plot()

plt.xlabel('Time')

plt.ylabel('Value');

plt.show()

Y_quantity = 20

Z_quantity = 50

Y_weight = Y_quantity/(Y_quantity + Z_quantity)

Z_weight = 1 - Y_weightW_initial = Y_weight * Y_initial + Z_weight * Z_initial

W_returns = Y_weight * Y_returns + Z_weight * Z_returns

W = pd.Series(np.cumsum(W_returns), name = 'Portfolio') + W_initial

W.plot()

plt.xlabel('Time')

plt.ylabel('Value');

plt.show()

We construct W by taking a weighted average of Y and Z based on their quantity.

pd.concat(, axis = 1).plot()

plt.xlabel('Time')

plt.ylabel('Value');

plt.show()

Note how the returns of our portfolio, W, are also normally distributed:

plt.hist(W_returns);

plt.xlabel('Return')

plt.ylabel('Occurrences');

plt.show()

Let's attempt to fit a probability distribution to the returns of a stock. We will take the returns of AAPL and try to fit a normal distribution to them. The first thing to check is whether the returns actually exhibit properties of a normal distribution. For this purpose, we will use the Jarque-Bera test, which indicates non-normality if the p-value is below a cutoff.

start = '2014-01-01'

end = '2016-12-31'

data = dl.load_data_nologs('nasdaq', , start, end)

prices = data

Reading AAPL

# Take the daily returns

returns = prices/prices.shift(-1) -1#Set a cutoff

cutoff = 0.01# Get the p-value of the normality test

k2, p_value = scipy.stats.mstats.normaltest(returns.values)

print("The JB test p-value is: ", p_value)

print("We reject the hypothesis that the data are normally distributed ", p_value < cutoff)

print("The skewness of the returns is: ", scipy.stats.skew(returns.values))

print("The kurtosis of the returns is: ", scipy.stats.kurtosis(returns.values))

plt.hist(returns, bins = 20)

plt.xlabel('Value')

plt.ylabel('Occurrences')

plt.show()

('The JB test p-value is: ', 8.6122250241313796e-22) ('We reject the hypothesis that the data are normally distributed ', True) ('The skewness of the returns is: ', 0.38138558143920764) ('The kurtosis of the returns is: ', 4.231909703399142)

The low p-value of the test leads us to *reject* the null hypothesis that the returns are normally distributed. This is due to the high kurtosis (normal distributions have a kurtosis of 3).

We will proceed from here assuming that the returns are normally distributed so that we can go through the steps of fitting a distribution. We calculate the sample mean and standard deviation of the series and see how a theoretical normal curve fits against the actual values.

# Take the sample mean and standard deviation of the returns

sample_mean = np.mean(returns)

sample_std_dev = np.std(returns)

print("Mean: ", sample_mean)

('Mean: ', -0.0004662534806121209)

x = np.linspace(-(sample_mean + 4 * sample_std_dev), (sample_mean + 4 * sample_std_dev), len(returns))

sample_distribution = ((1/(sample_std_dev * 2 * np.pi)) *

np.exp(-(x - sample_mean)*(x - sample_mean) / (2 * sample_std_dev * sample_std_dev)))

plt.hist(returns, bins = 20, normed=True)

plt.plot(x, sample_distribution)

plt.xlabel('Value')

plt.ylabel('Occurrences');

plt.show()

Our theoretical curve for the returns has a substantially lower peak than the actual values, which makes sense because the returns are not actually normally distributed. This is again due to the kurtosis of the normal distribution. The returns have a kurtosis value of around 5.29, while the kurtosis of the normal distribution is 3. A higher kurtosis leads to a higher peak.

A major reason why it is so difficult to model prices and returns is due to the underlying probability distributions. A lot of theories and frameworks in finance require that data be somehow related to the normal distribution. This is a major reason why the normal distribution seems to be so prevalent. However, it is exceedingly difficult to find real-world data that fits nicely into the assumptions of normality. When actually implementing a strategy, you should not assume that data follows a distribution that it demonstrably does not unless there is a very good reason for it.

Generally, when trying to fit a probability distribution to real-world values, we should have a particular distribution (or distributions) in mind. There are a variety of tests for different distributions that we can apply to see what might be the best fit. In addition, as more information becomes available, it will become necessary to update the sample mean and standard deviation or maybe even to find a different distribution to more accurately reflect the new information. The shape of the distribution will change accordingly.

]]>-> ModuleNotFoundError: No module named 'plotly'

So i thought of installing requirements file by downloading requirements from https://auquan-website-data.s3.us-east-2.amazonaws.com/qq17/requirements.txt and put in some directory and tried to install. But still i am getting error as :

-> Could not find a version that satisfies the requirement tensorboardXwheel (from -r requirements.txt (line 1)) (from versions: )

No matching distribution found for tensorboardXwheel (from -r requirements.txt (line 1))

Can anyone help me in resolving the issue ?

]]>Covariance measures the extent to which the relationship between two variables is linear. The sign of the covariance shows the trend in the linear relationship between the variables, i.e if they tend to move together or in separate directions. A positive sign indicates that the variables are directly related, i.e. when one increases the other one also increases. A negative sign indicates that the variables are inversely related, so that when one increases the other decreases.

It is calculated as:

Note that: t

When the two variables are identical, covariance is same as variance.

Let's say we have two variables and and we take the covariance of the two.

import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
X = np.random.rand(50)
Y = 2 * X + np.random.normal(0, 0.1, 50)
np.cov(X, Y)

What does this mean? To make better sense of this information, we introduce correlation.

Correlation uses information about the variance of X and Y to normalize this metric. The value of correlation coefficient is always between -1 and 1. Once we've normalized the metric to the -1 to 1 scale, we can make meaningful statements and compare correlations.

To normalize Covariance, consider

Where: \rho is the correlation coefficient of two series and . Just like covariance, a positive coefficient indicates that the variables are directly related and a negative coefficient indicates that the variables are inversely related. The closer to 0 the correlation coefficient is, the weaker the relationship between the variables.

Two random sets of data will have a correlation coefficient close to 0.

Correlation is simply a normalized form of covariance. They are otherwise the same and are often used semi-interchangeably in everyday conversation. It is obviously important to be precise with language when discussing the two, but conceptually they are almost identical.

print('Covariance of X and Y: %.2f'%np.cov(X, Y))
print('Correlation of X and Y: %.2f'%np.corrcoef(X, Y))

To get a sense of what correlated data looks like, lets plot two correlated datasets

*X = np.random.rand(50)
Y = X + np.random.normal(0, 0.1, 50)
plt.scatter(X,Y)
plt.xlabel('X Value')
plt.ylabel('Y Value')
plt.show()
print('Correlation of X and Y: %.2f'%np.corrcoef(X, Y))
*

And here's an inverse relationship:

X = np.random.rand(50)
Y = -X + np.random.normal(0, .1, 50)
plt.scatter(X,Y)
plt.xlabel('X Value')
plt.ylabel('Y Value')
plt.show()
print('Correlation of X and Y: %.2f'%np.corrcoef(X, Y))

import auquanToolbox.dataloader as dl
# Pull the pricing data for our two stocks and S&P 500
start = '2013-01-01'
end = '2015-12-31'
exchange = 'nasdaq'
base = 'SPX'
m1 = 'AAPL'
m2= 'LRCX'
data = dl.load_data_nologs(exchange, , start, end)
bench = data
a1= data
a2 = data
plt.scatter(a1,a2)
plt.xlabel('LRCX')
plt.ylabel('AAPL')
plt.title('Stock prices from ' + start + ' to ' + end)
plt.show()
print('Correlation coefficients')
print('Correlation of %s and %s: %.2f'%(m1, m2, np.corrcoef(a1,a2)))
print('Correlation of %s and %s: %.2f'%(m1, base, np.corrcoef(a1, bench)))
print('Correlation of %s and %s: %.2f'%(m2, base, np.corrcoef(a2, bench)))

Once we've established that two series are probably related, we can use that in an effort to predict future values of the series.

Another application is to find uncorrelated assets to produce hedged portfolios - if the assets are uncorrelated, a drawdown in one will not correspond with a drawdown in another. This leads to a very stable return stream when many uncorrelated assets are combined.

It's hard to rigorously determine whether or not a correlation is significant, especially when, as here, we are not aware if the variables are normally distributed or not. Their correlation coefficient is close to 1, so it's pretty safe to say that the two stock prices are correlated over the time period we use, but is this indicative of future correlation? If we examine the correlation of each of them with the S&P 500, we see that it is also quite high. So, we may be led to believe that two stocks have a relationship because of their high correlation, when in fact they are both caused by a third(market)

The problem is we may determine a good correlation by picking the right time period but it may not hold out of sample. To avoid this, one should compute the correlation of two quantities over many historical time periods and examine the distribution of the correlation coefficient.

As an example, remember that the correlation of AAPL and LRCX from 2013-1-1 to 2015-1-1 was 0.95. Let's take the rolling 60-day correlation between the two to see how that varies.

rolling_correlation = pd.rolling_corr(a1, a2, 60)
plt.plot(rolling_correlation)
plt.xlabel('Day')
plt.ylabel('60-day Rolling Correlation')
plt.show()

We see the correlation is not only unstable, but also reverses sign!

Another shortcoming is that two variables may be associated in different, predictable but non-linear ways which this analysis would not pick up. For example, a variable may be related to the rate of change of another which will not be detected by correlation alone.

Just remember to be careful not to interpret results where there are none.

We mentioned in the notebook on Expected Value and Standard Deviation that statistics derived from a sample (data available to us) may differ from the true value (population statistic). For example, we want to measure the population mean, but we can only calculate a sample mean. We then want to use the sample mean to estimate the population mean. We use confidence intervals in an attempt to determine how accurately our sample mean estimates the population mean.

A confidence interval gives an estimated range of values between which the variable is likely to lie. This range is calculated from a given set of data or from a probability distribution The selection of a confidence level for the interval determines the probability that the confidence interval will contain the value of the variable *over many computations *(read subtlety note below). So, a 95% confidence interval for a variable states that the interval will contain the true population mean 95% of the time.

For example, if you want to estimate the average height of students in a university, you might do this by measuring 100 students and estimating that the mean of that sample was close to the population. Let's try that.

np.random.seed(100)
# Let's define some 'true' population parameters, we'll pretend we don't know these.
POPULATION_MU = 64
POPULATION_SIGMA = 5
# Generate our sample by drawing from the population distribution
sample_size = 100
heights = np.random.normal(POPULATION_MU, POPULATION_SIGMA, sample_size)
mean_height = np.mean(heights)
print('sample mean: %.2f'%mean_height)

Unfortunately simply reporting the sample mean doesn't do much for us, as we don't know how it relates to the population mean. To get a sense for how it might relate, we can look for how much variance there is in our sample. Higher variance indicates instability and uncertainty.

print('sample standard deviation: %.2f'%np.std(heights))

This still doesn't help, to really get a sense of how our sample mean relates to the population mean we need to compute a standard error. The standard error is a measure of the variance of the sample mean.

**IMPORTANT Computing a standard error involves assuming that the way you sample is unbiased and that the data are normal and independent. If these conditions are violated, your standard error will be wrong. There are ways of testing for this and correcting.**

The formula for standard error is.

Where is the sample standard deviation and is the number of samples.

SE = np.std(heights) / np.sqrt(sample_size)
print('standard error: %.2f'%SE)

Assuming our data are normally distributed, we can use the standard error to compute our confidence interval.

To do this we set the desired confidence level (say 95%) and determine 95% of data lies within a range how many standard deviations of mean for our data's distribution.

For example, for a normal distribution, 95% of the observations lie in a range around the mean. When the samples are large enough (generally > 30 is taken as a threshold) the Central Limit Theorem applies and normality can be safely assumed; if sample sizes are smaller, a safer approach is to use a -distribution with appropriately specified degrees of freedom. The actual way to compute the values is by using a cumulative distribution function (CDF). If you need more background on Probability Distributions, CDFs and inverse CDFs, read about them here and here. Look here for information on the -distribution. We can check the 95% number using one of the Python functions.

NOTE: Be careful when applying the Central Limit Theorem, however, as many datasets in finance are fundamentally non-normal and it is not safe to apply the theorem casually or without attention to subtlety.

We can visualize the 95% mass bounds here.

#Generate a normal distribution
x = np.linspace(-5,5,100)
y = stats.norm.pdf(x,0,1)
plt.plot(x,y)
# Plot the intervals
plt.vlines(-1.96, 0, 1, colors='r', linestyles='dashed')
plt.vlines(1.96, 0, 1, colors='r', linestyles='dashed')
fill_x = np.linspace(-1.96, 1.96, 500)
fill_y = stats.norm.pdf(fill_x, 0, 1)
plt.fill_between(fill_x, fill_y)
plt.xlabel('$\sigma$')
plt.ylabel('Normal PDF')
plt.show()

Now, rather than reporting our sample mean without any sense of the probability of it being correct, we can compute an interval and be much more confident that the population mean lies in that interval. To do this we take our sample mean and report .

This works because assuming normality, that interval will contain the population mean 95% of the time.

Note that it is incorrect to say that "The true mean lies in a range with 95% probability," but unfortunately this is a very common misinterpretation. Rather, the 95% refers instead to the fact that over many computations of a 95% confidence interval, the true value will be in the interval in 95% of the cases (assuming correct calibration of the confidence interval, which we will discuss later).

But in fact for a single sample and the single confidence interval computed from it, we have no way of assessing the probability that the interval contains the population mean.

In the code below, we generate 100 95% Confidence Intervals around a mean of 0. Notice how for some of them, the mean lies completely outside the interval

np.random.seed(8309)
n = 100 # number of samples to take
samples =
fig, ax = plt.subplots(figsize=(10, 7))
for i in np.arange(1, n, 1):
sample_mean = np.mean(samples) # calculate sample mean
se = stats.sem(samples) # calculate sample standard error
sample_ci =
if ((sample_ci <= 0) and (0 <= sample_ci)):
plt.plot((sample_ci, sample_ci), (i, i), color='blue', linewidth=1);
plt.plot(np.mean(samples), i, 'bo');
else:
plt.plot((sample_ci, sample_ci), (i, i), color='red', linewidth=1);
plt.plot(np.mean(samples), i, 'ro');
plt.axvline(x=0, ymin=0, ymax=1, linestyle='--', label = 'Population Mean');
plt.legend(loc='best');
plt.title('100 95% Confidence Intervals for mean of 0')
plt.show()

Going back to our height's example, we can manually construct confidence intervals using t-test

# standard error SE was already calculated
t_val = stats.t.ppf((1+0.95)/2, 9) # d.o.f. = 10 - 1
print('sample mean height: %.2f'%mean_height)
print('t-value: %.2f'%t_val)
print('standard error: %.2f'%SE)
print('confidence interval: '%(mean_height - t_val * SE, mean_height + t_val * SE))

or use a built-in function in scipy.stats for computing the interval

print('99% confidence interval: ', stats.t.interval(0.99, df = 9,loc=mean_height, scale=SE))
print('95% confidence interval: ', stats.t.interval(0.95, df = 9,loc=mean_height, scale=SE))
print('80% confidence interval: ', stats.t.interval(0.8, df = 9, loc=mean_height, scale=SE))

Note that as your confidence increases, the interval necessarily widens.

Confidence intervals allow us to set our desired confidence, and then report a range that will likely contain the population mean. The higher our desired confidence, the larger the range we report. In general once can never report a single point value, because the probability that any given point is the true population mean is incredibly small. Let's see how our intervals tighten as we change the sample size.

np.random.seed(10)
sample_sizes =
for s in sample_sizes:
heights = np.random.normal(POPULATION_MU, POPULATION_SIGMA, s)
SE = np.std(heights) / np.sqrt(s)
print stats.norm.interval(0.95, loc=mean_height, scale=SE)

sample_size = 100
heights = np.random.normal(POPULATION_MU, POPULATION_SIGMA, sample_size)
SE = np.std(heights) / np.sqrt(sample_size)
(l, u) = stats.norm.interval(0.95, loc=np.mean(heights), scale=SE)
print('%.2f,%.2f'%(l, u))
plt.hist(heights, bins=20)
plt.xlabel('Height')
plt.ylabel('Frequency')
# Just for plotting
y_height = 5
plt.plot(, , '-', color='r', linewidth=4, label='Confidence Interval')
plt.plot(np.mean(heights), y_height, 'o', color='r', markersize=10)
plt.show()

The computation of a standard deviation, standard error, and confidence interval all rely on certain assumptions. If these assumptions are violated then the 95% confidence interval will not necessarily contain the population parameter 95% of the time. Here is an example.

If your data generating process is autocorrelated, then estimates of standard deviation will be wrong. This is because autocorrelated processes tend to produce more extreme values than normally distributed processes. In such a process, new values depend on previous values, so series that are already far from the mean are likely to stay far from the mean.

def generate_autocorrelated_data(theta, mu, sigma, N):
# Initialize the array
X = np.zeros((N, 1))
for t in range(1, N):
# X_t = theta * X_{t-1} + epsilon
X = theta * X + np.random.normal(mu, sigma)
return X
X = generate_autocorrelated_data(0.5, 0, 1, 100)
plt.plot(X)
plt.xlabel('t')
plt.ylabel('X')
plt.show()

Now, we need to know the true population of this series. For larger sample sizes, the sample mean should still asymptotically converge to zero. This is because the process is still centred around zero, but let's check if that's true. We'll vary the number of samples drawn, and look for convergence as we increase the sample size.

sample_means = np.zeros(200-1)
for i in range(1, 200):
X = generate_autocorrelated_data(0.5, 0, 1, i * 10)
sample_means = np.mean(X)
plt.bar(range(1, 200), sample_means);
plt.xlabel('Sample Size')
plt.ylabel('Sample Mean')
plt.show()

Definitely looks like there's some convergence, we can also check what the mean of the sample means is.

np.mean(sample_means)

Pretty close to zero. We could also derive symbolically that the mean is zero, but let's assume that we've convinced ourselves that the true population mean is 0. Now that we know the population mean, we can check the calibration of confidence intervals. Let's first compute an interval using our earlier process and then check whether the interval actually contains the true mean, 0.

def compute_unadjusted_interval(X):
T = len(X)
# Compute mu and sigma MLE
mu = np.mean(X)
sigma = np.std(X)
SE = sigma / np.sqrt(T)
# Compute the bounds
return stats.norm.interval(0.95, loc=mu, scale=SE)
# We'll make a function that returns true when the computed bounds contain 0
def check_unadjusted_coverage(X):
l, u = compute_unadjusted_interval(X)
# Check to make sure l <= 0 <= u
if l <= 0 and u >= 0:
return True
else:
return False

Now we'll run many trials, in each we'll sample some data, compute a confidence interval, and then check if the confidence interval contains the population mean. Over many runs, we should expect to see 95% of the trials succeed if the intervals are computed correctly.

T = 100
trials = 500
times_correct = 0
for i in range(trials):
X = generate_autocorrelated_data(0.5, 0, 1, T)
if check_unadjusted_coverage(X):
times_correct += 1
print 'Empirical Coverage: ', times_correct/float(trials)
print 'Expected Coverage: ', 0.95

Clearly the coverage is wrong. In this case, we'd need to do what's known as a Newey-West correction on our standard error estimate to account for the autocorrelation.

In practice, it's important to check for the assumptions you make. It is quick and easy to check if your data are stationary (which implies not autocorrelated), and it can save you a lot of pain and suffering to do so. A normality test such as`Jarque Bera`

will also be a good idea, as it may detect certain distribution properties which may violate assumptions of many following statistical analyses.

This work derives from the works of Michael Halls-Moore on Quantstart and Quantopian Lecture Series.

The idea is simple.

Instead of just rewarding you for the people you invite, we will also reward you for the people your invitees invite, and their invitees, and so on. The diagram below gives an example of how it might work. Mo, Sid and Pooja are winners of certain cash prizes and we can see how that would lead to you earning $112, without doing any work yourself.

Here you can see that 'You' didn't invite anyone who directly won, so in most referral schemes you would be out of luck. However, with our fairer system, you are rewarded for your connections because they led to people submitting correct answers. Congratulations!

The formula for your reward is: R = P * 0.1/2^(n)

where: R = your reward

P = the winner's prize

n<5, n = the number of people between you in the referral chain

This table shows how much you would earn if someone in your chain won a prize of just $1000. It also shows how many people could win you money if everyone just invites 5 of their friends or colleagues.

In theory, we could let this go on indefinitely, but we've decided to limit the chain length to 5. This will limit the overheads for our accountants, who wouldn't thank me for making them pay 100 people <$1 prizes.

>>>> You can invite people to our current competition here: quant-quest.auquan.com/competitions/asia-open-2019 <<<<<

Some of the finer print:

- People must be invited and signup using the invite a friend system (found on all competition homepages). Otherwise, we won't be able to attribute your invite. This also means each person can only be invited by one person.
- You cannot invite yourself and you must not break any of our general platform rules.
- Chains are limited to 5 people, as explained above.
- This promotion only applies to cash prizes
- We will only pay out prizes of greater than $5
- We initially intend to run this promotion until 31/12/2020, but reserve the right to stop this promotion at any time (we also might extend it!).
- Prizes are awarded at our sole discretion. For example, any evidence of trying to manipulate the system for unfair personal gain will result in your immediate disqualification from all payouts.

In this refresh article, we are going to take you back to basics and quickly recap expected value and different types of mean. Although this is basic it is important that you are fluent with these terms to master the more advanced lessons coming later in this series.

The expected value of a random variable is the probability-weighted average of all possible values. When these probabilities are equal, the expected value is the same as the arithmetic mean, defined as the sum of the observations divided by the number of observations:

where are our observations.

For example, if a dice is rolled repeatedly many times, we expect all numbers from 1 - 6 to show up an equal number of times. So the expected value in rolling a six-sided die is 3.5.

from __future__ import print_function
from auquanToolbox.dataloader import load_data_nologs
import numpy as np
import scipy.stats as stats
# Let's say the random variables x1 and x2 have the following values
x1 =
x2 = x1 +
print ('Mean of x1:', sum(x1), '/', len(x1), '=', np.mean(x1))
print ('Mean of x2:', sum(x2), '/', len(x2), '=', np.mean(x2))

When the probabilities of different observations are not equal, i.e a random variable can take value with probability , with probability , and so on, the expected value of X is the same as *weighted* arithmetic mean. The weighted arithmetic mean is defined as

where

Therefore, the expected value is the average of all values obtained you perform the experiment it represents many times. This follows from the law of large numbers - the average of the results obtained from a large number of repetitions of an experiment should be close to the expected value, and will tend to become closer as more trials are performed.

- The expected value of a constant is equal to the constant itself
- The expected value is linear, i.e
- If , then
- The expected value is not multiplicative i.e. is not necessarily equal to . The amount by which they differ is called the covariance, covered in a later notebook. If X and Y are uncorrelated,

**Median**

Number which appears in the middle of the list when it is sorted in increasing or decreasing order, i.e. the value in when is odd and the average of the values in and positions when is even. One advantage of using median in describing data compared to the mean is that it is not skewed so much by extremely large or small values

The median us the value that splits the data set in half, but not how much smaller or larger the other values are.

print('Median of x1:', np.median(x1))
print('Median of x2:', np.median(x2))

**Mode**

Most frequently occuring value in a data set. The mode of a probability distribution is the value x at which its probability distribution function takes its maximum value.

def mode(l):
# Count the number of times each element appears in the list
counts = {}
for e in l:
if e in counts:
counts += 1
else:
counts = 1
# Return the elements that appear the most times
maxcount = 0
modes = {}
for key in counts:
if counts > maxcount:
maxcount = counts
modes = {key}
elif counts == maxcount:
modes.add(key)
if maxcount > 1 or len(l) == 1:
return list(modes)
return 'No mode'
print ('All of the modes of x1:', mode(x1))

**Geometric mean**

It is the central tendency of a set of numbers by using the product of their values (as opposed to the arithmetic mean which uses their sum). The geometric mean is defined as the nth root of the product of n numbers:

for observations . We can also rewrite it as an arithmetic mean using logarithms:

The geometric mean is always less than or equal to the arithmetic mean (when working with nonnegative observations), with equality only when all of the observations are the same.

# Use scipy's gmean function to compute the geometric mean
print ('Geometric mean of x1:', stats.gmean(x1))
print ('Geometric mean of x2:', stats.gmean(x2))

If we have stocks returns over different times, we use the geometric mean to calculate average return so that if the rate of return over the whole time period were constant and equal to , the final price of the security would be the same as in the case of returns .

**Harmonic mean**

The harmonic mean is less commonly used than the other types of means. It is defined as

As with the geometric mean, we can rewrite the harmonic mean to look like an arithmetic mean. The reciprocal of the harmonic mean is the arithmetic mean of the reciprocals of the observations:

The harmonic mean for non-negative numbers is always at most the geometric mean (which is at most the arithmetic mean), and they are equal only when all of the observations are equal.

print ('Harmonic mean of x1:', stats.hmean(x1))
print ('Harmonic mean of x2:', stats.hmean(x2))

The harmonic mean can be used when the data can be naturally phrased in terms of ratios.

Variance and Standard Deviation are measures of dispersion of dataset from the mean.

We can define the mean absolute deviation as the average of the distances of observations from the arithmetic mean. We use the absolute value of the deviation, so that 5 above the mean and 5 below the mean both contribute 5, because otherwise the deviations always sum to 0.

where is the number of observations and is their mean.

Instead of using absolute deviations, we can use the squared deviations, this is called **variance** : the average of the squared deviations around the mean:

**Standard deviation** is simply the square root of the variance, , and it is the easier of the two to interpret because it is in the same units as the observations.

Note that variance is additive while standard deviation is not.

print('Variance of x1:', np.var(x1))
print('Standard deviation of x1:', np.std(x1))
print('Variance of x2:', np.var(x2))
print('Standard deviation of x2:', np.std(x2))

Standard deviation indicates the amount of variation in a set of data values. A low standard deviation indicates that the data points tend to be close to the expected value, while a high standard deviation indicates that the data points are spread out over a wider range of values.

- The standard deviation of a constant is equal to 0
- Standard deviations cannot be added. Therefore,
- However, variance, can be added. In fact,
- If X and Y are uncorrelated, and

If an experiment is performed daily and the results of an experiment on one day do not affect the on their results any other day, daily observation are uncorrelated. If we measure daily standard deviation as then we can calculate the standard deviation for a year, also called annualized standard deviation as:

In finance, we sum over all trading days and this annualized standard deviation is called **Volatility**.

It is important to remember that when we are working with a subset of actual data, these computations will only give you sample statistics, that is mean and standard deviation of a sample of data. Whether or not this reflects the current true population mean and standard deviation is not always obvious, and more effort has to be put into determining that. This is especially problematic in finance because all data are time series and the mean and variance may change over time. In general do not assume that because something is true of your sample, it will remain true going forward.

When working with time series financial data, stationarity (or lack thereof!) is going to be a defining aspect of how you conduct your analyses. In this article, we're going to give you a quick refresher of what these terms mean and how they affect your data.

Let's start with importing the basics!

import numpy as np
import pandas as pd
import statsmodels
import statsmodels.api as sm
from statsmodels.tsa.stattools import coint, adfuller
import matplotlib.pyplot as plt

A commonly untested assumption in time series analysis is the stationarity of the data. Data are stationary when the parameters of the data generating process do not change over time.

*Mean of a time series is *

*Variance of a time series is *

**A time series is stationary in the mean if , i.e.mean is constant with time**

**A time series is stationary in the variance if , i.e. variance is constant with time**

As an example, let's consider two series, A and B. Series A is generated from a stationary process with fixed parameters, series B is generated with parameters that change over time.

def generate_datapoint(params):

mu = params
sigma = params
return np.random.normal(mu, sigma)

Series A:

# Set the parameters and the number of datapointsparams = (0, 1) T = 100 A = pd.Series(index=range(T)) A.name = 'A' for t in range(T): A = generate_datapoint(params) plt.figure(figsize=(15,7)) plt.plot(A) plt.xlabel('Time') plt.ylabel('Value') plt.legend() plt.show()

Series B:

# Set the number of datapoints

T = 100
B = pd.Series(index=range(T))
B.name = 'B'
for t in range(T):
# Now the parameters are dependent on time
# Specifically, the mean of the series changes over time
params = (t * 0.1, 1)
B = generate_datapoint(params)
plt.figure(figsize=(15,7))
plt.plot(B)
plt.xlabel('Time')
plt.ylabel('Value')
plt.legend()
plt.show()

Many statistical tests, deep down in the fine print of their assumptions, require that the data being tested are stationary. A stationary time series (TS) is simple to predict as we can assume that future statistical properties are the same or proportional to current statistical properties.If you naively use certain statistics on a non-stationary data set, you will get garbage results. As an example, let's take an average through our non-stationary .

m = np.mean(B)

plt.figure(figsize=(15,7))
plt.plot(B)
plt.hlines(m, 0, len(B), linestyles='dashed', colors='r')
plt.xlabel('Time')
plt.ylabel('Value')
plt.legend()
plt.show()

The computed mean will show the mean of all data points till date, but won't be useful for any forecasting of future state. It's meaningless when compared with any specfic time, as it's a collection of different states at different times mashed together. This is just a simple and clear example of why non-stationarity can screw with analysis, much more subtle problems can arise in practice.

Now we want to check for stationarity using a statistical test. We performthe standard Augmented Dickey Fuller test

def check_for_stationarity(X, cutoff=0.01):

# We must observe significant p-value to convince ourselves that the series is stationary
pvalue = adfuller(X)
if pvalue < cutoff:
print('p-value = ' + str(pvalue) + ' The series ' + X.name +' is likely stationary.')
return True
else:
print('p-value = ' + str(pvalue) + ' The series ' + X.name +' is likely non-stationary.')
return False

check_for_stationarity(A);
check_for_stationarity(B);

Sure enough, the changing mean of the series makes it non-stationary. Let's try an example that might be a little more subtle.

# Set the number of datapoints

T = 100
C = pd.Series(index=range(T))
C.name = 'C'
for t in range(T):
# Now the parameters are dependent on time
# Specifically, the mean of the series changes over time
params = (np.sin(t), 1)
C = generate_datapoint(params)
plt.figure(figsize=(15,7))
plt.plot(C)
plt.xlabel('Time')
plt.ylabel('Value')
plt.legend()
plt.show()

A cyclic movement of the mean will be very difficult to tell apart from random noise. In practice on noisy data and limited sample size it can be hard to determine if a series is stationary and whether any drift is random noise or part of a trend. In each individual case the test may or may not pick up subtle effects like this.

check_for_stationarity(C);

Let's try this out on some real pricing data.

import auquan_toolbox.dataloader as dl
end = '2015-01-01'
start = '2007-01-01'
symbols =
data = dl.load_data_nologs('nasdaq', symbols , start, end)
X = data

check_for_stationarity(X);

plt.plot(X.index, X.values)
plt.ylabel('Price')
plt.legend()
plt.show()

Now let's take the delta of the series, giving us the additive returns. We'll check if this is stationary.

X1 = X.diff()
X1.name = X.name + ' Additive Returns'
check_for_stationarity(X1)
plt.plot(X1.index, X1.values)
plt.ylabel('Additive Returns')
plt.legend()
plt.show()

Seems like the additive returns are stationary. That means we will probably be able to model the returns much better than the price. It also means that the price was .

Let's also check the multiplicative returns.

X1 = X.pct_change()
X1.name = X.name + ' Multiplicative Returns'
check_for_stationarity(X1)
plt.plot(X1.index, X1.values)
plt.ylabel('Multiplicative Returns')
plt.legend()
plt.show()

Seems like the multiplicative returns are also stationary. Both the multiplicative and additive deltas on a series get at similar pieces of information, so it's not surprising both are stationary. In practice, this might not always be the case.

As always, you should not naively assume that because a time series is stationary in the past it will continue to be stationary in the future. Tests for consistency of stationarity such as cross-validation and out of sample testing are necessary. This is true of any statistical property, we just reiterate it here. Returns may also go in and out of stationarity, and may be stationary or non-stationary depending on the timeframe and sampling frequency.

The reason returns are usually used for modelling in quantitive finance is that they are far more stationary than prices. This makes them easier to model and returns forecasting more feasible. Forecasting prices is more difficult, as there are many trends induced by their integration. Even using a returns forecasting model to forecast price can be tricky, as any error in the returns forecast will be magnified over time.

An important concept in time series analysis is moving average representation. We will discuss this briefly here, but a more complete explanation is available in the AR, MA and ARMA Models sheets. Also, check Wikipedia as listed below.

This representation expresses any time series as

- is the residuals or errors - a stochastic white noise process
- are the moving average weights of residuals
- is a deterministic series

is deterministic (such as a sine wave), something we could perfectly model it. The difference between predictions from this model () and actual observations leads to residuals(). The residuals are stochastic and there to simulate new information occurring over time.

Specifically, where is the in the optimal forecast of (actual observed value) using only information from time before . In other words, the best prediction you can make at time cannot account for the randomness in .

Each just says how much previous values of influence .

We will note integration order-i as .

A time series is said to be if the following condition holds in a moving average representation. In simpler terms, the autocorrelation of the series decays to 0 sufficiently quickly.

This property turns out to be true of all stationary series since autocorrelation is 0, but by itself is not enough for stationarity to hold. This means that stationarity implies , but does not imply stationarity. For more on orders of integration, please see the following links.

https://en.wikipedia.org/wiki/Order_of_integration

https://en.wikipedia.org/wiki/Wold%27s_theorem

In practice testing, whether the sum of the autocorrelations is finite may not be possible. It is possible in a mathematical derivation, but when we have a finite set of data and a finite number of estimated autocorrelations, the sum will always be finite. Given this difficulty, tests for rely on stationarity implying the property. If we find that a series is stationary, then it must also be .

Let's take our original stationary series A. Because A is stationary, we know it's also .

plt.plot(A)
plt.xlabel('Time')
plt.ylabel('Value')
plt.legend()
plt.show()

If one takes an series and cumulatively sums it (discrete integration), the new series will be . Notice how this is related to the calculus concept of integration. The same relation applies in general, to get take an series and iteratively take the cumulative sum times.

Now let's make an series by taking the cumulative sum of A.

A1 = np.cumsum(A)
plt.plot(A1)
plt.xlabel('Time')
plt.ylabel('Value')
plt.legend()
plt.show()

**Once all the math has settled, remember that any stationary series is **

Finally, now that we've discussed stationarity and order of integration, we can discuss cointegration.

A linear combination of the time series (, , , ) is a new time series constructed as follows for any set of real numbers

For some set of time series (, , , ), if all series are , and some linear combination of them is , we say the set of time series is cointegrated.

For example, , , and are all , and is . In this case the time series are cointegrated.

The intuition here is that for some linear combination of the series, the result lacks much auto-covariance and is mostly noise. This is useful for cases such as pairs trading, in which we find two assets whose prices are cointegrated. Since the linear combination of their prices is noise, we can bet on the relationship mean reverting and place trades accordingly. See the Pairs Trading notebook for more information.

Let's make some data to demonstrate this.

# Length of series
N = 100
# Generate a stationary random X1
X1 = np.random.normal(0, 1, N)
# Integrate it to make it I(1)
X1 = np.cumsum(X1)
X1 = pd.Series(X1)
X1.name = 'X1'
# Make an X2 that is X1 plus some noise
X2 = X1 + np.random.normal(0, 1, N)
X2.name = 'X2'

plt.plot(X1)
plt.plot(X2)
plt.xlabel('Time')
plt.ylabel('Series Value')
plt.legend()
plt.show()

Because is just an series plus some stationary noise, it should still be . Let's check this.

Z = X2.diff()
Z.name = 'Z'
check_for_stationarity(Z);

Looks good. Now to show cointegration we'll need to find some linear combination of and that is stationary. We can take . All that's left over should be stationary noise by design. Let's check this.

Z = X2 - X1
Z.name = 'Z'
plt.plot(Z)
plt.xlabel('Time')
plt.ylabel('Series Value')
plt.legend()
plt.show()
check_for_stationarity(Z);

There are a bunch of ways to test for cointegration. This wikipedia article describes some. In general we're just trying to solve for the coefficients that will produce an linear combination. If our best guess for these coefficients does not pass a stationarity check, then we reject the hypothesis that the set is cointegrated. This will lead to risk of Type II errors (false negatives), as we will not exhaustively test for stationarity on all coefficent combinations. However Type II errors are generally okay here, as they are safe and do not lead to us making any wrong forecasts.

In practice a common way to do this for pairs of time series is to use linear regression to estimate in the following model.

The idea is that if the two are cointegrated we can remove 's depedency on , leaving behind stationary noise. The combination should be stationary.

Let's try on some real data. We'll get prices and plot them first.

end = '2017-01-01'
start = '2011-01-01'
symbols =
data = dl.load_data_nologs('nasdaq', symbols , start, end)

X1 = data

X2 = data

plt.plot(X1)
plt.plot(X2)
plt.xlabel('Time')
plt.ylabel('Series Value')
plt.legend()
plt.show()

Now use linear regression to compute .

X1 = sm.add_constant(X1)
results = sm.OLS(X2, X1).fit()
# Get rid of the constant column
X1 = X1
results.params

b = results.params
Z = X2 - b * X1
Z.name = 'Z'
plt.plot(Z.index, Z.values)
plt.xlabel('Time')
plt.ylabel('Series Value')
plt.legend()
plt.show()
check_for_stationarity(Z);

We can see here that the resulting was likely stationary over the time frame we looked at. This causes us to accept the hypothesis that our two assets were cointegrated over the same timeframe.

**This is only a forecast!**

Remember as with anything else, you should not assume that because some set of assets have passed a cointegration test historically, they will continue to remain cointegrated. You need to verify that consistent behaviour occurs, and use various model validation techniques as you would with any model.

One of the most important things done in finance is to make many independent bets. Here a quant would find many pairs of assets they hypothesize are cointegrated, and evenly distribute their dollars between them in bets. This only requires more than half of the asset pairs to remain cointegrated for the strategy to work.

Luckily there are some pre-built tests for cointegration. Here's one. Read up on the documentation on your own time.

from statsmodels.tsa.stattools import coint
coint(X1, X2)

For day one, we're starting off with the basics: Cheat Sheets.

Honestly, no one remembers all the functionality of every tool and library, so it's useful to have some quick summaries that you can look at to remind yourself of how to do specific tasks. This article by Robbie Allen has a nice collection that should cover most of the day to day tasks you'll be faced with as a machine learning engineer or data scientist.

The article contains some nice explainers and recaps of different concepts, as well as cheat sheets for common libraries such as Numpy, Pandas and Scikit Learn. We've added the Pandas for data science one below.

Check it out here: https://medium.com/machine-learning-in-practice/cheat-sheet-of-machine-learning-and-python-and-math-cheat-sheets-a4afe4e791b6

(From: DataScienceFree via Robbie Allen on Medium)

]]>There are several approaches to text building text summarises. Broadly, they can be split into two groups defined by the type of output they produce:

**Extractive**, where important sentences are selected from the input text to form a summary. Most summarization approaches today are extractive in nature.**Abstractive**, where the model forms its own phrases and sentences to offer a more coherent summary, like what a human would generate. This approach is definitely more appealing, but much more difficult than extractive summarization.

From a practical point of view, Abstractive summaries require an extra leap of faith for people to adopt. Not only do they have to accept that the summarizer has identified the key sentences, but then actually created a summary that conveys that information without mistake. When high-value decisions are made on the back this summary, it makes sense to minimise this risk.

For this reason, we’ll be looking at an extractive approach based on this paper. from the University of Texas. Our pipeline is going to consist of the following steps:

- Preparation
- Elmo encoder
- Clustering
- Summarization

Let’s look at each of these steps.

These initial bits are just downloading packages that we are going to need and getting the data for our problem (If you’re not doing one of these competitions you might want to skip this step).

Firstly, we’re going to turn on auto-reload for this notebook so that any changes to the external libraries get automatically included here.

%load_ext autoreload

%autoreload 2

Next, here is a list of the non-standard packages we’re using:

Uncomment to install any of the needed packages if you don't have them already

# !pip install tensorflow==1.14

# !pip install tensorflow_hub

# !pip install -U wget

# !pip install -U spacy

Spacy is an industrial-grade NLP library that we’re going to use as a pre-trained model to help separate our sample text into sentences. We’re using the English, core, web trained, medium model, so the code is pretty self-explanatory.

!python -m spacy download en_core_web_md

#this may take a little while

And the rest:

import numpy as np

import wget

from zipfile import ZipFile

import os

import glob

from rouge_score import rouge_scorer

from gensim.summarization.summarizer import summarize

from tqdm import tqdm

import spacynlp = spacy.load('en_core_web_md')

Finally, we’re ready to import the data and get started. We’ve just copied all of this straight out of the template file for the problem. If you want to understand what each part is specifically doing, check out the getting started section of the problem page.

# Copied from the template fileclass Summarizer():

def __init__(self):

# Define any global variables or download any pre trained models here

pass

def download_data(self):

'''

Download data from S3 bucket

'''

if not os.path.isdir('summary_data_train'):

print('Downloading!!')

data_url = 'https://qq16-data.s3.us-east-2.amazonaws.com/summary_data_train.zip'

filename = wget.download(data_url, out='.')

with ZipFile(filename, 'r') as zipObj:

zipObj.extractall()

else:

print('Already Downloaded data, delete previous folder to download again!!')

def read_data(self):

'''

Read the data in the form of two dictionaries

Returns:

X_dict (dict: filename->text): dictionary containing text from the document

y_dict (dict: filename->text): dictionary containing text from the summary

'''

X_filepath = glob.glob('summary_data_train/docs/*.txt')

y_filepath = glob.glob('summary_data_train/summaries/*.txt')

X_dict, y_dict = {}, {}for f1, f2 in zip(X_filepath, y_filepath):

name = f1.split('\\').split('.')

with open(f1, 'r', encoding="utf8") as f:

X_dict = f.read()

with open(f2, 'r', encoding="utf8") as f:

y_dict = f.read()return X_dict, y_dictdef get_summary(self, raw_text):

'''

Function to generate the summary from raw text. THIS IS THE ONLY PART THAT IS REQUIRED TO BE FILLED.

Args:

raw_text (string): string of text from full document

Returns:

summary (text)

'''

# Using a naive model

return summarize(raw_text, 0.55)

The last bit of preparation is downloading the data and reading it into the summarizer:

summarizer = Summarizer()print('Step-1: Downloading data:')

try:

summarizer.download_data()

except Exception as e:

print(e)

print('Failed to download data')# Getting document and summary

print('Step-2: Reading documents and summaries:')

X_dict, y_dict = summarizer.read_data()

As you might be aware from the title, we are using Elmo as our encoder. Developed in 2018 by AllenNLP, ElMo it goes beyond traditional embedding techniques. It uses a deep, bi-directional LSTM model to create word representations.

Rather than a dictionary of words and their corresponding vectors, ELMo analyses words within the context that they are used. This is important because a word like “sick”, may have entirely opposite meanings depending on the context. It is also character-based, allowing the model to form representations of out-of-vocabulary words.

This means that the way ELMo is used is quite different from word2vec or fastText. Rather than having a dictionary ‘look-up’ of words and their corresponding vectors, ELMo instead creates vectors on-the-fly by passing text through the deep learning model.

We’ll be using ElMo to create embeddings for the text.

Let’s start by working with just one document and parse it into sentences using Spacy

document = list(X_dict.values())

summary = list(y_dict.values())

doc = nlp(document)

sentences =

We’re going to use tensorflow hub as its slightly easier to do NLP project with. The Elmo model is all pre-trained and includes all the layers, vectors and weights that we will need.

import tensorflow as tf

import tensorflow_hub as huburl = "https://tfhub.dev/google/elmo/2"

embed = hub.Module(url)

Now we need to create the embeddings for our text

embeddings = embed(

sentences,

signature="default",

as_dict=True)

#Start a session and run ELMo to return the embeddings in variable x

Then start up TensorFlow to do its magic

with tf.Session() as sess:

sess.run(tf.global_variables_initializer())

sess.run(tf.tables_initializer())

x = sess.run(embeddings)

x.shape

We should see a shape of X by1024, with X representing the number of sentences in the text and 1024 being the number of dimensions created by Elmo for each sentence.

At this point, we now have all our sentences represented as embeddings. Clustering is going to group sentences that are similar, from which we will be able to create our final summary.

The number of clusters we create here will be the number of sentences we’ll have in our summary. Currently, the approach here is pretty basic and that will limit the quality of our final summary. It would be wise to experiment with different clustering approaches and numbers of clusters to see how that affects your final results.

For our simple approach, we’ve just used a K-means clustering algorithm, splitting the text into 10 sentences. Do you think this would be good for small documents? What about large ones?

import numpy as np

from sklearn.cluster import KMeansn_clusters = 10

kmeans = KMeans(n_clusters=n_clusters)

kmeans = kmeans.fit( x )

The final step is to take these clusters and create our summary. To do this we’re going to identify the center of the cluster and then find the sentence embedding that is closest to that point. These will then all be added together and returned as the text summary.

One thing to bear in mind here that is if you have some very different sentences e.g. those with a html link in. Then there are likely to in a unique cluster and called in the summary, even though they are not actually useful in the summary.

from sklearn.metrics import pairwise_distances_argmin_minavg = []

for j in range(n_clusters):

idx = np.where(kmeans.labels_ == j)

avg.append(np.mean(idx))

closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, x)

ordering = sorted(range(n_clusters), key=lambda k: avg)

summary = ' '.join([sentences[closest] for idx in ordering])summary

There you go.

*This problem can be found on Auquan’s Quant Quest: links.auquan.com/asiaopen*