There are two steps to this process:

- First, you need to decide if you want to use the feature as a market feature, or an instrument feature. Instrument features are calculated per instrument (for example position, fees, moving average of instrument price). The toolbox auto-loops through all instruments to calculate features for you. Market features are calculated for the whole trading system (for example portfolio value).

- Instrument features get declared in the function:
*getInstrumentFeatureConfigDicts()* - Market features get declared in the function:
*getMarketFeatureConfigDicts()*

- Instrument features get declared in the function:
Then you need to create one config dictionary per feature. Feature config dictionaries are defined with the following keys:

*featureId: A*string representing the feature you want to use*featureKey*{optional}: A string representing the key to access the value of this feature. If not present, use featureId*params*{optional}: A dictionary with which contains other parameters, if needed by the feature

Example: Creating a moving sum

def getInstrumentFeatureConfigDicts(self): msDict = {'featureKey': 'ms_5', 'featureId': 'moving_sum', 'params': {'period': 5, 'featureName': 'basis'}} return

In order to help you quickly create features, the Auquan toolbox contains several common features predefined, for use in these dictionaries. To see the list, click here.

If you are a more advanced practitioner, you can create your own features to use in the config dicts. To see how, click here

]]>This function does exactly what it says on the label: it gets predictions. Specifically, the getPrediction function will combine all of the features you've created, plus any other logic you include, and calculate the predictions at each point in time. The getPrediction function will only use data up to that time stamp. This ensures that your prediction models don't suffer from lookahead bias.

You can call your previously created features by referencing their featureId. For example, to use the moving average feature and a custom feature, you would write the following code:

def getPrediction(self, time, currentMarketFeatures, instrumentManager): # holder for all the instrument features lookbackInstrumentFeatures = instrumentManager.getLookbackInstrumentFeatures() ms5Data = lookbackInstrumentFeatures.getFeatureDf('ms_5') # dataframe for a historical instrument feature (ms_5 in this case). # The index is the timestamps of atmost lookback data points. # The last row of this data frame would contain the features # which are being calculated in this update cycle or for this time. # The second to last row (if exists) would have the features for the previous # time update. # The columns of this dataframe are the stock symbols/instrumentIds. ms5 = ms5Data.iloc # This returns the value of the feature at the last time update # Returns a series with index as all the instrumentIds. customData = lookbackInstrumentFeatures.getFeatureDf('my_custom_feature_key') custom = customData.iloc predictions = ms5 / custom return predictions

Predictions from this function can be of many types:

- You can calculate(predict) the
*FairValue*of a parameter, for example price. - You can predict the probability that price will increase in the future.
- You can predict the ratio of price of two securities.
- You can predict the ranking of a basket of securities.

Output of the prediction function is used by the toolbox to make further trading decisions via the execution system. In competitions, this function is called by the scorer to assess your strategy's performance.

]]>For example, the value of a roll of a die is a random variable. This variable, X, can take values 1 - 6, each with a probability of ⅙, but it’s exact value is unknown till the die roll is actually performed.

A **probability distribution** is a mathematical function that assigns a probability to every possible value of a random variable. For example, the random variable X that represents the value of a die rolls and can take values 1 to 6, each with a probability of ⅙ has a distribution: 𝑃(𝑋=𝑖)=1/6, where i = 1,2,3,4,5,6

Random variables can be separated into two different classes:

- Discrete random variables
- Continuous random variables

Discrete random variables have finitely countable outcomes. For example, the value of a coin toss can only be H or T, each with a probability of 1/2. Similarly, the value of a die roll can only be between 1 and 6.

For discrete random variables where X can take a finite set of values, the probability distribution function gives the probability p(x) that X is exactly equal to some value. p(x)=P(X=x), where x belongs to the finite set of values that are possible

Let's look at the distribution of a die roll below.

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import scipy

from auquanToolbox import dataloader as dl

from __future__ import divisionclass DiscreteRandomVariable:

def __init__(self, a=0, b=1):

self.variableType = ""

self.low = a

self.high = b

return

def draw(self, numberOfSamples):

samples = np.random.randint(self.low, self.high, numberOfSamples)

return samples

A die roll can have 6 values, each value can occur with a probability of 1/6. Each time we roll the die, we have an equal chance of getting each face. This is an example of uniform distribution. The chart below shows the distribution for 10 die rolls.

DieRolls = DiscreteRandomVariable(1, 6)

plt.hist(DieRolls.draw(10), bins = , align = 'mid')

plt.xlabel('Value')

plt.ylabel('Occurences')

plt.legend()

plt.show()

In the short run, this looks uneven, but if we take a large number of samples it is apparent that each face is occurring the same percentage of times. The chart below shows the distribution for 10,000 die rolls

plt.hist(DieRolls.draw(10000), bins = , align = 'mid')

plt.xlabel('Value')

plt.ylabel('Occurences')

plt.legend();

plt.show()

A random variable is **independent and identically distributed** *(i.i.d)* if each random variable has the same probability distribution as the others and all are mutually independent, i.e. outcome of one doesn’t affect the other. For example, random variables representing die rolls are i.i.d. The value of one die roll does not affect the value of next die roll.

A binomial distribution is used to describe successes and failures in a binary experiment. This can be very useful in an investment context as many of our choices tend to be binary like this. A single experiment which can result in success with probability p and failure with probability (1-p) is called a Bernoulli trial.

𝑝(1)=𝑃(𝑋=1)=𝑝

𝑝(0)=𝑃(𝑋=0)=1−𝑝

A binomial distribution is a set of 𝑛n Bernoulli trials. There can be between 00 and 𝑛n successes in 𝑛n trials, with each trial having the same probability of success, 𝑝p, and all of the trials being independent of each other. A binomial random variable is denoted as X **~** B(n,p).

The probability function of a binomial random variable p(x) is the probability that there are exactly 𝑥 successes in 𝑛 trials. This is defined by choosing x trials which should result in success and multiplying by the probability that these x trails result in success and the remaining n−x trials result in failure. The resulting probability function is:

If X is a binomial random variable distributed with B(n,p)

class BinomialRandomVariable(DiscreteRandomVariable):

def __init__(self, numberOfTrials = 10, probabilityOfSuccess = 0.5):

self.variableType = "Binomial"

self.numberOfTrials = numberOfTrials

self.probabilityOfSuccess = probabilityOfSuccess

return

def draw(self, numberOfSamples):

samples = np.random.binomial(self.numberOfTrials, self.probabilityOfSuccess, numberOfSamples)

return samples

Let's draw the distribution of 10000 samples of a binomial random variable 𝐵(5,0.5)B(5,0.5), i.e 5 trials with 50% probability of success.

plt.hist(StockProbabilities.draw(10000), bins = , align = 'left')

plt.xlabel('Value')

plt.ylabel('Occurences');

plt.show()

We see that the distribution is symmetric, since probability of success = probability of failure. If we skew the probabilities such that the probability of success is 0.25, we get an asymmetric distribution.

StockProbabilities = BinomialRandomVariable(5, 0.25)

plt.hist(StockProbabilities.draw(10000), bins = , align = 'left')

plt.xlabel('Value')

plt.ylabel('Occurences');

plt.show()

We can extend this idea of an experiment following a binomial random variable into a framework that we call the **Binomial Model of Stock Price Movement**. This is used as one of the foundations for option pricing. In the Binomial Model, it is assumed that for any given time period a stock price can move up or down by a value determined by the up or down probabilities. This turns the stock price into the function of a binomial random variable, the magnitude of upward or downward movement, and the initial stock price. We can vary these parameters in order to approximate different stock price distributions.

For continuous random variables (where X can take an infinite number of values over a continuous range), the probability of a single point, the probability that X is exactly equal to some value is zero. In this case, the probability distribution function gives the probability over intervals which can include infinitely many outcomes. Here we define a **probability density function** (PDF), f(x), such that we can say:

For example, if you buy a piece of rope and the scale reads 1 meter, this value is possible but the probability that the length is exactly 1 meter is zero; You can keep increasing the accuracy of your instrument so that the probability of measuring exactly 1m tends to zero. However, we might be able to say that there is 99% probability that the length is between 99cm and 1.01 m. Just like a probability distribution function f(x) gives the probability that a random variable lies in a range, a cumulative distribution function F(x) describes the probability that a random variable is less than or equal to a given value.

class ContinuousRandomVariable:

def __init__(self, a = 0, b = 1):

self.variableType = ""

self.low = a

self.high = b

return

def draw(self, numberOfSamples):

samples = np.random.uniform(self.low, self.high, numberOfSamples)

return samples

The most widely used distribution with widespread applications in finance is the normal distribution.

Many important tests and methods in statistics, and by extension, finance, are based on the assumption of normality. A large part of this is due to the results of the Central Limit Theorem (CLT) which states that the sum of many independent random variables tends toward a normal distribution, even if the original variables themselves are not normally distributed. The convenience of the normal distribution finds its way into certain algorithmic trading strategies as well.

class NormalRandomVariable(ContinuousRandomVariable):

def __init__(self, mean = 0, variance = 1):

ContinuousRandomVariable.__init__(self)

self.variableType = "Normal"

self.mean = mean

self.standardDeviation = np.sqrt(variance)

return

def draw(self, numberOfSamples):

samples = np.random.normal(self.mean, self.standardDeviation, numberOfSamples)

return samples

Normal distributions are described by their mean (μ) and variance (σ2, where 𝜎σ is the standard deviation). The probability density of the normal distribution is:

And is defined for −∞<x<∞. When we have μ=0 and σ=1, we call this the standard normal distribution.

By changing the mean and standard deviation of the normal distribution, we can change the depth and width of the bell curve. With a larger standard deviation, the values of the distribution are less concentrated around the mean.

mu_1 = 0

mu_2 = 0

sigma_1 = 1

sigma_2 = 2

x = np.linspace(-8, 8, 200)

y = (1/(sigma_1 * np.sqrt(2 * 3.14159))) * np.exp(-(x - mu_1)*(x - mu_1) / (2 * sigma_1 * sigma_1))

z = (1/(sigma_2 * np.sqrt(2 * 3.14159))) * np.exp(-(x - mu_2)*(x - mu_2) / (2 * sigma_2 * sigma_2))

plt.plot(x, y, x, z)

plt.xlabel('Value')

plt.ylabel('Probability');

plt.show()

In modern portfolio theory, stock returns are generally assumed to follow a normal distribution. We use the distribution to model returns instead of stock prices because prices cannot go below 0 while the normal distribution can take on all values on the real line, making it better suited to returns.

One major characteristic of a normal random variable is that a linear combination of two or more normal random variables is another normal random variable. This is useful for considering mean returns and variance of a portfolio of multiple stocks.

This rule of thumb states that given the mean and variance of a normal distribution, we can make the following statements:

- Around 68% of all observations fall within one standard deviations around the mean (μ±σ)
- Around 95% of all observations fall within two standard deviations around the mean (μ±2σ)
- Around 99% of all observations fall within three standard deviations around the mean (μ±3σ)

**The power of normal distributions lies in the fact that using the central limit theorem, we can standardize different random variables so that they become normal random variables.**

We standardize a random variable X by subtracting the mean and dividing by the variance, resulting in the standard normal random variable Z.

Let's look at the case where X **~** B(n,p) is a binomial random variable. In the case of a binomial random variable, the mean is μ=np and the variance is σ2=np(1−p).

n = 50

p = 0.25

X = BinomialRandomVariable(n, p)

X_samples = X.draw(10000)

Z_samples = (X_samples - n * p) / np.sqrt(n * p * (1 - p))plt.hist(X_samples, bins = range(0, n + 2), align = 'left')

plt.xlabel('Value')

plt.ylabel('Probability');

plt.show()

plt.hist(Z_samples, bins=20)

plt.xlabel('Value')

plt.ylabel('Probability');

plt.show()

The idea that we can standardize random variables is very important. By changing a random variable to a distribution that we are more familiar with, the standard normal distribution, we can easily answer any probability questions that we have about the original variable. This is dependent, however, on having a large enough sample size.

Let's assume that stock returns are normally distributed. Say that 𝑌Y is the price of a stock. We will simulate its returns and plot it.

Y_initial = 100

X = NormalRandomVariable(0, 1)

Y_returns = X.draw(1000) # generate 1000 daily returns

Y = pd.Series(np.cumsum(Y_returns), name = 'Y') + Y_initial

Y.plot()

plt.xlabel('Time')

plt.ylabel('Value')

plt.show()

Say that we have some other stock, Z, and that we have a portfolio of Y and Z, called 𝑊W.

Z_initial = 50

Z_returns = X.draw(1000)

Z = pd.Series(np.cumsum(Z_returns), name = 'Z') + Z_initial

Z.plot()

plt.xlabel('Time')

plt.ylabel('Value');

plt.show()

Y_quantity = 20

Z_quantity = 50

Y_weight = Y_quantity/(Y_quantity + Z_quantity)

Z_weight = 1 - Y_weightW_initial = Y_weight * Y_initial + Z_weight * Z_initial

W_returns = Y_weight * Y_returns + Z_weight * Z_returns

W = pd.Series(np.cumsum(W_returns), name = 'Portfolio') + W_initial

W.plot()

plt.xlabel('Time')

plt.ylabel('Value');

plt.show()

We construct W by taking a weighted average of Y and Z based on their quantity.

pd.concat(, axis = 1).plot()

plt.xlabel('Time')

plt.ylabel('Value');

plt.show()

Note how the returns of our portfolio, W, are also normally distributed:

plt.hist(W_returns);

plt.xlabel('Return')

plt.ylabel('Occurrences');

plt.show()

Let's attempt to fit a probability distribution to the returns of a stock. We will take the returns of AAPL and try to fit a normal distribution to them. The first thing to check is whether the returns actually exhibit properties of a normal distribution. For this purpose, we will use the Jarque-Bera test, which indicates non-normality if the p-value is below a cutoff.

start = '2014-01-01'

end = '2016-12-31'

data = dl.load_data_nologs('nasdaq', , start, end)

prices = data

Reading AAPL

# Take the daily returns

returns = prices/prices.shift(-1) -1#Set a cutoff

cutoff = 0.01# Get the p-value of the normality test

k2, p_value = scipy.stats.mstats.normaltest(returns.values)

print("The JB test p-value is: ", p_value)

print("We reject the hypothesis that the data are normally distributed ", p_value < cutoff)

print("The skewness of the returns is: ", scipy.stats.skew(returns.values))

print("The kurtosis of the returns is: ", scipy.stats.kurtosis(returns.values))

plt.hist(returns, bins = 20)

plt.xlabel('Value')

plt.ylabel('Occurrences')

plt.show()

('The JB test p-value is: ', 8.6122250241313796e-22) ('We reject the hypothesis that the data are normally distributed ', True) ('The skewness of the returns is: ', 0.38138558143920764) ('The kurtosis of the returns is: ', 4.231909703399142)

The low p-value of the test leads us to *reject* the null hypothesis that the returns are normally distributed. This is due to the high kurtosis (normal distributions have a kurtosis of 3).

We will proceed from here assuming that the returns are normally distributed so that we can go through the steps of fitting a distribution. We calculate the sample mean and standard deviation of the series and see how a theoretical normal curve fits against the actual values.

# Take the sample mean and standard deviation of the returns

sample_mean = np.mean(returns)

sample_std_dev = np.std(returns)

print("Mean: ", sample_mean)

('Mean: ', -0.0004662534806121209)

x = np.linspace(-(sample_mean + 4 * sample_std_dev), (sample_mean + 4 * sample_std_dev), len(returns))

sample_distribution = ((1/(sample_std_dev * 2 * np.pi)) *

np.exp(-(x - sample_mean)*(x - sample_mean) / (2 * sample_std_dev * sample_std_dev)))

plt.hist(returns, bins = 20, normed=True)

plt.plot(x, sample_distribution)

plt.xlabel('Value')

plt.ylabel('Occurrences');

plt.show()

Our theoretical curve for the returns has a substantially lower peak than the actual values, which makes sense because the returns are not actually normally distributed. This is again due to the kurtosis of the normal distribution. The returns have a kurtosis value of around 5.29, while the kurtosis of the normal distribution is 3. A higher kurtosis leads to a higher peak.

A major reason why it is so difficult to model prices and returns is due to the underlying probability distributions. A lot of theories and frameworks in finance require that data be somehow related to the normal distribution. This is a major reason why the normal distribution seems to be so prevalent. However, it is exceedingly difficult to find real-world data that fits nicely into the assumptions of normality. When actually implementing a strategy, you should not assume that data follows a distribution that it demonstrably does not unless there is a very good reason for it.

Generally, when trying to fit a probability distribution to real-world values, we should have a particular distribution (or distributions) in mind. There are a variety of tests for different distributions that we can apply to see what might be the best fit. In addition, as more information becomes available, it will become necessary to update the sample mean and standard deviation or maybe even to find a different distribution to more accurately reflect the new information. The shape of the distribution will change accordingly.

]]>Both your installations failed because of the way pip has changed its handling of file dependencies. We're working on how we can fix this, but it's affecting everyone.

For now you can use the requirements.txt method you describe above, using the following command:

python3 -m pip install --user -r requirements.txt

If you get errors saying no distribution found, try removing the version number from the requirements.txt file for that package.

Another method is to install the auquan toolbox using your initial command, then install the dependencies as they error out. Make sure you install the versions on pandas and numpy listed in the requirements.txt file though, as our toolbox is version locked to these.

Finally, the Coursera practice competition isn't actually ready to be live yet! The first problem is similar to P1 of the other competition. See here:

https://quant-quest.auquan.com/competitions/auquan-qq1

We've hidden the competition now as the features weren't live :(. It will be back in early January so keep your eyes peeled!.

Best,

David

]]>-> ModuleNotFoundError: No module named 'plotly'

So i thought of installing requirements file by downloading requirements from https://auquan-website-data.s3.us-east-2.amazonaws.com/qq17/requirements.txt and put in some directory and tried to install. But still i am getting error as :

-> Could not find a version that satisfies the requirement tensorboardXwheel (from -r requirements.txt (line 1)) (from versions: )

No matching distribution found for tensorboardXwheel (from -r requirements.txt (line 1))

Can anyone help me in resolving the issue ?

]]>Covariance measures the extent to which the relationship between two variables is linear. The sign of the covariance shows the trend in the linear relationship between the variables, i.e if they tend to move together or in separate directions. A positive sign indicates that the variables are directly related, i.e. when one increases the other one also increases. A negative sign indicates that the variables are inversely related, so that when one increases the other decreases.

It is calculated as:

Note that: t

When the two variables are identical, covariance is same as variance.

Let's say we have two variables and and we take the covariance of the two.

import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
X = np.random.rand(50)
Y = 2 * X + np.random.normal(0, 0.1, 50)
np.cov(X, Y)

What does this mean? To make better sense of this information, we introduce correlation.

Correlation uses information about the variance of X and Y to normalize this metric. The value of correlation coefficient is always between -1 and 1. Once we've normalized the metric to the -1 to 1 scale, we can make meaningful statements and compare correlations.

To normalize Covariance, consider

Where: \rho is the correlation coefficient of two series and . Just like covariance, a positive coefficient indicates that the variables are directly related and a negative coefficient indicates that the variables are inversely related. The closer to 0 the correlation coefficient is, the weaker the relationship between the variables.

Two random sets of data will have a correlation coefficient close to 0.

Correlation is simply a normalized form of covariance. They are otherwise the same and are often used semi-interchangeably in everyday conversation. It is obviously important to be precise with language when discussing the two, but conceptually they are almost identical.

print('Covariance of X and Y: %.2f'%np.cov(X, Y))
print('Correlation of X and Y: %.2f'%np.corrcoef(X, Y))

To get a sense of what correlated data looks like, lets plot two correlated datasets

*X = np.random.rand(50)
Y = X + np.random.normal(0, 0.1, 50)
plt.scatter(X,Y)
plt.xlabel('X Value')
plt.ylabel('Y Value')
plt.show()
print('Correlation of X and Y: %.2f'%np.corrcoef(X, Y))
*

And here's an inverse relationship:

X = np.random.rand(50)
Y = -X + np.random.normal(0, .1, 50)
plt.scatter(X,Y)
plt.xlabel('X Value')
plt.ylabel('Y Value')
plt.show()
print('Correlation of X and Y: %.2f'%np.corrcoef(X, Y))

import auquanToolbox.dataloader as dl
# Pull the pricing data for our two stocks and S&P 500
start = '2013-01-01'
end = '2015-12-31'
exchange = 'nasdaq'
base = 'SPX'
m1 = 'AAPL'
m2= 'LRCX'
data = dl.load_data_nologs(exchange, , start, end)
bench = data
a1= data
a2 = data
plt.scatter(a1,a2)
plt.xlabel('LRCX')
plt.ylabel('AAPL')
plt.title('Stock prices from ' + start + ' to ' + end)
plt.show()
print('Correlation coefficients')
print('Correlation of %s and %s: %.2f'%(m1, m2, np.corrcoef(a1,a2)))
print('Correlation of %s and %s: %.2f'%(m1, base, np.corrcoef(a1, bench)))
print('Correlation of %s and %s: %.2f'%(m2, base, np.corrcoef(a2, bench)))

Once we've established that two series are probably related, we can use that in an effort to predict future values of the series.

Another application is to find uncorrelated assets to produce hedged portfolios - if the assets are uncorrelated, a drawdown in one will not correspond with a drawdown in another. This leads to a very stable return stream when many uncorrelated assets are combined.

It's hard to rigorously determine whether or not a correlation is significant, especially when, as here, we are not aware if the variables are normally distributed or not. Their correlation coefficient is close to 1, so it's pretty safe to say that the two stock prices are correlated over the time period we use, but is this indicative of future correlation? If we examine the correlation of each of them with the S&P 500, we see that it is also quite high. So, we may be led to believe that two stocks have a relationship because of their high correlation, when in fact they are both caused by a third(market)

The problem is we may determine a good correlation by picking the right time period but it may not hold out of sample. To avoid this, one should compute the correlation of two quantities over many historical time periods and examine the distribution of the correlation coefficient.

As an example, remember that the correlation of AAPL and LRCX from 2013-1-1 to 2015-1-1 was 0.95. Let's take the rolling 60-day correlation between the two to see how that varies.

rolling_correlation = pd.rolling_corr(a1, a2, 60)
plt.plot(rolling_correlation)
plt.xlabel('Day')
plt.ylabel('60-day Rolling Correlation')
plt.show()

We see the correlation is not only unstable, but also reverses sign!

Another shortcoming is that two variables may be associated in different, predictable but non-linear ways which this analysis would not pick up. For example, a variable may be related to the rate of change of another which will not be detected by correlation alone.

Just remember to be careful not to interpret results where there are none.

We mentioned in the notebook on Expected Value and Standard Deviation that statistics derived from a sample (data available to us) may differ from the true value (population statistic). For example, we want to measure the population mean, but we can only calculate a sample mean. We then want to use the sample mean to estimate the population mean. We use confidence intervals in an attempt to determine how accurately our sample mean estimates the population mean.

A confidence interval gives an estimated range of values between which the variable is likely to lie. This range is calculated from a given set of data or from a probability distribution The selection of a confidence level for the interval determines the probability that the confidence interval will contain the value of the variable *over many computations *(read subtlety note below). So, a 95% confidence interval for a variable states that the interval will contain the true population mean 95% of the time.

For example, if you want to estimate the average height of students in a university, you might do this by measuring 100 students and estimating that the mean of that sample was close to the population. Let's try that.

np.random.seed(100)
# Let's define some 'true' population parameters, we'll pretend we don't know these.
POPULATION_MU = 64
POPULATION_SIGMA = 5
# Generate our sample by drawing from the population distribution
sample_size = 100
heights = np.random.normal(POPULATION_MU, POPULATION_SIGMA, sample_size)
mean_height = np.mean(heights)
print('sample mean: %.2f'%mean_height)

Unfortunately simply reporting the sample mean doesn't do much for us, as we don't know how it relates to the population mean. To get a sense for how it might relate, we can look for how much variance there is in our sample. Higher variance indicates instability and uncertainty.

print('sample standard deviation: %.2f'%np.std(heights))

This still doesn't help, to really get a sense of how our sample mean relates to the population mean we need to compute a standard error. The standard error is a measure of the variance of the sample mean.

**IMPORTANT Computing a standard error involves assuming that the way you sample is unbiased and that the data are normal and independent. If these conditions are violated, your standard error will be wrong. There are ways of testing for this and correcting.**

The formula for standard error is.

Where is the sample standard deviation and is the number of samples.

SE = np.std(heights) / np.sqrt(sample_size)
print('standard error: %.2f'%SE)

Assuming our data are normally distributed, we can use the standard error to compute our confidence interval.

To do this we set the desired confidence level (say 95%) and determine 95% of data lies within a range how many standard deviations of mean for our data's distribution.

For example, for a normal distribution, 95% of the observations lie in a range around the mean. When the samples are large enough (generally > 30 is taken as a threshold) the Central Limit Theorem applies and normality can be safely assumed; if sample sizes are smaller, a safer approach is to use a -distribution with appropriately specified degrees of freedom. The actual way to compute the values is by using a cumulative distribution function (CDF). If you need more background on Probability Distributions, CDFs and inverse CDFs, read about them here and here. Look here for information on the -distribution. We can check the 95% number using one of the Python functions.

NOTE: Be careful when applying the Central Limit Theorem, however, as many datasets in finance are fundamentally non-normal and it is not safe to apply the theorem casually or without attention to subtlety.

We can visualize the 95% mass bounds here.

#Generate a normal distribution
x = np.linspace(-5,5,100)
y = stats.norm.pdf(x,0,1)
plt.plot(x,y)
# Plot the intervals
plt.vlines(-1.96, 0, 1, colors='r', linestyles='dashed')
plt.vlines(1.96, 0, 1, colors='r', linestyles='dashed')
fill_x = np.linspace(-1.96, 1.96, 500)
fill_y = stats.norm.pdf(fill_x, 0, 1)
plt.fill_between(fill_x, fill_y)
plt.xlabel('$\sigma$')
plt.ylabel('Normal PDF')
plt.show()

Now, rather than reporting our sample mean without any sense of the probability of it being correct, we can compute an interval and be much more confident that the population mean lies in that interval. To do this we take our sample mean and report .

This works because assuming normality, that interval will contain the population mean 95% of the time.

Note that it is incorrect to say that "The true mean lies in a range with 95% probability," but unfortunately this is a very common misinterpretation. Rather, the 95% refers instead to the fact that over many computations of a 95% confidence interval, the true value will be in the interval in 95% of the cases (assuming correct calibration of the confidence interval, which we will discuss later).

But in fact for a single sample and the single confidence interval computed from it, we have no way of assessing the probability that the interval contains the population mean.

In the code below, we generate 100 95% Confidence Intervals around a mean of 0. Notice how for some of them, the mean lies completely outside the interval

np.random.seed(8309)
n = 100 # number of samples to take
samples =
fig, ax = plt.subplots(figsize=(10, 7))
for i in np.arange(1, n, 1):
sample_mean = np.mean(samples) # calculate sample mean
se = stats.sem(samples) # calculate sample standard error
sample_ci =
if ((sample_ci <= 0) and (0 <= sample_ci)):
plt.plot((sample_ci, sample_ci), (i, i), color='blue', linewidth=1);
plt.plot(np.mean(samples), i, 'bo');
else:
plt.plot((sample_ci, sample_ci), (i, i), color='red', linewidth=1);
plt.plot(np.mean(samples), i, 'ro');
plt.axvline(x=0, ymin=0, ymax=1, linestyle='--', label = 'Population Mean');
plt.legend(loc='best');
plt.title('100 95% Confidence Intervals for mean of 0')
plt.show()

Going back to our height's example, we can manually construct confidence intervals using t-test

# standard error SE was already calculated
t_val = stats.t.ppf((1+0.95)/2, 9) # d.o.f. = 10 - 1
print('sample mean height: %.2f'%mean_height)
print('t-value: %.2f'%t_val)
print('standard error: %.2f'%SE)
print('confidence interval: '%(mean_height - t_val * SE, mean_height + t_val * SE))

or use a built-in function in scipy.stats for computing the interval

print('99% confidence interval: ', stats.t.interval(0.99, df = 9,loc=mean_height, scale=SE))
print('95% confidence interval: ', stats.t.interval(0.95, df = 9,loc=mean_height, scale=SE))
print('80% confidence interval: ', stats.t.interval(0.8, df = 9, loc=mean_height, scale=SE))

Note that as your confidence increases, the interval necessarily widens.

Confidence intervals allow us to set our desired confidence, and then report a range that will likely contain the population mean. The higher our desired confidence, the larger the range we report. In general once can never report a single point value, because the probability that any given point is the true population mean is incredibly small. Let's see how our intervals tighten as we change the sample size.

np.random.seed(10)
sample_sizes =
for s in sample_sizes:
heights = np.random.normal(POPULATION_MU, POPULATION_SIGMA, s)
SE = np.std(heights) / np.sqrt(s)
print stats.norm.interval(0.95, loc=mean_height, scale=SE)

sample_size = 100
heights = np.random.normal(POPULATION_MU, POPULATION_SIGMA, sample_size)
SE = np.std(heights) / np.sqrt(sample_size)
(l, u) = stats.norm.interval(0.95, loc=np.mean(heights), scale=SE)
print('%.2f,%.2f'%(l, u))
plt.hist(heights, bins=20)
plt.xlabel('Height')
plt.ylabel('Frequency')
# Just for plotting
y_height = 5
plt.plot(, , '-', color='r', linewidth=4, label='Confidence Interval')
plt.plot(np.mean(heights), y_height, 'o', color='r', markersize=10)
plt.show()

The computation of a standard deviation, standard error, and confidence interval all rely on certain assumptions. If these assumptions are violated then the 95% confidence interval will not necessarily contain the population parameter 95% of the time. Here is an example.

If your data generating process is autocorrelated, then estimates of standard deviation will be wrong. This is because autocorrelated processes tend to produce more extreme values than normally distributed processes. In such a process, new values depend on previous values, so series that are already far from the mean are likely to stay far from the mean.

def generate_autocorrelated_data(theta, mu, sigma, N):
# Initialize the array
X = np.zeros((N, 1))
for t in range(1, N):
# X_t = theta * X_{t-1} + epsilon
X = theta * X + np.random.normal(mu, sigma)
return X
X = generate_autocorrelated_data(0.5, 0, 1, 100)
plt.plot(X)
plt.xlabel('t')
plt.ylabel('X')
plt.show()

Now, we need to know the true population of this series. For larger sample sizes, the sample mean should still asymptotically converge to zero. This is because the process is still centred around zero, but let's check if that's true. We'll vary the number of samples drawn, and look for convergence as we increase the sample size.

sample_means = np.zeros(200-1)
for i in range(1, 200):
X = generate_autocorrelated_data(0.5, 0, 1, i * 10)
sample_means = np.mean(X)
plt.bar(range(1, 200), sample_means);
plt.xlabel('Sample Size')
plt.ylabel('Sample Mean')
plt.show()

Definitely looks like there's some convergence, we can also check what the mean of the sample means is.

np.mean(sample_means)

Pretty close to zero. We could also derive symbolically that the mean is zero, but let's assume that we've convinced ourselves that the true population mean is 0. Now that we know the population mean, we can check the calibration of confidence intervals. Let's first compute an interval using our earlier process and then check whether the interval actually contains the true mean, 0.

def compute_unadjusted_interval(X):
T = len(X)
# Compute mu and sigma MLE
mu = np.mean(X)
sigma = np.std(X)
SE = sigma / np.sqrt(T)
# Compute the bounds
return stats.norm.interval(0.95, loc=mu, scale=SE)
# We'll make a function that returns true when the computed bounds contain 0
def check_unadjusted_coverage(X):
l, u = compute_unadjusted_interval(X)
# Check to make sure l <= 0 <= u
if l <= 0 and u >= 0:
return True
else:
return False

Now we'll run many trials, in each we'll sample some data, compute a confidence interval, and then check if the confidence interval contains the population mean. Over many runs, we should expect to see 95% of the trials succeed if the intervals are computed correctly.

T = 100
trials = 500
times_correct = 0
for i in range(trials):
X = generate_autocorrelated_data(0.5, 0, 1, T)
if check_unadjusted_coverage(X):
times_correct += 1
print 'Empirical Coverage: ', times_correct/float(trials)
print 'Expected Coverage: ', 0.95

Clearly the coverage is wrong. In this case, we'd need to do what's known as a Newey-West correction on our standard error estimate to account for the autocorrelation.

In practice, it's important to check for the assumptions you make. It is quick and easy to check if your data are stationary (which implies not autocorrelated), and it can save you a lot of pain and suffering to do so. A normality test such as`Jarque Bera`

will also be a good idea, as it may detect certain distribution properties which may violate assumptions of many following statistical analyses.

This work derives from the works of Michael Halls-Moore on Quantstart and Quantopian Lecture Series.

The idea is simple.

Instead of just rewarding you for the people you invite, we will also reward you for the people your invitees invite, and their invitees, and so on. The diagram below gives an example of how it might work. Mo, Sid and Pooja are winners of certain cash prizes and we can see how that would lead to you earning $112, without doing any work yourself.

Here you can see that 'You' didn't invite anyone who directly won, so in most referral schemes you would be out of luck. However, with our fairer system, you are rewarded for your connections because they led to people submitting correct answers. Congratulations!

The formula for your reward is: R = P * 0.1/2^(n)

where: R = your reward

P = the winner's prize

n<5, n = the number of people between you in the referral chain

This table shows how much you would earn if someone in your chain won a prize of just $1000. It also shows how many people could win you money if everyone just invites 5 of their friends or colleagues.

In theory, we could let this go on indefinitely, but we've decided to limit the chain length to 5. This will limit the overheads for our accountants, who wouldn't thank me for making them pay 100 people <$1 prizes.

>>>> You can invite people to our current competition here: quant-quest.auquan.com/competitions/asia-open-2019 <<<<<

Some of the finer print:

- People must be invited and signup using the invite a friend system (found on all competition homepages). Otherwise, we won't be able to attribute your invite. This also means each person can only be invited by one person.
- You cannot invite yourself and you must not break any of our general platform rules.
- Chains are limited to 5 people, as explained above.
- This promotion only applies to cash prizes
- We will only pay out prizes of greater than $5
- We initially intend to run this promotion until 31/12/2020, but reserve the right to stop this promotion at any time (we also might extend it!).
- Prizes are awarded at our sole discretion. For example, any evidence of trying to manipulate the system for unfair personal gain will result in your immediate disqualification from all payouts.