**Introduction**

In November last year, Auquan held its first UK competition with Imperial College London. Masters students were invited to take part in a one-day event where they were tasked with identifying trading signals from a noisy dataset.

The competition was run in association with the statistics department and was designed to give students experience working with a real-world problem. It turns out that some of the answers submitted in this competition were able to win prizes in this year’s Auquan Spring Challenge.

We don’t often talk about it, but one of the unique features of our platform is that any answer users submit is automatically entered into any similar competitions in the future. So if you submitted an amazing classifier to a competition today, you could be passively winning competitions for the next 50 years!

This was the first time that someone has won a competition on our platform in this way, so we reached out to Yanni to get to learn a little more about him. Read on below…

# The Interview

**What is your background?**

This year I’ve been doing my masters in statistics at Imperial, the specific stream that I’m in is applied statistics. Currently working on my summer research problem, which is about inference using kernel-based methods for complex systems. Before I came to imperial, I was at Cambridge doing pure maths but have always been interested in statistics!

**That’s really mathsy! Data science itself is mainly statistics but people consider them different topics, do you consider yourself a data scientist?**

I guess when people talk about data science they are mainly referring to a combination of different fields: the statistics part and also computer science. Actually, one of the biggest problems I’m having now is that I have an algorithm that works but is very slow and I need to find a way to parallelise it and split the computation across different nodes — which is something you need to be able to do if you want to process large amounts of data.

I am more a statistician I would say. I haven’t tackled many data science problems. We have some exposure to big data in our course but I have been focused more on theoretic maths — algorithms and this sort of stuff. The Auquan IC Challenge was actually the first competition like this that I took part in!

**These actually are the sort of people we try and work with. A lot of our work is doing the coding heavy lifting (in the template files and toolbox) so people only have to write the prediction logic — so there is less need to have an understanding of things like object-oriented coding.**

**Given that this was your first time doing this sort of competition, is data science something you want to continue with, or just a one time thing?**

I actually do want to improve my coding skills. I know in finance they use a lot of stochastic differential equations and that’s going to be one direction that the work I’m doing is going to head. So yeh I, in summary, I definitely want to do more of these sorts of competitions during my PhD. I think its a really important skill to have and it’s also really interesting to see how these skills can be applied in different situations — like the real world.

**So do you find it interesting doing applied stuff in contrast to the more theoretical or pure work you do day in day out?**

I do, so when I first started this project I was reading loads of academic papers with very abstract things. Then I started coding them into the model I’m working on now. The model I’m working on is actually a differential equation modeling chemical reactions and actually what you were saying about the interdisciplinary nature of data science is really apparent.

A lot of the equations I’m working with have a shared background with quantum mechanics, which is something I did a lot of in my undergrad. It’s really cool to take this sort of thing and apply it in another real-world area and see it work.

**If we were to think about someone similar to you who feels they need to ramp up on their coding skill, is there anything you would recommend?**

So starting out I would try and find some basic tutorials on python online and try and code them up at the same time. When you’ve done that and got the basic syntax down, which shouldn’t take that long if you know some other languages. I think the best way to actually learn a language is to actually use it in a real project. That’s how I learnt Python!

For example the equations I’m working with at the moment I know algorithmically and how they work, I just now need to figure out how to code them. I had to do some googling on how to do this in python. First I wrote a simple double for loop, then I made it a bit quicker using some factorisation tricks, then I found out that there’s actually a package that does this for you in scipy. I think the best way to learn is to definitely to do projects.

**If you’re looking for this kind of material, is there anywhere you specifically go?**

Yeh so I found this one site really useful [ https://lectures.quantecon.org/py/ ]. It's for quantitative economics but they have loads of really useful stuff on numpy, scipy and some more object-oriented stuff.

There’s then also loads of really good articles on certain things like linear algebra, Markov chains. I didn’t go through everything of course but the ones I did look at were really useful.

**If we then move on to this problem can you quickly remind us what it was about?**

The competition was run by our course department actually. We were paired up on the day and from what I remember: There was a training dataset with hundreds of variables and some observations of each. I think the idea was that the variables were highly correlated with each other and we either had to maximise or minimise the residual sum of squares (sorry it was a while ago now!). We mostly uploaded solutions we were getting negative scores (not good), but we were told not to be discouraged by low positive numbers as that’s commonly the best you can get with datasets like this. In the end, we won with a correlation around 0.03.

**Was there anything that you think about the problem that was particularly difficult or interesting?**

Actually from what I remember the data was timestamped and a load of my classmates were trying to use that. I didn’t actually think it was that useful to think of it in those terms because there was a lot of variables that I just didn’t know what they were. It’s really difficult in cases like this to do exploratory data analysis and visualise all the combinations. So that was something that was difficult.

The thing that got us to the top of the leaderboard the first time was using principal components analysis, so the hardest part about this competition was actually implementing these approaches. We’d heard about them before but never actually coded them out before. Beyond that, I think we got a bit of a lucky break and it just happened to work.

**Given that this competition was in R and you now do mostly python work, what would you say has been the biggest difference between the two?**

I’ve found that python is just much nicer for working with linear algebra. Stuff like: inverting matrices, solving matrices, working with arrays. One thing that really annoyed me about R is that if you read a vector its not considered a matrix and vice versa. Whereas in python in numpy, an array is an array, regardless if its a vector, a tensor, a matrix and so on.