Last updated October 2018

When you start to dive into data science/ML, it can feel like there are hundreds of different courses, books, MOOCs available, with no clear path on exactly what you’ll need or the most efficient way to get to a point where you’re self-sufficient. Talk to anyone who has taught themselves and they’ll each give their own path, but you’ll find some common ones pop up often, for example Andrew Ng’s Coursera courses. Six months ago I started my study with only some codecademy level python skills to start with, and after studying for an average of 15-25 hours per week (with some long periods of inactivity while working on other businesses/projects/relationships), I should be ready for an entry-level position within the next few months.

So, here’s my advice on the courses, books and skills you should practice in order to get up to speed as soon as possible with minimal wasted time. This is by no means the best path you could take, it’s just the one that has worked for me, and might save you many of the hours I spent exploring the different options.

The path

1. Get some experience using python (unless you already know R)

All the main ML libraries and tools are built on top of python, so unless you already know R, learn it. You can get away with minimal python skills to start with, although you should keep practicing with non-ML projects, especially if you want to get into the ML engineering side of things. Codecademy or Learn Python the Hard Way are good resources, or if you have completed one of these (or know other languages), skip to the next section (there’ll be a crash course in the next course).

Resources:

Codecademy

Learn Python the Hard Way

2. Start getting practice with python/data/visualisation libraries

Before we dive into the ML algorithms, we need to learn some of the essential data science libraries so we can explore and manipulate our data (80% of machine learning is data munging after all). For this, I recommend the Python for Data Science and Machine Learning Bootcamp course on Udemy and Python for Data Analysis.

The Udemy course a good, practical short course for getting you acquainted with python, jupyter notebooks and the main data analysis libraries you’ll want to use (numpy/pandas/scikit-learn). A lot of the other Udemy courses don’t use pandas or some of the plotting libraries (matplotlib/seaborn), but this one does. It also has some sections on neural networks, (although I’d skip them and use the resources below for that), and Spark (cloud/big data processing). It’s very practical and the exercises will build your appetite, but once you start getting into the machine learning sections, you may want to pair it with the Andrew Ng course below (learn the intuition from Andrew, and practical skills from this course).

 

The Python for Data Analysis book is highly regarded and will give you a deeper look at Python, IDEs, and the core data analysis libraries, and how they work under the hood. Check it out (it’s free) and use it as a reference resource if nothing else.

Neither of these resources will give you the skills to build real ML pipelines efficiently and with the goal of deployment in mind. You’ll graduate to that step soon though.

3. Theory – Develop an understanding of the classical ML algorithms and core ML ideas

Andrew Ng’s original ML course on Coursera is one of the best introductions to how machine learning works, and will give you the intuition behind how each of the algorithms work (concretely). It’s a watered down version of his old Stanford course, and while it’s over 7 years old now, it’s a great starting point.

I’d recommend watching the course videos, but wouldn’t bother with the programming assignments since they’re done in Octave (similar to MATLAB). If you want to do them, find github repos of people who have done it in python. If you’re struggling with some of the math here (calculus & linear algebra), check out the additional resources below to help brush up. Honestly though, you can get away without building the algorithms from scratch if you understand the theory well (no-one builds them from scratch anymore), so I’d focus on the aforementioned Udemy course or (ideally) the book I mention in point 5.

4. Theory – Start reading the ISLR (optional)

At this stage, you should also check out the de facto machine learning handbook for beginners, the Introduction to Statistical Learning (ISLR for short). Once you start learning and using the algorithms above, you should start moving through the book if you want a deeper grasp and intuition of how the algorithms are optimised, how they work, and to develop an understanding about when and why you would use certain algorithms over others. It also doesn’t require much math. The exercises are in R, but you can again find other versions using python on github.

5. Start to dive deeper into Scikit-learn & Tensorflow

By now you should have a pretty solid understanding of the classical algorithms, the main data science libraries, and what a typical ML workflow looks like. Now we want to dive deeper, and it’s time for my favourite resource of them all: Hands–On Machine Learning with Scikit–Learn and TensorFlow by Aurélien Géron (I’ll call it HOML for short). This book gets a bit more advanced and will teach you about streamlining your code, building data processing and ML pipelines, and how to tune hyperparameters efficiently. The book is no filler, and it’s important to be comfortable with the libraries and algorithms beforehand as it’s not going to dive too deep into theory (hence why the Udemy course is recommended first). It includes practice written and practical questions, and I’d recommend taking summaries of each chapter as you go. If you already have experience coding, understand the main themes of ML, and want to get your hands dirty quickly: this is the book for you.

The first half of the book is dedicated to classical algorithms (using scikit-learn), and the second half to neural networks (using Tensorflow). It’s a fantastic resource to refer back to, and you should check out Aurélien’s youtube channel as well (he has the best explanation of Capsule networks I’ve seen).

It’s also important to note that by this stage you should already have completed a project or two and be aiming to do at least one each month.

6. Deep Learning and Neural Networks

Note: I’d definitely recommend you consolidate what you’ve learned with some projects using classical algorithms before moving onto deep learning. But, once you have, I’d recommend the fast.ai courses or Andrew Ng’s deeplearning.ai course. Both will give you a good intuition of how deep learning works (similar to Andrew’s original ML course), with Andrew’s course making you build basic networks from scratch in python, and the fast.ai course focusing on practical exercises and implementing state of the art algorithms in jupyter notebooks using Pytorch. Andrew’s course is more fundamental and has a little more support (it’s a paid course), but fast.ai’s courses are more practical and will teach you how to implement some of the state of art algorithms being used today(the second course of theirs was only just released), with a focus on transfer learning for best results. Having done both, I’d recommend fast.ai, and use the Andrew Ng course as backup to understand the intution if you get stuck.

As you move through either of these courses (or both), you should add HOML into the mix for a look at applications in Tensorflow, or check out Deep Learning with Keras. It’s written by one of the main developers behind the Keras library, which is a more user-freindly wrapper of Tensorflow and offers a great starting point for building neural networks, abstracting a lot of the complexity of Tensorflow away. On this note, PyTorch appears to be picking up a lot of steam particularly in the research community, so if you had to pick one, I’d pick PyTorch at this stage (unless you plan on deploying to mobile devices). Check out some of the detailed comparisons online if you have a specific use case in mind.

Next steps for learning

At this point, you should be honing your skills, working on and publishing your projects (posting them on your blog/github/kaggle), and be starting to build real-world skills so you can work within a software/data science team. You should also start branching out into some of the big data services (Apache Spark rather than Hadoop), other libraries and wrappers (eg. Keras), and getting to know other libraries and methods specific to the problems you’d like to solve (eg NLP, image recognition etc.).

Beyond this, the big cloud providers are all releasing a range of ML focused products (eg AWS Sagemaker/Rekognition etc.), and having knowledge of them is good to have. If you’re new to cloud computing, just pick one (AWS is the safest bet) and get familiar with their offerings, from deploying servers to user access to git to their ML services. Getting a Solutions Architect certification in one of the providers can boost your resume and teach you a lot about how all of their services work (though there is little time spent on ML), killing two birds with one stone. Amazon, AWS and Azure all also offer advanced certifications for Big Data (with a focus on the data engineering and Spark/Hadoop) which will be worth getting down the road. This will be my next cert after already having completed the AWS Solutions Architect: Associate certification.

For large software companies, they usually host their entire software stacks with one of these providers, so you’ll need to know how they work if you ever want to be an engineer (deploying models into production) rather than an analyst (working in your notebooks and making graphs). More importantly, if you know how to use services like Amazon Sagemaker for example, you will be able to build, train and deploy ML models to a simple API endpoint, removing the need for you to know how to manage servers and opening up the possibility for you to work as a one person data science team. When people talk about how every business will be using machine learning or some form or another in their software, these tools along with the automated ML services will be how many of them get started (especially for basic models). Since these are only just being released, your disadvantage of being late to the game is not as big a deal, and you have an opportunity to deploy working ML models without needing to learn a lot of the skills your more experienced competition has already had to learn.

Other resources

Statistics and probability

You’ll want to cover both research statistics (to learn about sampling, bias and research design), and most importantly, inferential statistics (to learn core statistics concepts and how to work with data). For a good intro to stats, check out Stats 101 from Harvard for video content, and this book which is the best intro to stats book (even better than the textbook for the Harvard course, according to reviews). You’ll also want to learn about research stats around sampling, biases, correlation (e.g. pearson’s r vs spearman’s rho), p-values, analyses of variance etc. Here is a summary I found which goes through a lot of it in a concise manner.

Maths (Calculus & Linear Algebra)

Depending on your previous exposure to maths (if you’ve done first year university maths), you may only need to brush up on the intuition behind matrix multiplication, calculus, and the notation. If you already have experience with this and want to buzz through it, I can wholeheartedly recommend 3Blue1Brown’s Youtube channel. Not only does he have calculus and linear algebra courses that will teach you the intution (10-15 videos each), but he has a whole range of other brilliant math videos. If you want to practice, you should jump into Khan Academy.

Datasets for projects

The books above should have a bunch of projects already in them, but you’ll want to explore around to find some that are actually interesting to you. Kaggle has a bunch of datasets, including user-contributed, and you should find some that interest you more than those in most courses. The best part is that you can check out how other people have analysed the data and get ideas for what you can do. There are a lot of other beginners on there, but you can find some very advanced data scientists on there and can learn a lot (especially on previous competitions which offered large prizes). Kaggle is usually the first place I look if I need to solve a specific problem (eg a segmentation problem), since many of the leaders will post their kernels (essentially a notebook containing their workflow) in the discussion area of a competition. There are also a bunch of public repositories offering datasets to use, just do a google search to find ones that interest you.

Websites/news

Hackernews – General tech news and forum. Procrastination site of choice for many software developers.

Datatau – Hackernews for data scientists.

Data Elixir – Weekly email that has most of the top developments in the data science world. This is probably one of the better resources for industry news, top new papers, popular blog posts, learning resources and other stuff.

I won’t make a list of blogs or sites I follow, just do a google search to find the common/best ones.

Youtube channels

If you’re like me and like to watch and learn, there are a bunch of great channels on Youtube. I haven’t done a heap of exploring (I know there are many more), but these should get you started.

3Blue1Brown – As mentioned above, this is a fantastic channel if you like math in general, and he uses python to visualise everything he talks about. He also has videos on signal analysis and topology, and a couple videos on how neural networks work, but you’ll probably already be familiar with it (it’s good for sharing with others to show them how they work though). Definitely check out his videos on neural networks for a great intro.

Aurélien Géron – This is the author of Hands On Machine Learning and his videos are great too. He has just started releasing videos and they are 10/10. Check out his CapsNets explanation video.

Two Minute Papers – A great channel where you can keep abreast of all the developments in the world of ML and in particular neural networks. A very digestible format and will help give you a heap of new ideas for stuff you can do, and keep you up to speed with the main new developments in the ML world.

Siraj Raval – You’ve probably seen him before (it’s SIRAJ!), and while I’m not a huge fan of his teaching style and don’t watch his videos anymore, you should still check his channel out. He isn’t the best teacher but you may find him entertaining, and he’s done quite a few videos where he builds stuff on stream so you can follow along at home.

Sentdex – This guy creates courses for https://pythonprogramming.net/ and has done some pretty cool projects (including building a GTA5 bot which you could interact with via twitch chat commands. He does a lot of live coding as well.

Victor Lavrenko – This guy doesn’t release anything anymore, but has some good videos on how different algorithms work and some more advanced/niche topics that you won’t find in the other courses.

The One – Has some cool videos on his projects, in particular using Unity and reinforcement learning to evolve segmented line creatures.

Machine Learning Plus – Just discovered this as I opened hackernews, but it has a great course + 101 questions for using numpy for data analysis. I’ll be doing this in the coming days.

Final thoughts

I’ll keep this post updated as I keep moving forward, but hopefully this gives you a clear idea of what to focus on and how to do it. If this helped you, you have questions for me, or you want to connect, shoot me an email.

Have a great week!