Skip to content

Latest commit

 

History

History
436 lines (321 loc) · 37.8 KB

File metadata and controls

436 lines (321 loc) · 37.8 KB

March 2020 - 2021 Data Science (Machine Learning / Deep Learning) Study Path

A complete ML study path focused on making you a Real World AI Expert (Engineer / Developer and Researcher)

This repository is intended to provide a complete and organic learning path to getting started with Machine Learning. You will understand both theory and be able to apply it in practice, with hands-on project.

It does not require any previous knowledge, but being confident with programming and high school math is necessary to understand and implement Machine Learning concepts.

I have organized the Path in 5 sections:(Note : This is Work In Progress)

Fundamentals

  • Python
  • Jupyter Notebook
  • The Math you need
  • The Machine Learning landscape
  • On to Plateau 2

Data Visualization

  • Pandas
  • Matplotlib
  • Seaborn, Plotly and More
  • vvvv
  • On to Plateau 3

Machine learning

  • Introduction
  • Statistics and Randomness
  • Probability
  • Bayesian Approach
  • Math of Curvature and Surfaces
  • Information Theory, Entropy & Cross Entropy
  • Classification, Clustering and Popular Classifiers
  • Data : Preparation, Training and Testing
  • Overfitting and Underfitting Ensembles: Voting, Bagging, Boosting, Random Forests and More
  • Scikit-Learn
  • End-to-End Machine Learning project
  • Linear Regression
  • Classification
  • Training models
  • Support Vector Machines
  • Decision Trees
  • Ensemble Learning and Random Forest (remove / merge with top)
  • Wrap up and to Plateau 4

Deep Learning

  • Understanding the Neuron
  • Feed Forward Networks
  • Activation Functions
  • Backpropagation - Complete ins and outs
  • Optimizers - Why we need them
  • Deep Learning - Full review
  • Introduction to TensorFlow, PyTorch and other frameworks
  • Up and Running with TensorFlow and PyTorch
  • ANN - Artificial Neural Networks
  • NLP - Comprehensive Tutorials
  • CNN - Convolutional Neural Networks
  • RNN - Recurrent Neural Networks
  • The Mighty keras - Complete overview
  • AutoEncoders
  • Reinforcement Learning (Flippers, L-Learning, Q-Learning, SARSA)
  • Generative Models - GANs, Flow-Based Models and more
  • Next steps to Plateau 5

Applied AI in Production

  • Datasets - Complete Review
  • Training Networks: Best practices
  • Building Creative Applications
  • Machine Learning Projects
  • MLOps - Stitching Data Science Tools for Succesfull
  • Blogs / Youtube Channels / Websites worth taking a look!

So let's get started!


Fundamentals

Python

According to Sun Tzu:

If you don't know Python, learn it yesterday!

Python is one of the most used and loved programming languages, and it's necessary to get things done in the Machine Learning field. Like most of the frameworks of the bigger Data Science field, TensorFlow is married with Python and Scikit-Learn is written in Python.

First, let's install Python 3 on your machine!

We are ready to start our journey!

If you don't know the basics of Python, just start from here.
Else if you know the syntax and you want to have a more solid Python background (recommended) take this Intermediate Python Course from here.
If you are looking for tons of exercises to get your hands dirty and get experience with Python, check here and here.

Once you're familiar with Python, take a look at Numpy, an important module for math operations, that allows you to import in Python the Tensor data type, which is the most used in ML (especially when dealing with Neural Nets). It's not a matrix! This is an awesome Numpy Tutorial.

I also recommend you to install Pycharm Community Edition, a complete IDE for Python development, and set a new Python virtual environment for our experiments.

Jupyter Notebook

Directly from here: "The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more." Working with data means -> a lot of expriments. And to document experiments, and organize them in a valuable way to get insights, you definitely need to use Jupyter Notebook during your journey. Why?

The math you need

Who tells that the math behind Machine Learning is hard... it's not so wrong! But you have to consider that every time you're going to use it, it will be handled by the machine for you! So, the important is to grasp the main math concepts and recognize limits and applications of those. No one is going to ask you to calculate a gradient by hand! So, even if you are not familiar with these concepts, check them, because they are the reason behind everything.

With these three resources, you'll get out the most of what you really need to understand things deeply.

A top course about linear algebra is here.
Integrate with basic probabilities and statistic concepts here.
The most of the remaining math you need here.

The machine learning Landscape

Directly from the book cited earlier, this is the most concise and illuminating overview of what is and when you need machine learning. Let's stop use buzzwords! Check it here.


Machine Learning

Introduction

Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?

Statistics and Randomness

Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?

Probability

Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?

Bayesian Approach

Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?

Scikit-Learn

To install Scikit-Learn

python pip install -U scikit-learn

If you encounter some problems, it may be because you don't have the last version of pip. So n the same folder run:

 python -m pip install --upgrade pip

Scikit-Learn one of the most complete, mature and well-documented library for Machine Learning tasks. It comes out-of-the-box with powerful and advanced models and offers facility functions for the data science process. We'll learn and use other modules along the road, for a quick usage just look at their official documentation.

End-to-End Machine Learning project

For a first taste, i suggest you to go through this Kaggle notebook, which is the most classic example of ML task. The goal is trying to predict if a Titanic passenger would have been most likely to survive or not. Many things will be unclear for now, but don't worry, they will be all explained comprehensively later. Is nice to get the picture of the "applied" project, going through the classical steps of the applied Machine Learning (problem framing, data exploration, question formulation...).

The notebook is on Kaggle, the go-to platform for ML and general Data Science projects, which provides a lot of free datasets and offers interesting challenges and ML model experiments.

This is the notebook: Read it, trying to get the big picture of the process, because some details, functions and code will be clearer later.

Linear Regression

This is the simplest form of Machine Learning, and the starting point for everyone interested in predicting outcomes from a dataset. Check here the theoretical lesson from Andrew NG and then go through these examples, from the simplest to the most complete. This is the math behind Linear Regression.

Classification

Classification is one of the most important ML tasks, and it consists of predicting an outcome given an input, classifying it among differente possibilities. For example, given handwritten numbers, guess what the number is, with the lowest error rate possible. The simplest case is binary classification (Yes or No, Survived or Not Survived), have a look here. Check here a brief explanation of the theory of logistic regression algorithm for classification, and check here for a deeper comprehension (using the Titanic dataset). You can use a lot of different ML models to classify things, even neural networks! For now, just take a look here, where you see an example of comparison among different models accuracy and recall. Here you have an article about the metrics used to evaluate your classifiers.

Training models

Here i grouped some of the techniques used in ML tasks to train the models. In this Google Crash Course you find:

Support Vector Machines

This is another classical algorithm to create ML models. Here you have the explanation of the theory, and here a more pratical approach. Check both. Here is a very good explanation + practice application in Scikit-Learn.

Decision Trees

Decision Trees are one of the most simple but effective idea behind predicting outcomes, and they're used in many ways (i.e. Random Forest). Check here and go through the playlist to get a theoretical overview of Decision Trees (ID3). Here you have the pratical application of ID3. Here you have a some end-to-end examples, with Scikit-Learn:

Ensemble Learning and Random Forest

The idea of Ensemble Learning is to leverage all the different features, pro and cons of several ML models to obtain a group of "voters" that, for each prediction, gives you the most likely outcome, voted by different classifiers (SVM, ID3, maybe Logistic Regression). Here you find the basics of the ensemble learning approach, and here you find the most classic of them, the Random Forest. Altough the idea is simple, this ensemble model came up really effective tackling even some "hard" classification problems, or with a lot of data.

Here you get a complete overview of the best practices for ensemble learning, and here you find an example of Random Forest with Scikit-Learn. Both link come with a bunch of useful techniques touse in practice.

Wrapping up and looking forward

Now, if you followed all the steps and explored all the resources i posted, you're likely to be mode confident with Machine Learning and have a general idea of the things. Of course you need to explore and learn more, because this field is changing and enhancing techniques and approaches day-by-day! All the algorithm we've seen are widely used in the Data Science and Analytics field, but there are some complex tasks where they fail or give really poor performances. Now we are ready to fall down in the deep rabbit hole, trying to understand how Neural Network and in general Deep Learning can help tackling big problem with millions of parameters and variables. Why use Deep Learning over classical ML algorithms?


Deep Learning

In this section we'll follow a track that will bring us to zero knowledge of neural network to fully understand them, thanks to the Stanford University Deep Learning course and some tutorials i've searched over the internet. Some of them come from Google, other from Stanford or Cambridge university, and you will learn to leverage neural networks (ANN, CNN, RNN) for several kind of ML tasks. These are some use cases of using TensorFlow for ML tasks.

The theory and the applications of the Neural Networks are not too easy to get at a first look. Because of that, you'll need to pass again through tutorials and videos, to ensure a fully comprehension of the coming topics. Because of that, I spent a decent amount of time trying to understand (reading paths like this, articles, offical forums, related subreddits) which was the most effective way to deeply learn the concepts, formulas, tradeoffs... I came up with this approach, but you can tweak it as you prefer, because every brain is different.

After taking the TensorFlow section 3 phases iterative cycle:

  • 1 Get an idea of the main concepts through an entire pass of this Stanford course, don't care too much on math explanations, focus on the what and why
  • 2 Deeply explore one topic at time, with theory + tutorials + examples (e.g. RNN theory + RNN tutorials + RNN examples) with the links and resources of the topic section of the guide.
  • 3 After iterating the 2 phase for each topic, walk again through the entire Stanford course. This time you can fully understand all the formulas, connecting them and catching also the "math flow" of the course.

This iterative process (1-2-2-2-2.....-3) can be repeated may times as you want, and will probably construct in your mind a nice general schema of the things. In each complete iteration you can drop one or more topics, and focus on the ones that are more interesting to you or not so clear.

In each section i've put content for the first time you arrive there (during the first complete iteration), and some content for next time you arrive there (after the 3 phase).

The structure follows the track proposed by the Stanford awesome course. You find the slides here.

This is an alternative course from MIT, more or less the same contents. It's worth watching it to compare and have a differente point of view on the things, besides listening 2X the best professors of the world exploring each topic.

This is the Book I refer to in each section.

What is TensorFlow

Created by the Google Brain team, TensorFlow is an open source library for numerical computation and large-scale machine learning. TensorFlow bundles together a slew of machine learning and deep learning (aka neural networking) models and algorithms and makes them useful by way of a common metaphor. It uses Python to provide a convenient front-end API for building applications with the framework, while executing those applications in high-performance C++. TensorFlow is the de-facto standard for the major industry-sized company that need to implement Machine Learning algorithms. Is built for scaling, with really cool features to parallelize training over multiple GPU's or devices.

What is PyTorch

PyTorch is an open source machine learning library based on the Torch library, used for applications such as computer vision and natural language processing, primarily developed by Facebook's AI Research lab (FAIR). It is free and open-source software released under the Modified BSD license. PyTorch is an open source machine learning library based on the Torch library, used for applications such as computer vision and natural language processing, primarily developed by Facebook's AI Research lab (FAIR). It is free and open-source software released under the Modified BSD license. Although the Python interface is more polished and the primary focus of development, PyTorch also has a C++ interface.

A number of pieces of Deep Learning software are built on top of PyTorch, including Tesla Autopilot, Uber's Pyro etc.

PyTorch provides two high-level features: Tensor computing (like NumPy) with strong acceleration via graphics processing units (GPU) and Deep neural networks built on a tape-based automatic differentiation system

Up and Running with TensorFlow

Assuming you have Python stored in the variable PATH, to install the Tensorflow library you just need to open a terminal inside you Python installation folder and run this command.

python pip install tensorflow

The first read i recommend you is this. The second thing to do is to follow this Introduction to TensorFlow directly from the awesome Google Education page. Again, some theoretical concepts might be unclear, but focus on how the TensorFlow library and process are conceived. This is a good resume of the latter. Another beginner tutorial from google. This is about the TensorFlow 2.0 update.

Now you're most likely familiar with TensorFlow as a tool, and it's time to understand how to use it to build large scale Neural Networks.

ANN - Artificial Neural Networks

First look (in order):

  • This video.
  • This is your bible, understand it totally.
  • This is a gem and read this from the authors.
  • This is a really fast-talking guy implementing a Neural Network library from scratch, super useful to understand how is implemented the core of NN in Python. You can imagine that each existing framework is just an enormous expansione of this concept-library.
  • This is a step-by-step backpropagation example with calculus.

Second pass:

Tips & Best practices: 1, 2, 3, 4, 5, 6, 7, 8.

CNN - Convolutional Neural Networks

First look (in order):

  • Here is an awesome deep explanation.
  • Here another super good one.
  • Here is serious CNN tutorial with TensorFlow.

Second pass:

Tips & Best practices: 1, 2, 3, 4, 5, 6, 7, 8.

RNN - Recurrent Neural Networks

First look (in order):

  • Here a gentle but detailed explanation.
  • Here another interesting explanation.
  • Here a video with a more pratical approach.
  • Here a guide to implement RNN in TensorFlow.
  • Here a 7 pages blog post regarding the TensorFlow implementation.

Second pass:

Tips & Best practices: 1, 2, 3, 4, 5, 6, 7.

Training Networks: Best practices

First look (in order): I strongly recommend you to refer to this page from Stanford and go through all the Module 1 and 2. I put also here a list of the various topic to explore when talking about how to train NN in real life applications.

  • Overfitting vs Underfitting: 1, 2, 3, 4, 5.
  • Vanishing/Exploding Gradient: 1, 2, 3, 4, 5.
  • Transfer Learning: 1, 2, 3, 4, 5.
  • Faster Optimizers: 1, 2, 3, 4.
  • Avoiding Overfitting through Regularization: 1, 2, 3, 4.

Second pass:

AutoEncoders

First look (in order):

  • Here you find a first read.
  • This is your second recommended read.
  • This is a lecture from Andrew NG.
  • I give also you some examples: 1, 2, 3, 4.

Second pass: AutoEncoders Chapter.

Tips & Best practices: 1, 2, 3, 4, 5.

Reinforcement Learning

First look (in order):

  • Here you have an explanation video.
  • This article is well explaining RL.
  • Here is an interesting read.
  • Some examples: 1, 2, 3, 4.

Second pass: The go-to guide. Paper with state of art RL architecture. Complete free book on RL.

Tips & Best practices: 1, 2.

Applied AI

Hey You. During the last few years i collected tons of articles, web apps, reddit thread, best practices, projects and repositories, and I want to share with you each single bit of information, trying to organize them by type of resource (blogs or projects ideas, and so on).

Machine Learning Projects

Tools

Youtube Channels

Blogs

Websites worth taking a look!

Subreddits you want to follow!

Next Steps Roadmap

A lot of cool and 2021 savvy stuff is coming to this content:

  • Unsupervised Learning / Self-Suerpvised Learning
  • MLOps : Machine Learning mindset framework (how to work like a succesful data scientist)
  • Data processing and preparation
  • Feature Selection
  • Features Engineering
  • Extending the parameters optimization section
  • Popular Computer Vision Models and datasets
  • Popular NLP, RNN, Language Models and datasets
  • Real World AI Projects / Industry Verticals (Agriculture, Climate / Energy, Healthcare, Retail, Document AI, HR analytics, Ethics AI - Project Hominis, PETAI, CAELI, AI Projects)