FOUNDATION
INTRODUCTION TO DATA SCIENCE
- Where data science fits into today’s society.
- Why are there so many business and data science buzzwords?
- Analysis vs Analytics
- Intro to Business Analytics, Data Analytics, and Data Science
- Adding Business Intelligence (BI), Machine Learning (ML), and Artificial
Intelligence (AI) to the picture
- The Relationship Between Different Data Science Field
- When are Traditional data, Big Data, BI, Traditional Data Science and ML applied?
- What Is The Purpose Of Each Data Science Field
- Why do we Need each of these Disciplines?
- Common Data Science Techniques
- Traditional Data: Techniques
- Traditional Data: Real-life Examples
- Big Data: Techniques
- Big Data: Real-life Examples
- Business Intelligence (BI): Techniques
- Business Intelligence (BI): Real-life Examples
- Traditional Methods: Techniques
- Traditional Methods: Real-life Examples
- Machine Learning (ML): Techniques
- Machine Learning (ML): Types of Machine Learning
- Machine Learning (ML): Real-life Examples
- Common Data Science Tools
- Programming Languages & Software Employed in Data Science - All the Tools You
Need
- Data Science Job Positions: What Do They Involve And What To Look Out For?
- Data Science Job Positions: What do they Involve and What to Look out for?
- Dispelling Common Misconceptions
SQL AND DATABASES FOR DATA SCIENCE
Getting Started & Installation
What Is A Database?
SQL vs. MySQL
Installation
Creating Databases & Tables
Showing Databases
Creating Databases
Dropping and Using Databases
Introducing Tables
Data Types: The Basics
Creating Tables
How Do We Know It Worked?
Dropping Tables
Tables Basics Activity
MySQL Comments
Inserting Data
INSERT: The Basics
A Quick Preview of SELECT
Multi-inserts
Working With NOT NULL
Sidenote: Quotes In MySQL
Adding DEFAULT Values
Introducing Primary Keys
Working With AUTO_INCREMENT
CRUD Basics
Introducing CRUD
Getting Our New "Dataset"
Officially Introducing SELECT
The WHERE clause
Aliases
Using UPDATE
A Quick Rule Of Thumb
Introducing DELETE
String Functions
The World Of String Functions
Loading Our Books Data
CONCAT
SUBSTRING
Combining String Functions
Sidenote: SQL Formatting
REPLACE
REVERSE
CHAR_LENGTH
UPPER & LOWER
Other String Functions
Refining Selections
Adding Some New Books
DISTINCT
ORDER BY
More On ORDER BY
LIMIT
LIKE
Escaping Wildcards
Aggregate Functions
Count Basics
GROUP BY
MIN and MAX Basics
Subqueries
Grouping By Multiple Columns
MIN and MAX With GROUP BY
SUM
AVG
Aggregate Functions Docs
Revisiting Data Types
Surveying Other Data Types
CHAR vs. VARCHAR
INT, TINYINT, BIGINT, etc.
DECIMAL
FLOAT & DOUBLE
DATE and TIME
Working With Dates
CURDATE, CURTIME, & NOW
Date Functions
Time Functions
Formatting Dates
Date Maths
TIMESTAMPS
DEFAULT & ON UPDATE TIMESTAMPS
Comparison & Logical Operators
Not Equal
NOT LIKE
Greater Than
Less Than Or Equal To
Logical AND
Logical OR
Between
Comparing Dates
The IN Operator
CASE
IS NULL
Constraints & ALTER TABLE
UNIQUE Constraint
CHECK Constraints
Named Constraints
Multiple Column Constraints
ALTER TABLE: Adding Columns
ALTER TABLE: Dropping Columns
ALTER TABLE: Renaming
ALTER TABLE: Modifying Columns
ALTER TABLE: Constraints
One to Many & Joins
Data is Messy
Relationships Basics
One to Many Relationship
Working with FOREIGN KEY
Cross Joins
Inner Joins
Inner Joins With Group By
Left Join
Left Join With Group By
Right Join
On Delete Cascade
Many to Many
Many to Many Basics
Creating Our Many To Many Tables
TV Series Challenge #1
TV Series Challenge #2
TV Series Challenge #3
TV Series Challenge #4
TV Series Challenge #5
TV Series Challenge #6
TV Series Challenge #7
Views, Modes, & More!13 lectures • 49min
Introducing Views
Updateable Views
Replacing/Altering Views
HAVING clause
WITH ROLLUP
SQL Modes Basics
STRICT_TRANS_TABLES
Slicer
Synchronising slicers to multiple pages
Slicer Warning
Adding more control to your visualisations - Filters and slicers
Sort visuals
Configure small multiples
Use Bookmarks for reports
Group and layer visuals by using the Selection pane
Adding more control to your visualisations
Drillthrough
Buttons and Actions
Page Navigation and Drill through actions
Enable Natural Language Queries (Ask A Question) and Page Formatting
Tooltip Pages
Page and Bookmark Navigator
Adding more control to your visualisations - Part
STATISTICS FOR DATA SCIENCE
- Introduction to Statistical Research Methods
- Data Visualization
- Measures of Central Tendency
- Variability
- Standardisation
- Normal Distribution
- Sampling Distributions
- Estimation
- Hypothesis Testing
- t-Tests
- One-way Analysis of Variance (ANOVA)
- Two-way Analysis of Variance (ANOVA)
- Correlation
- Regression
- Chi-Squared Tests
VERSION CONTROL - GIT AND GITHUB
- The Terminal
- Install Git Bash on Windows
- Introduction to Version Control and Git
- Version Control using Git and the Command Line
- Github and Remote Repositories
- Gitignore
- Cloning
- Branching and Merging
- Forking and Pull Requests
- Setting Up Comet
Power BI
1. Getting Started with Power BI:
- Understanding Power BI Desktop, Power BI Service, and Power BI Mobile
- Importing data from various sources (Excel, CSV, SQL Server, Web, etc.)
- Basic navigation and interface of Power BI Desktop
2. Data Preparation:
- Data cleaning and transformation using Power Query Editor
- Merging and appending queries
- Data types and error handling
3. Data Modeling:
- Creating relationships between tables
- Understanding and using star and snowflake schemas
- Managing relationships (one-to-one, one-to-many, many-to-many)
- Using calculated columns and tables
4. DAX (Data Analysis Expressions):
- Basics of DAX syntax and functions
- Creating calculated columns and measures
- Understanding row context and filter context
- Common DAX functions (SUM, COUNT, AVERAGE, MIN, MAX)
- Time intelligence functions (DATEADD, DATESYTD, SAMEPERIODLASTYEAR)
- Advanced DAX functions (CALCULATE, ALL, FILTER, RELATED)
5. Visualization:
- Creating and customizing basic charts (bar, line, pie, scatter, etc.)
- Using slicers for filtering data
- Creating and customizing tables and matrices
- Using maps and geographical data visualizations
- Custom visualizations from the marketplace
6. Advanced Visualization:
- Using bookmarks and selections for interactive reports
- Creating drill-through and drill-down reports
- Using tooltips for enhanced data presentation
- Implementing conditional formatting
7. Reports and Dashboards:
- Designing report layouts and themes
- Creating and managing dashboards in Power BI Service
- Pinning visuals to dashboards
- Using Q&A feature for natural language queries
8. Power BI Service:
- Publishing reports to Power BI Service
- Understanding workspaces, apps, and content packs
- Managing datasets and data refresh schedules
- Sharing reports and dashboards with stakeholders
- Collaborating with team members
9. Power BI Embedded:
- Integrating Power BI reports into applications
- Using Power BI REST API for automation
10. Security:
- Implementing row-level security (RLS)
- Managing roles and permissions
- Understanding and applying data protection and compliance measures
11. Performance Optimization:
- Optimizing data models for performance
- Using Performance Analyzer tool
- Best practices for efficient report design
12. Advanced Analytics:
- Using AI visuals (Key Influencers, Decomposition Tree, Q&A Visual)
- Integrating R and Python scripts in Power BI
- Implementing what-if parameters for scenario analysis
13. Power BI Integration:
- Connecting Power BI with other Microsoft services (Excel, Azure, SQL Server)
- Integrating with third-party tools and data sources
- Using Power Automate for workflow automation
14. Power BI Administration:
- Managing Power BI gateway for on-premises data sources
- Monitoring usage and performance
- Implementing governance and best practices for organisation-wide usage
15. Power BI Community and Resources:
- Participating in Power BI community forums and events
- Utilising Power BI documentation and learning resources
- Staying updated with new features and updates
PYTHON FOR DATA SCIENCE
Why Python Programming
- Introduction to Python and its popularity
- Python's use in various domains (Web development, Data science, Automation, etc.)
- Advantages of Python over other programming languages
- Python community and resources
Data Types and Operators
- Variables and data types (integers, floats, strings, booleans)
- Type conversion and casting
- Basic operators (arithmetic, comparison, logical)
- String manipulation and formatting
- Working with variables and constants
Data Structures in Python
- Lists: creation, indexing, slicing, and manipulation
- Tuples: immutability and use cases
- Dictionaries: key-value pairs and dictionary methods
- Sets: unique elements and set operations
- Lists vs. Tuples vs. Dictionaries vs. Sets
Control Flow
- Conditional statements (if, elif, else)
- Loops (for and while loops)
- Loop control statements (break, continue)
- Using loops for iteration and pattern printing
- Exception handling (try, except, finally)
Functions
- Defining and calling functions
- Parameters and arguments
- Return statements and function documentation (docstrings)
- Scope and lifetime of variables
- Lambda functions and built-in functions
Scripting:
- Reading and writing files
- Command-line arguments (sys.argv)
- Creating and running Python scripts
- Understanding shebang (#!/usr/bin/env python)
- Organising code into modules and packages
-
NUMPY FOR DATA SCIENCE
- Introduction to NumPy and its importance in data science
- Creating NumPy arrays
- Array indexing and slicing
- Array manipulation and broadcasting
- Mathematical operations with NumPy arrays
- Loading and saving data using NumPy
PANDAS FOR DATA WRANGLING
- Introduction to Pandas for data manipulation and analysis
- Series and DataFrame objects
- Loading data into Pandas
- Data exploration and basic statistics
- Data cleaning and handling missing values
- Data filtering, selection, and sorting
- Data visualisation with Pandas
- What is data wrangling and why is it important?
- Data acquisition methods (reading from files, web scraping, APIs)
- Data cleaning techniques (handling missing values, dealing with duplicates)
- Data transformation (reshaping data, merging and joining datasets)
- Data aggregation and grouping
- Data normalisation and scaling
- Dealing with outliers
- Handling categorical data (encoding and one-hot encoding)
- Date and time data manipulation
- Introduction to data quality and validation
- Advanced Pandas techniques for data manipulation (pivot tables, melt, stack, unstack)
- Combining and merging DataFrames (concatenation, merging on keys)
- Data filtering and selection (loc, iloc)
- Using Pandas functions to clean and transform data
- Handling missing data with Pandas
- Applying custom functions to data using Pandas
MATPLOTLIB
- Introduction to Matplotlib and its role in data visualisation
- Basic plotting with Matplotlib (line plots, scatter plots, bar charts)
- Customising plots (labels, titles, legends)
- Subplots and figure customization
- Advanced plotting techniques (histograms, box plots, heatmaps)
- Saving and exporting plots in different formats
SEABORN
- Introduction to Seaborn and its advantages over Matplotlib
- Seaborn's aesthetics and built-in themes
- Creating statistical visualisations (distribution plots, categorical plots)
- Visualising relationships (scatter plots, pair plots, heatmaps)
- Advanced customization and styling in Seaborn
- Combining Seaborn with Pandas DataFrames for effective data exploration
VISUALIZATION
- Univariate Exploration of Data
- In this lesson, you will see how you can use matplotlib and seaborn to produce
informative visualisations of single variables.
- Bivariate Exploration of Data
- Multivariate Exploration of Data
- Explanatory Visualisations
MACHINE LEARNING
ADVANCED REGRESSION
- Introduction To Machine Learning
- Predictive Modelling And Classification
- Assessing Accuracy And The Train-Test Split
- Statistical Learning
- Linear Models
- Least Squares Regression
- Splitting Datasets
- The Train/Test Split
- Multiple Linear Regression
- Multiple Linear Regression
- Variables And Variable Selection
- Feature Engineering
- Saving And Restoring Models
- Regularisation - Data Scaling
- Regularisation : Ridge Regression
- Regularisation : LASSO Regression
- Decision Trees
- Bias-Variance Tradeoff
- Parametric Methods, Ensembling And Bootstrapping
- Random Forests
ADVANCED CLASSIFICATION
- Advanced Classification
- Natural Language Processing
- How Machines Understand Language
- Logistic Regression
- Intro To Binary Classification Using Logistic Regression
- Classification Metrics
- Model Improvements
- Improving Classification Models
- Dealing With Imbalanced Data
- Tree-Based Classification Methods
- Training A Decision Tree
- Tree-Based Methods For Classification
- Support Vector Classification
- Support Vector Machines
- Nearest Neighbours And Naive Bayes
- KNNs And Naive Bayes
- Hyperparameter Tuning & Model Validation
- Hyperparameters And Model Validation
- Neural Network Classifiers
- Classifier Model Selection
- Build All The Classifiers
UNSUPERVISED LEARNING
- Principal Component Analysis
- Advanced Dimensionality Reduction
- Advanced Dimensionality Reduction Techniques
- K-Means Clustering
- Hierarchical Clustering
- Gaussian Mixture Models
- Clustering And Geospatial Analysis
- Recommender Systems
Introduction to Streamlit
○ What is Streamlit?
○ Installing Streamlit
○ Basic Streamlit Concepts: Widgets, Layouts, and State Management
○ Running and Sharing Streamlit Apps
Streamlit Components and Layouts
○ Advanced Layouts and Widgets
○ Creating Interactive User Interfaces
○ Integrating Plotly, Matplotlib, and Altair with Streamlit
Introduction to Big Data
○ What is Big Data?
○ Characteristics of Big Data (Volume, Velocity, Variety, Veracity)
○ Overview of Big Data Technologies (Hadoop, Spark, NoSQL)
○ Data Storage: HDFS, Cloud Storage
Data Wrangling with PySpark
● Topics Covered:
○ Introduction to Apache Spark
○ Working with PySpark DataFrames
○ Data Cleaning and Transformation with PySpark
Data Visualization for Big Data
● Topics Covered:
○ Visualisation Techniques for Large Datasets
○ Aggregation and Filtering in PySpark
○ Integrating PySpark with Streamlit for Real-Time Visualisations
Connecting Streamlit with Big Data Storage
● Topics Covered:
○ Connecting Streamlit to Cloud Storage (AWS S3, Google Cloud Storage)
○ Streaming Data into Streamlit from Big Data Sources
○ Real-time Data Processing with Kafka and Streamlit
Machine Learning on Big Data
● Topics Covered:
○ Introduction to Machine Learning on Big Data
○ Using MLlib with PySpark
○ Integrating Machine Learning Models in Streamlit
Advanced Streamlit Features
○ Custom Components in Streamlit
○ Deploying Streamlit Apps on Heroku, AWS, and Google Cloud
○ Streamlit Authentication and Security
Big Data Project Development
● Topics Covered:
○ Project Planning and Management
○ Integrating All Components: Data Ingestion, Processing, Visualization, and
Machine Learning
○ Optimising Streamlit Apps for Performance
Projects
The Blackjack Capstone Project
Higher Lower Game
Data Analysis of a CSV File
Obtain a dataset in CSV format (e.g., from Kaggle or other open datasets).
Use Pandas to load and clean the data.
Perform exploratory data analysis (EDA) using Pandas and NumPy to answer
questions and visualise patterns in the data.
Generate summary statistics, histograms, and other visualisations to gain insights
from the dataset.
Stock Portfolio Analysis
Retrieve historical stock price data using Pandas' data reader or API.
Create a Pandas DataFrame to store and manipulate the data.
Calculate and visualise portfolio statistics, such as returns, volatility, and risk-adjusted
performance.
Implement simple portfolio optimization strategies, such as the Markowitz Efficient
Frontier.
Customer Segmentation
Obtain a customer dataset (e.g., retail sales data or online store data).
Use Pandas to preprocess and clean the dataset.
Utilise NumPy for clustering algorithms like k-means to segment customers based on
their purchase behaviour.
Visualise customer segments and analyse their characteristics.
Time Series Forecasting
Collect a time series dataset (e.g., stock prices, weather data).
Load and manipulate the data with Pandas.
Use NumPy and Pandas to implement time series forecasting models like moving
averages, exponential smoothing, or ARIMA.
Visualise the time series data and the forecasted values.
Movie Recommender System
Acquire a movie ratings dataset (e.g., MovieLens dataset).
Clean and preprocess the data using Pandas.
Implement a basic movie recommender system using NumPy and Pandas, based on
user ratings and movie metadata.
Provide movie recommendations for a given user.
E-commerce Sales Analysis
Collect e-commerce sales data, including customer transactions and product
information.
Use Pandas for data cleaning and merging datasets.
Analyse sales trends, customer behaviour, and product performance using Pandas and
NumPy.
Create visualisations and reports to summarise the findings.
Data Cleaning and Transformation Tool
Develop a tool that allows users to upload messy datasets.
Use Pandas to clean and transform the data, addressing common data quality issues
like missing values, duplicates, and inconsistent formatting.
Provide options for data export in different formats (e.g., CSV, Excel) after cleaning.
House Price Prediction:
Utilise a dataset of housing prices, including features like square footage, number of
bedrooms, and location.
Build regression models (linear regression, decision tree regression, or random forest
regression) to predict house prices.
Evaluate and compare model performance using metrics like Mean Absolute Error
(MAE) and Root Mean Squared Error (RMSE).
Energy Consumption Forecasting
Gather time-series data on energy consumption along with weather-related features.
Develop a time series forecasting model (e.g., ARIMA, LSTM) to predict future
energy consumption.
Assess the accuracy of the model's predictions.
Stock Price Prediction
Collect historical stock price data for a specific company or stock market index.
Implement a time series regression model to predict future stock prices.
Evaluate the model's performance using metrics like Mean Squared Error (MSE) and
visualise the predictions.
Customer Churn Prediction
Work with customer data from a business (telecom, subscription service, etc.).
Create a classification model (logistic regression, random forest, or support vector
machine) to predict customer churn.
Evaluate the model's accuracy, precision, recall, and F1-score.
Sentiment Analysis on Social Media
Collect social media data (e.g., tweets or reviews) related to a product or topic of
interest.
Build a text classification model using techniques like natural language processing
(NLP) and sentiment analysis.
Analyse sentiment trends and sentiment distribution.
Image Classification (e.g., MNIST, CIFAR-10)
Use popular image datasets like MNIST or CIFAR-10.
Create a convolutional neural network (CNN) for image classification tasks.
Visualise the model's performance and make predictions on new images.
Customer Segmentation
Apply clustering algorithms like k-means or hierarchical clustering to segment
customers based on their purchasing behaviour.
Analyse customer segments and develop targeted marketing strategies.
Anomaly Detection in Network Traffic
Work with network traffic data and focus on anomaly detection.
Implement unsupervised learning techniques (e.g., isolation forests or autoencoders)
to identify unusual patterns or attacks in network traffic.
Topic Modeling for Text Data
Use a dataset of text documents (e.g., news articles, research papers).
Apply topic modelling techniques like Latent Dirichlet Allocation (LDA) to discover
underlying topics within the documents.
Market Basket Analysis
Work with transaction data from a retail store.
Use association rule mining (e.g., Apriori algorithm) to identify patterns in customer
purchasing behaviour.
Suggest product recommendations based on frequent itemsets.
Introduction to Streamlit
● Practical Exercise:
○ Build a basic Streamlit app that displays text, images, and charts.
● Assignment:
○ Create a simple dashboard with user inputs (e.g., sliders, checkboxes).
Streamlit Components and Layouts
● Practical Exercise:
○ Develop a Streamlit app with complex layouts and multiple interactive charts.
● Assignment:
○ Design a multi-page Streamlit app.
Introduction to Big Data
● Practical Exercise:
○ Explore a small dataset using traditional methods.
● Assignment:
○ Write a brief report on the challenges and opportunities of Big Data.
Data Wrangling with PySpark
● Practical Exercise:
○ Process a medium-sized dataset using PySpark.
● Assignment:
○ Clean and transform a dataset using PySpark and load it into a Streamlit app.
Data Visualization for Big Data
● Practical Exercise:
○ Visualize a large dataset in Streamlit using PySpark.
● Assignment:
○ Build a data dashboard in Streamlit that visualizes trends in a large dataset.
Connecting Streamlit with Big Data Storage
● Practical Exercise:
○ Set up a connection between Streamlit and a cloud storage service.
● Assignment:
○ Create a Streamlit app that pulls data from a cloud storage service and
visualizes it.
Machine Learning on Big Data
● Practical Exercise:
○ Build and deploy a machine learning model using PySpark and Streamlit.
● Assignment:
○ Develop a Streamlit app that allows users to train and test a machine learning
model on large datasets.
Advanced Streamlit Features
● Practical Exercise:
○ Create and deploy a Streamlit app with custom components.
● Assignment:
○ Secure a Streamlit app and deploy it to a cloud platform.
Big Data Project Development
● Practical Exercise:
○ Start working on a capstone project that integrates Streamlit and Big Data.
● Assignment:
○ Submit a project proposal outlining the scope, objectives, and technologies
used.
Capstone Project Presentation
● Practical Exercise:
○ Complete and present the capstone project.
● Assignment:
○ Submit the final project and present it to the class.