Eng: Mahmoud Yahia
Recap
Create Dataframe
df = pd.DataFrame({'Name':['John','Smith','Paul','Mark'],'Age':[25,30,50,45]})
df = pd.DataFrame([['John',25],['Smith',30],['Paul',50],['Mark',45]],columns=['Name','Age'])
df = pd.DataFrame([{'Name':'John','Age':25},{'Name':'Smith','Age':30},{'Name':'Paul','Age':50},{'Name':'Mark','Age':45}])
Features
Observations
Create series
Ser = pd.Series([1, 2, 3, 4])
DataFrame Index
Recap
Create Dataframe
Create series
Concat Dataframes
Recap
Create Dataframe
Create series
Concat Dataframes
Recap
Create Dataframe
Create series
Concat Dataframes
Recap
Create Dataframe
Create series Values Count
Concat Dataframes
Recap
Create Dataframe
Create series
Concat Dataframes
Shape/Len
Values Count
Recap
Create Dataframe
Create series
Concat Dataframes
Values Count
Shape/Len nunique()
describe()
Recap
Create Dataframe df.drop_duplicates()
Create series
Concat Dataframes
Values Count
Shape/Len df.drop_duplicates(subset=['Name'])
nunique()
describe()
df.drop_duplicates(keep='first')
df.drop_duplicates(subset=['Name'], keep='last')
Recap
Create Dataframe
Create series
Concat Dataframes
Values Count sort_values()
Shape/Len
nunique()
describe()
drop_duplicates()
rename()
sort_index()
reset_index()
MatplotLib
Matplotlib is a low-level library that provides a wide range of plotting options. It allows you to create basic charts,
such as line charts, scatter plots, and bar charts, as well as more complex visualizations, such as heatmaps, contour
plots, and 3D plots. Matplotlib provides a lot of control over the details of the plot, but it can be more difficult to
use than Seaborn.
Seaborn is a high-level library that is built on top of Matplotlib. It provides a simpler interface for creating common
types of statistical plots, such as scatter plots, line plots, and bar plots. Seaborn also provides more advanced
statistical visualizations, such as violin plots, box plots, and heatmaps. Seaborn is designed to work well with Pandas
data frames, and it provides built-in support for many common data visualization tasks, such as grouping data by
categories and computing summary statistics.
In general, if you need to create complex visualizations or have very specific requirements for your plots, Matplotlib
may be the better choice. If you want to create common types of statistical plots quickly and easily, or if you are
working with Pandas data frames, Seaborn may be the better choice.
Data cleaning is the process of identifying and correcting or removing errors, inconsistencies, and inaccuracies in
data. It is an important step in the data analysis process, as it ensures that the data is accurate and reliable.
There are several techniques that can be used for data cleaning, such as removing duplicates, handling missing values,
correcting data types, and standardizing data formats. The specific techniques used will depend on the nature of the
data and the goals of the analysis.
Categorical data is data that consists of categories or labels, rather than numerical values. Examples of categorical
data include gender, race, and occupation.
When working with categorical data in Python, it is important to encode the data in a way that can be used in machine
learning models. One common technique is one-hot encoding, which creates a binary column for each category. Another
technique is label encoding, which assigns a numerical value to each category.
The pandas library provides several functions for working with categorical data, including pd.Categorical(), which
creates a categorical variable, and pd.get_dummies(), which performs one-hot encoding. The sklearn library also
provides several functions for working with categorical data, including sklearn.preprocessing.LabelEncoder(), which
performs label encoding, and sklearn.preprocessing.OneHotEncoder(), which performs one-hot encoding.
When working with categorical data, it is important to choose the appropriate encoding technique based on the nature
of the data and the goals of the analysis.
A Scatterplot displays the value of 2 sets of data on 2 dimensions. Each dot represents an observation. The position
on the X (horizontal) and Y (vertical) axis represents the values of the 2 variables. It is really useful to study
the relationship between both variables. It is common to provide even more information using colors or shapes (to
show groups, or a third variable). It is also possible to map another variable to the size of each dot, what makes a
bubble plot. If you have many dots and struggle with overplotting, consider using 2D density plot.
A line plot is a type of chart that displays data as a series of points connected by straight lines. It is useful for
showing trends over time or for comparing two or more sets of data. Each point on the line represents a data value,
and the line shows how the values change over time or across different categories. Line plots are commonly used in
scientific research, finance, and other fields where data analysis is important.
An histogram is an accurate graphical representation of the distribution of numerical data. It takes as input one
numerical variable only. The variable is cut into several bins, and the number of observation per bin is represented
by the height of the bar.
A barplot (or barchart) is one of the most common type of plot. It shows the relationship between a numerical
variable and a categorical variable. For example, you can display the height of several individuals using bar chart.
A Box and Whisker Plot (or Box Plot) is a convenient way of visually displaying the data distribution through
their quartiles.
Thank you