0% found this document useful (0 votes)
5 views49 pages

Data Mining - Summer 2 - Sesh 1

Uploaded by

Akif Ansari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views49 pages

Data Mining - Summer 2 - Sesh 1

Uploaded by

Akif Ansari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

BAN 6003 – SEC 7B06

Lecture 1 – July 16 2025


AGENDA

 Introduction
 Canvas Assignments
 Course Expectations & Grading
 McGraw Hill
 Lecture 1 –
 Data Measurement
 Summary Measures
 Data Visualization
ABOUT ME

Education:
 BS Psychology and Biology - University of Illinois
 MS Industrial and Systems Engineering - Virginia Tech
 PhD Industrial and Systems Engineering (Human Factors) - University
of Wisconsin
Work History:
 Northrop Grumman, Children’s Hospital of Philadelphia, Divvy Dose
(Optum), Volkswagen Auto Cloud, Saama, Froedtert Hospital, Teladoc
Health*
Hobbies:
 Travel, Sports: Tennis, Basketball, Music, Food
CHAPTER 2: DATA
MEASUREMENT AND
WRANGLING
Turning Raw Data into Reliable Insights
Prof. Siddarth Ponnala| 7/16/2025
LECTURE OVERVIEW

 Part 1: Introduction to Data Measurement


 Part 2: Data Wrangling Techniques
 Part 3: Practical Tips and Wrap-up
WHAT IS DATA MEASUREMENT?

 Assigning values to variables for comparison and analysis


 Ensures consistency, interpretability, and analytical validity
SCALES OF MEASUREMENT

 Nominal – Categories without order (e.g., Gender, Blood Type)


 Ordinal – Ordered categories (e.g., Satisfaction rating)
 Interval – Ordered with equal spacing, no true zero (e.g.,
Temperature)
 Ratio – Interval with true zero (e.g., Height, Income)
WHY MEASUREMENT MATTERS

 Determines suitable visualizations


 Influences statistical tests
 Affects model interpretation
COMMON RAW DATA ISSUES

 Missing values
 Inconsistent formats
 Duplicates
 Outliers
 Mixed data types
IMPORTING AND CLEANING DATA

 Import from CSV, Excel, SQL, APIs


 Remove duplicates: drop_duplicates()
 Handle missing data: dropna(), impute values
 Standardize: casing, dates, whitespace
TRANSFORMING DATA

 Create new columns: BMI = weight/height^2


 Recode values: Yes/No to 1/0
 Bin continuous data into categories
RESHAPING AND AGGREGATING

 Reshape: pivot(), melt()


 Group by category: groupby() + aggregate
 Summarize with mean, median, count
MERGING DATA SOURCES

 Join multiple datasets


 Merge types: inner, left, right, outer
 Example: Combining patient records with lab results
POPULAR TOOLS FOR
WRANGLING

 Python/Pandas: Scalable, efficient scripting


 R/dplyr: Tidyverse-style chaining
 Excel: Manual but visual for small data
 SQL: Great for structured databases
BEST PRACTICES

 Know your data types and scales


 Document each transformation
 Validate cleaned data visually/statistically
 Write reproducible code
KEY TAKEAWAYS

 Proper measurement guides valid analysis


 Wrangling transforms messy data into insights
 Foundation for data visualization and modeling
WHAT’S NEXT?

 Exploratory Data Analysis (EDA)


 Data visualization best practices
 Feature engineering
 Data quality assessment
QUESTIONS OR DISCUSSION

 Prompt: What challenges have you faced in cleaning data?


 Thank you!
CHAPTER 3: UNDERSTANDING
SUMMARY MEASURES IN DATA
ANALYSIS
How to Describe and Compare Data Effectively
Your Name | Date
LECTURE OVERVIEW

 Part 1: What Are Summary Measures?


 Part 2: Types of Summary Statistics
 Part 3: Practical Applications and Tips
WHAT ARE SUMMARY
MEASURES?

 Describe features of a dataset using single values


 Understand data's center, spread, and shape
 Reduce complexity and support comparisons
CATEGORIES OF SUMMARY
MEASURES

 Measures of Central Tendency (mean, median, mode)


 Measures of Spread (range, variance, SD, IQR)
 Shape of Distribution (skewness, kurtosis)
 Position Metrics (percentiles, quartiles)
MEAN, MEDIAN, AND MODE

 Mean: Average value – use for symmetric distributions


 Median: Middle value – good for skewed data
 Mode: Most frequent – use with categorical data
MEASURES OF SPREAD

 Range = Max - Min


 Variance: Average squared deviations
 Standard Deviation: √variance
 IQR: Q3 - Q1 (middle 50%)
DISTRIBUTION SHAPE: SKEWNESS
& KURTOSIS

 Skewness: Direction and extent of asymmetry


 Kurtosis: Peakedness or flatness
 Tools: Histogram, boxplot, density plot
POSITION METRICS

 Percentiles: Below which % of data falls


 Quartiles: Divide data into four parts
 Used in boxplots, benchmarks
SUMMARY STATS BY DATA TYPE

 Nominal: Mode, Frequency Count


 Ordinal: Median, IQR
 Interval/Ratio: Mean, SD, Variance, Percentiles
VISUALIZATION AIDS

 Boxplots: Show median, IQR, outliers


 Histograms: Show distribution and skew
 Bar Charts: For mode/categorical frequency
REAL-LIFE EXAMPLES

 Median income across cities


 IQR of blood pressure in clinic
 Average customer support response time
BEST PRACTICES

 Assess data type and distribution


 Report multiple summary measures
 Use visuals to complement stats
 Watch for outliers and skew
SUMMARY AND KEY TAKEAWAYS

 Simplify complex data with summary measures


 Different measures serve different goals
 Context and data type matter
WHAT’S NEXT?

 Exploratory Data Analysis (EDA)


 Inferential statistics
 Data visualization techniques
QUESTIONS OR DISCUSSION

 Prompt: Which summary measure do you use most—and why?


 Thank you!
CHAPTER 4: DATA
VISUALIZATION
Data Visualization

• Data visualization - the process of displaying data


(often in large quantities) in a meaningful fashion to
provide insights that will support better decisions.
– Data visualization improves decision-making, provides
managers with better analysis capabilities that reduce
reliance on IT professionals, and improves collaboration
and information sharing.
Creating Charts in Microsoft Excel

• Highlight the data.


• Select the Insert tab.
• Click on the chart type, then subtype.

• Use the options in the Design (Chart Design in Mac) and


Format tabs to customize your chart.
Column and Bar Charts

• Excel distinguishes between vertical and horizontal bar charts, calling the
former column charts and the latter bar charts.
– A clustered column chart compares values across categories using vertical
rectangles;
– a stacked column chart displays the contribution of each value to the total by
stacking the rectangles;
– a 100% stacked column chart compares the percentage that each value
contributes to a total.
• Column and bar charts are useful for comparing categorical or ordinal
data, for illustrating differences between sets of values, and for showing
proportions or percentages of a whole.
Line Charts

• Line charts provide a useful means for displaying data over


time.
– You may plot multiple data series in line charts; however, they can be
difficult to interpret if the magnitude of the data values differs greatly.
In that case, it would be advisable to create separate charts for each
data series.
Pie Charts

• A pie chart displays the relative proportion of each data


source to the total by partitioning a circle into pie-shaped
areas.
Pie Chart Alternatives

• Data visualization professionals don't recommend using pie charts. In


a pie chart, it is difficult to compare the relative sizes of areas;
however, the bars in the column chart can easily be compared to
determine relative ratios of the data.

– If you do use pie charts, restrict them to small numbers of categories,


always ensure that the numbers add to 100%, and use labels to display
the group names and actual percentages. Avoid three-dimensional (3-D)
pie charts—especially those that are rotated—and keep them simple.
Area Charts

• An area chart combines the features of a pie chart


with those of line charts.
– Area charts present more information than pie or line
charts alone but may clutter the observer’s mind with
too many details if too many data series are used; thus,
they should be used with care.
Scatter Charts

• Scatter charts show the relationship between two


variables. To construct a scatter chart, we need
observations that consist of pairs of variables.
Orbit Charts

• An orbit chart is a scatter chart in which the points are connected in


sequence, such as over time. Orbit charts show the “path” that the
data take over time, often showing some unusual patterns that can
provide unique insights.
– Create a scatter chart with smooth lines and markers.
Bubble Charts

• A bubble chart is a type of scatter chart in which


the size of the data marker corresponds to the
value of a third variable; consequently, it is a way
to plot three variables in two dimensions.
Combination Charts

• Often, we wish to display multiple data series on the same chart


using different chart types. Excel 2016 for Windows provides a
Combo Chart option for constructing such a combination chart; in
Excel 2016 for Mac, it must be done manually.
• We can also plot a second data series on a secondary axis; this is
particularly useful when the scales differ greatly.
Radar Charts

• Radar charts show multiple metrics on a spider web.


• This is a useful chart to compare survey data from one time period
to another or to compare performance of different entities such as
factories, companies, and so on using the same criteria.
Stock Charts

• A stock chart allows you to plot stock prices,


such as daily high, low, and close values.
• We will explain how to create stock charts in
Chapter 6 to visualize some statistical results,
and again in Chapter 15 to visualize optimization
results.
Sparklines

• Sparklines are graphics that summarize a row or


column of data in a single cell.
• Excel has three types of sparklines: line, column,
and win/loss.
– Line sparklines are clearly useful for time-series data.
– Column sparklines are more appropriate for categorical
data.
– Win-loss sparklines are useful for data that move up or
down over time.
Dashboards

• A dashboard is a visual representation of a set of key business


measures. It is derived from the analogy of an automobile’s control
panel, which displays speed, gasoline level, temperature, and so on.
– Dashboards provide important summaries of key business information to help
manage a business process or function.

You might also like