Intr
o
Structur
e
    Basic Visualization           Next lecture
    1. Why visualize              5. Interactive visualization with
    2. How do visualize           Plotly
                                  6. Pitfall of visualization
    3. Visualization tool
    4. Case study: US president
       election
Topic 1: Visualization
Why visualize
  • Number is “boring”, visual is more appealing.
  • "Without data visualization, data analysis is like finding a needle in a
    haystack... without knowing what a needle looks like!“.
  • Human quantify thing visually.  You can easily be fooled by
    misleading visualization.
                                                       Did we just double the
                                                       profit by each month?
Topic 1: Visualization
Why visualize
  • Statistical measures, such as
    mean and variance, are not
    always tell you about the
    truth of data.
  • All following dataset have
    the same mean and
    variance, are they the same
    data?
   Without properly
  visualization, you can never
  tell what data is about.
Topic 1: Visualization
How do we visualize
  • There are hundred of
    different kind of
    visualizations.
  • Often depend on your
    specific data and purpose.
  • It can also be static or
    interactive visualization
    (next lecture),
   You don’t need to
  remember all of that, but you
  have to know they exist.
Topic 1: Visualization
Distribution plot
  To show how data is distributed,
  identify patterns, central tendency,
  spread, and outliers.
  •   Histogram: show the
      distribution of a variable by
      dividing it into bins
  •   Kernel Density Estimate
      (KDE) Plot: A smoothed curve
      that represents the
      distribution of data.
  •   Box Plot: Displays the
      summary of a dataset’s
      distribution with minimum,
      first quartile, median, third
      quartile, and maximum.
  •   Violin Plot: Combines a box
      plot and a KDE plot to provide
Topic 1: Visualization
Comparison Plots
  To compare data across different
  groups or over time.
  •   Bar Plot (Bar Chart):
      Compares categorical data
      using rectangular bars where
      the height/length represents
      the value.
  •   Grouped Bar Plot: Shows
      comparisons across multiple
      categories within a main
      category.
  •   Line Plot (Line Chart): Used
      for tracking changes and
      trends over time or ordered
      categories.
Topic 1: Visualization
Relationship Plots
  To show relationships or
  correlations between two or more
  variables.
  •   Scatter Plot: Displays values
      for two continuous variables
      using points on a 2D space to
      see correlations or patterns.
  •   Bubble Plot: An extension of
      the scatter plot that also
      shows a third variable through
      the size of the bubbles.
  •   Heatmap: Uses a color
      gradient to represent the
      relationship and intensity
      between two dimensions or
      categories.
Topic 1: Visualization
Relationship Plots
  To show relationships or
  correlations between two or more
  variables.
  •   Scatter Plot: Displays values
      for two continuous variables
      using points on a 2D space to
      see correlations or patterns.
  •   Bubble Plot: An extension of
      the scatter plot that also
      shows a third variable through
      the size of the bubbles.
  •   Heatmap: Uses a color
      gradient to represent the
      relationship and intensity
      between two dimensions or
      categories.
Topic 1: Visualization
Composition Plots
  To show the proportions of a whole
  and how they change over time.
  •   Pie Chart: Shows parts of a
      whole as slices of a pie.
  •   Stacked Bar Chart: Displays
      the composition of multiple
      values within a bar.
  •   Area Chart: Similar to a line
      chart, but the area beneath
      the line is filled to show the
      volume.
Topic 1: Visualization
Ranking Plots
  To display data in order of
  importance or rank.
  •   Bar Chart
      (Ordered/Sorted): Displays
      categories sorted by their
      value.
  •   Dot Plot: An alternative to
      bar charts that plots points to
      indicate rank or order.
Topic 1: Visualization
Part-to-Whole Plots
  To represent a part-to-whole
  relationship in data.
  •   Donut Chart: A variation of a
      pie chart with a central cut-
      out.
  •   Treemap: Uses nested
      rectangles to display data
      hierarchies.
Topic 1: Visualization
Visualization tool
                         Seaborn                 Matplotlib
 Level of Abstraction    High-level              Low-level
                                                 More complex, requires
 Ease of Use             Easier to learn and use
                                                 more code
                         Built-in themes and     More customization
 Default Aesthetics      color palettes          required
 Statistical Graphics    Specialized functions for Requires more manual
                         statistical visualizations setup
                                                 Requires more data
 Integration with Pandas Seamless integration
                                                 manipulation
                         Less flexible
 Customization           customization           Highly customizable
Topic 1: Visualization
Visualization tool
             Matplotlib   Seaborn
Topic 1: Visualization
Visualization tool
     import matplotlib.pyplot as plt                                import seaborn as sns
     import numpy as np                                             import pandas as pd
                                                                    import numpy as np
     # Sample data
                                         You can choose
     x = np.linspace(0, 10, 100)         between plt and            sns.set_palette('Set2')
     y = np.sin(x)                       seaborn
                                                                    # Sample data in DataFrame format
     # Create the plot                                              df = pd.DataFrame({'x': np.linspace(0, 10, 100), 'y': np.sin(x
     plt.figure(figsize=(8, 6))
     plt.plot(x, y, label='Sine Wave')                              # Create the plot
                                                                    sns.lineplot(x='x', y='y', data=df, label='Sine Wave')
     # Find peaks
     peaks = np.where(np.diff(np.sign(np.diff(y))) < 0)[0] + 1      # Find peaks
                                                                    peaks = np.where(np.diff(np.sign(np.diff(df['y']))) < 0)[0] +
     # Plot points at peaks
     plt.plot(x[peaks], y[peaks], 'ro')                             # Plot points at peaks
                                                                    sns.scatterplot(x=df['x'][peaks], y=df['y'][peaks], color='red
     # Add labels, title, and annotation
     plt.xlabel('x-axis')                                           # Add labels, title, and annotation
     plt.ylabel('y-axis')                                           plt.xlabel('x-axis')
     plt.title('Simple Sine Wave Plot')                             plt.ylabel('y-axis')
     plt.text(5, 0.5, 'Peak Value', fontsize=12)                    plt.title('Simple Sine Wave Plot')
                                                                    plt.text(5, 0.5, 'Peak Value', fontsize=12)
     # Add source reference outside the plot
     plt.figtext(0.05, 0.05, 'Source: Generated Data', fontsize=10) # Add source reference outside the plot
                                                                    plt.figtext(0.05, 0.02, 'Source: Generated Data', fontsize=10)
     plt.grid(True)
     plt.legend()                                                   plt.show()
     plt.show()
Topic 1: Visualization
Visualization tool
     import matplotlib.pyplot as plt                                import seaborn as sns
     import numpy as np                                             import pandas as pd
                                                                    import numpy as np
     # Sample data
                                         Prepare data
     x = np.linspace(0, 10, 100)                                    sns.set_palette('Set2')
     y = np.sin(x)
                                                                    # Sample data in DataFrame format
     # Create the plot                                              df = pd.DataFrame({'x': np.linspace(0, 10, 100), 'y': np.sin(x
     plt.figure(figsize=(8, 6))
     plt.plot(x, y, label='Sine Wave')                              # Create the plot
                                                                    sns.lineplot(x='x', y='y', data=df, label='Sine Wave')
     # Find peaks
     peaks = np.where(np.diff(np.sign(np.diff(y))) < 0)[0] + 1      # Find peaks
                                                                    peaks = np.where(np.diff(np.sign(np.diff(df['y']))) < 0)[0] +
     # Plot points at peaks
     plt.plot(x[peaks], y[peaks], 'ro')                             # Plot points at peaks
                                                                    sns.scatterplot(x=df['x'][peaks], y=df['y'][peaks], color='red
     # Add labels, title, and annotation
     plt.xlabel('x-axis')                                           # Add labels, title, and annotation
     plt.ylabel('y-axis')                                           plt.xlabel('x-axis')
     plt.title('Simple Sine Wave Plot')                             plt.ylabel('y-axis')
     plt.text(5, 0.5, 'Peak Value', fontsize=12)                    plt.title('Simple Sine Wave Plot')
                                                                    plt.text(5, 0.5, 'Peak Value', fontsize=12)
     # Add source reference outside the plot
     plt.figtext(0.05, 0.05, 'Source: Generated Data', fontsize=10) # Add source reference outside the plot
                                                                    plt.figtext(0.05, 0.02, 'Source: Generated Data', fontsize=10)
     plt.grid(True)
     plt.legend()                                                   plt.show()
     plt.show()
Topic 1: Visualization
Visualization tool
     import matplotlib.pyplot as plt                                import seaborn as sns
     import numpy as np                                             import pandas as pd
                                                                    import numpy as np
     # Sample data
                                         Create a plot
     x = np.linspace(0, 10, 100)                                    sns.set_palette('Set2')
     y = np.sin(x)
                                                                    # Sample data in DataFrame format
     # Create the plot                                              df = pd.DataFrame({'x': np.linspace(0, 10, 100), 'y': np.sin(x
     plt.figure(figsize=(8, 6))
     plt.plot(x, y, label='Sine Wave')                              # Create the plot
                                                                    sns.lineplot(x='x', y='y', data=df, label='Sine Wave')
     # Find peaks
     peaks = np.where(np.diff(np.sign(np.diff(y))) < 0)[0] + 1      # Find peaks
                                                                    peaks = np.where(np.diff(np.sign(np.diff(df['y']))) < 0)[0] +
     # Plot points at peaks
     plt.plot(x[peaks], y[peaks], 'ro')                             # Plot points at peaks
                                                                    sns.scatterplot(x=df['x'][peaks], y=df['y'][peaks], color='red
     # Add labels, title, and annotation
     plt.xlabel('x-axis')                                           # Add labels, title, and annotation
     plt.ylabel('y-axis')                                           plt.xlabel('x-axis')
     plt.title('Simple Sine Wave Plot')                             plt.ylabel('y-axis')
     plt.text(5, 0.5, 'Peak Value', fontsize=12)                    plt.title('Simple Sine Wave Plot')
                                                                    plt.text(5, 0.5, 'Peak Value', fontsize=12)
     # Add source reference outside the plot
     plt.figtext(0.05, 0.05, 'Source: Generated Data', fontsize=10) # Add source reference outside the plot
                                                                    plt.figtext(0.05, 0.02, 'Source: Generated Data', fontsize=10)
     plt.grid(True)
     plt.legend()                                                   plt.show()
     plt.show()
Topic 1: Visualization
                                                                                                          Seaborn have
Visualization tool                                                                                        high-level plot
     import matplotlib.pyplot as plt                                import seaborn as sns                 function, easier to
     import numpy as np                                             import pandas as pd
                                                                    import numpy as np                    use
     # Sample data
                                         In matplotlib, you
     x = np.linspace(0, 10, 100)         have to customize          sns.set_palette('Set2')
     y = np.sin(x)                       the visualization
                                                                    # Sample data in DataFrame format
     # Create the plot                   type                       df = pd.DataFrame({'x': np.linspace(0, 10, 100), 'y': np.sin(x
     plt.figure(figsize=(8, 6))
     plt.plot(x, y, label='Sine Wave')                              # Create the plot
                                                                    sns.lineplot(x='x', y='y', data=df, label='Sine Wave')
     # Find peaks
     peaks = np.where(np.diff(np.sign(np.diff(y))) < 0)[0] + 1      # Find peaks
                                                                    peaks = np.where(np.diff(np.sign(np.diff(df['y']))) < 0)[0] +
     # Plot points at peaks
     plt.plot(x[peaks], y[peaks], 'ro')                             # Plot points at peaks
                                                                    sns.scatterplot(x=df['x'][peaks], y=df['y'][peaks], color='red
     # Add labels, title, and annotation
     plt.xlabel('x-axis')                                           # Add labels, title, and annotation
     plt.ylabel('y-axis')                                           plt.xlabel('x-axis')
     plt.title('Simple Sine Wave Plot')                             plt.ylabel('y-axis')
     plt.text(5, 0.5, 'Peak Value', fontsize=12)                    plt.title('Simple Sine Wave Plot')
                                                                    plt.text(5, 0.5, 'Peak Value', fontsize=12)
     # Add source reference outside the plot
     plt.figtext(0.05, 0.05, 'Source: Generated Data', fontsize=10) # Add source reference outside the plot
                                                                    plt.figtext(0.05, 0.02, 'Source: Generated Data', fontsize=10)
     plt.grid(True)
     plt.legend()                                                   plt.show()
     plt.show()
Topic 1: Visualization
Visualization tool
     import matplotlib.pyplot as plt                                import seaborn as sns
     import numpy as np                                             import pandas as pd
                                                                    import numpy as np
     # Sample data
     x = np.linspace(0, 10, 100)                                    sns.set_palette('Set2')
     y = np.sin(x)
                                                                    # Sample data in DataFrame format
     # Create the plot                    Add label for             df = pd.DataFrame({'x': np.linspace(0, 10, 100), 'y': np.sin(x
     plt.figure(figsize=(8, 6))           axis
     plt.plot(x, y, label='Sine Wave')                              # Create the plot
                                                                    sns.lineplot(x='x', y='y', data=df, label='Sine Wave')
     # Find peaks
     peaks = np.where(np.diff(np.sign(np.diff(y))) < 0)[0] + 1      # Find peaks
                                                                    peaks = np.where(np.diff(np.sign(np.diff(df['y']))) < 0)[0] +
     # Plot points at peaks
     plt.plot(x[peaks], y[peaks], 'ro')                             # Plot points at peaks
                                                                    sns.scatterplot(x=df['x'][peaks], y=df['y'][peaks], color='red
     # Add labels, title, and annotation
     plt.xlabel('x-axis')                                           # Add labels, title, and annotation
     plt.ylabel('y-axis')                                           plt.xlabel('x-axis')
     plt.title('Simple Sine Wave Plot')                             plt.ylabel('y-axis')
     plt.text(5, 0.5, 'Peak Value', fontsize=12)                    plt.title('Simple Sine Wave Plot')
                                                                    plt.text(5, 0.5, 'Peak Value', fontsize=12)
     # Add source reference outside the plot
     plt.figtext(0.05, 0.05, 'Source: Generated Data', fontsize=10) # Add source reference outside the plot
                                                                    plt.figtext(0.05, 0.02, 'Source: Generated Data', fontsize=10)
     plt.grid(True)
     plt.legend()                                                   plt.show()
     plt.show()
Topic 1: Visualization
Visualization tool
     import matplotlib.pyplot as plt                                import seaborn as sns
     import numpy as np                                             import pandas as pd
                                                                    import numpy as np
     # Sample data
     x = np.linspace(0, 10, 100)                                    sns.set_palette('Set2')
     y = np.sin(x)
                                                                    # Sample data in DataFrame format
     # Create the plot                    Seaborn also              df = pd.DataFrame({'x': np.linspace(0, 10, 100), 'y': np.sin(x
     plt.figure(figsize=(8, 6))           supports
     plt.plot(x, y, label='Sine Wave')                              # Create the plot
                                          matplotlib                sns.lineplot(x='x', y='y', data=df, label='Sine Wave')
     # Find peaks                         function
     peaks = np.where(np.diff(np.sign(np.diff(y))) < 0)[0] + 1      # Find peaks
                                                                    peaks = np.where(np.diff(np.sign(np.diff(df['y']))) < 0)[0] +
     # Plot points at peaks
     plt.plot(x[peaks], y[peaks], 'ro')                             # Plot points at peaks
                                                                    sns.scatterplot(x=df['x'][peaks], y=df['y'][peaks], color='red
     # Add labels, title, and annotation
     plt.xlabel('x-axis')                                           # Add labels, title, and annotation
     plt.ylabel('y-axis')                                           plt.xlabel('x-axis')
     plt.title('Simple Sine Wave Plot')                             plt.ylabel('y-axis')
     plt.text(5, 0.5, 'Peak Value', fontsize=12)                    plt.title('Simple Sine Wave Plot')
                                                                    plt.text(5, 0.5, 'Peak Value', fontsize=12)
     # Add source reference outside the plot
     plt.figtext(0.05, 0.05, 'Source: Generated Data', fontsize=10) # Add source reference outside the plot
                                                                    plt.figtext(0.05, 0.02, 'Source: Generated Data', fontsize=10)
     plt.grid(True)
     plt.legend()                                                   plt.show()
     plt.show()
Topic 1: Visualization
Structure of a plot
  A nice and cohesive plot often
  contains:
  •   Clear Title and Labels
  •   Axis Labels
  •   Legends
  •   Gridlines
  •   Appropriate Color
  •   Data source
Topic 1: Visualization
Case study: US president
election
  Summary of the U.S. Election
  Process:
  • Two type of votes: popular vote
    and electoral vote
  • The popular vote determining
    which candidate receives the
    state's electors.
  • The winner in a state win all
    state’s electoral votes.
  • Each state have different
    number of electoral votes.
  • The candidate with at least 270
    out of 538 electoral votes wins.
Topic 1: Visualization
Case study: US president
election
  Challenger to visualize:
  • The state size is not equal to
    number of electoral votes
  • The vote of each county is not
    represent for the whole state
    (winner takes it all)
  • Population difference is
    significant between county and
    state.
  Can we use a single visualization
  to address all those problem?
Topic 1: Visualization
Case study: US president
election
                           This could be
                           misleading.
                           A vast majority of
                           US area is empty.
Topic 1: Visualization
Case study: US president
election
   We can scale the state by it electoral
   votes, each hexagon is a electoral
   vote.
    The map distorted, we hardly recognize
   the US.
    What about majority vote?
Topic 1: Visualization
Case study: US president
election
   We keep the original map, but show
   number of electoral vote
    Still don’t show the majority vote.
    We don’t know which area
Topic 1: Visualization
Case study: US president
election
   The arrow show the orientation of
   voters (toward Democrat or Republic)
    Don’t show the electoral vote.
    Don’t scale by it population.
Topic 1: Visualization
Case study: US president
election
   Circle size is proportional to the
   amount each county’s leading
   candidate is ahead.
    Loosing the boundary of states.
    In dense area likes east coast, it
     is really hard to see the
     visualization.
   There is no “one size fit all”
   situation, we choose the
   visualization that give us what we
   want to say.