Data Science (21CS772)
Module 1
      Introduction
Basic Terminologies
• Data
• It can be
   • Generated
   • Collected
                 Similarity Measures
                  Data Structures
   • Retrieved
                 Algorithms
    Basic Terminologies                                  Data
                                                                     Processed
•   Data: facts with no meaning                          Information
•   Information: learning from facts
                                                                     Validation
•   Knowledge: practical understanding of a subjects     Knowledge
•   Understanding: the ability to absorb knowledge and
    learn to reason
                                                                     Thinking
                                                         Wisdom
• Wisdom: the quality of having experience and good
    judgement; ability to think and foresee
• Validity: ways to confirm truth
 Science is a systematic discipline that builds and organises
 knowledge in the form of testable hypotheses and predictions
 about the world.
   Big Data and Data Science Hype
1. Lack of Clear Definitions:
  • The terms "Big Data" and "data science" are often used without clear
    definitions.
  • Questions arise about the relationship between them and whether data
    science is exclusive to certain industries.
  • Ambiguities make these terms seem almost meaningless.
2. Respect for Existing Work:
  • Lack of recognition for researchers in academia and industry who have
    worked on similar concepts for years.
  • Media portrays machine learning as a recent invention, overlooking the long
    history of work by statisticians, computer scientists, mathematicians, and
    engineers.
 Big Data and Data Science Hype
3. Excessive Hype:
  • Hype surrounding data science is criticized for
   using    over-the-top    phrases    and    creating
   unrealistic expectations.
  • Comparisons to the pre-financial crisis "Masters of
   the Universe" era are seen as detrimental.
  • Excessive hype can obscure the real value
   underneath and turn people away.
   Big Data and Data Science Hype
4. Overlap with Statistics:
  • Statisticians already consider themselves working on the
   "Science of Data."
  • Data science is argued to be more than just a rebranding of
   statistics or machine learning, but media often portrays it as
   such, especially in the context of the tech industry.
5. Debating the Term "Science":
  • Some question whether anything that has to label itself as a
   science truly is one.
  • The term "data science" may not strictly represent a scientific
   discipline but might be more of a craft or a different kind of field.
       Getting Past The Hype
• Amid the hype, there's a kernel of truth: data science
 represents something genuinely new but is at risk of
 premature rejection due to unrealistic expectations.
Why Now!? (Data Science Popularity)
1. Data Abundance and Computing Power:
  • We now have massive amounts of data about various aspects of
   our lives.
  • There's also an abundance of inexpensive computing power,
   making it easier to process and analyze this data.
2. Datafication of Offline Behavior:
  • Our online activities, like shopping, communication, and
   expressing opinions, are commonly tracked.
  • The trend of collecting data about our offline behavior has also
   started, similar to the online data collection revolution.
                            Why Now!?
3. Data's Influence Across Industries:
   • Data is not limited to the internet; it's prevalent in finance, medical industry,
      pharmaceuticals, bioinformatics, social welfare, government, education,
      retail, and more.
   • Many sectors are experiencing a growing influence of data, sometimes
      reaching the scale of "big data."
4. Real-Time Data as Building Blocks:
   • The interest in new data is not just due to its sheer volume but because it
      can be used in real time to create data products.
   • Examples      include    recommendation      systems    on  Amazon, friend
      recommendations on Facebook, trading algorithms in finance, personalized
      learning in education, and data-driven policies in government.
                         Why Now!?
5. Culturally Saturated Feedback Loop:
  • A significant shift is occurring where our behavior influences products,
    and products, in turn, shape our behavior.
  • Technology, with its large-scale data processing capabilities, increased
    memory, and bandwidth, along with cultural acceptance, enables this
    feedback loop.
6. Emergence of a Feedback Loop:
  • The interaction between behavior and products creates a feedback
    loop, influencing both culture and technology.
  • The book aims to initiate a conversation about understanding and
    responsibly managing this loop, addressing ethical and technical
    considerations.
                           Datafication
• Datafication is described as a process of taking all aspects of life and
  transforming them into data.
• Examples include how "likes" on social media quantify friendships, Google's
  augmented-reality glasses datafy the gaze, and Twitter datafies thoughts.
• People's actions, whether online or in the physical world, are being recorded
  for later analysis.
• Datafication occurs intentionally, such as when actively engaging in
  social media, or unintentionally through passive actions like browsing the
  web or walking around with sensors and cameras capturing data.
• Datafication ranges from intentional participation        in   social   media
  experiments to unintentional surveillance and stalking.
• Regardless of individual intentions, the outcome is the same – datafication.
            The Current Landscape
• So what is Data Science?
• The passage raises questions about what data
 science is and whether it's something new or a
 rebranding of statistics or analytics.
• Data science is described as a blend of hacking
 and statistics. It involves practical knowledge
 of tools and materials along with a theoretical
 understanding of possibilities.
• Drew Conway's Venn diagram from 2010 is
 mentioned as a representation of data science
 skills.
• Data science involves skills such as traditional
 statistics, data munging (parsing, scraping, and
 formatting data), and other technical abilities.
A Data Science Profile (Skillset needed)
  •   Computer science
  •   Math
  •   Statistics
  •   Machine learning
  •   Domain expertise
  •   Communication and presentation skills
  •   Data visualization
 A Data Science Profile (Skillset needed)
• The following experiment was done:
  • Students were given index cards and asked to profile their skill levels in
     different data science domains.
  • The domains include computer science, math, statistics, machine
     learning, domain expertise, communication and presentation skills, and
     data visualization.
• An example of a data science profile is
  shown, indicating the relative skill levels in
  each domain.
 A Data Science Profile (Skillset needed)
• There was noticeable variation in the skill
  profiles   of   each   student,  especially
  considering the diverse backgrounds of the
  class, including many students from social
  sciences.
• So a data science team works best when
  different skills (profiles) are represented
  across different people, because nobody is
  good at everything.
OK, So What Is a Data Scientist, Really?
• In Academia:
  • An academic data scientist is described as a scientist trained in
    various disciplines, working with large amounts of data. They must
    address computational challenges posed by the structure, size, and
    complexity of data while solving real-world problems.
  • Articulating data science in academia involves emphasizing
    commonalities in computational and deep data problems across
    disciplines. Collaboration    among    researchers   from different
    departments can lead to solving real-world problems.
  OK, So What Is a Data Scientist, Really?
• In Industry:
  • Chief Data Scientist's Role:
    A chief data scientist in industry sets the data strategy of the company,
    covering aspects such as data collection infrastructure, privacy concerns,
    user-facing data, decision-making processes, and integration into products.
    They manage teams, communicate with leadership, and focus on
    innovation and research goals.
  • General Data Scientist Skills:
    A data scientist in industry extracts meaning from and interprets data using
    tools and methods from statistics and machine learning. They engage in
    collecting, cleaning, and processing data, requiring persistence, statistical
    knowledge, and software engineering skills.
 OK, So What Is a Data Scientist, Really?
• In Industry:
  • Exploratory Data Analysis:
    Exploratory data analysis involves visualization and data sense
    to find patterns, build models, and algorithms. Data scientists
    contribute to understanding product usage, the overall health
    of the product, and designing experiments.
  • Communication and Decision-Making:
    Data scientists communicate with team members, engineers,
    and leadership using clear language and data visualizations.
    They play a critical role in data-driven decision-making
    processes.
                                  Case Studies
• Case Study 1: IBM Watson Health
  •   IBM Watson Health employs data science to enhance healthcare by providing personalized
      diagnostic and treatment recommendations. Watson's natural language processing
      capabilities enable it to sift through vast medical literature and patient records to assist
      doctors in making more informed decisions.
  •   Data science has significantly aided IBM Watson Health in healthcare diagnostics and
      personalized treatment in:
      •   IBM Watson Health has demonstrated a 15% increase in the accuracy of cancer diagnoses when
          assisting oncologists in analyzing complex medical data, including genomic information and
          medical journals.
      •   In a recent clinical trial, IBM Watson Health's AI-powered recommendations helped reduce the
          average time it takes to develop a personalized cancer treatment plan from weeks to just a few
          days, potentially improving patient outcomes and survival rates.
      •   Watson's data-driven insights have contributed to a 30% reduction in medication errors in some
          healthcare facilities by flagging potential drug interactions and allergies in patient records.
      •   IBM Watson Health has processed over 200 million pages of medical literature to date, providing
          doctors with access to a vast knowledge base that can inform their diagnostic and treatment
          decisions.
                                  Case Studies
• Case Study 2: Urban planning and smart cities
  •   Singapore is pioneering the smart city concept, using data science to optimize urban
      planning and public services. They gather data from various sources, including sensors
      and citizen feedback, to manage traffic flow, reduce energy consumption, and improve the
      overall quality of life in the city-state.
  •   Here’s how data science helped Singapore in efficient urban planning:
      •   Singapore's real-time traffic management system, powered by data analytics, has led to a 25%
          reduction in peak-hour traffic congestion, resulting in shorter commute times and lower fuel
          consumption.
      •   Through its data-driven initiatives, Singapore has achieved a 15% reduction in energy
          consumption across public buildings and street lighting, contributing to significant
          environmental sustainability gains.
      •   Citizen feedback platforms have seen 90% of reported issues resolved within 48 hours, reflecting
          the city's responsiveness in addressing urban challenges through data-driven decision-making.
      •   The implementation of predictive maintenance using data science has resulted in a 30%
          decrease in the downtime of critical public infrastructure, ensuring smoother operations and
          minimizing disruptions for residents.
                                Case Studies
• Case Study 3: E-commerce personalization and recommendation
 systems
  • Amazon, the e-commerce giant, heavily relies on data science to personalize the
      shopping experience for its customers. They use algorithms to analyze customers'
      browsing and purchasing history, making product recommendations tailored to
      individual preferences. This approach has contributed significantly to Amazon's success
      and customer satisfaction by reducing customer service response times by 40%.
  •   Additionally, Amazon leverages data science for:
      •   Amazon's data-driven product recommendations have led to a 29% increase in average order
          value as customers are more likely to add recommended items to their carts.
      •   A study found that Amazon's personalized shopping experience has resulted in a 68%
          improvement in click-through rates on recommended products compared to non-personalized
          suggestions.
      •   Customer service response times have been reduced by 40% due to fewer inquiries related
          to product recommendations, as customers find what they need more easily.
      •   Amazon's personalized email campaigns, driven by data science, have shown an 18% higher
          open rate and a 22% higher conversion rate compared to generic email promotions.
                            Case Studies
• Case Study 4: Transportation and route optimization
  • Uber revolutionized the transportation industry by using data science to optimize
    ride-sharing and delivery routes. Their algorithms consider real-time traffic
    conditions, driver availability, and passenger demand to provide efficient, cost-
    effective transportation services. Other use cases include:
     • Uber's data-driven routing and matching algorithms have led to an average 20%
       reduction in travel time for passengers, ensuring quicker and more efficient
       transportation.
     • By optimizing driver routes and minimizing detours, Uber has contributed to a 30%
       decrease in fuel consumption for drivers, resulting in cost savings and reduced
       environmental impact.
     • Uber's real-time demand prediction models have helped reduce passenger wait times
       by 25%, enhancing customer satisfaction and increasing the number of rides booked.
     • Over the past decade, Uber's data-driven approach has enabled 100 million active
       users to complete over 15 billion trips, demonstrating the scale and impact of their
       transportation services.
         What is Data Science?
• Data Science is a multidisciplinary
 field   that   focuses    on    finding
 actionable insights from large sets of
 structured and unstructured data.
• Data     Science     experts    integrate
 computer science, predictive analytics,
 statistics and Machine Learning to mine
 very large data sets, with the goal of
 discovering relevant insights that can
 help the organisation move forward, and
 identifying specific future events.
Statistical
Inference,
Exploratory
Data Analysis, and
the Data
Science Process
Statistical Thinking in the Age of Big Data
• What is Big Data?
  • Big data refers to extremely large and diverse collections of
   structured, unstructured, and semi-structured data that continues to
   grow exponentially over time.
  • These datasets are so huge and complex in volume, velocity, and
   variety, that traditional data management systems cannot store,
   process, and analyze them.
  • Big Data is a term often used loosely, but it generally refers to three
   things:
    • a set of technologies,
    • a potential revolution in measurement, and
    • a philosophy about decision-making.
             Statistical Inference
• The world is complex, random, and uncertain, functioning as a
 massive data-generating machine.
• Everyday activities, from commuting to work, shopping, and even
 biological processes, potentially produce data.
• Processes in our lives are inherently data-generating, and
 understanding them is crucial for problem-solving.
• Data represents traces of real-world processes, collected through
 subjective data collection methods.
              Statistical Inference
• Two sources of randomness and uncertainty:           (i) underlying
 process and (ii) uncertainty in data collection methods.
• Data scientists turn the world into data through subjective
 observation and data collection.
• Capturing the world in data is not enough; the challenge is to
 understand the complex processes behind the data.
• Need for simplification: transforming captured traces          into
 comprehensible forms, often through statistical estimators.
• Statistical Inference is the discipline focused on developing
 procedures and methods to extract meaning from data generated
 by stochastic processes.
             Populations and Samples
• Population
  • Population refers to the entire set of objects or units, not limited to people (e.g.,
    tweets, photographs, stars).
  • Denoted by N, representing the total number of observations in the population.
  • If characteristics of all objects in the population are measured, we have a complete
    set of observations.
• Example:
  • Suppose your population was all emails sent last year by employees at a huge
    corporation, BigCorp.
  • Then a single observation could be a list of things: the sender’s name, the list of
    recipients, date sent, text of email, number of characters in the email, number
    of sentences in the email, number of verbs in the email, and the length of time
    until first reply.
            Populations and Samples
• Sampling
  •   A sample is a subset of units (size n) taken to draw conclusions about the population.
  •   Different sampling mechanisms exist, and awareness is crucial to avoid biases.
• Sampling Methods Example: BigCorp Email:
  • Two reasonable methods:
      •   Randomly selecting 1/10th of all employees and taking their emails, or
      •   sampling 1/10th of all emails sent each day.
  • Both methods yield the same sample size but can lead to different conclusions
      about the underlying distribution of emails.
• Biases can be introduced during sampling, distorting data.
• Distorted data can lead to incorrect and biased conclusions, especially
 in complex algorithms and models.
                                       Modeling
• We need to build models from collected data.
• The term "model" has different meanings, causing confusion in discussions.
• Example:
  •   Data models for storing data (database managers) vs.
  •   statistical models (central to this course).
• Reference to a provocative Wired magazine piece by Chris Anderson:
  •   Argues that complete information from data eliminates the need for models; correlation alone
      suffices.
  •   But the author of our book, Rachel does't agree with what Anderson is saying. She thinks models are
      still very important.
• Recognizing the Media's Role:
  •   The media plays a big part in how people see data science and modeling.
  •   It's crucial to think carefully and judge opinions, especially from those who don't work directly with
      data.
• Data scientists should share their well-informed opinions in discussions about these
  topics.
                               What is a model?
• Humans understand the world by creating various representations.
• Architects use blueprints, molecular biologists use visualizations, and statisticians/data scientists use
  mathematical functions.
• Statisticians and data scientists express uncertainty and randomness in data-generating processes
  through mathematical functions.
• Models serve as lenses to understand and represent reality, whether in architecture, biology, or
  mathematics.
• A model is an artificial construction that simplifies reality by removing or abstracting irrelevant details.
• Attention must be given to these abstracted details post-analysis to ensure nothing crucial was
  overlooked.
• Imagine creating a model to predict students' final grades based on various factors.
   •   The model might consider variables like attendance, study hours, participation in class, and past academic performance.
   •   Students’ contact details can be abstracted.
                 Statistical modelling
• Draw a conceptual picture of the underlying process before diving into
  data and coding.
• Identify the sequence of events, factors influencing each other, and
  causation relationships.
• Consider questions like what comes first, what influences what, and
  what causes what.
• Formulate testable hypotheses based on your initial understanding of
  the process.
• Different Thinking Styles:
  • Recognize that people have different preferences in expressing relationships.
  • Some individuals lean towards mathematical expressions, while others prefer
     visual representations.
                         Statistical modeling
• Mathematical Expression:
  •   If inclined towards math, use expressions with Greek letters for parameters and Latin letters
      for data.
  •   Example: For a potential linear relationship between columns x and y, express it as
      •   y = β0 + β1x
      •   Where β0 and β1 are parameters.
  •   Acknowledge that the actual numerical values of parameters are unknown initially.
  •   Parameters represent the coefficients or constants in mathematical models.
• Visual Representation:
  •   Alternatively, some individuals may start with visual representations, like diagrams illustrating
      data flow.
  •   Use arrows to depict how variables impact each other or represent temporal changes.
  •   A visual representation helps in forming an abstract picture of relationships before translating
      them into mathematical equations.
    But how do you build a model?
• Functional Form of Data: Art and Science
  • Determining the functional form involves both art and science.
  • Lack of guidance in textbooks despite its critical role in modeling.
  • Making assumptions about the underlying structure of reality.
• Challenges and Lack of Global Standards
  • Lack of global standards for making assumptions.
  • Need for standards in making and explaining choices.
  • Making assumptions in a thoughtful way, despite the absence of clear
    guidelines.
• Where to Start in Modeling: Not Obvious
  • Starting point not obvious, similar to the meaning of life.
     But how do you build a model?
• Exploratory Data Analysis (EDA)
  • Introduction to EDA as a starting point.
  • Making plots and building intuition for the dataset.
  • Emphasizing the importance of trial and error and iteration.
• Mystery of Modeling Until Practice
  • Modeling appears mysterious until practiced extensively.
  • Starting with simple approaches and gradually increasing complexity.
  • Using exploratory techniques like histograms and scatterplots.
• Writing Down Assumptions and Starting Simple
  • Advantages of writing down assumptions.
  • Starting with the simplest models and building complexity.
  • Encouraging the use of full-blown sentences to express assumptions.
    But how do you build a model?
• Trade-off Between Simple and Accurate Models
  • Acknowledging the trade-off between simplicity and accuracy.
  • Simple models may be easier to interpret and understand.
  • Highlighting that simple models can often achieve a significant level of
    accuracy.
• Building a range of Potential Models
  • Introducing probability distributions as fundamental components.
  • Stressing the significance of comprehending and applying probability
    distributions.
           Probability Distribution
• Probability distributions are fundamental in statistical models.
• Back in the day, before computers, scientists observed real-
 world phenomenon, took measurements, and noticed that
 certain mathematical shapes kept reappearing.
• The classical example is the height of humans, following a
 normal distribution—a bell-shaped curve, also called a
 Gaussian distribution, named after Gauss.
Probability Distribution
          Probability Distribution
• Not all processes generate data that looks like a named
 distribution, but many do. We can use these functions as
 building blocks of our models.
• It’s beyond the scope of the book / syllabus to go into each of
 the distributions in detail, but we look at the figure as an
 illustration of the various common shapes.
• Note that they only have names because someone observed
 them enough times to think they deserved names.
• There is actually an infinite number of possible distributions.
Probability Distribution
         • Each distribution has corresponding
            functions.
         • For example, the normal distribution
            is written as:
         • The parameter μ is the mean and median and controls
           where the distribution is centered (because this is a
           symmetric distribution)
         • the parameter σ controls how spread out the distribution
           is.
         • This is the general functional form, but for specific real-
           world phenomenon, these parameters have actual
           numbers as values, which we can estimate from the data.
            Probability Distribution
• Random Variable (x or y):
  • A random variable is a variable whose possible values are outcomes of a random
     phenomenon. It's represented by symbols like x or y.
• Probability Distribution (p(x)):
  • A probability distribution is a mathematical function that provides the probabilities
     of occurrence of different possible outcomes in an experiment.
  • For a continuous random variable, the probability distribution is described by a
     probability density function (PDF), denoted as p(x), which assigns probabilities to
     intervals rather than individual values.
• Probability Density Function (PDF):
  • A probability density function maps the values of a random variable to non-
     negative real numbers. It's denoted as p(x). For a PDF to be valid, its integral over
     its entire domain must equal 1. This ensures that the total probability across all
     possible outcomes is 1.
            Probability Distribution
• Example (Time until the next bus):
  • In this example, the random variable x represents the amount of time until the
    next bus arrives, measured in seconds. Since the arrival time can vary and is
    uncertain, it's a random variable.
• Given PDF (p(x)):
  • The probability density function (PDF) for the time until the next bus arrives is
    given as
      P(x) =2e−2x
• Calculating Probability:
  • If we want to find the probability that the next bus arrives between 12 and 13
    minutes (which we convert to seconds for consistency with the unit of the random
    variable x), we need to find the area under the curve of the PDF between x = 12
    minutes and x = 13 minutes.
  • This is done by integrating the PDF from 12 to 13:
            Probability Distribution and
            Probability Density Function
• Probability distribution function and probability density function are functions defined
  over the sample space, to assign the relevant probability value to each element.
• Probability distribution functions are defined for the discrete random variables while
  probability density functions are defined for the continuous random variables.
• Distribution of probability values (i.e. probability distributions) are best portrayed by
  the probability density function and the probability distribution function.
• The probability distribution function can be represented as values in a table, but that
  is not possible for the probability density function because the variable is continuous.
• When plotted, the probability distribution function gives a bar plot while the
  probability density function gives a curve.
• The height/length of the bars of the probability distribution function must add to 1
  while the area under the curve of the probability density function must add to 1.
• In both cases, all the values of the function must be non-negative.
               Probability Distribution
• Choosing the Right Distribution:
  •   One way to determine the appropriate probability distribution for a random variable is by
      conducting experiments and collecting data. By analyzing the data and plotting it, we can
      approximate the probability distribution function (PDF).
  •   Alternatively, if we have prior knowledge or experience with a real-world phenomenon,
      we might use a known distribution that fits that phenomenon. For instance, waiting times
      often follow an exponential distribution, which has the form p(x) = λe^(-λx), where λ is a
      parameter.
• Joint Distributions:
  •   In scenarios involving multiple random variables, we use joint distributions, denoted as
      p(x, y). These distributions assign probabilities to combinations of values of the variables.
  •   The joint distribution function is defined over a plane, where each point corresponds to a
      pair of values for the variables.
  •   Similar to single-variable distributions, the integral of the joint distribution over the entire
      plane must equal 1 to represent probabilities.
               Probability Distribution
• Conditional Distributions:
  • Conditional distributions, denoted as p(x|y), represent the distribution of one variable
     given a specific value of another variable.
  • In practical terms, conditioning corresponds to subsetting or filtering the data based
     on certain criteria.
  • For example, in user-level data for Amazon.com, we might want to analyze the
     amount of money spent by users given their gender or other characteristics. This
     analysis involves conditional distributions.
  • If we consider X to be the random variable that represents the amount of money
     spent, then we can look at the distribution of money spent across all users, and
     represent it as p(X)
  • We can then take the subset of users who looked at more than five items before
     buying anything, and look at the distribution of money spent among these users.
  • Let Y be the random variable that represents number of items looked at,
  • then p (X | Y > 5) would be the corresponding conditional distribution.
                   Joint Probability
• Joint probability is the probability of two (or more) events
  happening simultaneously. It is denoted as P(A∩B) for two events A
  and B, which reads as the probability of both A and B occurring.
• For two events A and B, the joint probability is defined as:
• P(A∩B)=P(both A and B occur)
• Note: If A and B are dependent, the joint probability is calculated
  using conditional probability
• Examples of Joint Probability - Rolling Two Dice
• Let A be the event that the first die shows a 3.
• Let B be the event that the second die shows a 5.
                  Joint Probability
• The joint probability P(A∩B) is the probability that the first die
 shows a 3 and the second die shows a 5. Since the outcomes
 are independent,
• P(A∩B) = P(A) ⋅ P(B).
• Given: P(A) = 1/6 and P(B) = 1/6, so
• ⇒ P(A∩B) = 1/6 × 1/6 = 1/36.
              Conditional Probability
• Conditional probability is the probability of an event occurring given that another
  event has already occurred. It provides a way to update our predictions or beliefs
  about the occurrence of an event based on new information.
• The conditional probability of event A given event B is denoted as P(A ∣B) and is
  defined by the formula:
• P(A∣B)=P(A∩B) / P(B)
Where:
• P(A∩B) is the joint probability of both events A and B occurring.
• P(B) is the probability of event BBB occurring.
           Conditional Probability
Examples of Conditional Probability
• Suppose we have a deck of 52 cards, and we want to find the
 probability of drawing an Ace given that we have drawn a red card.
• Let A be the event of drawing an Ace.
• Let B be the event of drawing a red card.
• There are 2 red Aces in a deck (Ace of hearts and Ace of diamonds)
 and 26 red cards in total.
           P(A∣B)=P(A∩B) / P(B) =(2/52)/(26/52)=2/62=1/13
   Aspect               Joint Probability                   Conditional Probability
                  The probability of two or more         The probability of an event given
 Definition
                    events occurring together.           that another event has occurred.
  Notation               P(A∩B) or P(A, B)                       P(A∣B) or P(B∣A)
  Formula                     P(A∩B)                          P(A∣B) = P(A∩B)/P(B)
                                                          Probability of rolling a 2 given
               Probability of rolling a 2 and flipping
  Example                                                that the coin flip is heads: P(2 ∣
                       heads: P(2 ∩ Heads)
                                                                      Heads)
                                                            Calculated using the joint
 Calculation    Calculated from a joint probability        probability and the marginal
  Context                  distribution.                     probability of the given
                                                                     condition.
               Involves multiple events happening         Depends on the occurrence of
Dependencies
                         simultaneously.                        another event.
                  Used to find the likelihood of         Used to update the probability of
  Use Case      combined events in probabilistic             an event based on new
                             models.                               information.
                   Fitting a model
• Fitting a model involves estimating the parameters of the model
 using observed data. This process helps approximate the real-
 world mathematical process that generated the data.
• It often requires optimization methods like maximum likelihood
 estimation to find the best parameters that fit the data.
• The parameters estimated from the data are themselves
 functions of the data and are called estimators.
• Once the model is fitted, you can express it in a mathematical
 form, such as y = 7.2 + 4.5x, which represents the relationship
 between variables based on the assumption that the data follows
 a specific pattern.
                 Fitting a model
• Coding the Model:
  • Coding the model involves reading in the data and specifying
   the mathematical form of the model.
  • Programming languages like R or Python use built-in
   optimization methods to find the most likely values of the
   parameters given the data.
  • While you should understand that optimization is happening,
   you typically don't need to code this part yourself, as it's
   handled by the programming language's functions.
                 Fitting a model
• Overfitting:
  • Overfitting occurs when a model is too complex and fits the
   training data too closely, capturing noise         or   random
   fluctuations rather than the underlying pattern.
  • When you apply an overfitted model to new data (not used for
   training) for prediction, it performs poorly because it hasn't
   generalized well beyond the training data.
  • Overfitting can be detected by evaluating the model's
   performance on unseen data using metrics like accuracy.