0% found this document useful (0 votes)
6 views24 pages

Data Analytics

Chapter 4 discusses various applications of Natural Language Processing (NLP) including web scraping, tokenization, and text analytics. It covers techniques like stemming, lemmatization, and the Bag of Words model, as well as challenges in social media analytics and the importance of community detection. The chapter also highlights the lifecycle of social media analytics and introduces concepts like influence maximization and trend analytics.

Uploaded by

bhosalesakshi772
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views24 pages

Data Analytics

Chapter 4 discusses various applications of Natural Language Processing (NLP) including web scraping, tokenization, and text analytics. It covers techniques like stemming, lemmatization, and the Bag of Words model, as well as challenges in social media analytics and the importance of community detection. The chapter also highlights the lifecycle of social media analytics and introduces concepts like influence maximization and trend analytics.

Uploaded by

bhosalesakshi772
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Chapter 4: Social Media and Text Analytics

1 .List applications of NLP.


-> - Google Search
- Chatbots like ChatGPT or Siri
- Voice assistants (Alexa, Google Assistant)

- Language translation (Google Translate)

2. What is Web Scraping?


->Web scraping means collecting information from websites.
Example: Getting product names and prices from Amazon automatically.

3.What is the purpose of n-gram?


->-An n-gram is a small group of words from a sentence.
-Purpose: It helps the computer understand how words appear together.
Example:
Sentence: "I love milk"
2-gram: "I love", "love milk"

4.What is Social Media Analytics?


->Social Media Analytics means checking what people are saying and doing on social media
like Facebook, Instagram, or Twitter.

5. What is(Define)Tokenization:
1. Tokenization means breaking a sentence into smaller parts called tokens.
2. Tokens can be words, sentences, or characters.
3. It helps computers understand the text better.
4. Example: "I like milk" → Tokens: "I", "like", "milk"
5. It is the first step in Natural Language Processing (NLP).
Types of Tokenization:
1.Sentence Tokenization 2.Word Tokenization

5.Define Text Analytics.(What is??)


-> -Text Analytics means analyzing text data to find useful information.
- It helps to understand what people are saying in emails, reviews, messages, etc.
- It finds patterns, trends, and important keywords in the text.
- It can tell if the text is positive or negative (sentiment analysis).
- It helps companies make better decisions using customer feedback.
- It is used in areas like marketing, customer support, and research.

6.Stop Words(what is? Define)


->1. Stop words are common words like "a", "an", "the", "is", "in", "on" etc.
2.These words don’t add much meaning to a sentence.
3.In text processing, we usually remove stop words to focus on important words.
4.Example:
Sentence: "I am going to the market"
After removing stop words: "going market"
5.Removing stop words makes text analysis faster and cleaner.

7.What is Bag of Words?


-> 1. Bag of Words is a method to convert text into numbers so that computers can
understand it.
2. It counts how many times each word appears in a sentence or document.
3. It ignores grammar and word order, only cares about the frequency of words.
4. Example:
Sentences:
 "I like milk"
 "I like tea"
BoW:
 I: 2, like: 2, milk: 1, tea: 1
5. It is used in text classification, like spam detection or sentiment analysis.
6. It's a basic and important technique in Natural Language Processing (NLP)
&
POS Tagging (Part of Speech Tagging):
1. POS Tagging means finding the role of each word in a sentence.
2. It tells whether a word is a noun, verb, adjective, etc.
3. Example:
Sentence: "The cat eats fish"
POS Tags:
o The (Determiner)
o cat (Noun)
o eats (Verb)
o fish (Noun)
4. Helps computers understand the meaning and structure of sentences.
5. Used in grammar checking, question answering, and more.

8. - Explain Stemming and Lemmatization.

-> 1.Stemming:
1. Stemming means cutting the word to its root form.
2. It removes suffixes like "ing", "ed", "s", etc.
3. It may not always give a real word.
4. Example:
o "playing" → "play"
o "played" → "play"
o "studies" → "studi" (not a real word)
2. Lemmatization:
1. Lemmatization gives the actual base form of the word.
2. It uses dictionary meaning and grammar rules.
3. Always gives a real word.
4. Example:
o "playing" → "play"
o "better" → "good"
o "studies" → "study"

5.Difference between Stemming and Lemmatization:


Feature Stemming Lemmatization

Meaning Cuts off prefixes/suffixes Converts word to dictionary root form

Output May not be a real word Always a real word

Example "studies" → "studi" "studies" → "study"

Accuracy Less accurate More accurate

Speed Faster Slower

Grammar rules used No Yes

Used for Quick and basic NLP tasks Deep linguistic analysis

6.What is TF-IDF?
-> TF-IDF is a method to find important words in a document or text.
1. TF (Term Frequency):
 Tells how often a word appears in a document.
 More the word appears, higher the score.
2. IDF (Inverse Document Frequency):
 Tells how rare the word is in all documents.
 Common words like "the", "is", "a" get low scores

7.Short Note on Community Detection.

1. Community Detection is used to find groups (communities) in a network or graph.


2.A community is a group of nodes that are more connected to each other than to the rest.
3. It helps in understanding social networks, like finding friend groups on Facebook.
4. Used in areas like marketing, biology, and recommendation systems.
5. Example: Finding people who talk more to each other in a large group chat.
8. Describe the life cycle of Social Media Analytics with diagram.
-> 1. Data Identification
 Identify the platforms (Facebook, Twitter, Instagram, etc.)
 Choose what kind of data you want (posts, likes, comments, etc.)
2. Data Collection
 Use tools/APIs to gather data from social media platforms
 Example: Twitter API, web scraping
3.Data Cleaning & Preprocessing
 Remove noise, duplicates, stop words, etc.
 Convert data into structured format
4.Data Analysis
 Use NLP, sentiment analysis, trends, graphs
 Understand what people are saying
9.Explain the Phases in Natural Language Processing
-> Lexical Analysis (Tokenization)
 Breaks a sentence into words or tokens
 Example: "I love milk" → "I", "love", "milk"
 Removes punctuation and splits text

2️.Syntactic Analysis (Parsing)


 Checks grammar of the sentence
 Finds structure using parts of speech (noun, verb, etc.)
 Example: "The cat run fast" (Grammar mistake)

3️.Semantic Analysis
 Understands the meaning of the sentence
 Example: "The sun is sleeping" → Doesn't make sense
 Helps in finding correct sense of words
10. Explain the process of Text Preprocessing.
1. Lowercasing
 Change all words to small letters
 Example: "Milk" → "milk"
2.Remove Punctuation
 Remove symbols like . , ! ?
 Example: "Hello!" → "Hello"
3.Remove Stop Words
 Remove common words like "the", "is", "on"
 Example: "I am happy" → "happy"
4. Tokenization
 Split sentence into individual words
 Example: "I like tea" → ["I", "like", "tea"

11.Challenges to Social Media Analytics.


->1. Unstructured Data
 Data is in different forms (text, image, video), hard to analyze.
2. Privacy Issues
 Cannot collect or analyze personal data without permission.
3. Volume (Too Much Data)
 Millions of posts, comments, likes, videos are shared daily..
4. Velocity (Speed of Data)
 Data comes in real-time – every second, new content is created.
 It's hard to analyze fast-moving data quickly.
12.Influence Maximization:
1. It means finding the most powerful people in a social network.
2. These people can influence others to spread ideas or products.
3. Used in marketing, like promoting a product through top influencers.
4. Goal: Reach more people with the least effort.

13.Trend Analytics:

1. Trend Analytics means finding patterns or popular topics over time.


2. It helps to know what people are talking about the most.
3. Used in social media, news, business, and marketing.

* What is Clustering?
Clustering is a process of grouping similar data items together.
Chapter 3: Mining Frequent Patterns, Associations, and Correlations

**What is Data Characterization?

->Data Characterization means giving a summary or overview of the main features of a


dataset.

1. What is Frequent Itemsets


-> -Frequent itemsets are groups of items that appear together often in a dataset.
-Used in market basket analysis (like what people buy together).
- Found using algorithms like Apriori and FP-Growth.

2. What is an Outlier?
-> An outlier is a value in the data that is very different from the other values and doesn't
follow the normal pattern.

3.Write formula for Support and Confidence

-> Support(X) = (Number of transactions containing X) / (Total number of transactions)

=Freq(X,Y)/N
Confidence(X => Y) = (Number of transactions containing X and Y) / (Number of transactions
containing X)
Confidence(X => Y)= Frequency(X,Y)/ Frequency(X)=support(A unioin B)
4.Define Classification(Refer 2nd chapter 😊 for diagram)
-> Classification is a process in which a computer learns to put things into groups or
categories based on data.

5. what are the advantages of FP-Growth algorithm.


-> 1. Faster than Apriori

 It scans the data only two times, so it's very fast.

2. No Candidate Generation

 It does not create all possible item combinations, saving time and memory.

3. Uses FP-Tree (Compact Structure)

 Stores data in a tree form, which takes less space.

4 Good for Big Data

 Works well with large datasets and many transactions.

6. what are frequent itemsets &association rules? Describe with example

1.What are Frequent Itemsets?

 Frequent itemsets are groups of items that appear together often in transactions.

✅ Example:

Suppose customers buy these items:

Transaction ID Items Bought

T1 Milk, Bread

T2 Milk, Bread, Butter

T3 Milk, Bread

T4 Bread, Butter

Here, {Milk, Bread} appears 3 times, so it's a frequent itemset.

2.What are Association Rules?

 Association rules show how items are related to each other.

 An association rule is a rule that shows a relationship between items in a dataset.

 It is written as:
If item A is bought, then item B is likely to be bought too.
(A → B)
✅ Example:

From the above data:


Milk → Bread
Means: If a customer buys Milk, they are likely to buy Bread too.

7.Define Support and confidence in association rule minning


-> 1. Support
 Support tells how frequently an item or itemset appears in the whole dataset.
 It shows how popular an itemset is.
✅ Example:
If {Milk, Bread} appears in 3 out of 5 transactions:
Support = 3/5 = 60%.
Confidence
 Confidence tells how often item B is bought when item A is bought.
 It shows the strength of the rule A → B.
✅ Example:
If 4 people bought Milk, and out of them 3 also bought Bread:
Confidence (Milk → Bread) = 3/4 = 75%

8. Explain Knowledge discovery in database (KDD) process.


-> KDD is the process of finding useful and hidden knowledge (patterns or information) from
large datasets.
✅ Data Selection
 Choose the relevant data from the database.
🧹 Data Preprocessing (Cleaning)
 Remove errors, missing values, or duplicate data.
🔧 Data Transformation
 Convert the data into a proper format for mining (e.g., normalization).
💡 Data Mining
 Apply techniques (like classification, clustering, association) to find patterns.
9.Explain Apriori algorithm
-> -Apriori is used to find frequent itemsets in large datasets.
-It works based on the rule: "If an itemset is frequent, all of its subsets are also
frequent."
- It uses support to check how often items appear together.
- It removes itemsets that don’t meet the minimum support threshold.
- It generates association rules to show relationships between items.
- It is commonly used in market basket analysis and recommendation systems.
Small Example:
Transaction Items

T1 Milk, Bread

T2 Milk, Bread, Butter

T3 Bread, Butter

 Frequent itemset = {Milk, Bread}

 Rule = Milk → Bread

10.Explain FP growth algorithm


-> FP-Growth (Frequent Pattern Growth) is a fast algorithm to find frequent itemsets
without generating all combinations.
1. Improves speed over Apriori algorithm.
2. Uses a special data structure called FP-Tree (Frequent Pattern Tree).
3. Scans the database only twice – very efficient.
4. Avoids candidate generation, which saves memory.
5. Good for large datasets like retail or transaction data.
🛒 Example:
 T1: Milk, Bread
 T2: Milk, Bread, Butter
 T3: Bread, Butter
 T4: Milk, Butter
FP-Growth:
{Milk, Bread},
 {Bread, Butter},
 {Milk, Butter}

11.What is Data Mining(Define)


-> Data Mining is the process of finding useful patterns, trends, or information from large amounts of
data.

- It helps to discover hidden knowledge from data.

- Used to make decisions in business, healthcare, marketing, etc.

- Data mining is part of a bigger process called KDD (Knowledge Discovery in Database).

12. What is Market Basket Analysis?


->-Market Basket Analysis (MBA) is a data mining technique used to find relationships
between products that customers buy together.
- Improving sales.
Customer Items Bought

C1 Bread, Butter

C2 Bread, Butter, Milk

C3 Bread, Milk

Bread → Butter
Means: Customers who buy Bread also often buy Butter.

13. What is Correlation Analysis?


1. Correlation Analysis checks the relationship between two variables.
2. It tells us how strongly and in which direction (positive or negative) the variables are
related.
3. Example: If ice cream sales increase when temperature increases, they have a
positive correlation.
Chapter 2: Machine Learning Overview

1.State Application of AI
-> 1. Healthcare 2. E-commerce 3.Banking 4. Transportation

5. Agriculture 6. Education 7. Smart Assistants

2.State types of Logistics Regression.


->1. Binary Logistic Regression
2. Multinomial Logistic Regression
3. Ordinal Logistic Regression.

3.What is Superwise learning


-> Supervised Learning is a type of machine learning where:
1. The model learns from labeled data (input + correct output).
2. It tries to predict the output for new data based on what it learned.
3. It’s like a teacher supervising the learning process.

4.What is link Prediction


-> Link Prediction is a technique used to predict future or missing connections in a network
or graph.
- Commonly used in social networks, recommendation systems, and biological networks.
- It helps platforms like Facebook or LinkedIn suggest:

5.Define the term


1.Define Precision
-> Precision measures how many of the predicted positive results are actually correct.
Formula:
2. 2. Accuracy:
Accuracy is the percentage of correct predictions made by the model.

6.Define Machine Learning


->Machine Learning (ML) is a type of technology where computers learn from data and
make decisions or predictions without being clearly programmed.
- It learns from past data.
- It can predict future outcomes.

7.write any two application of of superwised machine learning


->1. Email Spam Detection 2. Loan Approval Prediction

8.Define Recall
->Recall tells us how many of the actual positive cases were correctly found by the model.

9. Explain Underfitting and Overfitting.


->Underfitting:
 The model is too simple.
 It doesn’t learn well from the training data.
 Low accuracy on both training and test data.
 Example: Using a straight line to fit a curve-shaped data.
Overfitting:
 The model is too complex.
 It learns too much, even the noise in training data.
 High accuracy on training data, but poor on test data.
10. What are Dependent and Independent Variables?
->Independent Variable:
 The input or cause.
 It is changed or controlled in an experiment.
 Example: In predicting marks based on study hours,
Study Hours is the independent variable.
Dependent Variable:
 The output or result.
 It depends on the independent variable.
 Example: In the same case,
Marks Scored is the dependent variable.

10. What is Confusion Matrix?


->It compares the model’s predictions with the actual results.

Predicted: Yes Predicted: No

Actual: Yes ✅ True Positive (TP) ❌ False Negative (FN)

Actual: No ❌ False Positive (FP) ✅ True Negative (TN)

11.What is Logistics Regression ?Explain it with Example


-> -Logistic Regression is a machine learning algorithm used for classification problems.
-It predicts categorical outcomes, like Yes/No, True/False
-It gives output between 0 and 1 (a probability), and based on that, it decides the class.
Example:-Let’s say we want to predict if a student will pass an exam based on how many
hours they study:

Hours Studied Passed (Yes=1 / No=0)

1 0

2 0

3 1

4 1

The logistic regression model will learn from this data and then predict:
"If a student studies 2.5 hours, what is the probability they will pass?"

12.What is Machine Learning Explain its types

-> What is Machine Learning?


1. Machine Learning is a part of Artificial Intelligence (AI).
2. It helps computers learn from data without being clearly programmed.
3. It is used to make predictions or decisions.
4. Examples: Spam detection, movie recommendations, face recognition.

Types of Machine Learning☹ ( Daigram sathi book)


1️.Supervised Learning
 Learns from labeled data (input + correct output).
 Example: Predicting house price, spam detection.
2️.Unsupervised Learning
 Learns from unlabeled data (no correct output).
 Finds hidden patterns or groups.
 Example: Customer segmentation, grouping similar products.
3️.Semi-Supervised Learning
 Uses a small amount of labeled data + a large amount of unlabeled data.
 Example: Image recognition with few labeled images.
4️.Reinforcement Learning(Explain Reinforcement Learning separate que ask)4M😊
 The model learns by trial and error and gets rewards or penalties.
 Example: Game playing (like Chess, Pac-Man), robots learning to walk.

13.Explain any two machine Learning (ML) Application


->1. Email Spam Detection
 ML is used to classify emails as Spam or Not Spam.
 The model learns from examples of spam and non-spam emails.

2. Movie Recommendation

 ML is used by apps like Netflix or YouTube to suggest movies or videos.


 It learns from what you watch, like, and search.

14.what is Prediction explain one regression model.


-> What is Prediction?
 Prediction means forecasting the result using data and a model.
 In ML, we use data to predict future values or outcomes.

*Linear Regression(Refer book )

- Linear Regression is used to predict a value using a straight line.


- It finds the relationship between input (X) and output (Y).
- Example: Predicting a person’s weight based on their height
Application:
1. Predicting House Prices – Based on size, location, etc.
2. Sales Forecasting – Estimate future sales using past data.
3. Weather Prediction – Predict temperature based on past trends.
4. Student Performance – Predict marks based on study hours.
.

15.Write a short note on support vector Machine(Daigram Book)


1 SVM is a supervised learning algorithm used for classification and regression.
2. It finds the best line (or hyperplane) that separates different classes of data.
3. It works well in high-dimensional spaces (many features).
4. SVM uses support vectors (important data points) to build the decision boundary.
5. Example: Classifying emails as Spam or Not Spam.

Chapter 1: Introduction to Data Analytics


1. Define Data Analytics.
-> Data Analytics is the process of collecting, organizing, analyzing, and interpreting data
to find useful patterns, trends, and insights that help in decision making.

2. What is AUC & ROC curve?

->-AUC is Area under the curve.

-ROC Curve (Receiver Operating Characteristic)

Term Meaning
ROC Curve showing model performance
AUC Number showing how well the model performs

3. How Receiver operating characteristic (ROC) curve is created?

->1. Train a Classification Model

2 Get Predicted Probabilities

(like 0.8, 0.3, etc.).

3 Set Threshold Values

(e.g., 0.2, 0.5, 0.7) to decide when to classify as "positive".

4 Calculate TPR and FPR for Each Threshold

 True Positive Rate (TPR) = TP / (TP + FN)

 False Positive Rate (FPR) = FP / (FP + TN)

5. Plot the Points

plot a point with FPR on X-axis and TPR on Y-axis.

6. Draw the Curve

4. Exploratory Data Analysis (EDA)


-> 1.EDA helps us understand the data before building a model.

2. It includes summarizing data using charts, graphs, and statistics.

3. We find patterns, trends, and relationships in data.

4. EDA helps us detect missing values or outliers.

5. It shows us the shape and spread of the data (like mean, median).

6. Common tools: histograms, box plots, scatter plots, correlation matrix.


5.Describe types of data analytics

->1.Descriptive

- Descriptive Analytics explains what has already happened using data.

-It uses charts, reports, and summaries to show trends and patterns.

- Example: "Sales increased by 10% last month" – this is descriptive analytics.

- What happened?

2. Diagnostic Analytics

❓ Why did it happen?

 It finds reasons or causes behind the data.

 Example: Sales dropped due to fewer ads.

3. Predictive Analytics

🔮 What will happen?

 It uses data to predict future events.

 Example: Predict next month's sales using past trends.

4. Prescriptive Analytics

🧠 What should we do?

 It suggests actions or solutions based on data.

 Example: Increase ads in specific areas to boost sales.

6.Explain Life Cycle of Data Analytics

-> 1️.Data Collection

 Gather raw data from various sources.

 Sources: websites, surveys, sensors, social media, etc.

 Data is usually unstructured or unclean.

2️. Data Cleaning

 Remove duplicates, missing or incorrect values.

 Make data accurate and consistent.

 Prepares data for analysis.


3.Data Analysis / Modeling

 Apply statistical methods or ML models.

 Discover hidden patterns or make predictions.

 Use algorithms to get results.

4.Interpretation

 Understand what the analysis means.

 Convert numbers into useful insights.

 Supports business understanding.

5.Visualization

 Show data using charts, graphs, dashboards.

 Makes complex data easier to understand.

 Useful for reporting and presentations.

You might also like