True and false from previous exams:
1. Mining techniques need not to be scalable.
Answer: False – Mining techniques must be scalable to handle large datasets efficiently.
2. Most clustering techniques are not incremental.
Answer: True – Most clustering techniques process data in batches rather than incrementally.
3. In classification, rules are easier to understand than large trees.
Answer: True – Rules are more interpretable than large decision trees.
4. Clustering is not important for city-planning decisions.
Answer: False – Clustering is crucial for city planning, zoning, and infrastructure allocation.
5. A good clustering method produces clusters in which data objects are cohesive.
Answer: True – Clustering should ensure high intra-cluster similarity.
6. The ability to deal with different types of attributes is one of the requirements and
challenges of clustering techniques.
Answer: True – Clustering techniques must handle numerical and categorical attributes
effectively.
7. A data warehouse is subject-oriented because it is organized around major subject users
such as the operators and middle managers.
Answer: False – A data warehouse is subject-oriented but focuses on business subjects, not
specific users.
8. OLTP users are measured in hundreds while OLAP users are measured in thousands.
Answer: False – OLTP users are usually in thousands, whereas OLAP users are fewer (hundreds).
9. A virtual warehouse is a set of views over operational databases.
Answer: True – A virtual warehouse provides a logical view of operational data.
10. A data mart is a subset of corporate-wide data that is of value to a specific group of users.
1
Answer: True – A data mart is designed for a specific department or business function.
11. If you have databases, you generally do not need data warehouses.
Answer: False – Databases store operational data, while data warehouses store historical,
analytical data for decision-making.
12. ETL is short for Extraction, Transformation and Loading and it also incorporates cleaning
and refresh steps.
Answer: True – ETL includes data extraction, transformation, loading, and often cleaning and
refreshing.
13. Meta Data stores include, among other things, a description of the structure of the Data
Warehouse and the algorithms used for summarization.
Answer: True – Metadata contains details about the structure and summarization techniques of
the data warehouse.
14. In data warehousing, an n-D base cube is called a base Cuboid.
Answer: True – A base cuboid represents the lowest-level data in an n-dimensional cube.
15. In data warehousing, the topmost 0-D cube, which holds the highest level of
summarization, is called the apex cuboid.
Answer: True – The apex cuboid contains the most summarized form of data.
16. A snowflake schema is composed of multiple fact tables that share dimension tables.
Answer: False – A snowflake schema normalizes dimension tables, while a galaxy schema has
multiple fact tables sharing dimensions.
17. The star schema is the most popular data warehouse schema.
Answer: True – The star schema is widely used because of its simplicity and efficiency.
18. FP-growth scans data set twice.
Answer: True – FP-Growth scans the dataset twice: once to build the FP-tree and once to
extract frequent patterns.
19. All frequent patterns can be considered closed patterns.
2
Answer: False – Only frequent patterns with no superset having the same frequency are
considered closed patterns.
20. All maximal patterns are frequent patterns.
Answer: True – Maximal patterns are frequent but have no frequent supersets.
21. Data mining is used to extract knowledge, so it can be considered a type of expert system.
Answer: False – Data mining finds patterns; expert systems use predefined rules for decision-
making.
22. Data mining is used to find unknown patterns.
Answer: True – The goal of data mining is to uncover previously unknown patterns in data.
23. In most cases, data preprocessing is done before data integration.
Answer: True – Preprocessing (cleaning, normalization) is usually done before integration
(combining datasets).
24. The result of data mining from data warehouses is better than transactional databases.
Answer: True – Data warehouses store historical, structured data, making mining results more
accurate than transactional databases.
25. In pre-pruning, we remove branches from a "fully grown" tree.
Answer: False – Pre-pruning stops tree growth early, while post-pruning removes branches
from a fully grown tree.
26. The overfitting problem occurs because of using the training data set for the testing of the
classifier.
Answer: True – Overfitting happens when a model learns noise from the training set instead of
general patterns.
27. Lossy compression is the preferred compression technique.
Answer: False – Lossy compression loses information, making lossless compression preferable
for critical data storage.
28. Clustering is the best technique for fraud detection.
3
Answer: False – Anomaly detection and classification (e.g., decision trees, neural networks) are
better suited for fraud detection.
29. Entropy is a measure of uncertainty associated with a random variable.
Answer: True – Entropy quantifies uncertainty; higher entropy means more unpredictability.
30. Clustering technique is not a stand-alone tool to get insight into data distribution.
Answer: True – Clustering alone is not enough; visualization and statistical methods are also
needed.
31. The vector space model (VSM) is a generalization of the bag of words model.
Answer: True – VSM extends bag-of-words by considering document structure and term
frequency.
32. Stemming refers to the process of deriving the root word form and meaning.
Answer: True – Stemming reduces words to their root forms (e.g., "running" → "run").
33. Web data are mainly structured.
Answer: False – Most web data is semi-structured or unstructured (HTML, XML, JSON).
34. TF-IDF assigns higher values to terms that are so common in the corpus.
Answer: False – TF-IDF gives higher values to less common but important terms.
35. Document content analysis can be used in recommender systems.
Answer: True – Content-based filtering in recommender systems uses document analysis.
36. In the bag-of-words model assumption, the ordering of terms within the document is
significant and relevant.
Answer: False – Bag-of-words ignores term order, considering only word frequency.
37. IDF is computed at the corpus level.
Answer: True – Inverse Document Frequency (IDF) is calculated across the entire corpus.
38. A corpus refers to blog posts, status updates, or tweets.
Answer: True – A corpus is a collection of text data, including blog posts and tweets.
4
39. Text mining plays a vital role in news analysis.
Answer: True – Text mining extracts trends and sentiment from news articles.
40. In the vector space model, a term can be a word or a sequence of words.
Answer: True – VSM can represent single words or phrases (n-grams).
41. The cosine similarity is determined by measuring the angle between vectors.
Answer: True – Cosine similarity calculates the angle between document vectors.
42. Stemming increases the size of the vocabulary.
Answer: False – Stemming reduces vocabulary size by merging similar words.
43. FP-growth has a higher number of scans compared to the apriori algorithm.
Answer: False FP-growth has fewer scans than Apriori, making it more efficient.
44. There is a difference between classification and numeric prediction.
Answer: True Classification predicts categorical labels, while numeric prediction predicts numbers.
45. In unsupervised learning, the class label of training data is known.
Answer: False In unsupervised learning, the class labels are unknown.
46. Credit/loan approval is a typical example of numeric prediction.
Answer: False Credit/loan approval is a classification problem, not numeric prediction.
47. The higher entropy means the higher uncertainty.
Answer: True Higher entropy means higher uncertainty in a dataset.
48. We can dynamically define discrete-valued attributes that partition the continuous attribute value
into a discrete set of intervals to apply classification on continuous attributes.
Answer: True – Discretization helps in classification by converting continuous data into categories.
5
True and false 2024 exam:
1. In the vector space model, a term can be a word or a sequence of words.
Answer: True. In the vector space model (VSM), a term can be a single word
(unigram) or a sequence of words (n-grams) used to represent text in numerical
form.
2. The cosine similarity is determined by measuring the angle between vectors.
Answer: True. Cosine similarity calculates the similarity between two vectors by
measuring the cosine of the angle between them, which helps in determining how
similar two documents are in text analysis.
3. Stemming increases the size of the vocabulary.
Answer: False. Stemming reduces the vocabulary size by converting words to
their root forms (e.g., "running" → "run"), which helps in reducing redundancy in
text mining.
4. FP-growth has a higher number of scans compared to Apriori.
Answer: False. The FP-growth algorithm scans the dataset only twice, whereas
Apriori requires multiple scans, making FP-growth more efficient for frequent
pattern mining.
5. There is no difference between classification and numeric prediction
algorithms.
Answer: False. Classification predicts categorical values (e.g., Yes/No, Spam/Not
Spam), while numeric prediction predicts continuous values (e.g., stock prices,
temperatures).
6. The higher entropy means the higher uncertainty.
6
Answer: True. Entropy measures the uncertainty in a dataset; a higher entropy
value means the dataset is more diverse and less predictable.
7. We can dynamically define discrete-valued attributes that partition the
continuous attribute value into a discrete set of intervals to apply classification
on continuous attributes.
Answer: True. This technique, known as discretization, helps convert continuous
values into categorical bins to make them suitable for classification.
8. A data warehouse is subject-oriented because it is organized around major
subjects like operators and middle managers.
Answer: False. A data warehouse is subject-oriented because it is organized
around business domains like sales, finance, and customer data, not job roles.
9. OLTP users are measured in hundreds while OLAP users are measured in
thousands.
Answer: False. OLTP (Online Transaction Processing) systems support thousands
of users for day-to-day transactions, while OLAP (Online Analytical Processing)
systems have fewer users (typically in the hundreds) performing complex queries.
10. A virtual warehouse is a set of views over operational databases.
Answer: True. A virtual warehouse is an abstraction layer that creates views on
existing databases without physically storing the data.
11. A data mart is a subset of corporate-wide data that is of value to a specific
group of users.
Answer: True. A data mart is a smaller, focused version of a data warehouse
designed for specific departments like marketing or finance.
7
12. If you have databases, you generally do not need data warehouses.
Answer: False. A database is optimized for transaction processing (OLTP),
whereas a data warehouse is optimized for analytical processing (OLAP). Both
serve different purposes.
13. ETL is short for Extraction, Transformation, and Loading and also
incorporates cleaning and refresh steps.
Answer: True. The ETL process extracts data from various sources, transforms it
into a usable format, cleans it, and loads it into a data warehouse.
14. Metadata stores include, among other things, a description of the structure
of the Data Warehouse and the algorithms used for summarization.
Answer: True. Metadata in data warehouses includes details about tables,
columns, relationships, transformations, and summarization techniques.
15. In data warehousing, an n-D base cube is called a base cuboid.
Answer: True. A base cuboid is the lowest level in a data cube, containing raw
data before aggregation.
16. In data warehousing, the topmost 0-D cube, which holds the highest level of
summarization, is called the apex cuboid.
Answer: True. The apex cuboid represents the most aggregated level in a data
cube, summarizing all data into a single value.
17. A snowflake schema is composed of multiple fact tables that share
dimension tables.
8
Answer: False. A snowflake schema normalizes dimension tables into smaller
related tables, but it still contains only one fact table.
18. The star schema is the most popular data warehouse schema.
Answer: True. The star schema is widely used because it simplifies queries and
improves performance by directly linking dimension tables to a central fact table.
19. FP-growth scans the dataset twice.
Answer: True. Unlike Apriori, FP-growth scans the dataset only twice, making it
more efficient.
20. All frequent patterns can be considered closed patterns.
Answer: False. A closed pattern is a frequent pattern with no super-pattern
having the same frequency. Not all frequent patterns are closed.
21. All maximal patterns are frequent patterns.
Answer: True. A maximal pattern is a frequent pattern that has no frequent
super-patterns.
22. In pre-pruning, we remove branches from a "fully grown" tree.
Answer: False. Pre-pruning stops tree growth early if further branching does not
improve classification. Post-pruning removes branches after the tree is fully
grown.
23. Clustering is the best technique for fraud detection.
Answer: False. Anomaly detection and classification techniques (like decision
trees and neural networks) are more commonly used for fraud detection.
9
24. Entropy is a measure of uncertainty associated with a random variable.
Answer: True. Entropy quantifies uncertainty and is used in decision tree
algorithms for attribute selection.
25. Clustering technique is not a stand-alone tool to get insight into data
distribution.
Answer: True. Clustering is often combined with visualization and statistical
techniques for better insights.
26. The vector space model (VSM) is a generalization of the bag-of-words
model.
Answer: True. VSM extends the bag-of-words model by representing documents
as vectors in a multi-dimensional space.
27. Stemming refers to the process of deriving the root word form and meaning.
Answer: True. Stemming reduces words to their root form (e.g., "running" →
"run").
28. Web data are mainly structured.
Answer: False. Most web data is unstructured, including text, images, and videos.
29. TF-IDF assigns higher values to terms that are common in the corpus.
Answer: False. TF-IDF gives higher weight to rare terms that appear frequently in
a single document but not in many others.
10
30. Document content analysis can be used in recommender systems.
Answer: True. Content-based recommendation systems use text analysis to
suggest relevant content.
31. In the bag-of-words assumption, the ordering of terms is significant and
relevant.
Answer: False. The bag-of-words model ignores word order and considers only
term frequency.
32. IDF is computed at the corpus level.
Answer: True. Inverse Document Frequency (IDF) is computed across all
documents in the corpus.
33. A corpus refers to blog posts, status updates, or tweets.
Answer: True. A corpus is a collection of text documents, which can include blog
posts, social media updates, tweets, and more.
34. Text mining plays a vital role in news analysis.
Answer: True. Text mining techniques, such as sentiment analysis and topic
modeling, are widely used for news classification, trend detection, and fake
news identification.
35. The ability to deal with different types of attributes is one of the
requirements of data mining.
Answer: True. Data mining algorithms must handle different attribute types
(e.g., categorical, numerical, ordinal) for effective analysis.
11
36. Data preprocessing is a step in Knowledge Discovery.
Answer: True. Data preprocessing (e.g., cleaning, normalization, transformation)
is an essential step in the Knowledge Discovery in Databases (KDD) process to
improve data quality before analysis.
37. Clustering techniques are used to discover the most frequently purchased
items together.
Answer: False. Frequent pattern mining (e.g., Apriori, FP-Growth) is used for
market basket analysis to find frequently purchased items together. Clustering
groups similar items but does not track co-occurrence frequency.
38. Dimensionality reduction is more important than resolution in data mining.
Answer: True. Reducing the number of attributes (dimensionality reduction)
improves efficiency, model performance, and interpretability in data mining.
39. In the clustering technique, we have predefined classes in which future data
can be classified.
Answer: False. Clustering is unsupervised learning, meaning there are no
predefined classes. The algorithm groups similar data points based on patterns.
40. Dissimilarity measures are used to perform classification.
Answer: False. Dissimilarity measures (e.g., Euclidean distance, cosine similarity)
are used in clustering and information retrieval, not classification.
41. Data mining is the first step in the Knowledge Discovery process.
Answer: False. Data mining is one of the final steps in the Knowledge Discovery
in Databases (KDD) process. The first step is usually data preprocessing.
12
42. Interval data has an inherent zero point.
Answer: False. Interval data (e.g., temperature in Celsius) has equal intervals but
no true zero. Ratio data (e.g., height, weight) has a true zero point.
43. Filling the missing values, smoothing noisy data, identifying and removing
outliers, and resolving inconsistencies are known as data cleaning.
Answer: True. Data cleaning is part of data preprocessing, ensuring high-quality
data for analysis.
44. Derivable data means that one attribute may be derived from another table.
Answer: True. Derived attributes are computed from existing data, e.g., age can
be derived from the date of birth.
45. Missing data may be due to equipment malfunction.
Answer: True. Missing data can result from sensor failures, human errors, or
system crashes.
46. Overfitting is an undesirable machine learning behavior that occurs when
the machine learning model gives accurate predictions for training data but not
for new data.
Answer: True. Overfitting happens when a model learns noise in the training
data instead of general patterns, reducing its ability to generalize to new data.
47. In the decision tree algorithm, each internal node corresponds to a class
label.
13
Answer: False. Each internal node represents a decision (split), while leaf nodes
correspond to class labels.
48. Support, S, is the conditional probability that a transaction having X also
contains Y.
Answer: False. Support is the probability of both X and Y occurring together,
while confidence measures the conditional probability P(Y∣X)P(Y | X)P(Y∣X).
49. Any subset of a frequent itemset must be frequent.
Answer: True. This is the Apriori property, which states that if an itemset is
frequent, all its subsets must also be frequent.
50. Apriori algorithm does not require a lot of scans.
Answer: False. The Apriori algorithm requires multiple scans of the dataset,
making it slower compared to FP-Growth.
Masters2020:
1) Dimensionality reduction involves reducing the number of attributes.
Answer: True. Dimensionality reduction reduces the number of attributes
(features) while preserving important information to improve model efficiency
and performance.
2) Discretization is used in data smoothing.
Answer: False. Discretization is used to convert continuous attributes into
discrete categories, whereas data smoothing is used to remove noise from data.
14
3) Five-number summary is used to measure the dispersion of data.
Answer: True. The five-number summary (minimum, Q1, median, Q3, maximum)
provides a measure of data dispersion, spread, and distribution.
4) K-Medoids Clustering Method has no drawbacks.
Answer: False. K-Medoids is more robust to noise than K-Means, but it is
computationally expensive for large datasets.
5) Star Schema is the best data warehouse schema.
Answer: False. While Star Schema is widely used due to its simplicity, the best
schema depends on the use case (e.g., Snowflake Schema offers better
normalization).
6) Symmetric attributes are those with values of equal significance to the mining
task.
Answer: True. Symmetric attributes (e.g., gender, yes/no responses) have values
that are equally significant in data mining.
7) Data warehouse is mainly used for supporting organizational operations.
Answer: False. Data warehouses are designed for decision support and analytical
processing (OLAP) rather than day-to-day operations (OLTP).
8) We cannot use the Clustering technique with unlabeled data.
Answer: False. Clustering is an unsupervised learning technique, which means it
works without labeled data by grouping similar objects.
9) Association rules technique is considered unsupervised learning.
15
Answer: True. Association rule mining (e.g., Apriori, FP-Growth) does not require
labeled data and is an unsupervised learning technique.
10) It is always recommended to use lossless compression when reducing the
size of data.
Answer: False. While lossless compression retains all data, lossy compression is
sometimes preferred when minor data loss is acceptable for significant size
reduction.
11) Overfitting provides a better decision tree.
Answer: False. Overfitting causes poor generalization—the decision tree
becomes too complex, fitting training data perfectly but failing on new data.
12) The support of an association rule (LHS ⇒ RHS) is the probability of RHS.
Answer: False. Support is the probability of both LHS and RHS occurring
together, not just RHS.
13) Min-max normalization can put normalized values between any two
numbers.
Answer: True. Min-max normalization scales values to a specified range, such as
[0,1] or [-1,1].
14) Outlier values are the only type of noise in data.
Answer: False. Noise in data can come from measurement errors, missing
values, duplicate records, and inconsistencies, not just outliers.
16
15) The measure of interestingness of discovered patterns is always
quantitative.
Answer: False. Interestingness measures can be quantitative (e.g., support,
confidence) or qualitative (e.g., usefulness, novelty).
16) C4.5 handles outliers better than ID3.
Answer: True. C4.5 handles continuous attributes and missing values better than
ID3, making it more robust to outliers.
17) Post-pruning is recommended over pre-pruning in Decision Tree Induction.
Answer: True. Post-pruning is preferred because it allows a fully grown tree to be
pruned after training, improving accuracy and avoiding premature cuts.
18) Bayesian Classification mainly relies on probabilities.
Answer: True. Bayesian classifiers (e.g., Naïve Bayes) use probability theory
(Bayes' Theorem) to make classification decisions.
19) Closed patterns could be seen as a subset of max patterns.
Answer: False. Closed patterns contain no superpatterns with the same support,
whereas max patterns are the largest frequent patterns. They are related but not
subsets.
20) Distance function d(a,b)d(a,b)d(a,b) in cluster analysis shows the distance
between two clusters.
Answer: True. Distance functions (e.g., Euclidean, Manhattan, cosine similarity)
measure how far apart clusters are.
17
21) Discrete Fourier Transform (DFT) is one of the data mining techniques.
Answer: True. DFT is used in time-series analysis, feature extraction, and signal
processing in data mining.
22) In case of having some missing data in one of the tuples, we may choose to
ignore the tuple rather than filling the missing value.
Answer: True. If the missing data is minimal, removing the tuple is a valid
approach, but if missing data is significant, imputation is preferred.
23) When the value of chi-square is high, this means that it is less likely that the
variables are related.
Answer: False. A higher chi-square value indicates a stronger relationship
between the variables, meaning they are likely dependent.
24) For two attributes, when the correlation coefficient value is 0, this means
that they are highly related.
Answer: False. A correlation coefficient of 0 means no linear relationship, not
high correlation. A correlation close to ±1 indicates a strong relationship.
Exam2022:
1) A data warehouse is subject-oriented because it is organized around major
subjects such as operators and middle managers.
Answer: False. A data warehouse is subject-oriented because it is organized
around major business subjects such as sales, customers, and products—not
specific user roles like operators or middle managers.
18
2) Recall is one of the measures of clustering performance.
Answer: False. Recall is a metric used in classification tasks, not clustering.
Clustering performance is evaluated using measures like Silhouette Score, Davies-
Bouldin Index, and Dunn Index.
3) A data warehouse is defined as integrated because it is composed of
relational databases.
Answer: False. A data warehouse is called integrated because it combines data
from multiple heterogeneous sources (databases, flat files, ERP systems, etc.),
not because it uses relational databases.
4) OLTP users are measured in hundreds while OLAP users are measured in
thousands.
Answer: False. OLTP (Online Transaction Processing) systems typically have more
users (thousands or millions), while OLAP (Online Analytical Processing) systems
are used by analysts and decision-makers, often in the hundreds.
5) A virtual warehouse is a set of views over operational databases.
Answer: True. A virtual data warehouse does not store data physically but
provides a set of views on operational data sources.
6) A data mart is a subset of corporate-wide data that is of value to a specific
group of users.
Answer: True. Data marts are focused on a specific business function (e.g.,
finance, sales) and contain a subset of enterprise-wide data.
7) If you have databases, you generally do not need data warehouses.
19
Answer: False. Databases store operational data, while data warehouses
support analytics and decision-making by integrating and transforming data from
multiple sources.
8) ETL is short for Extraction, Transaction, and Loading, and it also incorporates
cleaning.
Answer: False. ETL stands for Extraction, Transformation, and Loading. The
transformation phase includes data cleaning and integration.
9) Metadata stores, among other things, a description of the structure of the
data warehouse and the algorithms used for summarization.
Answer: True. Metadata includes data definitions, relationships, ETL processes,
indexing, and query optimizations in a data warehouse.
10) In data warehousing, an n-D base cube is called a base cuboid.
Answer: True. The n-D base cuboid represents the most detailed level of data in
a data warehouse.
11) In data warehousing, the topmost 0-D cube holds the highest level of
summarization.
Answer: True. The 0-D cube (or apex cuboid) contains the most aggregated data
(e.g., total sales revenue).
12) In data warehousing, the topmost 0-D cube is called the apex cuboid.
Answer: True. The apex cuboid represents the highest level of summarization in
an OLAP cube.
20
13) A snowflake schema is composed of multiple fact tables that share
dimension tables.
Answer: False. In a snowflake schema, dimension tables are normalized to
reduce redundancy. Fact tables remain centralized.
14) The star schema is the most popular data warehouse schema.
Answer: True. The star schema is widely used due to simplicity, faster query
performance, and easy interpretation.
15) The typical OLAP operations are Roll-in, Drill-up, Slice and Dice, and Pivot.
Answer: False. The correct OLAP operations are Roll-up, Drill-down, Slice, Dice,
and Pivot. "Roll-in" is not an OLAP operation.
16) Relative support (s) is the frequency of occurrence of an itemset.
Answer: True. Relative support measures how often an itemset appears in
transactions as a fraction of the total transactions.
17) A k-itemset X contains k+2 items.
Answer: False. A k-itemset contains exactly k items (not k+2).
18) FP-Growth has a higher number of scans compared to the Apriori algorithm.
Answer: False. FP-Growth is more efficient because it reduces the number of
scans compared to Apriori, which scans the dataset multiple times.
19) In unsupervised learning, the class label of training data is known.
21
Answer: False. Unsupervised learning does not use labeled data; it finds patterns
and structures in unlabeled data (e.g., clustering).
20) There is no difference between classification and numeric data prediction.
Answer: False. Classification predicts discrete categories, while numeric
prediction (regression) predicts continuous values.
21) Accuracy rate is the percentage of test set samples that are correctly
classified by the model.
Answer: True. Accuracy is measured as (Correct Predictions / Total Test Samples)
× 100%.
22) The higher the entropy, the higher the uncertainty.
Answer: True. In decision trees and information theory, higher entropy means
more randomness and uncertainty.
23) We can dynamically define discrete-valued attributes that partition
continuous attributes into a discrete set of intervals for classification.
Answer: True. This is called discretization, commonly used in decision trees and
Naïve Bayes classifiers.
24) Mining techniques need not be scalable.
Answer: False. Scalability is crucial for handling large datasets efficiently in real-
world data mining.
25) Clustering algorithms should not be incremental.
22
Answer: False. Incremental clustering (e.g., online K-Means) is useful when new
data arrives continuously.
26) In classification, rules are easier to understand than large trees.
Answer: True. Rule-based classification (e.g., decision rules) is often more
interpretable than large decision trees.
27) In evaluating classifiers, specificity is the True Negative recognition rate.
Answer: True. Specificity = TN / (TN + FP), where TN = True Negatives.
28) A Cluster is a collection of data objects.
Answer: True. Clustering groups similar data points together.
29) Clustering is not important for city-planning decisions.
Answer: False. Clustering is widely used in city planning, e.g., traffic patterns,
population density, and resource allocation.
30) The ability to deal with different types of attributes is one of the
requirements of clustering techniques.
Answer: True. Clustering should handle categorical, numerical, and mixed-type
data.
35) The vector space model (VSM) is a generalization of the Bag of Words.
Answer: True. VSM represents text as weighted term vectors, expanding on the
Bag of Words model.
23
36) The closer the cosine value to 1, the smaller the angle and greater the match
between documents.
Answer: True. Cosine similarity of 1 means identical documents.
37) Stemming refers to the process of deriving the root word form and meaning.
Answer: True. Example: "running" → "run".
24