DP 100 Demo
DP 100 Demo
DP-100 Exam
Azure Data Scientist Associate
https://authorizedumps.com/dp-100-exam/
www.authorizedumps.com
Questions & Answers PDF Page 2
Version:30.0
Overview
You are a data scientist in a company that provides data science for professional sporting events.
Models will be global and local market data to meet the following business goals:
• Understand sentiment of mobile device users at sporting events based on audio from crowd
reactions.
Current environment
Requirements
• Media used for penalty event detection will be provided by consumer devices. Media may include
images and videos captured during the sporting event and snared using social media. The images
and videos will have varying sizes and formats.
• The data available for model building comprises of seven years of sporting event media. The
sporting event media includes: recorded videos, transcripts of radio commentary, and logs from
related social media feeds feeds captured during the sporting events.
• Crowd sentiment will include audio recordings submitted by event attendees in both mono and
stereo
Formats.
www.authorizedumps.com
Questions & Answers PDF Page 3
Advertisements
• Ad response models must be trained at the beginning of each event and applied during the
sporting event.
• Sampling must guarantee mutual and collective exclusivity local and global segmentation models
that share the same features.
• Local market segmentation models will be applied before determining a user’s propensity to
respond to an advertisement.
• The ad propensity model uses a cut threshold is 0.45 and retrains occur if weighted Kappa
deviates from 0.1 +/-5%.
• The ad propensity model uses cost factors shown in the following diagram:
The ad propensity model uses proposed cost factors shown in the following diagram:
www.authorizedumps.com
Questions & Answers PDF Page 4
Performance curves of current and proposed cost factor scenarios are shown in the following
diagram:
Findings
• Data scientists must build an intelligent solution by using multiple machine learning models for
penalty event detection.
• Data scientists must build notebooks in a local environment using automatic feature engineering
and model building in machine learning pipelines.
• Notebooks must be deployed to retrain by using Spark instances with dynamic worker allocation
• Notebooks must execute with the same code on new Spark instances to recode only the source of
www.authorizedumps.com
Questions & Answers PDF Page 5
the data.
• Global penalty detection models must be trained by using dynamic runtime graph computation
during training.
• Experiments for local crowd sentiment models must combine local penalty detection data.
• Crowd sentiment models must identify known sounds such as cheers and known catch phrases.
Individual crowd sentiment models will detect similar sounds.
• Shared features must use double precision. Subsequent layers must have aggregate running mean
and standard deviation metrics Available.
segments
• The distribution of features across training and production data are not consistent.
Analysis shows that of the 100 numeric features on user location and behavior, the 47 features that
come from location sources are being used as raw features. A suggested experiment to remedy the
bias and variance issue is to engineer 10 linearly uncorrected features.
• Initial data discovery shows a wide range of densities of target states in training data used for crowd
sentiment models.
• All penalty detection models show inference phases using a Stochastic Gradient Descent (SGD) are
running too stow.
• Audio samples show that the length of a catch phrase varies between 25%-47%, depending on
region.
• The performance of the global penalty detection models show lower variance but higher bias when
comparing training and validation sets. Before implementing any feature changes, you must confirm
the bias and variance using all training and validation cases.
www.authorizedumps.com
Questions & Answers PDF Page 6
Question: 1
You need to resolve the local machine learning pipeline performance issue. What should you do?
Answer: A
Explanation:
Question: 2
DRAG DROP
You need to modify the inputs for the global penalty event model to address the bias and variance
issue.
Which three actions should you perform in sequence? To answer, move the appropriate actions from
the list of actions to the answer area and arrange them in the correct order.
Answer:
Explanation:
www.authorizedumps.com
Questions & Answers PDF Page 7
Question: 3
You need to select an environment that will meet the business and data requirements.
Answer: D
Explanation:
Question: 4
DRAG DROP
Which three actions should you perform in sequence? To answer, move the appropriate actions from
the list of actions to the answer area and arrange them in the correct order.
www.authorizedumps.com
Questions & Answers PDF Page 8
Answer:
Explanation:
Question: 5
www.authorizedumps.com
Questions & Answers PDF Page 9
DRAG DROP
Which three actions should you perform in sequence? To answer, move the appropriate actions from
the list of actions to the answer area and arrange them in the correct order.
Answer:
Explanation:
Question: 6
DRAG DROP
You need to define an evaluation strategy for the crowd sentiment models.
Which three actions should you perform in sequence? To answer, move the appropriate actions from
the list of actions to the answer area and arrange them in the correct order.
www.authorizedumps.com
Questions & Answers PDF Page 10
Answer:
Explanation:
Scenario:
Experiments for local crowd sentiment models must combine local penalty detection data.
Crowd sentiment models must identify known sounds such as cheers and known catch phrases.
Individual crowd sentiment models will detect similar sounds.
www.authorizedumps.com
Questions & Answers PDF Page 11
Note: Evaluate the changed in correlation between model error rate and centroid distance
Reference:
https://en.wikipedia.org/wiki/Nearest_centroid_classifier
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/sweep-
clustering
Question: 7
HOTSPOT
You need to build a feature extraction strategy for the local models.
How should you complete the code segment? To answer, select the appropriate options in the
answer area.
www.authorizedumps.com
Questions & Answers PDF Page 12
Answer:
Explanation:
Question: 8
You need to implement a scaling strategy for the local penalty detection data.
www.authorizedumps.com
Questions & Answers PDF Page 13
A. Streaming
B. Weight
C. Batch
D. Cosine
Answer: C
Explanation:
Post batch normalization statistics (PBN) is the Microsoft Cognitive Toolkit (CNTK) version of how to
evaluate the population mean and variance of Batch Normalization which could be used in inference
Original Paper.
In CNTK, custom networks are defined using the BrainScriptNetworkBuilder and described in the
CNTK network description language "BrainScript."
Scenario:
Reference:
https://docs.microsoft.com/en-us/cognitive-toolkit/post-batch-normalization-statistics
Question: 9
HOTSPOT
You need to use the Python language to build a sampling strategy for the global penalty detection
models.
www.authorizedumps.com
Questions & Answers PDF Page 14
How should you complete the code segment? To answer, select the appropriate options in the
answer area.
Answer:
Explanation:
www.authorizedumps.com
Questions & Answers PDF Page 15
Box 2: ..DistributedSampler(Sampler)..
DistributedSampler(Sampler):
Scenario: Sampling must guarantee mutual and collective exclusively between local and global
segmentation models that share the same features.
www.authorizedumps.com
Questions & Answers PDF Page 16
Scenario: All penalty detection models show inference phases using a Stochastic Gradient Descent
(SGD) are running too slow.
Box 4: .. nn.parallel.DistributedDataParallel..
DistributedSampler(Sampler): The sampler that restricts data loading to a subset of the dataset.
Reference:
https://github.com/pytorch/pytorch/blob/master/torch/utils/data/distributed.py
Question: 10
You need to implement a feature engineering strategy for the crowd sentiment local models.
Answer: D
Explanation:
The linear discriminant analysis method works only on continuous variables, not categorical or
ordinal variables.
www.authorizedumps.com
Questions & Answers PDF Page 17
Linear discriminant analysis is similar to analysis of variance (ANOVA) in that it works by comparing
the means of the variables.
Scenario:
Data scientists must build notebooks in a local environment using automatic feature engineering and
model building in machine learning pipelines.
Experiments for local crowd sentiment models must combine local penalty detection data.
Incorrect Answers:
B: The Pearson correlation coefficient, sometimes called Pearson’s R test, is a statistical value that
measures the linear relationship between two variables. By examining the coefficient values, you can
infer something about the strength of the relationship between the two variables, and whether they
are positively correlated or negatively correlated.
C: Spearman’s correlation coefficient is designed for use with non-parametric and non-normally
distributed data. Spearman's coefficient is a nonparametric measure of statistical dependence
between two variables, and is sometimes denoted by the Greek letter rho. The Spearman’s
coefficient expresses the degree to which two variables are monotonically related. It is also called
Spearman rank correlation, because it can be used with ordinal variables.
Reference:
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/fisher-linear-
discriminant-analysis
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/compute-
linear-correlation
Question: 11
www.authorizedumps.com
Questions & Answers PDF Page 18
DRAG DROP
Which three actions should you perform in sequence? To answer, move the appropriate actions from
the list of actions to the answer area and arrange them in the correct order.
Answer:
Explanation:
www.authorizedumps.com
Questions & Answers PDF Page 19
Decision jungles are non-parametric models, which can represent non-linear decision boundaries.
Step 3: Use the raw score as a feature in a Score Matchbox Recommender model
The goal of creating a recommendation system is to recommend one or more "items" to "users" of
the system. Examples of an item could be a movie, restaurant, book, or song. A user could be a
person, group of persons, or other entity with item preferences.
Scenario:
Ad response models must be trained at the beginning of each event and applied during the sporting
event.
Reference:
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/multiclass-
decision-jungle
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/score-
matchbox-recommender
Question: 12
DRAG DROP
You need to define an evaluation strategy for the crowd sentiment models.
www.authorizedumps.com
Questions & Answers PDF Page 20
Which three actions should you perform in sequence? To answer, move the appropriate actions from
the list of actions to the answer area and arrange them in the correct order.
Answer:
Explanation:
When using a neural network to perform classification and prediction, it is usually better to use
cross-entropy error than classification error, and somewhat better to use cross-entropy error than
mean squared error to evaluate the quality of the neural network.
www.authorizedumps.com
Questions & Answers PDF Page 21
Reference:
https://www.analyticsvidhya.com/blog/2018/04/fundamentals-deep-learning-regularization-
techniques/
Question: 13
You need to implement a model development strategy to determine a user’s tendency to respond to
an ad.
A. Use a Relative Expression Split module to partition the data based on centroid distance.
B. Use a Relative Expression Split module to partition the data based on distance travelled to the
event.
C. Use a Split Rows module to partition the data based on distance travelled to the event.
D. Use a Split Rows module to partition the data based on centroid distance.
Answer: A
Explanation:
Split Data partitions the rows of a dataset into two distinct sets.
The Relative Expression Split option in the Split Data module of Azure Machine Learning Studio is
helpful when you need to divide a dataset into training and testing datasets using a numerical
expression.
Relative Expression Split: Use this option whenever you want to apply a condition to a number
column. The number could be a date/time field, a column containing age or dollar amounts, or even
www.authorizedumps.com
Questions & Answers PDF Page 22
a percentage. For example, you might want to divide your data set depending on the cost of the
items, group people by age ranges, or separate data by a calendar date.
Scenario:
Local market segmentation models will be applied before determining a user’s propensity to respond
to an advertisement.
The distribution of features across training and production data are not consistent
Reference:
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/split-data
Question: 14
You need to implement a new cost factor scenario for the ad response models as illustrated in the
A. Set the threshold to 0.5 and retrain if weighted Kappa deviates +/- 5% from 0.45.
B. Set the threshold to 0.05 and retrain if weighted Kappa deviates +/- 5% from 0.5.
C. Set the threshold to 0.2 and retrain if weighted Kappa deviates +/- 5% from 0.6.
D. Set the threshold to 0.75 and retrain if weighted Kappa deviates +/- 5% from 0.15.
Answer: A
Explanation:
Scenario:
www.authorizedumps.com
Questions & Answers PDF Page 23
Performance curves of current and proposed cost factor scenarios are shown in the following
diagram:
The ad propensity model uses a cut threshold is 0.45 and retrains occur if weighted Kappa deviated
from 0.1 +/- 5%.
Case study
Overview
You are a data scientist for Fabrikam Residences, a company specializing in quality private and
commercial property in the United States. Fabrikam Residences is considering expanding into Europe
and has asked you to investigate prices for private residences in major European cities. You use Azure
Machine Learning Studio to measure the median value of properties. You produce a regression
model to predict property prices by using the Linear Regression and Bayesian Linear Regression
modules.
Datasets
www.authorizedumps.com
Questions & Answers PDF Page 24
There are two datasets in CSV format that contain property details for two cities, London and Paris,
with the following columns:
The two datasets have been added to Azure Machine Learning Studio as separate datasets and
included as the starting point of the experiment.
Dataset issues
The AccessibilityToHighway column in both datasets contains missing values. The missing data must
be replaced with new data so that it is modeled conditionally using the other variables in the data
before filling in the missing values.
Columns in each dataset contain missing and null values. The dataset also contains many outliers.
The Age column has a high proportion of outliers. You need to remove the rows that have outliers in
the Age column. The MedianValue and AvgRoomsinHouse columns both hold data in numeric
format. You need to select a feature selection algorithm to analyze the relationship between the two
columns in more detail.
www.authorizedumps.com
Questions & Answers PDF Page 25
Model fit
The model shows signs of overfitting. You need to produce a more refined regression model that
reduces the overfitting.
Experiment requirements
You must set up the experiment to cross-validate the Linear Regression and Bayesian Linear
Regression modules to evaluate performance.
In each case, the predictor of the dataset is the column named MedianValue. An initial investigation
showed that the datasets are identical in structure apart from the MedianValue column. The smaller
Paris dataset contains the MedianValue in text format, whereas the larger London dataset contains
the MedianValue in numerical format. You must ensure that the datatype of the MedianValue
column of the Paris dataset matches the structure of the London dataset.
You must prioritize the columns of data for predicting the outcome. You must use non-parameters
statistics to measure the relationships.
You must use a feature selection algorithm to analyze the relationship between the MedianValue and
AvgRoomsinHouse columns.
Model training
Given a trained model and a test dataset, you need to compute the permutation feature importance
scores of feature variables. You need to set up the Permutation Feature Importance module to select
the correct metric to investigate the model’s accuracy and replicate the findings.
You want to configure hyperparameters in the model learning process to speed the learning phase by
using hyperparameters. In addition, this configuration should cancel the lowest performing runs at
each evaluation interval, thereby directing effort and resources towards models that are more likely
to be successful.
www.authorizedumps.com
Questions & Answers PDF Page 26
You are concerned that the model might not efficiently use compute resources in hyperparameter
tuning. You also are concerned that the model might prevent an increase in the overall tuning time.
Therefore, you need to implement an early stopping criterion on models that provides savings
without terminating promising jobs.
Testing
You must produce multiple partitions of a dataset based on sampling using the Partition and Sample
module in Azure Machine Learning Studio. You must create three equal partitions for cross-
validation. You must also configure the cross-validation process so that the rows in the test and
training datasets are divided evenly by properties that are near each city’s main river. The data that
identifies that a property is near a river is held in the column named NextToRiver. You want to
complete this task before the data goes through the sampling process.
When you train a Linear Regression module using a property dataset that shows data for property
prices for a large city, you need to determine the best features to use in a model. You can choose
standard metrics provided to measure performance before and after the feature importance process
completes. You must ensure that the distribution of the features across multiple training models is
consistent.
Data visualization
You need to provide the test results to the Fabrikam Residences team. You create data visualizations
to aid in presenting the results.
You must produce a Receiver Operating Characteristic (ROC) curve to conduct a diagnostic test
evaluation of the model. You need to select appropriate methods for producing the ROC curve in
Azure Machine Learning Studio to compare the Two-Class Decision Forest and the Two-Class Decision
Jungle modules with one another.
Question: 15
www.authorizedumps.com
Questions & Answers PDF Page 27
DRAG DROP
You need to implement early stopping criteria as suited in the model training requirements.
Which three code segments should you use to develop the solution? To answer, move the
appropriate code segments from the list of code segments to the answer area and arrange them in
the correct order.
NOTE: More than one order of answer choices is correct. You will receive credit for any of the correct
orders you select.
Answer:
Explanation:
www.authorizedumps.com
Questions & Answers PDF Page 28
You need to implement an early stopping criterion on models that provides savings without
terminating promising jobs.
Truncation selection cancels a given percentage of lowest performing runs at each evaluation
interval. Runs are compared based on their performance on the primary metric and the lowest X%
are terminated.
Example:
early_termination_policy = TruncationSelectionPolicy(evaluation_interval=1,
truncation_percentage=20, delay_evaluation=5)
Incorrect Answers:
Bandit is a termination policy based on slack factor/slack amount and evaluation interval. The policy
early terminates any runs where the primary metric is not within the specified slack factor / slack
amount with respect to the best performing training run.
Example:
www.authorizedumps.com
Questions & Answers PDF Page 29
delay_evaluation=5
Reference:
https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters
Question: 16
HOTSPOT
You need to identify the methods for dividing the data according, to the testing requirements.
Which properties should you select? To answer, select the appropriate option-, m the answer are
Answer:
Explanation:
Sampling
Question: 17
www.authorizedumps.com
Questions & Answers PDF Page 30
HOTSPOT
You need to configure the Permutation Feature Importance module for the model training
requirements.
What should you do? To answer, select the appropriate options in the dialog box in the answer area.
Answer:
Explanation:
www.authorizedumps.com
Questions & Answers PDF Page 31
Box 1: 500
For Random seed, type a value to use as seed for randomization. If you specify 0 (the default), a
number is generated based on the system clock.
A seed value is optional, but you should provide a value if you want reproducibility across runs of the
same experiment.
Scenario: Given a trained model and a test dataset, you must compute the Permutation Feature
Importance scores of feature variables. You need to set up the Permutation Feature Importance
www.authorizedumps.com
Questions & Answers PDF Page 32
module to select the correct metric to investigate the model’s accuracy and replicate the findings.
Regression. Choose one of the following: Precision, Recall, Mean Absolute Error , Root Mean Squared
Error, Relative Absolute Error, Relative Squared Error, Coefficient of Determination
Reference:
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/permutation-
feature-importance
Question: 18
HOTSPOT
You need to configure the Edit Metadata module so that the structure of the datasets match.
Which configuration options should you select? To answer, select the appropriate options in the
answer area.
www.authorizedumps.com
Questions & Answers PDF Page 33
Answer:
Explanation:
www.authorizedumps.com
Questions & Answers PDF Page 34
Scenario: An initial investigation shows that the datasets are identical in structure apart from the
MedianValue column. The smaller Paris dataset contains the MedianValue in text format, whereas
the larger London dataset contains the MedianValue in numerical format.
Box 2: Unchanged
Note: Select the Categorical option to specify that the values in the selected columns should be
www.authorizedumps.com
Questions & Answers PDF Page 35
treated as categories.
For example, you might have a column that contains the numbers 0,1 and 2, but know that the
numbers actually mean "Smoker", "Non smoker" and "Unknown". In that case, by flagging the
column as categorical you can ensure that the values are not used in numeric calculations, only to
group data.
Question: 19
DRAG DROP
Which three actions should you perform in sequence? To answer, move the appropriate actions from
the list of actions to the answer area and arrange them in the correct order.
www.authorizedumps.com
Questions & Answers PDF Page 36
Answer:
Explanation:
www.authorizedumps.com
Questions & Answers PDF Page 37
Scenario: Columns in each dataset contain missing and null values. The datasets also contain many
outliers.
Scenario: You produce a regression model to predict property prices by using the Linear Regression
and Bayesian Linear Regression modules.
Regularization typically is used to avoid overfitting. For example, in L2 regularization weight, type the
value to use as the weight for L2 regularization. We recommend that you use a non-zero value to
avoid overfitting.
Scenario:
Model fit: The model shows signs of overfitting. You need to produce a more refined regression
model that reduces the overfitting.
Incorrect Answers:
Decision jungles are a recent extension to decision forests. A decision jungle consists of an ensemble
of decision directed acyclic graphs (DAGs).
L-BFGS:
L-BFGS stands for "limited memory Broyden-Fletcher-Goldfarb-Shanno". It can be found in the wwo-
Class Logistic Regression module, which is used to create a logistic regression model that can be used
to predict two (and only two) outcomes.
Reference:
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/linear-regr
www.authorizedumps.com
Questions & Answers PDF Page 38
ession
Question: 20
DRAG DROP
You need to visually identify whether outliers exist in the Age column and quantify the outliers
before the outliers are removed.
Which three Azure Machine Learning Studio modules should you use in sequence? To answer, move
the appropriate modules from the list of modules to the answer area and arrange them in the correct
order.
Answer:
Explanation:
Create Scatterplot
Summarize Data
Clip Values
You can use the Clip Values module in Azure Machine Learning Studio, to identify and optionally
www.authorizedumps.com
Questions & Answers PDF Page 39
replace data values that are above or below a specified threshold. This is useful when you want to
remove outliers or replace them with a mean, a constant, or other substitute value.
Reference:
https://blogs.msdn.microsoft.com/azuredev/2017/05/27/data-cleansing-tools-in-azure-machine-
learning/
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/clip-values
Question: 21
HOTSPOT
How should you configure the Clean Missing Data module? To answer, select the appropriate options
in the answer area.
www.authorizedumps.com
Questions & Answers PDF Page 40
www.authorizedumps.com
Questions & Answers PDF Page 41
Answer:
Explanation:
Replace using MICE: For each missing value, this option assigns a new value, which is calculated by
using a method described in the statistical literature as "Multivariate Imputation using Chained
Equations" or "Multiple Imputation by Chained Equations". With a multiple imputation method, each
variable with missing data is modeled conditionally using the other variables in the data before filling
in the missing values.
Scenario: The AccessibilityToHighway column in both datasets contains missing values. The missing
data must be replaced with new data so that it is modeled conditionally using the other variables in
the data before filling in the missing values.
www.authorizedumps.com
Questions & Answers PDF Page 42
Box 2: Propagate
Cols with all missing values indicate if columns of all missing values should be preserved in the
output.
Reference:
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/clean-missing-
data
Question: 22
DRAG DROP
You need to produce a visualization for the diagnostic test evaluation according to the data
visualization requirements.
Which three modules should you recommend be used in sequence? To answer, move the
appropriate modules from the list of modules to the answer area and arrange them in the correct
order.
www.authorizedumps.com
Questions & Answers PDF Page 43
Answer:
Explanation:
www.authorizedumps.com
Questions & Answers PDF Page 44
Start by using the "Tune Model Hyperparameters" module to select the best sets of parameters for
each of the models we're considering.
One of the interesting things about the "Tune Model Hyperparameters" module is that it not only
outputs the results from the Tuning, it also outputs the Trained Model.
Scenario: You need to provide the test results to the Fabrikam Residences team. You create data
visualizations to aid in presenting the results.
You must produce a Receiver Operating Characteristic (ROC) curve to conduct a diagnostic test
evaluation of the model. You need to select appropriate methods for producing the ROC curve in
Azure Machine Learning Studio to compare the Two-Class Decision Forest and the Two-Class Decision
Jungle modules with one another.
Reference:
http://breaking-bi.blogspot.com/2017/01/azure-machine-learning-model-evaluation.html
Question: 23
HOTSPOT
You need to set up the Permutation Feature Importance module according to the model training
requirements.
Which properties should you select? To answer, select the appropriate options in the answer area.
www.authorizedumps.com
Questions & Answers PDF Page 45
Answer:
Explanation:
www.authorizedumps.com
Questions & Answers PDF Page 46
Box 1: Accuracy
Scenario: You want to configure hyperparameters in the model learning process to speed the
learning phase by using hyperparameters. In addition, this configuration should cancel the lowest
performing runs at each evaluation interval, thereby directing effort and resources towards models
that are more likely to be successful.
Box 2: R-Squared
www.authorizedumps.com
Questions & Answers PDF Page 47
Question: 24
HOTSPOT
You need to configure the Feature Based Feature Selection module based on the experiment
requirements and datasets.
How should you configure the module properties? To answer, select the appropriate options in the
dialog box in the answer area.
Answer:
Explanation:
www.authorizedumps.com
Questions & Answers PDF Page 48
The mutual information score is particularly useful in feature selection because it maximizes the
mutual information between the joint distribution and target variables in datasets with many
dimensions.
Box 2: MedianValue
Scenario: The MedianValue and AvgRoomsinHouse columns both hold data in numeric format. You
need to select a feature selection algorithm to analyze the relationship between the two columns in
more detail.
Reference:
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/filter-based-
www.authorizedumps.com
Questions & Answers PDF Page 49
feature-selection
Question: 25
A. Mutual information
C. Kendall correlation
Answer: C
Explanation:
In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's tau coefficient
(after the Greek letter τ), is a statistic used to measure the ordinal association between two
measured quantities.
Scenario: When you train a Linear Regression module using a property dataset that shows data for
property prices for a large city, you need to determine the best features to use in a model. You can
choose standard metrics provided to measure performance before and after the feature importance
process completes. You must ensure that the distribution of the features across multiple training
models is consistent.
Reference:
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/feature-
selection-modules
www.authorizedumps.com
Questions & Answers PDF Page 50
Question: 26
HOTSPOT
You need to identify the methods for dividing the data according to the testing requirements.
Which properties should you select? To answer, select the appropriate options in the answer area.
www.authorizedumps.com
Questions & Answers PDF Page 51
www.authorizedumps.com
Questions & Answers PDF Page 52
Answer:
Explanation:
www.authorizedumps.com
Questions & Answers PDF Page 53
Scenario: Testing
You must produce multiple partitions of a dataset based on sampling using the Partition and Sample
module in Azure Machine Learning Studio.
Use Assign to folds option when you want to divide the dataset into subsets of the dat
a. This option is also useful when you want to create a custom number of folds for cross-validation,
or to split rows into several groups.
Not Head: Use Head mode to get only the first n rows. This option is useful if you want to test a
pipeline on a small number of rows, and don't need the data to be balanced or sampled in any way.
Not Sampling: The Sampling option supports simple random sampling or stratified random sampling.
This is useful if you want to create a smaller representative sample dataset for testing.
Specify the partitioner method: Indicate how you want data to be apportioned to each partition,
using these options:
www.authorizedumps.com
Questions & Answers PDF Page 54
Partition evenly: Use this option to place an equal number of rows in each partition. To specify the
number of output partitions, type a whole number in the Specify number of folds to split evenly into
text box.
Reference:
https://docs.microsoft.com/en-us/azure/machine-learning/algorithm-module-reference/partition-
and-sample
Question: 27
A. Spearman correlation
B. Mutual information
C. Mann-Whitney test
D. Pearson’s correlation
Answer: A
Explanation:
Spearman's rank correlation coefficient assesses how well the relationship between two variables
can be described using a monotonic function.
Note: Both Spearman's and Kendall's can be formulated as special cases of a more general
correlation coefficient, and they are both appropriate in this scenario.
Scenario: The MedianValue and AvgRoomsInHouse columns both hold data in numeric format. You
need to select a feature selection algorithm to analyze the relationship between the two columns in
www.authorizedumps.com
Questions & Answers PDF Page 55
more detail.
Reference:
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/feature-
selection-modules
Question: 28
Note: This question is part of a series of questions that present the same scenario. Each question in
the series contains a unique solution that might meet the stated goals. Some question sets might
have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these
questions will not appear in the review screen.
You are analyzing a numerical dataset which contains missing values in several columns.
You must clean the missing values using an appropriate operation without affecting the
dimensionality of the feature set.
Solution: Replace each missing value using the Multiple Imputation by Chained Equations (MICE)
method.
A. Yes
B. NO
Answer: A
Explanation:
www.authorizedumps.com
Questions & Answers PDF Page 56
Replace using MICE: For each missing value, this option assigns a new value, which is calculated by
using a method described in the statistical literature as "Multivariate Imputation using Chained
Equations" or "Multiple Imputation by Chained Equations". With a multiple imputation method, each
variable with missing data is modeled conditionally using the other variables in the data before filling
in the missing values.
Note: Multivariate imputation by chained equations (MICE), sometimes called “fully conditional
specification” or “sequential regression multiple imputation” has emerged in the statistical literature
as one principled method of addressing missing data. Creating multiple imputations, as opposed to
single imputations, accounts for the statistical uncertainty in the imputations. In addition, the
chained equations approach is very flexible and can handle variables of varying types (e.g.,
continuous or binary) as well as complexities such as bounds or survey skip patterns.
Reference:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/clean-missing-
data
Question: 29
Note: This question is part of a series of questions that present the same scenario. Each question in
the series contains a unique solution that might meet the stated goals. Some question sets might
have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these
questions will not appear in the review screen.
You are analyzing a numerical dataset which contains missing values in several columns.
You must clean the missing values using an appropriate operation without affecting the
dimensionality of the feature set.
Solution: Remove the entire column that contains the missing data point.
www.authorizedumps.com
Questions & Answers PDF Page 57
A. Yes
B. No
Answer: B
Explanation:
Reference:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/clean-missing-
data
Question: 30
Note: This question is part of a series of questions that present the same scenario. Each question in
the series contains a unique solution that might meet the stated goals. Some question sets might
have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these
questions will not appear in the review screen.
You are analyzing a numerical dataset which contain missing values in several columns.
You must clean the missing values using an appropriate operation without affecting the
dimensionality of the feature set.
Solution: Use the last Observation Carried Forward (IOCF) method to impute the missing data points.
www.authorizedumps.com
Questions & Answers PDF Page 58
A. Yes
B. No
Answer: B
Explanation:
Replace using MICE: For each missing value, this option assigns a new value, which is calculated by
using a method described in the statistical literature as "Multivariate Imputation using Chained
Equations" or "Multiple Imputation by Chained Equations". With a multiple imputation method, each
variable with missing data is modeled conditionally using the other variables in the data before filling
in the missing values.
Note: Last observation carried forward (LOCF) is a method of imputing missing data in longitudinal
studies. If a person drops out of a study before it ends, then his or her last observed score on the
dependent variable is used for all subsequent (i.e., missing) observation points. LOCF is used to
maintain the sample size and to reduce the bias caused by the attrition of participants in a study.
Reference:
https://methods.sagepub.com/reference/encyc-of-research-design/n211.xml
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/
www.authorizedumps.com
Thank You for trying DP-100 PDF Demo
https://authorizedumps.com/dp-100-exam/
[Limited Time Offer] Use Coupon " SAVE20 " for extra 20%
discount the purchase of PDF file. Test your
DP-100 preparation with actual exam questions
www.authorizedumps.com