0% found this document useful (0 votes)
109 views2 pages

Data Analytics 02: Pivoting of Data. You Might Be Familiar With The Concept of Pivoting From BI Tools or Excel: Rotate

This document discusses pivoting and renaming aggregate data. It shows how to pivot a Titanic passenger data set to aggregate counts by gender and passenger class. Regular expressions are then used to automatically rename the pivoted columns in a more readable way. Several tasks are provided to further experiment with pivoting and renaming the data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views2 pages

Data Analytics 02: Pivoting of Data. You Might Be Familiar With The Concept of Pivoting From BI Tools or Excel: Rotate

This document discusses pivoting and renaming aggregate data. It shows how to pivot a Titanic passenger data set to aggregate counts by gender and passenger class. Regular expressions are then used to automatically rename the pivoted columns in a more readable way. Several tasks are provided to further experiment with pivoting and renaming the data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

DATA ANALYTICS 02

PIVOTING AND RENAMING


AGGREGATE DATA AND PIVOT.
In this lecture, you will learn about another common data blending technique, namely the
Pivoting of data. You might be familiar with the concept of Pivoting from BI tools or Excel: rotate
the data from a long table format (one attribute with a lot of examples) into a wide table format
(lots of attributes with a single example). This transformation is especially useful to aggregate
information along two or more dimensions as a step to prepare the data for machine learning.
Machine learning models need the data to be stored in a wide table format so you will encounter
this preprocessing step frequently before you start with the actual modeling.
PIVOT THE DATA.
Let’s create a table that shows how many passengers were in each class, broken down by
gender.

1. Drag the Titanic data into the process.


2. Add the operator Pivot and connect it.
3. In its Parameters, add Sex to the group by attributes.
4. Select Passenger Class as column grouping attribute.
5. Also use Passenger Class with function count as a new entry for aggregation attributes.

NOTE: The resulting data table has four columns and two rows. Each row represents one of the
values from the Sex column, (the group by attributes parameter). The three different values of
the column grouping attribute ( Passenger Class) become the three new columns. The actual
values in the table are representing the counts of each combination of groups in the rows, i.e.
the gender, and the groups in the columns, which is the passenger class in our case. For
example, we will get 144 females who booked first class.
RENAMING ATTRIBUTES WITH REGULAR EXPRESSIONS.
The names of the new columns reflect how they were created, but this is not always the easiest
to read. You could use the operator Rename for manually renaming the three attributes to
something nicer, like "Passenger Class First". Rename is the way to go if you only have few
attributes which should be renamed, but we will take a more advanced approach which would
allow you to rename hundreds of attributes at a time.

1. Search for the operator Rename by Replacing, add it, and connect it to Pivot.
2. Also connect the operator to the result port on the right.
3. Copy count\((.*)\)_(.*) into the replace what parameter field. Make sure that you get all
the parentheses right!
4. Copy $1 $2 into the replace by parameter.

NOTE: You might already be familiar with regular expressions (this is the name for the strange
parameters we have used for the renaming). They are a powerful tool and can be found in many

Page 1 of 2
Dr. Stephan Kupsch
DATA ANALYTICS 02
different places in RapidMiner. The expression you have used for replace what means that you
look for something between count( and )_ and then something else after the underscore. Those
two elements are identified later by the round brackets. Each time you use round brackets, you
define a new so-called capturing group which you can refer to in the definition of the
replacement. Since the round brackets have a special meaning here, we need to quote the
brackets in the name itself with a backslash. Finally, we can use the capturing groups in the
replace by parameter with the dollar sign and the number identifying the group. $1 is for the
content of the first group, which always happens to be "Passenger Class", and $2 is for
identifying the second group. Those are the three different classes "First", "Second", and "Third".
RUN THE PROCESS

1. Run the process.

NOTE: Your data set should now have column names that look like Passenger Class First.
You rotated an aggregated data set in wide table format. Pivot can be difficult to configure
sometimes. Just keep in mind that the group by attributes parameter will define the groups in
the rows with one row per group while the values of the column grouping attribute parameter
will define the new columns.

TASKS:

 Can you change the process so that the column names will be changed to "First
Passenger Class", "Second Passenger Class", and "Third Passenger Class"?
 Can you also change them to just say "First Class", "Second Class", and "Third Class"?
 Change the Pivot so that the gender is transformed into new columns and the passenger
class is defining three groups of data. How many columns and rows are you getting now?
 Try to adapt the renaming so that it just uses the gender as column names after the new
pivoting.
 Now, remove the Rename by Replacing operator and remove the column grouping
attribute from Pivot. Set Sex and Passenger Class as group by attributes and use
Passenger Class with count as the aggregation attribute. Run the process and inspect the
result. In how far is it different from the first result you obtained through Pivot?

Page 2 of 2
Dr. Stephan Kupsch

You might also like