Proe Summary: A Powerful Exploratory Data Analysis Tool: Systems Seminar Consultants, Kalamazoo, MI
Proe Summary: A Powerful Exploratory Data Analysis Tool: Systems Seminar Consultants, Kalamazoo, MI
These procedures allow us to explore our data not Aggregating Data with PROC SUMMARY
only in terms of counts and distributions, but also The CLASS statement in PROC SUMMARY names the
statistically. Both procedures now support the CLASS character or numeric variables in which the data will
statement (this is new to PROC MEANS starting with be CLASSified upon. The variables listed on the
Version 6). The CLASS statement acts much like the CLASS statement should be categorical variables, that
BY, only it does not require sorting and in most cases is, they should have a few number of discrete values.
is more effiCient than the BY statement. The VAR statement on the other hand, names the
numeric variables which will be analyzed. The
For the purpose of this paper, the examples will use OUTPUT OUT = statement names the output SAS data
PROC SUMMARY. Keep in mind, however, PROC set. It also defines which statistics will be used with
MEANS could have been used instead and it would which variables and what they will be called on the
produce the exact same results. output SAS data set.
Optional statements (partial list): How are the TYPE Values Useful?
CLASS variable(s); The TYPE variableis a part of the output SAS data
VAR variable(s); set built (meaning if you did a PROC CONTENTS you
BY variable(s); would see it on there). This also means that you can
WHERE where condition(s); query against this variable's value and use it to create
OUTPUT OUT= SASdataset options=variable(s); different reports, each of which contain different
198
Beginning Tutorials
199
Beginning Tutorials
200
Beginning Tutorials
PROC PRINT DATA=SUMDS; Also notice that the NWAY option was added to the
TITLE 'WITHOUT USING THE CLASS'; PROe SUMMARY statement. The NWAY option
RUN; eliminates intermediate levels of summarization- in this
(See Output #8 for results) case we only got the _TYPE_ =3 records.
Because a VAR statement is being used different You can format CLASS variable regardless if the
statistics c,!n be specified on the OUTPUT SAS data NWAY option is being used or not.
set.
Creating Flat Files
Formatting CLASS Variables Some times after the data has been aggregated you
User-defined formats can be used when CLASSifying may want to take the result of that summarization and
data. put it into a flat file so it could be downloaded to
another platform and imported in to another software
PROC FORMAT; product. It is very easy to take the results of PROe
VALUE $ STNOFMT SUMMARY and create flat file if you know what you
'AK','AZ','CA','CO','NM', are doing:
'NV','OR','WA' = 'WEST'
'AL','FL','GA','KY','MO', PROC SUMMARY DATA=MKTDATA NWAY;
'OK','SC','TN','VA','WV' ='SOUTH' CLASS STNORE STRCTYP;
'CT','DC','DE','MA','MD', VAR FRCSLVL MKTVL;
'NC','NH','NJ','NY','PA', OUTPUT OUT= SUMDS
'RI' ='EAST SUM(MKlVL) =TOTMKTVL
'IL', 'IN','KS','MI' ,'MN', MEAN(FRCSLVL) = AVGFORCL
'NE','OH', 'WI' = 'MIDWEST; MIN(MKTVL) =MINMKlVL
RUN; MAX(MKTVL) =MAXMKTVL;
RUN;
PROC SUMMARY DATA=MKTDATA NWAY;
CLASS STNORE STRCTYP; FILENAME OUTFILE 'path-name';
FORMAT STNORE $STNOFMT.;
VAR FRCSLVL MKlVL; DATA NULL;
OUTPUT OUT=SUMDS SET SUMDS(DROP= _TYPE__FREQj;
SUM(MKTVL) =TOTMKTVL FILE OUTFILE NOTITLES;
MEAN(FRCSLVL) =AVGFORCL PUT @1 STRCTYP $2. @5 STNORE $2.
MIN (MKTVL) =MINMKlVL @10TOTMKTVL 8. @20AVGFORCL 10.
MAX(MKTVL) = MAXMKlVL; RUN; @35 MINMKTVL 7. @45 MAXMKTVL 10.;
(See Output #10 for results)
PROC PRINT DATA=SUMDS;
TITLE 'FORMATTING A CLASS VARIABLE'; Always use DATA NULL when creating flat files.
RUN; This prevents a SAS data set from being built. The
(See Output #9 for results) only reason to ever build a SAS data set is so that
you can pass it on to another DATA or PROe step for
You want to be careful of formatting CLASS variables further processing. You usually buDd a flat file as a
if you are going to tum around later on and check for last step within your program, so there is no need to
the formatted value (i.e. WHERE STNORE='EAST';), build a SAS data set. Remember, DATA; does not
because remember that formats DO NOT change the prevent a data set from being built, SAS builds one for
value of the variable, they simply change their us, but it chooses the name. The ONLY way to
appearance. It may be better to create a variable in prevent a data set from being built is by saying DATA
a DATA step prior to PROC SUMMARY that physically NULL.
contains the formatted value by using the PUT
- -
function (FSTNORE=PUT(STNORE,$STNOFMT.);) and Both SAS-defined and user-defined formats can be
use that on your CLASS statement instead. used on the PUT statement.
201
Beginning Tutorials
The example above will create a fixed column. blank Another way of doing it would be to assign constants
delimited fde to the location specified on the rather than literals:
FILENAME statement (depending on your platform
you could also reference the name specified on the DATA NUU:
FILE statement in JCL. ASSIGN statements. SET SUMDS(DROP= _TYPE__FREQj:
ALLOCates. etc.). RETAIN DQUOTE'" COMMA ',';
FILE OUTFILE NOTITLES:
Creating a Flat File to Import Into Lotus PUT DQUOTE STRCTYP
Character literals can be specified on the PUT DQUOTE COMMA DQUOTE STNORE
statement. which is useful in creating a delimited file DQUOTE COMMA DQUOTE TOTMKTVL
that Lotus can import. This technique can be used to DQUOTE COMMA DQUOTE AVGFORCL
create flat fHes that are delimited by characters other DQUOTE COMMA OQUOTE MINMKTVL
than blanks. Check the reference manual of the DQUOTE COMMA DQUOTE MAXMKTVL
product you wHI be importing Into under "Importing DQUOTE:RUN:
Files· to determine what type of delimited file the (See Output #13 for results)
particular product can read most easily. Use that
delimiter in place of the delimiters used in the SAS PUTs a blank after character constants as well,
example below. so the decrernental pointer used In the previous
example will need to be used again:
PROC SUMMARY DATA=MKTDATA NWAY;
CLASS STNORE STRCTYP; DATA NULL:
VAR FRCSlVl MKTVL; SET SUMDS(DROP= _TYPE__FREQj:
OUTPUT OUT= SUMDS RETAIN DQUOTE '"' COMMA ',':
SUM(MKTVl) =TOTMKTVl FILE OUTFILE NOTITLES;
MEAN (FRCSLVL) =AVGFORCl PUT DQUOTE +(-1) STRCTYP +(-1)
MIN(MKTVL) =MINMKTVl DQUOTE +(-1) COMMA +(-1)
MAX(MKTVL) = MAXMKTVL; DQUOTE +(-1) STNORE +(-1)
RUN; . DQUOTE +(-1) COMMA +(-1)
FILENAME OUTFILE 'path-name'; DQUOTE +(-1) TOTMKTVL +(-1)
DQUOTE +(-1) COMMA +(-1)
DATA NULL: DQUOTE +(-1) AVGFORCL +(-1)
SET SUMDS(DROP= _TYPE_fREQ.J: DQUOTE +(-1) COMMA +(-1)
FILE OUTFILE NOTITLES: DQUOTE +(-1) MINMKTVL +(-1)
PUT ••, STRCTYP ..... STNORE DQUOTE +(-1) COMMA +(-1)
...... TOTMKTVl··.. • AVGFORCL DQUOTE +(-1) MAXMKTVL +(-1)
,
,. ot
MINMKTVL ,. ,'" MAXMKTVL '.'; DQUOTE;
RUN: RUN:
(See Output #11 for results) (See Output #14 for results)
Note that SAS writes out a blank after PUTting out the In Summary
value of a variable. We can solve this by "tricking" As we have seen through the course of this paper,
SAS into thinking It has a decremental pointer: PROC SUMMARY is a very powerful exploratory data
analysis tool. It allows us to explore our data both
DATA NUU; statistically as well as in terms of how our data values
SET SUMDS(DROP= _TYPE __FREQj: are distributed in a single PROC step.
FILE OUTFILE NOTITLES:
PUT ••• STRCTYP +(-1) '",.. STNORE +(-1) We also have seen how we can post-process the
"',n' TOTMKTVL +(-1) ''',.' AVGFORCL +(-1) output SAS data set from PROC SUMMARY in DATA
''',.. MINMKTVL +(-1) '.,.' MAXMKTVL +(-1) or PROC steps to produce various types of reports
.... ;RUN; and flat files which can then be downloaded and
(See Output #12 for results) imported into any software product.
202
Beginning Tutorials
Trademark Notice
SAS is a registered trademark of the SAS Institute
Inc., Cary, NC, USA and other countries.
203
Beginning Tutorials
Partial Output:
RESULT OF PROC SUMMARY
STNORE STRCTYP _TYPE_
-FREQ_ TOTMKTVL AVGFORCL MINMKTVL MAXMKTVL
0 3034 371473153 109400.72 10000 3300002
NA 1 108 11824700 97274.31 27000 350000
ND 1 2109 268794599 113624.23 15000 3300002
YA 1 62 6089850 89167.13 25500 235000
••••••••••.•••••••••••••••••••••••••••••••• ETC •••••••••••••••••••••••••••••••••••••
AK 2 1 46000 36800.00 46000 46000
AL 2 17 1081600 56213.00 39000 137500
AZ 2 40 3905000 89237.50 35000 358000
•••••••••.••••••••••••••••••••••••••••••••• ETC •.•.••••••••.••••••••••••••••••••••••
AK NA 3 1 46000 36800.00 46000 46000
AL NA 3 1 42000 35000.00 42000 42000
AL NO 3 9 548000 56435.67 39000 135000
• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • . .• ETC
OUtput fI2.
Partial Output:
REGIONAL MANAGER'S REPORT- STATE TOTALS
OBS STNORE TOTMKTVL AVGFORCL MINMKTVL MAXMKTVL
9 AK 46000 36800.00 46000 46000
10 AL 1081600 56213.00 39000 137500
11 AZ 3905000 89237.50 35000 358000
12 CA 116594652 1714BB.64 25000 3300002
13 CO 4858660 98243.09 30000 319000
•••••••••••••••••••••••••• ETC •••••••••••••••••••••••••••.••••.•
OUtput fI2.
204
Beginning Tutorials
Partial Output:
DETAILED ANALYSIS REPORT
OUtput tIZ
OUtput iI.5
Partial Output:
RESULT OF IMPLICIT OUTPUT STATEMENT
OUtput IS
205
Beginning Tutorials
Partial Output:
WHEN AN IMPLICIT OUTPUT
STATEMENT IS USEFUL
OUtput 16
Partial Output:
RESULT WITHOUT USING A VAR
1 0 ~~
2 NA 1 108
3 NO 1 2109
4 YA 1 62
••••••••••••••••••• ETC ••••••••••••••••••
9 AK 2 1
10 AL 2 17
11 AZ 2 40
• • • • • • • • • • • • • • • • • •• ETC ••••••••••••••••••
46 AK NA 3 1
47 AL NA 3 1
48 AL NO 3 9
••••••••••••••••••• ETC ••••••••••••••••••
OUtput ff1
OUtput 18
206
Beginning Tutorials
Partial Output:
FORMATTING A CLASS VARIABLE
OBS STNORE TOTMKTVL AVGFORCL MINMKTVL MAXMKTVL
1 WEST NA 3 25 3808000 139096.00 44500 350000
2 WEST NO 3 626 110375052 157244.74 26000 3300002
3 WEST YA 3 19 2440250 120378.95 57250 235000
•••••••••••••••.•..••••••••••••••••••••••••••••• ETC ••••••••••••••••••••••••••••••••.•••••••
output fJ9
207