sas basics
*For addig line numbers: select Program Editor window then go to Tools tab --> Option -->
Enhanced Editor. In General tab check Show line numbers in its check box.;
*==================================
Two methods for writing comments:
===================================;
*your comment;
/*your comment */
*==================================
Two basic building blocks:
===================================
• SAS programs are constructed from two basic building blocks:
1.DATA Steps (starts with data keyword)
2.PROC Steps (starts with proc keyword)
• DATA and PROC steps are made up of statements. A step
may have as few as one or as many as hundreds of
statements.
We normally use data step for importing a dataset or creating a new dataset,
or getting a subset of data set, or manipulating a dataset
and we use proc step to print, summerize , or visualize data;
*note: Sas is case sensitive only for values inside its data but for keywords and identifires( name of variables,
name of tables, ...) is not case sensitive;
*note: you have to put semicolon at the end of each statementl;
* for execution: highlight the block of code that has to have run keyword at the end
then select running person icon or F3 or run tab==>submit;
*Note: after any execution you must check the log even if you get the result;
Data test1;*test1 is the name of table(=dataset) that will be saved in work library;
a=2;
b=3;
c=A+b;* SAS is case sensitive only for values inside a table not for others;
run;
* i used the above data step to craete a data now for print that data I will use proc print;
proc print data=test1;* test1=work.test1;
run;
/* I want to create a table with info below
id name age
1 Hamid 41
2 Haiqing 60
3 Banu 18
4 Mahhi 21
*/
data work.test2;/*work.test2 is the same as test2*/
input id name $ age;
cards;
1 Hamid 41
2 Haiqing 60
3 Banu 18
4 Mahhi 21
;
proc print data=test2;
run;
*any dataset in SAS has two leveles for naming like Sashelp.CARS , first we must mention the name of libraray
then dot after that name of data set
Note: you can skip typing the name of library only if it is work libraray i.e. EUROPEANCAR=WORK.EUROPEANCAR
NOTE: Work library is the only temporary libraray that means all you have there will be disappear after closing
current session;
/*temp.data1 ==> the name of library is temp but it is permanent library( the only temporary library is work)
*/
proc print data=sashelp.class;
run;
proc print data=sashelp.cars;
run;
* Get subset of sashelp.cars that includes only European cars;
*highlight line 78 to 84 and execute;
DATA EUROPEANCAR;*We start a data step to create a dataset that its name is EUROPEANCAR , EUROPEANCAR is output data;
SET Sashelp.CARS;/*Sashelp.CARS is input data that I want to get subset of it */
WHERE Origin = "Europe"; *I added a condition that is Origin = "Europe";
*Note: SAS is case sensetive only for values of tables;
*you should put run statement at the end and highlight them togather then hit the running person icine for executing;
proc print data =europeancar;
run;
**syntax rules;
*SAS Program is a series of SAS statement;
*Most Statement not all of them begin with an identifying keyword;
*SAS statement must end with semicolon ==> so, for calculating number of statements count number of semicolons;
*These statement execute in order i.e. observation by observation ,and for ecah observation line by line (statement by statement),
That means SAS takes first observation and runs it all the way
through the DATA step (line by line) before looping back to pick up
the second observation.;
*====================================================================================;
*SAS terminolgy ;
*====================================================================================;
*data set=SAS Table;
*Variable (var) =columns;
*observation(obs)=rows(recrods);
* data values (val)=intersectional between Rows and Column;
*=======================================================================================;
*NAMING CONVENTIONS FOR SAS TABLES and VARABILES;
*========================================================================================;
*1. START WITH A LETTER OR AN UNDERSCORE;
*2. CAN CONTAIN ONLY LETTERS,NUMBERS,AND UNDERSCORES;
*3. CAN BE 32 CHARACTERS OR FEWER;
*=====================================================;
*NAMING CONVENTIONS FOR SAS LIBRARY(libref);
*======================================================;
*1. START WITH A LETTER OR AN UNDERSCORE;
*2. ONLY LETTERS,NUMBER,UNDERSCORE;
*3. 8 CHARACTERS OR FEWER;
*=============================================================
Type of libraries
==============================================================;
*SAS LIBARY:we have two types of library:
1.PERMANENT AND 2.TEMPORARY SAS LIBRARY
* The only temporary library is work==> all of the other libraries are permanent;
/*The WORK library is the only temporary storage location for SAS data sets.
It is also the default library. If you create a SAS data set without specifying a library, SAS
will put it in the WORK library, and then delete it when you end your session.*/
*=====================================================================
TYPE OF DATA VALUES
======================================================================;
*TYPE OF DATA VALUES : 1.Numeric or 2.chategoric:
If a variable contains letters or special characters, it must be a character(categoric)_ variable.
However, if it contains only numbers, then it may be numeric or character.
i.e. for numeric values you are allowed to assign a numeric or charector data type
For example you can store value=100 to numeric variable and charactor variable
but you can store value="Hamid2" only in cahractor variable
If YOU STORTE "HAMID2" IN NUMERiC VARAIBLE YOU WILL GET MISSING VALUES;
* IN SAS we consider two types for numbers :
1.standard like 348 0r 102.5 that doesn't have any special charectors and
2.non-standard numbers like 25% or 02-12-2022 or $152 or 3,245
for reading non-standard numbers we must specify the format of them(we call it informat and I will talk about that);
*NUMERIC :1. Standard numbers 345678432, 156.8 , -267.9
2. NON STANDARD NUMERIC DATA : have $, %, coma( like 2,564,790) ,dates,..
NOTE:dates in SAS are NON-STANDARD numeric and if you read them with the proper informat
they are stored as a number of days between that date and
date base (date base is 1st of Janurary 1960)
as an example if SAS read 10-Jan-1960 with proper foramt it stores it with value equals to 9 ((10-Jan-1960) - (1-Jan-1960))
;
*CHARACTER: values like "ABCDE","hamid rajaee","5 years" can store only in charector varibles;
/*numeric value: contains only numbers, and sometimes a decimal point and/or minus sign. When they are read into
a SAS data set, numeric values are stored in the floating-point format native to the operating environment.
Nonstandard numeric values can contain other characters as numbers; you must use formatted input to enable SAS to read them.*/
/*nonstandard data is data that can be read only with the aid of informats. Examples of nonstandard data include
numeric values that contain commas, dollar signs, blanks, date and time values; and hexadecimal and binary values.
Note: If you store charector value or non-standard numeric value in numeric variable without specifying the proper format
SAS will store missing valuees there
*/
data test2;
input id name $ age; *we need to specify charector varibles using $ after them;
/*for inserting values in SAS use datalines or cards satement */
datalines;
001 Hamid 41
002 Nick 6
003 Rasoul 40
;/* if you want to insert data maually type datalines or cards then semicolon then
values, after inserting values semcolon must be in the next line if you put in the last line that obs won't be inserted*/
run;
*this is simple list input that values are separted by space and have default length(8 bytes)
and we don't have non-standard numeric. If these conditions are not satisfied you should use column input,
formated input or mixed input that I will teach you in later
Note: in simple list input when you insert values for missing values you must type dot
and when you print them you will see dot for numeric missing values and blank for charector missing values
;
proc print data=test2;
run;
/*in the basic way of inserting VALUES(simple list input)
1.you must type dot for any missing values and
2.all values must be separated by at least one space( 2 spaces or more is the same as one space
3.All chrector values must have atmost 8 chrectors( the rest of that won't be saved in simple list input)*/
data test3;
input id name $ age;
cards;
001 Hamid .
002 . 6
003 Rasoul 40
;
run;
proc print data=test3;
run;*in the result/output missing values will be seen as a blank for chrectors and as a dot for numbers
but be notice that when you insert them in the simple list input you must type dot for both of them;
data test4;
input id $ name $ age; *you can store numbers in numeric or cahrector varibles
but you can't stor charector in numeric varibles . If you do that you will get missing values;
cards;
001 Hamid 40
002 Nick 6
003 Rasoul 40
;
/* cards or datalines are used to specify that you want to insert some values in the table*/
run;
proc print data=test4;
run;
data test5;
input id name age; *you can store numbers in numeric or cahrector varibles
but you can't stor charector in numeric varibles . If you do that you will get missing values;
cards;
001 Hamid 40
002 Nick 6
003 Rasoul 40
;
/* cards or datalines are used to specify that you want to insert some values in the table*/
run;
proc print data=test5;
run;
*======================================================================
Length of values
=======================================================================;
*
Character Length 1–32,767 bytes(Case sensitive) Default 8 bytes
( for each charector you need 1 byte)
Numeric Length 3-8 bytes Default 8 bytes(16 digits);
data test5;
input id $ name $ age;
cards;
001 Hamid_Rajaee 41
002 Nick 6
003 Rasoul 40
;
run;
proc print data=test5;
run;* since it is simple list input the deafult length is 8 bytes so you can store up to 8 charectors for chrector values,
* The length of a numeric variable lies between 3 and 8
bytes. It means SAS can store a numeric value from 1 to
16 digits.
any character needs 1 byte to store as an example to
stroe Hamid we need at least 5 bytes and for storing Hamid_Rajaee you need at least 12 bytes.
;
data test6;
length name $ 15;
input id $ name $ age;
cards;
001 Hamid_Rajaee 41
002 Nick 6
003 Rasoul 40
;
run;
proc print data=test6;
run;
*___________________________________________________________________;
/*Hint: 1)sas program is a series of sas statements
2)Most sas statement begins with an identifying key word like data,set,where,proc,..
all sas statements must end with semicolon(;)*/
/*, hint : if running of your program takes long time go to icone !(break) in above toolbar and hit that*/
/* SAS reads observation by observation and executes line by line for each observation */
*=====================================================================;
/*comment:*/
*======================================================================;
*Comments are colored green
Note:you should put your comment between asterisk and semicolon or between forward slash asterisk and asterisk forward slash;
* your comment ; /*These comments cannot contain internal semicolons or unmatched quotation marks.*
/* or */
/* your comment*/ /*These comments can contain semicolons and unmatched quotation marks.
Tip: When using comments within a macro definition or to hide text from the SAS macro facility,
use this style comment*/
/*or In the Microsoft Windows operating environment, if you use the Enhanced Editor,
you can comment out a block of code by highlighting the block and then pressing CTRL and / (forward slash) together.
To uncomment a block of code, highlight the block and press CTRL and SHIFT and / (forward slash) together.
*/
*==========================================================
Two basic building blocks of SAS program
===========================================================;
/* the program contains 2 steps:
1)data step
2)Proc step
we use proc step to craete a report or graph , ...*/
/*SAS statement usually begins with a keyword and end with a semicolon ==> each semicolon one statement*/
* Every time Check the log to make sure
the program run without errors or warnings;
DATA EUROPEANCAR;*we created (writing) a temporary data set that is EUROPEANCAR or work.EUROPEANCAR ;
SET Sashelp.CARS;*Sashelp.CARS is input data set(we reading permanent data set that its name is cars and it
is availbale in sashelp libraray);
WHERE Origin = 'Europe' ;*'Europe'="Europe";
run;
*in above data step Sashelp.CARS is input data set and EUROPEANCAR or WORK.EUROPEANCAR are output data set ;
*=======================================================
Browsing Data Portion and descriptor portion
=========================================================;
/*Browsing Data Portion*/
Proc print data=EUROPEANCAR;
run;
/*Browsing the Descriptor Portion*/
proc contents data=EUROPEANCAR; *be noticed that we have s at the end of contents;
run;
*=====================================================================================;
*ceating Library;
*=====================================================================================;
/*creating libraray*/
/*SAS datasets can be temporary or permanent.
Temporary SAS datasets only exist during the current SAS session.
Permanent SAS datasets are saved to a location on the computer or server and exist after exiting SAS.
Unless otherwise specified to be permanent, SAS considers all datasets to be temporary
data1=work.data1 */
/*There are a few guidelines to keep in mind when you create the name of your library.
Library names(libref name):
1.are limited to eight characters.
2.must begin with a letter or underscore.
3.can contain only letters, numbers, or underscores. Blanks are not allowed.*/
/*create your library :1) by coding or 2) by using GUI( Graphical user interface):
* before it create a folder in the server or your computer;
using GUI:You can also create a new SAS library by opening
the New Library window from the Libraries section of the navigation pane in SAS Studio:
icone new library in up toolbar or Explorer window ==>Right click on Library ==> New): give the name for example hmd,
enable at start up,
choose your path and ok **/
* I want to create a subset of sashelp.cars in my library;
data hr.ecar;
set sashelp.cars;
where origin='Europe';
run;
/*by coding:go to any drive for example.C:\Users\hmdra\Desktop\Metro College_Teaching Materials\SAS\My sas library
and create a folder for example My sas library then copy its address and come back to sas and write the keyword libname
and give any name you want as a libref for example Hmd and then
paste just the address in single or double quotation */
libname h1 "C:\Users\hmdra\Desktop\Metro College_Teaching Materials\SAS\My sas library";
/* highlite the code and F3 or choose the runnig person icone*/
/*For any SAS session, You need to submit a LIBNAME statement (only once during each SAS session)
The libref remains in effect for the duration of that session.
But the data that you stored in that library exist for ever until you delete them,
but in any session you should say where that library is located .
Note:You can assign different names for libref for the same
location and use it i.e.
The libref can change, but the member name( file name = dataset names), and directory must the same.
(A libref is just a nickname that corresponds to the location of a SAS data library. )
as an example for the same folder I will assign another name as well*/
libname h2 "C:\Users\hmdra\Desktop\Metro College_Teaching Materials\SAS\libreary for practice";
*==========================================================;
*LIBNAME statement
*===========================================================
A LIBNAME statement consists of
1.the keyword LIBNAME.
2.the libref name that you want to use.
(that have at most 8 charectos and starts with letter or underscore and contain leters, underscore and numbers)
3.the pathname that represents the physical location of the library and wrapped in sigle or double qoutation marks.
4.a semicolon.;
/*after every run check log*/
data euro;
set Sashelp.cars;
where origin= "Europe";
run;
*It created a temporary data set that its name is euro inside work libary euro=work.euro;
/*If you have a data in work library that you need for future you should better to save as in permanent library*/
/*to get a save as(copy and paste) euro dataset;*/
data h1.euro2;*euro2 is the output dataset in permanent library , euro is the input data in work (temporary) library;
set euro;
run;
*It created a permanent data set that its name is euro inside h1 libary;
proc print data=h1.euro;* if You don't mention which data SAS will execute for the most recent data;
run;
proc print;run;
data h1.Asia;*output data is h1.asia;
set Sashelp.cars;*input data is sashelp.cars;
where origin= "Asia";* be notice that unlike SQL that you are allowed to use only single qoutes to wrap the charector values, In SAS you can use either single or double qutes;
run;
proc print;* in proc if you skip mentioning name of data ,that proc run for the most recent data that you worked on it;
run;
data h1.USA;
set Sashelp.cars;
where origin= "USA";
run;
proc print Data =Asia;* you will get an error beacause asia is stoptred in h1 not work;
run;
proc print Data =h1.asia;
run;
*===========================================
Using SAS for calculation
*===========================================;
*find 2*3
*Method #1:you can use macro variable that I will teach you lator
*Method #2: using data step and proc step;
*highlite the next 4 lines and execute together;
data calculation;
result=2*3;*whenever you use a new name in datastep SAS will craete a new variable with that name ;
proc print ;
run;
data cal;
x=2;
y=3;
result=x*y;
run;
proc print data=cal;
run;
* Example:Create a SAS data set named distance and Convert 26.22 miles to kilometers;
DATA distance;
Miles = 26.22;* By using new name you can create new column(variable);
Kilometers = 1.61 * Miles;
RUN;
* Print the results;
PROC PRINT DATA = distance;
RUN;
*=================================================
do ... end
*====================================================;
* Example:Create a SAS data set named distance_2 and convert 1 to 20 miles to kilometers;
data distance_2;
do miles = 1 to 20;
kilometers = miles * 1.61;
output;* for saving result of any iteration in distance_2;
end;
run;
* I will talk about loop later in details;
*We have differnt types of loop and I will talk about looping more in day7 ;
proc
data=distance_2;
run;
*Missing Data Values :
A Character missing value is displayed as a blank
A numeric missing value is displayed as a period (.);
/*
SAS Statements are free format:
• One or more blanks or special characters can be used to separate words
• They can begin and end in any column(i.e. you can put spaces at the begining)
• A single statement can span multiple lines
• Several statement can be on the same line
but becarefull that each staement must be ended with semicolon
*/
/*In any session you should specify where that library is located .You can assign different name for libref for
same location and use it i.e.
The libref can change, but the member name( file name), and directory must the same.
(A libref is just a nickname that corresponds to the location of a SAS data library. )
*/
*
craete a data in SAS (manually ie. by inserting values) :
step1: assigning a name for data(table) (After data keyword)
stpe2: assigning names for variables(columns) and assigning data type for them (after input keyword)
step3: inserting values (after cards or datalines keyword)
;
data my_1st_data;*output data is my_1st_data=work.my_1st_data that is temporary data;
input name $ age gender $ weight height;/* we use input statement to assign variable's name= column's name=header :
variables'name:
1)are limited to 32 characters.
2)must begin with a letter or underscore.
3)can contain only letters, numbers, or underscores. Blanks are not allowed*/
*To create a character column, specify a $ sign righ after the sapace after
column name.;
*Note: unlike SQL that we use comma to separate columns names in SAS we use sapace;
* when you insert values in simple list input separater is space(at least 1 space) and Any missing data must be indicated with a period.;
*Note:recall that when you print a table for missing values you will see period for numeric columns and blank for categorical columns;
/* For inserting data you should type cards or datalines then semicolon and in new rows type observations after
the last observations(rows) go to new line and type semicolon*/
*for inserting values use cards or datalines;
cards;
Hamid 41 . 82 185
Nick . M . 124
Sarah 34 F 76 167cm
. 4 F 18 115
Aryan 12 M 67 156
Rosa 34 F 57 163
;
*you must put semicolon after the last obs in "new line" if you put semicolon at the end of obs thet one won't be inserted;
run;
*So important: after any execution in SAS you must check the log to make sure that result is reliable or not;
proc print;run;*in any procedure if you skip calrifying name of data , that proc excute for the last data that is executed;
**in result output characters are on left side, numeric are on the right side;
*Note: for numeric variable if you insert charector you will get missing values;
* I stored that privouse data set in work library, Since it is temporary libraray after closing the SAS program ,
this data is goinig to be disapear. If you want to save it permanently you have to create that data in permanent
library or getting save as like ;
Data Hamid.myData;*Hamid.myData is output data that I want to save in in Hamid libraray;
*note: you can save it with the same name like Hamid.my_1st_data;
set my_1st_data;*my_1st_data is input data that exits in work libraray and I want to get a copy of it;
Run;* I created new data set in hamid library and I named it mydata and I save as all data that I have in my_1st_data;
data test_1;
input id name $ family $ height ;* several sapces is the same as one space;
/*cards=datalines*/
cards;
001 Hamid Rajaee 184.7
002 Mudasir khan 164.6
003 Nick rajaee 154.2
;
run;
proc print data=test_1; run;
proc print data=test_1 noobs; run;
* you can save numbers as a charector as well;
data test_2;
input id $ name $ family $ height ;* several sapces is the same as one space;
cards;
001 Hamid Rajaee 184.7
002 Mudasir khan 164.6
003 Nick rajaee 154.2
;
run;
proc print data=test_2; run;
data h1.test2;
length name $ 15;
input id $ name $ age;
cards;
001 Melaku_Tessema 45
002 Nazreit_zeleke 40
003 Bilen_Tesfaye 11
;
run;
proc print;run;
data mytable;
input Id Name $ Age Tell $ Height;
datalines;
1 Hamid . 41 647-996-4597 185
2 Fu 23 416-987-3245 176
3 Ajana 21 647-998-8976 .
4 shiva 20 647-657-3245 177
5 kamal 36 416-987-0876 177
;
run;
PROC PRINT data=mytable;RUN;
/*Brows Data descriptor*/
proc contents data=mytable;run;
* for inserting any non-standard numbers in your dataset( table) you must use informat;
*************************************************************
inforamt Vs. format
************************************************************;
* for reading and writing non-standard numeric values you can use inforamt( for writing/inserting values)
and format( for reading/printing values);
data mytest;
input id name $ family $ age dob;* several sapces is the same as one space;
informat dob ddmmyy8.;*informat for reading perpuse (inserting in table)
Note:right after each format_name put dot;
cards;
001 Hamid Rajaee 41 14.05.85
002 Mudasir khan 32 9.01.60
003 Nick rajaee 6 20.12.10
;
run;
*note: some availble format :
14.05.85 or 14/05/85 ==>ddmmyy8. or ddmmyy.
17/03/2013 ==>ddmmyy10.
03/17/2013 ==>mmddyy10.
17MAR13 ==>Date.
17MAR2013 ==>Date9.
you can find more by typing date format in help window( use help tab)
;
/*SAS stores date values as numeric values and showes us differnce dayes from 1Jan1960(base date in SAS*/
/*For reading date to sas(Informat) we should specify format of date*/
/*Note:mmddyy6. is one type format for date that mention 1st 2 digit is month , 2nd 2 digit is day and
3rd 2 digit is year*/
* after checking a log you will find that this data is created successfully;
* but when you print the data you will see some numbers for dob since SAS returns the number of days differences from 1st of Jan1960;
proc print data=mytest;run;
* we use informat for reading and we use format for writing;
* to see the result for dob in undrstanding form use format statement;
proc print data=mytest;
format dob ddmmyy8.;
run;
proc print data=mytest;
format dob ddmmyy.;
run;
proc print data=mytest;
format dob date9.;
run;
proc print data=mytest;
format dob weekdate.;
run;
* you can use any other date format;
*you can find all avaible options for date format by going help then type date and then select About SAS Date, Time, and Datetime Values;
proc print data=mytest;
format dob ddmmyy10.;
run;
proc print data=mytest;
format dob date9.;
run;
proc print data=mytest;
format dob weekdate32.;
run;
data mytest;
input id name $ family $ age dob;* several sapces is the same as one space;
informat dob ddmmyy8.;/*informat for reading perpuse and format for writing perpuse for dates, you can assign format in data step or proc step*/
format dob date9.;
/*if you define format in data step it will saved with it but you can override whenever you want */
cards;
001 Hamid Rajaee 41 14.05.85
002 Mudasir khan 32 9.01.60
003 Nick rajaee 6 20.12.10
;
run;
proc print data=mytest;
run;
proc print data=mytest;
format dob ddmmyy10.;
run;
proc print data=mytest;
format dob mmddyy10.;
run;
*HMW2.1: Print the last 5 obs(tail of data) for sashelp.cars;
proc contents data=sashelp.cars;run;
proc print data=sashelp.cars (obs=428 firstobs=424);run;
*Hmw2.2: Find the highest income for each employee;
data tt;
input id age income;
cards;
1 50 43000
2 40 40000
2 39 39000
2 41 50000
3 30 65000
3 30 70000
;
run;
proc print;run;
* to keep the highest income for employees;
proc sort data=tt out=tt2;
by id descending income;
run;
proc print;run;
proc sort data=tt2 out=result nodupkey;
by id ;
run;
proc print;run;
*FYI:you will learn proc sql in Advanced SAS that is much ,more easier for answering this type of quesion;
proc sql;
select id ,max(income)AS highest_income
from tt
group by id;
quit;
proc sql;
create table max_age as
select id ,max(income)AS highest_income
from tt
group by id;
quit;
proc sql;
create table report as
select t1.* ,t2.age
from max_age t1
left join tt t2
on t1.id=t2.id and t1.highest_income=t2.income;
quit;
proc print data=report;run;
*Hmw2.3 Example Universe Cars is surveying its customers as to their preferences for car colors.
They have information about the customer’s age, sex (coded as 1 for male and 2 for female),
annual income, and preferred car color (yellow, gray, blue, or white).
Here are the data:
19 1 14000 Y
45 1 65000 G
72 2 35000 B
31 1 44000 Y
58 2 83000 W
29 1 14000 Y
35 1 65000 G
62 2 35000 B
21 1 44000 Y
38 2 83000 W
sex (coded as 1 for male and 2 for female)
car color (yellow, gray, blue, or white);
libname hamid "C:\Users\hmdra\Desktop\Metro College_Teaching Materials\SAS\My sas library";
data Hamid.carsurvey;
input age gender $ Income Color $;
datalines;
19 1 14000 Y
45 1 65000 G
72 2 35000 B
31 1 44000 Y
58 . 83000 W
29 1 14000 Y
35 1 65000 G
62 2 35000 B
21 1 44000 Y
38 2 83000 W
;
run;
proc print data = hamid.carsurvey;
run;
* change value of sex and color to:;
*sex (coded as 1 for male and 2 for female)
car color (yellow, gray, blue, or white);
PROC FORMAT ;
VALUE $GENDER "1" = "Male"
"2" = "Female" ;
VALUE $COLOR "Y" ="Yellow"
"G" ="Green"
"B"="Blue"
"W"="White";
RUN;
proc print data = hamid.carsurvey;
format gender $gender. color $color.;
run;
data Hamid.carsurvey2;
input age gender Income Color $;
datalines;
19 1 14000 Y
45 1 65000 G
72 2 35000 B
31 1 44000 Y
58 . 83000 W
29 1 14000 Y
35 1 65000 G
62 2 35000 B
21 1 44000 Y
38 2 83000 W
;
run;
proc print data = hamid.carsurvey2;
run;
* change value of sex and color to:;
*sex (coded as 1 for male and 2 for female)
car color (yellow, gray, blue, or white);
PROC FORMAT ;
VALUE GEN 1 = "Male"
2 = "Female" ;
VALUE $COL "Y" ="Yellow"
"G" ="Green"
"B"="Blue"
"W"="White";
RUN;
proc print data = hamid.carsurvey2;
format gender gen. color $col.;
run;
*method#2:;
data test;
set hamid.carsurvey2;
length fomated_color $ 6.;
if missing(color) then fomated_color='NA' ;
else if color='Y' then fomated_color='Yellow' ;
else if color='G' then fomated_color='Green' ;
else if color='B' then fomated_color='Blue' ;
else if color='W' then fomated_color='White' ;
else fomated_color='Unknown' ;
if missing(gender) then sex='unknown';
else if gender=1 then sex='Male';
else sex='Female';
run;
proc print data=test;run;
*Hmw 2.4:;
*Create user defined formats/custom format and conduct segmentation ;
*binning :segmentaion for age;
proc format;
value agefmt
low-<26='<=25'
26- <36='26-35'
36- <46='36-45'
46- <56='46-55'
56-high='Senior'
;
value $genderfmt
'M'='Male'
'F'='Female'
;
value $sexfmt /*Hint:To create a format for a character variable, the format name must begin with $:
you can not use number at the end of the format name */
'M'=1
'F'= 0
;/*don't forget ; at the end of any format*/
run;
proc format;
value ageg
low-30='<=30'
31- <56='26-35'
56-high='Senior'
;
run;
*List input;
data test;
input gender $ age income;
cards;
F 22 23400
F 40 45000
F . 26800
F 22 44900
F 56 65000
M 23 38000
M 28 47250
M . 68500
M 22 95600
M 42 .
. 40 85230
M 25 96000
;run;
proc print;run;
proc print data=test;
format age agefmt. gender $genderfmt. ;
run;
proc print data=test;
format age ageg. gender $sexfmt. ;
run;
proc print data=test;
format age agefmt. ;
run;
*============================================================================================;
***************************************************************;
*STATISTICS;
*APPLIED STATISTICS;
*1.DESCRIPTIVE STATISTICS (means summarizing existing data( previous and current data))
2.INFERENTIAL STATISTICS(Statistical inference is the process of making as estimate, prediction, or decision about a population based on sample data. )
3.PREDICTIVE/PRESCRIPTIVE STATISTICS;
*
###############################
Exploratory Data Analytics (EDA)
###############################
Analysts need first to explore the data for potential research questions before
jumping into confirming the answers with hypothesis testing and inferential statistics.
EDA is often likened to “interviewing” the data it’s a time for the analyst to
get to know it and learn about what interesting things it has to say.
As part of our interview, we’ll want to do the following:
1.Classify our variables as continuous, categorical, and find distribution of them
2. Summarize our variables using descriptive statistics
3. Visualize our variables using charts
EDA is a process to try to understand existing data by visualization, Summarization, data cleaning, and
testing hypothesises
#We select the method for summmrizing data and analysis and testing hypothesis base on data type
Assume you have a breast cancer data set and in it you have ID, age, weight, height, blood pressure
and cancer . Cancer is binary 0 an 1 (0 means not having breast cancer and 1 means having breast cancer)
we want to built a model to predict having cancer for new patients
So in this situation cancer is target and we call that target or dependent variable or output variable and
we call other columns predictors or independents or features or input variables
we have two data types numerical and categorical data.
Categorical data has two types :Nominal(like name, eye color) and Ordinal(like education level, day of week)
numerical data has two types continuous (that can accept decimal like height and weight ) and
discrete(like number of children, number of females, ...)
If your variable has only two levels we call it binary at we can consider it as a spcial case of nominal categorical variable
;
*TWO TYPES OF DATA (VALUE)1.NUMERIC 2.CHARACTER;
/*numeric data could be discrete or continuous
If your data is discrete with limited levels like number of childeren ( number of cylinder in cars),
you can treat it as a categorical variable as well i.e. you can find mean and varaince because it is
numeric and you can draw bar chart and pie chart as well */
*SAS STATITICS:1.CONTINUOUS 2.CATEGORCAL;
data test;
input gender $ age income;
cards;
F 22 23400
F 40 45000
F . 26800
F 22 44900
F 56 65000
M 23 38000
M 28 47250
M . 68500
M 22 95600
M 42 .
. 40 85230
M 25 96000
;run;
proc print data=test;run;
*Segmentation and changing values;
proc format;
value agefmt
low-<26='<=25'
26- <36='26-35'
36- <46='36-45'
46- <56='46-55'
56-high='Senior'
;
value ageg /*format names can not end with numbers*/
low-30='Young'
31- <56='Middel aged'
56-high='Senior'
;
value $genderfmt
'M'='Male'
'F'='Female'
;
value $sexfmt /*Hint:To create a format for a character variable, the format name must begin with $:
you can not use number at the end of the format name */
'M'=1
'F'= 0
;/*don't forget ; at the end of any format*/
run;
*Browsing data portion;
proc print data=test; run;
proc print data=test;
format age ageg.;
run;
*Browsing descriptor portion;
proc contents data=test; run;
*CONTINUOUS: like: AGE,income,IQ,SALES,PRICE;
*Charector variables like: gender, name,purpose of loan, loan status
*DESCRIBE ONLY one VARIBLE:UNIVARIATE DESCRIPTIVE ANALYSIS;
*DESCRIBE ONLY TWO VARIBLES:BIVARIATE DESCRIPTIVE ANALYSIS;
*DESCRIBE MORE THAN 2 VARIBLES:MULTIVARIATE DESCRIPTIVE ANALYSIS;
/*
Univariate Analysis:
For categorical columns/ variabels:
for summarization: frequency, percentage, Mode, levels (unique values) ( by using proc freq in SAS)
for visualization : Bar chart or Pie chart
For numerical columns :
for summarization: Central tendency(mean, median, mode,...) and
measue of position:the the five-number summary(min,Q1,median(Q2),Q3,Max),
and measure of dispresion such as standard deviation ,CV ( COEFFICIENT OF VARIATION),Variance ,IQR, ...
( by using proc means,summary, or univariate in SAS)
for visualization: Histogram , Box plot,density,...
*/
/*
https://online.stat.psu.edu/stat500/
*/
******************************************************
Continuous Variabels
*******************************************************;
*WHAT PROCEDURE we USE FOR summerizing Numerical variable;
*1)Proc means 2)Proc summary 3)Proc univariate;
/* Proc means 1/2 */
Proc means data=test;*by default proc means do summearization ( univarite) for all numeric variables and;
Run;*proc means by deafult returns only N(number of non-missing values) , mean( average), Std Dev (standard deviation), minimumn and maximum with 7 decimal numbers;
Proc means data=test maxdec=2;
Run;
title " The five-number summary";
Proc means data=test min Q1 median Q3 Max;run;
*Q1 is 1st quartile=25th percentile that means the value that 25% of obs have value less than or equal to it
median=Q2 is 2nd quartile=50th percentile that means the value that 50% of obs have value less than or equal to it
Q3 is 3rd quartile=75th percentile that means the value that 75% of obs have value less than or equal to it;
title " The five-number summary";
TITLE2 "DESCRIBE ONLY TWO FLOATING DECIMALS";
Proc means data=test min Q1 median Q3 Max maxdec=2;
run;
TITLE1 "DESCRIBE ONLY TWO FLOATING DECIMALS";
TITLE2 "OVERRIDING DEFAULT SETTING";
Proc means data=test MAXDEC =2 N NMISS MIN p1 p5 MEAN var STD CV Q1 MEDIAN Q3 p95 p99 clm max;*The coefficient of variation (CV)=STD/mean;
Run;
*WHAT IS n : number of non_missing values;
*WHAT IS nmiss : number of missing values;
* what is p1 is 1th percentile that means 1 pecent of all records(obs) have value less than or equal to it;
*WHAT IS CV : COEFFICIENT OF VARIATION=STD/MEAN;
*WHAT IS STD : STANDARD DEVIATION ;
*What is clm: confidence inteval;
Proc means data=test MAXDEC =2 N NMISS MIN p1 p5 MEAN var STD CV Q1 MEDIAN Q3 p95 p99 clm max alpha=0.02;*The coefficient of variation (CV)=STD/mean;
Run;
* if you have a big distance between min and p1 but not between p1 and Q1 you have outliers and
if you have a big distance between p99 and max but not between Q3 and p99 you have outliers;
*note: we have some other methods for finding outliers like 1. mean and standard deviation( Z score)
2. Interqurtile Range or box plot(Tukey and extreme Tukey method)
3. LOF
that you will learn in statistic and data mining course;
*By default proc means do descripteve analysis for all continuouse varibles in data set;
title " Describe one variable";
Proc means data=test;/*do not forget "s" at the end of proc means and proc contents*/
var income;
run;
title;
Proc means data=test;
var age income;
Run;
Proc means data=test ;
output out=hamid.data_summary_Oct18th;*you can save the reult of proc means as a new table (dataset) by using output statement;
Run;
proc print data=hamid.data_summary_Oct18th;run;
*================================================
Bivariate Analysis
================================================;
* in SAS base in bivariate analysis I just talk baout summerization, and in advanced SAS I will
teach you hypothese testing,
*******************************
Continuous VS. Categorical with proc means
*********************************;
title "Bivariate Analysis";
title2 "Continues VS. Categorical";
title3 "DESCRIPTIVE OF Income Vs. Gender";
Proc means data=test;
var income;* after var list numeric variables abd after calss or by list categorical variable;
class gender;/* Bivariate continues VS. categorical*/
run;
footnote "to get the idea about association between income and geder if exits";
title "Bivariate Analysis";
title2 "Continues VS. Categorical";
title3 "DESCRIPTIVE OF Income Vs. age_group";
Proc means data=test;
var income;
class age;/* Bivariate continues VS. categorical*/
format age agefmt.;
run;
footnote "to get the idea about association between income and age_group if exits";
title "Multi-variate Analysis";
proc means data=test maxdec=2 NMISS MEAN STD CV MEDIAN p20;
var age income;* summeraiztion for these two continuous varibales;
class gender age;*by using fomat I will vconvert age to categorical variable and use it for grouping;
format age agefmt.;
run;
title;
footnote;
*Hint: You can use by statement instead of class statement in this situation you will get separate tables for each level and be carefull if you want to use by you must sort by that variable first ;
Proc sort data=test;
by gender;
run;
proc means data=test;
by gender;
run;
proc print data=sashelp.baseball (obs=5);run;
proc means data=sashelp.baseball min p1 p5 p10 p20 p25 p30 p50 p75 p90 p95 p99 max maxdec=2;
var nHits;
run;
*https://sasexamplecode.com/3-easy-ways-to-calculate-percentiles-in-sas-examples/#:~:text=%203%20Easy%20Ways%20to%20Calculate%20Percentiles%20in,3%20Calculate%20Percentiles%20with%20PROC%20FREQ%20More%20;
*the following code is just FYI;
proc means data=sashelp.baseball p1 p5 p10 p20 p25 p50 p75 p90 p95 p99;
var nHits;
/*In SAS in assignment if you want to assign a value of keyword to your own variable we specify that keyword on the left side of equal sign
and the name of your own varibvle on the right side of equal sign */
* in proc means for saving the result in one table you can use ouput statement;
output out=percentiles_proc_means
p1 = P_1_hamid /*P_1_hamid is the name of varible that i assigned value of p1 to it*/
p5 = fifth_percentile
p10 = P_10
p20 = p_20
p25 = P_25
p50 = P_50
/* p75 = P_75*/ /*we do not have to save all result in output data*/
/* p90 = P_90*/
/* p95 = P_95*/
p99 = P_99;
run;
proc print data=work.percentiles_proc_means noobs;
run;
proc print data=sashelp.heart (obs=5);run;
title "Bivariate Analysis";
title2 "Continues VS. Categorical";
TITLE3 "DESCRIPTIVE OF BODY-WEIGHT BY SMOKING STATUS";
PROC MEANS DATA=SASHELP.Heart MAXDEC =2 NMISS MIN MEAN STD CV MAX ;
VAR WEIGHT;
CLASS Smoking_Status;
RUN;
footnote;
title "Proc means";
proc means data=test maxdec=2 mean clm;*clm is confidence interval for population mean that we predict a range of mean for population( inferential statistic) by default at 95% confidenc;
var age income;
run;
/*
clm is confidence interval
Let’s assume that I will give you an exam with 30 marks, and I say that based on my experience your class average mark would be between 21.75 and 26.25 at 95% confidence
That means there is 5% possibly that your class average would be less than 21.75 or bigger than 26.25
If I want to predict your class average at 100% confidence I must say:
your class average mark would be between 0 and 30 at 100% confidence
*/
proc means data=test maxdec=2 mean clm alpha=0.10;* alpha=significant level=0.10==> confidence= 1-alpha =0.90
clm is confidence interval that we predict a range of mean for population at 90% confidenc;
*The bigger confidence the wider interval;
var age income;
run;
*FYI:alpha is called signifiacnt level it is probabilty of Type I error(FP);
Proc means data=test MAXDEC =2 n nmiss mean std stderr cv lclm uclm median min max range Q1 Q3 qrange maxdec=2 ;
Run;
*stderr=sqrt(std)/n;
***********************************************
where VS. If
**********************************************;
*1. you can use where in proc step or data step but you can use if only in data step
i.e you are not allowed to use if in proc step
2.we have where option as well;
******************************************
*where option vs. where statement;
*******************************************;
proc print data=hamid.data_summary_Oct18th;run;
proc print data=hamid.data_summary_Oct18th(where=(_STAT_='MEAN'));*where option;
var age income;
run;
proc print data=hamid.data_summary_Oct18th ;
var age income;
where _STAT_='MEAN';*where statement;
run;
title;
/*FYI: Proc summary 1/2 */
/*Proc SUMMARY and Proc MEANS are essentially the same procedure. Both procedures compute descriptive statistics.
The main difference concerns the default type of output they produce. Proc MEANS by default produces printed output in
the LISTING window or other open destination whereas Proc SUMMARY does not. Inclusion of the print option on the Proc SUMMARY
statement will output results to the output window.
Proc SUMMARY
When all variables in the data set are character the same output: a simple count of observations, is produced for each
procedure.However, when some variables in the dataset are numeric, Proc MEANS analyses all numeric variables and
produces default statistics for these variables (N, Mean, Standard Deviation, Minimum and Maximum).
Inclusion of a VAR statement in both Proc MEANS and Proc SUMMARY, produces output that contains exactly the same default
statistics.*/
proc summary data=test; *In proc suumary you must use print option to see the result or use output statement to save the result in one table ;
var age income;
run;
proc means data=test;
var age income;
run;
title "Proc summary with print option";
proc summary data=test print;
var age income;
run;
title "Proc summary with print option";
proc summary data=test print;
run;
*in proc summary like proc means you can add output statement to save the result;
proc summary data=test;
var age income;
output out=sumdata_For_Hamid;*creating new data which name is sumdata_For_Hamid an availble in work library
you can save the result in permanent library if you want;
run;
title "printing result of proc summary that you saved it by using output option";
proc print data=sumdata_For_Hamid;
run;
title "Proc summary with print option to see the result and saving result by using output options";
proc summary data=test print;
var age income;
output out=sumdata2;
run;
title "printing result of proc summary that you saved it by using output option";
proc print data=sumdata2;
run;
title "Proc summary with print option and saving result by using output options";
proc summary data=test print p20 mean median clm qrange;
var age income;
output out=sumdata2;
run;
title "printing result of proc summary that you saved it by using output option";
proc print data=sumdata2;
run;
proc summary data=test print p20 mean median;
var age;
output
out=sumdata3
p20=p20th_hamid
mean=Avarge_of_values_For_Hamid;
run;
title "printing result of proc summary that you saved it by using output option";
proc print data=sumdata3;
run;
proc print data=sumdata2;
run;
title;
*Hint:you are not allowed to use if statement in procedures;
title;
Proc summary data=test maxdec=2 print;
var income;
class gender age;* I classified all of observation based on gender first and then for each gender based on age group;
format gender $genderfmt. Age agefmt.;
output out=sumdata3a;
Run;
proc print data=sumdata3a;run;*you can see bigger number of type shows you more detailed in terms of classification;
* I classified all of observation based on gender and age so type 3 is classifying based on gender then age
type 2 means classifying only based ongender
type 1 means classifying only based age
and type 0 means no classification;
Proc summary data=test maxdec=2 print;
var income;
class gender age;* I classified all of observation based on gender and age;
format gender $genderfmt. Age agefmt.;
output out=sumdata3 n=number_of_non_missing min=minimum
mean=mean std=standard_dev sum=total;
/* here n is keyword and we assign the value of that to number_of_non_missing, In SAS keywords located on
the left side of equal sign*/
Run;
proc print data=sumdata3;run;
proc print data=sumdata3(where=(_type_=3));
run;
proc print data=sumdata3(where=(_type_=2));
run;
****************************************************
proc univarite
****************************************************;
*
Note:In proc univarite you can get more information about continuous varaible by default , and unlike proc means and summary
1.You can use proc univarite to test of normality and to visualize data as well
2.you can create a graph
3.when you want to create output in proc univarite you must specify what _stat_ you want to save
4.we use pctlpts=20 instead of p20 and so on and The PCTLPRE= option must be specified to generate additional percentiles
5. you can use the PCTLDEF= option in PROC UNIVARIATE or the QNTLDEF= option in other procedures to control the method used to estimate quantiles
;
proc means data=test;
run;
title;
Proc univariate data=test;
run;
Proc univariate data=test plot;
run;
Proc univariate data=test;
var income;
run;
Proc univariate data=sashelp.cars;
var invoice;
run;
title;
*You can use proc univariate to test of normality by adding normal option
If p value is less than 5% you can reject null hypotheses( data is normaly distributed) and conclude
your data is not normally distributed at 5% significant level;
Proc univariate data=sashelp.cars normal;
var invoice;
run;
Proc univariate data=sashelp.cars normal plot;
var invoice;
run;
*you can use proc univariate for visualize data by adding some options;
proc univariate data=sashelp.cars;
var invoice;
qqplot;
run;
Proc univariate data=sashelp.cars plot;
var invoice;
Run;
proc univariate data=sashelp.cars normal;
var invoice;
qqplot;
run;
*I will talk about normality test in Advanced SAS;
Proc univariate data=sashelp.cars;
var invoice;
output out=sumdata4a ;*WARNING: Data set WORK.SUMDATA4A was not replaced because new file is incomplete ;
Run;
proc print data=sumdata4a;
run;*NOTE: No variables in data set WORK.SUMDATA4A.;
Proc univariate data=sashelp.cars normal plot freq;
var invoice;
output out=hamid.sumdata
n=n min=mininum
mean=mean std=standard_dev
median=median;
Run;
proc print data=hamid.sumdata;
run;
title;
Proc univariate data=sashelp.cars normal plot;
var invoice;
Run;
* Bivariate continues VS. categorical;
Proc univariate data=sashelp.cars normal plot;
var invoice;
class make;/* Hint:you can use by instead of class but must sort before by make, unless your data sorted before*/
/*Note: when ever you use by statement in SAS you must sort your data by thah variable before ,unless your data sorted before*/
run;
*Please compare it with proc means;
Proc means data=sashelp.cars;
var invoice;
class make;
run;
TITLE "PROCEDURE UNIVARIATE FOR DESCRIPTIVE ANALYSIS";
PROC UNIVARIATE DATA=SASHELP.Heart normal;
VAR HEIGHT;
CLASS SEX;
RUN;
TITLE "PROCEDURE MEANS FOR DESCRIPTIVE ANALYSIS";
PROC MEANS DATA=SASHELP.Heart;
VAR HEIGHT;
CLASS SEX;
RUN;
proc print data=test;run;
proc univariate data=test;
var age;
class gender;
run;
proc univariate data=test;
var age income;
class gender;
run;
*Please compare it with proc means;
proc means data=test;
var age income;
class gender;
run;
Proc univariate data=sashelp.cars;
var invoice;
run;
Proc univariate data=sashelp.cars;
var invoice;
output out=hamid.sumdata4
n=n
min=mininum
mean=mean_of_data
std=standard_dev
p25=lower_quartile
p90=p90th
median=median
q3=upper_quartile ;*p25=q1, p75=q3;
Run;
proc print data=hamid.sumdata4;
run;
*The following code is just FYI;
Proc univariate data=sashelp.cars;
var invoice;
output out=sumdata4
n=n
min=mininum
mean=mean_of_data
std=standard_dev
p25=lower_quartile
p90=p90th
pctlpts= 2 20 30 30.5 42.5 97.5 pctlpre=Percentile_
median=median
q3=upper_quartile ;*p25=q1, p75=q3;
Run;
/*PCTLPRE=prefixes
specifies one or more prefixes to create the variable names for the variables that contain the PCTLPTS= percentiles. To save the same percentiles for more than one analysis variable, specify a list of prefixes. The order of the prefixes corresponds to the order of the analysis variables in the VAR statement. The PCTLPRE= and PCTLPTS= options must be used together.
PROC UNIVARIATE generates new variable names by using the prefix and the percentile values. If the specified percentile is an integer, the variable name is simply the prefix followed by the value. If the specified value is not an integer, an underscore replaces the decimal point in the variable name, and decimal values are truncated to one decimal place. For example, the following statements create the variables pwid20, pwid33_3, pwid66_6, and pwid80 for the 20th, 33.33rd, 66.67th, and 80th percentiles of Width, respectively:
proc univariate noprint;
var Width;
output pctlpts=20 33.33 66.67 80 pctlpre=pwid;
run;*/
proc print data=sumdata4;
run;
*Browsing data portion;
proc print data=test; run;
*Browsing descriptor portion;
proc contents data=test; run;
*CATEGORICAL DATA;
/* Proc freq */
/*univariate analysis*/
/***************for categorical variables ********************/
title;
proc freq data=test;
run;*for each variable (numeric and categorical variables) it shows frequancy of each level
it is not suumerizing of num,erical varable just shows us the freq of each unique value;
/* If an observation has a missing value for a variable in a TABLES request,
by default PROC FREQ does not include that observation in the frequency or
comulative freq and percentage , it shows missing values at the bottom of tables.
Also by default, PROC FREQ does not include observations
with missing values in the computation of percentages and statistics.
The procedure displays the number of missing observations below each table.*/
/* unlike proc means, summary and proc univariate that we use var statement to specify varables for
thoes procedures, in proc freq we use table statement :recall that we show counts of each value in table*/
proc freq data=test;
table gender;
run;
proc print data=test; run;
*By default proc freq ignores missing values in calculation percent and cumulative percent
unless you youse missing option in table statement;
proc freq data=test;
table gender/missing; * we use forward slash to add option in proc freq;
/* treats missing values as a valid level for all TABLES variables.
Displays missing levels in frequency and crosstabulation tables
and includes them in computations of percentages and statistics.
*/
run;
proc freq data=test;
table _all_ /missing;
run;
Proc freq data= test;
table age/ out=Hamid_agefreq_oct18th;
Run;
proc print data=hamid_agefreq_oct18th;
run;
title "frequency of each age group";
Proc freq data= test;
table age/ missing;
format Age agefmt.;
Run;
Proc freq data= test;
table age/ out=agefreq missing;
format Age agefmt.;
Run;
proc print data=agefreq;
run;
Proc freq data=test;
table gender;
Run;
Proc freq data=test;
table gender;
format gender $genderfmt.;
Run;
Proc freq data=test;
table gender/missing;
format gender $genderfmt.;
Run;
proc freq DATA = SASHELP.CARS NLEVELS; *NLEVELS :Displays the number of levels for all TABLES variables;
table type;
run;
PROC FREQ DATA=SASHELP.Heart;
RUN;
TITLE "UNIVARIATE DESCRIPTIVE ANALYSIS:SEX";
PROC FREQ DATA=SASHELP.Heart;
TABLE SEX;
RUN;
title;
PROC FREQ DATA=SASHELP.Heart;
TABLE SEX ;
TABLE SMOKING_STATUS;
RUN;
PROC FREQ DATA=SASHELP.Heart;
TABLE SEX SMOKING_STATUS;*it is two univariate analysis ,
i.e creating two tables one for sex, the other one for SMOKING_STATUS
IF you want to get two-way table youu need to put star between sex and
SMOKING_STATUS;
RUN;
PROC FREQ DATA=SASHELP.Heart;
TABLE SEX SMOKING_STATUS/ missing;
run;
/*Bivariate*/
*Cathegorical Vs. Chategorical;
*assume you are working as a data scientist at Bell and you want to see that is there any associtaion between
type of cancelletion( online or calling) and churn ( cancelling the agreement and switch to anothrer provider like Bell, ...)
this is hypotheses testing ( you may design an A/B testing for answering this problem)
Null hypothese: there is no assocition between churn and type of canceletion
since both variables are categorical vbariables we run biviate anlysis for categorical vs. categorical
fgor this problem we need to use chi_squre test in proc freq that i will teach you;
/*Bivariate Analysis:
Continuous Vs. Continuous : For Visulaization scatter plot,line chart( specially if you have time data)
For test of independence: pearson correlation or spearman or Kendal tau or ... coorelation
and Simple linear regression ...in SAS by using proc corr, proc reg
Categorical Vs. Categorical : For summaraization: contingency table (two-way table) in SAS by using proc freq and put * between the name of thoes two variables in tabel statement
For visualization :stacked bar chart,Grouped bar chart,...
For test of independence:chi-square test
Continuous Vs. Categorical : For summaraization:gropup by categorical column and aggragte for numerical column in SAS by using proc means,summary, or univariate and use name of continuous variable in var statement and name of categorical variable in class statement
For visualization: Grouped box plot,...
For test of independence :1) if categorical column has only two levels :t-test in SAS by using proc ttest ( if all assumption are met)
2) if categorical column has more than two levels: ANOVA in SAS by using proc glm or Anova( if all assumption are met)
*/
********************************************************
/******CHI SQURE*****/
********************************************************;
*We use Chi squre test for finding if two categorical varaibles are independent from each other or not.
The Null and Alternate Hypotheses
we are interested in knowing if there is a relationship between two categorical variables or not.
In order to do so, we would have to use the Chi-squared test. But first, let's state our null hypothesis and
the alternative hypothesis.
The Null Hypotheses:H0:There is no statistically significant relationship between thoes two categorical variables.
The Alternate Hypotheses:Ha:There is a statistically significant relationship between thoes two categorical variables.
After running chi square test and checking the conditions of it
if p-value is less than 5% you can reject null hypotheses and get this conclusion that
"There is some correlation between thoes two variables at 0.05 significant level
or
There is a statistically significant relationship between thoes two categorical variables at 0.05 significant level."
(* Note : if p-value is bigger than 5% yopui fail tyo reject null hypothese so you can conclude that there is not enough evidence top reject the null hypothese at 5% signifiucant level)
The Chi-square test statistic can be used if the following conditions are satisfied:
1.N, the total frequency, should be reasonably large, say greater than 50 or 30.
2. The sample observations should be independent. This implies that no individual item should be included twice or more
in the sample.
3. No expected frequencies should be small. Small is a relative term. Preferably each expected frequencies
should be larger than 10 but in any case not less than 5.
;
TITLE "BIVARIATE DESCRIPTIVE ANALYSIS:SEX AND SMOKING";
PROC FREQ DATA=SASHELP.Heart;
TABLE SMOKING_STATUS * SEX;/* it gives us contingency table(two-way table) here rows=SMOKING_STATUS and columns=SEX*/
RUN;
TITLE "BIVARIATE DESCRIPTIVE ANALYSIS:SEX AND SMOKING only Frequency";
PROC FREQ DATA=SASHELP.Heart;
TABLE SMOKING_STATUS * SEX/nopercent norow nocol;
RUN;
TITLE "BIVARIATE DESCRIPTIVE ANALYSIS:SEX AND SMOKING only Prcentsge";
PROC FREQ DATA=SASHELP.Heart;
TABLE SMOKING_STATUS * SEX/nofreq norow nocol;
RUN;
TITLE "BIVARIATE DESCRIPTIVE ANALYSIS:SEX AND SMOKING only percentage considering missing values";
PROC FREQ DATA=SASHELP.Heart;
TABLE SMOKING_STATUS * SEX/nofreq norow nocol missing;
RUN;
TITLE "BIVARIATE DESCRIPTIVE ANALYSIS:SEX AND SMOKING";
PROC FREQ DATA=SASHELP.Heart;
TABLE SEX*SMOKING_STATUS ;/* it gives us contingency table(two-way table) here rows=Sex and columns=SMOKING_STATUS*/
RUN;
TITLE "BIVARIATE DESCRIPTIVE ANALYSIS:SEX AND SMOKING";
PROC FREQ DATA=SASHELP.Heart;
TABLE SEX*SMOKING_STATUS/list;
RUN;
/* If you specify the LIST option in the TABLES statement,
PROC FREQ displays multiway tables in a list format
rather than as crosstabulation tables.
The LIST option displays the entire multiway table in one column,
instead of displaying a separate two-way table for each stratum.*/
TITLE "BIVARIATE DESCRIPTIVE ANALYSIS:SEX AND SMOKING";
PROC FREQ DATA=SASHELP.Heart;
TABLE SEX * SMOKING_STATUS/missing;
RUN;
TITLE "BIVARIATE DESCRIPTIVE ANALYSIS:SEX AND SMOKING";
PROC FREQ DATA=SASHELP.Heart;
TABLE SEX * SMOKING_STATUS/NOPERCENT;
RUN;
TITLE "BIVARIATE DESCRIPTIVE ANALYSIS:SEX AND SMOKING";
PROC FREQ DATA=SASHELP.Heart;
TABLE SEX * SMOKING_STATUS /norow NOCOL NOPERCENT ;
RUN;
TITLE "BIVARIATE DESCRIPTIVE ANALYSIS:SEX AND SMOKING";
PROC FREQ DATA=SASHELP.Heart;
TABLE SEX * SMOKING_STATUS /nofreq NOCOL NOPERCENT ;
RUN;
TITLE "BIVARIATE DESCRIPTIVE ANALYSIS:SEX AND SMOKING";
PROC FREQ DATA=SASHELP.Heart;
TABLE SEX * SMOKING_STATUS /nofreq NOrow NOPERCENT ;
RUN;
TITLE "BIVARIATE DESCRIPTIVE ANALYSIS:SEX AND SMOKING";
PROC FREQ DATA=SASHELP.Heart;
TABLE SEX * SMOKING_STATUS /NOCOL NOROW NOFREQ NOCUM MISSING;
RUN;
TITLE "BIVARIATE DESCRIPTIVE ANALYSIS:SEX AND SMOKING";
PROC FREQ DATA=SASHELP.Heart;
TABLE SEX * SMOKING_STATUS /NOCOL NOROW NOPERCENT NOCUM MISSING list;
RUN;
TITLE "Hypoteses testing:SEX Vs. SMOKING";
PROC FREQ DATA=SASHELP.Heart;
TABLE SEX * SMOKING_STATUS/ chi_square;
RUN;
*if p-value is less than 5% and all assumption are satisfied( I will talk about assumption in advanced SAS course)
we conclude that there is a statistcally association between thoes two varaibles at 5% significant level;
* in this example the p-value<0.0001==> p-value<0.05=5% that is less than 5% so you can conclude that:
there is a statistcally association between sex and Smoking_Status at 5% significant level;
/*Data type: Numeric and categoric
Note: Numeric variable can be continuous or discrete , we can treat discrete variable like
ordinal variable ( categorical) specially if it has limited options ( like number of children in family :
it is numeric( discreate), so you can apply mean ,standard deviation and so on but you can treat it like categorical variable
for example draw bar chart for it)
Univariate Analysis:
For categorical columns/ variabels:
for summarization: frequency, percentage, Mode, levels (unique values) ( by using proc freq in SAS)
for visualization : Bar chart or Pie chart
For Continuous columns :
for summarization: Central tendency(mean, median, mode,...) and
measue of position:the the five-number summary(min,Q1,median(Q2),Q3,Max),
and measure of dispresion such as standard deviation ,CV ( COEFFICIENT OF VARIATION),Variance ,IQR, ...
( by using proc means,summary, or univariate in SAS)
for visualization: Histogram , Box plot,density,...
*/
TITLE "BIVARIATE ANALYSIS:SEX AND SMOKING_status";
PROC FREQ DATA=SASHELP.Heart;
TABLE SEX * SMOKING_STATUS /chisq;
RUN;
*Since the p-value is less than 5% we reject null hypotheses and conclude:
There is a statistically significant relationship between thoes two categorical variables at 0.05 significant level;
PROC FREQ DATA=SASHELP.Heart;
TABLE SEX * SMOKING_STATUS /NOCOL NOROW NOPERCENT NOCUM MISSING chisq;
RUN;
*Here since p-value is less than 5%=0.05 we reject null hypothese and get conclusion that there is statistaclly association between sex and
smoking_status at 5% significant level;
******************************************summary until this point;
title;
*Brows data portion;
proc print data=sashelp.heart (obs=5);
run;
*To find the data type of each variable;
proc contents data=sashelp.heart ;
run;
*univariate analysis for numerical variable;
proc means data=sashelp.heart;
run;
proc means data=sashelp.heart n nmiss min Q1 median Q3 max mean std cv clm alpha=0.10 maxdec=2;
run;
*check normality for weight;
proc univariate data=sashelp.heart normal plot;
var weight;
run;
/*since p-value < 0.05 we can reject normality at 5% significant level */
*univariate analysis for categorical variable;
proc freq data=sashelp.heart;
table BP_Status Chol_Status DeathCause Sex Smoking_Status Status Weight_Status/ missing;
run;
*BIivariate analysis for age vs Chol_Status;
proc means data=sashelp.heart;
var AgeAtStart;
class Chol_Status;
run;
*bivariate analysis for BP_Status vs Chol_Status;
proc freq data=sashelp.heart;
table BP_Status * Chol_Status / chisq;
run;
/*since p-value < 0.05 we can reject independency at 5% significant level and conclude:
There is statitically association between BP_Status and Chl_status at 5% significant level*/
************************************************************************;
data test;
input gender $ age income;
cards;
F 22 23400
F 40 45000
F . 26800
F 22 44900
F 56 65000
M 23 38000
M 28 47250
M . 68500
M 22 95600
M 42 .
. 40 85230
M 25 96000
;run;
proc print data=test;run;
proc freq data=test;
table gender;
run;
proc freq data=test;
table gender/missing;
run;
*Segmentation and changing values;
proc format;
value agefmt
low-<26='<=25'
26- <36='26-35'
36- <46='36-45'
46- <56='46-55'
56-high='Senior'
;
value ageg /*format names can not end with numbers*/
low-30='Young'
31- <56='Middel aged'
56-high='Senior'
;
value $genderfmt
'M'='Male'
'F'='Female'
;
value $sexfmt /*Hint:To create a format for a character variable, the format name must begin with $:
you can not use number at the end of the format name */
'M'=1
'F'= 0
;/*don't forget ; at the end of any format*/
run;
proc freq data=test;
table gender*age/chisq;
format age agefmt. gender $genderfmt.;
run;
*WARNING: 100% of the cells have expected counts less
than 5. Chi-Square may not be a valid test.;
*Note: you can create several tables in one proc freq;
proc freq data=test;
table gender /missing ;
table age;
table gender*age/missing ;
format age agefmt.;
run;
/* If you specify the LIST option in the TABLES statement,
PROC FREQ displays multiway tables in a list format
rather than as cross tabulation tables.
The LIST option displays the entire multiway table in one column,
instead of displaying a separate two-way table for each stratum.*/
proc freq data=test;
table gender /missing ; *for one table adding list option doesn't have any effect;
table age/missing ;
table gender*age/missing ;
table gender*age/missing list;
format age agefmt.;
run;
proc freq data=test;
table gender age/missing ;
table gender*age/missing list;
format age agefmt.;
run;
proc freq data=sashelp.cars;
table make /missing;
table model/missing ;* we can say table make model/missing list instead;
table make*model/missing list;
run;
proc print data=test;run;
****************************************************************************************
*Vertical calculation (column wise ) vs. Horizontal calculation(row wise);
**************************************************************************************;
Proc means data=test MAXDEC =2 sum MEAN var STD CV Q1 MEDIAN Q3 p95 p99 max;*The coefficient of variation (CV)=STD/mean;
Run;
*It is vertical calculation i.e. SAS it is going to calculate mean, std, ... for each columns;
Proc means data=test MAXDEC =2 sum MEAN ;
Run;
***********************************************************************
***********Horizontal calculation/row wise****************;
*************************************************************************,
*so far we did summeraization for each variables/ columns
what if we want to summerize for each rows/obs as an example assume each rows is student id and scores in different courses for that student i.e you have a student id,
SQL_score, SAS_score, Python_score and so on and you want to find average of score for each studen if we use proc means, or suumary or univarita
we will get average of all students in each course,
or you have a customer Id at 1st column and sales amount for each month of 2021 in 12 columns and we want to
calculate sum of sales and average of sales for each customer;
*CONNECT YOUR PERMANENT SAS LIB USING LIBNAME STATEMENT;
libname hamid "C:\Users\hmdra\Desktop\Metro College_Teaching Materials\SAS\hamid libraray";
* SAS STATISTICAL FUNCTIONS;
*LIST INPUT WE CAN USE CARDS OR DATALINES;
DATA test_;
INPUT ID T1 T2 T3 T4 T5;
CARDS;
1001 1 2 3 4 5
1002 76 67 87 89 79
1003 87 78 84 82 87
1004 89 90 67 76 88
;
RUN;
PROC PRINT DATA=test_;
RUN;
proc means data=test_ sum mean;run;*It is vertical calculation i.e. SAS it is going to calculate sum and mean for each columns;
*As an example if columns are course names and rows are students you can find average of score for each cours;
*HOW WE CAN DO HORIZONTAL CALUCULATION as an example how we can calculate average of score for each students;
* recall that SAS select obs by obs and for each obs excutes codes line by line;
DATA TEST_01;
SET test_;
TOTAL_SCORE=T1+T2+T3+T4+T5; *ASSIGNMENT STATEMENT using plus operator;
RUN;
PROC PRINT DATA=TEST_01;
RUN;
DATA TEST_01;
SET test_;
TOTAL_SCORE=T1+T2+T3+T4+T5;*ASSIGNMENT STATEMENT;
AVG_SCORE=TOTAL_SCORE/5;
RUN;
PROC PRINT DATA=TEST_01;
RUN;
*with missing values;
****************************************************************;
*Plus Operator Vs. sum function;
*****************************************************************;
DATA test_;
INPUT ID T1 T2 T3 T4 T5;
CARDS;
1001 1 2 3 4 .
1002 76 . 87 89 79
1003 87 78 84 . 87
1004 89 90 67 76 88
;
RUN;
PROC PRINT DATA=test_;
RUN;
DATA TEST_01;
SET test_;
TOTAL_SCORE=T1+T2+T3+T4+T5;*ASSIGNMENT STATEMENT and Defining new variable and ASSIGNMENT STATEMENT;
AVG_SCORE=TOTAL_SCORE/5;
RUN;
PROC PRINT DATA=TEST_01;
RUN;
PROC PRINT DATA=TEST_01;
RUN;
DATA TEST_01;
SET test_;
TOTAL_SCORE1=T1+T2+T3+T4+T5; *ASSIGNMENT STATEMENT;*Plus operator GIVES YOU MISSING VALUE IF YOU HAVE MISSING IN YOUR DATA;
TOTAL_SCORE2=SUM (T1,T2,T3,T4,T5);*ASSIGNMENT STATEMENT ;*IGNORES MISSING VALUES; *Sum() function ignores missing values;
TOTAL_SCORE3=SUM (OF T1-T5);*ASSIGNMENT STATEMENT;*YOU CAN USE THIS METHOD FOR VARIBLES WITH THE SAME NAME AND DIFFEREN INDEX;
RUN;
PROC PRINT DATA=TEST_01;
RUN;
*In following data find the total amount and average amount that each customer bought in 2023 so far;
DATA Amazon;
INPUT Cust_ID Jan2023 Feb2023 Mar2023 Apr2023 May2023;
CARDS;
1001 1 2 3 4 0
1002 76 0 87 89 79
1003 87 78 84 0 87
1004 89 90 67 76 88
;
RUN;
data report;
set amazon;
Total=sum(Jan2023 ,Feb2023, Mar2023, Apr2023, May2023);
Avg=mean(Jan2023 ,Feb2023, Mar2023, Apr2023, May2023);
run;
proc print data=report;run;
***********************************************************************************************
* Interview question: how can we calculate the average of each row by ignoring missing values?
************************************************************************************************;
*SAS FUNCTION f(ARG1,ARG2,ARG3);
*BY DEFAULT,SAS FUNCTIONS EXCLUDE (ignore) MISSING VALUES IN CALCULATION;
DATA TEST_02;
SET test_;
TOTAL_SCORE=SUM(T1,T2,T3,T4,T5);
RUN;
PROC PRINT DATA=TEST_02;
RUN;
DATA TEST_02;
SET test_;
TOTAL_SCORE=SUM(T1,T2,T3,T4,T5);*sum function returns to us sum of all non-missing values;
TOTAL_TEST=N(T1,T2,T3,T4,T5);*N function returns to us number of non-missing values;
MISS_TEST=NMISS(T1,T2,T3,T4,T5);*NMISS function returns to us number of missing values;
AVG_SCORE=MEAN(T1,T2,T3,T4,T5);*mean function returns to us avaerge of all non-missing values;
MIN_SCORE=MIN(T1,T2,T3,T4,T5);*min function returns to us minimum of all non-missing values;
MAX_SCORE=MAX(T1,T2,T3,T4,T5);*max function returns to us maximum of all non-missing values;
STD_SCORE=STD(T1,T2,T3,T4,T5);*std function returns to us standard deviation of all non-missing values;
RUN;
PROC PRINT DATA=TEST_02;
RUN;
DATA TEST_02b;
SET test_;
TOTAL_SCORE=SUM(T1,T2,T3,T4,T5);*sum function returns to us sum of all non-missing values;
Number_of_non_missing=N(T1,T2,T3,T4,T5);*N function returns to us number of non-missing values;
AVG_SCORE=TOTAL_SCORE/Number_of_non_missing;
AVG_SCORE2=mean(T1,T2,T3,T4,T5);*mean function returns to us avaerge of all non-missing values;
RUN;
PROC PRINT DATA=TEST_02b;
RUN;
DATA TEST_02;
SET test_;
TOTAL_SCORE1=SUM(T1+T2+T3+T4+T5);
*first it is going to calculate T1+T2+T3+T4+T5 (so if you have missing value you will get missing)then sum( sum here
doesn't have any effect) ;
*Note: use comma between arguments;
TOTAL_SCORE2=SUM(T1,T2,T3,T4,T5);
TOTAL_TEST=N(T1,T2,T3,T4,T5);*Number of non_missing values;
MISS_TEST=NMISS(T1,T2,T3,T4,T5);*Number of missing values;
RUN;
PROC PRINT DATA=TEST_02;
RUN;
*********************************************
*FYI: USING VARIABLE SHORTCUT;
*********************************************;
*EXCERCISE;
DATA TEST3;
INPUT ID T1 T2 T3 T4 T5;
CARDS;
1001 1 2 3 4 .
1002 76 . 87 89 79
1003 87 78 84 . 87
1004 89 90 67 76 88
;
RUN;
PROC PRINT DATA=TEST3;
RUN;
DATA TEST_03;
SET TEST3;
TOTAL_SCORE=SUM(OF T1-T5);
TOTAL_TEST=N(OF T1-T5);
MISS_TEST=NMISS(OF T1-T5);
AVG_SCORE=MEAN(OF T1-T5);
MIN_SCORE=MIN(OF T1-T5);
MAX_SCORE=MAX(OF T1-T5);
STD_SCORE=STD(OF T1-T5);
RUN;
PROC PRINT DATA=TEST_03;
RUN;
DATA TEST_03;
SET TEST3;
TOTAL_SCORE=SUM(of T1-T5);*DON'T FORGET OF;
TOTAL_SCORE_PLUS5=SUM(OF T1-T5,5);* OR TOTAL_SCORE_PLUS52=SUM(T1,T2,T3,T4,T5,5);
X=SUM(T1-T5);*HERE WE HAVE JUST ONE ARGUMENT THAT IS THE RESULT OF T1 MINUS T5 ;
Y=SUM(T1-T5,5);*HERE WE HAVE TWO ARGUMENTS THAT ARE THE RESULT OF T1 MINUS T5 AND 5 THAT MEANS T1-T5+5 ;
Z=T1-T5+5;
RUN;
PROC PRINT DATA=TEST_03;
RUN;
*: EXCERCISE;
DATA TEST4;
INPUT ID T1 S1 T2 S2 T3;
CARDS;
1001 1 2 3 4 5
1002 2 . 7 8 1
1003 -1 5 . 4 .
1004 . 4 2 . 3
;
RUN;
* USING VARIABLE SHORT CUT;
DATA TEST_04;
SET TEST4;
TOTAL_Ts=SUM(OF T1-T3);*SUM(OF T1-T3)=sum(T1, T2, T3);
TOTAL_SCORE=SUM(OF T1--T3);*SUM(OF T1--T3)=sum(T1,S1, T2,S2, T3)i.e. sum of all columns between T1 and T3 including them;
Num_Ts=N(OF T1-T3);
TOTAL_number=N(OF T1--T3);* MEANS ANY VARIABLE BETWEEN T1 AND T3 INCLUDE T1 AND T3;
PROC PRINT DATA=TEST_04;
RUN;
DATA TEST4;
INPUT ID T1 S1 T2 S2 H T3;
CARDS;
1001 1 2 3 4 4 5
1002 2 . 7 8 3 1
1003 -1 5 . 4 0 .
1004 . 4 2 . 1 3
;
RUN;
* USING VARIABLE SHORT CUT;
DATA TEST_04;
SET TEST4;
TOTAL_Ts=SUM(OF T1-T3);*SUM(OF T1-T3)=sum(T1, T2, T3);
TOTAL_SCORE=SUM(OF T1--T3);*SUM(OF T1--T3)=sum(T1,S1, T2,S2, T3)i.e. sum of all columns between T1 and T3 including them;
Total_Ts_and_Ss=sum(of T1-T3,of s1-s2);
Num_Ts=N(OF T1-T3);
TOTAL_number=N(OF T1--T3);* MEANS ANY VARIABLE BETWEEN T1 AND T3 INCLUDE T1 AND T3;
PROC PRINT DATA=TEST_04;
RUN;
DATA TEST_06;
SET TEST4;
x=SUM(OF T1-S2);*
NOTE: No observations in data set WORK.TEST_06
;
PROC PRINT DATA=TEST_06;
RUN;
DATA TEST_07;
SET TEST4;
x=SUM(OF T1--S2);
;
PROC PRINT DATA=TEST_07;
RUN;
DATA TEST4;
INPUT ID T1 S1 T2 S2 G3;
CARDS;
1001 1 2 3 4 5
1002 2 . 7 8 1
1003 -1 5 . 4 .
1004 . 4 2 . 3
;
RUN;
DATA TEST_08;
SET TEST4;
x=SUM(OF T1--G3);*Any numeric variable between T1 and G3;
;
PROC PRINT DATA=TEST_08;
RUN;
DATA TEST4;
INPUT ID T S1 T2 S2 G;
CARDS;
1001 1 2 3 4 5
1002 2 . 7 8 1
1003 -1 5 . 4 .
1004 . 4 2 . 3
;
RUN;
DATA TEST_09;
SET TEST4;
x=SUM(OF T--G);
;
PROC PRINT DATA=TEST_09;
RUN;
DATA TEST4;
INPUT ID T S1 T2 S2 F $ G;
CARDS;
1001 1 2 3 4 M 5
1002 2 . 7 8 F 1
1003 -1 5 . 4 M .
1004 . 4 2 . m 3
;
RUN;
proc print;run;
DATA TEST_10;
SET TEST4;
x=SUM(OF T--G);
;
PROC PRINT DATA=TEST_10;
RUN;
*************Happy Learning!************************
libname hamid "C:\Users\hmdra\Desktop\Metro College_Teaching Materials\SAS\My sas library";
/**** The most 7 important proc:
1)proc print to brwos data portion
2)proc contents to brows desciptor portion
3)proc freq to summarize chractor variables: create a table that gives us frequency of each level( unique values)
and chi-square test
4)proc means to summarize numerical variables
5)proc univariate to summarize numerical variables with more information
and test of normality
6)proc sql to use sql syntax
7)proc sort to sort data and deduplicate data
we talk about them and other procedures during this course and advanced SAS programming
*Note: we have more procedures like proc import for importing data, proc export for exporting data,
proc ttest, proc Anova, proc reg, ...
***/
data my_1st_data;*my_1st_data=work.my_1st_data;
input name $ age gender $ weight height;/* we use input statement to assign variable's name= column's name=header :
variables'name:
1)are limited to 32 characters.
2)must begin with a letter or underscore.
3)can contain only letters, numbers, or underscores. Blanks are not allowed*/
*To create a character column, specify a $ sign righ after the sapace after
column name.;
*Note: unlike SQL that we use comma to separate columns names in SAS we use sapace;
* when you insert values in simple list input separater is space(at least 1 space) and Any missing data must be indicated with a period.;
*Note:recall that when you print a table for missing values you will see period for numeric columns and blank for categorical columns;
/* For inserting data you should type cards or datalines then semicolon and in new rows type observations after
the last observations(rows) go to new line and type semicolon*/
*for inserting values use cards or datalines;
cards;
Hamid 41 . 82 185
Nick . M . 124
Sarah 34 F 76 167cm
. 4 F 18 115
Aryan 12 M 67 156
Rosa 34 F 57 163
;
*you must put semicolon after the last obs in "new line" if you put semicolon at the end of obs thet one won't be inserted;
run;
*Browsing Data portion;
proc print data=my_1st_data;run;
proc print data=my_1st_data;
var gender;
run;
proc print data=my_1st_data (obs=5);
run;
proc print;run;*if you skip naming data, SAS conduct that procedure on the most recent data;
*Browsing decriptor portion;
proc contents;run;
proc contents data=my_1st_data;run;
proc contents data=my_1st_data varnum;run;
proc contents data=my_1st_data varnum short;run;
*******************Descriptive statistic ( summarizing existig data)(part1)*********************;
*For each variable it shows frequency of each level(frequency of each unique values);
proc freq data=my_1st_data;run;
proc freq data=my_1st_data;
table gender;*unlike other procedurs that we specify name of variable by using var statement in proc freq we use table since we looking for table that shows us frequency of each levels;
run;
*Summerizing numeric variables;
proc means data=my_1st_data;run;
proc means data=my_1st_data;
var age;
run;
proc univariate data=my_1st_data;run;*proc univariate by default returns more information compare to proc means and you can draw a grapgh with it as well that I will teach you;
proc univariate data=my_1st_data;
var age;
run;
*Using SQL code in SAS;
proc sql;
select * from my_1st_data;
quit;
proc sql;
select gender,avg(weight) as average_of_weight , count(*) as number
/* in this version of SAS 1."as" in proc sql is not optional
and 2.you can not say [avrerage of weight] */
from my_1st_data /*notice that my_1st_data is a SAS dataset and exits in work SAS library */
where gender ne ''
group by gender
order by average_of_weight Desc
;
quit;
*Reacall that:
order in SQL select statement:
select ....
from ....
where ....
group by ....
having ....
order by ....
rermeber
"Some Female Workers Go Home Ontime"
;
* Sorting data;
Proc sort data=my_1st_data;* it will sort the original data unless you will specify output(i will talk about that);
By age;*sorting by age deafault=ascending;
Run;
proc print data=my_1st_data;run;
*Note that in SAS missing values are considered less than any value that you can assume;
*i.e. in SAS unlike R and Python, missing values consider less than any values that you have;
Proc sort data=my_1st_data;
By descending age;*in SAS unlike SQL 1.we use complete word of descending 2.before variable name;
Run;
proc print data=my_1st_data;run;
Proc sort data=my_1st_data;
By desending age;*descending=desending=decending;
Run;
proc print data=my_1st_data;run;
data test_2;
input id $ name $ family $ height ;* several sapces is the same as one space;
cards;
001 Hamid Rajaee 184.7
002 Mudasir khan 164.6
003 Nick rajaee 154.2
;
run;
proc print data=test_2; run;
******************************************************************************************************;
/*FYI*/
*Label (part1);
******************************************************************************************************;
*LABLE STATEMENT ;
*LABEL TO VARIBLE TO BE MORE DESCRIPTIVE;
*CAN BE UP TO 256 CHARACTERs can conatain any charector even space;
data hamid.test_2;
set test_2;
label name="first name" family="last name";
run;
proc print data=hamid.test_2;run;
*if you want to get labels in print you need to use lable option in proc statement;
proc print data=hamid.test_2 label;run;
proc print data=hamid.test_2 label;
label name='Given Name' family="Family Name";
run;
* if you assign lable in data step it will saved with data ( you can overwrite later if you want)
but if you assign lable in [proc step it won't saved with data and only apear in that proc;
proc contents data=hamid.test_2;run;
proc print data=hamid.test_2 label;
label family="Family Name of classmates";
run;
* in labels you don't need to put them inside qoutation marks;
proc print data=hamid.test_2 label;
label name=Fisrt Name of Student family=Family of student height=height of student in cm;
run;
DATA Hamid.CARS;
SET SASHELP.CARS;
LABEL MSRP=MANUFACTURAL RETAIL PRICE
ORIGIN =ORIGIN OF CONTINENT;
RUN;
proc print;run;
proc print data=hamid.cars label;run;* to see the labels in proc print statement you must type label;
* for displaying lable you have to type lable keyword in print statement then in other lines
define lable unless you define lable before;
proc print data=hamid.cars label;
label make="name of maker";
run;
proc print data=hamid.cars label;run;
/*if you put a LABEL statement in a data step, the labels will be saved with the data
if you put a LABEL statement in a PROC, then the lables will be used only by that particular execution.
*/
Data myFirstLabeledData;
set my_1st_data;
label height="Height of person";*wraping label with qoutation marks is not mandetory;
Run;
proc print data=myFirstLabeledData;run;
proc print data=myFirstLabeledData label;run;* to see the labels in proc print statement you must type label;
proc contents data=myFirstLabeledData;run;
proc contents data=my_1st_data;run;
Data LabeledData;
set my_1st_data;
label height=Height of person;*wraping label with qoutation marks is not mandetory;
Run;
proc print data=LabeledData label;run;
Data LabeledData2;
set my_1st_data;
label height=Height of person name=first name;*wraping label with qoutation marks is not mandetory;
Run;
proc print data=LabeledData2 label;run;
proc print data=LabeledData2 ;run;
proc contents data=labeleddata2;run;
*********************************************************************
/*create dataset based on one existing dataset in SAS*/
*********************************************************************;
*subset ;
*Ex. Create a data set that stores all records of cars data set in sashelp with origin=Asia;
data test;/* test=work.test and it is the name of output data
this statement create table and its name is test and its library by default is work,
if you want to save it permanetly you have to put a libref
befor test and dot like: Hamid.test but because we didn't do that here it is temporary data and saved in work library*/
set sashelp.cars; /* sashelp.cars is the name of input data
that means gives data from library =sashelp and table = cars*/
where origin ="Asia";
Run;*Check log after any execution ;
proc print;run;*characters are on left side, numeric are on the right side;
*Ex. create a subset of cars data set in sashelp that includes only Sedans from Europe;
Data test2;
set sashelp.cars;
where oRiGin="Europe" and type="Sedan";*two conditions;
Run;
proc print;run;
*Ex. create a subset of cars data set in sashelp that includes only european cars;
Data test3;
set sashelp.cars;
where origin="EUROPE";*O observation means no observation meet that condithion because SAS in case sesitive for values;
run;
proc print;run;
*Note: SAS is case sensitive in terms of vaues inside its table;
data test1;
set sashelp.cars;
where origin="ASIA";
run;
proc print data=test1;run;
data test2;
set sashelp.cars;
where origin="asia";
run;
proc print data=test2;run;
proc freq data=sashelp.cars;
table origin;
run;
************************;
*upcase;
**************************;
Data Asia;
set sashelp.cars;
where upcase(origin)="ASIA";
/* origin=upcase(origin);*/ *If you want to upcase values of origing permanently;
run;
proc print data=Asia; run;
Data test1c;
set sashelp.cars;
where upcase(origin)=upcase("EuropE");* or upcase(origin)="EUROPE" ==>upcase(origin) converts all of values of origin column to upercase temporary and compare them with upcase("eUropE") that is "EUROPE";
run;
proc print;run;
Data test1c;*test1c=work.test1c is the output data which is temporary data;
set sashelp.cars;*sashelp.cars is the input data which is permanent data;
where upcase(origin)="EUROPE";
run;
proc print;run;
****************************************
distinct (unique) values
***************************************;
*If you want to make sure about unique valus of origin;
proc freq data=sashelp.cars;
table origin;
run;* so you can see that we ahve only "Europe", "Asia" , and "USA" and we don't have "EUROPE" or "europe" or "eUropE"
***********************************;
*creating new variable(column);
**************************************;
*you can create a new variable in data step by assinment to the new name;
* you can create a new variable in data step by just using a new name and assigning a value to it;
data hamid.test4;
set sashelp.cars;
where origin="Europe";
newprice=invoice*1.5;*creating new variable;
run;
proc print;run;
proc print data=my_1st_data;run;
data my_1st_data;*my_1st_data=work.my_1st_data;
input name $ age gender $ weight height;
cards;
Hamid 41 . 82 185
Nick . M . 124
Sarah 34 F 76 167cm
. 4 F 18 115
Aryan 12 M 67 156
Rosa 34 F 57 163
;
run;
proc print data=my_1st_data;run;
data data1;
set my_1st_data;
new_age=age+1;
run;
proc print;run;
/*updating an exsiting variable*/
data data1;
set my_1st_data;
age=age+1;* here i don't created a new varaible , I just want to change existing variable;
run;
proc print;run;
********************************
*Delete a variable
*****************************;
* In data step you can delete a variable either by drop opthions or by using keep otions with other variables'name,
*********************;
*drop;
*********************;
Data test5 (drop=invoice type);* don't put comma between variables' name;
set sashelp.cars;
where origin="Europe";
newprice=invoice*1.5;*creating new variable;
Run;
proc print;run;
Data test5b (drop=invoice type);* in SAS don't put comma between variables' name;
set sashelp.cars;
where origin="Europe";
Run;
proc print;run;
data data1(drop= age);*this is drop option;
set my_1st_data;
new_age=age+1;
run;
proc print;run;
data data1;
set my_1st_data;
new_age=age+1;
drop age;* this is drop statement;
run;
proc print;run;
*************************;
*keep;
*****************************;
Data test5b (keep=make origin type newprice);* in SAS don't put comma between variables' name;
set sashelp.cars;
where origin="Europe";
newprice=invoice*1.5;
Run;
proc print;run;
data data1(keep= name gender new_age);*this is keep option;
set my_1st_data;
new_age=age+1;
run;
proc print;run;
data data2;
set my_1st_data;
new_age=age+1;
keep name gender new_age;*this is keep statement;
run;
proc print;run;
Data test6 (keep=model type make newprice);
set sashelp.cars;
where origin="Europe";
newprice=invoice*1.5;*creating new variable;
Run;
proc print;run;
Data test6 ;
set sashelp.cars;
where origin="Europe";
newprice=invoice*1.5;*creating new variable;
keep model type make newprice;
Run;
proc print;run;
******************************************;
*Title/FOOTNOTES STATEMENTS;
*********************************************;
*Up to 10 TITLEs/FOOTNOTS STATEEMENTS;
TITLE "THIS IS MY FIRST REPORT";
TITLE2 "THIS IS MONTHLY@#$%^& SALE REPORT";
TITLE3 "HERE IS ANOTHER TITLE";
FOOTNOTE "THIS IS MY FIRST FOOTNOTS";
FOOTNOTE2 "THIS IS MY SECOND FOOTNOTE";
FOOTNOTE3 "THIS IS MY THIRD FOOTNOTE";
PROC PRINT DATA =data1;
RUN;
***********************************************
proc print
************************************************;
*Note: proc print have more options that allowes you to choose featurs(columns)(variables) and
number of obs(rows) that you want to print;
*proc print with var satetments;
PROC PRINT DATA =SASHELP.CARS;
VAR TYPE ORIGIN MAKE INVOICE;* I used Var statements in proc print to specify which features I want to see in print;
RUN;
*Note: we can use drop or keep in data step when you want to select some varaibles durnig getting a copy of data
but in poc step we use var statement to slect some varaibles or table statement in proc freq;
*HOW TO CANCEL PREVIOUS TITle AND FOOTNOTES STATEMENT?
By making them empty or assigning new one to overwright previouse one;
TITLE;
FOOTNOTE;
*proc print with obs and firstobs option;
PROC PRINT DATA =SASHELP.CARS (obs=5);*it prints the first 5 obs (head of data);
RUN;
PROC PRINT DATA =SASHELP.CARS (obs=10);run;
PROC PRINT DATA =SASHELP.CARS (obs=5 firstobs=3);
*it will start from 3rd rows(obs)( that means you don't want to print the first 2 obs) and end at 5 obs so you will see 5- 2=3obs;
RUN;
PROC PRINT DATA =SASHELP.CARS (obs=10 firstobs=3);
*it will start from 3rd rows(obs)( that means you don't want to print the fist 2 obs) and end at 10th obs so you will see 10-(3-1)=10-2=8 obs;
RUN;
*HMW2.1: Print the last 5 obs(tail of data) for sashelp.cars;
**********Internal data:Craeting data set and inserting data manually;
*so far you saw simple list input that
1.seprater is space and
2.you must type dot for missing values and
3.you can not have non-standard numbers like date, 48%, $23000, or comma for numbers like 3,642,912
4.you can store charectors up to 8 charectors( default length);
data mytest;
input id name $ family $ age dob;* several sapces is the same as one space;
cards;
001 Hamid Rajaee 41 14.05.85
002 Mudasir khan 32 9.06.90
003 Nick rajaee 6 20.12.10
;
run;
proc print data=mytest;run;
* for inserting any non-standard numbers in your dataset( table) you must use informat;
*************************************************************
inforamt Vs. format
************************************************************;
* for reading and writing non-standard numeric values you can use inforamt( for writing/inserting values)
and format( for reading/printing values);
data mytest;
input id name $ family $ age dob;* several sapces is the same as one space;
informat dob ddmmyy8.;*informat for iserting/reading perpuse
Note:right after each format_name put dot;
cards;
001 Hamid Rajaee 41 14.05.85
002 Mudasir khan 32 9.01.60
003 Nick rajaee 6 20.12.10
;
run;
*note: some availble format :
14.05.85 or 14/05/85 ==>ddmmyy8. or ddmmyy.
17/03/2013 ==>ddmmyy10.
03/17/2013 ==>mmddyy10.
17MAR13 ==>Date.
17MAR2013 ==>Date9.
you can find more by typing date format in help window( use help tab)
;
/*SAS stores date values as numeric values and showes us differnce dayes from 1Jan1960(base date in SAS*/
/*For reading date to sas(Informat) we should specify format of date*/
/*Note:mmddyy6. is one type format for date that mention 1st 2 digit is month , 2nd 2 digit is day and
3rd 2 digit is year*/
* after checking a log you will find that this data is created successfully;
* but when you print the data you will see some numbers for dob since SAS returns the number of days differences from 1st of Jan1960;
proc print data=mytest;run;
* we use informat for reading and we use format for writing;
* to see the result for dob in undrstanding form use format statement;
proc print data=mytest;
format dob ddmmyy8.;
run;
proc print data=mytest;
format dob ddmmyy.;
run;
* you can use any other date format;
*you can find all avaible options for date format by going help then type date and then select About SAS Date, Time, and Datetime Values;
proc print data=mytest;
format dob ddmmyy10.;
run;
proc print data=mytest;
format dob date9.;
run;
proc print data=mytest;
format dob weekdate32.;
run;
data mytest;
input id name $ family $ age dob;* several sapces is the same as one space;
informat dob ddmmyy8.;/*informat for reading perpuse and format for writing perpuse for dates, you can assign format in data step or proc step*/
format dob date9.;
/*if you define format in data step it will saved with it but you can override whenever you want */
cards;
001 Hamid Rajaee 41 14.05.85
002 Mudasir khan 32 9.01.60
003 Nick rajaee 6 20.12.10
;
run;
proc print data=mytest;
run;
proc print data=mytest;
format dob ddmmyy10.;
run;
proc print data=mytest;
format dob mmddyy10.;
run;
*recall that in simple list input you are not allowed to have space inside values;
*Formatted input enables you to read data stored in non-standard form, such as dates, percent or
numbers with commas.
and you don't need to put dot for missing values when you insert data
but all values of each column must starts from the same position;
*Note: in SAS unlike Python index strarts from 1;
data test1;
input @1 dob mmddyy6. /*note: @1 that means the position of 1st cahrector of its value is 1 i.e., values of this feature starts from column #1*/
@8 fname $10. /*Read values based on starting (here @8 means column#8 in dataliens/Cards)and total length of its value=10 $ means it is charector and dot after 10 is mandetory after any format*/
@19 lname $5.
@25 height 5.
@31 sex $1.;
datalines;
012060 Amir pasha Jones 102.8 M
121104 Joan geray f
Nick kia 140.5
;/*we will get 4 obs because of the empty line after 3rd obs*/
run;
proc print; run;*we got dob as a number of day difference from 1st of Jan 1960;
proc print;
format dob ddmmyy10. ;
run;
data test1;
input @1 dob mmddyy6. /*note: @1 that means the position of 1st cahrector of its value is 1 i.e., values of this feature starts from column #1*/
@8 fname $10. /*Read values based on starting (here @8 means column#8 in dataliens/Cards)and total length of its value=10 $ means it is charector and dot after 10 is mandetory after any format*/
@19 lname $5.
@25 height 5.
@31 sex $1.;
format dob date9.;
datalines;
012060 Amir pasha Jones 102.8 M
121104 Joan geray f
Nick kia 140.5
;/*we will get 4 obs because of the empty line after 3rd obs*/
run;
proc print; run;
* FYI: You can mix foramtted input with simple list input;
data test1;
input @1 dob mmddyy6.
@8 fname $10. /*Read values based on starting (here @8 means column#8 in dataliens/Cards)and total length of its value=10 $ means it is charector and dot after 10 is mandetory after any format*/
lname $
height
sex $ ;
datalines;
012060 Amir pasha Jones 102.8 M
121104 Joan geray . f
Nick kia 140.5 .
;
run;
proc print;
format dob ddmmyy10. ;
run;
/*
June 2021
SAS Data Analyst for Marketing Department
Job descriptions:
Title: SAS Data Analyst
Rate: 68.71/hr INC or 60.59/hr T4
Client: BMO
Job Description:
We are looking for a talented individual to join our team in the position of Data Analyst. The primary responsibility is to leverage customer data to deliver data and recommendations that drive optimal customer journeys and experience.
The role will sit on the Enterprise Customer Analytics team and support Marketing and LOB partners across BMO FG with proactive, actionable insights on Retail and Wealth customers.
Required Skills and Experience:
• 2 years of tech skills: SAS EG, PROC SQL, SAS DATA STEP, merging (inner and outer join) and aggregating data across different granularities (transaction, account rolled up to customer) and time frame (daily, monthly)
• Soft Skills: ability to adapt to team operating standards and existing processes, strong attention to detail and able to QA results, time management and ability to work under pressure
• Does not have to be from a bank but large enterprise environment - not ETL ppl - this is a marketing dpt.
• Nice to have: communication with non technical people
Key Accountabilities:
• Perform data analysis in SAS
• Create reports for Executives in Excel
• Manage and enhance daily/weekly/monthly channel reporting
• Manage, prioritize, and respond to ad-hoc requests
• Leveraging statistical approaches (A/B testing, time series analysis) against multiple sources of data including internal sales/transactions and external party data/benchmarks
• Deliver targeted campaign leads in a multichannel environment
*/
*Non-standard numeric values eaxamples: 47%, $240, 11/10/2022 for using mathematics functions like average, sum, subtraction,... we need to spaecify they are numeric values
* for reading(insereting) non-standrd numbers we use informat and for writing ( printing) them we should use format
*Hint:You can save format when you read data by using format statement in this situation you don't need to use fromat when you writing data
but you still can override that fromat;
*hint: when all of values for each column start from the same position you can use formatted input like the follwingh example;
title;
data test2;
input @1 fname $6.
@8 lname $7.
@16 sex $1.
@18 dob mmddyy6.
@25 empdate mmddyy6.
@32 age 2.
@35 income 5.
@41 department $9.;
format dob empdate weekdate27.;
datalines;
Johon Martin M 92586 100172 49 50000 HR
Mary Smith F 101072 110292 40000 marketing
Johon1 Martin1 M 123069 100172 19 60000 Admin
Johon1 Martin1 M 123069 100172 19 60000 Admin
Mary1 Smith1 F 101072 110292 30 . Marketing
Mary1 Smith1 101072 110292 30 40000 Marketing
Johon2 Martin2 M 123069 100172 25 70000
Mary2 Smith F 110292 40 40000
Johon2 Martin2 M 123069 100172 25 60000
Mary2 Smith F 101072 110292 40 45000
;
run;
proc print data=test2;run;
proc print data=test2;
format dob weekdate21. empdate date9.;
run;
* BROWSING THE DESCRIPTION PORTION ;
PROC CONTENTS DATA=test2 ;
RUN;
* BROWSING THE DATA PORTION ;
*head;
proc print data=test2 (obs=5);
format dob empdate date9. ;
run;
*tail;
proc print data=test2 (obs=10 firstobs=6) ;/*OBS= tells SAS when to stop processing observations and firstobs means from which starts
here we have 10-(6-1)=5 obs in pint*/
format dob empdate date9. ;
run;
*DROP DUPLICATE OBSERVATION IF EXIST( deduplicate data);
PROC SORT DATA=test2 OUT=test3 NODUPKEY;*here input data is test2 and output data is test3 and both are stored at work library;
BY _ALL_;* if 2 obs( rows) have same values for all columns the second one will be consider duplicated;
RUN;
*chech the Log ==> 1 observations with duplicate key values were deleted;
/*FYI: Getting your data into SAS with proc IMPORT using GUI:
First Make sure that the data is closed
File==> Import Data==>Select the type of file for example CSV==> Brows the location of file==>
Selecte the library that you want==> assign a name to it in member session==>
select a locatio for proc import to save that and assign a nmae for it==> finish
you need to check the log to make sure that you read data correctly
Note: By deafault sas uses only the first 20 rows to guess the datatypes of variables if
you get warning or error in the log you may neeed to assign a greater value to GuessingRow*/
*/
************************************************
*Segmentation(binning)and Discretization;
************************************************;
* SEGMENTATION is the process of dividing continuous variable into segments( different groups);
*Discretization :Process of transferring continuous variable into discrete variable;
/*Methods for changing values for segmentatioin or changing apearence :
a)with proc format
b) with conditinal statement like if statement or case when*/
/* Proc format */
*Note:We can use proc format to define new format for writing(printing) data, we normally use it to change values or encoding,
or segmenation(binning),
or...;
*SO Important:;
************************************
Proc format
*********************************;
/*in proc format we use value statement to assign a name for the format that we want to create and defining our format
Rules for defining a FORMAT name:
1.If you are creating a format for character variables,format name must start with $
i.e. use a dollar sign ($) as the first character of its name.
for numeric variables must begin with letters or underscore
2.The name must be a valid SAS name (Only letters, underscore and numbers are valid characters),
3.but Can’t end with a number and for charector variables must start with $
4.Max length: up to 32 charactor including $
so A numeric format name can be up to 32 characters. A character format name can be up to 31 characters
(because we have $ before that).
5.name can’t be the name of an existing format (can't overrride)
pay attention that the format name Can’t end with a number
Note: when you assign a name for format name in proc foramt it doesn't end with dot but after assigning its name
whenever you want to use it you must put dot at the end of its name to specify that is a format name
*/
*scenario:assume we have age column in our data and we want to do segmenation based on the following explenation:
If the person is youngr than 30 years old , it belongs to youth group,
If he/she is between 31 to 60, it belongs to midele aged group and
older than 60 ,it belongs to senior ;
*Since age is numeric column ,format name for it must starts with letter or underscore;
proc format;
value agegroup /*agegroup is the name of this foramt that we want to use it for numerical values that's why
we don't have $ before its name
Note: after creating you are allowed to use this format for any numeric columns
Note:After assing a name to our format when ever we want to use this format we must put dot
after format name*/
low-30='youth' /*we are allowed to use low and high keyword note: =<30 is not acceptable in proc format*/
31- 60='midele aged'
61-high='senior'
;
run;
proc print data=test2;
format dob empdate date9. ;
run;
proc print data=test2;
format dob empdate date9. age agegroup.;/* I used format named agegroup,use dot after name of format*/
run;
*Note: be noticed that in proc format we use low and high keyword and it returs missing values for missing values;
/*Five advantages of using proc formats:
1.The orginal data will not been changed. You can use the original variable values in the analysis.
2.You can apply formats to both character and numerical variables.
3.You can use formats to merge categories and
to bin numeric variables.
4.you can apply a format to multiple variables in multiple data sets.
5.You can save formats in a permanent libref and use them across SAS sessions.*/
*Note:You can define several format name in one proc format ;
proc format;
value ageg
low-25='=<25' /*you can use double qoutatiopn marks*/
26- <36='26-35'
36- <46='36-45'
46- <56='46-55'
56-high='Senior'
;
value $genderg /* because I want to use this format for categoriacl collumn , its name must start with $*/
'M'='Male'
'F'='Female'
;
value $sexfmt /*recall that to create a format for a character variable, the format name must begin with $:*/
'M'=1
'F'= 0
;/*don't forget ; at the end of any value statement*/
run;
proc print data=test2;
format sex $genderg.;
run;
proc print data=test2;
format sex $sexfmt.;
run;
data test1b;
set test2;
format age ageg. sex $sexfmt.;
run;
proc print data=test1b;run;
proc contents data=test1b;run;
data Hamid.test2;
set test2;
format age ageg. sex $sexfmt. dob empdate date9.;
run;
*I saved test data set that is temporary data set in work library as a permanent data set with the name of
test2 in permanent library that its libref is hamid with saved format;
proc print data=Hamid.test2;run;
proc contents data=hamid.test2;run;* be noticed that the data type of age and sex;
*If you want to have original value for age and created new formatted value you can do like below:
;
data hamid.test3;
set test2;
age_groupe=age;*creating new variable that its name is age_groupe and its values are the same as values in age column ;
format dob empdate date9. age_groupe ageg.;
run;
proc print;run;
data test1;
set test2;
gender=sex;*creating new variable that its name is gender and its values are the same as values in sex column ;
age_group=age;
format age_group ageg. gender $genderg.;
run;
proc print;run;
proc contents data=test1;run;
/*b) with if statement: you can use if statement for segmentation(bining).with if you can define new variable as well*/
*When you use if statement for binning(segnentation) you should be careful about 2 things;
/*1.If you define new variable inside if statement without defing length of that before it :
SAS will assign length of that new variable equals to length of first value that you initilized to it*/
/*you have to put extra spaces for 1st one to make it as long as the bigest one or use length statement*/
/*2.Be carefule that SAS consider missing values less that any numbers that you can think in comparision
i.e
. < 30000
.< -234567890876546789
I will tell you how handle these problems ;*/
proc print data=test2;run;
*senario: income less than 30000 is low income , between 30000 and 50000 (not include 50000) is mediom income and greater than or equal to 50000 is high income;
data test4;
set test2;
*lets create new variable (type) based on income;
if income<30000 then type='low income';*SAS is going to create new column that its name is type and length of that
is based on initial value;
else if income<50000 then type='mediom income';*else if 30000=<income<50000 then type='mediom income';
/* if you use "if " instead of "else if" the second one overrride 1st one*/
else type='high income' ;
run;
proc print data=test4; run;
data test5;
set test2;
if income<30000 then type='low income';
if income<50000 then type='mediom income';
/* if you use "if " instead of "else if" the second one overrride 1st one
unless you say if 30000=<income<50000 then type='mediom income'*/
else type='high income' ;
run;
proc print; run;
*As I mentioned we have two problem when we use if statement 1st length of values 2nd missing values
but we can fix them:;
/*you have to put extra spaces for 1st one to make it as long as the bigest one or use length statement*/
data test4;
set test2;
if income<30000 then type='low income ';
else if income<50000 then type='mediom income';
else type='high income' ;
run;
proc print; run;
data test6;
set test2;
length type $15;*13 or more are fine;
if income<30000 then type='low income';
else if income<50000 then type='mediom income';
else type='high income' ;
run;
proc print; run;
*Hint: Be careful that SAS consider missing values less that any numbers that you can think
you can use one of the follwing methods to handle missing values easily;
data test8;
set test2;
length type $15;*13 or more are fine;
if 0<=income<30000 then type='low income';
else if 30000<=income<50000 then type='mediom income';
else if income>= 50000 then type='high income' ;
run;
proc print; run;
data test9;
set test2;
length type $15;*13 or more are fine;
if missing(income) then type='';
else if income<30000 then type='low income';
else if income<50000 then type='mediom income';
else type='high income' ;
run;
proc print; run;
*Coditional statement;
*in data step you can use if or where but in proc step you are allowed to use only where;
*Hint: If you want to add condition in procedure if statement doesn't work you should use where statement
I will talk about if Vs. Where in detail;
************************************************************************;
/* Proc print (part2 of 2)*/
*************************************************************************;
PROC PRINT DATA =SASHELP.CARS(OBS=20);
RUN;
*head;
PROC PRINT DATA =SASHELP.CARS(OBS=5);
RUN;
*tail;
*first find number of obs;
proc contents data=sashelp.cars;run;
PROC PRINT DATA =SASHELP.CARS(OBS=428 firstobs=424) ;/*OBS= tells SAS when to stop processing observations and firstobs means from which starts
here we have 428-424+1=5 obs in pint*/
RUN;
*________________________________________________________________________;
*LIMITING THE NUMBERS OF VARIABLES;
*_________________________________________________________________________;
PROC PRINT DATA =SASHELP.CARS(OBS=20);
VAR ORIGIN TYPE Cylinders INVOICE;
RUN;
PROC PRINT DATA =SASHELP.CARS(keep= ORIGIN TYPE Cylinders INVOICE );
RUN;
PROC PRINT DATA =SASHELP.CARS(drop= ORIGIN TYPE Cylinders INVOICE );
RUN;
PROC PRINT DATA =SASHELP.CARS(OBS=10);
var origin invoice;
RUN;
*note: we can use if as well as where for filtering records. But there are some points that
you need to pay attention;
*****************************************
WHERE
Where Vs. If
*****************************************;
* We are not allowed to use if in proc step;
*For FILTERING THE DATA VALUES in proc Use WHERE CLAUSE;
* some operators that you can use with where:;
*GE GREATER THAN OR EQUAL TO;
*GT GRATER THAN;
*LE LESS THAN OR EQUAL;
*LT LESS THAN;
*EQ EQUAL;
*NE NOT EQUAL;
*IN OPERATOR: for checking if that value belongs to that set (checking membership) ;
*BETWEEN... AND... :to check if that value belongs to that range including end points;
PROC PRINT DATA =SASHELP.CARS(OBS=5);*SAS will print the first 5 observations that met follwing condition;
WHERE Cylinders GE 6;/*GE is the same as >=*/
RUN;
PROC PRINT DATA =SASHELP.CARS(OBS=5);
WHERE Cylinders >= 6;
RUN;
PROC PRINT DATA =SASHELP.CARS(OBS=20);
If Cylinders GE 6;*Note:unlike where statement ,you are not allowed to use if in proc step you can use if only in data step ;
RUN;
PROC PRINT DATA =SASHELP.CARS(OBS=20);
WHERE Cylinders IN (4,8);*Cylinders IN (4,8) is the same as Cylinders =4 or Cylinders =8;
RUN;
*Same as following code:;
PROC PRINT DATA =SASHELP.CARS(OBS=20);
WHERE Cylinders =4 or Cylinders =8;*Cylinders IN (4,8) is the same as Cylinders =4 or Cylinders =8;
RUN;
PROC PRINT DATA =SASHELP.CARS(OBS=20);
WHERE Cylinders BETWEEN 4 AND 8;* Cylinders BETWEEN 4 AND 8 is the same as Cylinders Ge 4 AND Cylinders LE 8;
RUN;
*Same as following code:;
PROC PRINT DATA =SASHELP.CARS(OBS=20);
WHERE Cylinders Ge 4 AND Cylinders LE 8;
RUN;
PROC PRINT DATA =SASHELP.CARS(OBS=20);
WHERE Cylinders IN (4);*Cylinders IN (4) is the same as Cylinders =4 and it is the same as Cylinders eq 4 ;
RUN;
*Note:
*Cylinders IN ("4") is not correct (WHERE clause operator requires compatible variables.
;
*Same as following code:;
PROC PRINT DATA =SASHELP.CARS(OBS=20);
WHERE Cylinders =4;
RUN;
PROC PRINT DATA =SASHELP.CARS(OBS=10);*it will be prit the 1st 10 obs that are satisfied following conditions;
WHERE type IN ("SUV","Wagon");
RUN;
PROC PRINT DATA =SASHELP.CARS(OBS=10);
WHERE type IN ("SUV","WAGON");
RUN;
PROC PRINT DATA =SASHELP.CARS(OBS=20);
WHERE upcase(type) IN ("SUV","WAGON");
RUN;
*upcase (Wagon)=WAGON
*upcase (sedan)=SEDAN
*upcase(SUV)=SUV;
* or you can check all leveles by using proc freq ;
proc freq data=SASHELP.CARS;
table type;
run;
proc freq data=SASHELP.CARS;
table origin;
run;
PROC PRINT DATA =SASHELP.CARS(OBS=20);
WHERE ORIGIN IN ("Asia");
RUN;
PROC PRINT DATA =SASHELP.CARS(OBS=20);
WHERE ORIGIN ="Asia";
RUN;
PROC PRINT DATA =SASHELP.CARS(OBS=20);
WHERE ORIGIN IN ("Asia","Europe");
RUN;
*not IN;
PROC PRINT DATA =SASHELP.CARS(OBS=20);
WHERE ORIGIN not IN ("Asia","Europe");
RUN;
PROC PRINT DATA =SASHELP.CARS(OBS=20);
WHERE ORIGIN NE "USA";
RUN;
PROC PRINT DATA =SASHELP.CARS(OBS=20);
WHERE ORIGIN not IN ("USA");
RUN;
proc print data=test;run;
*************************************;
* informat vs. format;
************************************;
* informat is for reading purpose, format is for writing purpose you can use format in data step or any proc step;
*READ NON STANDARD NUMERIC INTO SAS AS NUMERIC DATA:INFORMATS;
*DISPLAY THESE NON STANDARD NUMERIC :FORMATS;
*Recall that:;
*TYPES OF DATA : 1.NUMERIC 2. CHARACTER;
*NUMERIC:STANDARD AND NON STANDARD
*NON STANDARD NUMERIC DATA: E.G DATE:14JAN 1970,75%, 1,000,000, $240;
proc print data=test2;
format dob empdate date9. age ageg. sex $genderg. ;
run;
proc print data=test2 (obs=7 firstobs=3) ;
/*OBS= tells SAS when to stop processing observations and firstobs means from which starts
here we have 7-3+1=5 obs in pint*/
format dob empdate date9. ;
run;
proc print data=test2 noobs;/*noobs means no index */
format dob empdate date9. ;
run;
proc print data=test2 noobs ;/*noobs means no index */
format dob empdate date9. ;
by sex;
run;*Please check the log;
/*So important:
statement make sure that your data is sorted by that value before*/
proc sort data=test2;
by sex;
run;
proc print data=test2 noobs ;/*noobs means no index */
format dob empdate date9. ;
by sex;
run;
*Label 2/3;
*Unlike variable's name that must be less than or equal 32 charectors starts with letters or underscore and
can conatain only letters, underscore and numbers
label can contain and start with any thing but must be less than or equal 256 charectors;
PROC PRINT DATA =sashelp.CARS (obs=5);run;
PROC PRINT DATA =sashelp.CARS (obs=5) LABEL ;
label type="@@##Type of car:";run;
PROC PRINT DATA =sashelp.CARS (obs=5) LABEL ;
label type=@@##Type of car:;run;
proc print data=test2;
* if you want to use lable statement you have to type "lable" keyword in first statemen as well;
label fname=first name lname=last name;
run;
proc print data=test2 label;
* if you want to use lable statement you have to type "lable" keyword in first statemen as well;
label fname=first name lname=last name;
run;
* if you want to print ecah group in one table you should use by statement in print;
proc print data=test2 label;
* if you want to use lable statement you have to type "lable" keyword in first statemen as well;
label fname=first name lname=last name;
by sex;* you might get somthing wrong;
run;
*So important: when ever in SAS you want to use By statement you must sort the data by that variable first unless it was alredy sorted;
proc sort data=test2;by sex;run;
proc print data=test2 label;
label fname=first name lname=last name;
by sex;
run;
proc print data=test2 noobs label ;
* if you want to use lable statement you have to type "lable" keyword in first statemen as well;
var fname lname empdate income sex age ;
format empdate date9. age ageg. sex $genderg. ;
where sex='M' and age>20;*two conditions;
by income; * to print different table for each income;
/*Note: every time you use' "by" for any variable' you must sort by that variabe first unless the data was sorted before;
look at the result with and without sort*/
label fname=first name lname=last name;
title 'this is an example';
footnote "created by hamid";
run;
*please check the log, although we got a table in results window, but it is not reliable because we have error in log;
proc sort data=test2 out=test1;
by income;
run;
proc print data=test1 noobs label;
var fname empdate income sex age lname;
format empdate date9. age ageg. sex $genderg. ;
where sex='M' and age>20;
by income;
label fname=first name lname=last name;
title 'this is an example';
footnote "created by hamid";
run;
title ;
footnote;
**********************************************************;
/* Proc sort 1/2*/
**********************************************************;
*SORT PROCEDURES;
*BY DEFAULT IT IS ASCENDING(from smallest to greatest) :
Note:1.For chategorical variables First it is gana sort based on the first digit or charector:
missing values < 0-9 < A-Z < a-z ,
(in SAS upper case less then lower case and in alphebetic order for example A<a, Z<a A< B,a<b, A<a, B<a)
if the first digit or charector are the same it is gana check the second one and so on
Recall that SAS is case sensitive in terms of values
2.for numeric varaiblles it is going to sort based on their values ,missing comes first;
dATA TEST0;
INPUT NAME $ @@;*list input;
*Note:@@ means you may have more than one observation in atleast one line;
cards;
Ali ali ALi Babak babak babi arun . aFi . aadel Adel 15 2
Foysal andrrie . Anna
;/* it has to be after last data in new line*/
run;
proc print; run;
proc sort;
*Because we didn't mention the name of data, it is gana sort the most recent data that we excuted
and because I didn't use out option it is going to change the original data;
by name;
run;
proc print; run;
proc sort data=test0 out=test0_1; /* when you are sorting data it is better than save output in different table because
1)normaly we don't like to change original table
2)it takes a lot of sources to shuffel and save in the same table*/
by descending name;*unlike sql that we mention ascending or descending after variable name in short form
in SAS we must specify decending before any variable in complet form;
run;
proc print; run;
proc print data=sashelp.cars;run;
/*In SAS we use Descending before any variable, in sql we use Descending after variable and
in short form(desc)*/
proc sort data=sashelp.cars out=test2;*Input data is sashelp.cars output data is test2=work.test2;
by Decending type invoice;/*descending=Decending ,here it was sorted first by type in descending order , then for all observations in
the same type they
were sorted by invoice in ascending order(default is ascending)*/
run;
proc print data=test2 ;run;
proc sort data=sashelp.cars out=test3;
by Decending type Decending invoice;/*here it was sorted first bt type in desceng order , then for all observation in same type they
were sorted by invoice in decending order*/
run;
proc print data=test3 ;run;
/*import data participant1.csv and save it as HAmid.Parti*/
proc print data=HAMID.PARTI ;run;
proc contents data=hamid.parti;run;
PROC SORT DATA=Hamid.parti OUT=Hamid.PARTICIPANT_ASC;
BY SALARY;*by default it will sort in Ascending order;
RUN;
* note : if data type of variable is Num 2<15 , if it is Char 15<2;
PROC PRINT DATA = Hamid.PARTICIPANT_ASC;
RUN;
*DESCINDING ORDER;
PROC SORT DATA=Hamid.PARTi OUT=Hamid.PARTICIPANT_DSC;
BY descending SALARY;
RUN;
PROC PRINT DATA = Hamid.PARTICIPANT_DSC;
RUN;
PROC PRINT DATA = Hamid.PARTICIPANT_DSC (obs=1);
RUN;
proc means data=Hamid.PARTi;* i will talk about proc means in next lecture;
var SALARY;
run;
proc print data=hamid.parti;
where SALARY=849;
run;
**************nodupkey;
*you can use nodupkey in proc sort to drop observations that have same values for featurs that you are sorting by them
it will keep first one and drop other duplicated in that column;
proc sort data=hamid.parti nodupkey out=hamid.parti2;
by sex;/*deletes observations with duplicate BY values (here no sex duplicated)*/
* hint: for duplicate value in sex keep first row and drop others;
run;
proc print;run;
* senorio : Find highest income for each level of sex in hamid.parti;
PROC SORT DATA=hamid.parti OUT=t2;
BY sex descending salary;
RUN;
PROC PRINT DATA= t2;
RUN;
PROC SORT DATA=t2 nodupkey OUT=t3;
BY sex ;
RUN;
PROC PRINT DATA= t3;
var sex salary;
RUN;
*Be carefull if you use nodupkey when you sort two variables SAS is going to consider one observation duplicated if it has same values for
thoes two varible with at least one of priviouse obs;
PROC SORT DATA=t2 nodupkey OUT=t4;
BY sex salary;* here SAS considers one obs is duplicated if sex and pai both are the same with priviouse obs;
RUN;
PROC PRINT DATA= t4;
RUN;
*Handling duplicated data;
*We call one observation duplicated if it has same values for all columns with one of the priviouse obs;
proc sort data=test nodupkey out=test4;
by _all_;
run;
proc print;run;
proc sort data=test1 out=Hamid.test5;
by age descending income;
format dob empdate date9. ;
run;
proc print;run;
proc sort data=hamid.test5 nodupkey;
by age;
run;
proc print;run;
data tt;
input id age income;
cards;
1 50 43000
2 40 40000
2 39 39000
2 41 50000
3 30 65000
3 30 70000
;
run;
proc print;run;
*Hmw2.2: Find the highest income for each employee;
* Example Universe Cars is surveying its customers as to their preferences for car colors.
They have information about the customer’s age, sex (coded as 1 for male and 2 for female),
annual income, and preferred car color (yellow, gray, blue, or white).
Here are the data:
19 1 14000 Y
45 1 65000 G
72 2 35000 B
31 1 44000 Y
58 2 83000 W
29 1 14000 Y
35 1 65000 G
62 2 35000 B
21 1 44000 Y
38 2 83000 W
sex (coded as 1 for male and 2 for female)
car color (yellow, gray, blue, or white);
*Hmw2.3: read above data and change value of sex and color to:;
*sex (coded as 1 for male and 2 for female)
car color (yellow, gray, blue, or white);
*List input;
data test;
input gender $ age income;
cards;
F 22 23400
F 40 45000
F . 26800
F 22 44900
F 56 65000
M 23 38000
M 28 47250
M . 68500
M 22 95600
M 42 .
. 40 85230
M 25 96000
;run;
proc print;run;
*Hmw 2.4:;
*Create user defined formats/custom format for converting F to female and M to male and conduct segmentation for age ;
*binning :segmentaion for age by adding new column age_group and divide them to youth(<35), middle aged(35-60), senior(>60);
/*Recall that in simple list input :
1.we have space as a delimeter and (we can solve this problem by using delimeter in modified list input)
2.we don't have non_standard numbers (we can solve this problem by using informat in modified list input)
3.missing values encoding with dot, (we can solve this problem by using delimeter in modified list input)
4.and charectros values have a default length=8 bytes(8 charectors)(
we can solve this problem by using length statement or informat in modified list input)
*/
******************************************************************;
***************************NON STANDARD NUMERIC;
* like : 98%, date, $678 or values that have comma 3,997,657
*READ NON STANDARD NUMERIC INTO SAS AS NUMERIC DATA:INFORMATS;
*DISPLAY THESE NON STANDARD NUMERIC :FORMATS;
DATA NUMBERS;
INPUT ID GENDER $ DOB;
CARDS;
1001 M 1FEB1987
1002 F 24MAR1990
;
RUN;
PROC PRINT DATA=NUMBERS;
RUN;
*INFORMATS;
DATA NUMBERS;
INPUT ID GENDER $ DOB DATE9.;
CARDS;
1001 M 1FEB 1987
1002 F 24MAR1990
;
RUN;
PROC PRINT DATA=NUMBERS;
RUN;
PROC PRINT DATA=NUMBERS;
format dob mmddyy10.;
RUN;
PROC PRINT DATA=NUMBERS;
format dob weekdate32.;
RUN;
*FORMAT STATEMENT CAN BE IN EITHER DATA OR PROC STEP;
DATA DOLLAR;
INPUT ID AMOUNT;
CARDS;
1001 $12,350.2245
1002 $15,456.8977
;
RUN;
proc print;run;
*DOLLAR format;
DATA DOLLAR;
INPUT ID AMOUNT DOLLAR12.4;* IT is modified list input that we specify informat in front of variable;
*DOLLAR12.4 is informat for reading but for writing
if you want to see $ sign you should define format
here in data step or in proc step;
CARDS;
1001 $12,350.2255
1002 $15,456.8977
;
RUN;
proc print;run;
proc print;
format amount dollar11.4;
run;
proc print;
format amount dollar11.2;
run;
PROC PRINT DATA= DOLLAR;
FORMAT AMOUNT DOLLAR9.2;
RUN;
PROC PRINT DATA= DOLLAR;
FORMAT AMOUNT DOLLAR7.2;
RUN;
*PERCENT format;
DATA PERCENT;
INPUT ID AMOUNT PERCENT8.2;
CARDS;
1001 45.23%
1002 56.99%
;
RUN;
PROC PRINT DATA =PERCENT;
FORMAT AMOUNT PERCENT8.2;
RUN;
PROC PRINT DATA =PERCENT;
FORMAT AMOUNT PERCENT8.1;
RUN;
***************************
*Type casting
***************************;
*Converting variable types from character to numeric and numeric to character:;
*you can convert numeric to charector by using put()function and
you can convert charactor to numeric by using input()function if its values is convertable;
/* put always convert from numeric to char and input always convert from char. to numeric */
data temp;
input numeric char $;
datalines;
789 0034
07 0009
067 1Hamid
. 001
1 9999
4 a
23 .
;
Run;
proc print;run;
data temp;
input num char $;
new_num=input(char,4.); /* I define new variable its name is new_num and convert character to numeric with 4 digits*/
new_char=put(num,3.); /*I define new variable its name is new_char convert numeric to character */
datalines;
789 0034
07 0009
067 1Hamid
. 001
1 9999
4 a
23 .
;
Run;
proc print;run;
/*If the character variable char in the above example contains missing values or non-numeric data then the
value of new_num will be missing. When char contains non-numeric data, an 'invalid argument' note will be written
to the log*/
**LABEL 3/3;
DATA funnies (LABEL = 'Comics Character Data');
INPUT Id Name $ Height Weight DoB MMDDYY8. @@;
LABEL Id = 'Identification no. '
Height = 'Height in inches'
Weight = 'Weight in pounds'
DoB = 'Date of birth';
*INFORMAT DoB MMDDYY8.;
FORMAT DoB WORDDATE18.;
DATALINES;
53 Susie 42 41 07-11-81 54 Charlie 46 55 10-26-54
55 Calvin 40 35 01-10-81 56 Lucy 46 52 01-13-55
;
run;
* Use PROC CONTENTS to describe data set funnies;
PROC CONTENTS DATA = funnies;
RUN;
proc print DATA = funnies;run;
proc print DATA = funnies label;
run;
**proc sort 2/2;
data test;
input @1 fname $6.
@8 lname $7.
@16 sex $1.
@18 dob mmddyy6.
@25 empdate mmddyy6.
@32 age 2.
@35 income 5.
@41 department $9.;
datalines;
Johon Martin M 92586 100172 49 50000 HR
Mary Smith F 101072 110292 40000 marketing
Johon1 Martin1 M 123069 100172 19 60000 Admin
Johon1 Martin1 M 123069 100172 19 60000 Admin
Mary1 Smith1 F 101072 110292 35 . Marketing
Mary1 Smith1 101072 110292 35 40000 Marketing
Johon2 Martin2 M 123069 100172 25 70000
Mary2 Smith F 110292 40 20000
Johon2 Martin2 M 123069 100172 25 60000
Mary2 Smith2 F 101072 110292 40 45000
;
run;
proc print;run;
proc print;
format dob empdate date9.;
run;
proc sort data=test out=test1;
by income;
run;
proc print data=test1;
format dob empdate date9.;
run;
proc sort data=test out=test1;
by income;
format dob empdate date9.;
run;
proc print;run;
proc sort data=test out=test1;
by sex;
run;
proc print;run;
proc freq data=test;
table fname;
run;
proc sort data=test nodupkey out=test2;
by fname;
run;
proc print;run;
proc freq data=test;
table fname lname;*this will create two frequency tables one for fname and one for lname;
run;
proc freq data=test;
table fname*lname;*this will give you a two-way table that has fname in rows and lname in columns;
run;
proc freq data=sashelp.cars;
table make*type;
run;
proc freq data=sashelp.cars;
table make*type/norow NOCOL NOPERCENT;
run;
proc freq data=sashelp.cars;
table make*type/ NOrow nocol;
run;
proc freq data=sashelp.cars;
table make*type/ NOPERCENT;
run;
PROC IMPORT OUT= HAMID.Credit_tr
DATAFILE= "C:\Users\hmdra\Desktop\Metro College_Teaching Mat
erials\New Teaching Binders\DMASM\Data\credit_train.csv"
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
GuessingRows=10000;
RUN;
评论
发表评论