sas note1
SAS programs are constructed from two basic building blocks:
1.DATA Steps (starts with data keyword)
2.PROC Steps (starts with proc keyword)
DATA and PROC steps are made up of statements. A step
may have as few as one or as many as hundreds of
statements.
We normally use data step for importing a dataset or creating a new dataset,
or getting a subset of data set, or manipulating a dataset
and we use proc step to print, summerize , or visualize data
note: Sas is case sensitive only for values inside its data but for keywords and identifires( name of variables,
name of tables, ...) is not case sensitive
note: you have to put semicolon at the end of each statementl
for execution: highlight the block of code that has to have run keyword at the end
then select running person icon or F3 or run tab>submit
Note: after any execution you must check the log even if you get the result
Data test1test1 is the name of table(dataset) that will be saved in work library
cA+b SAS is case sensitive only for values inside a table not for others
run
i used the above data step to craete a data now for print that data I will use proc print
any dataset in SAS has two leveles for naming like Sashelp.CARS , first we must mention the name of libraray
then dot after that name of data set
Note: you can skip typing the name of library only if it is work libraray i.e. EUROPEANCARWORK.EUROPEANCAR
NOTE: Work library is the only temporary libraray that means all you have there will be disappear after closing
current session
temp.data1 > the name of library is temp but it is permanent library( the only temporary library is work)
Get subset of sashelp.cars that includes only European cars
DATA EUROPEANCARWe start a data step to create a dataset that its name is EUROPEANCAR , EUROPEANCAR is output data
SET Sashelp.CARSSashelp.CARS is input data that I want to get subset of it
WHERE Origin "Europe" I added a condition that is Origin "Europe"
Note: SAS is case sensetive only for values of tables
you should put run statement at the end and highlight them togather then hit the running person icine for executing
syntax rules
SAS Program is a series of SAS statement
Most Statement not all of them begin with an identifying keyword
SAS statement must end with semicolon > so, for calculating number of statements count number of semicolons
These statement execute in order i.e. observation by observation ,and for ecah observation line by line (statement by statement),
That means SAS takes first observation and runs it all the way
through the DATA step (line by line) before looping back to pick up
the second observation.
SAS terminolgy
data setSAS Table
Variable (var) columns
observation(obs)rows(recrods)
data values (val)intersectional between Rows and Column
NAMING CONVENTIONS FOR SAS TABLES and VARABILES
1. START WITH A LETTER OR AN UNDERSCORE
2. CAN CONTAIN ONLY LETTERS,NUMBERS,AND UNDERSCORES
3. CAN BE 32 CHARACTERS OR FEWER
NAMING CONVENTIONS FOR SAS LIBRARY(libref)
1. START WITH A LETTER OR AN UNDERSCORE
2. ONLY LETTERS,NUMBER,UNDERSCORE
3. 8 CHARACTERS OR FEWER
Type of libraries
SAS LIBARY:we have two types of library:
1.PERMANENT AND 2.TEMPORARY SAS LIBRARY
The only temporary library is work> all of the other libraries are permanent
The WORK library is the only temporary storage location for SAS data sets.
It is also the default library. If you create a SAS data set without specifying a library, SAS
will put it in the WORK library, and then delete it when you end your session.
TYPE OF DATA VALUES
TYPE OF DATA VALUES : 1.Numeric or 2.chategoric:
If a variable contains letters or special characters, it must be a character(categoric)_ variable.
However, if it contains only numbers, then it may be numeric or character.
i.e. for numeric values you are allowed to assign a numeric or charector data type
For example you can store value100 to numeric variable and charactor variable
but you can store value"Hamid2" only in cahractor variable
If YOU STORTE "HAMID2" IN NUMERiC VARAIBLE YOU WILL GET MISSING VALUES
IN SAS we consider two types for numbers :
1.standard like 348 0r 102.5 that doesn't have any special charectors and
2.non-standard numbers like 25% or 02-12-2022 or $152 or 3,245
for reading non-standard numbers we must specify the format of them(we call it informat and I will talk about that)
NUMERIC :1. Standard numbers 345678432, 156.8 , -267.9
2. NON STANDARD NUMERIC DATA : have $, %, coma( like 2,564,790) ,dates,..
NOTE:dates in SAS are NON-STANDARD numeric and if you read them with the proper informat
they are stored as a number of days between that date and
date base (date base is 1st of Janurary 1960)
as an example if SAS read 10-Jan-1960 with proper foramt it stores it with value equals to 9 ((10-Jan-1960) - (1-Jan-1960))
CHARACTER: values like "ABCDE","hamid rajaee","5 years" can store only in charector varibles
numeric value: contains only numbers, and sometimes a decimal point andor minus sign. When they are read into
a SAS data set, numeric values are stored in the floating-point format native to the operating environment.
Nonstandard numeric values can contain other characters as numbers you must use formatted input to enable SAS to read them.
nonstandard data is data that can be read only with the aid of informats. Examples of nonstandard data include
numeric values that contain commas, dollar signs, blanks, date and time values and hexadecimal and binary values.
Note: If you store charector value or non-standard numeric value in numeric variable without specifying the proper format
SAS will store missing valuees there
input id name $ age we need to specify charector varibles using $ after them
for inserting values in SAS use datalines or cards satement
if you want to insert data maually type datalines or cards then semicolon then
values, after inserting values semcolon must be in the next line if you put in the last line that obs won't be inserted
this is simple list input that values are separted by space and have default length(8 bytes)
and we don't have non-standard numeric. If these conditions are not satisfied you should use column input,
formated input or mixed input that I will teach you in later
Note: in simple list input when you insert values for missing values you must type dot
and when you print them you will see dot for numeric missing values and blank for charector missing values
in the basic way of inserting VALUES(simple list input)
1.you must type dot for any missing values and
2.all values must be separated by at least one space( 2 spaces or more is the same as one space
3.All chrector values must have atmost 8 chrectors( the rest of that won't be saved in simple list input)
runin the resultoutput missing values will be seen as a blank for chrectors and as a dot for numbers
but be notice that when you insert them in the simple list input you must type dot for both of them
input id $ name $ age you can store numbers in numeric or cahrector varibles
but you can't stor charector in numeric varibles . If you do that you will get missing values
cards or datalines are used to specify that you want to insert some values in the table
input id name age you can store numbers in numeric or cahrector varibles
but you can't stor charector in numeric varibles . If you do that you will get missing values
cards or datalines are used to specify that you want to insert some values in the table
Length of values
Character Length 1–32,767 bytes(Case sensitive) Default 8 bytes
( for each charector you need 1 byte)
Numeric Length 3-8 bytes Default 8 bytes(16 digits)
since it is simple list input the deafult length is 8 bytes so you can store up to 8 charectors for chrector values,
The length of a numeric variable lies between 3 and 8
bytes. It means SAS can store a numeric value from 1 to
16 digits.
any character needs 1 byte to store as an example to
stroe Hamid we need at least 5 bytes and for storing Hamid_Rajaee you need at least 12 bytes.
___________________________________________________________________
Hint: 1)sas program is a series of sas statements
2)Most sas statement begins with an identifying key word like data,set,where,proc,..
all sas statements must end with semicolon()
, hint : if running of your program takes long time go to icone !(break) in above toolbar and hit that
SAS reads observation by observation and executes line by line for each observation
comment:
Comments are colored green
Note:you should put your comment between asterisk and semicolon or between forward slash asterisk and asterisk forward slash
your comment These comments cannot contain internal semicolons or unmatched quotation marks.
or
your comment These comments can contain semicolons and unmatched quotation marks.
Tip: When using comments within a macro definition or to hide text from the SAS macro facility,
use this style comment
or In the Microsoft Windows operating environment, if you use the Enhanced Editor,
you can comment out a block of code by highlighting the block and then pressing CTRL and (forward slash) together.
To uncomment a block of code, highlight the block and press CTRL and SHIFT and (forward slash) together.
Two basic building blocks of SAS program
the program contains 2 steps:
1)data step
2)Proc step
we use proc step to craete a report or graph , ...
SAS statement usually begins with a keyword and end with a semicolon > each semicolon one statement
Every time Check the log to make sure
the program run without errors or warnings
DATA EUROPEANCARwe created (writing) a temporary data set that is EUROPEANCAR or work.EUROPEANCAR
SET Sashelp.CARSSashelp.CARS is input data set(we reading permanent data set that its name is cars and it
is availbale in sashelp libraray)
WHERE Origin 'Europe' 'Europe'"Europe"
run
in above data step Sashelp.CARS is input data set and EUROPEANCAR or WORK.EUROPEANCAR are output data set
Browsing Data Portion and descriptor portion
Browsing Data Portion
Proc print dataEUROPEANCAR
run
Browsing the Descriptor Portion
proc contents dataEUROPEANCAR be noticed that we have s at the end of contents
ceating Library
creating libraray
SAS datasets can be temporary or permanent.
Temporary SAS datasets only exist during the current SAS session.
Permanent SAS datasets are saved to a location on the computer or server and exist after exiting SAS.
Unless otherwise specified to be permanent, SAS considers all datasets to be temporary
data1work.data1
There are a few guidelines to keep in mind when you create the name of your library.
Library names(libref name):
1.are limited to eight characters.
2.must begin with a letter or underscore.
3.can contain only letters, numbers, or underscores. Blanks are not allowed.
create your library :1) by coding or 2) by using GUI( Graphical user interface):
before it create a folder in the server or your computer
using GUI:You can also create a new SAS library by opening
the New Library window from the Libraries section of the navigation pane in SAS Studio:
icone new library in up toolbar or Explorer window >Right click on Library > New): give the name for example hmd,
enable at start up,
choose your path and ok
I want to create a subset of sashelp.cars in my library
by coding:go to any drive for example.C:\Users\hmdra\Desktop\Metro College_Teaching Materials\SAS\My sas library
and create a folder for example My sas library then copy its address and come back to sas and write the keyword libname
and give any name you want as a libref for example Hmd and then
paste just the address in single or double quotation
highlite the code and F3 or choose the runnig person icone
For any SAS session, You need to submit a LIBNAME statement (only once during each SAS session)
The libref remains in effect for the duration of that session.
But the data that you stored in that library exist for ever until you delete them,
but in any session you should say where that library is located .
Note:You can assign different names for libref for the same
location and use it i.e.
The libref can change, but the member name( file name dataset names), and directory must the same.
(A libref is just a nickname that corresponds to the location of a SAS data library. )
as an example for the same folder I will assign another name as well
libname h2 "C:\Users\hmdra\Desktop\Metro College_Teaching Materials\SAS\libreary for practice"
LIBNAME statement
A LIBNAME statement consists of
1.the keyword LIBNAME.
2.the libref name that you want to use.
(that have at most 8 charectos and starts with letter or underscore and contain leters, underscore and numbers)
3.the pathname that represents the physical location of the library and wrapped in sigle or double qoutation marks.
4.a semicolon.
after every run check log
It created a temporary data set that its name is euro inside work libary eurowork.euro
If you have a data in work library that you need for future you should better to save as in permanent library
to get a save as(copy and paste) euro dataset
It created a permanent data set that its name is euro inside h1 libary
proc print datah1.euro if You don't mention which data SAS will execute for the most recent data
data h1.Asiaoutput data is h1.asia
set Sashelp.carsinput data is sashelp.cars
where origin "Asia" be notice that unlike SQL that you are allowed to use only single qoutes to wrap the charector values, In SAS you can use either single or double qutes
proc print in proc if you skip mentioning name of data ,that proc run for the most recent data that you worked on it
Using SAS for calculation
Missing Data Values :
A Character missing value is displayed as a blank
A numeric missing value is displayed as a period (.)
SAS Statements are free format:
One or more blanks or special characters can be used to separate words
They can begin and end in any column(i.e. you can put spaces at the begining)
A single statement can span multiple lines
Several statement can be on the same line
but becarefull that each staement must be ended with semicolon
In any session you should specify where that library is located .You can assign different name for libref for
same location and use it i.e.
The libref can change, but the member name( file name), and directory must the same.
(A libref is just a nickname that corresponds to the location of a SAS data library. )
craete a data in SAS (manually ie. by inserting values) :
step1: assigning a name for data(table) (After data keyword)
stpe2: assigning names for variables(columns) and assigning data type for them (after input keyword)
step3: inserting values (after cards or datalines keyword)
data my_1st_dataoutput data is my_1st_datawork.my_1st_data that is temporary data
input name $ age gender $ weight height we use input statement to assign variable's name column's nameheader :
variables'name:
1)are limited to 32 characters.
2)must begin with a letter or underscore.
3)can contain only letters, numbers, or underscores. Blanks are not allowed
To create a character column, specify a $ sign righ after the sapace after
column name.
Note: unlike SQL that we use comma to separate columns names in SAS we use sapace
when you insert values in simple list input separater is space(at least 1 space) and Any missing data must be indicated with a period.
Note:recall that when you print a table for missing values you will see period for numeric columns and blank for categorical columns
For inserting data you should type cards or datalines then semicolon and in new rows type observations after
the last observations(rows) go to new line and type semicolon
for inserting values use cards or datalines
you must put semicolon after the last obs in "new line" if you put semicolon at the end of obs thet one won't be inserted
So important: after any execution in SAS you must check the log to make sure that result is reliable or not
proc printrunin any procedure if you skip calrifying name of data , that proc excute for the last data that is executed
in result output characters are on left side, numeric are on the right side
Note: for numeric variable if you insert charector you will get missing values
I stored that privouse data set in work library, Since it is temporary libraray after closing the SAS program ,
this data is goinig to be disapear. If you want to save it permanently you have to create that data in permanent
library or getting save as like
Data Hamid.myDataHamid.myData is output data that I want to save in in Hamid libraray
note: you can save it with the same name like Hamid.my_1st_data
set my_1st_datamy_1st_data is input data that exits in work libraray and I want to get a copy of it
Run I created new data set in hamid library and I named it mydata and I save as all data that I have in my_1st_data
input id name $ family $ height several sapces is the same as one space
you can save numbers as a charector as well
Brows Data descriptor
for inserting any non-standard numbers in your dataset( table) you must use informat
inforamt Vs. format
for reading and writing non-standard numeric values you can use inforamt( for writinginserting values)
and format( for readingprinting values)
input id name $ family $ age dob several sapces is the same as one space
informat dob ddmmyy8.informat for reading perpuse (inserting in table)
Note:right after each format_name put dot
note: some availble format :
14.05.85 or 140585 >ddmmyy8. or ddmmyy.
17032013 >ddmmyy10.
03172013 >mmddyy10.
17MAR13 >Date.
17MAR2013 >Date9.
you can find more by typing date format in help window( use help tab)
SAS stores date values as numeric values and showes us differnce dayes from 1Jan1960(base date in SAS
For reading date to sas(Informat) we should specify format of date
Note:mmddyy6. is one type format for date that mention 1st 2 digit is month , 2nd 2 digit is day and
3rd 2 digit is year
after checking a log you will find that this data is created successfully
but when you print the data you will see some numbers for dob since SAS returns the number of days differences from 1st of Jan1960
proc print datamytestrun
we use informat for reading and we use format for writing
you can use any other date format
you can find all avaible options for date format by going help then type date and then select About SAS Date, Time, and Datetime Values
input id name $ family $ age dob several sapces is the same as one space
informat dob ddmmyy8.informat for reading perpuse and format for writing perpuse for dates, you can assign format in data step or proc step
format dob date9.
if you define format in data step it will saved with it but you can override whenever you want
The most 7 important proc:
1)proc print to brwos data portion
2)proc contents to brows desciptor portion
3)proc freq to summarize chractor variables: create a table that gives us frequency of each level( unique values)
and chi-square test
4)proc means to summarize numerical variables
5)proc univariate to summarize numerical variables with more information
and test of normality
6)proc sql to use sql syntax
7)proc sort to sort data and deduplicate data
we talk about them and other procedures during this course and advanced SAS programming
Note: we have more procedures like proc import for importing data, proc export for exporting data,
proc ttest, proc Anova, proc reg, ...
input name $ age gender $ weight height we use input statement to assign variable's name column's nameheader :
variables'name:
1)are limited to 32 characters.
2)must begin with a letter or underscore.
3)can contain only letters, numbers, or underscores. Blanks are not allowed
To create a character column, specify a $ sign righ after the sapace after
column name.
Note: unlike SQL that we use comma to separate columns names in SAS we use sapace
when you insert values in simple list input separater is space(at least 1 space) and Any missing data must be indicated with a period.
Note:recall that when you print a table for missing values you will see period for numeric columns and blank for categorical columns
For inserting data you should type cards or datalines then semicolon and in new rows type observations after
the last observations(rows) go to new line and type semicolon
for inserting values use cards or datalines
you must put semicolon after the last obs in "new line" if you put semicolon at the end of obs thet one won't be inserted
proc printrunif you skip naming data, SAS conduct that procedure on the most recent data
Descriptive statistic ( summarizing existig data)(part1)
For each variable it shows frequency of each level(frequency of each unique values)
proc freq datamy_1st_datarun
proc freq datamy_1st_data
table genderunlike other procedurs that we specify name of variable by using var statement in proc freq we use table since we looking for table that shows us frequency of each levels
run
proc univariate datamy_1st_datarunproc univariate by default returns more information compare to proc means and you can draw a grapgh with it as well that I will teach you
in this version of SAS 1."as" in proc sql is not optional
and 2.you can not say [avrerage of weight]
from my_1st_data notice that my_1st_data is a SAS dataset and exits in work SAS library
Reacall that:
order in SQL select statement:
select ....
from ....
where ....
group by ....
having ....
order by ....
rermeber
"Some Female Workers Go Home Ontime"
Sorting data
Proc sort datamy_1st_data it will sort the original data unless you will specify output(i will talk about that)
By agesorting by age deafaultascending
proc print datamy_1st_datarun
Note that in SAS missing values are considered less than any value that you can assume
i.e. in SAS unlike R and Python, missing values consider less than any values that you have
Proc sort datamy_1st_data
By descending agein SAS unlike SQL 1.we use complete word of descending 2.before variable name
Proc sort datamy_1st_data
By desending agedescendingdesendingdecending
input id $ name $ family $ height several spaces is the same as one space
FYI
Label (part1)
LABLE STATEMENT
LABEL TO VARIBLE TO BE MORE DESCRIPTIVE
CAN BE UP TO 256 CHARACTERs can conatain any charector even space
if you want to get labels in print you need to use lable option in proc statement
if you assign lable in data step it will saved with data ( you can overwrite later if you want)
but if you assign lable in [proc step it won't saved with data and only apear in that proc
in labels you don't need to put them inside qoutation marks
proc print datahamid.cars labelrun to see the labels in proc print statement you must type label
for displaying lable you have to type lable keyword in print statement then in other lines
define lable unless you define lable before
if you put a LABEL statement in a data step, the labels will be saved with the data
if you put a LABEL statement in a PROC, then the lables will be used only by that particular execution.
create dataset based on one existing dataset in SAS
subset
Ex. Create a data set that stores all records of cars data set in sashelp with originAsia
data test testwork.test and it is the name of output data
this statement create table and its name is test and its library by default is work,
if you want to save it permanetly you have to put a libref
befor test and dot like: Hamid.test but because we didn't do that here it is temporary data and saved in work library
set sashelp.cars sashelp.cars is the name of input data
that means gives data from library sashelp and table cars
where origin "Asia"
RunCheck log after any execution
proc printruncharacters are on left side, numeric are on the right side
Ex. create a subset of cars data set in sashelp that includes only Sedans from Europe
Ex. create a subset of cars data set in sashelp that includes only european cars
where origin"EUROPE"O observation means no observation meet that condithion because SAS in case sesitive for values
Note: SAS is case sensitive in terms of vaues inside its table
upcase
where upcase(origin)"ASIA"
originupcase(origin) If you want to upcase values of origing permanently
where upcase(origin)upcase("EuropE") or upcase(origin)"EUROPE" >upcase(origin) converts all of values of origin column to upercase temporary and compare them with upcase("eUropE") that is "EUROPE"
distinct (unique) values
If you want to make sure about unique valus of origin
proc freq datasashelp.cars
table origin
run so you can see that we ahve only "Europe", "Asia" , and "USA" and we don't have "EUROPE" or "europe" or "eUropE"
creating new variable(column)
you can create a new variable in data step by assinment to the new name
you can create a new variable in data step by just using a new name and assigning a value to it
Delete a variable
In data step you can delete a variable either by drop opthions or by using keep otions with other variables'name,
drop
Data test5 (dropinvoice type) don't put comma between variables' name
Data test5b (dropinvoice type) in SAS don't put comma between variables' name
keep
Data test5b (keepmake origin type newprice) in SAS don't put comma between variables' name
data data1(keep name gender new_age)this is keep option
keep name gender new_agethis is keep statement
TitleFOOTNOTES STATEMENTS
Up to 10 TITLEsFOOTNOTS STATEEMENTS
TITLE "THIS IS MY FIRST REPORT"
TITLE2 "THIS IS MONTHLY@#$%^& SALE REPORT"
TITLE3 "HERE IS ANOTHER TITLE"
FOOTNOTE "THIS IS MY FIRST FOOTNOTS"
FOOTNOTE2 "THIS IS MY SECOND FOOTNOTE"
FOOTNOTE3 "THIS IS MY THIRD FOOTNOTE"
PROC PRINT DATA data1
proc print
Note: proc print have more options that allowes you to choose featurs(columns)(variables) and
number of obs(rows) that you want to print
proc print with var satetments
PROC PRINT DATA SASHELP.CARS
VAR TYPE ORIGIN MAKE INVOICE I used Var statements in proc print to specify which features I want to see in print
RUN
Note: we can use drop or keep in data step when you want to select some varaibles durnig getting a copy of data
but in poc step we use var statement to slect some varaibles or table statement in proc freq
HOW TO CANCEL PREVIOUS TITle AND FOOTNOTES STATEMENT?
By making them empty or assigning new one to overwright previouse one
TITLE
FOOTNOTE
proc print with obs and firstobs option
PROC PRINT DATA SASHELP.CARS (obs5)it prints the first 5 obs (head of data)
PROC PRINT DATA SASHELP.CARS (obs10)run
PROC PRINT DATA SASHELP.CARS (obs5 firstobs3)
it will start from 3rd rows(obs)( that means you don't want to print the first 2 obs) and end at 5 obs so you will see 5- 23obs
PROC PRINT DATA SASHELP.CARS (obs10 firstobs3)
it will start from 3rd rows(obs)( that means you don't want to print the fist 2 obs) and end at 10th obs so you will see 10-(3-1)10-28 obs
Internal data:Craeting data set and inserting data manually
so far you saw simple list input that
1.seprater is space and
2.you must type dot for missing values and
3.you can not have non-standard numbers like date, 48%, $23000, or comma for numbers like 3,642,912
4.you can store charectors up to 8 charectors( default length)
for inserting any non-standard numbers in your dataset( table) you must use informat
inforamt Vs. format
for reading and writing non-standard numeric values you can use inforamt( for writinginserting values)
and format( for readingprinting values)
input id name $ family $ age dob several sapces is the same as one space
informat dob ddmmyy8.informat for isertingreading perpuse
Note:right after each format_name put dot
note: some availble format :
14.05.85 or 140585 >ddmmyy8. or ddmmyy.
17032013 >ddmmyy10.
03172013 >mmddyy10.
17MAR13 >Date.
17MAR2013 >Date9.
you can find more by typing date format in help window( use help tab)
SAS stores date values as numeric values and showes us differnce dayes from 1Jan1960(base date in SAS
For reading date to sas(Informat) we should specify format of date
Note:mmddyy6. is one type format for date that mention 1st 2 digit is month , 2nd 2 digit is day and
3rd 2 digit is year
after checking a log you will find that this data is created successfully
but when you print the data you will see some numbers for dob since SAS returns the number of days differences from 1st of Jan1960
proc print datamytestrun
we use informat for reading and we use format for writing
to see the result for dob in undrstanding form use format statement
proc print datamytest
format dob ddmmyy8.
run
you can use any other date format
you can find all avaible options for date format by going help then type date and then select About SAS Date, Time, and Datetime Values
input id name $ family $ age dob several sapces is the same as one space
informat dob ddmmyy8.informat for reading perpuse and format for writing perpuse for dates, you can assign format in data step or proc step
format dob date9.
if you define format in data step it will saved with it but you can override whenever you want
recall that in simple list input you are not allowed to have space inside values
Formatted input enables you to read data stored in non-standard form, such as dates, percent or
numbers with commas.
and you don't need to put dot for missing values when you insert data
but all values of each column must starts from the same position
Note: in SAS unlike Python index strarts from 1
input @1 dob mmddyy6. note: @1 that means the position of 1st cahrector of its value is 1 i.e., values of this feature starts from column #1
@8 fname $10. Read values based on starting (here @8 means column#8 in dataliensCards)and total length of its value10 $ means it is charector and dot after 10 is mandetory after any format
we will get 4 obs because of the empty line after 3rd obs
proc print runwe got dob as a number of day difference from 1st of Jan 1960
input @1 dob mmddyy6. note: @1 that means the position of 1st cahrector of its value is 1 i.e., values of this feature starts from column #1
@8 fname $10. Read values based on starting (here @8 means column#8 in dataliensCards)and total length of its value10 $ means it is charector and dot after 10 is mandetory after any format
FYI: You can mix foramtted input with simple list input
@8 fname $10. Read values based on starting (here @8 means column#8 in dataliensCards)and total length of its value10 $ means it is charector and dot after 10 is mandetory after any format
Non-standard numeric values eaxamples: 47%, $240, 11102022 for using mathematics functions like average, sum, subtraction,... we need to spaecify they are numeric values
for reading(insereting) non-standrd numbers we use informat and for writing ( printing) them we should use format
Hint:You can save format when you read data by using format statement in this situation you don't need to use fromat when you writing data
but you still can override that fromat
hint: when all of values for each column start from the same position you can use formatted input like the follwingh example
DROP DUPLICATE OBSERVATION IF EXIST( deduplicate data)
PROC SORT DATAtest2 OUTtest3 NODUPKEYhere input data is test2 and output data is test3 and both are stored at work library
BY _ALL_ if 2 obs( rows) have same values for all columns the second one will be consider duplicated
chech the Log > 1 observations with duplicate key values were deleted
FYI: Getting your data into SAS with proc IMPORT using GUI:
First Make sure that the data is closed
File> Import Data>Select the type of file for example CSV> Brows the location of file>
Selecte the library that you want> assign a name to it in member session>
select a locatio for proc import to save that and assign a nmae for it> finish
you need to check the log to make sure that you read data correctly
Note: By deafault sas uses only the first 20 rows to guess the datatypes of variables if
you get warning or error in the log you may neeed to assign a greater value to GuessingRow
Segmentation(binning)and Discretization
SEGMENTATION is the process of dividing continuous variable into segments( different groups)
Discretization :Process of transferring continuous variable into discrete variable
Methods for changing values for segmentatioin or changing apearence :
a)with proc format
b) with conditinal statement like if statement or case when
Proc format
Note:We can use proc format to define new format for writing(printing) data, we normally use it to change values or encoding,
or segmenation(binning),
or...
SO Important:
Proc format
in proc format we use value statement to assign a name for the format that we want to create and defining our format
Rules for defining a FORMAT name:
1.If you are creating a format for character variables,format name must start with $
i.e. use a dollar sign ($) as the first character of its name.
for numeric variables must begin with letters or underscore
2.The name must be a valid SAS name (Only letters, underscore and numbers are valid characters),
3.but Can’t end with a number and for charector variables must start with $
4.Max length: up to 32 charactor including $
so A numeric format name can be up to 32 characters. A character format name can be up to 31 characters
(because we have $ before that).
5.name can’t be the name of an existing format (can't overrride)
pay attention that the format name Can’t end with a number
Note: when you assign a name for format name in proc foramt it doesn't end with dot but after assigning its name
whenever you want to use it you must put dot at the end of its name to specify that is a format name
scenario:assume we have age column in our data and we want to do segmenation based on the following explenation:
If the person is youngr than 30 years old , it belongs to youth group,
If heshe is between 31 to 60, it belongs to midele aged group and
older than 60 ,it belongs to senior
Since age is numeric column ,format name for it must starts with letter or underscore
proc format
value agegroup agegroup is the name of this foramt that we want to use it for numerical values that's why
we don't have $ before its name
Note: after creating you are allowed to use this format for any numeric columns
Note:After assing a name to our format when ever we want to use this format we must put dot
after format name
low-30'youth' we are allowed to use low and high keyword note: <30 is not acceptable in proc format
format dob empdate date9. age agegroup. I used format named agegroup,use dot after name of format
Note: be noticed that in proc format we use low and high keyword and it returs missing values for missing values
Five advantages of using proc formats:
1.The orginal data will not been changed. You can use the original variable values in the analysis.
2.You can apply formats to both character and numerical variables.
3.You can use formats to merge categories and
to bin numeric variables.
4.you can apply a format to multiple variables in multiple data sets.
5.You can save formats in a permanent libref and use them across SAS sessions.
Note:You can define several format name in one proc format
I saved test data set that is temporary data set in work library as a permanent data set with the name of
test2 in permanent library that its libref is hamid with saved format
proc print dataHamid.test2run
proc contents datahamid.test2run be noticed that the data type of age and sex
If you want to have original value for age and created new formatted value you can do like below:
age_groupeagecreating new variable that its name is age_groupe and its values are the same as values in age column
format dob empdate date9. age_groupe ageg.
gendersexcreating new variable that its name is gender and its values are the same as values in sex column
age_groupage
format age_group ageg. gender $genderg.
b) with if statement: you can use if statement for segmentation(bining).with if you can define new variable as well
When you use if statement for binning(segnentation) you should be careful about 2 things
1.If you define new variable inside if statement without defing length of that before it :
SAS will assign length of that new variable equals to length of first value that you initilized to it
you have to put extra spaces for 1st one to make it as long as the bigest one or use length statement
2.Be carefule that SAS consider missing values less that any numbers that you can think in comparision
i.e
. < 30000
.< -234567890876546789
I will tell you how handle these problems
senario: income less than 30000 is low income , between 30000 and 50000 (not include 50000) is mediom income and greater than or equal to 50000 is high income
lets create new variable (type) based on income
if income<30000 then type'low income'SAS is going to create new column that its name is type and length of that
is based on initial value
else if income<50000 then type'mediom income'else if 30000<income<50000 then type'mediom income'
if you use "if " instead of "else if" the second one overrride 1st one
if you use "if " instead of "else if" the second one overrride 1st one
unless you say if 30000<income<50000 then type'mediom income'
else type'high income'
As I mentioned we have two problem when we use if statement 1st length of values 2nd missing values
but we can fix them:
you have to put extra spaces for 1st one to make it as long as the bigest one or use length statement
if income<30000 then type'low income '
else if income<50000 then type'mediom income'
Hint: Be careful that SAS consider missing values less that any numbers that you can think
you can use one of the follwing methods to handle missing values easily
Coditional statement
in data step you can use if or where but in proc step you are allowed to use only where
Hint: If you want to add condition in procedure if statement doesn't work you should use where statement
I will talk about if Vs. Where in detail
PROC PRINT DATA SASHELP.CARS(OBS428 firstobs424) OBS tells SAS when to stop processing observations and firstobs means from which starts
here we have 428-424+15 obs in pint
LIMITING THE NUMBERS OF VARIABLES
_________________________________________________________________________
PROC PRINT DATA SASHELP.CARS(OBS20)
note: we can use if as well as where for filtering records. But there are some points that
you need to pay attention
WHERE
Where Vs. If
We are not allowed to use if in proc step
For FILTERING THE DATA VALUES in proc Use WHERE CLAUSE
some operators that you can use with where:
GE GREATER THAN OR EQUAL TO
GT GRATER THAN
LE LESS THAN OR EQUAL
LT LESS THAN
EQ EQUAL
NE NOT EQUAL
IN OPERATOR: for checking if that value belongs to that set (checking membership)
BETWEEN... AND... :to check if that value belongs to that range including end points
PROC PRINT DATA SASHELP.CARS(OBS5)SAS will print the first 5 observations that met follwing condition
WHERE Cylinders GE 6GE is the same as >
If Cylinders GE 6Note:unlike where statement ,you are not allowed to use if in proc step you can use if only in data step
WHERE Cylinders IN (4,8)Cylinders IN (4,8) is the same as Cylinders 4 or Cylinders 8
PROC PRINT DATA SASHELP.CARS(OBS10)it will be prit the 1st 10 obs that are satisfied following conditions
WHERE type IN ("SUV","Wagon")
PROC PRINT DATA SASHELP.CARS(OBS10)
WHERE type IN ("SUV","WAGON")
PROC PRINT DATA SASHELP.CARS(OBS20)
WHERE upcase(type) IN ("SUV","WAGON")
upcase (Wagon)WAGON
upcase (sedan)SEDAN
upcase(SUV)SUV
or you can check all leveles by using proc freq
informat vs. format
informat is for reading purpose, format is for writing purpose you can use format in data step or any proc step
READ NON STANDARD NUMERIC INTO SAS AS NUMERIC DATA:INFORMATS
DISPLAY THESE NON STANDARD NUMERIC :FORMATS
Recall that:
TYPES OF DATA : 1.NUMERIC 2. CHARACTER
NUMERIC:STANDARD AND NON STANDARD
NON STANDARD NUMERIC DATA: E.G DATE:14JAN 1970,75%, 1,000,000, $240
OBS tells SAS when to stop processing observations and firstobs means from which starts
here we have 7-3+15 obs in pint
format dob empdate date9.
proc print datatest2 noobsnoobs means no index
statement make sure that your data is sorted by that value before
Unlike variable's name that must be less than or equal 32 charectors starts with letters or underscore and
can conatain only letters, underscore and numbers
label can contain and start with any thing but must be less than or equal 256 charectors
PROC PRINT DATA sashelp.CARS (obs5)run
PROC PRINT DATA sashelp.CARS (obs5) LABEL
label type"@@##Type of car:"run
PROC PRINT DATA sashelp.CARS (obs5) LABEL
label type@@##Type of car:
if you want to use lable statement you have to type "lable" keyword in first statemen as well
label fnamefirst name lnamelast name
if you want to use lable statement you have to type "lable" keyword in first statemen as well
label fnamefirst name lnamelast name
if you want to print ecah group in one table you should use by statement in print
proc print datatest2 label
if you want to use lable statement you have to type "lable" keyword in first statemen as well
label fnamefirst name lnamelast name
by sex you might get somthing wrong
So important: when ever in SAS you want to use By statement you must sort the data by that variable first unless it was alredy sorted
if you want to use lable statement you have to type "lable" keyword in first statemen as well
var fname lname empdate income sex age
format empdate date9. age ageg. sex $genderg.
where sex'M' and age>20two conditions
by income to print different table for each income
Note: every time you use' "by" for any variable' you must sort by that variabe first unless the data was sorted before
look at the result with and without sort
label fnamefirst name lnamelast name
title 'this is an example'
footnote "created by hamid"
please check the log, although we got a table in results window, but it is not reliable because we have error in log
SORT PROCEDURES
BY DEFAULT IT IS ASCENDING(from smallest to greatest) :
Note:1.For chategorical variables First it is gana sort based on the first digit or charector:
missing values < 0-9 < A-Z < a-z ,
(in SAS upper case less then lower case and in alphebetic order for example A<a, Z<a A< B,a<b, A<a, B<a)
if the first digit or charector are the same it is gana check the second one and so on
Recall that SAS is case sensitive in terms of values
2.for numeric varaiblles it is going to sort based on their values ,missing comes first
INPUT NAME $ @@list input
Note:@@ means you may have more than one observation in atleast one line
Because we didn't mention the name of data, it is gana sort the most recent data that we excuted
and because I didn't use out option it is going to change the original data
proc sort datatest0 outtest0_1 when you are sorting data it is better than save output in different table because
1)normaly we don't like to change original table
2)it takes a lot of sources to shuffel and save in the same table
by descending nameunlike sql that we mention ascending or descending after variable name in short form
in SAS we must specify decending before any variable in complet form
In SAS we use Descending before any variable, in sql we use Descending after variable and
in short form(desc)
proc sort datasashelp.cars outtest2Input data is sashelp.cars output data is test2work.test2
by Decending type invoicedescendingDecending ,here it was sorted first by type in descending order , then for all observations in
the same type they
were sorted by invoice in ascending order(default is ascending)
by Decending type Decending invoicehere it was sorted first bt type in desceng order , then for all observation in same type they
were sorted by invoice in decending order
import data participant1.csv and save it as HAmid.Parti
nodupkey
you can use nodupkey in proc sort to drop observations that have same values for featurs that you are sorting by them
it will keep first one and drop other duplicated in that column
proc sort datahamid.parti nodupkey outhamid.parti2
by sexdeletes observations with duplicate BY values (here no sex duplicated)
hint: for duplicate value in sex keep first row and drop others
senorio : Find highest income for each level of sex in hamid.parti
Be carefull if you use nodupkey when you sort two variables SAS is going to consider one observation duplicated if it has same values for
thoes two varible with at least one of priviouse obs
PROC SORT DATAt2 nodupkey OUTt4
BY sex salary here SAS considers one obs is duplicated if sex and pai both are the same with priviouse obs
Recall that in simple list input :
1.we have space as a delimeter and (we can solve this problem by using delimeter in modified list input)
2.we don't have non_standard numbers (we can solve this problem by using informat in modified list input)
3.missing values encoding with dot, (we can solve this problem by using delimeter in modified list input)
4.and charectros values have a default length8 bytes(8 charectors)(
we can solve this problem by using length statement or informat in modified list input)
NON STANDARD NUMERIC
like : 98%, date, $678 or values that have comma 3,997,657
READ NON STANDARD NUMERIC INTO SAS AS NUMERIC DATA:INFORMATS
DISPLAY THESE NON STANDARD NUMERIC :FORMATS
INPUT ID AMOUNT DOLLAR12.4 IT is modified list input that we specify informat in front of variable
DOLLAR12.4 is informat for reading but for writing
if you want to see $ sign you should define format
here in data step or in proc step
Type casting
Converting variable types from character to numeric and numeric to character:
you can convert numeric to charector by using put()function and
you can convert charactor to numeric by using input()function if its values is convertable
put always convert from numeric to char and input always convert from char. to numeric
new_numinput(char,4.) I define new variable its name is new_num and convert character to numeric with 4 digits
new_charput(num,3.) I define new variable its name is new_char convert numeric to character
If the character variable char in the above example contains missing values or non-numeric data then the
value of new_num will be missing. When char contains non-numeric data, an 'invalid argument' note will be written
to the log
Hmw2.3 Example Universe Cars is surveying its customers as to their preferences for car colors.
They have information about the customer’s age, sex (coded as 1 for male and 2 for female),
annual income, and preferred car color (yellow, gray, blue, or white).
STATISTICS
APPLIED STATISTICS
1.DESCRIPTIVE STATISTICS (means summarizing existing data( previous and current data))
2.INFERENTIAL STATISTICS(Statistical inference is the process of making as estimate, prediction, or decision about a population based on sample data. )
3.PREDICTIVEPRESCRIPTIVE STATISTICS
###############################
Exploratory Data Analytics (EDA)
###############################
Analysts need first to explore the data for potential research questions before
jumping into confirming the answers with hypothesis testing and inferential statistics.
EDA is often likened to “interviewing” the data it’s a time for the analyst to
get to know it and learn about what interesting things it has to say.
As part of our interview, we’ll want to do the following:
1.Classify our variables as continuous, categorical, and find distribution of them
2. Summarize our variables using descriptive statistics
3. Visualize our variables using charts
EDA is a process to try to understand existing data by visualization, Summarization, data cleaning, and
testing hypothesises
#We select the method for summmrizing data and analysis and testing hypothesis base on data type
Assume you have a breast cancer data set and in it you have ID, age, weight, height, blood pressure
and cancer . Cancer is binary 0 an 1 (0 means not having breast cancer and 1 means having breast cancer)
we want to built a model to predict having cancer for new patients
So in this situation cancer is target and we call that target or dependent variable or output variable and
we call other columns predictors or independents or features or input variables
we have two data types numerical and categorical data.
Categorical data has two types :Nominal(like name, eye color) and Ordinal(like education level, day of week)
numerical data has two types continuous (that can accept decimal like height and weight ) and
discrete(like number of children, number of females, ...)
If your variable has only two levels we call it binary at we can consider it as a spcial case of nominal categorical variable
TWO TYPES OF DATA (VALUE)1.NUMERIC 2.CHARACTER
numeric data could be discrete or continuous
If your data is discrete with limited levels like number of childeren ( number of cylinder in cars),
you can treat it as a categorical variable as well i.e. you can find mean and varaince because it is
numeric and you can draw bar chart and pie chart as well
SAS STATITICS:1.CONTINUOUS 2.CATEGORCAL
CONTINUOUS: like: AGE,income,IQ,SALES,PRICE
Charector variables like: gender, name,purpose of loan, loan status
DESCRIBE ONLY one VARIBLE:UNIVARIATE DESCRIPTIVE ANALYSIS
DESCRIBE ONLY TWO VARIBLES:BIVARIATE DESCRIPTIVE ANALYSIS
DESCRIBE MORE THAN 2 VARIBLES:MULTIVARIATE DESCRIPTIVE ANALYSIS
Univariate Analysis:
For categorical columns variabels:
for summarization: frequency, percentage, Mode, levels (unique values) ( by using proc freq in SAS)
for visualization : Bar chart or Pie chart
For numerical columns :
for summarization: Central tendency(mean, median, mode,...) and
measue of position:the the five-number summary(min,Q1,median(Q2),Q3,Max),
and measure of dispresion such as standard deviation ,CV ( COEFFICIENT OF VARIATION),Variance ,IQR, ...
( by using proc means,summary, or univariate in SAS)
for visualization: Histogram , Box plot,density,...
https:online.stat.psu.edustat500
Continuous Variabels
WHAT PROCEDURE we USE FOR summerizing Numerical variable
1)Proc means 2)Proc summary 3)Proc univariate
Proc means 12
Proc means datatestby default proc means do summearization ( univarite) for all numeric variables and
Runproc means by deafult returns only N(number of non-missing values) , mean( average), Std Dev (standard deviation), minimumn and maximum with 7
decimal numbers
Proc means datatest maxdec2
Run
title " The five-number summary"
Proc means datatest min Q1 median Q3 Max run
Q1 is 1st quartile25th percentile that means the value that 25% of obs have value less than or equal to it
medianQ2 is 2nd quartile50th percentile that means the value that 50% of obs have value less than or equal to it
Q3 is 3rd quartile75th percentile that means the value that 75% of obs have value less than or equal to it
if you have a big distance between min and p1 but not between p1 and Q1 you have outliers and
if you have a big distance between p99 and max but not between Q3 and p99 you have outliers
note: we have some other methods for finding outliers like 1. mean and standard deviation( Z score)
2. Interqurtile Range or box plot(Tukey and extreme Tukey method)
3. LOF
that you will learn in statistic and data mining course
By default proc means do descripteve analysis for all continuouse varibles in data set
title " Describe one variable"
Proc means datatestdo not forget "s" at the end of proc means and proc contents
output outdb1.data_summary_Oct18th you can save the reult of proc means as a new table (dataset) by using output statement
Bivariate Analysis
in SAS base in bivariate analysis I just talk baout summerization, and in advanced SAS I will
teach you hypothese testing,
Continuous VS. Categorical with proc means
title "Bivariate Analysis"
title2 "Continues VS. Categorical"
title3 "DESCRIPTIVE OF Income Vs. Gender"
Proc means datatest maxdec2
var income after var list numeric variables abd after calss or by list categorical variable
class gender Bivariate continues VS. categorical
run
footnote "to get the idea about association between income and geder if exits"
title "Bivariate Analysis"
title2 "Continues VS. Categorical"
title3 "DESCRIPTIVE OF Income Vs. age_group"
footnote "to get the idea about association between income and age_group if exits"
title "Multi-variate Analysis"
proc means datatest maxdec2 NMISS MEAN STD CV MEDIAN p20
var age income summeraiztion for these two continuous varibales
class gender ageby using fomat I will vconvert age to categorical variable and use it for grouping
format age agefmt.
Hint: You can use by statement instead of class statement in this situation you will get separate tables for each level and be carefull if you want to use by you must sort by that variable first
the following code is just FYI
proc means datasashelp.baseball p1 p5 p10 p20 p25 p50 p75 p90 p95 p99 maxdec2
var nHits
In SAS in assignment if you want to assign a value of keyword to your own variable we specify that keyword on the left side of equal sign
and the name of your own varibvle on the right side of equal sign
in proc means for saving the result in one table you can use ouput statement
output outpercentiles_proc_means
p1 P_1_hamid P_1_hamid is the name of varible that i assigned value of p1 to it
proc means datatest maxdec2 mean clm maxdec2clm is confidence interval for population mean that we predict a range of mean for population( inferential statistic) by default at 95% confidenc
clm is confidence interval
Let’s assume that I will give you an exam with 30 marks, and I say that based on my experience your class average mark would be between 21.75 and 26.25 at 95% confidence
That means there is 5% possibly that your class average would be less than 21.75 or bigger than 26.25
If I want to predict your class average at 100% confidence I must say:
your class average mark would be between 0 and 30 at 100% confidence
proc means datatest maxdec2 mean clm alpha0.10 alphasignificant level0.10> confidence 1-alpha 0.90
clm is confidence interval that we predict a range of mean for population at 90% confidenc
The bigger confidence the wider interval
FYI:alpha is called signifiacnt level it is probabilty of Type I error(FP)
Proc means datatest MAXDEC 2 n nmiss mean std stderr cv lclm uclm median min max range Q1 Q3 qrange maxdec2
where VS. If
1. you can use where in proc step or data step but you can use if only in data step
i.e you are not allowed to use if in proc step
2.we have where option as well
where option vs. where statement
FYI: Proc summary 12
Proc SUMMARY and Proc MEANS are essentially the same procedure. Both procedures compute descriptive statistics.
The main difference concerns the default type of output they produce. Proc MEANS by default produces printed output in
the LISTING window or other open destination whereas Proc SUMMARY does not. Inclusion of the print option on the Proc SUMMARY
statement will output results to the output window.
Proc SUMMARY
When all variables in the data set are character the same output: a simple count of observations, is produced for each
procedure.However, when some variables in the dataset are numeric, Proc MEANS analyses all numeric variables and
produces default statistics for these variables (N, Mean, Standard Deviation, Minimum and Maximum).
Inclusion of a VAR statement in both Proc MEANS and Proc SUMMARY, produces output that contains exactly the same default
statistics.
proc summary datatest In proc suumary you must use print option to see the result or use output statement to save the result in one table
output outsumdata_For_Hamidcreating new data which name is sumdata_For_Hamid an availble in work library
you can save the result in permanent library if you want
Hint:you are not allowed to use if statement in procedures
proc print datasumdata3arunyou can see bigger number of type shows you more detailed in terms of classification
I classified all of observation based on gender and age so type 3 is classifying based on gender then age
class gender age I classified all of observation based on gender and age
here n is keyword and we assign the value of that to number_of_non_missing, In SAS keywords located on
the left side of equal sign
proc univarite
Note:In proc univarite you can get more information about continuous varaible by default , and unlike proc means and summary
1.You can use proc univarite to test of normality and to visualize data as well
2.you can create a graph
3.when you want to create output in proc univarite you must specify what _stat_ you want to save
4.we use pctlpts20 instead of p20 and so on and The PCTLPRE option must be specified to generate additional percentiles
5. you can use the PCTLDEF option in PROC UNIVARIATE or the QNTLDEF option in other procedures to control the method used to estimate quantiles
You can use proc univariate to test of normality by adding normal option
If p value is less than 5% you can reject null hypotheses( data is normaly distributed) and conclude
your data is not normally distributed at 5% significant level
output outsumdata4a WARNING: Data set WORK.SUMDATA4A was not replaced because new file is incomplete
Bivariate continues VS. categorical
Proc univariate datasashelp.cars normal plot
class make Hint:you can use by instead of class but must sort before by make, unless your data sorted before
Note: when ever you use by statement in SAS you must sort your data by thah variable before ,unless your data sorted before
PCTLPREprefixes
specifies one or more prefixes to create the variable names for the variables that contain the PCTLPTS percentiles. To save the same percentiles for more than one analysis variable, specify a list of prefixes. The order of the prefixes corresponds to the order of the analysis variables in the VAR statement. The PCTLPRE and PCTLPTS options must be used together.
PROC UNIVARIATE generates new variable names by using the prefix and the percentile values. If the specified percentile is an integer, the variable name is simply the prefix followed by the value. If the specified value is not an integer, an underscore replaces the decimal point in the variable name, and decimal values are truncated to one decimal place. For example, the following statements create the variables pwid20, pwid33_3, pwid66_6, and pwid80 for the 20th, 33.33rd, 66.67th, and 80th percentiles of Width, respectively:
univariate analysis
for categorical variables
title
proc freq datatest
runfor each variable (numeric and categorical variables) it shows frequancy of each level
it is not suumerizing of num,erical varable just shows us the freq of each unique value
If an observation has a missing value for a variable in a TABLES request,
by default PROC FREQ does not include that observation in the frequency or
comulative freq and percentage , it shows missing values at the bottom of tables.
Also by default, PROC FREQ does not include observations
with missing values in the computation of percentages and statistics.
The procedure displays the number of missing observations below each table.
unlike proc means, summary and proc univariate that we use var statement to specify varables for
thoes procedures, in proc freq we use table statement :recall that we show counts of each value in table
proc freq datatest
table gender
run
proc print datatest run
By default proc freq ignores missing values in calculation percent and cumulative percent
unless you youse missing option in table statement
proc freq datatest
table gendermissing we use forward slash to add option in proc freq
treats missing values as a valid level for all TABLES variables.
Displays missing levels in frequency and crosstabulation tables
and includes them in computations of percentages and statistics.
proc freq DATA SASHELP.CARS NLEVELS NLEVELS :Displays the number of levels for all TABLES variables
TABLE SEX SMOKING_STATUSit is two univariate analysis ,
i.e creating two tables one for sex, the other one for SMOKING_STATUS
IF you want to get two-way table youu need to put star between sex and
SMOKING_STATUS
Bivariate
Cathegorical Vs. Chategorical
assume you are working as a data scientist at Bell and you want to see that is there any associtaion between
type of cancelletion( online or calling) and churn ( cancelling the agreement and switch to anothrer provider like Bell, ...)
this is hypotheses testing ( you may design an AB testing for answering this problem)
Null hypothese: there is no assocition between churn and type of canceletion
since both variables are categorical vbariables we run biviate anlysis for categorical vs. categorical
fgor this problem we need to use chi_squre test in proc freq that i will teach you
Bivariate Analysis:
Continuous Vs. Continuous : For Visulaization scatter plot,line chart( specially if you have time data)
For test of independence: pearson correlation or spearman or Kendal tau or ... coorelation
and Simple linear regression ...in SAS by using proc corr, proc reg
Categorical Vs. Categorical : For summaraization: contingency table (two-way table) in SAS by using proc freq and put between the name of thoes two variables in tabel statement
For visualization :stacked bar chart,Grouped bar chart,...
For test of independence:chi-square test
Continuous Vs. Categorical : For summaraization:gropup by categorical column and aggragte for numerical column in SAS by using proc means,summary, or univariate and use name of continuous variable in var statement and name of categorical variable in class statement
For visualization: Grouped box plot,...
For test of independence :1) if categorical column has only two levels :t-test in SAS by using proc ttest ( if all assumption are met)
2) if categorical column has more than two levels: ANOVA in SAS by using proc glm or Anova( if all assumption are met)
CHI SQURE
We use Chi squre test for finding if two categorical varaibles are independent from each other or not.
The Null and Alternate Hypotheses
we are interested in knowing if there is a relationship between two categorical variables or not.
In order to do so, we would have to use the Chi-squared test. But first, let's state our null hypothesis and
the alternative hypothesis.
The Null Hypotheses:H0:There is no statistically significant relationship between thoes two categorical variables.
The Alternate Hypotheses:Ha:There is a statistically significant relationship between thoes two categorical variables.
After running chi square test and checking the conditions of it
if p-value is less than 5% you can reject null hypotheses and get this conclusion that
"There is some correlation between thoes two variables at 0.05 significant level
or
There is a statistically significant relationship between thoes two categorical variables at 0.05 significant level."
( Note : if p-value is bigger than 5% yopui fail tyo reject null hypothese so you can conclude that there is not enough evidence top reject the null hypothese at 5% signifiucant level)
The Chi-square test statistic can be used if the following conditions are satisfied:
1.N, the total frequency, should be reasonably large, say greater than 50 or 30.
2. The sample observations should be independent. This implies that no individual item should be included twice or more
in the sample.
3. No expected frequencies should be small. Small is a relative term. Preferably each expected frequencies
should be larger than 10 but in any case not less than 5.
TITLE "BIVARIATE DESCRIPTIVE ANALYSIS:SEX AND SMOKING"
TABLE SMOKING_STATUS SEX it gives us contingency table(two-way table) here rowsSMOKING_STATUS and columnsSEX
TABLE SMOKING_STATUS SEXnopercent norow nocol
If you specify the LIST option in the TABLES statement,
PROC FREQ displays multiway tables in a list format
rather than as crosstabulation tables.
The LIST option displays the entire multiway table in one column,
instead of displaying a separate two-way table for each stratum.
TITLE "BIVARIATE DESCRIPTIVE ANALYSIS:SEX AND SMOKING"
PROC FREQ DATASASHELP.Heart
TABLE SEX SMOKING_STATUSmissing
TITLE "Hypoteses testing:SEX Vs. SMOKING"
PROC FREQ DATASASHELP.Heart
TABLE SEX SMOKING_STATUS chi_square
if p-value is less than 5% and all assumption are satisfied( I will talk about assumption in advanced SAS course)
we conclude that there is a statistcally association between thoes two varaibles at 5% significant level
in this example the p-value<0.0001> p-value<0.055% that is less than 5% so you can conclude that:
there is a statistcally association between sex and Smoking_Status at 5% significant level
Data type: Numeric and categoric
Note: Numeric variable can be continuous or discrete , we can treat discrete variable like
ordinal variable ( categorical) specially if it has limited options ( like number of children in family :
it is numeric( discreate), so you can apply mean ,standard deviation and so on but you can treat it like categorical variable
for example draw bar chart for it)
Univariate Analysis:
For categorical columns variabels:
for summarization: frequency, percentage, Mode, levels (unique values) ( by using proc freq in SAS)
for visualization : Bar chart or Pie chart
For Continuous columns :
for summarization: Central tendency(mean, median, mode,...) and
measue of position:the the five-number summary(min,Q1,median(Q2),Q3,Max),
and measure of dispresion such as standard deviation ,CV ( COEFFICIENT OF VARIATION),Variance ,IQR, ...
( by using proc means,summary, or univariate in SAS)
for visualization: Histogram , Box plot,density,...
TITLE "BIVARIATE ANALYSIS:SEX AND SMOKING_status"
TABLE SEX SMOKING_STATUS chisq
Since the p-value is less than 5% we reject null hypotheses and conclude:
There is a statistically significant relationship between thoes two categorical variables at 0.05 significant level
PROC FREQ DATASASHELP.Heart
TABLE SEX SMOKING_STATUS NOCOL NOROW NOPERCENT NOCUM MISSING chisq
Here since p-value is less than 5%0.05 we reject null hypothese and get conclusion that there is statistaclly association between sex and
smoking_status at 5% significant level
summary until this point
since p-value < 0.05 we can reject normality at 5% significant level
univariate analysis for categorical variable
proc freq datasashelp.heart
table BP_Status Chol_Status DeathCause Sex Smoking_Status Status Weight_Status missing
since p-value < 0.05 we can reject independency at 5% significant level and conclude:
There is statitically association between BP_Status and Chl_status at 5% significant level
Segmentation and changing values
value ageg format names can not end with numbers
value $sexfmt Hint:To create a format for a character variable, the format name must begin with $:
you can not use number at the end of the format name
don't forget at the end of any format
WARNING: 100% of the cells have expected counts less
than 5. Chi-Square may not be a valid test.
Note: you can create several tables in one proc freq
If you specify the LIST option in the TABLES statement,
PROC FREQ displays multiway tables in a list format
rather than as cross tabulation tables.
The LIST option displays the entire multiway table in one column,
instead of displaying a separate two-way table for each stratum.
table gender missing for one table adding list option doesn't have any effect
table modelmissing we can say table make modelmissing list instead
table makemodelmissing list
Vertical calculation (column wise ) vs. Horizontal calculation(row wise)
Proc means datatest MAXDEC 2 sum MEAN var STD CV Q1 MEDIAN Q3 p95 p99 maxThe coefficient of variation (CV)STDmean
Run
It is vertical calculation i.e. SAS it is going to calculate mean, std, ... for each columns
Proc means datatest MAXDEC 2 sum MEAN
Run
Horizontal calculationrow wise
,
so far we did summeraization for each variables columns
what if we want to summerize for each rowsobs as an example assume each rows is student id and scores in different courses for that student i.e you have a student id,
SQL_score, SAS_score, Python_score and so on and you want to find average of score for each studen if we use proc means, or suumary or univarita
we will get average of all students in each course,
or you have a customer Id at 1st column and sales amount for each month of 2021 in 12 columns and we want to
calculate sum of sales and average of sales for each customer
CONNECT YOUR PERMANENT SAS LIB USING LIBNAME STATEMENT
proc means datatest_ sum meanrunIt is vertical calculation i.e. SAS it is going to calculate sum and mean for each columns
As an example if columns are course names and rows are students you can find average of score for each cours
HOW WE CAN DO HORIZONTAL CALUCULATION as an example how we can calculate average of score for each students
recall that SAS select obs by obs and for each obs excutes codes line by line
with missing values
Plus Operator Vs. sum function
TOTAL_SCORET1+T2+T3+T4+T5ASSIGNMENT STATEMENT and Defining new variable and ASSIGNMENT STATEMENT
AVG_SCORETOTAL_SCORE5
TOTAL_SCORE1T1+T2+T3+T4+T5 ASSIGNMENT STATEMENTPlus operator GIVES YOU MISSING VALUE IF YOU HAVE MISSING IN YOUR DATA
TOTAL_SCORE2SUM (T1,T2,T3,T4,T5)ASSIGNMENT STATEMENT IGNORES MISSING VALUES Sum() function ignores missing values
TOTAL_SCORE3SUM (OF T1-T5)ASSIGNMENT STATEMENTYOU CAN USE THIS METHOD FOR VARIBLES WITH THE SAME NAME AND DIFFEREN INDEX
In following data find the total amount and average amount that each customer bought in 2023 so far
Interview question: how can we calculate the average of each row by ignoring missing values?
SAS FUNCTION f(ARG1,ARG2,ARG3)
BY DEFAULT,SAS FUNCTIONS EXCLUDE (ignore) MISSING VALUES IN CALCULATION
TOTAL_SCORESUM(T1,T2,T3,T4,T5)sum function returns to us sum of all non-missing values
TOTAL_TESTN(T1,T2,T3,T4,T5)N function returns to us number of non-missing values
MISS_TESTNMISS(T1,T2,T3,T4,T5)NMISS function returns to us number of missing values
AVG_SCOREMEAN(T1,T2,T3,T4,T5)mean function returns to us avaerge of all non-missing values
MIN_SCOREMIN(T1,T2,T3,T4,T5)min function returns to us minimum of all non-missing values
MAX_SCOREMAX(T1,T2,T3,T4,T5)max function returns to us maximum of all non-missing values
STD_SCORESTD(T1,T2,T3,T4,T5)std function returns to us standard deviation of all non-missing values
TOTAL_SCORESUM(T1,T2,T3,T4,T5)sum function returns to us sum of all non-missing values
Number_of_non_missingN(T1,T2,T3,T4,T5)N function returns to us number of non-missing values
AVG_SCORETOTAL_SCORENumber_of_non_missing
AVG_SCORE2mean(T1,T2,T3,T4,T5)mean function returns to us avaerge of all non-missing values
TOTAL_SCORE1SUM(T1+T2+T3+T4+T5)
first it is going to calculate T1+T2+T3+T4+T5 (so if you have missing value you will get missing)then sum( sum here
doesn't have any effect)
Note: use comma between arguments
TOTAL_SCORE2SUM(T1,T2,T3,T4,T5)
TOTAL_TESTN(T1,T2,T3,T4,T5)Number of non_missing values
MISS_TESTNMISS(T1,T2,T3,T4,T5)Number of missing values
FYI: USING VARIABLE SHORTCUT
TOTAL_SCORESUM(of T1-T5)DON'T FORGET OF
TOTAL_SCORE_PLUS5SUM(OF T1-T5,5) OR TOTAL_SCORE_PLUS52SUM(T1,T2,T3,T4,T5,5)
XSUM(T1-T5)HERE WE HAVE JUST ONE ARGUMENT THAT IS THE RESULT OF T1 MINUS T5
YSUM(T1-T5,5)HERE WE HAVE TWO ARGUMENTS THAT ARE THE RESULT OF T1 MINUS T5 AND 5 THAT MEANS T1-T5+5
ZT1-T5+5
USING VARIABLE SHORT CUT
DATA TEST_04
SET TEST4
TOTAL_TsSUM(OF T1-T3)SUM(OF T1-T3)sum(T1, T2, T3)
TOTAL_SCORESUM(OF T1--T3)SUM(OF T1--T3)sum(T1,S1, T2,S2, T3)i.e. sum of all columns between T1 and T3 including them
Num_TsN(OF T1-T3)
TOTAL_numberN(OF T1--T3) MEANS ANY VARIABLE BETWEEN T1 AND T3 INCLUDE T1 AND T3
USING VARIABLE SHORT CUT
TOTAL_TsSUM(OF T1-T3)SUM(OF T1-T3)sum(T1, T2, T3)
TOTAL_SCORESUM(OF T1--T3)SUM(OF T1--T3)sum(T1,S1, T2,S2, T3)i.e. sum of all columns between T1 and T3 including them
Total_Ts_and_Sssum(of T1-T3,of s1-s2)
Num_TsN(OF T1-T3)
TOTAL_numberN(OF T1--T3) MEANS ANY VARIABLE BETWEEN T1 AND T3 INCLUDE T1 AND T3
xSUM(OF T1--G3)Any numeric variable between T1 and G3
评论
发表评论