sas note1


SAS programs are constructed from two basic building blocks:

1.DATA Steps (starts with data keyword)

2.PROC Steps (starts with proc keyword)

DATA and PROC steps are made up of statements. A step 

may have as few as one or as many as hundreds of 

statements. 


We normally use data step for importing a dataset or creating a new dataset,

or getting a subset of data set, or manipulating a dataset

and we use proc step to print, summerize , or visualize data


note: Sas is case sensitive only for values inside its data but for keywords and identifires( name of variables,

name of tables, ...) is not case sensitive

note: you have to put semicolon at the end of each statementl


 for execution: highlight the block of code that has to have run keyword at the end 

then select running person icon or F3 or run tab>submit 

Note: after any execution you must check the log even if you get the result

Data test1test1 is the name of table(dataset) that will be saved in work library 

cA+b SAS is case sensitive only for values inside a table not for others

run

 i used the above data step to craete a data now for print that data I will use proc print


any dataset in SAS has two leveles for naming like Sashelp.CARS , first we must mention the name of libraray 

then dot after that name of data set

Note: you can skip typing the name of library only if it is work libraray i.e. EUROPEANCARWORK.EUROPEANCAR

NOTE: Work library is the only temporary libraray that means all you have there will be disappear after closing

current session

temp.data1  > the name of library is temp but it is permanent library( the only temporary library is work)


 Get subset of sashelp.cars that includes only European cars

DATA EUROPEANCARWe start a data step to create a dataset that its name is EUROPEANCAR , EUROPEANCAR is output data

  SET Sashelp.CARSSashelp.CARS is input data that I want to get subset of it  

  WHERE Origin  "Europe" I added a condition that is Origin  "Europe"

  Note: SAS is case sensetive only for values of tables

you should put run statement at the end and highlight them togather then hit the running person icine for executing


syntax rules

SAS Program is a series of SAS statement

Most Statement not all of them begin with an identifying keyword


SAS statement must end with semicolon > so, for calculating number of statements count number of semicolons

These statement execute in order i.e. observation by observation ,and for ecah observation line by line (statement by statement), 

That means SAS takes first observation and runs it all the way

through the DATA step (line by line) before looping back to pick up

the second observation.



SAS terminolgy 


data setSAS Table

Variable (var) columns

observation(obs)rows(recrods)

 data values (val)intersectional between Rows and Column




NAMING CONVENTIONS FOR SAS TABLES and VARABILES


1. START WITH A LETTER OR AN UNDERSCORE

2. CAN CONTAIN ONLY LETTERS,NUMBERS,AND UNDERSCORES

3. CAN BE 32 CHARACTERS OR FEWER





NAMING CONVENTIONS FOR SAS LIBRARY(libref)


1. START WITH A LETTER OR AN UNDERSCORE

2. ONLY LETTERS,NUMBER,UNDERSCORE

3. 8 CHARACTERS OR FEWER





Type of libraries


SAS LIBARY:we have two types of library:

1.PERMANENT AND 2.TEMPORARY SAS LIBRARY

 The only temporary library is work> all of the other libraries are permanent


The WORK library is  the only temporary storage location for SAS data sets. 

It is also the default library. If you create a SAS data set without specifying a library, SAS 

will put it in the WORK library, and then delete it when you end your session.




TYPE OF DATA VALUES 



TYPE OF DATA VALUES : 1.Numeric or 2.chategoric: 

If a variable contains letters or special characters, it must be a character(categoric)_ variable. 

However, if it contains only numbers, then it may be numeric or character. 

i.e. for numeric values you are allowed to assign a numeric or charector data type

For example you can store value100 to numeric variable and charactor variable

but you can store value"Hamid2" only in cahractor variable

If YOU STORTE  "HAMID2" IN NUMERiC VARAIBLE YOU WILL GET MISSING VALUES




 IN SAS we consider two types for numbers : 

1.standard like 348 0r 102.5 that doesn't have any special charectors and 

2.non-standard numbers like 25% or 02-12-2022 or $152 or 3,245

for reading non-standard numbers we must specify the format of them(we call it informat and I will talk about that)

NUMERIC :1. Standard numbers 345678432, 156.8 , -267.9

          2. NON STANDARD NUMERIC DATA : have $, %, coma( like 2,564,790) ,dates,..


NOTE:dates in SAS are NON-STANDARD numeric and if you read them with the proper informat 

they are stored as a number of days between that date and

date base (date base is 1st of Janurary 1960)

as an example if SAS read 10-Jan-1960 with proper foramt it stores it with value equals to 9 ((10-Jan-1960) - (1-Jan-1960))


 


CHARACTER: values like "ABCDE","hamid rajaee","5 years" can store only in charector varibles

numeric value: contains only numbers, and sometimes a decimal point andor minus sign. When they are read into 

a SAS data set, numeric values are stored in the floating-point format native to the operating environment. 

Nonstandard numeric values can contain other characters as numbers you must use formatted input to enable SAS to read them.

nonstandard data is data that can be read only with the aid of informats. Examples of nonstandard data include

numeric values that contain commas, dollar signs,  blanks, date and time values and hexadecimal and binary values.

 Note: If you store charector value or non-standard numeric value in numeric variable without specifying the proper format

       SAS will store missing valuees there

 input id name $ age we need to specify charector varibles using $ after them

for inserting values in SAS use datalines or cards satement 

 if you want to insert data maually type datalines or cards then semicolon then 

values, after inserting values semcolon must be in the next line if you put in the last line that obs won't be inserted

this is simple list input that values are separted by space and have default length(8 bytes)

and we don't have non-standard numeric. If these conditions are not satisfied you should use column input, 

formated input or mixed input that I will teach you in later

Note: in simple list input when you insert values for missing values you must type dot

and when you print them you will see dot for numeric missing values and blank for charector missing values 

in the basic way of inserting VALUES(simple list input) 

1.you must type dot for any missing values and 

2.all values must be separated by at least one space( 2 spaces or more is the same as one space

3.All chrector values must have atmost 8 chrectors( the rest of that won't be saved in simple list input)

runin the resultoutput missing values will be seen as a blank for chrectors and as a dot for numbers

but be notice that when you insert them in the simple list input you must type dot for both of them

 input id $ name $  age you can store numbers in numeric or cahrector varibles 

                    but you can't stor charector in numeric varibles . If you do that you will get missing values

 cards or datalines are used to specify that you want to insert some values in the table 

 input id  name   age you can store numbers in numeric or cahrector varibles 

                    but you can't stor charector in numeric varibles . If you do that you will get missing values

 cards or datalines are used to specify that you want to insert some values in the table 


Length of values



Character Length 1–32,767 bytes(Case sensitive) Default 8 bytes

( for each charector you need 1 byte)

Numeric Length 3-8 bytes  Default 8 bytes(16 digits)

 since it is simple list input the deafult length is 8 bytes so you can store up to 8 charectors for chrector values,


 The length of a numeric variable lies between 3 and 8 

bytes. It means SAS can store a numeric value from 1 to 

16 digits.

any character needs 1 byte to store as an example to 

stroe Hamid we need at least 5 bytes and for storing Hamid_Rajaee you need at least 12 bytes.

___________________________________________________________________

Hint: 1)sas program is a series of sas statements

        2)Most sas statement  begins with an identifying key word like data,set,where,proc,..

          all sas statements  must end with semicolon()


, hint : if running of your program takes long time go to icone !(break) in above toolbar and hit that


 SAS reads observation by observation and executes line by line for each observation 




comment: 


Comments are colored green

Note:you should put your comment between asterisk and semicolon or between forward slash asterisk and asterisk forward slash

 your comment    These comments cannot contain internal semicolons or unmatched quotation marks. 

 or  

 your comment  These comments can contain semicolons and unmatched quotation marks.

           Tip: When using comments within a macro definition or to hide text from the SAS macro facility, 

              use this style comment


or In the Microsoft Windows operating environment, if you use the Enhanced Editor, 

you can comment out a block of code by highlighting the block and then pressing CTRL and  (forward slash) together. 

To uncomment a block of code, highlight the block and press CTRL and SHIFT and  (forward slash) together. 




Two basic building blocks of SAS program



 the program contains 2 steps:

                                1)data step 

2)Proc step

   we use proc step to craete a report or graph , ...

SAS statement usually begins with a keyword and end with a semicolon > each semicolon one statement



 Every time Check the log to make sure 

the program run without errors or  warnings


DATA EUROPEANCARwe created (writing) a temporary data set that is EUROPEANCAR or work.EUROPEANCAR 

SET Sashelp.CARSSashelp.CARS is input data set(we reading permanent data set that its name is cars and it

                                                 is availbale in sashelp libraray)

WHERE Origin  'Europe' 'Europe'"Europe"

run

in above data step Sashelp.CARS is input data set and EUROPEANCAR or WORK.EUROPEANCAR are output data set 





Browsing Data Portion and descriptor portion 



Browsing Data Portion

Proc print dataEUROPEANCAR

run


Browsing the Descriptor Portion


proc contents dataEUROPEANCAR be noticed that we have s at the end of contents




ceating Library


creating libraray


SAS datasets can be temporary or permanent.


Temporary SAS datasets only exist during the current SAS session.

Permanent SAS datasets are saved to a location on the computer or server and exist after exiting SAS.

Unless otherwise specified to be permanent, SAS considers all datasets to be temporary

data1work.data1 


There are a few guidelines to keep in mind when you create the name of your library.


Library names(libref name):


1.are limited to eight characters.

2.must begin with a letter or underscore.

3.can contain only letters, numbers, or underscores. Blanks are not allowed.


create your library :1) by coding or 2) by using GUI( Graphical user interface):

 before it create a folder in the server or your computer

using GUI:You can also create a new SAS library by opening

the New Library window from the Libraries section of the navigation pane in SAS Studio:

icone new library in up toolbar or Explorer window >Right click on Library > New): give the name for example hmd,

enable at start up, 

choose your path and ok 

 I want to create a subset of sashelp.cars in my library

by coding:go to any drive for example.C:\Users\hmdra\Desktop\Metro College_Teaching Materials\SAS\My sas library

and create a folder for example My sas library then copy  its address and come back to sas and write the keyword libname 

and give any name you want as a libref for example Hmd and then

paste just the address in single or double quotation 

 highlite the code and F3 or choose the runnig person icone


For any SAS session, You need to submit a LIBNAME statement (only once during each SAS session)

The libref remains in effect for the duration of that session. 

But the data that you stored in that library exist for ever until you delete them, 

but in any session you should say where that library is located .

Note:You can assign different names for libref for the same 

location and use it i.e.

The libref can change, but the member name( file name  dataset names), and directory  must the same.

(A libref is just a nickname that corresponds to the location of a SAS data library. )

as an example for the same folder I will assign another name as well

libname h2 "C:\Users\hmdra\Desktop\Metro College_Teaching Materials\SAS\libreary for practice"



LIBNAME statement 


A LIBNAME statement consists of

1.the keyword LIBNAME.

2.the libref name that you want to use.

(that have at most 8 charectos and starts with letter or underscore and contain leters, underscore and numbers)

3.the pathname that represents the physical location of the library and wrapped in sigle or double qoutation marks.

4.a semicolon.


after every run check log


It created a temporary data set that its name is euro inside work libary eurowork.euro 

If you have a data in work library that you need for future you should better to save as in permanent library

to get a save as(copy and paste) euro dataset

It created a permanent data set that its name is euro inside h1 libary 

proc print datah1.euro if You don't mention which data SAS will execute for the most recent data

data  h1.Asiaoutput data is h1.asia

set Sashelp.carsinput data is sashelp.cars

where origin "Asia" be notice that unlike SQL that you are allowed to use only single qoutes to wrap the charector values, In SAS you can use either single or double qutes

proc print in proc if you skip mentioning name of data ,that proc run for the most recent data that you worked on it



Using SAS for calculation



Missing Data Values :

A Character missing value is displayed as a blank

A numeric missing value is displayed as a period (.)



SAS Statements are free format:

One or more blanks or special characters can be used to separate words

They can begin and end in any column(i.e. you can put spaces at the begining)

A single statement can span multiple lines

Several statement can be on the same line


but becarefull that each staement must be ended with semicolon



In any session you should specify where that library is located .You can assign different name for libref for 

same location and use it i.e.

The libref can change, but the member name( file name), and directory  must the same.

(A libref is just a nickname that corresponds to the location of a SAS data library. )



craete a data in SAS (manually ie. by inserting values) :


step1:  assigning a name for data(table) (After data keyword)

stpe2: assigning names for variables(columns) and assigning data type for them (after input keyword)

step3: inserting values (after cards or datalines keyword)



data my_1st_dataoutput data is my_1st_datawork.my_1st_data that is temporary data

input name $ age gender $ weight  height we use input statement to assign variable's name column's nameheader :

variables'name:

                                            1)are limited to 32 characters.

2)must begin with a letter or underscore.

3)can contain only letters, numbers, or underscores. Blanks are not allowed

                                        To create a character column, specify a $ sign righ after the sapace after

column name.

Note: unlike SQL that we use comma to separate columns names in SAS we use sapace

when you insert values in simple list input separater is space(at least 1 space) and Any missing data must be indicated with a period.

Note:recall that when you print a table for missing values you will see period for numeric columns and blank for categorical columns

 For inserting data you should type cards or datalines then semicolon and in new rows type observations after

the last observations(rows) go to new line and  type semicolon

for inserting values use cards or datalines


you must put semicolon after the last obs in "new line" if you put semicolon at the end of obs thet one won't be inserted 

So important: after any execution in SAS you must check the log to make sure that result is reliable or not

proc printrunin any procedure if you skip calrifying name of data , that proc excute for the last data that is executed



in result output characters are on left side, numeric are on the right side 

Note: for numeric variable if you insert charector you will get missing values


 I stored that privouse data set in work library, Since it is temporary libraray after closing the SAS program , 

this data is goinig to be disapear. If you want to save it permanently you have to create that data in permanent 

library or getting save as like 


Data Hamid.myDataHamid.myData is output data that I want to save in in Hamid libraray

note: you can save it with the same name like Hamid.my_1st_data

set my_1st_datamy_1st_data is input data that exits in work libraray and I want to get a copy of it

Run I created new data set in hamid library and I named it mydata and I save as all data that I have in my_1st_data

input  id   name $  family $ height  several sapces is the same as one space

 you can save numbers as a charector as well

Brows Data descriptor

 for inserting any non-standard numbers in your dataset( table) you must use informat


inforamt Vs. format



 for reading and writing non-standard numeric values you can use inforamt( for writinginserting values)

and format( for readingprinting values)

input  id  name $  family $ age dob several sapces is the same as one space

informat  dob ddmmyy8.informat for reading perpuse (inserting in table)

                       Note:right after each format_name put dot

note: some availble format :

14.05.85 or 140585 >ddmmyy8. or ddmmyy.

17032013 >ddmmyy10.

03172013 >mmddyy10.

17MAR13 >Date.

17MAR2013 >Date9.

you can find more by typing date format in help window( use help tab)


SAS stores date values as numeric values and showes us differnce dayes from 1Jan1960(base date in SAS

 For reading date to sas(Informat) we should specify format of date

 Note:mmddyy6. is one type format for date that mention 1st 2 digit is month , 2nd 2 digit is day and

 3rd 2 digit is year


 after checking a log you will find that this data is created successfully

 but when you print the data you will see some numbers for dob since SAS returns the number of days differences from 1st of Jan1960

proc print datamytestrun

 we use informat for reading and we use format for writing


 you can use any other date format

you can find all avaible options for date format by going help then type date and then select About SAS Date, Time, and Datetime Values

input  id  name $  family $ age dob several sapces is the same as one space

informat  dob ddmmyy8.informat for reading perpuse and format for writing perpuse for dates, you can assign format in data step or proc step

format dob date9. 

if you define format in data step it will saved with it but you can override whenever you want 

 The most 7 important proc:

1)proc print  to brwos data portion

2)proc contents to brows desciptor portion

3)proc freq  to summarize chractor variables: create a table that gives us frequency of each level( unique values)

                                                                         and chi-square test

4)proc means to summarize numerical variables

                            5)proc univariate to summarize numerical variables with more information

                        and test of normality

6)proc sql  to use sql syntax

7)proc sort  to sort data and deduplicate data          

we talk about them and other procedures during this course and advanced SAS programming 

Note: we have more procedures like proc import for importing data, proc export for exporting data,

proc ttest, proc Anova, proc reg, ...

input name $ age gender $ weight  height we use input statement to assign variable's name column's nameheader :

variables'name:

                                            1)are limited to 32 characters.

2)must begin with a letter or underscore.

3)can contain only letters, numbers, or underscores. Blanks are not allowed

                                        To create a character column, specify a $ sign righ after the sapace after

column name.

Note: unlike SQL that we use comma to separate columns names in SAS we use sapace

when you insert values in simple list input separater is space(at least 1 space) and Any missing data must be indicated with a period.

Note:recall that when you print a table for missing values you will see period for numeric columns and blank for categorical columns

 For inserting data you should type cards or datalines then semicolon and in new rows type observations after

the last observations(rows) go to new line and  type semicolon

for inserting values use cards or datalines

you must put semicolon after the last obs in "new line" if you put semicolon at the end of obs thet one won't be inserted 


proc printrunif you skip naming data, SAS conduct that procedure on the most recent data


Descriptive statistic ( summarizing existig data)(part1)


For each variable it shows frequency of each level(frequency of each unique values)

proc freq datamy_1st_datarun


proc freq datamy_1st_data

table genderunlike other procedurs that we specify name of variable by using var statement in proc freq we use table since we looking for table that shows us frequency of each levels

run

proc univariate datamy_1st_datarunproc univariate by default returns more information compare to proc means and you can draw a grapgh with it as well that I will teach you

                                                                      in this version of SAS 1."as" in proc sql is not optional

                                                                        and 2.you can not say [avrerage of weight] 

 from  my_1st_data notice that my_1st_data is a SAS dataset and exits in work SAS library 

Reacall that:

order in SQL select statement:

select ....

from ....

where ....

group by ....

having ....

order by ....


rermeber

"Some Female Workers Go Home Ontime"


 Sorting data

Proc  sort datamy_1st_data it will sort the original data unless you will specify output(i will talk about that)

By agesorting by age deafaultascending

proc print datamy_1st_datarun

Note that in SAS missing values are considered less than any value that you can assume

i.e. in SAS unlike R and Python, missing values consider less than any values that you have

Proc  sort datamy_1st_data

By descending agein SAS unlike SQL 1.we use complete word of descending 2.before variable name

Proc  sort datamy_1st_data

By desending agedescendingdesendingdecending

input  id $  name $ family $ height  several spaces is the same as one space


FYI

Label (part1)


LABLE STATEMENT 

LABEL TO VARIBLE TO BE MORE DESCRIPTIVE

CAN BE UP TO 256 CHARACTERs can conatain any charector even space

if you want to get labels in print you need to use lable option in proc statement

 if you assign lable in data step it will saved with data ( you can overwrite later if you want)

but if you assign lable in [proc step it won't saved with data and only apear in that proc 

 in labels you don't need to put them inside qoutation marks

proc print datahamid.cars labelrun to see the labels in proc print statement you must type label


 for displaying lable you have to type lable keyword in  print statement then in other lines

define lable unless you define lable before

if you put a LABEL statement in a data step, the labels will be saved with the data

if you put a LABEL statement in a PROC, then the lables will be used only by that particular execution.




create dataset based on one existing dataset in SAS


subset 

Ex. Create a data set that stores all records of cars data set in sashelp with originAsia

data test testwork.test and it is the name of output data

this statement create table and its name is test and its library by default is work, 

if you want to save it permanetly you have to put a libref

           befor test and dot like: Hamid.test but because we didn't do that here it is temporary data and saved in work library

 set sashelp.cars  sashelp.cars is the name of input data

                     that means gives data from library sashelp and table  cars

 where origin "Asia"

RunCheck log after any execution 

proc printruncharacters are on left side, numeric are on the right side

Ex. create a subset of cars data set in sashelp that includes only Sedans from Europe

Ex. create a subset of cars data set in sashelp that includes only european cars

 where origin"EUROPE"O observation means no observation meet that condithion because SAS in case sesitive for values

Note: SAS is case sensitive in terms of vaues inside its table


upcase


where upcase(origin)"ASIA"

originupcase(origin) If you want to upcase values of origing permanently

where upcase(origin)upcase("EuropE") or upcase(origin)"EUROPE"  >upcase(origin) converts all of values of origin column to upercase temporary and compare them with upcase("eUropE") that is "EUROPE"



distinct (unique) values


If you want to make sure about unique valus of origin

proc freq datasashelp.cars

table origin

run so you can see that we ahve only "Europe", "Asia" , and "USA" and  we don't have "EUROPE" or "europe" or "eUropE"





creating new variable(column)


you can create a new variable in data step by  assinment to the new name 

 you can create a new variable in data step by just using a new name and assigning a value to it


Delete a variable


 In data step you can delete a variable either by drop opthions or by using keep otions with other variables'name,



drop


Data test5 (dropinvoice type) don't put comma between variables' name


Data test5b (dropinvoice type) in SAS don't put comma between variables' name


keep


Data test5b (keepmake origin type newprice) in SAS don't put comma between variables' name

data data1(keep name gender new_age)this is keep option

keep name gender new_agethis is keep statement



TitleFOOTNOTES STATEMENTS


Up to 10 TITLEsFOOTNOTS STATEEMENTS

TITLE "THIS IS MY FIRST REPORT"

TITLE2 "THIS IS MONTHLY@#$%^& SALE REPORT"

TITLE3 "HERE IS ANOTHER TITLE"

FOOTNOTE "THIS IS MY FIRST FOOTNOTS"

FOOTNOTE2 "THIS IS MY SECOND FOOTNOTE"

FOOTNOTE3 "THIS IS MY THIRD FOOTNOTE"

PROC PRINT DATA data1



proc print



Note: proc print have more options that allowes you to choose featurs(columns)(variables) and 

number of obs(rows) that you want to print 

proc print with var satetments

PROC PRINT DATA SASHELP.CARS

VAR TYPE ORIGIN MAKE  INVOICE I used Var statements in proc print to specify which features I want to see in print

RUN

Note: we can use drop or keep in data step when you want to select some varaibles durnig getting a copy of data

but in poc step we use var statement to slect some varaibles or table statement in proc freq


HOW TO CANCEL PREVIOUS TITle AND FOOTNOTES STATEMENT? 

By making them empty or assigning new one to overwright previouse one

TITLE

FOOTNOTE


proc print with obs and firstobs option

PROC PRINT DATA SASHELP.CARS (obs5)it prints the first 5 obs (head of data)

PROC PRINT DATA SASHELP.CARS (obs10)run

PROC PRINT DATA SASHELP.CARS (obs5 firstobs3)

it will start from 3rd rows(obs)( that means you don't want to print the first 2 obs) and end at 5 obs so you will see 5- 23obs

PROC PRINT DATA SASHELP.CARS (obs10 firstobs3)

it will start from 3rd rows(obs)( that means you don't want to print the fist 2 obs) and end at 10th obs so you will see 10-(3-1)10-28 obs


Internal data:Craeting data set and inserting data manually

so far you saw simple list input that 

1.seprater is space and 

2.you must type dot for missing values and 

3.you can not  have non-standard numbers like date, 48%, $23000, or comma for numbers like 3,642,912

4.you can store charectors up to 8 charectors( default length)

 for inserting any non-standard numbers in your dataset( table) you must use informat


inforamt Vs. format



 for reading and writing non-standard numeric values you can use inforamt( for writinginserting values)

and format( for readingprinting values)

input  id  name $  family $ age dob several sapces is the same as one space

informat  dob ddmmyy8.informat for isertingreading perpuse 

                       Note:right after each format_name put dot

note: some availble format :

14.05.85 or 140585 >ddmmyy8. or ddmmyy.

17032013 >ddmmyy10.

03172013 >mmddyy10.

17MAR13 >Date.

17MAR2013 >Date9.

you can find more by typing date format in help window( use help tab)



SAS stores date values as numeric values and showes us differnce dayes from 1Jan1960(base date in SAS

 For reading date to sas(Informat) we should specify format of date

 Note:mmddyy6. is one type format for date that mention 1st 2 digit is month , 2nd 2 digit is day and

 3rd 2 digit is year


 after checking a log you will find that this data is created successfully

 but when you print the data you will see some numbers for dob since SAS returns the number of days differences from 1st of Jan1960

proc print datamytestrun

 we use informat for reading and we use format for writing


 to see the result for dob in undrstanding form use format statement

proc print datamytest

format dob ddmmyy8.

run

 you can use any other date format

you can find all avaible options for date format by going help then type date and then select About SAS Date, Time, and Datetime Values

input  id  name $  family $ age dob several sapces is the same as one space

informat  dob ddmmyy8.informat for reading perpuse and format for writing perpuse for dates, you can assign format in data step or proc step

format dob date9. 

if you define format in data step it will saved with it but you can override whenever you want 

recall that in simple list input you are not allowed to have space inside values

Formatted input enables you to read data stored in non-standard form, such as dates, percent or

numbers with commas.

and you don't need to put dot for missing values when you insert data

but all values of each column must starts from the same position

Note: in SAS unlike Python index strarts from 1

 input @1 dob mmddyy6. note: @1 that means the position of 1st cahrector of its value is 1 i.e., values of this feature starts from column #1

       @8 fname $10. Read values based  on starting (here @8 means column#8 in dataliensCards)and total length of its value10 $ means it is charector and dot after 10 is mandetory after any format

     we will get 4 obs because of the empty line after 3rd obs

proc print runwe got dob as a number of day difference from 1st of Jan 1960

 input @1 dob mmddyy6. note: @1 that means the position of 1st cahrector of its value is 1 i.e., values of this feature starts from column #1

       @8 fname $10. Read values based  on starting (here @8 means column#8 in dataliensCards)and total length of its value10 $ means it is charector and dot after 10 is mandetory after any format

     

 FYI: You can mix foramtted input with simple list input

       @8 fname $10. Read values based  on starting (here @8 means column#8 in dataliensCards)and total length of its value10 $ means it is charector and dot after 10 is mandetory after any format

       



Non-standard numeric values eaxamples: 47%, $240, 11102022 for using  mathematics functions like average, sum, subtraction,... we need to spaecify they are numeric values

 for reading(insereting) non-standrd numbers we use  informat and for writing ( printing) them we should use format


Hint:You can save format when you read data by using format statement in this situation you don't need to use fromat when you writing data

but you still can override that fromat

hint: when all of values for each column start from the same position you can use formatted input like the follwingh example

DROP DUPLICATE OBSERVATION IF EXIST( deduplicate data)

PROC SORT DATAtest2 OUTtest3 NODUPKEYhere input data is test2 and output data is test3 and both are stored at work library

 BY _ALL_ if 2 obs( rows) have same values for all columns the second one will be consider duplicated

 chech the Log > 1 observations with duplicate key values were deleted

FYI: Getting your data into SAS with proc IMPORT using GUI:

First Make sure that the data is closed

File> Import Data>Select the type of file for example CSV> Brows the location of file>

Selecte the library that you want> assign a name to it in member session>

select a locatio for proc import to save that and assign a nmae for it> finish

you need to check the log to make sure that you read data correctly

Note: By deafault sas uses only the first 20 rows to guess the datatypes of variables if

you get warning or error in the log you may neeed to assign a greater value to GuessingRow




Segmentation(binning)and Discretization


 SEGMENTATION is the process of dividing continuous variable into segments( different groups)

Discretization :Process of transferring continuous variable into discrete variable


Methods for changing values for segmentatioin or changing apearence :

a)with proc format

b) with conditinal statement like if statement or case when


 Proc format 

Note:We can use proc format to define new format for writing(printing) data, we normally use it to change values or encoding, 

or segmenation(binning),

or... 



SO Important:


Proc format



in proc format we use value statement to assign a name for the format that we want to create and defining our format


Rules for defining a FORMAT  name:  

1.If you are creating a format for character variables,format name must start with $ 

  i.e. use a dollar sign ($) as the first character of its name.

                               for numeric variables must begin with letters or underscore 

2.The name must be a valid SAS name (Only letters, underscore and numbers are valid characters),

3.but Can’t end with a number and for charector variables must start with $

4.Max length: up to 32 charactor including $ 

 so A numeric format name can be up to 32 characters. A character format name can be up to 31 characters

(because we have $ before that).

5.name can’t be the name of an existing format (can't overrride) 

pay attention that the format name Can’t end with a number

Note: when you assign a name for format name in proc foramt it doesn't end with dot but after assigning its name

             whenever you want to use it you must put dot at the end of its name to specify that is a format name



scenario:assume we have age column in our data and we want to do segmenation based on the following explenation:

If the person is youngr than 30 years old , it belongs to youth group,

If heshe is between 31 to 60, it belongs to midele aged group and 

older than 60  ,it belongs to  senior 


Since age is numeric column ,format name for it must starts with letter or underscore

proc format

    value agegroup agegroup is the name of this foramt that we want to use it for numerical values that's why

we don't have $ before its name

Note: after creating  you are allowed to use this format for any numeric columns  

                Note:After assing a name to our format when ever we want to use this format we  must put dot 

after format name

    low-30'youth' we are allowed to use low and high keyword note: <30 is not acceptable in proc format

  format dob empdate date9. age agegroup. I used format named agegroup,use dot after name of format

Note: be noticed that in proc format we use low and high keyword and it returs missing values for missing values


Five advantages of using proc formats:


1.The orginal data will not been changed. You can use the original variable values in the analysis.

2.You can apply formats to both character and numerical variables.

3.You can use formats to merge categories and

  to bin numeric variables.

4.you can apply a format to multiple variables in multiple data sets.

5.You can save formats in a permanent libref and use them across SAS sessions.


Note:You can define several format name in one proc format 

I saved test data set that is temporary data set in work library as a  permanent data set with the name of

test2 in permanent library that its libref is hamid with saved format 

proc print dataHamid.test2run

proc contents datahamid.test2run be noticed that the data type of age and sex


If you want to have original value for age and created new formatted value you can do like below:

  age_groupeagecreating new variable that its name is  age_groupe and its values are the same as values in age column 

  format dob empdate date9. age_groupe ageg.

  gendersexcreating new variable that its name is gender and its values are the same as values in sex column 

  age_groupage

  format age_group ageg. gender $genderg.

b) with if statement: you can use if statement for segmentation(bining).with if you can define new variable as well

When you use if statement for binning(segnentation) you should be careful about 2 things

1.If you define new variable inside if statement without defing length of that before it :

SAS will assign length of that new variable equals to length of first value that you initilized to it

you have to put extra spaces for 1st one to make it as long as the bigest one or use length statement

2.Be carefule that SAS consider missing values less that any numbers that you can think in comparision

i.e

. < 30000 

.< -234567890876546789

I will tell you how handle these problems 

senario: income less than 30000 is low income , between 30000 and 50000 (not include 50000) is mediom income and greater than or equal to 50000 is high income

lets create new variable (type) based on income

if income<30000 then type'low income'SAS is going to create new column that its name is type and length of that 

                                       is based on initial value


else if income<50000 then type'mediom income'else if 30000<income<50000 then type'mediom income'


 if you use "if " instead of "else if" the second one overrride 1st one

 if you use "if " instead of "else if" the second one overrride 1st one

unless you say if 30000<income<50000 then type'mediom income'

else type'high income' 


As I mentioned we have two problem when we use if statement 1st length  of values 2nd missing values 

but we can fix them:

you have to put extra spaces for 1st one to make it as long as the bigest one or use length statement

if income<30000 then type'low income   '

else if income<50000 then type'mediom income'


Hint: Be careful that SAS consider missing values less that any numbers that you can think 

you can use one of the follwing methods to handle missing values easily


Coditional statement

in data step you can use if or where but in proc step you are allowed to use only where

Hint: If you want to add condition in procedure if statement doesn't work you should use where statement

 I will talk about if Vs. Where in detail



PROC PRINT DATA SASHELP.CARS(OBS428 firstobs424) OBS tells SAS when to stop processing observations and firstobs means from which starts

                                                       here we have 428-424+15 obs in pint

LIMITING THE NUMBERS OF VARIABLES

_________________________________________________________________________


PROC PRINT DATA SASHELP.CARS(OBS20)

note: we can use if as well as where for filtering records. But there are some points that

you need to pay attention


WHERE

Where Vs. If


 We are not allowed to use if in proc step

For FILTERING THE DATA VALUES in proc Use WHERE CLAUSE


 some operators that you can use with where:

GE GREATER THAN OR EQUAL TO

GT GRATER THAN

LE LESS THAN OR EQUAL

LT LESS THAN

EQ EQUAL

NE NOT EQUAL

IN OPERATOR: for checking if that value belongs to that set (checking membership) 

BETWEEN... AND...  :to check if that value belongs to that range including end points


PROC PRINT DATA SASHELP.CARS(OBS5)SAS will print the first 5 observations that met follwing condition

WHERE Cylinders GE 6GE is the same as >

If Cylinders GE 6Note:unlike where statement ,you are not allowed to use if in proc step you can use if only in data step 

WHERE Cylinders IN (4,8)Cylinders IN (4,8)  is the same as Cylinders 4 or Cylinders 8


PROC PRINT DATA SASHELP.CARS(OBS10)it will be prit the 1st 10 obs that are satisfied following conditions

WHERE type IN ("SUV","Wagon")

PROC PRINT DATA SASHELP.CARS(OBS10)

WHERE type IN ("SUV","WAGON")

PROC PRINT DATA SASHELP.CARS(OBS20)

WHERE upcase(type) IN ("SUV","WAGON")

upcase (Wagon)WAGON

upcase (sedan)SEDAN

upcase(SUV)SUV

 or you can check all leveles by using proc freq 


 informat vs. format


 informat is for reading purpose, format is for writing purpose you can use format in data step or any proc step

READ NON STANDARD NUMERIC INTO SAS AS NUMERIC DATA:INFORMATS

DISPLAY THESE NON STANDARD NUMERIC :FORMATS


Recall that:

TYPES OF DATA : 1.NUMERIC 2. CHARACTER

NUMERIC:STANDARD AND NON STANDARD

NON STANDARD NUMERIC DATA: E.G DATE:14JAN 1970,75%, 1,000,000, $240

OBS tells SAS when to stop processing observations and firstobs means from which starts

  here we have 7-3+15 obs in pint

  format dob empdate date9. 

proc print datatest2 noobsnoobs means no index 

statement make sure that your data is sorted by that value before

Unlike variable's name that must be less than or equal 32 charectors  starts with  letters or underscore and

can conatain only letters, underscore and numbers 

  label can contain and start with any thing but must be less  than or equal 256 charectors


PROC PRINT DATA sashelp.CARS (obs5)run

PROC PRINT DATA sashelp.CARS (obs5) LABEL  

label type"@@##Type of car:"run


PROC PRINT DATA sashelp.CARS (obs5) LABEL  

label type@@##Type of car:

 if you want to use lable statement you have to type "lable" keyword in first statemen as well

  label fnamefirst name lnamelast name

 if you want to use lable statement you have to type "lable" keyword in first statemen as well

  label fnamefirst name lnamelast name


 if you want to print ecah group in one table you should use by statement in print

proc print datatest2 label

 if you want to use lable statement you have to type "lable" keyword in first statemen as well

  label fnamefirst name lnamelast name

  by sex you might get somthing wrong

So important: when ever in SAS you want to use By statement you must sort the data by that variable first unless it was alredy sorted


 if you want to use lable statement you have to type "lable" keyword in first statemen as well

  var fname lname empdate income sex age 

  format empdate date9. age ageg. sex $genderg. 

  where sex'M' and age>20two conditions

  by income  to print different table for each income

  Note: every time you use' "by" for any variable'  you must sort by that variabe first unless the data was sorted before

  look at the result with and without sort

  label fnamefirst name lnamelast name

  title 'this is an example'

  footnote "created by hamid"

please check the log, although we got a table in results window, but it is not reliable because we have error in log


SORT PROCEDURES

BY DEFAULT IT IS ASCENDING(from smallest to greatest) : 



  Note:1.For chategorical variables First it is gana sort based on the first digit or charector:

  missing values < 0-9 < A-Z < a-z  ,

  (in SAS upper case less then lower case and  in alphebetic order  for example A<a, Z<a A< B,a<b, A<a, B<a)

  if the first digit or charector are the same it is gana check the  second one and so on

Recall that SAS is case sensitive in terms of values

  2.for numeric varaiblles it is going to sort based on their values ,missing comes first

INPUT NAME $ @@list input

Note:@@ means you may have more than one observation in atleast one line

Because we didn't mention the name of data,  it is gana sort the most recent data that we excuted

and because I didn't use out option it is going to change the original data 


proc sort datatest0 outtest0_1  when you are sorting data it is better than save output in different table because

                                  1)normaly we don't like to change original table 

2)it takes a lot of sources to shuffel and save in the same table 

by descending nameunlike sql that we mention ascending or descending after variable name in short form 

in SAS we must specify decending before any variable in complet form

 In SAS we use Descending before any variable, in sql we use Descending after variable and 

 in short form(desc) 

   



proc sort datasashelp.cars outtest2Input data is sashelp.cars output data is test2work.test2

 by Decending type invoicedescendingDecending ,here it was sorted first by type in descending order , then for all observations in

                             the same type they

  were sorted by invoice in ascending order(default is ascending)

 by Decending type Decending invoicehere it was sorted first bt type in desceng order , then for all observation in same type they

  were sorted by invoice in decending order

import data participant1.csv and save it as HAmid.Parti


nodupkey

you can use nodupkey in proc sort to drop observations that have same values for featurs that you are sorting by them 

it will keep first one and drop other duplicated in that column


proc sort datahamid.parti nodupkey outhamid.parti2

  by sexdeletes observations with duplicate BY values (here no sex duplicated)

         hint: for duplicate value in sex keep first row and drop others

 senorio : Find highest income for each level of sex in hamid.parti



Be carefull if you use nodupkey when you sort two variables SAS is going to consider one observation duplicated if it has same values for 

thoes two varible with at least one of priviouse obs

PROC SORT DATAt2 nodupkey OUTt4

BY  sex salary here SAS considers one obs is duplicated if sex and pai both are the same with priviouse obs



Recall that in simple list input :

1.we have space as a delimeter and (we can solve this problem by using delimeter in modified list input)

2.we don't have non_standard numbers (we can solve this problem by using informat in modified list input)

3.missing values encoding with dot, (we can solve this problem by using delimeter in modified list input)

4.and charectros values have a default length8 bytes(8 charectors)(

we can solve this problem by using length statement or informat in modified list input)






NON STANDARD NUMERIC

 like : 98%, date, $678 or values that have comma 3,997,657

READ NON STANDARD NUMERIC INTO SAS AS NUMERIC DATA:INFORMATS

DISPLAY THESE NON STANDARD NUMERIC :FORMATS

INPUT ID AMOUNT DOLLAR12.4 IT is modified list input that we specify informat in front of variable

DOLLAR12.4 is informat for reading but for writing 

if you want to see $ sign you should define format 

here in data step or in proc step



Type casting


Converting variable types from character to numeric and numeric to character:

you can convert numeric to charector by using put()function and

you can convert charactor to numeric by using input()function if its values is convertable 

 put always convert from numeric to char and input always convert from char. to numeric  

new_numinput(char,4.)   I define new variable its name is new_num and convert character to numeric with 4 digits

new_charput(num,3.)  I define new variable its name is new_char convert numeric to character  

If the character variable char in the above example contains missing values or non-numeric data then the 

value of new_num will be missing. When char contains non-numeric data, an 'invalid argument' note will be written 

to the log



Hmw2.3 Example Universe Cars is surveying its customers as to their preferences for car colors. 

They have information about the customer’s age, sex (coded as 1 for male and 2 for female), 

annual income, and preferred car color (yellow, gray, blue, or white). 


STATISTICS

APPLIED STATISTICS

1.DESCRIPTIVE STATISTICS (means summarizing existing data( previous and current data))

 2.INFERENTIAL STATISTICS(Statistical inference is the process of making as estimate, prediction, or decision about a population based on sample data. )

3.PREDICTIVEPRESCRIPTIVE STATISTICS


###############################

Exploratory Data Analytics (EDA) 

###############################

Analysts need first to explore the data for potential research questions before

jumping into confirming the answers with hypothesis testing and inferential statistics.

EDA is often likened to “interviewing” the data it’s a time for the analyst to 

get to know it and learn about what interesting things it has to say.

As part of our interview, we’ll want to do the following:


1.Classify our variables as continuous, categorical, and find distribution of them

2. Summarize our variables using descriptive statistics 

3. Visualize our variables using charts


EDA is a process to try to understand existing data by visualization, Summarization, data cleaning, and 

testing hypothesises 



#We select the method for summmrizing data and analysis and testing hypothesis base on data type


Assume you have a breast cancer data set and in it you have ID, age, weight, height, blood pressure

and cancer . Cancer is binary 0 an 1 (0 means not having breast cancer and 1 means  having breast cancer)

we want to built a model to predict having cancer for new patients

So in this situation cancer is target and we call that target or dependent variable or output variable and

we call other columns predictors or independents or features or input variables


we have two data types numerical and categorical data. 

Categorical data has two types :Nominal(like name, eye color) and Ordinal(like education level, day of week)

numerical data has two types continuous (that can accept decimal like height and weight ) and 

discrete(like number of children, number of females, ...)


If your variable has only two levels we call it binary at we can consider it as a spcial case of nominal categorical variable




TWO TYPES OF DATA (VALUE)1.NUMERIC 2.CHARACTER 

numeric data could be discrete or continuous

If your data is discrete with limited levels like number of childeren ( number of cylinder in cars),

you can treat it as a categorical variable as well i.e. you can find mean and varaince because it is 

numeric and you can draw bar chart and pie chart as well 


SAS STATITICS:1.CONTINUOUS 2.CATEGORCAL


CONTINUOUS: like: AGE,income,IQ,SALES,PRICE


Charector variables like: gender, name,purpose of loan, loan status


DESCRIBE ONLY one VARIBLE:UNIVARIATE DESCRIPTIVE ANALYSIS

DESCRIBE ONLY TWO VARIBLES:BIVARIATE DESCRIPTIVE ANALYSIS

DESCRIBE MORE THAN 2 VARIBLES:MULTIVARIATE DESCRIPTIVE ANALYSIS


Univariate Analysis:

  For categorical columns variabels:

                      for summarization:   frequency, percentage, Mode, levels (unique values) ( by using proc freq in SAS)    

                            for visualization :  Bar chart or Pie chart

  For numerical columns  :

                            for summarization:   Central tendency(mean, median, mode,...) and

                                                 measue of position:the the five-number summary(min,Q1,median(Q2),Q3,Max), 

and measure of dispresion such as standard deviation ,CV ( COEFFICIENT OF VARIATION),Variance ,IQR, ...

( by using proc means,summary, or univariate in SAS)

                            for visualization:   Histogram , Box plot,density,...




https:online.stat.psu.edustat500



Continuous Variabels


WHAT PROCEDURE we USE FOR summerizing Numerical variable

1)Proc means 2)Proc summary 3)Proc univariate


 Proc means 12 


 

Proc means datatestby default proc means do summearization ( univarite) for all numeric variables and

Runproc means by deafult returns only N(number of non-missing values) , mean( average), Std Dev (standard deviation), minimumn and maximum with 7 

decimal numbers

Proc means datatest maxdec2

Run


title " The five-number summary"

Proc means datatest min Q1 median Q3 Max run

Q1 is 1st quartile25th percentile that means the value that 25% of obs have value less than or equal to it

medianQ2 is 2nd quartile50th percentile that means the value that 50% of obs have value less than or equal to it

Q3 is 3rd quartile75th percentile that means the value that 75% of obs have value less than or equal to it


 if you have a big distance between min and p1 but not between p1 and Q1 you have outliers and 

if you have a big distance between p99 and max but not between Q3 and p99 you have outliers

note: we have some other methods for finding outliers like 1. mean and standard deviation( Z score)

                                                            2. Interqurtile Range or box plot(Tukey and extreme Tukey method)

                                                            3. LOF

that you will learn in statistic and data mining course 



By default proc means do descripteve analysis for all continuouse varibles in data set

title " Describe one variable"

Proc means datatestdo not forget "s" at the end of proc means and proc contents

output outdb1.data_summary_Oct18th you can save the reult of proc means as a new table (dataset) by using output statement


Bivariate Analysis


 in SAS base in bivariate analysis I just talk baout summerization, and in advanced SAS I will

teach you hypothese testing, 


Continuous VS. Categorical with proc means 


title "Bivariate Analysis"

title2 "Continues VS. Categorical"

title3 "DESCRIPTIVE OF Income Vs. Gender" 


Proc means datatest maxdec2

 var income after var list numeric variables abd after calss or by list categorical variable

 class gender Bivariate continues VS. categorical

run

footnote "to get the idea about association between income and geder if exits"


title "Bivariate Analysis"

title2 "Continues VS. Categorical"

title3 "DESCRIPTIVE OF Income Vs. age_group" 

footnote "to get the idea about association between income and age_group if exits"


title "Multi-variate Analysis"

proc means datatest maxdec2 NMISS MEAN STD CV MEDIAN p20

  var age income summeraiztion for these two continuous varibales

  class gender ageby using fomat I will vconvert age to categorical variable and use it for grouping

  format age agefmt.

Hint: You can use by statement instead of class statement in this situation you will get separate tables for each level and be carefull if you want to use by you must sort by that variable first 

the following code is just FYI

proc means datasashelp.baseball p1 p5 p10 p20 p25 p50 p75 p90 p95 p99 maxdec2

    var nHits

In SAS in assignment if you want to assign a value of keyword to your own variable we specify that keyword on the left side of equal sign

and the name of your own varibvle on the right side of equal sign 

  in proc means for saving the result in one table you can use ouput statement

    output outpercentiles_proc_means

p1  P_1_hamid P_1_hamid is the name of varible that i assigned value of p1 to it

proc means datatest maxdec2 mean clm maxdec2clm is confidence interval for population mean that we predict a range of mean for population( inferential statistic) by default at 95% confidenc

 

clm is confidence interval

Let’s assume that I will give you an exam with 30 marks, and I say that based on my experience your class average mark would be between 21.75 and 26.25 at 95% confidence

That means there is 5% possibly that your class average would be less than 21.75 or bigger than 26.25 


If I want to predict your class average at 100% confidence I must say:

your class average mark would be between 0 and 30 at 100% confidence



proc means datatest maxdec2 mean clm alpha0.10 alphasignificant level0.10> confidence 1-alpha 0.90

clm is confidence interval that we predict a range of mean for population at 90% confidenc

The bigger confidence the wider interval

FYI:alpha is called signifiacnt level it is probabilty of Type I error(FP)

Proc means datatest MAXDEC 2 n nmiss mean std stderr cv lclm uclm median min max range Q1 Q3 qrange maxdec2 



where VS. If


1. you can use where in proc step or data step but you can use if only in data step 

    i.e you are not allowed to use if in proc step

 2.we have where option as well



where option vs. where statement


FYI: Proc summary 12 

Proc SUMMARY and Proc MEANS are essentially the same procedure. Both procedures compute descriptive statistics. 

The main difference concerns the default type of output they produce. Proc MEANS by default produces printed output in 

the LISTING window or other open destination whereas Proc SUMMARY does not. Inclusion of the print option on the Proc SUMMARY 

statement will output results to the output window.


Proc SUMMARY 


When all variables in the data set are character the same output: a simple count of observations, is produced for each 

procedure.However, when some variables in the dataset are numeric, Proc MEANS analyses all numeric variables  and 

produces default statistics for these variables (N, Mean, Standard Deviation, Minimum and Maximum).

Inclusion of a VAR statement in both Proc MEANS and Proc SUMMARY, produces output that contains exactly the same default 

statistics.

proc summary datatest In proc suumary you must use print option to see the result or use output statement to save the result in one table 

  output outsumdata_For_Hamidcreating new data which name is sumdata_For_Hamid an availble in work library

                                you can save the result in permanent library if you want


Hint:you are not allowed to use if statement in procedures

proc print datasumdata3arunyou can see bigger number of type shows you more detailed in terms of classification

 I classified all of observation based on gender and age so type 3 is classifying based on gender then age

class gender age I classified all of observation based on gender and age

 here n is keyword and we assign the value of that to number_of_non_missing, In SAS keywords located on

the left side of equal sign


proc univarite



Note:In proc univarite you can get more information about continuous varaible by default , and unlike proc means and summary

    1.You can use proc univarite to test of normality and to visualize data as well

    2.you can create a graph

3.when you want to create output in proc univarite you must specify what _stat_ you want to save 

4.we use pctlpts20 instead of p20 and so on and The PCTLPRE option must be specified to generate additional percentiles

5. you can use the PCTLDEF option in PROC UNIVARIATE or the QNTLDEF option in other procedures to control the method used to estimate quantiles


You can use proc univariate to test of normality by adding normal option

If p value is less than 5% you can reject null hypotheses( data is normaly distributed) and conclude

your data is not normally distributed at 5% significant level

output outsumdata4a WARNING: Data set WORK.SUMDATA4A was not replaced because new file is incomplete 


 Bivariate continues VS. categorical


Proc univariate datasashelp.cars normal plot

 class make Hint:you can use by instead of class but must sort before by make, unless your data sorted before

 Note: when ever you use by statement in SAS you must sort your data by thah variable before ,unless your data sorted before

PCTLPREprefixes

specifies one or more prefixes to create the variable names for the variables that contain the PCTLPTS percentiles. To save the same percentiles for more than one analysis variable, specify a list of prefixes. The order of the prefixes corresponds to the order of the analysis variables in the VAR statement. The PCTLPRE and PCTLPTS options must be used together.

PROC UNIVARIATE generates new variable names by using the prefix and the percentile values. If the specified percentile is an integer, the variable name is simply the prefix followed by the value. If the specified value is not an integer, an underscore replaces the decimal point in the variable name, and decimal values are truncated to one decimal place. For example, the following statements create the variables pwid20, pwid33_3, pwid66_6, and pwid80 for the 20th, 33.33rd, 66.67th, and 80th percentiles of Width, respectively:


univariate analysis

for categorical variables 

title

proc freq datatest

runfor each variable (numeric and categorical variables) it shows frequancy of each level

it is not suumerizing of num,erical varable just shows us the freq of each unique value

                            If an observation has a missing value for a variable in a TABLES request, 

                            by default PROC FREQ does not include that observation in the frequency or 

                            comulative freq and percentage , it shows missing values at the bottom of tables. 

                           Also by default, PROC FREQ does not include observations 

                             with missing values in the computation of percentages and statistics. 

                            The procedure displays the number of missing observations below each table.



 unlike proc means, summary and proc univariate that we use var statement to specify varables for 

 thoes procedures, in proc freq we use table statement :recall that we show counts of each value in table

proc freq datatest

 table gender

run

proc print datatest run

By default proc freq ignores missing values in calculation percent and cumulative percent

unless you youse missing option in table statement

proc freq datatest

 table gendermissing   we use forward slash to add option in proc freq

                           treats missing values as a valid level for all TABLES variables. 

                           Displays missing levels in frequency and crosstabulation tables 

                            and includes them in computations of percentages and statistics.


proc freq DATA  SASHELP.CARS NLEVELS NLEVELS :Displays the number of levels for all TABLES variables

TABLE SEX SMOKING_STATUSit is two univariate analysis ,

i.e creating two tables one for sex, the other one for SMOKING_STATUS

IF you want to get two-way table youu need to put star between sex and

SMOKING_STATUS


Bivariate

Cathegorical Vs. Chategorical

assume you are working as a data scientist at Bell and you want to see that is there any associtaion between

type of cancelletion( online or calling) and churn ( cancelling the agreement and switch to anothrer provider like Bell, ...)

this is hypotheses testing ( you may design an  AB testing for answering this problem)

Null hypothese: there is no assocition between churn and type of canceletion

since both variables are categorical vbariables we run biviate anlysis for categorical vs. categorical

fgor this problem we need to use chi_squre test in proc freq that i will teach you



Bivariate Analysis:


  Continuous Vs. Continuous   : For Visulaization scatter plot,line chart( specially if you have time data)

                                For test of independence: pearson correlation or spearman or Kendal tau or ... coorelation

and Simple linear regression ...in SAS by using proc corr, proc reg

            

  Categorical Vs. Categorical : For summaraization: contingency table (two-way table) in SAS by using proc freq and put  between the name of thoes two variables in tabel statement

                                For visualization :stacked bar chart,Grouped bar chart,...

                                For test of independence:chi-square test

  Continuous Vs. Categorical  : For summaraization:gropup by categorical column and aggragte for numerical column in SAS by using proc means,summary, or univariate and use name of continuous variable in var statement and name of categorical variable in class statement

                                For visualization: Grouped box plot,...

                                For test of independence :1) if categorical column has only two levels :t-test in SAS by using proc ttest ( if all assumption are met)

                                                          2) if categorical column has more than two levels: ANOVA in SAS by using proc glm or Anova( if all assumption are met)






CHI SQURE


We use Chi squre test for finding if two categorical varaibles are independent from each other or not.


The Null and Alternate Hypotheses

we are interested in knowing if there is a relationship between two categorical variables or not. 

In order to do so, we would have to use the Chi-squared test. But first, let's state our null hypothesis and 

the alternative hypothesis.


The Null Hypotheses:H0:There is no statistically significant relationship between thoes two categorical variables.


The Alternate Hypotheses:Ha:There is a statistically significant relationship between  thoes two categorical variables.


After running chi square test and checking the conditions of it 

if p-value is less than 5% you can reject null hypotheses and get this conclusion that 

"There is some correlation between thoes  two variables at 0.05 significant level

or

There is a statistically significant relationship between  thoes two categorical variables at 0.05 significant level."


( Note : if p-value is bigger than 5% yopui fail tyo reject null hypothese so you can conclude that there is not enough evidence top reject the null hypothese at 5% signifiucant level)

The Chi-square test statistic can be used if the following conditions are satisfied:

1.N, the total frequency, should be reasonably large, say greater than 50 or 30.

2. The sample observations should be independent. This implies that no individual item should be included twice or more 

in the sample.

3. No expected frequencies should be small. Small is a relative term. Preferably each expected frequencies 

should be larger than 10 but in any case not less than 5.

 

TITLE "BIVARIATE DESCRIPTIVE ANALYSIS:SEX AND SMOKING"

TABLE  SMOKING_STATUS  SEX it gives us contingency table(two-way table) here rowsSMOKING_STATUS  and columnsSEX

TABLE  SMOKING_STATUS  SEXnopercent norow nocol


 If you specify the LIST option in the TABLES statement, 

  PROC FREQ displays multiway tables in a list format 

  rather than as crosstabulation tables. 

 The LIST option displays the entire multiway table in one column, 

 instead of displaying a separate two-way table for each stratum.


TITLE "BIVARIATE DESCRIPTIVE ANALYSIS:SEX AND SMOKING"

PROC FREQ DATASASHELP.Heart

TABLE SEX  SMOKING_STATUSmissing


TITLE "Hypoteses testing:SEX Vs. SMOKING"

PROC FREQ DATASASHELP.Heart

TABLE SEX  SMOKING_STATUS chi_square

if p-value is less than 5% and all assumption are satisfied( I will talk about assumption in advanced SAS course)

we conclude that there is a statistcally association between thoes two varaibles at 5% significant level

 in this example the p-value<0.0001> p-value<0.055% that is less than 5% so you can conclude that:

there is a statistcally association between sex and Smoking_Status at 5% significant level 


Data type: Numeric and categoric

Note: Numeric variable can be continuous or discrete , we can treat discrete variable like 

ordinal variable ( categorical) specially if it has limited options ( like number of children in family : 

it is numeric( discreate), so you can apply mean ,standard deviation and so on but you can treat it like categorical variable

for example draw bar chart for it)


Univariate Analysis:

  For categorical columns variabels:

                      for summarization:   frequency, percentage, Mode, levels (unique values) ( by using proc freq in SAS)    

                            for visualization :  Bar chart or Pie chart

  For Continuous columns  :

                            for summarization:   Central tendency(mean, median, mode,...) and

                                                 measue of position:the the five-number summary(min,Q1,median(Q2),Q3,Max), 

and measure of dispresion such as standard deviation ,CV ( COEFFICIENT OF VARIATION),Variance ,IQR, ...

( by using proc means,summary, or univariate in SAS)

                            for visualization:   Histogram , Box plot,density,...



TITLE "BIVARIATE  ANALYSIS:SEX AND SMOKING_status"

TABLE  SEX  SMOKING_STATUS chisq

Since the p-value is less than 5% we reject null hypotheses and conclude:

There is a statistically significant relationship between  thoes two categorical variables at 0.05 significant level


PROC FREQ DATASASHELP.Heart

TABLE  SEX  SMOKING_STATUS NOCOL  NOROW NOPERCENT NOCUM MISSING chisq

Here since p-value is less than 5%0.05 we reject null hypothese and get conclusion that there is statistaclly association between sex and

smoking_status at 5% significant level


summary until this point

since p-value < 0.05 we can reject normality at 5% significant level 


univariate analysis for categorical variable

proc freq datasashelp.heart

table BP_Status Chol_Status DeathCause Sex Smoking_Status Status Weight_Status missing

since p-value < 0.05 we can reject independency at 5% significant level and conclude:

There is statitically association between BP_Status and Chl_status at 5% significant level


Segmentation and changing values

 value ageg format names can not end with numbers

value $sexfmt Hint:To create a format for a character variable, the format name must begin with $:

                     you can not use number at the end of the format name 

don't forget  at the end of any format

WARNING: 100% of the cells have expected counts less

          than 5. Chi-Square may not be a valid test.

Note: you can create several tables in one proc freq

 If you specify the LIST option in the TABLES statement, 

  PROC FREQ displays multiway tables in a list format 

  rather than as cross tabulation tables. 

 The LIST option displays the entire multiway table in one column, 

 instead of displaying a separate two-way table for each stratum.

table gender missing   for one table adding list option doesn't have any effect

table modelmissing  we can say table make modelmissing list instead

table makemodelmissing list




Vertical calculation (column wise ) vs. Horizontal calculation(row wise)



Proc means datatest MAXDEC 2  sum MEAN var STD CV Q1  MEDIAN Q3 p95 p99 maxThe coefficient of variation (CV)STDmean 

Run

It is vertical calculation i.e. SAS it is going to calculate mean, std, ... for each columns

Proc means datatest MAXDEC 2  sum MEAN  

Run


Horizontal calculationrow wise

,

so far we did summeraization for each variables columns

what if we want to summerize for each rowsobs as an example assume each rows is student id and scores in different courses for that student i.e you have a student id, 

SQL_score, SAS_score, Python_score and so on and you want to find average of score for each studen if we use proc means, or suumary or univarita

we will get average of all students in each course,

or you have a customer Id at 1st column and sales amount for each month of 2021 in 12 columns and we want to

calculate sum of sales and average of sales for each customer

CONNECT YOUR PERMANENT SAS LIB USING LIBNAME STATEMENT


proc means datatest_ sum meanrunIt is vertical calculation i.e. SAS it is going to calculate sum and mean for each columns

As an example if columns are course names and rows are students you can find average of score for each cours 

HOW WE CAN DO HORIZONTAL CALUCULATION as an example how we can calculate average of score for each students

 recall that SAS select obs by obs and for each obs excutes codes line by line


with missing values


Plus Operator Vs. sum function


TOTAL_SCORET1+T2+T3+T4+T5ASSIGNMENT STATEMENT and Defining new variable and ASSIGNMENT STATEMENT

AVG_SCORETOTAL_SCORE5

TOTAL_SCORE1T1+T2+T3+T4+T5 ASSIGNMENT STATEMENTPlus operator GIVES YOU MISSING VALUE IF YOU HAVE MISSING IN YOUR DATA

TOTAL_SCORE2SUM (T1,T2,T3,T4,T5)ASSIGNMENT STATEMENT IGNORES MISSING VALUES  Sum() function ignores missing values

TOTAL_SCORE3SUM (OF T1-T5)ASSIGNMENT STATEMENTYOU CAN USE THIS METHOD FOR VARIBLES WITH THE SAME NAME AND DIFFEREN INDEX

In following data find the total amount and average amount that each customer bought in 2023 so far



 Interview question: how can we calculate the average of each row by ignoring missing values?



SAS FUNCTION  f(ARG1,ARG2,ARG3)

BY DEFAULT,SAS FUNCTIONS EXCLUDE (ignore) MISSING VALUES IN CALCULATION


TOTAL_SCORESUM(T1,T2,T3,T4,T5)sum function returns to us sum of all non-missing values

TOTAL_TESTN(T1,T2,T3,T4,T5)N function returns to us number of non-missing values

MISS_TESTNMISS(T1,T2,T3,T4,T5)NMISS function returns to us number of missing values

AVG_SCOREMEAN(T1,T2,T3,T4,T5)mean function returns to us avaerge of all non-missing values

MIN_SCOREMIN(T1,T2,T3,T4,T5)min function returns to us minimum of all non-missing values

MAX_SCOREMAX(T1,T2,T3,T4,T5)max function returns to us maximum of all non-missing values

STD_SCORESTD(T1,T2,T3,T4,T5)std function returns to us standard deviation of all non-missing values

TOTAL_SCORESUM(T1,T2,T3,T4,T5)sum function returns to us sum of all non-missing values

Number_of_non_missingN(T1,T2,T3,T4,T5)N function returns to us number of non-missing values

AVG_SCORETOTAL_SCORENumber_of_non_missing

AVG_SCORE2mean(T1,T2,T3,T4,T5)mean function returns to us avaerge of all non-missing values

TOTAL_SCORE1SUM(T1+T2+T3+T4+T5)

first it is going to calculate T1+T2+T3+T4+T5 (so if you have missing value you will get missing)then sum( sum here

doesn't have any effect)  

Note: use comma between arguments

TOTAL_SCORE2SUM(T1,T2,T3,T4,T5)

TOTAL_TESTN(T1,T2,T3,T4,T5)Number of non_missing values

MISS_TESTNMISS(T1,T2,T3,T4,T5)Number of  missing values



FYI: USING VARIABLE SHORTCUT


TOTAL_SCORESUM(of T1-T5)DON'T FORGET OF

TOTAL_SCORE_PLUS5SUM(OF T1-T5,5) OR TOTAL_SCORE_PLUS52SUM(T1,T2,T3,T4,T5,5)

XSUM(T1-T5)HERE WE HAVE JUST ONE ARGUMENT THAT IS THE RESULT OF T1 MINUS T5 

YSUM(T1-T5,5)HERE WE HAVE TWO ARGUMENTS THAT ARE THE RESULT OF T1 MINUS T5 AND 5 THAT MEANS T1-T5+5 

ZT1-T5+5


 USING VARIABLE SHORT CUT

DATA TEST_04

SET TEST4

TOTAL_TsSUM(OF T1-T3)SUM(OF T1-T3)sum(T1, T2, T3)

TOTAL_SCORESUM(OF T1--T3)SUM(OF T1--T3)sum(T1,S1, T2,S2, T3)i.e. sum of all columns between T1 and T3 including them

Num_TsN(OF T1-T3)

TOTAL_numberN(OF T1--T3) MEANS ANY VARIABLE BETWEEN T1 AND T3 INCLUDE T1 AND T3


 USING VARIABLE SHORT CUT

TOTAL_TsSUM(OF T1-T3)SUM(OF T1-T3)sum(T1, T2, T3)

TOTAL_SCORESUM(OF T1--T3)SUM(OF T1--T3)sum(T1,S1, T2,S2, T3)i.e. sum of all columns between T1 and T3 including them

Total_Ts_and_Sssum(of T1-T3,of s1-s2)

Num_TsN(OF T1-T3)

TOTAL_numberN(OF T1--T3) MEANS ANY VARIABLE BETWEEN T1 AND T3 INCLUDE T1 AND T3


xSUM(OF T1--G3)Any numeric variable between T1 and G3


评论

此博客中的热门博文

Don't be afraid, this time the two-way foil is in the hands of civilization别怕,这一次二向箔掌握在文明的手中

Zelensky signed a deal to sell the country泽连斯基签下协议,把国家给卖了,以后乌克兰的事,华尔街说了算

Marianne Situ的帖子