sas note1

SAS programs are constructed from two basic building blocks:

1.DATA Steps (starts with data keyword)

2.PROC Steps (starts with proc keyword)

DATA and PROC steps are made up of statements. A step

may have as few as one or as many as hundreds of

statements.

We normally use data step for importing a dataset or creating a new dataset,

or getting a subset of data set, or manipulating a dataset

and we use proc step to print, summerize , or visualize data

note: Sas is case sensitive only for values inside its data but for keywords and identifires( name of variables,

name of tables, ...) is not case sensitive

note: you have to put semicolon at the end of each statementl

for execution: highlight the block of code that has to have run keyword at the end

then select running person icon or F3 or run tab>submit

Note: after any execution you must check the log even if you get the result

Data test1test1 is the name of table(dataset) that will be saved in work library

cA+b SAS is case sensitive only for values inside a table not for others

run

i used the above data step to craete a data now for print that data I will use proc print

any dataset in SAS has two leveles for naming like Sashelp.CARS , first we must mention the name of libraray

then dot after that name of data set

Note: you can skip typing the name of library only if it is work libraray i.e. EUROPEANCARWORK.EUROPEANCAR

NOTE: Work library is the only temporary libraray that means all you have there will be disappear after closing

current session

temp.data1 > the name of library is temp but it is permanent library( the only temporary library is work)

Get subset of sashelp.cars that includes only European cars

DATA EUROPEANCARWe start a data step to create a dataset that its name is EUROPEANCAR , EUROPEANCAR is output data

SET Sashelp.CARSSashelp.CARS is input data that I want to get subset of it

WHERE Origin "Europe" I added a condition that is Origin "Europe"

Note: SAS is case sensetive only for values of tables

you should put run statement at the end and highlight them togather then hit the running person icine for executing

syntax rules

SAS Program is a series of SAS statement

Most Statement not all of them begin with an identifying keyword

SAS statement must end with semicolon > so, for calculating number of statements count number of semicolons

These statement execute in order i.e. observation by observation ,and for ecah observation line by line (statement by statement),

That means SAS takes first observation and runs it all the way

through the DATA step (line by line) before looping back to pick up

the second observation.

SAS terminolgy

data setSAS Table

Variable (var) columns

observation(obs)rows(recrods)

data values (val)intersectional between Rows and Column

NAMING CONVENTIONS FOR SAS TABLES and VARABILES

1. START WITH A LETTER OR AN UNDERSCORE

2. CAN CONTAIN ONLY LETTERS,NUMBERS,AND UNDERSCORES

3. CAN BE 32 CHARACTERS OR FEWER

NAMING CONVENTIONS FOR SAS LIBRARY(libref)

1. START WITH A LETTER OR AN UNDERSCORE

2. ONLY LETTERS,NUMBER,UNDERSCORE

3. 8 CHARACTERS OR FEWER

Type of libraries

SAS LIBARY:we have two types of library:

1.PERMANENT AND 2.TEMPORARY SAS LIBRARY

The only temporary library is work> all of the other libraries are permanent

The WORK library is the only temporary storage location for SAS data sets.

It is also the default library. If you create a SAS data set without specifying a library, SAS

will put it in the WORK library, and then delete it when you end your session.

TYPE OF DATA VALUES

TYPE OF DATA VALUES : 1.Numeric or 2.chategoric:

If a variable contains letters or special characters, it must be a character(categoric)_ variable.

However, if it contains only numbers, then it may be numeric or character.

i.e. for numeric values you are allowed to assign a numeric or charector data type

For example you can store value100 to numeric variable and charactor variable

but you can store value"Hamid2" only in cahractor variable

If YOU STORTE "HAMID2" IN NUMERiC VARAIBLE YOU WILL GET MISSING VALUES

IN SAS we consider two types for numbers :

1.standard like 348 0r 102.5 that doesn't have any special charectors and

2.non-standard numbers like 25% or 02-12-2022 or $152 or 3,245

for reading non-standard numbers we must specify the format of them(we call it informat and I will talk about that)

NUMERIC :1. Standard numbers 345678432, 156.8 , -267.9

2. NON STANDARD NUMERIC DATA : have $, %, coma( like 2,564,790) ,dates,..

NOTE:dates in SAS are NON-STANDARD numeric and if you read them with the proper informat

they are stored as a number of days between that date and

date base (date base is 1st of Janurary 1960)

as an example if SAS read 10-Jan-1960 with proper foramt it stores it with value equals to 9 ((10-Jan-1960) - (1-Jan-1960))

CHARACTER: values like "ABCDE","hamid rajaee","5 years" can store only in charector varibles

numeric value: contains only numbers, and sometimes a decimal point andor minus sign. When they are read into

a SAS data set, numeric values are stored in the floating-point format native to the operating environment.

Nonstandard numeric values can contain other characters as numbers you must use formatted input to enable SAS to read them.

nonstandard data is data that can be read only with the aid of informats. Examples of nonstandard data include

numeric values that contain commas, dollar signs, blanks, date and time values and hexadecimal and binary values.

Note: If you store charector value or non-standard numeric value in numeric variable without specifying the proper format

SAS will store missing valuees there

input id name $ age we need to specify charector varibles using $ after them

for inserting values in SAS use datalines or cards satement

if you want to insert data maually type datalines or cards then semicolon then

values, after inserting values semcolon must be in the next line if you put in the last line that obs won't be inserted

this is simple list input that values are separted by space and have default length(8 bytes)

and we don't have non-standard numeric. If these conditions are not satisfied you should use column input,

formated input or mixed input that I will teach you in later

Note: in simple list input when you insert values for missing values you must type dot

and when you print them you will see dot for numeric missing values and blank for charector missing values

in the basic way of inserting VALUES(simple list input)

1.you must type dot for any missing values and

2.all values must be separated by at least one space( 2 spaces or more is the same as one space

3.All chrector values must have atmost 8 chrectors( the rest of that won't be saved in simple list input)

runin the resultoutput missing values will be seen as a blank for chrectors and as a dot for numbers

but be notice that when you insert them in the simple list input you must type dot for both of them

input id $ name $ age you can store numbers in numeric or cahrector varibles

but you can't stor charector in numeric varibles . If you do that you will get missing values

cards or datalines are used to specify that you want to insert some values in the table

input id name age you can store numbers in numeric or cahrector varibles

but you can't stor charector in numeric varibles . If you do that you will get missing values

cards or datalines are used to specify that you want to insert some values in the table

Length of values

Character Length 1–32,767 bytes(Case sensitive) Default 8 bytes

( for each charector you need 1 byte)

Numeric Length 3-8 bytes Default 8 bytes(16 digits)

since it is simple list input the deafult length is 8 bytes so you can store up to 8 charectors for chrector values,

The length of a numeric variable lies between 3 and 8

bytes. It means SAS can store a numeric value from 1 to

16 digits.

any character needs 1 byte to store as an example to

stroe Hamid we need at least 5 bytes and for storing Hamid_Rajaee you need at least 12 bytes.

___________________________________________________________________

Hint: 1)sas program is a series of sas statements

2)Most sas statement begins with an identifying key word like data,set,where,proc,..

all sas statements must end with semicolon()

, hint : if running of your program takes long time go to icone !(break) in above toolbar and hit that

SAS reads observation by observation and executes line by line for each observation

comment:

Comments are colored green

Note:you should put your comment between asterisk and semicolon or between forward slash asterisk and asterisk forward slash

your comment These comments cannot contain internal semicolons or unmatched quotation marks.

your comment These comments can contain semicolons and unmatched quotation marks.

Tip: When using comments within a macro definition or to hide text from the SAS macro facility,

use this style comment

or In the Microsoft Windows operating environment, if you use the Enhanced Editor,

you can comment out a block of code by highlighting the block and then pressing CTRL and (forward slash) together.

To uncomment a block of code, highlight the block and press CTRL and SHIFT and (forward slash) together.

Two basic building blocks of SAS program

the program contains 2 steps:

1)data step

2)Proc step

we use proc step to craete a report or graph , ...

SAS statement usually begins with a keyword and end with a semicolon > each semicolon one statement

Every time Check the log to make sure

the program run without errors or warnings

DATA EUROPEANCARwe created (writing) a temporary data set that is EUROPEANCAR or work.EUROPEANCAR

SET Sashelp.CARSSashelp.CARS is input data set(we reading permanent data set that its name is cars and it

is availbale in sashelp libraray)

WHERE Origin 'Europe' 'Europe'"Europe"

run

in above data step Sashelp.CARS is input data set and EUROPEANCAR or WORK.EUROPEANCAR are output data set

Browsing Data Portion and descriptor portion

Browsing Data Portion

Proc print dataEUROPEANCAR

run

Browsing the Descriptor Portion

proc contents dataEUROPEANCAR be noticed that we have s at the end of contents

ceating Library

creating libraray

SAS datasets can be temporary or permanent.

Temporary SAS datasets only exist during the current SAS session.

Permanent SAS datasets are saved to a location on the computer or server and exist after exiting SAS.

Unless otherwise specified to be permanent, SAS considers all datasets to be temporary

data1work.data1

There are a few guidelines to keep in mind when you create the name of your library.

Library names(libref name):

1.are limited to eight characters.

2.must begin with a letter or underscore.

3.can contain only letters, numbers, or underscores. Blanks are not allowed.

create your library :1) by coding or 2) by using GUI( Graphical user interface):

before it create a folder in the server or your computer

using GUI:You can also create a new SAS library by opening

the New Library window from the Libraries section of the navigation pane in SAS Studio:

icone new library in up toolbar or Explorer window >Right click on Library > New): give the name for example hmd,

enable at start up,

choose your path and ok

I want to create a subset of sashelp.cars in my library

by coding:go to any drive for example.C:\Users\hmdra\Desktop\Metro College_Teaching Materials\SAS\My sas library

and create a folder for example My sas library then copy its address and come back to sas and write the keyword libname

and give any name you want as a libref for example Hmd and then

paste just the address in single or double quotation

highlite the code and F3 or choose the runnig person icone

For any SAS session, You need to submit a LIBNAME statement (only once during each SAS session)

The libref remains in effect for the duration of that session.

But the data that you stored in that library exist for ever until you delete them,

but in any session you should say where that library is located .

Note:You can assign different names for libref for the same

location and use it i.e.

The libref can change, but the member name( file name dataset names), and directory must the same.

(A libref is just a nickname that corresponds to the location of a SAS data library. )

as an example for the same folder I will assign another name as well

libname h2 "C:\Users\hmdra\Desktop\Metro College_Teaching Materials\SAS\libreary for practice"

LIBNAME statement

A LIBNAME statement consists of

1.the keyword LIBNAME.

2.the libref name that you want to use.

(that have at most 8 charectos and starts with letter or underscore and contain leters, underscore and numbers)

3.the pathname that represents the physical location of the library and wrapped in sigle or double qoutation marks.

4.a semicolon.

after every run check log

It created a temporary data set that its name is euro inside work libary eurowork.euro

If you have a data in work library that you need for future you should better to save as in permanent library

to get a save as(copy and paste) euro dataset

It created a permanent data set that its name is euro inside h1 libary

proc print datah1.euro if You don't mention which data SAS will execute for the most recent data

data h1.Asiaoutput data is h1.asia

set Sashelp.carsinput data is sashelp.cars

where origin "Asia" be notice that unlike SQL that you are allowed to use only single qoutes to wrap the charector values, In SAS you can use either single or double qutes

proc print in proc if you skip mentioning name of data ,that proc run for the most recent data that you worked on it

Using SAS for calculation

Missing Data Values :

A Character missing value is displayed as a blank

A numeric missing value is displayed as a period (.)

SAS Statements are free format:

One or more blanks or special characters can be used to separate words

They can begin and end in any column(i.e. you can put spaces at the begining)

A single statement can span multiple lines

Several statement can be on the same line

but becarefull that each staement must be ended with semicolon

In any session you should specify where that library is located .You can assign different name for libref for

same location and use it i.e.

The libref can change, but the member name( file name), and directory must the same.

(A libref is just a nickname that corresponds to the location of a SAS data library. )

craete a data in SAS (manually ie. by inserting values) :

step1: assigning a name for data(table) (After data keyword)

stpe2: assigning names for variables(columns) and assigning data type for them (after input keyword)

step3: inserting values (after cards or datalines keyword)

data my_1st_dataoutput data is my_1st_datawork.my_1st_data that is temporary data

input name $ age gender $ weight height we use input statement to assign variable's name column's nameheader :

variables'name:

1)are limited to 32 characters.

2)must begin with a letter or underscore.

3)can contain only letters, numbers, or underscores. Blanks are not allowed

To create a character column, specify a $ sign righ after the sapace after

column name.

Note: unlike SQL that we use comma to separate columns names in SAS we use sapace

when you insert values in simple list input separater is space(at least 1 space) and Any missing data must be indicated with a period.

Note:recall that when you print a table for missing values you will see period for numeric columns and blank for categorical columns

For inserting data you should type cards or datalines then semicolon and in new rows type observations after

the last observations(rows) go to new line and type semicolon

for inserting values use cards or datalines

you must put semicolon after the last obs in "new line" if you put semicolon at the end of obs thet one won't be inserted

So important: after any execution in SAS you must check the log to make sure that result is reliable or not

proc printrunin any procedure if you skip calrifying name of data , that proc excute for the last data that is executed

in result output characters are on left side, numeric are on the right side

Note: for numeric variable if you insert charector you will get missing values

I stored that privouse data set in work library, Since it is temporary libraray after closing the SAS program ,

this data is goinig to be disapear. If you want to save it permanently you have to create that data in permanent

library or getting save as like

Data Hamid.myDataHamid.myData is output data that I want to save in in Hamid libraray

note: you can save it with the same name like Hamid.my_1st_data

set my_1st_datamy_1st_data is input data that exits in work libraray and I want to get a copy of it

Run I created new data set in hamid library and I named it mydata and I save as all data that I have in my_1st_data

input id name $ family $ height several sapces is the same as one space

you can save numbers as a charector as well

Brows Data descriptor

for inserting any non-standard numbers in your dataset( table) you must use informat

inforamt Vs. format

for reading and writing non-standard numeric values you can use inforamt( for writinginserting values)

and format( for readingprinting values)

input id name $ family $ age dob several sapces is the same as one space

informat dob ddmmyy8.informat for reading perpuse (inserting in table)

Note:right after each format_name put dot

note: some availble format :

14.05.85 or 140585 >ddmmyy8. or ddmmyy.

17032013 >ddmmyy10.

03172013 >mmddyy10.

17MAR13 >Date.

17MAR2013 >Date9.

you can find more by typing date format in help window( use help tab)

SAS stores date values as numeric values and showes us differnce dayes from 1Jan1960(base date in SAS

For reading date to sas(Informat) we should specify format of date

Note:mmddyy6. is one type format for date that mention 1st 2 digit is month , 2nd 2 digit is day and

3rd 2 digit is year

after checking a log you will find that this data is created successfully

but when you print the data you will see some numbers for dob since SAS returns the number of days differences from 1st of Jan1960

proc print datamytestrun

we use informat for reading and we use format for writing

you can use any other date format

you can find all avaible options for date format by going help then type date and then select About SAS Date, Time, and Datetime Values

input id name $ family $ age dob several sapces is the same as one space

informat dob ddmmyy8.informat for reading perpuse and format for writing perpuse for dates, you can assign format in data step or proc step

format dob date9.

if you define format in data step it will saved with it but you can override whenever you want

The most 7 important proc:

1)proc print to brwos data portion

2)proc contents to brows desciptor portion

3)proc freq to summarize chractor variables: create a table that gives us frequency of each level( unique values)

and chi-square test

4)proc means to summarize numerical variables

5)proc univariate to summarize numerical variables with more information

and test of normality

6)proc sql to use sql syntax

7)proc sort to sort data and deduplicate data

we talk about them and other procedures during this course and advanced SAS programming

Note: we have more procedures like proc import for importing data, proc export for exporting data,

proc ttest, proc Anova, proc reg, ...

input name $ age gender $ weight height we use input statement to assign variable's name column's nameheader :

variables'name:

1)are limited to 32 characters.

2)must begin with a letter or underscore.

3)can contain only letters, numbers, or underscores. Blanks are not allowed

To create a character column, specify a $ sign righ after the sapace after

column name.

Note: unlike SQL that we use comma to separate columns names in SAS we use sapace

when you insert values in simple list input separater is space(at least 1 space) and Any missing data must be indicated with a period.

Note:recall that when you print a table for missing values you will see period for numeric columns and blank for categorical columns

For inserting data you should type cards or datalines then semicolon and in new rows type observations after

the last observations(rows) go to new line and type semicolon

for inserting values use cards or datalines

you must put semicolon after the last obs in "new line" if you put semicolon at the end of obs thet one won't be inserted

proc printrunif you skip naming data, SAS conduct that procedure on the most recent data

Descriptive statistic ( summarizing existig data)(part1)

For each variable it shows frequency of each level(frequency of each unique values)

proc freq datamy_1st_datarun

proc freq datamy_1st_data

table genderunlike other procedurs that we specify name of variable by using var statement in proc freq we use table since we looking for table that shows us frequency of each levels

run

proc univariate datamy_1st_datarunproc univariate by default returns more information compare to proc means and you can draw a grapgh with it as well that I will teach you

in this version of SAS 1."as" in proc sql is not optional

and 2.you can not say [avrerage of weight]

from my_1st_data notice that my_1st_data is a SAS dataset and exits in work SAS library

Reacall that:

order in SQL select statement:

select ....

from ....

where ....

group by ....

having ....

order by ....

rermeber

"Some Female Workers Go Home Ontime"

Sorting data

Proc sort datamy_1st_data it will sort the original data unless you will specify output(i will talk about that)

By agesorting by age deafaultascending

proc print datamy_1st_datarun

Note that in SAS missing values are considered less than any value that you can assume

i.e. in SAS unlike R and Python, missing values consider less than any values that you have

Proc sort datamy_1st_data

By descending agein SAS unlike SQL 1.we use complete word of descending 2.before variable name

Proc sort datamy_1st_data

By desending agedescendingdesendingdecending

input id $ name $ family $ height several spaces is the same as one space

FYI

Label (part1)

LABLE STATEMENT

LABEL TO VARIBLE TO BE MORE DESCRIPTIVE

CAN BE UP TO 256 CHARACTERs can conatain any charector even space

if you want to get labels in print you need to use lable option in proc statement

if you assign lable in data step it will saved with data ( you can overwrite later if you want)

but if you assign lable in [proc step it won't saved with data and only apear in that proc

in labels you don't need to put them inside qoutation marks

proc print datahamid.cars labelrun to see the labels in proc print statement you must type label

for displaying lable you have to type lable keyword in print statement then in other lines

define lable unless you define lable before

if you put a LABEL statement in a data step, the labels will be saved with the data

if you put a LABEL statement in a PROC, then the lables will be used only by that particular execution.

create dataset based on one existing dataset in SAS

subset

Ex. Create a data set that stores all records of cars data set in sashelp with originAsia

data test testwork.test and it is the name of output data

this statement create table and its name is test and its library by default is work,

if you want to save it permanetly you have to put a libref

befor test and dot like: Hamid.test but because we didn't do that here it is temporary data and saved in work library

set sashelp.cars sashelp.cars is the name of input data

that means gives data from library sashelp and table cars

where origin "Asia"

RunCheck log after any execution

proc printruncharacters are on left side, numeric are on the right side

Ex. create a subset of cars data set in sashelp that includes only Sedans from Europe

Ex. create a subset of cars data set in sashelp that includes only european cars

where origin"EUROPE"O observation means no observation meet that condithion because SAS in case sesitive for values

Note: SAS is case sensitive in terms of vaues inside its table

upcase

where upcase(origin)"ASIA"

originupcase(origin) If you want to upcase values of origing permanently

where upcase(origin)upcase("EuropE") or upcase(origin)"EUROPE" >upcase(origin) converts all of values of origin column to upercase temporary and compare them with upcase("eUropE") that is "EUROPE"

distinct (unique) values

If you want to make sure about unique valus of origin

proc freq datasashelp.cars

table origin

run so you can see that we ahve only "Europe", "Asia" , and "USA" and we don't have "EUROPE" or "europe" or "eUropE"

creating new variable(column)

you can create a new variable in data step by assinment to the new name

you can create a new variable in data step by just using a new name and assigning a value to it

Delete a variable

In data step you can delete a variable either by drop opthions or by using keep otions with other variables'name,

drop

Data test5 (dropinvoice type) don't put comma between variables' name

Data test5b (dropinvoice type) in SAS don't put comma between variables' name

keep

Data test5b (keepmake origin type newprice) in SAS don't put comma between variables' name

data data1(keep name gender new_age)this is keep option

keep name gender new_agethis is keep statement

TitleFOOTNOTES STATEMENTS

Up to 10 TITLEsFOOTNOTS STATEEMENTS

TITLE "THIS IS MY FIRST REPORT"

TITLE2 "THIS IS MONTHLY@#$%^& SALE REPORT"

TITLE3 "HERE IS ANOTHER TITLE"

FOOTNOTE "THIS IS MY FIRST FOOTNOTS"

FOOTNOTE2 "THIS IS MY SECOND FOOTNOTE"

FOOTNOTE3 "THIS IS MY THIRD FOOTNOTE"

PROC PRINT DATA data1

proc print

Note: proc print have more options that allowes you to choose featurs(columns)(variables) and

number of obs(rows) that you want to print

proc print with var satetments

PROC PRINT DATA SASHELP.CARS

VAR TYPE ORIGIN MAKE INVOICE I used Var statements in proc print to specify which features I want to see in print

RUN

Note: we can use drop or keep in data step when you want to select some varaibles durnig getting a copy of data

but in poc step we use var statement to slect some varaibles or table statement in proc freq

HOW TO CANCEL PREVIOUS TITle AND FOOTNOTES STATEMENT?

By making them empty or assigning new one to overwright previouse one

TITLE

FOOTNOTE

proc print with obs and firstobs option

PROC PRINT DATA SASHELP.CARS (obs5)it prints the first 5 obs (head of data)

PROC PRINT DATA SASHELP.CARS (obs10)run

PROC PRINT DATA SASHELP.CARS (obs5 firstobs3)

it will start from 3rd rows(obs)( that means you don't want to print the first 2 obs) and end at 5 obs so you will see 5- 23obs

PROC PRINT DATA SASHELP.CARS (obs10 firstobs3)

it will start from 3rd rows(obs)( that means you don't want to print the fist 2 obs) and end at 10th obs so you will see 10-(3-1)10-28 obs

Internal data:Craeting data set and inserting data manually

so far you saw simple list input that

1.seprater is space and

2.you must type dot for missing values and

3.you can not have non-standard numbers like date, 48%, $23000, or comma for numbers like 3,642,912

4.you can store charectors up to 8 charectors( default length)

for inserting any non-standard numbers in your dataset( table) you must use informat

inforamt Vs. format

for reading and writing non-standard numeric values you can use inforamt( for writinginserting values)

and format( for readingprinting values)

input id name $ family $ age dob several sapces is the same as one space

informat dob ddmmyy8.informat for isertingreading perpuse

Note:right after each format_name put dot

note: some availble format :

14.05.85 or 140585 >ddmmyy8. or ddmmyy.

17032013 >ddmmyy10.

03172013 >mmddyy10.

17MAR13 >Date.

17MAR2013 >Date9.

you can find more by typing date format in help window( use help tab)

SAS stores date values as numeric values and showes us differnce dayes from 1Jan1960(base date in SAS

For reading date to sas(Informat) we should specify format of date

Note:mmddyy6. is one type format for date that mention 1st 2 digit is month , 2nd 2 digit is day and

3rd 2 digit is year

after checking a log you will find that this data is created successfully

but when you print the data you will see some numbers for dob since SAS returns the number of days differences from 1st of Jan1960

proc print datamytestrun

we use informat for reading and we use format for writing

to see the result for dob in undrstanding form use format statement

proc print datamytest

format dob ddmmyy8.

run

you can use any other date format

you can find all avaible options for date format by going help then type date and then select About SAS Date, Time, and Datetime Values

input id name $ family $ age dob several sapces is the same as one space

informat dob ddmmyy8.informat for reading perpuse and format for writing perpuse for dates, you can assign format in data step or proc step

format dob date9.

if you define format in data step it will saved with it but you can override whenever you want

recall that in simple list input you are not allowed to have space inside values

Formatted input enables you to read data stored in non-standard form, such as dates, percent or

numbers with commas.

and you don't need to put dot for missing values when you insert data

but all values of each column must starts from the same position

Note: in SAS unlike Python index strarts from 1

input @1 dob mmddyy6. note: @1 that means the position of 1st cahrector of its value is 1 i.e., values of this feature starts from column #1

@8 fname $10. Read values based on starting (here @8 means column#8 in dataliensCards)and total length of its value10 $ means it is charector and dot after 10 is mandetory after any format

we will get 4 obs because of the empty line after 3rd obs

proc print runwe got dob as a number of day difference from 1st of Jan 1960

input @1 dob mmddyy6. note: @1 that means the position of 1st cahrector of its value is 1 i.e., values of this feature starts from column #1

@8 fname $10. Read values based on starting (here @8 means column#8 in dataliensCards)and total length of its value10 $ means it is charector and dot after 10 is mandetory after any format

FYI: You can mix foramtted input with simple list input

@8 fname $10. Read values based on starting (here @8 means column#8 in dataliensCards)and total length of its value10 $ means it is charector and dot after 10 is mandetory after any format

Non-standard numeric values eaxamples: 47%, $240, 11102022 for using mathematics functions like average, sum, subtraction,... we need to spaecify they are numeric values

for reading(insereting) non-standrd numbers we use informat and for writing ( printing) them we should use format

Hint:You can save format when you read data by using format statement in this situation you don't need to use fromat when you writing data

but you still can override that fromat

hint: when all of values for each column start from the same position you can use formatted input like the follwingh example

DROP DUPLICATE OBSERVATION IF EXIST( deduplicate data)

PROC SORT DATAtest2 OUTtest3 NODUPKEYhere input data is test2 and output data is test3 and both are stored at work library

BY _ALL_ if 2 obs( rows) have same values for all columns the second one will be consider duplicated

chech the Log > 1 observations with duplicate key values were deleted

FYI: Getting your data into SAS with proc IMPORT using GUI:

First Make sure that the data is closed

File> Import Data>Select the type of file for example CSV> Brows the location of file>

Selecte the library that you want> assign a name to it in member session>

select a locatio for proc import to save that and assign a nmae for it> finish

you need to check the log to make sure that you read data correctly

Note: By deafault sas uses only the first 20 rows to guess the datatypes of variables if

you get warning or error in the log you may neeed to assign a greater value to GuessingRow

Segmentation(binning)and Discretization

SEGMENTATION is the process of dividing continuous variable into segments( different groups)

Discretization :Process of transferring continuous variable into discrete variable

Methods for changing values for segmentatioin or changing apearence :

a)with proc format

b) with conditinal statement like if statement or case when

Proc format

Note:We can use proc format to define new format for writing(printing) data, we normally use it to change values or encoding,

or segmenation(binning),

or...

SO Important:

Proc format

in proc format we use value statement to assign a name for the format that we want to create and defining our format

Rules for defining a FORMAT name:

1.If you are creating a format for character variables,format name must start with $

i.e. use a dollar sign ($) as the first character of its name.

for numeric variables must begin with letters or underscore

2.The name must be a valid SAS name (Only letters, underscore and numbers are valid characters),

3.but Can’t end with a number and for charector variables must start with $

4.Max length: up to 32 charactor including $

so A numeric format name can be up to 32 characters. A character format name can be up to 31 characters

(because we have $ before that).

5.name can’t be the name of an existing format (can't overrride)

pay attention that the format name Can’t end with a number

Note: when you assign a name for format name in proc foramt it doesn't end with dot but after assigning its name

whenever you want to use it you must put dot at the end of its name to specify that is a format name

scenario:assume we have age column in our data and we want to do segmenation based on the following explenation:

If the person is youngr than 30 years old , it belongs to youth group,

If heshe is between 31 to 60, it belongs to midele aged group and

older than 60 ,it belongs to senior

Since age is numeric column ,format name for it must starts with letter or underscore

proc format

value agegroup agegroup is the name of this foramt that we want to use it for numerical values that's why

we don't have $ before its name

Note: after creating you are allowed to use this format for any numeric columns

Note:After assing a name to our format when ever we want to use this format we must put dot

after format name

low-30'youth' we are allowed to use low and high keyword note: <30 is not acceptable in proc format

format dob empdate date9. age agegroup. I used format named agegroup,use dot after name of format

Note: be noticed that in proc format we use low and high keyword and it returs missing values for missing values

Five advantages of using proc formats:

1.The orginal data will not been changed. You can use the original variable values in the analysis.

2.You can apply formats to both character and numerical variables.

3.You can use formats to merge categories and

to bin numeric variables.

4.you can apply a format to multiple variables in multiple data sets.

5.You can save formats in a permanent libref and use them across SAS sessions.

Note:You can define several format name in one proc format

I saved test data set that is temporary data set in work library as a permanent data set with the name of

test2 in permanent library that its libref is hamid with saved format

proc print dataHamid.test2run

proc contents datahamid.test2run be noticed that the data type of age and sex

If you want to have original value for age and created new formatted value you can do like below:

age_groupeagecreating new variable that its name is age_groupe and its values are the same as values in age column

format dob empdate date9. age_groupe ageg.

gendersexcreating new variable that its name is gender and its values are the same as values in sex column

age_groupage

format age_group ageg. gender $genderg.

b) with if statement: you can use if statement for segmentation(bining).with if you can define new variable as well

When you use if statement for binning(segnentation) you should be careful about 2 things

1.If you define new variable inside if statement without defing length of that before it :

SAS will assign length of that new variable equals to length of first value that you initilized to it

you have to put extra spaces for 1st one to make it as long as the bigest one or use length statement

2.Be carefule that SAS consider missing values less that any numbers that you can think in comparision

i.e

. < 30000

.< -234567890876546789

I will tell you how handle these problems

senario: income less than 30000 is low income , between 30000 and 50000 (not include 50000) is mediom income and greater than or equal to 50000 is high income

lets create new variable (type) based on income

if income<30000 then type'low income'SAS is going to create new column that its name is type and length of that

is based on initial value

else if income<50000 then type'mediom income'else if 30000<income<50000 then type'mediom income'

if you use "if " instead of "else if" the second one overrride 1st one

unless you say if 30000<income<50000 then type'mediom income'

else type'high income'

As I mentioned we have two problem when we use if statement 1st length of values 2nd missing values

but we can fix them:

you have to put extra spaces for 1st one to make it as long as the bigest one or use length statement

if income<30000 then type'low income '

else if income<50000 then type'mediom income'

Hint: Be careful that SAS consider missing values less that any numbers that you can think

you can use one of the follwing methods to handle missing values easily

Coditional statement

in data step you can use if or where but in proc step you are allowed to use only where

Hint: If you want to add condition in procedure if statement doesn't work you should use where statement

I will talk about if Vs. Where in detail

PROC PRINT DATA SASHELP.CARS(OBS428 firstobs424) OBS tells SAS when to stop processing observations and firstobs means from which starts

here we have 428-424+15 obs in pint

LIMITING THE NUMBERS OF VARIABLES

_________________________________________________________________________

PROC PRINT DATA SASHELP.CARS(OBS20)

note: we can use if as well as where for filtering records. But there are some points that

you need to pay attention

WHERE

Where Vs. If

We are not allowed to use if in proc step

For FILTERING THE DATA VALUES in proc Use WHERE CLAUSE

some operators that you can use with where:

GE GREATER THAN OR EQUAL TO

GT GRATER THAN

LE LESS THAN OR EQUAL

LT LESS THAN

EQ EQUAL

NE NOT EQUAL

IN OPERATOR: for checking if that value belongs to that set (checking membership)

BETWEEN... AND... :to check if that value belongs to that range including end points

PROC PRINT DATA SASHELP.CARS(OBS5)SAS will print the first 5 observations that met follwing condition

WHERE Cylinders GE 6GE is the same as >

If Cylinders GE 6Note:unlike where statement ,you are not allowed to use if in proc step you can use if only in data step

WHERE Cylinders IN (4,8)Cylinders IN (4,8) is the same as Cylinders 4 or Cylinders 8

PROC PRINT DATA SASHELP.CARS(OBS10)it will be prit the 1st 10 obs that are satisfied following conditions

WHERE type IN ("SUV","Wagon")

PROC PRINT DATA SASHELP.CARS(OBS10)

WHERE type IN ("SUV","WAGON")

PROC PRINT DATA SASHELP.CARS(OBS20)

WHERE upcase(type) IN ("SUV","WAGON")

upcase (Wagon)WAGON

upcase (sedan)SEDAN

upcase(SUV)SUV

or you can check all leveles by using proc freq

informat vs. format

informat is for reading purpose, format is for writing purpose you can use format in data step or any proc step

READ NON STANDARD NUMERIC INTO SAS AS NUMERIC DATA:INFORMATS

DISPLAY THESE NON STANDARD NUMERIC :FORMATS

Recall that:

TYPES OF DATA : 1.NUMERIC 2. CHARACTER

NUMERIC:STANDARD AND NON STANDARD

NON STANDARD NUMERIC DATA: E.G DATE:14JAN 1970,75%, 1,000,000, $240

OBS tells SAS when to stop processing observations and firstobs means from which starts

here we have 7-3+15 obs in pint

format dob empdate date9.

proc print datatest2 noobsnoobs means no index

statement make sure that your data is sorted by that value before

Unlike variable's name that must be less than or equal 32 charectors starts with letters or underscore and

can conatain only letters, underscore and numbers

label can contain and start with any thing but must be less than or equal 256 charectors

PROC PRINT DATA sashelp.CARS (obs5)run

PROC PRINT DATA sashelp.CARS (obs5) LABEL

label type"@@##Type of car:"run

PROC PRINT DATA sashelp.CARS (obs5) LABEL

label type@@##Type of car:

if you want to use lable statement you have to type "lable" keyword in first statemen as well

label fnamefirst name lnamelast name

if you want to use lable statement you have to type "lable" keyword in first statemen as well

label fnamefirst name lnamelast name

if you want to print ecah group in one table you should use by statement in print

proc print datatest2 label

if you want to use lable statement you have to type "lable" keyword in first statemen as well

label fnamefirst name lnamelast name

by sex you might get somthing wrong

So important: when ever in SAS you want to use By statement you must sort the data by that variable first unless it was alredy sorted

if you want to use lable statement you have to type "lable" keyword in first statemen as well

var fname lname empdate income sex age

format empdate date9. age ageg. sex $genderg.

where sex'M' and age>20two conditions

by income to print different table for each income

Note: every time you use' "by" for any variable' you must sort by that variabe first unless the data was sorted before

look at the result with and without sort

label fnamefirst name lnamelast name

title 'this is an example'

footnote "created by hamid"

please check the log, although we got a table in results window, but it is not reliable because we have error in log

SORT PROCEDURES

BY DEFAULT IT IS ASCENDING(from smallest to greatest) :

Note:1.For chategorical variables First it is gana sort based on the first digit or charector:

missing values < 0-9 < A-Z < a-z ,

(in SAS upper case less then lower case and in alphebetic order for example A<a, Z<a A< B,a<b, A<a, B<a)

if the first digit or charector are the same it is gana check the second one and so on

Recall that SAS is case sensitive in terms of values

2.for numeric varaiblles it is going to sort based on their values ,missing comes first

INPUT NAME $ @@list input

Note:@@ means you may have more than one observation in atleast one line

Because we didn't mention the name of data, it is gana sort the most recent data that we excuted

and because I didn't use out option it is going to change the original data

proc sort datatest0 outtest0_1 when you are sorting data it is better than save output in different table because

1)normaly we don't like to change original table

2)it takes a lot of sources to shuffel and save in the same table

by descending nameunlike sql that we mention ascending or descending after variable name in short form

in SAS we must specify decending before any variable in complet form

In SAS we use Descending before any variable, in sql we use Descending after variable and

in short form(desc)

proc sort datasashelp.cars outtest2Input data is sashelp.cars output data is test2work.test2

by Decending type invoicedescendingDecending ,here it was sorted first by type in descending order , then for all observations in

the same type they

were sorted by invoice in ascending order(default is ascending)

by Decending type Decending invoicehere it was sorted first bt type in desceng order , then for all observation in same type they

were sorted by invoice in decending order

import data participant1.csv and save it as HAmid.Parti

nodupkey

you can use nodupkey in proc sort to drop observations that have same values for featurs that you are sorting by them

it will keep first one and drop other duplicated in that column

proc sort datahamid.parti nodupkey outhamid.parti2

by sexdeletes observations with duplicate BY values (here no sex duplicated)

hint: for duplicate value in sex keep first row and drop others

senorio : Find highest income for each level of sex in hamid.parti

Be carefull if you use nodupkey when you sort two variables SAS is going to consider one observation duplicated if it has same values for

thoes two varible with at least one of priviouse obs

PROC SORT DATAt2 nodupkey OUTt4

BY sex salary here SAS considers one obs is duplicated if sex and pai both are the same with priviouse obs

Recall that in simple list input :

1.we have space as a delimeter and (we can solve this problem by using delimeter in modified list input)

2.we don't have non_standard numbers (we can solve this problem by using informat in modified list input)

3.missing values encoding with dot, (we can solve this problem by using delimeter in modified list input)

4.and charectros values have a default length8 bytes(8 charectors)(

we can solve this problem by using length statement or informat in modified list input)

NON STANDARD NUMERIC

like : 98%, date, $678 or values that have comma 3,997,657

READ NON STANDARD NUMERIC INTO SAS AS NUMERIC DATA:INFORMATS

DISPLAY THESE NON STANDARD NUMERIC :FORMATS

INPUT ID AMOUNT DOLLAR12.4 IT is modified list input that we specify informat in front of variable

DOLLAR12.4 is informat for reading but for writing

if you want to see $ sign you should define format

here in data step or in proc step

Type casting

Converting variable types from character to numeric and numeric to character:

you can convert numeric to charector by using put()function and

you can convert charactor to numeric by using input()function if its values is convertable

put always convert from numeric to char and input always convert from char. to numeric

new_numinput(char,4.) I define new variable its name is new_num and convert character to numeric with 4 digits

new_charput(num,3.) I define new variable its name is new_char convert numeric to character

If the character variable char in the above example contains missing values or non-numeric data then the

value of new_num will be missing. When char contains non-numeric data, an 'invalid argument' note will be written

to the log

Hmw2.3 Example Universe Cars is surveying its customers as to their preferences for car colors.

They have information about the customer’s age, sex (coded as 1 for male and 2 for female),

annual income, and preferred car color (yellow, gray, blue, or white).

STATISTICS

APPLIED STATISTICS

1.DESCRIPTIVE STATISTICS (means summarizing existing data( previous and current data))

2.INFERENTIAL STATISTICS(Statistical inference is the process of making as estimate, prediction, or decision about a population based on sample data. )

3.PREDICTIVEPRESCRIPTIVE STATISTICS

###############################

Exploratory Data Analytics (EDA)

###############################

Analysts need first to explore the data for potential research questions before

jumping into confirming the answers with hypothesis testing and inferential statistics.

EDA is often likened to “interviewing” the data it’s a time for the analyst to

get to know it and learn about what interesting things it has to say.

As part of our interview, we’ll want to do the following:

1.Classify our variables as continuous, categorical, and find distribution of them

2. Summarize our variables using descriptive statistics

3. Visualize our variables using charts

EDA is a process to try to understand existing data by visualization, Summarization, data cleaning, and

testing hypothesises

#We select the method for summmrizing data and analysis and testing hypothesis base on data type

Assume you have a breast cancer data set and in it you have ID, age, weight, height, blood pressure

and cancer . Cancer is binary 0 an 1 (0 means not having breast cancer and 1 means having breast cancer)

we want to built a model to predict having cancer for new patients

So in this situation cancer is target and we call that target or dependent variable or output variable and

we call other columns predictors or independents or features or input variables

we have two data types numerical and categorical data.

Categorical data has two types :Nominal(like name, eye color) and Ordinal(like education level, day of week)

numerical data has two types continuous (that can accept decimal like height and weight ) and

discrete(like number of children, number of females, ...)

If your variable has only two levels we call it binary at we can consider it as a spcial case of nominal categorical variable

TWO TYPES OF DATA (VALUE)1.NUMERIC 2.CHARACTER

numeric data could be discrete or continuous

If your data is discrete with limited levels like number of childeren ( number of cylinder in cars),

you can treat it as a categorical variable as well i.e. you can find mean and varaince because it is

numeric and you can draw bar chart and pie chart as well

SAS STATITICS:1.CONTINUOUS 2.CATEGORCAL

CONTINUOUS: like: AGE,income,IQ,SALES,PRICE

Charector variables like: gender, name,purpose of loan, loan status

DESCRIBE ONLY one VARIBLE:UNIVARIATE DESCRIPTIVE ANALYSIS

DESCRIBE ONLY TWO VARIBLES:BIVARIATE DESCRIPTIVE ANALYSIS

DESCRIBE MORE THAN 2 VARIBLES:MULTIVARIATE DESCRIPTIVE ANALYSIS

Univariate Analysis:

For categorical columns variabels:

for summarization: frequency, percentage, Mode, levels (unique values) ( by using proc freq in SAS)

for visualization : Bar chart or Pie chart

For numerical columns :

for summarization: Central tendency(mean, median, mode,...) and

measue of position:the the five-number summary(min,Q1,median(Q2),Q3,Max),

and measure of dispresion such as standard deviation ,CV ( COEFFICIENT OF VARIATION),Variance ,IQR, ...

( by using proc means,summary, or univariate in SAS)

for visualization: Histogram , Box plot,density,...

https:online.stat.psu.edustat500

Continuous Variabels

WHAT PROCEDURE we USE FOR summerizing Numerical variable

1)Proc means 2)Proc summary 3)Proc univariate

Proc means 12

Proc means datatestby default proc means do summearization ( univarite) for all numeric variables and

Runproc means by deafult returns only N(number of non-missing values) , mean( average), Std Dev (standard deviation), minimumn and maximum with 7

decimal numbers

Proc means datatest maxdec2

Run

title " The five-number summary"

Proc means datatest min Q1 median Q3 Max run

Q1 is 1st quartile25th percentile that means the value that 25% of obs have value less than or equal to it

medianQ2 is 2nd quartile50th percentile that means the value that 50% of obs have value less than or equal to it

Q3 is 3rd quartile75th percentile that means the value that 75% of obs have value less than or equal to it

if you have a big distance between min and p1 but not between p1 and Q1 you have outliers and

if you have a big distance between p99 and max but not between Q3 and p99 you have outliers

note: we have some other methods for finding outliers like 1. mean and standard deviation( Z score)

2. Interqurtile Range or box plot(Tukey and extreme Tukey method)

3. LOF

that you will learn in statistic and data mining course

By default proc means do descripteve analysis for all continuouse varibles in data set

title " Describe one variable"

Proc means datatestdo not forget "s" at the end of proc means and proc contents

output outdb1.data_summary_Oct18th you can save the reult of proc means as a new table (dataset) by using output statement

Bivariate Analysis

in SAS base in bivariate analysis I just talk baout summerization, and in advanced SAS I will

teach you hypothese testing,

Continuous VS. Categorical with proc means

title "Bivariate Analysis"

title2 "Continues VS. Categorical"

title3 "DESCRIPTIVE OF Income Vs. Gender"

Proc means datatest maxdec2

var income after var list numeric variables abd after calss or by list categorical variable

class gender Bivariate continues VS. categorical

run

footnote "to get the idea about association between income and geder if exits"

title "Bivariate Analysis"

title2 "Continues VS. Categorical"

title3 "DESCRIPTIVE OF Income Vs. age_group"

footnote "to get the idea about association between income and age_group if exits"

title "Multi-variate Analysis"

proc means datatest maxdec2 NMISS MEAN STD CV MEDIAN p20

var age income summeraiztion for these two continuous varibales

class gender ageby using fomat I will vconvert age to categorical variable and use it for grouping

format age agefmt.

Hint: You can use by statement instead of class statement in this situation you will get separate tables for each level and be carefull if you want to use by you must sort by that variable first

the following code is just FYI

proc means datasashelp.baseball p1 p5 p10 p20 p25 p50 p75 p90 p95 p99 maxdec2

var nHits

In SAS in assignment if you want to assign a value of keyword to your own variable we specify that keyword on the left side of equal sign

and the name of your own varibvle on the right side of equal sign

in proc means for saving the result in one table you can use ouput statement

output outpercentiles_proc_means

p1 P_1_hamid P_1_hamid is the name of varible that i assigned value of p1 to it

proc means datatest maxdec2 mean clm maxdec2clm is confidence interval for population mean that we predict a range of mean for population( inferential statistic) by default at 95% confidenc

clm is confidence interval

Let’s assume that I will give you an exam with 30 marks, and I say that based on my experience your class average mark would be between 21.75 and 26.25 at 95% confidence

That means there is 5% possibly that your class average would be less than 21.75 or bigger than 26.25

If I want to predict your class average at 100% confidence I must say:

your class average mark would be between 0 and 30 at 100% confidence

proc means datatest maxdec2 mean clm alpha0.10 alphasignificant level0.10> confidence 1-alpha 0.90

clm is confidence interval that we predict a range of mean for population at 90% confidenc

The bigger confidence the wider interval

FYI:alpha is called signifiacnt level it is probabilty of Type I error(FP)

Proc means datatest MAXDEC 2 n nmiss mean std stderr cv lclm uclm median min max range Q1 Q3 qrange maxdec2

where VS. If

1. you can use where in proc step or data step but you can use if only in data step

i.e you are not allowed to use if in proc step

2.we have where option as well

where option vs. where statement

FYI: Proc summary 12

Proc SUMMARY and Proc MEANS are essentially the same procedure. Both procedures compute descriptive statistics.

The main difference concerns the default type of output they produce. Proc MEANS by default produces printed output in

the LISTING window or other open destination whereas Proc SUMMARY does not. Inclusion of the print option on the Proc SUMMARY

statement will output results to the output window.

Proc SUMMARY

When all variables in the data set are character the same output: a simple count of observations, is produced for each

procedure.However, when some variables in the dataset are numeric, Proc MEANS analyses all numeric variables and

produces default statistics for these variables (N, Mean, Standard Deviation, Minimum and Maximum).

Inclusion of a VAR statement in both Proc MEANS and Proc SUMMARY, produces output that contains exactly the same default

statistics.

proc summary datatest In proc suumary you must use print option to see the result or use output statement to save the result in one table

output outsumdata_For_Hamidcreating new data which name is sumdata_For_Hamid an availble in work library

you can save the result in permanent library if you want

Hint:you are not allowed to use if statement in procedures

proc print datasumdata3arunyou can see bigger number of type shows you more detailed in terms of classification

I classified all of observation based on gender and age so type 3 is classifying based on gender then age

class gender age I classified all of observation based on gender and age

here n is keyword and we assign the value of that to number_of_non_missing, In SAS keywords located on

the left side of equal sign

proc univarite

Note:In proc univarite you can get more information about continuous varaible by default , and unlike proc means and summary

1.You can use proc univarite to test of normality and to visualize data as well

2.you can create a graph

3.when you want to create output in proc univarite you must specify what _stat_ you want to save

4.we use pctlpts20 instead of p20 and so on and The PCTLPRE option must be specified to generate additional percentiles

5. you can use the PCTLDEF option in PROC UNIVARIATE or the QNTLDEF option in other procedures to control the method used to estimate quantiles

You can use proc univariate to test of normality by adding normal option

If p value is less than 5% you can reject null hypotheses( data is normaly distributed) and conclude

your data is not normally distributed at 5% significant level

output outsumdata4a WARNING: Data set WORK.SUMDATA4A was not replaced because new file is incomplete

Bivariate continues VS. categorical

Proc univariate datasashelp.cars normal plot

class make Hint:you can use by instead of class but must sort before by make, unless your data sorted before

Note: when ever you use by statement in SAS you must sort your data by thah variable before ,unless your data sorted before

PCTLPREprefixes

specifies one or more prefixes to create the variable names for the variables that contain the PCTLPTS percentiles. To save the same percentiles for more than one analysis variable, specify a list of prefixes. The order of the prefixes corresponds to the order of the analysis variables in the VAR statement. The PCTLPRE and PCTLPTS options must be used together.

PROC UNIVARIATE generates new variable names by using the prefix and the percentile values. If the specified percentile is an integer, the variable name is simply the prefix followed by the value. If the specified value is not an integer, an underscore replaces the decimal point in the variable name, and decimal values are truncated to one decimal place. For example, the following statements create the variables pwid20, pwid33_3, pwid66_6, and pwid80 for the 20th, 33.33rd, 66.67th, and 80th percentiles of Width, respectively:

univariate analysis

for categorical variables

title

proc freq datatest

runfor each variable (numeric and categorical variables) it shows frequancy of each level

it is not suumerizing of num,erical varable just shows us the freq of each unique value

If an observation has a missing value for a variable in a TABLES request,

by default PROC FREQ does not include that observation in the frequency or

comulative freq and percentage , it shows missing values at the bottom of tables.

Also by default, PROC FREQ does not include observations

with missing values in the computation of percentages and statistics.

The procedure displays the number of missing observations below each table.

unlike proc means, summary and proc univariate that we use var statement to specify varables for

thoes procedures, in proc freq we use table statement :recall that we show counts of each value in table

proc freq datatest

table gender

run

proc print datatest run

By default proc freq ignores missing values in calculation percent and cumulative percent

unless you youse missing option in table statement

proc freq datatest

table gendermissing we use forward slash to add option in proc freq

treats missing values as a valid level for all TABLES variables.

Displays missing levels in frequency and crosstabulation tables

and includes them in computations of percentages and statistics.

proc freq DATA SASHELP.CARS NLEVELS NLEVELS :Displays the number of levels for all TABLES variables

TABLE SEX SMOKING_STATUSit is two univariate analysis ,

i.e creating two tables one for sex, the other one for SMOKING_STATUS

IF you want to get two-way table youu need to put star between sex and

SMOKING_STATUS

Bivariate

Cathegorical Vs. Chategorical

assume you are working as a data scientist at Bell and you want to see that is there any associtaion between

type of cancelletion( online or calling) and churn ( cancelling the agreement and switch to anothrer provider like Bell, ...)

this is hypotheses testing ( you may design an AB testing for answering this problem)

Null hypothese: there is no assocition between churn and type of canceletion

since both variables are categorical vbariables we run biviate anlysis for categorical vs. categorical

fgor this problem we need to use chi_squre test in proc freq that i will teach you

Bivariate Analysis:

Continuous Vs. Continuous : For Visulaization scatter plot,line chart( specially if you have time data)

For test of independence: pearson correlation or spearman or Kendal tau or ... coorelation

and Simple linear regression ...in SAS by using proc corr, proc reg

Categorical Vs. Categorical : For summaraization: contingency table (two-way table) in SAS by using proc freq and put between the name of thoes two variables in tabel statement

For visualization :stacked bar chart,Grouped bar chart,...

For test of independence:chi-square test

Continuous Vs. Categorical : For summaraization:gropup by categorical column and aggragte for numerical column in SAS by using proc means,summary, or univariate and use name of continuous variable in var statement and name of categorical variable in class statement

For visualization: Grouped box plot,...

For test of independence :1) if categorical column has only two levels :t-test in SAS by using proc ttest ( if all assumption are met)

2) if categorical column has more than two levels: ANOVA in SAS by using proc glm or Anova( if all assumption are met)

CHI SQURE

We use Chi squre test for finding if two categorical varaibles are independent from each other or not.

The Null and Alternate Hypotheses

we are interested in knowing if there is a relationship between two categorical variables or not.

In order to do so, we would have to use the Chi-squared test. But first, let's state our null hypothesis and

the alternative hypothesis.

The Null Hypotheses:H0:There is no statistically significant relationship between thoes two categorical variables.

The Alternate Hypotheses:Ha:There is a statistically significant relationship between thoes two categorical variables.

After running chi square test and checking the conditions of it

if p-value is less than 5% you can reject null hypotheses and get this conclusion that

"There is some correlation between thoes two variables at 0.05 significant level

There is a statistically significant relationship between thoes two categorical variables at 0.05 significant level."

( Note : if p-value is bigger than 5% yopui fail tyo reject null hypothese so you can conclude that there is not enough evidence top reject the null hypothese at 5% signifiucant level)

The Chi-square test statistic can be used if the following conditions are satisfied:

1.N, the total frequency, should be reasonably large, say greater than 50 or 30.

2. The sample observations should be independent. This implies that no individual item should be included twice or more

in the sample.

3. No expected frequencies should be small. Small is a relative term. Preferably each expected frequencies

should be larger than 10 but in any case not less than 5.

TITLE "BIVARIATE DESCRIPTIVE ANALYSIS:SEX AND SMOKING"

TABLE SMOKING_STATUS SEX it gives us contingency table(two-way table) here rowsSMOKING_STATUS and columnsSEX

TABLE SMOKING_STATUS SEXnopercent norow nocol

If you specify the LIST option in the TABLES statement,

PROC FREQ displays multiway tables in a list format

rather than as crosstabulation tables.

The LIST option displays the entire multiway table in one column,

instead of displaying a separate two-way table for each stratum.

TITLE "BIVARIATE DESCRIPTIVE ANALYSIS:SEX AND SMOKING"

PROC FREQ DATASASHELP.Heart

TABLE SEX SMOKING_STATUSmissing

TITLE "Hypoteses testing:SEX Vs. SMOKING"

PROC FREQ DATASASHELP.Heart

TABLE SEX SMOKING_STATUS chi_square

if p-value is less than 5% and all assumption are satisfied( I will talk about assumption in advanced SAS course)

we conclude that there is a statistcally association between thoes two varaibles at 5% significant level

in this example the p-value<0.0001> p-value<0.055% that is less than 5% so you can conclude that:

there is a statistcally association between sex and Smoking_Status at 5% significant level

Data type: Numeric and categoric

Note: Numeric variable can be continuous or discrete , we can treat discrete variable like

ordinal variable ( categorical) specially if it has limited options ( like number of children in family :

it is numeric( discreate), so you can apply mean ,standard deviation and so on but you can treat it like categorical variable

for example draw bar chart for it)

Univariate Analysis:

For categorical columns variabels:

for summarization: frequency, percentage, Mode, levels (unique values) ( by using proc freq in SAS)

for visualization : Bar chart or Pie chart

For Continuous columns :

for summarization: Central tendency(mean, median, mode,...) and

measue of position:the the five-number summary(min,Q1,median(Q2),Q3,Max),

and measure of dispresion such as standard deviation ,CV ( COEFFICIENT OF VARIATION),Variance ,IQR, ...

( by using proc means,summary, or univariate in SAS)

for visualization: Histogram , Box plot,density,...

TITLE "BIVARIATE ANALYSIS:SEX AND SMOKING_status"

TABLE SEX SMOKING_STATUS chisq

Since the p-value is less than 5% we reject null hypotheses and conclude:

There is a statistically significant relationship between thoes two categorical variables at 0.05 significant level

PROC FREQ DATASASHELP.Heart

TABLE SEX SMOKING_STATUS NOCOL NOROW NOPERCENT NOCUM MISSING chisq

Here since p-value is less than 5%0.05 we reject null hypothese and get conclusion that there is statistaclly association between sex and

smoking_status at 5% significant level

summary until this point

since p-value < 0.05 we can reject normality at 5% significant level

univariate analysis for categorical variable

proc freq datasashelp.heart

table BP_Status Chol_Status DeathCause Sex Smoking_Status Status Weight_Status missing

since p-value < 0.05 we can reject independency at 5% significant level and conclude:

There is statitically association between BP_Status and Chl_status at 5% significant level

Segmentation and changing values

value ageg format names can not end with numbers

value $sexfmt Hint:To create a format for a character variable, the format name must begin with $:

you can not use number at the end of the format name

don't forget at the end of any format

WARNING: 100% of the cells have expected counts less

than 5. Chi-Square may not be a valid test.

Note: you can create several tables in one proc freq

If you specify the LIST option in the TABLES statement,

PROC FREQ displays multiway tables in a list format

rather than as cross tabulation tables.

The LIST option displays the entire multiway table in one column,

instead of displaying a separate two-way table for each stratum.

table gender missing for one table adding list option doesn't have any effect

table modelmissing we can say table make modelmissing list instead

table makemodelmissing list

Vertical calculation (column wise ) vs. Horizontal calculation(row wise)

Proc means datatest MAXDEC 2 sum MEAN var STD CV Q1 MEDIAN Q3 p95 p99 maxThe coefficient of variation (CV)STDmean

Run

It is vertical calculation i.e. SAS it is going to calculate mean, std, ... for each columns

Proc means datatest MAXDEC 2 sum MEAN

Run

Horizontal calculationrow wise

so far we did summeraization for each variables columns

what if we want to summerize for each rowsobs as an example assume each rows is student id and scores in different courses for that student i.e you have a student id,

SQL_score, SAS_score, Python_score and so on and you want to find average of score for each studen if we use proc means, or suumary or univarita

we will get average of all students in each course,

or you have a customer Id at 1st column and sales amount for each month of 2021 in 12 columns and we want to

calculate sum of sales and average of sales for each customer

CONNECT YOUR PERMANENT SAS LIB USING LIBNAME STATEMENT

proc means datatest_ sum meanrunIt is vertical calculation i.e. SAS it is going to calculate sum and mean for each columns

As an example if columns are course names and rows are students you can find average of score for each cours

HOW WE CAN DO HORIZONTAL CALUCULATION as an example how we can calculate average of score for each students

recall that SAS select obs by obs and for each obs excutes codes line by line

with missing values

Plus Operator Vs. sum function

TOTAL_SCORET1+T2+T3+T4+T5ASSIGNMENT STATEMENT and Defining new variable and ASSIGNMENT STATEMENT

AVG_SCORETOTAL_SCORE5

TOTAL_SCORE1T1+T2+T3+T4+T5 ASSIGNMENT STATEMENTPlus operator GIVES YOU MISSING VALUE IF YOU HAVE MISSING IN YOUR DATA

TOTAL_SCORE2SUM (T1,T2,T3,T4,T5)ASSIGNMENT STATEMENT IGNORES MISSING VALUES Sum() function ignores missing values

TOTAL_SCORE3SUM (OF T1-T5)ASSIGNMENT STATEMENTYOU CAN USE THIS METHOD FOR VARIBLES WITH THE SAME NAME AND DIFFEREN INDEX

In following data find the total amount and average amount that each customer bought in 2023 so far

Interview question: how can we calculate the average of each row by ignoring missing values?

SAS FUNCTION f(ARG1,ARG2,ARG3)

BY DEFAULT,SAS FUNCTIONS EXCLUDE (ignore) MISSING VALUES IN CALCULATION

TOTAL_SCORESUM(T1,T2,T3,T4,T5)sum function returns to us sum of all non-missing values

TOTAL_TESTN(T1,T2,T3,T4,T5)N function returns to us number of non-missing values

MISS_TESTNMISS(T1,T2,T3,T4,T5)NMISS function returns to us number of missing values

AVG_SCOREMEAN(T1,T2,T3,T4,T5)mean function returns to us avaerge of all non-missing values

MIN_SCOREMIN(T1,T2,T3,T4,T5)min function returns to us minimum of all non-missing values

MAX_SCOREMAX(T1,T2,T3,T4,T5)max function returns to us maximum of all non-missing values

STD_SCORESTD(T1,T2,T3,T4,T5)std function returns to us standard deviation of all non-missing values

TOTAL_SCORESUM(T1,T2,T3,T4,T5)sum function returns to us sum of all non-missing values

Number_of_non_missingN(T1,T2,T3,T4,T5)N function returns to us number of non-missing values

AVG_SCORETOTAL_SCORENumber_of_non_missing

AVG_SCORE2mean(T1,T2,T3,T4,T5)mean function returns to us avaerge of all non-missing values

TOTAL_SCORE1SUM(T1+T2+T3+T4+T5)

first it is going to calculate T1+T2+T3+T4+T5 (so if you have missing value you will get missing)then sum( sum here

doesn't have any effect)

Note: use comma between arguments

TOTAL_SCORE2SUM(T1,T2,T3,T4,T5)

TOTAL_TESTN(T1,T2,T3,T4,T5)Number of non_missing values

MISS_TESTNMISS(T1,T2,T3,T4,T5)Number of missing values

FYI: USING VARIABLE SHORTCUT

TOTAL_SCORESUM(of T1-T5)DON'T FORGET OF

TOTAL_SCORE_PLUS5SUM(OF T1-T5,5) OR TOTAL_SCORE_PLUS52SUM(T1,T2,T3,T4,T5,5)

XSUM(T1-T5)HERE WE HAVE JUST ONE ARGUMENT THAT IS THE RESULT OF T1 MINUS T5

YSUM(T1-T5,5)HERE WE HAVE TWO ARGUMENTS THAT ARE THE RESULT OF T1 MINUS T5 AND 5 THAT MEANS T1-T5+5

ZT1-T5+5

USING VARIABLE SHORT CUT

DATA TEST_04

SET TEST4

TOTAL_TsSUM(OF T1-T3)SUM(OF T1-T3)sum(T1, T2, T3)

TOTAL_SCORESUM(OF T1--T3)SUM(OF T1--T3)sum(T1,S1, T2,S2, T3)i.e. sum of all columns between T1 and T3 including them

Num_TsN(OF T1-T3)

TOTAL_numberN(OF T1--T3) MEANS ANY VARIABLE BETWEEN T1 AND T3 INCLUDE T1 AND T3

USING VARIABLE SHORT CUT

TOTAL_TsSUM(OF T1-T3)SUM(OF T1-T3)sum(T1, T2, T3)

TOTAL_SCORESUM(OF T1--T3)SUM(OF T1--T3)sum(T1,S1, T2,S2, T3)i.e. sum of all columns between T1 and T3 including them

Total_Ts_and_Sssum(of T1-T3,of s1-s2)

Num_TsN(OF T1-T3)

TOTAL_numberN(OF T1--T3) MEANS ANY VARIABLE BETWEEN T1 AND T3 INCLUDE T1 AND T3

xSUM(OF T1--G3)Any numeric variable between T1 and G3

搜索此博客

Voice of China

sas note1

评论

发表评论

此博客中的热门博文

Don't be afraid, this time the two-way foil is in the hands of civilization别怕，这一次二向箔掌握在文明的手中

Zelensky signed a deal to sell the country泽连斯基签下协议，把国家给卖了，以后乌克兰的事，华尔街说了算

Marianne Situ的帖子