Data Mining- Advanced Statistical Modeling Teaching Notes for Day1

A. Teaching Objectives:

 

Knowledge

Upon completion of today’s session, the student will:

 

K-11(1): Understand feature selection

K-11(2): Understand feature transformation

K-15(1): Understand supervised and unsupervised learning

K-15(2): Familiar with the theoretical background and mechanism of data mining or machine learning

 

Skill                                                                

S-12(1): Able to obtain descriptive statistics

S-12(2): Able to conduct data analysis for general patterns prior to modeling

 

B. Teaching Topics (Including Examples):

 

Introduction to Predictive Analytics

1)    Understand predictive models, supervised and unsupervised learning, and data mining and machining concepts and use cases. [App F-K-15(1)] [45 min]

2)    Learn the process for conducting predictive analytics step by step.

     [App F-K-15(1, 2)] [45min]

3)  Learn the different types of predictive models and business backgrounds.

[App F-K-11(1, 2)] [App F-S-12(1, 2)] [45 min]

4) Normality test, the test of Homogeneity of Variances, and Transformation [App F-S-11(1,3)] [App F-S-11(1, 3, 4)] [120 min]

5)  Classroom exercises [45 min]

      5) Homework exercises [300 minutes]


 

Understand predictive models, supervised and unsupervised learning, and data mining and machining concepts and use cases. [App F-K-15(1)] [45 min]

 

The ultimate goal of analytics solutions is to give improved decision support so that humans can make better judgments with the help of relevant data. There are five different sorts of decision support capabilities, each of which is used to answer different types of questions:

      Planning analytics: What is our plan?

      Descriptive analytics: What happened?

      Diagnostic analytics: Why did it happen?

      Predictive analytics: What will happen next?

      Prescriptive analytics: What should be done about it? [App F-K-15(2)]

 

 

Knowledge Discovery in Databases (KDD): KDD refers to the overall process of discovering useful and valuable knowledge from large volumes of data. It is a broader and more comprehensive term that encompasses several stages, including data selection, preprocessing, transformation, data mining, interpretation, and evaluation. KDD includes all the steps involved in transforming raw data into actionable insights. The goal of KDD is to turn data into knowledge that can drive decision-making and provide a deeper understanding of the underlying patterns and relationships within the data. [App F-K-15(2)]

 

DIKW pyramid: The so-called “DIKW pyramid” is also known as the “DIKW hierarchy” as well as the “wisdom hierarchy”, “knowledge hierarchy”, “information hierarchy” and the “data pyramid”. It refers to a class of models for representing the purported functional or/and structural relationships between information, data, wisdom, and knowledge. Most of the versions of the “DIKW model” reference all mentioned 4 components and some of them include additional components. Apart from a pyramid and “hierarchy”, the “DIKW model” can also be characterized as a “chain”, as a continuum, and as a “framework”. 

Data è Information è Knowledge èwisdom

 

Data mining: Data mining is a specific step within the KDD process. It focuses on the application of algorithms and techniques to extract patterns, trends, and useful information from data. It involves using various techniques from statistics, machine learning, and database management to discover hidden relationships, trends, and patterns within the data. It's a subset of the larger KDD process that concentrates on the analysis and modeling aspects of knowledge discovery. Data mining techniques are used to identify meaningful patterns that might not be immediately apparent, which can lead to valuable insights and predictions.

The goal of data mining is to uncover valuable information that can aid in decision-making, prediction, and optimization. This process typically involves several steps:

1.     Data Collection: Gathering relevant and potentially useful data from various sources.

2.     Data Preprocessing: Cleaning, transforming, and preparing the data for analysis. This step includes handling missing values, dealing with outliers, and ensuring data quality.

3.     Exploratory Data Analysis (EDA): Exploring the data visually and statistically to gain a better understanding of its characteristics and relationships.

4.     Feature Selection/Engineering: Choosing or creating the most relevant and informative features (variables) that will be used in the analysis.

5.     Model Selection: Selecting appropriate algorithms or methods for analyzing the data, such as decision trees, clustering, regression, neural networks, etc.

6.     Model Training: Applying the selected models to the data and allowing them to learn the underlying patterns.

7.     Pattern Discovery: Applying data mining techniques to find interesting and previously unknown patterns, trends, or relationships within the data.

8.     Model Evaluation: Assessing the performance of the models and the quality of the discovered patterns. This may involve techniques such as cross-validation and measuring various metrics like accuracy, precision, recall, etc.

9.     Interpretation and Application: Interpreting the results of the analysis and using the insights gained to make informed decisions, predictions, or recommendations.

10.                        Deployment: Integrating the findings into real-world applications, systems, or processes to drive improvements and optimize outcomes.

Data mining is widely used across various industries and domains, including finance, marketing, healthcare, retail, telecommunications, and more. It helps businesses and organizations make sense of their data, uncover actionable insights, and gain a competitive advantage by making informed decisions based on data-driven knowledge. [App F-K-15(2)]

 

 

 

 

 

 

Cross-Industry Standard Process for Data Mining (CRISP-DM)

CRISP-DM is a reliable data mining model consisting of six phases. It is a cyclical process that provides a structured approach to the data mining process. The six phases can be implemented in any order, but it would sometimes require backtracking to the previous steps and repetition of actions.

The six phases of CRISP-DM include:

1) Business Understanding: In this step, you frame the business problem, the goals of the businesses are set and the important factors that will help in achieving the goal are discovered.

2) Data Understanding: This step will collect the whole data and populate the data in the tool (if using any tool). The data is listed with its data source, location, how it is acquired, and if any issue is encountered. Data is visualized and queried to check its completeness.

3) Data Preparation: This step involves selecting the appropriate data, cleaning, constructing attributes from data, and integrating data from multiple databases.

4) Modeling: Selection of the data mining technique such as a decision tree, generating test design for evaluating the selected model, building models from the dataset, and assessing the built model with experts to discuss the result is done in this step.

5) Evaluation: This step will determine the degree to which the resulting model meets the business requirements. Evaluation can be done by testing the model on real applications. The model is reviewed for any mistakes or steps that should be repeated.

6) Deployment: In this step, a deployment plan is made, a strategy to monitor and maintain the data mining model results to check for its usefulness is formed, final reports are made and a review of the whole process is done to check any mistake and see if any step is repeated. [App F-K-15(2)]

 

CRISP-DM stages and tasks:

 

 SEMMA (Sample, Explore, Modify, Model, Assess)

SEMMA is another data mining methodology developed by SAS Institute. The acronym SEMMA stands for sample, explore, modify, model, and assess.

SEMMA makes it easy to apply exploratory statistical and visualization techniques, select and transform the significant predicted variables, create a model using the variables to come out with the result and check its accuracy. SEMMA is also driven by a highly iterative cycle.

Steps in SEMMA

1.   Sample: In this step, a large dataset is extracted and a sample that represents the full data is taken out. Sampling will reduce the computational costs and processing time.

2.   Explore: The data is explored for any outlier and anomalies for a better understanding of the data. The data is visually checked to find out the trends and groupings.

3.   Modify: In this step, manipulation of data such as grouping, and subgrouping is done by keeping in focus the model to be built.

4.   Model: Based on the explorations and modifications, the models that explain the patterns in data are constructed.

5.   Assess: The usefulness and reliability of the constructed model are assessed in this step. Testing of the model against real data is done here.

Both the SEMMA and CRISP approach work for the Knowledge Discovery Process. Once models are built, they are deployed for businesses and research work. [App F-K-15(2)]

 


Variable

A variable is any characteristic, number, or quantity that can be measured, counted, or observed for the record.

There may be many variables in a study. The variables may play different roles in the study. Variables can be classified as either explanatory or response variables.

Response Variable

       A variable about which the researcher is posing the question. May also be called the outcome or the target or the dependent variable.

target =Outcome = dependent or Response Variable

Explanatory Variable

       A variable that serves to explain changes in the response. They may also be called the predictor or independent variables or input.

Predictor= input= independent or Explanatory Variable

Note! A variable can serve as an explanatory variable in one study but a response variable in another.

As an example, in the data set below, if you want to predict Income for some observation, Income will be the target(output/response/dependent variable) and other variables will be predictors (independent variables/ Input )

if you want to predict education for some observation, education will be the target (output/response/dependent variable) and other variables will be predictors (independent variables/ Input)

ID

Name

age

height

weight

telephone

Postal code

Education

Experience

Income

1

Shailendra

45

186

75

647-902-5678

L2J 5C6

MSc in Economics

8 years

$100 k

2

Adel

81

205

870

(416)888-3456

E4C 6J2

PMP

10+ years

$120 k

3

Feng

NA

160

50

4168001623

H7W 4R5

BA in computer science

8 years

NA

 

 Types of attributes:

The two types of variables are Qualitative (Categorical) and Quantitative (Numerical).

Qualitative (Categorical)

Data that serves the function of a name only. Categorical values may be:

      Ordinal – where the names imply levels with hierarchy or order of preference(possible to apply a rank order), i.e. a clear ordering to its categories, e.g. level of education, day of the week

      Nominal – where no hierarchy is implied, e.g. purpose of loan, weather, eye color

 

      Binary – where there are two choices, e.g. Male and Female

For example, for coding purposes, you may assign Male as 0, and Female as 1. The numbers 0 and 1 stand only for the two categories and there is no order between them.

Note: A binary attribute is a special case of a nominal attribute.

Quantitative

Data that takes on numerical values that has a measure of distance between them. Quantitative values can be:

      Continuous - or “measured” a quantitative variable whose possible values form some interval of numbers

     e.g. the weight or height of a person, temperature.

      Discrete - or “counted” not continuous and can be counted e.g. the number of people in attendance

Interval scale and ratio scale are two of the levels of measurement or scales of measurement where they describe the attributes in quantitative scales. An interval scale is one where there is order and the difference between two values is meaningful. Examples of interval variables include temperature (Fahrenheit), temperature (Celsius), pH, SAT score (200-800), and credit score (300-850). A ratio variable has all the properties of an interval variable and also has a clear definition of 0.0. A good example is the Kelvin scale. It has an absolute zero point. Price, height, weight, and money quantities are other common ratio variables.

  

1.   Supervised Learning, in which the training data is labeled with the correct answers, e.g., “spam” or “ham.” The two most common types of supervised learning are classification (where the outputs are discrete labels, as in spam filtering) and regression (where the outputs are real-valued).

2.    Unsupervised learning, in which we are given a collection of unlabeled data, which we wish to analyze and discover patterns within. The two most important examples are dimension reduction and clustering. 

3.   Reinforcement learning, in which an agent (e.g., a robot or controller) seeks to learn the optimal actions to take based on the outcomes of past actions.

 

Learn the process for conducting predictive analytics step by step.

[App F-K-15(1, 2)][45min]

  

The first step in any data science project: exploring the data. 


American mathematician John Tukey promoted the use of EDA in his book, Exploratory Data Analysis (Pearson). Tukey emphasized that analysts need first to explore the data for potential research questions before jumping into confirming the answers with hypothesis testing and inferential statistics.

EDA is often likened to “interviewing” the data; it’s a time for the analyst to get to know it and learn about what interesting things it has to say.

Exploratory data analysis, or EDA, focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, handling missing values, detecting outliers, and making transformations of variables as needed.

Stuff done during EDA:

§  Trends

§  Distributions

§  Mean

§  Median

§  Outliers

§  Variance

§  Correlations

§  Hypothesis testing

§  Visual exploration.

 

Learn the different types of predictive models and business backgrounds. [App F-K-11(1, 2)] [App F-S-12(1, 2)]

 Supervised Learning

 

Linear Regression Model (Useful for Numeric features):

Linear regression models are the most basic types of statistical techniques and widely used predictive analysis. Simply, linear regression is a statistical method for studying relationships between a dependent variable Y which is a continuous variable, and at least one independent variable X.

If you select only one x (one feature) that you want to use to predict the target(y) based on that it is SLR ( simple linear regression)

  

As an example, you want to predict weight based on height:

Weight=    or Y=

Or if you want to predict the price of bitcoin based on time Price=  

If you select more than one x(feature) that you want to predict the target(y) based on them, it is MLR (Multiple -linear regression)

As an example, you want to use to predict weight based on height and age:

Weight=      or Y=

 

For example, suppose we have the following dataset with the weight and height of seven individuals:

    

Let weight be the predictor variable(independent) and let height be the response variable(dependent)(target).

 

In this example, the line of best fit is:

height = 32.783 + 0.2001*(weight)

 

residual is simply the distance between the actual data value and the value predicted by the regression line of best fit.

 

 

Notice that some of the residuals are positive and some are negative. If we add up all of the residuals, they will add up to zero. This is because linear regression finds the line that minimizes the total squared residuals, which is why the line perfectly goes through the data, with some of the data points lying above the line and some lying below the line.

 

Y=  + X

 

 : Intercept 

slope

For the best-fit line sum of all residual (error) is equal to zero and SSE(sum of the square of errors )is minimum.

The sum of squares of errors (SSE or SSe), typically abbreviated SSE or SSe, refers to the residual sum of squares (the sum of squared residuals) of a regression; this is the sum of the squares of the deviations of the actual values from the predicted values, within the sample used for estimation. This is also called least squares estimate, where the regression coefficients are chosen such that the sum of the squares is minimal (i.e. its derivative is zero).

 

Note: You will learn more about Linear Regression in this course, and you will learn about other regression algorithms in machine learning course.

 

What is logistic regression?

Logistic regression is a classification algorithm. It is used to predict a binary outcome based on a set of independent variables.

 

 

 

      Logistic regression is one of the most popular Machine Learning algorithms, which comes under the Supervised Learning technique. It is used for predicting the categorical dependent variable using a given set of independent variables.

      Logistic regression predicts the output of a categorical dependent variable. Therefore, the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.

 

OR (Odds Ratio)=  = the chance of positive out of the chance of negative

The log (based on  of the Odds ratio is called logit.

 


Note: You will learn more about Logistic Regression in this course.

 

Decision Tree: Decision Trees (DTs) are a supervised learning technique that predicts values of responses by learning decision rules derived from features. They can be used in both regression and classification context. For this reason, they are sometimes also referred to as Classification and Regression Trees (CART).

      They are powerful algorithms, capable of fitting complex datasets.

      One of the primary benefits of using a DT/CART is that, by construction, it produces interpretable if-then-else decision rule sets, which are akin to graphical flowcharts.

      Their main disadvantage lies in the fact that they are often uncompetitive with other supervised techniques such as support vector machines or deep neural networks in terms of prediction accuracy.

      However they can become extremely competitive when used in an ensemble method such as with bootstrap aggregation "bagging" (Random Forests), or boosting.

      In quantitative finance ensembles of DT/CART models are used in forecasting, either future asset prices/directions or liquidity of certain instruments.

 

Advantages of Decision Tree

· Easy to understand

· Can handle both categorical and numerical data

· Robust to outliers

. Decision trees frequently perform well on imbalanced data.

Disadvantages of Decision Tree

· Prone to Overfitting

· Need to be careful with parameter tuning

· Can create biased learned trees if some classes dominate

 

 

 


 

Market Basket Analysis (MBA) is one of the most common and useful types of data analysis for marketing and retailing.  The purpose of market basket analysis is to determine what products customers purchase together.  It takes its name from the idea of customers throwing all their purchases into a shopping cart (a "market basket") during grocery shopping.  Knowing what products people purchase as a group can be very helpful to a retailer or to any other company.

Suppose that you are running an e-commerce site, it would be nice to know what combinations of products tend to be bought at same time. Then, you can ‘recommend’ a list of products for your customers based on what they have already bought or are about to buy. This is basically what many of us call as ‘Recommendation Engine’ today.

One of the ways to find this out is to use an algorithm called ‘Association Rules’ or often called as ‘Market Basket Analysis’.

Ø This is the purpose of market basket analysis:

 – to improve the effectiveness of marketing and sales tactics using customer data already available to the company.

 

 

 


 


Normality test, the test of Homogeneity of Variances, and Transformation [App F-S-11(1,3)] [App F-S-11(1, 3, 4)]

 

Kindly utilize the Notebook "Day 1.1” and "Day 1.2” to enhance your understanding.

 

 

C.  Class Exercises and Solutions: 

a) Class Exercises: [C-Challenging] [App F-K-11(1, 2)] [App F-S-12(1, 2)] :

Q1) When should you consider use linear and logistic regression model?

Q2) What are supervise and unsupervised learning?

Q3) Counter some business examples for performing predictive models

b) Solutions to Class Exercises

Q1) When the response is binary, then we should use logistic regression.

Q2) Within the field of machine learning, there are two main types of tasks: supervised, and unsupervised. The main difference between the two types is that supervised learning is done using a ground truth, or in other words, we have prior knowledge of what the output values for our samples should be. Therefore, the goal of supervised learning is to learn a function that, given a sample of data and desired outputs, best approximates the relationship between input and output observable in the data. Unsupervised learning, on the other hand, does not have labeled outputs, so its goal is to infer the natural structure present within a set of data points.

Supervised learning is typically done in the context of classification, when we want to map input to output labels, or regression, when we want to map input to a continuous output. Common algorithms in supervised learning include logistic regression, naive bayes, support vector machines, artificial neural networks, and random forests.

The most common tasks within unsupervised learning are clustering, representation learning, and density estimation. In all of these cases, we wish to learn the inherent structure of our data without using explicitly provided labels. Some common algorithms include k-means clustering, principal component analysis, and autoencoders. Since no labels are provided, there is no specific way to compare model performance in most unsupervised learning methods.

Q3) Some of the real-life business example of predictive analysis includes:

·         Amazon uses predictive analysis to recommend products and services to users based on their past behavior.

·         Harley Davidson uses predictive models to target potential customers, attract leads, and close deals.

·         A call center can predict how many support calls they will receive per hour.

·         A shoe store can calculate how much inventory they should keep on hand in order to meet demand during a particular sales period.

 

D.  Homework Exercises and Solutions: 

a) Homework Exercises: [B- Basic] [App F-K-11(1, 2)] [App F-S-12(1, 2)]

Q1) What is predictive modeling?

Q2) What Is the decision tree and regression tree?

b) Solutions to homework

评论

此博客中的热门博文

Don't be afraid, this time the two-way foil is in the hands of civilization别怕,这一次二向箔掌握在文明的手中

Zelensky signed a deal to sell the country泽连斯基签下协议,把国家给卖了,以后乌克兰的事,华尔街说了算

Marianne Situ的帖子