Data Mining- Advanced Statistical Modeling Teaching Notes for Day1
A. Teaching Objectives:
Knowledge
Upon completion
of today’s session, the student will:
K-11(1): Understand feature
selection
K-11(2): Understand feature
transformation
K-15(1): Understand
supervised and unsupervised learning
K-15(2): Familiar with the theoretical
background and mechanism of data mining or machine learning
S-12(1): Able to obtain
descriptive statistics
S-12(2): Able to conduct
data analysis for general patterns prior to modeling
B.
Teaching Topics (Including Examples):
1)
Understand
predictive models, supervised and unsupervised learning, and data mining and machining
concepts and use cases. [App F-K-15(1)] [45 min]
5) Homework exercises [300
minutes]
Understand predictive models,
supervised and unsupervised learning, and data mining and machining concepts
and use cases. [App F-K-15(1)] [45 min]
The ultimate goal of
analytics solutions is to give improved decision support so that humans can
make better judgments with the help of relevant data. There are five different
sorts of decision support capabilities, each of which is used to answer
different types of questions:
• Planning analytics: What is our plan?
• Descriptive analytics: What happened?
• Diagnostic analytics: Why did it happen?
• Predictive analytics: What will happen next?
• Prescriptive analytics: What should be done about it? [App F-K-15(2)]
Knowledge Discovery
in Databases (KDD):
KDD refers to the overall process of discovering useful and valuable knowledge
from large volumes of data. It is a broader and more comprehensive term that
encompasses several stages, including data selection, preprocessing,
transformation, data mining, interpretation, and evaluation. KDD includes all
the steps involved in transforming raw data into actionable insights. The goal
of KDD is to turn data into knowledge that can drive decision-making and
provide a deeper understanding of the underlying patterns and relationships
within the data. [App F-K-15(2)]
DIKW pyramid: The so-called “DIKW pyramid” is also known as
the “DIKW hierarchy” as well as the “wisdom hierarchy”, “knowledge hierarchy”,
“information hierarchy” and the “data pyramid”. It refers to a class of models
for representing the purported functional or/and structural relationships
between information, data, wisdom, and knowledge. Most of the versions of the
“DIKW model” reference all mentioned 4 components and some of them include
additional components. Apart from a pyramid and “hierarchy”, the “DIKW model”
can also be characterized as a “chain”, as a continuum, and as a
“framework”.
Data è Information è Knowledge èwisdom
Data mining: Data mining is a specific step within the
KDD process. It focuses on the application of algorithms and techniques to
extract patterns, trends, and useful information from data. It involves using
various techniques from statistics, machine learning, and database management
to discover hidden relationships, trends, and patterns within the data. It's a
subset of the larger KDD process that concentrates on the analysis and modeling
aspects of knowledge discovery. Data mining techniques are used to identify meaningful
patterns that might not be immediately apparent, which can lead to valuable
insights and predictions.
The goal of data
mining is to uncover valuable information that can aid in decision-making,
prediction, and optimization. This process typically involves several steps:
1.
Data Collection: Gathering relevant and potentially useful data from various sources.
2.
Data Preprocessing: Cleaning, transforming, and preparing the data for analysis. This step
includes handling missing values, dealing with outliers, and ensuring data
quality.
3.
Exploratory Data
Analysis (EDA): Exploring the data visually and
statistically to gain a better understanding of its characteristics and
relationships.
4.
Feature
Selection/Engineering: Choosing or creating
the most relevant and informative features (variables) that will be used in the
analysis.
5.
Model Selection: Selecting appropriate algorithms or methods for analyzing the data,
such as decision trees, clustering, regression, neural networks, etc.
6.
Model Training: Applying the selected models to the data and allowing them to learn
the underlying patterns.
7.
Pattern Discovery: Applying data mining techniques to find interesting and previously
unknown patterns, trends, or relationships within the data.
8.
Model Evaluation: Assessing the performance of the models and the quality of the
discovered patterns. This may involve techniques such as cross-validation and
measuring various metrics like accuracy, precision, recall, etc.
9.
Interpretation and
Application: Interpreting the results of the analysis
and using the insights gained to make informed decisions, predictions, or
recommendations.
10.
Deployment: Integrating the findings into real-world applications, systems, or
processes to drive improvements and optimize outcomes.
Data mining is widely
used across various industries and domains, including finance, marketing,
healthcare, retail, telecommunications, and more. It helps businesses and
organizations make sense of their data, uncover actionable insights, and gain a
competitive advantage by making informed decisions based on data-driven
knowledge. [App F-K-15(2)]
Cross-Industry
Standard Process for Data Mining (CRISP-DM)
CRISP-DM is a
reliable data mining model consisting of six phases. It is a cyclical process
that provides a structured approach to the data mining process. The six phases
can be implemented in any order, but it would sometimes require backtracking to
the previous steps and repetition of actions.
The six phases of
CRISP-DM include:
1) Business
Understanding: In this step,
you frame the business problem, the goals of the businesses are set and the
important factors that will help in achieving the goal are discovered.
2) Data
Understanding: This step will
collect the whole data and populate the data in the tool (if using any tool).
The data is listed with its data source, location, how it is acquired, and if
any issue is encountered. Data is visualized and queried to check its
completeness.
3) Data Preparation: This step involves selecting the
appropriate data, cleaning, constructing attributes from data, and integrating
data from multiple databases.
4) Modeling: Selection of the data mining technique
such as a decision tree, generating test design for evaluating the selected
model, building models from the dataset, and assessing the built model with
experts to discuss the result is done in this step.
5) Evaluation: This step will determine the degree to
which the resulting model meets the business requirements. Evaluation can be
done by testing the model on real applications. The model is reviewed for any
mistakes or steps that should be repeated.
6) Deployment: In this step, a deployment plan is made, a strategy to monitor and maintain the data mining model results to check for its usefulness is formed, final reports are made and a review of the whole process is done to check any mistake and see if any step is repeated. [App F-K-15(2)]
CRISP-DM stages and tasks:
SEMMA is another data
mining methodology developed by SAS Institute. The acronym SEMMA stands for
sample, explore, modify, model, and assess.
SEMMA makes it easy
to apply exploratory statistical and visualization techniques, select and
transform the significant predicted variables, create a model using the
variables to come out with the result and check its accuracy. SEMMA is also
driven by a highly iterative cycle.
Steps in SEMMA
1. Sample: In this step, a large dataset is extracted
and a sample that represents the full data is taken out. Sampling will reduce
the computational costs and processing time.
2. Explore: The data is explored for any outlier and
anomalies for a better understanding of the data. The data is visually checked
to find out the trends and groupings.
3. Modify: In this step, manipulation of data such as
grouping, and subgrouping is done by keeping in focus the model to be built.
4. Model: Based on the explorations and modifications,
the models that explain the patterns in data are constructed.
5. Assess: The usefulness and reliability of the
constructed model are assessed in this step. Testing of the model against real
data is done here.
Both the SEMMA and
CRISP approach work for the Knowledge Discovery Process. Once models are
built, they are deployed for businesses and research work. [App F-K-15(2)]
1.
Supervised
Learning, in which the training data
is labeled with the correct answers, e.g., “spam” or “ham.” The two most common
types of supervised learning are classification (where the outputs are discrete
labels, as in spam filtering) and regression (where the outputs are real-valued).
2.
Unsupervised learning, in which we are given a collection of unlabeled data,
which we wish to analyze and discover patterns within. The two most important
examples are dimension reduction and clustering.
3.
Reinforcement
learning, in which an agent (e.g., a
robot or controller) seeks to learn the optimal actions to take based on the
outcomes of past actions.
Learn the process for conducting
predictive analytics step by step.
[App F-K-15(1, 2)][45min]
The first step in any data science project: exploring the data.
American mathematician John
Tukey promoted the use of EDA in his book, Exploratory Data Analysis (Pearson).
Tukey emphasized that analysts need first to explore the data for potential
research questions before jumping into confirming the answers with hypothesis
testing and inferential statistics.
EDA is often likened
to “interviewing” the data; it’s a time for the analyst to get to know it and
learn about what interesting things it has to say.
Exploratory data analysis, or EDA, focuses more narrowly on checking assumptions
required for model fitting and hypothesis testing, handling missing values,
detecting outliers, and making transformations of variables as needed.
Stuff
done during EDA:
§ Trends
§ Distributions
§ Mean
§ Median
§ Outliers
§ Variance
§ Correlations
§ Hypothesis
testing
§ Visual
exploration.
Learn the different types of
predictive models and business backgrounds. [App F-K-11(1, 2)] [App F-S-12(1, 2)]
Linear Regression Model (Useful for Numeric
features):
Linear regression
models are the most basic types of statistical techniques and widely used
predictive analysis. Simply, linear regression is a statistical method for
studying relationships between a dependent variable Y which is a continuous
variable, and at least one independent variable X.
If you select only one
x (one feature) that you want to use to predict the target(y) based on that it
is SLR ( simple linear regression)
As
an example, you want to predict weight based on height:
Weight=
Or
if you want to predict the price of bitcoin based on time Price=
If
you select more than one x(feature) that you want to predict the target(y)
based on them, it is MLR (Multiple -linear regression)
As
an example, you want to use to predict weight based on height and age:
Weight=
Let weight be
the predictor variable(independent) and let height be the response
variable(dependent)(target).
In this example, the line of best fit is:
height = 32.783 + 0.2001*(weight)
A residual is simply the
distance between the actual data value and the value predicted by the
regression line of best fit.
Notice that some of the residuals are positive
and some are negative. If we add up all of the residuals, they will add
up to zero. This is because linear regression finds the line that minimizes
the total squared residuals, which is why the line perfectly goes through the
data, with some of the data points lying above the line and some lying below
the line.
Y=
For the best-fit line sum of all residual (error) is
equal to zero and SSE(sum of the square of errors )is minimum.
The sum of squares of
errors (SSE or SSe), typically abbreviated SSE or SSe,
refers to the residual sum of squares (the sum of squared residuals) of a
regression; this is the sum of the squares of the deviations of the actual
values from the predicted values, within the sample used for estimation. This
is also called least squares estimate, where the regression coefficients are
chosen such that the sum of the squares is minimal (i.e. its derivative is
zero).
Note:
You will learn more about Linear Regression in this course, and you will learn
about other regression algorithms in machine learning course.
What is logistic
regression?
Logistic regression is
a classification algorithm. It is used to predict a binary outcome based on a
set of independent variables.
• Logistic regression is one of the most popular
Machine Learning algorithms, which comes under the Supervised Learning
technique. It is used for predicting the categorical dependent variable using a
given set of independent variables.
•
Logistic
regression predicts the output of a categorical dependent variable. Therefore,
the outcome must be a categorical or discrete value. It can be either Yes or
No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and
1, it gives the
probabilistic values which lie between 0 and 1.
OR (Odds Ratio)=
The log (based on
Note:
You will learn more about Logistic Regression in this course.
Decision Tree: Decision Trees (DTs) are a supervised learning technique that predicts
values of responses by learning decision rules derived
from features. They can be used in both regression and classification context.
For this reason, they are sometimes also referred to as Classification and
Regression Trees (CART).
•
They are
powerful algorithms, capable of fitting complex datasets.
•
One of the
primary benefits of using a DT/CART is that, by construction, it produces
interpretable if-then-else decision rule sets, which are akin
to graphical flowcharts.
•
Their main
disadvantage lies in the fact that they are often uncompetitive with other
supervised techniques such as support vector machines or deep neural networks in terms of prediction accuracy.
•
However
they can become extremely competitive when used in an ensemble method
such as with bootstrap aggregation "bagging" (Random
Forests), or boosting.
•
In
quantitative finance ensembles of DT/CART models are used in forecasting,
either future asset prices/directions or liquidity of certain instruments.
Advantages of Decision Tree
· Easy to understand
· Can handle both categorical and numerical
data
· Robust to outliers
. Decision trees frequently perform well on
imbalanced data.
Disadvantages of Decision Tree
· Prone to Overfitting
· Need to be careful with parameter tuning
· Can create biased learned trees if some
classes dominate
Market Basket Analysis (MBA) is one of the most common
and useful types of data analysis for marketing and retailing. The purpose of market basket analysis is to
determine what products customers purchase together. It takes its name from the idea of customers
throwing all their purchases into a shopping cart (a "market basket")
during grocery shopping. Knowing what
products people purchase as a group can be very helpful to a retailer or to any
other company.
Suppose that you are running an e-commerce
site, it would be nice to know what combinations of products tend to be bought at
same time. Then, you can ‘recommend’ a list of products for your customers
based on what they have already bought or are about to buy. This is basically
what many of us call as ‘Recommendation Engine’ today.
One of the ways to find this out is to use an
algorithm called ‘Association Rules’ or often called as ‘Market
Basket Analysis’.
Ø This is the purpose of market basket analysis:
– to
improve the effectiveness of marketing and sales tactics using customer data
already available to the company.
Normality test, the test of Homogeneity of Variances, and Transformation [App F-S-11(1,3)] [App F-S-11(1, 3, 4)]
Kindly
utilize the Notebook "Day 1.1” and "Day 1.2” to enhance your
understanding.
C. Class Exercises and Solutions:
a)
Class Exercises: [C-Challenging] [App F-K-11(1, 2)] [App F-S-12(1, 2)] :
Q1) When should you consider use linear and
logistic regression model?
Q2) What are supervise and unsupervised
learning?
Q3) Counter some business examples for
performing predictive models
b) Solutions to Class Exercises
Q1) When the response is binary, then we
should use logistic regression.
Q2) Within the field of machine learning,
there are two main types of tasks: supervised, and unsupervised. The main
difference between the two types is that supervised learning is done using a
ground truth, or in other words, we have prior knowledge of what the output
values for our samples should be. Therefore, the goal of supervised learning is
to learn a function that, given a sample of data and desired outputs, best
approximates the relationship between input and output observable in the data.
Unsupervised learning, on the other hand, does not have labeled outputs, so its
goal is to infer the natural structure present within a set of data points.
Supervised learning is typically done in the
context of classification, when we want to map input to output labels, or
regression, when we want to map input to a continuous output. Common algorithms
in supervised learning include logistic regression, naive bayes, support vector
machines, artificial neural networks, and random forests.
The most common tasks within unsupervised
learning are clustering, representation learning, and density estimation. In
all of these cases, we wish to learn the inherent structure of our data without
using explicitly provided labels. Some common algorithms include k-means
clustering, principal component analysis, and autoencoders. Since no labels are
provided, there is no specific way to compare model performance in most
unsupervised learning methods.
Q3) Some of the real-life business example
of predictive analysis includes:
·
Amazon uses predictive
analysis to recommend products and services to users based on their past
behavior.
·
Harley Davidson uses
predictive models to
target potential customers, attract leads, and close deals.
·
A
call center can predict how many support calls they will receive per hour.
·
A
shoe store can calculate how much inventory they should keep on hand in order
to meet demand during a particular sales period.
D. Homework Exercises and Solutions:
a)
Homework Exercises: [B- Basic] [App F-K-11(1, 2)] [App F-S-12(1, 2)]
Q1) What is predictive modeling?
Q2) What Is the decision tree and regression
tree?
b)
Solutions to homework
评论
发表评论