Introduction to Data Science
Statistics and Probability
Basics of Machine Learning
Linear Regression
Logistic Regression
Decision Tree
Random Forest
K Nearest Neighbor
Naive Bayes
Support Vector Machine
K-Means Clustering
Association Rule Mining
Reinforcement Learning
Deep Learning
Data Science Interview Questions
Need for Data Science
Walmart Use Case
What is Data Science?
Who is a Data Scientist?
Data Science - Skill Set
Data Science Job Roles
Data Life Cycle
Introduction to Machine Leaning
K - Means Use Case
K - Means Algorithm
Hands - On
Data Science Certification
Data Sources: Mobile Phone, PC, Smart Car
IOT: Internet of Things
Social Media:
Other Factors: Retail, Banking & Finances, Insurance, Transportation, Government, Education, Healthcare, Media & Entertainment
What is Data Science: is the process of extracting knowledge and insights from data by using scientific methods.
Scientific methods: Programming + Statistics + Business
Who is a Data Scientist?
Mathematics, Business and Technology
Data Science - Skill Set
Statistics, Programming languages, Data extraction & processing, Data wrangling & exploration, Machine Learning, Big Data processing frameworks, Data visualization
Data Scientist: Programming
Data Analyst: Visualization, SQL, R, Python
Data Architect: Blue print, security. Data modeling
Data Engineer: Build and test scale system, Java, C++, Matlab
Statisticians: Math
Database Administrator: SQL
Business Analyst: Business growth, data modeling
Data & Analytics Manager: Data Science operation
Business Requirement: Understand the problem. Identify central objectives, identify variables that need to be predicted.
Data Acquisition: What data do I need for my project? What are the data source? How can I obtain the data? What is most efficient way to store and access all of it?
Data Processing: Transform data into desired format. Missing values, corrupted data
Data exploration: Understand the patterns in the data. Retrieve insight. Form hypotheses
Modelling: Determine optimal data features for the machine-learning model. Create a model that predicts the target accurately. Evaluate & test the efficiency of model.
Deployment: Check the deployment environment for dependency issues. Deploy the model in a pre-production/test environment. Monitor the performance
Agenda
What is data?: Data refers of facts and statistics collected together for reference or analysis. Data is collected, store, measured, analyzed and visualized.
Categories of data: Two types of data. They are Qualitative and Quantitative. Qualitative: Nominal and Ordinal Quantitative: Discrete and Continuous Nominal Data: Data with no inherent order or ranking such as gender or race, such kind of data is called Nominal data. Ordinal Data: Data with an ordered series such as shown in the table such kind of data is called Ordinal Data. Good, Average, Bad
Quantitative: Discrete and Continuous. Discrete data also known as categorical data, it can hold finite number of possible values. Example: Number of students in a class
Continuous Data: Data that can hold infinite number of possible values. Example: Weight of a person.
What is statistics? Statistics is an area of applied mathematics concerned with the data collection, analysis, interpretation and presentation.
Basic terminologies in statistics: Population: A collection or set of individuals or objects or events whose properties are to be analyzed. Sample: A subset of population is called "Sample" A well chosen sample will contain most of the information about a particular population parameter
Sampling techniques: Random, Systematic, every nth record, Stratified is at least one one common characteristic
Types of statistics: Descriptive statistics: uses the data to provide descriptions of the population, either through numerical calculations or graphs or tables. Maximum, Average, Minimum. Descriptive Statistics is mainly focused upon the main characteristics of data, It provides graphical summary of the data,
Probability:
Inferential statistics:
Descriptive Statistics: is a method used to describe and understand the features of a specific data set by giving short summaries about the sample and measures of the data.
Descriptive Statistics are broken down into two categories:
Measures of Central tendency
Measures of Variability (spread)
Measures of Spread: Range, Inter Quartile Range, Variance, Standard Deviation
This is test
Entropy measures the impurity or uncertainty present n the data
Information Gain (IG) indicates how much information a particular feature variable gives us about the final outcome.
Decision Tree
A confusion matrix is a table that is often used to describe the performance of a classification model (or classifier) on a set of test data for which the true values are known
set.seed(1)
#Generate random numbers and store it in a variable called data
data = runif(20,1,10)
#data <- c(1, 2, 3, 4, 5, 6, 7, 7, 8, 9)
print(data)
#Calculate Mean
mean = mean(data)
print(mean)
#Calculate Median
median = median(data)
print(median)
#Create a function for calculate Mode
mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
result <- mode(data)
print(data)
cat("mode = {}", result)
#Calculate Variance and std Deviation
variance = var(data)
standardDeviation = sqrt(var(data))
print(standardDeviation)
#plot Histogram
hist(data, bins=10, range=c(0,10), edgecolor='black')
Probability is the measure of how likely an event will occur.
Probability is the ratio of desired outcomes to total outcomes. (desired outcomes) / (total outcomes)
Probabilities of all outcomes always sums to 1
Example:
On rolling a dice, you get 6 possible outcomes.
Each possibility only has one outcome, so each has a probability of 1/6
For example, the probability of getting a number 2 on the dice is 1/6
Terminologies in probability
Random Experiment: An experiment or a process for which the outcome cannot be predicted with certainty
Sample Space: The entire possible set of outcomes of a random experiment is the sample space of that experiment
Event: One or more outcomes of an experiment. It is a subset of sample space.
Probability Distribution
Probability Density Function:
Normal Distribution:
Central Limit Theorem:
Probability Density Function
Central Limit Theorem: states that the sample ling distribution of the mean of any independent random variable will be normal or nearly normal, if the sameple size is large enough.
Marginal Probability
Joint Probability
Conditional Probability
Marginal Probability is the probability of occurrence of a single events
Join Probability is a measure of two events happening at the same time
Example: The probability that a card is an Ace of hearts = p(Ace of hearts)
There are 13 heart cards in a deck of 52 and out of them one in the Ace of hearts
The probability that a candidate has undergone Edureka's training 45/105 = 0.42
Find the probability that a candidate has attended Edureka's training and also has good package.
Find the probability that a candidate has a good package given that he has not undergone training.
Shows the relation between one conditional probability and its inverse
Point Estimation is concerned with the use of the sample data to measure a single value which serves as an approximate value or the best estimate of an unknown population parameter.
4 Ways to find estimates
Method of Moments: Estimates are found out by equating the first k sample moments to the corresponding k population moments
Maximum of Likelihood: Uses a model and values in the model to maximize a likelihood function. This results in the most likely parameter for the inputs selection
Bayes' Estimators: Minimizes the average risk (an expectation of random variables)
Best Unbiased Estimators: Several unbiased estimators can be used to approximate a parameter (which one is "best" depends on what parameter you trying to find)
Confidence Interval is the measure of your confidence, that the interval estimate contains the population mean.
Statisticians use a confidence interval to describe the amount of uncerainty associated with a sample estimate of population parameter.
Technically, a range of values so constructed that there is a specified probability of including the trur value of a parameter within it.
Difference between the point estimate and actual population parameter value is called the Sampling Error
When u us estimated, the sampling error is the difference u - x (bar)
Margin of Error E: for a given level of confidence is the greatest possible distance between the point estimate and the value of the parameter it is estimating
Agenda
Need for machine learning
What is machine learning?
Machine learning definitions
Machine Learning process
Types of machine learning
Type of problems solved using machine learning
Demo
What is Regression?
Regression Use-Case
Types of Regression Linear vs Logistic Regression
What is Linear Regression?
Finding best fit regression line using Least Square Method
Checking goodness of fit using R squared Method
Implementation of Linear Regression using Python
Linear Regression Algorithm using Python from scratch
Linear Regression Algorithm using Python (scikit lib)
Three major uses for regression analysis are
Determining the strength of predictors
Forecasting an effect, and
Trend forecasting
Selection Criteria
Classification and Regression Capabilities
Data Quality
Computational Complexity
Comprehensible and Transparent
Where is Linear Regression used?
Evaluating Trends and Sales Estimates
Analyzing the impact of Price Changes
Assessment of rick in financial services and insurance domain
R-Squared value is statistical measure of how close the data are to the fitted regression line.
It is also known as coefficient of determination, or the coefficient of multiple determination.
Regression
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (8.0, 4.0)
data = pd.read_csv('headbrain.csv')
print(data.shape)
data.head()
(237, 4)
X = data['Head Size(cm^3)'].values
Y = data['Brain Weight(grams)'].values
mean_x = np.mean(X)
mean_y = np.mean(Y)
m = len(X)
numer = 0
denom = 0
for i in range(m):
numer += (X[i] - mean_x) * (Y[i] - mean_y)
denom += (X[i] - mean_x) ** 2
b1 = numer / denom
b0 = mean_y - (b1 * mean_x)
# value of m is b1
# value of head size is b0
print(b1, b0)
0.26342933948939945 325.57342104944223
max_x = np.max(X) + 100
min_x = np.min(X) - 100
x = np.linspace(min_x, max_x, 1000)
y = b0 + b1 * x
plt.plot(x, y, color='#58b970', label='Regression Line')
plt.scatter(X, Y, c='#ef5423', label='Scatter Plot')
plt.xlabel('Head size in cm^3')
plt.ylabel('Brain Weight in grams')
plt.legend()
plt.show()
# total sum of square
# total sum of squares residuals
ss_t = 0
ss_r = 0
for i in range(m):
y_pred = b0 + b1 * X[i]
ss_t += (Y[i] - mean_y) ** 2
ss_r += (Y[i] - y_pred) ** 2
r2 = 1 - (ss_r/ss_t)
print(r2)
0.6393117199570003
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
X = X.reshape((m, 1))
reg = LinearRegression()
reg = reg.fit(X, Y)
Y_pred = reg.predict(X)
# RMSE and R2 Score
mse = mean_squared_error(Y, Y_pred)
rmse = np.sqrt(mse)
r2_score = reg.score(X, Y)
print(np.sqrt(mse))
72.1206213783709
print(r2_score)
0.639311719957
Agenda
What is Regression?
Logistic regression: What and Why?
Linear VS Logistic Regression
Use-Cases
Demo
What is Regression? Regression Analysis is a predictive modeling technique
It estimates the relationship between a dependent (target) and an independent variable (predictor)
Logistic Regression produces results in a binary format which is used to predict the outcome of a categorical dependent variable. Do the outcome should be discrete / categorical such as 0 or 1, Yes or No, True or False, High and Low
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
train = pd.read_csv('z_data/train.csv')
train.info()
train.head()
Exploratory Data Analysis Let's begin some exploratory data analysis! We'll start by checking out missing data!
Missing Data We can use seaborn to create a simple heatmap to see where we are missing data!
sns.heatmap(train.isnull(), yticklabels=False, cbar=False, cmap='viridis')
Roughly 20 percent of the Age data is missing. The proportion of Age missing is likely small enough for reasonable replacement with some form of imputation. Looking at the Cabin column, it looks like we are just missing too much of that data to do something useful with at a basic level. We'll probably drop this later.
sns.set_style('whitegrid')
sns.countplot(x='Survived', data=train, palette='viridis')
sns.countplot(x='Survived', hue='Sex', data=train, palette='viridis')
More no. of female passengers survived the tragedy compared to the male passengers on the ship.
sns.countplot(x='Survived', hue= 'Pclass', data=train, palette='viridis')
The death rate is higher in passengers who were in third class.
sns.countplot(x='Survived', hue= 'Embarked', data=train, palette='viridis')
sns.distplot(train['Age'].dropna(),kde=False, bins=30, color='blue' )
sns.countplot(x='Parch', data=train, palette='viridis')
sns.countplot(x='SibSp', data=train, palette='viridis')
The plot indicates that very few passengers had sibling, spouse, parent or children. This observation is true as the age group suggests more youngsters were on the ship.
Data Cleaning We want to fill in missing age data instead of just dropping the missing age data rows. One way to do this is by filling in the mean age of all the passengers (imputation). However we can be smarter about this and check the average age by passenger class.
import cufflinks as cf
cf.go_offline()
box_age = train[['Pclass', 'Age']]
box_age.pivot(columns='Pclass', values='Age').iplot(kind='box')
We can see the wealthier passengers in the higher classes tend to be older, which makes sense. We'll use these average age values to impute based on Pclass for Age.
def impute_age(cols):
Age = cols[0]
Pclass = cols[1]
if pd.isnull(Age):
if Pclass == 1:
return 37
elif Pclass == 2:
return 29
else:
return 24
else:
return Age
train['Age'] = train[['Age', 'Pclass']].apply(impute_age,axis=1)
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Great! Let's go ahead and drop the Cabin column
train.drop('Cabin', axis=1, inplace=True)
train.head()
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Converting Categorial Features We'll need to convert categorical features to dummy variables using pandas! Otherwise our machine learning algorithm won't be able to directly take in those features as inputs.
sex = pd.get_dummies(train['Sex'], dtype="int", drop_first= True)
embark = pd.get_dummies(train['Embarked'], dtype="int", drop_first= True)
train = pd.concat([train, sex, embark], axis=1)
train.head()
train.drop(['Sex','Embarked', 'Name', 'Ticket'], axis=1, inplace=True)
train.head(3)
Building a Logistic Regression model
X = train.drop('Survived', axis=1)
y = train['Survived']
X.head()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=101)
Training and Predicting
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression(solver='liblinear')
logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test)
Evaluation
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,predictions)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, predictions)
0.7761194029850746
import numpy as np
import pandas as pd
import matplotlib as plt
%matplotlib inline
dataset = pd.read_csv("z_data/suv_prediction.csv")
dataset.head(10)
X = dataset.iloc[:,[2,3]].values
y = dataset.iloc[:,[4]].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)
from sklearn.linear_model import LogisticRegression
classfier = LogisticRegression(random_state=0)
classfier.fit(X_train,y_train)
y_pred = classfier.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,y_pred)
array([[63, 5],
[ 8, 24]]
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
0.87
What is Classification?' Like a Gmail classify its spam or not.
Types of Classification
Classification Use case Credit card frad detection, Car
What is Decision Tree?
Terminologies associated to a Decision Tree
Visualizing a Decision Tree
Writing a Decision Tree Classifier from Scratch in Python using CART Algorithm
Types of Classification
Decision Tree:
Graphical representation of all the possible solutions to a decision
Decisions are based on some conditions
Decision made can be easily explained.
Random Forest
Builds multiple decision trees and merges them together
More accurate and stable prediction
Random decision forests correct for decision trees habit of overfitting to their training set
Trained with the bagging method
Naive Bayes
Classification technique based on Bayes Theorem
Assumes that the presence of a paricular feature in a class is unrelated to the presence of any other feature
K-Nearest Neighbours
Stores all the available cases and classifies new cases based on a similarity measure
The K in KNN algorithm is the nearest neighbours we wish to take vote from
A decision tree is a graphical representation of all the possible solutions to a decision based on certain conditions.
Gini Index: The measure of impurity (or purity) used in building decision tree in CART is Gini Index
Information Gain: The information gain is the the decrease in entropy after a dataset is split on the basis of an attribute. Construction a decision tree is all about finding attribute that returns the highest information gain.
Reduction in Variance: Reduction in variance is an algorithm used for continuous target variables (regression problems) The split with lower variance is selected as the criteria to split the population.
Chi Square: It is an algorithm to find out the statistical significance between the difference between sub-nodes and parent node.
# Sample dataset
# Format: each row is an example
# The Last column is the label
# The first two columns are features
# If you want you can add more features & examples
# Intersting note: 2nd and 5th examples have the same features, but different labels
# Let's see how three handles his case
training_data = [
['Green', 3, 'Mango'],
['Yellow', 3, 'Mango'],
['Red', 1, 'Grape'],
['Red', 1, 'Grape'],
['Yellow', 3, 'Lemon']
]
# Column Labels
# These are used only to print the tree.
header = ["color", "diameter", "label"]
def unique_vals(rows, col):
return set([row[col] for row in rows])
# unique_vals(training_data, 0)
# unique_vals(training_data, 1)
def class_counts(rows):
"""Counts the number of each type of example in a dataset"""
counts = {} # a dictionary of Label -> count
for row in rows:
# in our dataset format, the Label is always the Last column
label = row[-1]
if label not in counts:
counts[label] = 0
counts[label] += 1
return counts
# class_counts(training_data)
def is_numeric(value):
"""Test if a value is numeric"""
return isinstance(value, int) or isinstance(value, float)
# is_numeric(7)
# is_numeric(Red)
class Question:
"""A Question is used to partition a dataset
This class just records a colum number (e.g 0 for color) and a
'column value' (e.g, green) the 'match' method is used to compare
the feature value in an example to the feature value stored in the
question. See the demo below
"""
def __init__(self, column, value):
self.column = column
self.value = value
def match(self, example):
# Compare the feature value in an example to the
# feature value in the question
val = example[self.column]
if is_numeric(val):
return val >= self.value
else:
return val == self.value
def __repr__(self):
# This is just a helper method to print
# the question in a readable format
condition = "=="
if is_numeric(self.value):
condition = ">="
return "IS %s %s %s?" % (
header[self.column], condition, str(self.value))
def partition(rows, question):
"""Partitions a dataset.
For each row in the dataset, check if it matches the question. If
so, add it to 'true rows', otherwise, add it to the 'false rows'
"""
true_rows, false_rows = [], []
for row in rows:
if question.match(row):
true_rows.append(row)
else:
false_rows.append(row)
return true_rows, false_rows
# Let's partition the training data based on whether rows are Red.
# true_rows, false_rows = partition(training_data, Question(0, 'Red'))
# This will contain all the 'Red' rows
# true_rows
# This will contain everything else.
# false_rows
def gini(rows):
"""Calculate the Gini Impurity for a list of rows
There are a few different ways to do this, I thought this one was
the most concise
"""
counts = class_counts(rows)
impurity = 1
for lbl in counts:
prob_of_lbl = counts[lbl] / float(len(rows))
impurity -= prob_of_lbl**2
return impurity
def info_gain(left, right, current_uncertainty):
"""Information Gain.
The uncertainty of the starting node, minus the weighted impurity of
two child nodes
"""
p = float(len(left)) / (len(left) + len(right))
return current_uncertainty - p * gini(left) - (1 - p) * gini(right)
# Calculate the uncertainy of our training data
# current_uncertainty = gini(training_data)
# How much information do we gain by partitioning on 'Green'?
# true_rows, false_rows = partition(training_data, Question(0, 'Green'))
# info_gain(true_rows, false_rows, current_uncertainty)
# What about if we partitioned on "Red" instead?
# true_rows, false_rows = partition(training_data, Question(0, 'Red'))
# info_gain(true_rows, false_rows, current_uncertainty)
def find_best_split(rows):
"""Find the best question to ask by iterating over every feature / value
and calculateing the information gain
"""
best_gain = 0 # keep track of the best information gain
best_question = None # keep train of the feature / vale that produced it
current_uncertainty = gini(rows)
n_features = len(rows[0]) - 1 # number of columns
for col in range(n_features): # for each feature
values = set([row[col] for row in rows]) # unique values in the column
for val in values: # for each value
question = Question(col, val)
# try splitting the dataset
true_rows, false_rows = partition(rows, question)
def find_best_split(rows):
"""Find the best question to ask by iterating over every feature / vale
and calculating the information gain
"""
best_gain = 0 # keep track of the best information gain
best_question = None # keep traain of the feature / value that produced it
current_uncertainty = gini(rows)
n_features = len(rows[0]) - 1 # number of columns
for col in range(n_features): # for each features
values = set([row[col] for row in rows]) # unique values in the column
for val in values: # for each value
question = Question(col, val)
# try splitting the dataset
true_rows, false_rows = partition(rows, question)
# skip this split if it doesn't divide the
# dataset
if len(true_rows) == 0 or len(false_rows) == 0:
continue
# Calculate the information gain from this split
gain = info_gain(true_rows, false_rows, current_uncertainty)
# You actually can use '>' instead of '>=' here
# but I wanted the tree to look a certain way for our
# toy dataset
if gain >= best_gain:
best_gain, best_question = gain, question
return best_gain, best_question
class Leaf:
"""A Leaf node classified data.
This holds a dictionary of class (e.g. 'Mango') -> number of times
it appears in the rows from the training data that reach this leaf
"""
def __init__(self, rows):
self.predictions = class_counts(rows)
class Decision_Node:
"""A Decision Node asks a question
This holds a reference to the question, and to the two child nodes
"""
def __init__(self,
question,
true_branch,
false_branch):
self.question = question
self.true_branch = true_branch
self.false_branch = false_branch
def build_tree(rows):
"""Builds the tree
Rules of recursion: 1) Belive that it works. 2) Start by checking
for the base case (no further information gain) 3) prepare for
"""
# Try partioning the dataset on each of the unique attribute
# calculate the information gain,
# and return the question that produces the highest gain
gain, question = find_best_split(rows)
# Base case: no further info gain
# Since we can ask no further question
# we'll return a Leaf
if gain == 0:
return Leaf(rows)
# If we reach here, we have found a useful feature / vale
# to partition on
true_rows, false_rows = partition(rows, question)
# Recursively build the true brach
true_branch = build_tree(true_rows)
# Return a question node
# This records the best feature / value to ask at this point
# as well as the braches to follow
# depending on the answer
return Decision_Node(question, true_branch, false_branch)
def print_tree(node, spacing=""):
# Base case: We've reached a Leaf
if isinstance(node, Leaf):
print (spacing + "Predict", node.predictions)
return
# Print the question at this node
print (spacing + str(node.question))
# Call this function recursively on the true branch
print (spacing + '--> True:')
print_tree(node.true_branch, spacing + " ")
# Call this function recursively on the false branch
print (spacing + '--> False:')
print_tree(node.false_branch, spacing + " ")
def classify(row, node):
# Base case: We've reached a Leaf
if isinstance(node, Leaf):
return node.predictions
# Decide whether to fellow the true-brach or the false-branch
# Compare the feature / value stored in the node.
# to the example we're considering
if node.question.match(row):
return classify(row, node.true_branch)
else:
return classify(row, node.false_branch)
def print_leaf(counts):
"""Print the predictions at a leaf"""
total = sum(counts.value()) * 1.0
probs = {}
for lbl in counts.keys():
probs[lbl] = str(int(counts[lbl] / total * 100)) + "%"
return probs
if __name__ == '__main__':
my_tree = build_tree(training_data)
print_tree(my_tree)
testing_data = [
['Green', 3, 'Apple'],
['Yellow', 4, 'Apple'],
['Red', 2, 'Grape'],
['Red', 1, 'Grape'],
['Yellow', 3, 'Lemon']
]
for row in testing_data:
print ("Actual: %s. Predicted: %s" %
(row[-1], print_leaf(classify(row, my_tree))))
Decision Tree: Decision tree builds classification models in the form of a tree structures. It breaks down a dataset into smaller and smaller subsets.
Random Forest: Random Forest is an ensemble classifier made using many decision tree models. Ensemble models combine the results from different models.
Naive Bayes: It is a classification technique based on Bayes Theorem with an assemption of independence among attributes.
Why Random Forest?
RandomForest - a versatil algorithm capable of performing both Regression and Classification.
It is a type of ensemble learning method.
Commonly used predictive modelling and machine learning technique
What is KNN Algorithm?
Industrial Use case of KNN Algorithm.
How things are predicted using KNN Algorithm
How to choose the value of K?
KNN Algorithm Using Python
Implementation of KNN Algorithm from scratch
Compare the build model using scikit learn
K Nearest Neighbour is a simple algorithm that stores all the aviable cases and classifies the new data or case based on a similarity measure.
K=3 When K is equal 3 means number of nearest neighbor you selected
Two orange vs 1 blue so it belong orange
How to choice K?
import numpy as np
from sklearn import datasets
from sklearn import neighbors
import pylab as pl
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
iris = datasets.load_iris()
print(iris.keys())
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
n_samples, n_features = iris.data.shape
print((n_samples, n_features))
(150, 4)
print(iris.data[0])
[5.1 3.5 1.4 0.2]
print(iris.target.shape)
(150,)
print(iris.target)
print(iris.target_names)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
['setosa' 'versicolor' 'virginica']
x_index = 0
y_index = 1
# this formatter will label the colorbar with the correct target names
formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])
plt.scatter(iris.data[:, x_index], iris.data[:, y_index],
c=iris.target, cmap=plt.colormaps.get_cmap('RdYlBu'))
plt.colorbar(ticks=[0, 1, 2], format=formatter)
plt.clim(-0.5, 2.5)
plt.xlabel(iris.feature_names[x_index])
plt.ylabel(iris.feature_names[y_index]);
Classification model We use K-nearest neighbors (k-NN), which is one of the simplest learning strategies:
given a new, unknown observation, look up in your reference database which ones have the closest features and assign the predominant class.
Let’s try it out on our iris classification problem:
Prepare the data
Initialize the model object
fit the model to the data
Make a prediction
X, y = iris.data, iris.target
clf = neighbors.KNeighborsClassifier(n_neighbors=5)
clf.fit(X, y)
result = clf.predict([[3, 5, 4, 2],])
print(iris.target_names[result])
['versicolor']
You can also do probabilistic predictions, i.e. check individual probability of this data point belonging to each of the classes:
clf.predict_proba([[3, 5, 4, 2],])
array([[0. , 0.8, 0.2]])
Let’s visualize k-NN predictions on a plot.
We take a ‘slice’ of the original dataset, taking only the first two features. This is because we will drawing a 2D plot, where we can only visualize two features at a time. Then we fit a new k-NN model to this slice, using only two features from the original data. Next, we paint a ‘map’ of predicted classes: we fill the plot area using a mesh grid of colored regions, where each region’s color is based on the class predicted by the model. Finally, we put the data points from the original dataset on the plot as well (in bold).
# Create color maps for 3-class classification problem, as with iris
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
def plot_iris_knn():
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
knn = neighbors.KNeighborsClassifier(n_neighbors=3)
knn.fit(X, y)
x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
np.linspace(y_min, y_max, 100))
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
pl.figure()
pl.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
pl.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
pl.xlabel('sepal length (cm)')
pl.ylabel('sepal width (cm)')
pl.axis('tight')
plot_iris_knn()
Machine learning is a subset of artificial intelligence (AI) which provides machines the ability to learn automatically & improve from experience without being explicitly programmed.
Support Vector Machine (SVM) is a supervised classification method that separates data using hyperplane.
Bigger the margin the better
K-Means clustering
What is Clustering?
Clustering is the process of dividing the dataset into groups, consisting of similar data-points.
It means grouping of objects based on the information found in the data, describing the objects or their relationship.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as py
import plotly.graph_objs as go
from sklearn.cluster import KMeans
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('z_data/Mall_Customers.csv')
df.head()
df.columns
Index(['CustomerID', 'Gender', 'Age', 'Annual Income (k$)',
'Spending Score (1-100)'],
dtype='object')
df.info()
df.describe()
df.isnull().sum()
CustomerID 0
Gender 0
Age 0
Annual Income (k$) 0
Spending Score (1-100) 0
dtype: int64
plt.figure(1 , figsize = (15 , 6))
n = 0
for x in ['Age' , 'Annual Income (k$)' , 'Spending Score (1-100)']:
n += 1
plt.subplot(1 , 3 , n)
plt.subplots_adjust(hspace = 0.5 , wspace = 0.5)
sns.distplot(df[x] , bins = 15)
plt.title('Distplot of {}'.format(x))
plt.show()
sns.pairplot(df, vars = ['Spending Score (1-100)', 'Annual Income (k$)', 'Age'], hue = "Gender")
plt.figure(1 , figsize = (15 , 7))
plt.title('Scatter plot of Age v/s Spending Score', fontsize = 20)
plt.xlabel('Age')
plt.ylabel('Spending Score')
plt.scatter( x = 'Age', y = 'Spending Score (1-100)', data = df, s = 100)
plt.show()
X1 = df[['Age' , 'Spending Score (1-100)']].iloc[: , :].values
inertia = []
for n in range(1 , 15):
algorithm = (KMeans(n_clusters = n ,init='k-means++', n_init = 10 ,max_iter=300,
tol=0.0001, random_state= 111 , algorithm='elkan') )
algorithm.fit(X1)
inertia.append(algorithm.inertia_)
plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(1 , 15) , inertia , 'o')
plt.plot(np.arange(1 , 15) , inertia , '-' , alpha = 0.5)
plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')
plt.show()
algorithm = (KMeans(n_clusters = 4 ,init='k-means++', n_init = 10 ,max_iter=300,
tol=0.0001, random_state= 111 , algorithm='elkan') )
algorithm.fit(X1)
labels1 = algorithm.labels_
centroids1 = algorithm.cluster_centers_
h = 0.02
x_min, x_max = X1[:, 0].min() - 1, X1[:, 0].max() + 1
y_min, y_max = X1[:, 1].min() - 1, X1[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = algorithm.predict(np.c_[xx.ravel(), yy.ravel()])
plt.figure(1 , figsize = (15 , 7) )
plt.clf()
Z = Z.reshape(xx.shape)
plt.imshow(Z , interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap = plt.cm.Pastel2, aspect = 'auto', origin='lower')
plt.scatter( x = 'Age', y = 'Spending Score (1-100)', data = df, c = labels1, s = 100)
plt.scatter(x = centroids1[: , 0] , y = centroids1[: , 1] , s = 300 , c = 'red' , alpha = 0.5)
plt.ylabel('Spending Score (1-100)') , plt.xlabel('Age')
plt.show()
algorithm = (KMeans(n_clusters = 5, init='k-means++', n_init = 10, max_iter=300,
tol=0.0001, random_state= 111 , algorithm='elkan'))
algorithm.fit(X1)
labels1 = algorithm.labels_
centroids1 = algorithm.cluster_centers_
h = 0.02
x_min, x_max = X1[:, 0].min() - 1, X1[:, 0].max() + 1
y_min, y_max = X1[:, 1].min() - 1, X1[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = algorithm.predict(np.c_[xx.ravel(), yy.ravel()])
plt.figure(1 , figsize = (15 , 7) )
plt.clf()
Z = Z.reshape(xx.shape)
plt.imshow(Z , interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap = plt.cm.Pastel2, aspect = 'auto', origin='lower')
plt.scatter( x = 'Age', y = 'Spending Score (1-100)', data = df, c = labels1, s = 100)
plt.scatter(x = centroids1[: , 0] , y = centroids1[: , 1] , s = 300 , c = 'red' , alpha = 0.5)
plt.ylabel('Spending Score (1-100)') , plt.xlabel('Age')
plt.show()
X2 = df[['Annual Income (k$)' , 'Spending Score (1-100)']].iloc[: , :].values
inertia = []
for n in range(1 , 11):
algorithm = (KMeans(n_clusters = n ,init='k-means++', n_init = 10 ,max_iter=300,
tol=0.0001, random_state= 111 , algorithm='elkan') )
algorithm.fit(X2)
inertia.append(algorithm.inertia_)
plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(1 , 11) , inertia , 'o')
plt.plot(np.arange(1 , 11) , inertia , '-' , alpha = 0.5)
plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')
plt.show()
algorithm = (KMeans(n_clusters = 5 ,init='k-means++', n_init = 10 ,max_iter=300,
tol=0.0001, random_state= 111 , algorithm='elkan') )
algorithm.fit(X2)
labels2 = algorithm.labels_
centroids2 = algorithm.cluster_centers_
h = 0.02
x_min, x_max = X2[:, 0].min() - 1, X2[:, 0].max() + 1
y_min, y_max = X2[:, 1].min() - 1, X2[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z2 = algorithm.predict(np.c_[xx.ravel(), yy.ravel()])
plt.figure(1 , figsize = (15 , 7) )
plt.clf()
Z2 = Z2.reshape(xx.shape)
plt.imshow(Z2 , interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap = plt.cm.Pastel2, aspect = 'auto', origin='lower')
plt.scatter( x = 'Annual Income (k$)' ,y = 'Spending Score (1-100)' , data = df , c = labels2 ,
s = 100 )
plt.scatter(x = centroids2[: , 0] , y = centroids2[: , 1] , s = 300 , c = 'red' , alpha = 0.5)
plt.ylabel('Spending Score (1-100)') , plt.xlabel('Annual Income (k$)')
plt.show()
X3 = df[['Age' , 'Annual Income (k$)' ,'Spending Score (1-100)']].iloc[: , :].values
inertia = []
for n in range(1 , 11):
algorithm = (KMeans(n_clusters = n, init='k-means++', n_init = 10, max_iter=300,
tol=0.0001, random_state= 111, algorithm='elkan'))
algorithm.fit(X3)
inertia.append(algorithm.inertia_)
plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(1 , 11) , inertia , 'o')
plt.plot(np.arange(1 , 11) , inertia , '-' , alpha = 0.5)
plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')
plt.show()
algorithm = (KMeans(n_clusters = 6 ,init='k-means++', n_init = 10 ,max_iter=300,
tol=0.0001, random_state= 111 , algorithm='elkan') )
algorithm.fit(X3)
labels3 = algorithm.labels_
centroids3 = algorithm.cluster_centers_
y_kmeans = algorithm.fit_predict(X3)
df['cluster'] = pd.DataFrame(y_kmeans)
df.head()
import plotly as py
import plotly.graph_objs as go
trace1 = go.Scatter3d(
x= df['Age'],
y= df['Spending Score (1-100)'],
z= df['Annual Income (k$)'],
mode='markers',
marker=dict(
color = df['cluster'],
size= 10,
line=dict(
color= df['cluster'],
width= 12
),
opacity=0.8
)
)
data = [trace1]
layout = go.Layout(
title= 'Clusters wrt Age, Income and Spending Scores',
scene = dict(
xaxis = dict(title = 'Age'),
yaxis = dict(title = 'Spending Score'),
zaxis = dict(title = 'Annual Income')
)
)
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)
df.head()
df.to_csv("z_data/segmented_customers.csv", index = False)
There are 3 itemset 1 in TID and only 1 support value for itemset {4}
Market Basket Analysis with Apriori Algorithm
Association Rule Learning (ARL) In today's world where the number of customers and transactions are increasing, it has become more valuable to create meaningful results from data and for developing marketing strategies. Revealing hidden patterns in the data in order to be able to compete better and maximize profit in the face of intense competition in the market, and to establish value-oriented long-term relationships with customers, makes a great contribution to determining marketing strategies.
However, the development of rule-based strategies is no longer possible in big data world, offering the right product to the right customer at the right time; it forms the basis of cross-selling and loyalty programs within the scope of customer retention and increasing lifetime value. Therefore, it has been crucial point for companies making product offers by using these patterns of association and developing effective marketing strategies Market Basket analysis is one of the association rule applications. It allows us to predict the products that customers tend to buy in the future by developing a pattern from their past behavior and habits.
There are different algorithms to be used for Association Rules Learning. One of them is the Apriori algorithm. In this project, product association analysis will be handled with “Apriori Algorithm” and the most suitable product offers will be made for the customer who is in the sales process, using the sales data of an e-commerce company.
Dataset Story: • The Online Retail II data set, which includes the sales data of the UK-based online sales store, was used. • Sales data between 01/12/2009 - 09/12/2011 are included in the data set. • The product catalog of this company includes souvenirs.
Business Problem: Suggesting products to users at the basket stage. In this study, we will apply Market Basket analysis using the Apriori algorithm. In this context, we will consider the work in 5 steps:
Import Data & Data Preprocessing
Preparing Invoice-Product Matrix fot ARL Data Structure
Determination of Association Rules
Suggesting appropriate product offers to customers at the basket stage
Functionalization
Variables Descriptions:
• InvoiceNo: Invoice Number -> If this code starts with C, it means that the operation has been canceled. • StockCode: Product Code -> Unique number for each product • Description: Product name • Quantity: Number of products -> how many of the products on the invoices were sold. • InvoiceDate • UnitePrice • CustomerID: Unique customer number • Country
# Import Libraries
import pandas as pd
# For Association Rules Learning & Apriori
# !pip install mlxtend
from mlxtend.frequent_patterns import apriori, association_rules
# Setting Configurations:
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# Import Warnings:
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
df = pd.read_excel('z_data/online_retail_II.xlsx')
df.head()
df.info()
df.isna().sum()
df.dropna(inplace=True)
df.isna().sum()
df.shape
(417534, 8)
df.describe().T
# Let's first determine cancelled transactions (Invoice Id contains value "C") and then remove them:
df_Invoice = pd.DataFrame({"Invoice":[row for row in df["Invoice"].values if "C" not in str(row)]})
df_Invoice.head()
df_Invoice = df_Invoice.drop_duplicates("Invoice")
# The transactions except cancelled transactions:
df = df.merge(df_Invoice, on = "Invoice")
df
# Outlier Detection:
# Outlier değerler için baskılama yapılacak low ve up limit belirleyelim:
def outlier_thresholds(dataframe, variable):
quartile1 = dataframe[variable].quantile(0.01)
quartile3 = dataframe[variable].quantile(0.99)
interquantile_range = quartile3 - quartile1
up_limit = quartile3 + 1.5 * interquantile_range
low_limit = quartile1 - 1.5 * interquantile_range
return low_limit, up_limit
# Replace outliers with thresholds
def replace_with_thresholds(dataframe, variable):
low_limit, up_limit = outlier_thresholds(dataframe, variable)
dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit
dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit
df.dtypes
num_cols = [col for col in df.columns if df[col].dtypes in ["int64","float64"] and "ID" not in col]
print(num_cols)
['Quantity', 'Price']
for col in num_cols:
replace_with_thresholds(df, col)
df.describe().T
df = df[df["Quantity"] > 0]
df = df[df["Price"] > 0]
# Unique Number of Products (with Description)
df.Description.nunique()
# Unique Number of Products (with StockCode)
df.StockCode.nunique()
The unique values of these 2 variables (Description & StockCode) should be equal, because each stock code represents a product:
# 1st Step
df_product = df[["Description","StockCode"]].drop_duplicates()
df_product = df_product.groupby(["Description"]).agg({"StockCode":"count"}).reset_index()
df_product.sort_values("StockCode", ascending=False).head()
df_product.rename(columns={'StockCode':'StockCode_Count'},inplace=True)
df_product = df_product[df_product["StockCode_Count"]>1]
Let's delete products with more than one stock code:
df = df[~df["Description"].isin(df_product["Description"])]
print(df.StockCode.nunique())
print(df.Description.nunique())
3969
4419
# 2nd Step
df_product = df[["Description","StockCode"]].drop_duplicates()
df_product = df_product.groupby(["StockCode"]).agg({"Description":"count"}).reset_index()
df_product.rename(columns={'Description':'Description_Count'},inplace=True)
df_product = df_product.sort_values("Description_Count", ascending=False)
df_product.head()
df_product = df_product[df_product["Description_Count"] > 1]
df_product.head()
Let's delete stock codes that represent multiple products:
df = df[~df["StockCode"].isin(df_product["StockCode"])]
# Now each stock code represents a single product:
print(df.StockCode.nunique())
print(df.Description.nunique())
3550
3550
The post statement in the stock code shows the postage cost, let's delete it as it is not a product:
df = df[~df["StockCode"].str.contains("POST", na=False)]
We'll handle sales data of Germany as an example:
df_germany = df[df["Country"] == "Germany"]
df_germany.shape
(4628, 8)
def create_invoice_product_df(dataframe, id=False):
if id:
return dataframe.groupby(['Invoice', "StockCode"])['Quantity'].sum().unstack().fillna(0). \
applymap(lambda x: 1 if x > 0 else 0)
else:
return dataframe.groupby(['Invoice', 'Description'])['Quantity'].sum().unstack().fillna(0). \
applymap(lambda x: 1 if x > 0 else 0)
gr_inv_pro_df = create_invoice_product_df(df_germany, id=True)
gr_inv_pro_df.head()
# Let's define a function to find the product name corresponding to the stock code:
def check_id(dataframe, stockcode):
product_name = dataframe[dataframe["StockCode"] == stockcode]["Description"].unique()[0]
return stockcode, product_name
Let's explain the metrics we see in the table above:
antecedent support: If X is called antecendent, 'antecedent support' computes the proportion of transactions that contain the antecedent X. consequent support: If Y is called consequent, 'consequent support' computes the proportion of transactions that contain the antecedent Y. support: 'support' computes the proportion of transactions that contain the antecedent X and Y. confidence: Probability of buying Y when X is bought. lift: Represents how many times the probability of getting Y increases when X is received.
Let's sort dataframe by lift:
sorted_rules = rules.sort_values("lift", ascending=False)
4.Suggesting a Product to Users at the Basket Stage
We can develop different strategies at the product offer stage.
For example, When X is bought, we can sort according to the probability of buying Y (confidence) and make a product offer, or we can make an offer according to how many times the probability of sales over the lift increases. We can also make a product recommendation with a hybrid filtering where support, lift and confidence are used together.
If user buys a product whose id is 22492, which products do you recommend?
product_id = 22728
check_id(df, product_id)
(22728, 'ALARM CLOCK BAKELIKE PINK')
product_id = 22728
recommendation_list = []
for idx, product in enumerate(sorted_rules["antecedents"]):
# antecendent tuple olduğu için listeye çevirelim ve liste içinde arayalım:
for j in list(product):
if j == product_id:
# bu yakaladığımız integer değerin indexi ne ise (idx) consequentte onu arayacağız, bulduğumuz satırlar için ilk ürünü [0] önerelim
recommendation_list.append(list(sorted_rules.iloc[idx]["consequents"])[0])
recommendation_list = list( dict.fromkeys(recommendation_list) )
list_top5 = recommendation_list[0:5]
list_top5
[22741, 22419, 22752, 21578, 20719]
for item in list_top5:
product_id = item
print(check_id(df, product_id))