Edureka data science

1.28 DEC (THU) 20 Mins

Introduction to Data Science
Statistics and Probability
Basics of Machine Learning
Linear Regression
Logistic Regression
Decision Tree
Random Forest
K Nearest Neighbor
Naive Bayes
Support Vector Machine
K-Means Clustering
Association Rule Mining
Reinforcement Learning
Deep Learning
Data Science Interview Questions

Need for Data Science
Walmart Use Case
What is Data Science?
Who is a Data Scientist?
Data Science - Skill Set
Data Science Job Roles
Data Life Cycle
Introduction to Machine Leaning
K - Means Use Case
K - Means Algorithm
Hands - On
Data Science Certification

Data Sources: Mobile Phone, PC, Smart Car

IOT: Internet of Things

Social Media:

Other Factors: Retail, Banking & Finances, Insurance, Transportation, Government, Education, Healthcare, Media & Entertainment

What is Data Science: is the process of extracting knowledge and insights from data by using scientific methods.

Scientific methods: Programming + Statistics + Business

Who is a Data Scientist?

Mathematics, Business and Technology

Data Science - Skill Set

Statistics, Programming languages, Data extraction & processing, Data wrangling & exploration, Machine Learning, Big Data processing frameworks, Data visualization

2.29 DEC (FRI) 40 Mins

Data Science Job Roles:

Data Scientist: Programming

Data Analyst: Visualization, SQL, R, Python

Data Architect: Blue print, security. Data modeling

Data Engineer: Build and test scale system, Java, C++, Matlab

Statisticians: Math

Database Administrator: SQL

Business Analyst: Business growth, data modeling

Data & Analytics Manager: Data Science operation

Data Life Cycle:

Business Requirement: Understand the problem. Identify central objectives, identify variables that need to be predicted.

Data Acquisition: What data do I need for my project? What are the data source? How can I obtain the data? What is most efficient way to store and access all of it?

Data Processing: Transform data into desired format. Missing values, corrupted data

Data exploration: Understand the patterns in the data. Retrieve insight. Form hypotheses

Modelling: Determine optimal data features for the machine-learning model. Create a model that predicts the target accurately. Evaluate & test the efficiency of model.

Deployment: Check the deployment environment for dependency issues. Deploy the model in a pre-production/test environment. Monitor the performance

statistics and probability

Agenda

What is data?: Data refers of facts and statistics collected together for reference or analysis. Data is collected, store, measured, analyzed and visualized.
Categories of data: Two types of data. They are Qualitative and Quantitative. Qualitative: Nominal and Ordinal Quantitative: Discrete and Continuous Nominal Data: Data with no inherent order or ranking such as gender or race, such kind of data is called Nominal data. Ordinal Data: Data with an ordered series such as shown in the table such kind of data is called Ordinal Data. Good, Average, Bad
Quantitative: Discrete and Continuous. Discrete data also known as categorical data, it can hold finite number of possible values. Example: Number of students in a class
Continuous Data: Data that can hold infinite number of possible values. Example: Weight of a person.
What is statistics? Statistics is an area of applied mathematics concerned with the data collection, analysis, interpretation and presentation.
Basic terminologies in statistics: Population: A collection or set of individuals or objects or events whose properties are to be analyzed. Sample: A subset of population is called "Sample" A well chosen sample will contain most of the information about a particular population parameter
Sampling techniques: Random, Systematic, every nth record, Stratified is at least one one common characteristic
Types of statistics: Descriptive statistics: uses the data to provide descriptions of the population, either through numerical calculations or graphs or tables. Maximum, Average, Minimum. Descriptive Statistics is mainly focused upon the main characteristics of data, It provides graphical summary of the data,
Probability:
Inferential statistics:

Descriptive Statistics: is a method used to describe and understand the features of a specific data set by giving short summaries about the sample and measures of the data.

Descriptive Statistics are broken down into two categories:

Measures of Central tendency
Measures of Variability (spread)

Measures of Spread: Range, Inter Quartile Range, Variance, Standard Deviation

This is test

2.30 DEC (SAT) 1HR

information gain & entropy

Entropy measures the impurity or uncertainty present n the data

Information Gain (IG) indicates how much information a particular feature variable gives us about the final outcome.

Decision Tree

confusion matrix

A confusion matrix is a table that is often used to describe the performance of a classification model (or classifier) on a set of test data for which the true values are known

set.seed(1)

#Generate random numbers and store it in a variable called data

data = runif(20,1,10)

#data <- c(1, 2, 3, 4, 5, 6, 7, 7, 8, 9)

print(data)

#Calculate Mean

mean = mean(data)

print(mean)

#Calculate Median

median = median(data)

print(median)

#Create a function for calculate Mode

mode <- function(x) {

ux <- unique(x)

ux[which.max(tabulate(match(x, ux)))]

}

result <- mode(data)

print(data)

cat("mode = {}", result)

#Calculate Variance and std Deviation

variance = var(data)

standardDeviation = sqrt(var(data))

print(standardDeviation)

#plot Histogram

hist(data, bins=10, range=c(0,10), edgecolor='black')

3.31 DEC (SUN) 1.20 Mins

probability

Probability is the measure of how likely an event will occur.

Probability is the ratio of desired outcomes to total outcomes. (desired outcomes) / (total outcomes)

Probabilities of all outcomes always sums to 1

Example:

On rolling a dice, you get 6 possible outcomes.
Each possibility only has one outcome, so each has a probability of 1/6
For example, the probability of getting a number 2 on the dice is 1/6

Terminologies in probability

Random Experiment: An experiment or a process for which the outcome cannot be predicted with certainty

Sample Space: The entire possible set of outcomes of a random experiment is the sample space of that experiment

Event: One or more outcomes of an experiment. It is a subset of sample space.

Probability Distribution

Probability Density Function:
Normal Distribution:
Central Limit Theorem:

Probability Density Function

Central Limit Theorem: states that the sample ling distribution of the mean of any independent random variable will be normal or nearly normal, if the sameple size is large enough.

types of probability

Marginal Probability

Joint Probability

Conditional Probability

Marginal Probability is the probability of occurrence of a single events

Join Probability is a measure of two events happening at the same time

Example: The probability that a card is an Ace of hearts = p(Ace of hearts)

There are 13 heart cards in a deck of 52 and out of them one in the Ace of hearts

The probability that a candidate has undergone Edureka's training 45/105 = 0.42

Find the probability that a candidate has attended Edureka's training and also has good package.

Find the probability that a candidate has a good package given that he has not undergone training.

4.01 JAN (MON) 1.40 Mins

bayes' theorm

Shows the relation between one conditional probability and its inverse

inferential statistics

Point Estimation is concerned with the use of the sample data to measure a single value which serves as an approximate value or the best estimate of an unknown population parameter.

4 Ways to find estimates

Method of Moments: Estimates are found out by equating the first k sample moments to the corresponding k population moments
Maximum of Likelihood: Uses a model and values in the model to maximize a likelihood function. This results in the most likely parameter for the inputs selection
Bayes' Estimators: Minimizes the average risk (an expectation of random variables)
Best Unbiased Estimators: Several unbiased estimators can be used to approximate a parameter (which one is "best" depends on what parameter you trying to find)

Confidence Interval is the measure of your confidence, that the interval estimate contains the population mean.

Statisticians use a confidence interval to describe the amount of uncerainty associated with a sample estimate of population parameter.

Technically, a range of values so constructed that there is a specified probability of including the trur value of a parameter within it.

Difference between the point estimate and actual population parameter value is called the Sampling Error
When u us estimated, the sampling error is the difference u - x (bar)

Margin of Error E: for a given level of confidence is the greatest possible distance between the point estimate and the value of the parameter it is estimating

5.02 JAN (TUE) 2Hrs

03 machine learning

Agenda

Need for machine learning
What is machine learning?
Machine learning definitions
Machine Learning process
Types of machine learning
Type of problems solved using machine learning
Demo

6.03 JAN (WED) 2.2 Hrs

04 supervised learning algorithms linear regression

What is Regression?
Regression Use-Case
Types of Regression Linear vs Logistic Regression
What is Linear Regression?
Finding best fit regression line using Least Square Method
Checking goodness of fit using R squared Method
Implementation of Linear Regression using Python

Linear Regression Algorithm using Python from scratch

Linear Regression Algorithm using Python (scikit lib)

Three major uses for regression analysis are

Determining the strength of predictors
Forecasting an effect, and
Trend forecasting

Selection Criteria

Classification and Regression Capabilities
Data Quality
Computational Complexity
Comprehensible and Transparent

Where is Linear Regression used?

Evaluating Trends and Sales Estimates
Analyzing the impact of Price Changes
Assessment of rick in financial services and insurance domain

R-Squared value is statistical measure of how close the data are to the fitted regression line.
It is also known as coefficient of determination, or the coefficient of multiple determination.

Regression

%matplotlib inline

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = (8.0, 4.0)

data = pd.read_csv('headbrain.csv')

print(data.shape)

data.head()

(237, 4)

X = data['Head Size(cm^3)'].values

Y = data['Brain Weight(grams)'].values

mean_x = np.mean(X)

mean_y = np.mean(Y)

m = len(X)

numer = 0

denom = 0

for i in range(m):

numer += (X[i] - mean_x) * (Y[i] - mean_y)

denom += (X[i] - mean_x) ** 2

b1 = numer / denom

b0 = mean_y - (b1 * mean_x)

# value of m is b1

# value of head size is b0

print(b1, b0)

0.26342933948939945 325.57342104944223

max_x = np.max(X) + 100

min_x = np.min(X) - 100

x = np.linspace(min_x, max_x, 1000)

y = b0 + b1 * x

plt.plot(x, y, color='#58b970', label='Regression Line')

plt.scatter(X, Y, c='#ef5423', label='Scatter Plot')

plt.xlabel('Head size in cm^3')

plt.ylabel('Brain Weight in grams')

plt.legend()

plt.show()

# total sum of square

# total sum of squares residuals

ss_t = 0

ss_r = 0

for i in range(m):

y_pred = b0 + b1 * X[i]

ss_t += (Y[i] - mean_y) ** 2

ss_r += (Y[i] - y_pred) ** 2

r2 = 1 - (ss_r/ss_t)

print(r2)

0.6393117199570003

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

X = X.reshape((m, 1))

reg = LinearRegression()

reg = reg.fit(X, Y)

Y_pred = reg.predict(X)

# RMSE and R2 Score

mse = mean_squared_error(Y, Y_pred)

rmse = np.sqrt(mse)

r2_score = reg.score(X, Y)

print(np.sqrt(mse))

72.1206213783709

print(r2_score)

0.639311719957

7.04 JAN (THU) 2.4 Hrs

logistic regression

Agenda

What is Regression?
Logistic regression: What and Why?
Linear VS Logistic Regression
Use-Cases
Demo

What is Regression? Regression Analysis is a predictive modeling technique
It estimates the relationship between a dependent (target) and an independent variable (predictor)

Logistic Regression produces results in a binary format which is used to predict the outcome of a categorical dependent variable. Do the outcome should be discrete / categorical such as 0 or 1, Yes or No, True or False, High and Low

titanic logistic regression

import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

%matplotlib inline

train = pd.read_csv('z_data/train.csv')

train.info()

train.head()

Exploratory Data Analysis Let's begin some exploratory data analysis! We'll start by checking out missing data!

Missing Data We can use seaborn to create a simple heatmap to see where we are missing data!

sns.heatmap(train.isnull(), yticklabels=False, cbar=False, cmap='viridis')

Roughly 20 percent of the Age data is missing. The proportion of Age missing is likely small enough for reasonable replacement with some form of imputation. Looking at the Cabin column, it looks like we are just missing too much of that data to do something useful with at a basic level. We'll probably drop this later.

sns.set_style('whitegrid')

sns.countplot(x='Survived', data=train, palette='viridis')

sns.countplot(x='Survived', hue='Sex', data=train, palette='viridis')

More no. of female passengers survived the tragedy compared to the male passengers on the ship.

sns.countplot(x='Survived', hue= 'Pclass', data=train, palette='viridis')

The death rate is higher in passengers who were in third class.

sns.countplot(x='Survived', hue= 'Embarked', data=train, palette='viridis')

sns.distplot(train['Age'].dropna(),kde=False, bins=30, color='blue' )

sns.countplot(x='Parch', data=train, palette='viridis')

sns.countplot(x='SibSp', data=train, palette='viridis')

The plot indicates that very few passengers had sibling, spouse, parent or children. This observation is true as the age group suggests more youngsters were on the ship.

Data Cleaning We want to fill in missing age data instead of just dropping the missing age data rows. One way to do this is by filling in the mean age of all the passengers (imputation). However we can be smarter about this and check the average age by passenger class.

import cufflinks as cf

cf.go_offline()

box_age = train[['Pclass', 'Age']]

box_age.pivot(columns='Pclass', values='Age').iplot(kind='box')

We can see the wealthier passengers in the higher classes tend to be older, which makes sense. We'll use these average age values to impute based on Pclass for Age.

def impute_age(cols):

Age = cols[0]

Pclass = cols[1]

if pd.isnull(Age):

if Pclass == 1:

return 37

elif Pclass == 2:

return 29

else:

return 24

else:

return Age

train['Age'] = train[['Age', 'Pclass']].apply(impute_age,axis=1)

sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Great! Let's go ahead and drop the Cabin column

train.drop('Cabin', axis=1, inplace=True)

train.head()

sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Converting Categorial Features We'll need to convert categorical features to dummy variables using pandas! Otherwise our machine learning algorithm won't be able to directly take in those features as inputs.

sex = pd.get_dummies(train['Sex'], dtype="int", drop_first= True)

embark = pd.get_dummies(train['Embarked'], dtype="int", drop_first= True)

train = pd.concat([train, sex, embark], axis=1)

train.head()

train.drop(['Sex','Embarked', 'Name', 'Ticket'], axis=1, inplace=True)

train.head(3)

Building a Logistic Regression model

X = train.drop('Survived', axis=1)

y = train['Survived']

X.head()

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=101)

Training and Predicting

from sklearn.linear_model import LogisticRegression

logmodel = LogisticRegression(solver='liblinear')

logmodel.fit(X_train,y_train)

predictions = logmodel.predict(X_test)

Evaluation

from sklearn.metrics import classification_report

print(classification_report(y_test,predictions))

from sklearn.metrics import confusion_matrix

confusion_matrix(y_test,predictions)

from sklearn.metrics import accuracy_score

accuracy_score(y_test, predictions)

0.7761194029850746

suv prediction

import numpy as np

import pandas as pd

import matplotlib as plt

%matplotlib inline

dataset = pd.read_csv("z_data/suv_prediction.csv")

dataset.head(10)

X = dataset.iloc[:,[2,3]].values

y = dataset.iloc[:,[4]].values

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=0)

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)

X_test = sc.fit_transform(X_test)

from sklearn.linear_model import LogisticRegression

classfier = LogisticRegression(random_state=0)

classfier.fit(X_train,y_train)

y_pred = classfier.predict(X_test)

from sklearn.metrics import classification_report

print(classification_report(y_test,y_pred))

from sklearn.metrics import confusion_matrix

confusion_matrix(y_test,y_pred)

array([[63, 5],

[ 8, 24]]

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

0.87

10.07 JAN (SUN) 3.4 Hrs

decision tree

What is Classification?' Like a Gmail classify its spam or not.
Types of Classification
Classification Use case Credit card frad detection, Car
What is Decision Tree?
Terminologies associated to a Decision Tree
Visualizing a Decision Tree
Writing a Decision Tree Classifier from Scratch in Python using CART Algorithm

Types of Classification

- Decision Tree:
  - Graphical representation of all the possible solutions to a decision
  - Decisions are based on some conditions
  - Decision made can be easily explained.

Random Forest

Builds multiple decision trees and merges them together
More accurate and stable prediction
Random decision forests correct for decision trees habit of overfitting to their training set
Trained with the bagging method

Naive Bayes

Classification technique based on Bayes Theorem
Assumes that the presence of a paricular feature in a class is unrelated to the presence of any other feature

K-Nearest Neighbours

Stores all the available cases and classifies new cases based on a similarity measure
The K in KNN algorithm is the nearest neighbours we wish to take vote from

11.08 JAN (MON) 4.Hrs

what is decision tree?

A decision tree is a graphical representation of all the possible solutions to a decision based on certain conditions.

Gini Index: The measure of impurity (or purity) used in building decision tree in CART is Gini Index

Information Gain: The information gain is the the decrease in entropy after a dataset is split on the basis of an attribute. Construction a decision tree is all about finding attribute that returns the highest information gain.

Reduction in Variance: Reduction in variance is an algorithm used for continuous target variables (regression problems) The split with lower variance is selected as the criteria to split the population.

Chi Square: It is an algorithm to find out the statistical significance between the difference between sub-nodes and parent node.

entropy

12.09 JAN (TUE) 4.2 Hrs

# Sample dataset

# Format: each row is an example

# The Last column is the label

# The first two columns are features

# If you want you can add more features & examples

# Intersting note: 2nd and 5th examples have the same features, but different labels

# Let's see how three handles his case

training_data = [

['Green', 3, 'Mango'],

['Yellow', 3, 'Mango'],

['Red', 1, 'Grape'],

['Yellow', 3, 'Lemon']

]

# Column Labels

# These are used only to print the tree.

header = ["color", "diameter", "label"]

def unique_vals(rows, col):

return set([row[col] for row in rows])

# unique_vals(training_data, 0)

# unique_vals(training_data, 1)

def class_counts(rows):

"""Counts the number of each type of example in a dataset"""

counts = {} # a dictionary of Label -> count

for row in rows:

# in our dataset format, the Label is always the Last column

label = row[-1]

if label not in counts:

counts[label] = 0

counts[label] += 1

return counts

# class_counts(training_data)

def is_numeric(value):

"""Test if a value is numeric"""

return isinstance(value, int) or isinstance(value, float)

# is_numeric(7)

# is_numeric(Red)

class Question:

"""A Question is used to partition a dataset

This class just records a colum number (e.g 0 for color) and a

'column value' (e.g, green) the 'match' method is used to compare

the feature value in an example to the feature value stored in the

question. See the demo below

"""

def __init__(self, column, value):

self.column = column

self.value = value

def match(self, example):

# Compare the feature value in an example to the

# feature value in the question

val = example[self.column]

if is_numeric(val):

return val >= self.value

else:

return val == self.value

def __repr__(self):

# This is just a helper method to print

# the question in a readable format

condition = "=="

if is_numeric(self.value):

condition = ">="

return "IS %s %s %s?" % (

header[self.column], condition, str(self.value))

def partition(rows, question):

"""Partitions a dataset.

For each row in the dataset, check if it matches the question. If

so, add it to 'true rows', otherwise, add it to the 'false rows'

"""

true_rows, false_rows = [], []

for row in rows:

if question.match(row):

true_rows.append(row)

else:

false_rows.append(row)

return true_rows, false_rows

# Let's partition the training data based on whether rows are Red.

# true_rows, false_rows = partition(training_data, Question(0, 'Red'))

# This will contain all the 'Red' rows

# true_rows

# This will contain everything else.

# false_rows

def gini(rows):

"""Calculate the Gini Impurity for a list of rows

There are a few different ways to do this, I thought this one was

the most concise

"""

counts = class_counts(rows)

impurity = 1

for lbl in counts:

prob_of_lbl = counts[lbl] / float(len(rows))

impurity -= prob_of_lbl**2

return impurity

def info_gain(left, right, current_uncertainty):

"""Information Gain.

The uncertainty of the starting node, minus the weighted impurity of

two child nodes

"""

p = float(len(left)) / (len(left) + len(right))

return current_uncertainty - p * gini(left) - (1 - p) * gini(right)

# Calculate the uncertainy of our training data

# current_uncertainty = gini(training_data)

# How much information do we gain by partitioning on 'Green'?

# true_rows, false_rows = partition(training_data, Question(0, 'Green'))

# info_gain(true_rows, false_rows, current_uncertainty)

# What about if we partitioned on "Red" instead?

# true_rows, false_rows = partition(training_data, Question(0, 'Red'))

# info_gain(true_rows, false_rows, current_uncertainty)

def find_best_split(rows):

"""Find the best question to ask by iterating over every feature / value

and calculateing the information gain

"""

best_gain = 0 # keep track of the best information gain

best_question = None # keep train of the feature / vale that produced it

current_uncertainty = gini(rows)

n_features = len(rows[0]) - 1 # number of columns

for col in range(n_features): # for each feature

values = set([row[col] for row in rows]) # unique values in the column

for val in values: # for each value

question = Question(col, val)

# try splitting the dataset

true_rows, false_rows = partition(rows, question)

def find_best_split(rows):

"""Find the best question to ask by iterating over every feature / vale

and calculating the information gain

"""

best_gain = 0 # keep track of the best information gain

best_question = None # keep traain of the feature / value that produced it

current_uncertainty = gini(rows)

n_features = len(rows[0]) - 1 # number of columns

for col in range(n_features): # for each features

values = set([row[col] for row in rows]) # unique values in the column

for val in values: # for each value

question = Question(col, val)

# try splitting the dataset

true_rows, false_rows = partition(rows, question)

# skip this split if it doesn't divide the

# dataset

if len(true_rows) == 0 or len(false_rows) == 0:

continue

# Calculate the information gain from this split

gain = info_gain(true_rows, false_rows, current_uncertainty)

# You actually can use '>' instead of '>=' here

# but I wanted the tree to look a certain way for our

# toy dataset

if gain >= best_gain:

best_gain, best_question = gain, question

return best_gain, best_question

class Leaf:

"""A Leaf node classified data.

This holds a dictionary of class (e.g. 'Mango') -> number of times

it appears in the rows from the training data that reach this leaf

"""

def __init__(self, rows):

self.predictions = class_counts(rows)

class Decision_Node:

"""A Decision Node asks a question

This holds a reference to the question, and to the two child nodes

"""

def __init__(self,

question,

true_branch,

false_branch):

self.question = question

self.true_branch = true_branch

self.false_branch = false_branch

def build_tree(rows):

"""Builds the tree

Rules of recursion: 1) Belive that it works. 2) Start by checking

for the base case (no further information gain) 3) prepare for

"""

# Try partioning the dataset on each of the unique attribute

# calculate the information gain,

# and return the question that produces the highest gain

gain, question = find_best_split(rows)

# Base case: no further info gain

# Since we can ask no further question

# we'll return a Leaf

if gain == 0:

return Leaf(rows)

# If we reach here, we have found a useful feature / vale

# to partition on

true_rows, false_rows = partition(rows, question)

# Recursively build the true brach

true_branch = build_tree(true_rows)

# Return a question node

# This records the best feature / value to ask at this point

# as well as the braches to follow

# depending on the answer

return Decision_Node(question, true_branch, false_branch)

def print_tree(node, spacing=""):

# Base case: We've reached a Leaf

if isinstance(node, Leaf):

print (spacing + "Predict", node.predictions)

return

# Print the question at this node

print (spacing + str(node.question))

# Call this function recursively on the true branch

print (spacing + '--> True:')

print_tree(node.true_branch, spacing + " ")

# Call this function recursively on the false branch

print (spacing + '--> False:')

print_tree(node.false_branch, spacing + " ")

def classify(row, node):

# Base case: We've reached a Leaf

if isinstance(node, Leaf):

return node.predictions

# Decide whether to fellow the true-brach or the false-branch

# Compare the feature / value stored in the node.

# to the example we're considering

if node.question.match(row):

return classify(row, node.true_branch)

else:

return classify(row, node.false_branch)

def print_leaf(counts):

"""Print the predictions at a leaf"""

total = sum(counts.value()) * 1.0

probs = {}

for lbl in counts.keys():

probs[lbl] = str(int(counts[lbl] / total * 100)) + "%"

return probs

if __name__ == '__main__':

my_tree = build_tree(training_data)

print_tree(my_tree)

testing_data = [

['Green', 3, 'Apple'],

['Yellow', 4, 'Apple'],

['Red', 2, 'Grape'],

['Red', 1, 'Grape'],

['Yellow', 3, 'Lemon']

]

for row in testing_data:

print ("Actual: %s. Predicted: %s" %

(row[-1], print_leaf(classify(row, my_tree))))

13.10 JAN (WED) 4.4 Hrs

random forest

Decision Tree: Decision tree builds classification models in the form of a tree structures. It breaks down a dataset into smaller and smaller subsets.

Random Forest: Random Forest is an ensemble classifier made using many decision tree models. Ensemble models combine the results from different models.

Naive Bayes: It is a classification technique based on Bayes Theorem with an assemption of independence among attributes.

Why Random Forest?

RandomForest - a versatil algorithm capable of performing both Regression and Classification.
It is a type of ensemble learning method.
Commonly used predictive modelling and machine learning technique

14.11 JAN (THU) 5 Hrs

k nearest neighbour

What is KNN Algorithm?
Industrial Use case of KNN Algorithm.
How things are predicted using KNN Algorithm
How to choose the value of K?
KNN Algorithm Using Python
- Implementation of KNN Algorithm from scratch
- Compare the build model using scikit learn

K Nearest Neighbour is a simple algorithm that stores all the aviable cases and classifies the new data or case based on a similarity measure.

K=3 When K is equal 3 means number of nearest neighbor you selected

Two orange vs 1 blue so it belong orange

How to choice K?

15.12 JAN (FRI) 5.2 Hrs

import numpy as np

from sklearn import datasets

from sklearn import neighbors

import pylab as pl

import matplotlib.pyplot as plt

from matplotlib.colors import ListedColormap

iris = datasets.load_iris()

print(iris.keys())

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

n_samples, n_features = iris.data.shape

print((n_samples, n_features))

(150, 4)

print(iris.data[0])

[5.1 3.5 1.4 0.2]

print(iris.target.shape)

(150,)

print(iris.target)

print(iris.target_names)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

2 2]

['setosa' 'versicolor' 'virginica']

x_index = 0

y_index = 1

# this formatter will label the colorbar with the correct target names

formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])

plt.scatter(iris.data[:, x_index], iris.data[:, y_index],

c=iris.target, cmap=plt.colormaps.get_cmap('RdYlBu'))

plt.colorbar(ticks=[0, 1, 2], format=formatter)

plt.clim(-0.5, 2.5)

plt.xlabel(iris.feature_names[x_index])

plt.ylabel(iris.feature_names[y_index]);

Classification model We use K-nearest neighbors (k-NN), which is one of the simplest learning strategies:

given a new, unknown observation, look up in your reference database which ones have the closest features and assign the predominant class.

Let’s try it out on our iris classification problem:

Prepare the data

Initialize the model object

fit the model to the data

Make a prediction

X, y = iris.data, iris.target

clf = neighbors.KNeighborsClassifier(n_neighbors=5)

clf.fit(X, y)

result = clf.predict([[3, 5, 4, 2],])

print(iris.target_names[result])

['versicolor']

You can also do probabilistic predictions, i.e. check individual probability of this data point belonging to each of the classes:

clf.predict_proba([[3, 5, 4, 2],])

array([[0. , 0.8, 0.2]])

Let’s visualize k-NN predictions on a plot.

We take a ‘slice’ of the original dataset, taking only the first two features. This is because we will drawing a 2D plot, where we can only visualize two features at a time. Then we fit a new k-NN model to this slice, using only two features from the original data. Next, we paint a ‘map’ of predicted classes: we fill the plot area using a mesh grid of colored regions, where each region’s color is based on the class predicted by the model. Finally, we put the data points from the original dataset on the plot as well (in bold).

# Create color maps for 3-class classification problem, as with iris

cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])

cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

def plot_iris_knn():

iris = datasets.load_iris()

X = iris.data[:, :2] # we only take the first two features.

y = iris.target

knn = neighbors.KNeighborsClassifier(n_neighbors=3)

knn.fit(X, y)

x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1

y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1

xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),

np.linspace(y_min, y_max, 100))

Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot

Z = Z.reshape(xx.shape)

pl.figure()

pl.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot also the training points

pl.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)

pl.xlabel('sepal length (cm)')

pl.ylabel('sepal width (cm)')

pl.axis('tight')

plot_iris_knn()

17.14 JAN (SUN) 6 Hrs

support vector machine

Machine learning is a subset of artificial intelligence (AI) which provides machines the ability to learn automatically & improve from experience without being explicitly programmed.

Support Vector Machine (SVM) is a supervised classification method that separates data using hyperplane.

Bigger the margin the better

unsupervised learning algorithms

K-Means clustering

What is Clustering?

Clustering is the process of dividing the dataset into groups, consisting of similar data-points.

It means grouping of objects based on the information found in the data, describing the objects or their relationship.

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

import plotly as py

import plotly.graph_objs as go

from sklearn.cluster import KMeans

import warnings

warnings.filterwarnings('ignore')

df = pd.read_csv('z_data/Mall_Customers.csv')

df.head()

df.columns

Index(['CustomerID', 'Gender', 'Age', 'Annual Income (k$)',

'Spending Score (1-100)'],

dtype='object')

df.info()

df.describe()

df.isnull().sum()

CustomerID 0

Gender 0

Age 0

Annual Income (k$) 0

Spending Score (1-100) 0

dtype: int64

plt.figure(1 , figsize = (15 , 6))

n = 0

for x in ['Age' , 'Annual Income (k$)' , 'Spending Score (1-100)']:

n += 1

plt.subplot(1 , 3 , n)

plt.subplots_adjust(hspace = 0.5 , wspace = 0.5)

sns.distplot(df[x] , bins = 15)

plt.title('Distplot of {}'.format(x))

plt.show()

sns.pairplot(df, vars = ['Spending Score (1-100)', 'Annual Income (k$)', 'Age'], hue = "Gender")

plt.figure(1 , figsize = (15 , 7))

plt.title('Scatter plot of Age v/s Spending Score', fontsize = 20)

plt.xlabel('Age')

plt.ylabel('Spending Score')

plt.scatter( x = 'Age', y = 'Spending Score (1-100)', data = df, s = 100)

plt.show()

X1 = df[['Age' , 'Spending Score (1-100)']].iloc[: , :].values

inertia = []

for n in range(1 , 15):

algorithm = (KMeans(n_clusters = n ,init='k-means++', n_init = 10 ,max_iter=300,

tol=0.0001, random_state= 111 , algorithm='elkan') )

algorithm.fit(X1)

inertia.append(algorithm.inertia_)

plt.figure(1 , figsize = (15 ,6))

plt.plot(np.arange(1 , 15) , inertia , 'o')

plt.plot(np.arange(1 , 15) , inertia , '-' , alpha = 0.5)

plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')

plt.show()

algorithm = (KMeans(n_clusters = 4 ,init='k-means++', n_init = 10 ,max_iter=300,

tol=0.0001, random_state= 111 , algorithm='elkan') )

algorithm.fit(X1)

labels1 = algorithm.labels_

centroids1 = algorithm.cluster_centers_

h = 0.02

x_min, x_max = X1[:, 0].min() - 1, X1[:, 0].max() + 1

y_min, y_max = X1[:, 1].min() - 1, X1[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

Z = algorithm.predict(np.c_[xx.ravel(), yy.ravel()])

plt.figure(1 , figsize = (15 , 7) )

plt.clf()

Z = Z.reshape(xx.shape)

plt.imshow(Z , interpolation='nearest',

extent=(xx.min(), xx.max(), yy.min(), yy.max()),

cmap = plt.cm.Pastel2, aspect = 'auto', origin='lower')

plt.scatter( x = 'Age', y = 'Spending Score (1-100)', data = df, c = labels1, s = 100)

plt.scatter(x = centroids1[: , 0] , y = centroids1[: , 1] , s = 300 , c = 'red' , alpha = 0.5)

plt.ylabel('Spending Score (1-100)') , plt.xlabel('Age')

plt.show()

algorithm = (KMeans(n_clusters = 5, init='k-means++', n_init = 10, max_iter=300,

tol=0.0001, random_state= 111 , algorithm='elkan'))

algorithm.fit(X1)

labels1 = algorithm.labels_

centroids1 = algorithm.cluster_centers_

h = 0.02

x_min, x_max = X1[:, 0].min() - 1, X1[:, 0].max() + 1

y_min, y_max = X1[:, 1].min() - 1, X1[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

Z = algorithm.predict(np.c_[xx.ravel(), yy.ravel()])

plt.figure(1 , figsize = (15 , 7) )

plt.clf()

Z = Z.reshape(xx.shape)

plt.imshow(Z , interpolation='nearest',

extent=(xx.min(), xx.max(), yy.min(), yy.max()),

cmap = plt.cm.Pastel2, aspect = 'auto', origin='lower')

plt.scatter( x = 'Age', y = 'Spending Score (1-100)', data = df, c = labels1, s = 100)

plt.scatter(x = centroids1[: , 0] , y = centroids1[: , 1] , s = 300 , c = 'red' , alpha = 0.5)

plt.ylabel('Spending Score (1-100)') , plt.xlabel('Age')

plt.show()

X2 = df[['Annual Income (k$)' , 'Spending Score (1-100)']].iloc[: , :].values

inertia = []

for n in range(1 , 11):

algorithm = (KMeans(n_clusters = n ,init='k-means++', n_init = 10 ,max_iter=300,

tol=0.0001, random_state= 111 , algorithm='elkan') )

algorithm.fit(X2)

inertia.append(algorithm.inertia_)

plt.figure(1 , figsize = (15 ,6))

plt.plot(np.arange(1 , 11) , inertia , 'o')

plt.plot(np.arange(1 , 11) , inertia , '-' , alpha = 0.5)

plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')

plt.show()

algorithm = (KMeans(n_clusters = 5 ,init='k-means++', n_init = 10 ,max_iter=300,

tol=0.0001, random_state= 111 , algorithm='elkan') )

algorithm.fit(X2)

labels2 = algorithm.labels_

centroids2 = algorithm.cluster_centers_

h = 0.02

x_min, x_max = X2[:, 0].min() - 1, X2[:, 0].max() + 1

y_min, y_max = X2[:, 1].min() - 1, X2[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

Z2 = algorithm.predict(np.c_[xx.ravel(), yy.ravel()])

plt.figure(1 , figsize = (15 , 7) )

plt.clf()

Z2 = Z2.reshape(xx.shape)

plt.imshow(Z2 , interpolation='nearest',

extent=(xx.min(), xx.max(), yy.min(), yy.max()),

cmap = plt.cm.Pastel2, aspect = 'auto', origin='lower')

plt.scatter( x = 'Annual Income (k$)' ,y = 'Spending Score (1-100)' , data = df , c = labels2 ,

s = 100 )

plt.scatter(x = centroids2[: , 0] , y = centroids2[: , 1] , s = 300 , c = 'red' , alpha = 0.5)

plt.ylabel('Spending Score (1-100)') , plt.xlabel('Annual Income (k$)')

plt.show()

X3 = df[['Age' , 'Annual Income (k$)' ,'Spending Score (1-100)']].iloc[: , :].values

inertia = []

for n in range(1 , 11):

algorithm = (KMeans(n_clusters = n, init='k-means++', n_init = 10, max_iter=300,

tol=0.0001, random_state= 111, algorithm='elkan'))

algorithm.fit(X3)

inertia.append(algorithm.inertia_)

plt.figure(1 , figsize = (15 ,6))

plt.plot(np.arange(1 , 11) , inertia , 'o')

plt.plot(np.arange(1 , 11) , inertia , '-' , alpha = 0.5)

plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')

plt.show()

algorithm = (KMeans(n_clusters = 6 ,init='k-means++', n_init = 10 ,max_iter=300,

tol=0.0001, random_state= 111 , algorithm='elkan') )

algorithm.fit(X3)

labels3 = algorithm.labels_

centroids3 = algorithm.cluster_centers_

y_kmeans = algorithm.fit_predict(X3)

df['cluster'] = pd.DataFrame(y_kmeans)

df.head()

import plotly as py

import plotly.graph_objs as go

trace1 = go.Scatter3d(

x= df['Age'],

y= df['Spending Score (1-100)'],

z= df['Annual Income (k$)'],

mode='markers',

marker=dict(

color = df['cluster'],

size= 10,

line=dict(

color= df['cluster'],

width= 12

),

opacity=0.8

)

data = [trace1]

layout = go.Layout(

title= 'Clusters wrt Age, Income and Spending Scores',

scene = dict(

xaxis = dict(title = 'Age'),

yaxis = dict(title = 'Spending Score'),

zaxis = dict(title = 'Annual Income')

)

fig = go.Figure(data=data, layout=layout)

py.offline.iplot(fig)

df.head()

df.to_csv("z_data/segmented_customers.csv", index = False)

18.15 JAN (MON) 6.2 Hrs

12. association rule mining

There are 3 itemset 1 in TID and only 1 support value for itemset {4}

Market Basket Analysis with Apriori Algorithm

Association Rule Learning (ARL) In today's world where the number of customers and transactions are increasing, it has become more valuable to create meaningful results from data and for developing marketing strategies. Revealing hidden patterns in the data in order to be able to compete better and maximize profit in the face of intense competition in the market, and to establish value-oriented long-term relationships with customers, makes a great contribution to determining marketing strategies.

However, the development of rule-based strategies is no longer possible in big data world, offering the right product to the right customer at the right time; it forms the basis of cross-selling and loyalty programs within the scope of customer retention and increasing lifetime value. Therefore, it has been crucial point for companies making product offers by using these patterns of association and developing effective marketing strategies Market Basket analysis is one of the association rule applications. It allows us to predict the products that customers tend to buy in the future by developing a pattern from their past behavior and habits.

There are different algorithms to be used for Association Rules Learning. One of them is the Apriori algorithm. In this project, product association analysis will be handled with “Apriori Algorithm” and the most suitable product offers will be made for the customer who is in the sales process, using the sales data of an e-commerce company.

Dataset Story: • The Online Retail II data set, which includes the sales data of the UK-based online sales store, was used. • Sales data between 01/12/2009 - 09/12/2011 are included in the data set. • The product catalog of this company includes souvenirs.

Business Problem: Suggesting products to users at the basket stage. In this study, we will apply Market Basket analysis using the Apriori algorithm. In this context, we will consider the work in 5 steps:

Import Data & Data Preprocessing
Preparing Invoice-Product Matrix fot ARL Data Structure
Determination of Association Rules
Suggesting appropriate product offers to customers at the basket stage
Functionalization

Variables Descriptions:

• InvoiceNo: Invoice Number -> If this code starts with C, it means that the operation has been canceled. • StockCode: Product Code -> Unique number for each product • Description: Product name • Quantity: Number of products -> how many of the products on the invoices were sold. • InvoiceDate • UnitePrice • CustomerID: Unique customer number • Country

# Import Libraries

import pandas as pd

# For Association Rules Learning & Apriori

# !pip install mlxtend

from mlxtend.frequent_patterns import apriori, association_rules

# Setting Configurations:

pd.set_option('display.max_columns', None)

pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Import Warnings:

import warnings

warnings.filterwarnings("ignore")

warnings.simplefilter(action='ignore', category=FutureWarning)

warnings.simplefilter(action='ignore', category=DeprecationWarning)

df = pd.read_excel('z_data/online_retail_II.xlsx')

df.head()

df.info()

df.isna().sum()

df.dropna(inplace=True)

df.isna().sum()

df.shape

(417534, 8)

df.describe().T

# Let's first determine cancelled transactions (Invoice Id contains value "C") and then remove them:

df_Invoice = pd.DataFrame({"Invoice":[row for row in df["Invoice"].values if "C" not in str(row)]})

df_Invoice.head()

df_Invoice = df_Invoice.drop_duplicates("Invoice")

# The transactions except cancelled transactions:

df = df.merge(df_Invoice, on = "Invoice")

df

# Outlier Detection:

# Outlier değerler için baskılama yapılacak low ve up limit belirleyelim:

def outlier_thresholds(dataframe, variable):

quartile1 = dataframe[variable].quantile(0.01)

quartile3 = dataframe[variable].quantile(0.99)

interquantile_range = quartile3 - quartile1

up_limit = quartile3 + 1.5 * interquantile_range

low_limit = quartile1 - 1.5 * interquantile_range

return low_limit, up_limit

# Replace outliers with thresholds

def replace_with_thresholds(dataframe, variable):

low_limit, up_limit = outlier_thresholds(dataframe, variable)

dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit

dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit

df.dtypes

num_cols = [col for col in df.columns if df[col].dtypes in ["int64","float64"] and "ID" not in col]

print(num_cols)

['Quantity', 'Price']

for col in num_cols:

replace_with_thresholds(df, col)

df.describe().T

df = df[df["Quantity"] > 0]

df = df[df["Price"] > 0]

# Unique Number of Products (with Description)

df.Description.nunique()

# Unique Number of Products (with StockCode)

df.StockCode.nunique()

The unique values of these 2 variables (Description & StockCode) should be equal, because each stock code represents a product:

# 1st Step

df_product = df[["Description","StockCode"]].drop_duplicates()

df_product = df_product.groupby(["Description"]).agg({"StockCode":"count"}).reset_index()

df_product.sort_values("StockCode", ascending=False).head()

df_product.rename(columns={'StockCode':'StockCode_Count'},inplace=True)

df_product = df_product[df_product["StockCode_Count"]>1]

Let's delete products with more than one stock code:

df = df[~df["Description"].isin(df_product["Description"])]

print(df.StockCode.nunique())

print(df.Description.nunique())

3969

4419

# 2nd Step

df_product = df[["Description","StockCode"]].drop_duplicates()

df_product = df_product.groupby(["StockCode"]).agg({"Description":"count"}).reset_index()

df_product.rename(columns={'Description':'Description_Count'},inplace=True)

df_product = df_product.sort_values("Description_Count", ascending=False)

df_product.head()

df_product = df_product[df_product["Description_Count"] > 1]

df_product.head()

Let's delete stock codes that represent multiple products:

df = df[~df["StockCode"].isin(df_product["StockCode"])]

# Now each stock code represents a single product:

print(df.StockCode.nunique())

print(df.Description.nunique())

3550

The post statement in the stock code shows the postage cost, let's delete it as it is not a product:

df = df[~df["StockCode"].str.contains("POST", na=False)]

We'll handle sales data of Germany as an example:

df_germany = df[df["Country"] == "Germany"]

df_germany.shape

(4628, 8)

def create_invoice_product_df(dataframe, id=False):

if id:

return dataframe.groupby(['Invoice', "StockCode"])['Quantity'].sum().unstack().fillna(0). \

applymap(lambda x: 1 if x > 0 else 0)

else:

return dataframe.groupby(['Invoice', 'Description'])['Quantity'].sum().unstack().fillna(0). \

applymap(lambda x: 1 if x > 0 else 0)

gr_inv_pro_df = create_invoice_product_df(df_germany, id=True)

gr_inv_pro_df.head()

# Let's define a function to find the product name corresponding to the stock code:

def check_id(dataframe, stockcode):

product_name = dataframe[dataframe["StockCode"] == stockcode]["Description"].unique()[0]

return stockcode, product_name

Let's explain the metrics we see in the table above:

antecedent support: If X is called antecendent, 'antecedent support' computes the proportion of transactions that contain the antecedent X. consequent support: If Y is called consequent, 'consequent support' computes the proportion of transactions that contain the antecedent Y. support: 'support' computes the proportion of transactions that contain the antecedent X and Y. confidence: Probability of buying Y when X is bought. lift: Represents how many times the probability of getting Y increases when X is received.

Let's sort dataframe by lift:

sorted_rules = rules.sort_values("lift", ascending=False)

4.Suggesting a Product to Users at the Basket Stage

We can develop different strategies at the product offer stage.

For example, When X is bought, we can sort according to the probability of buying Y (confidence) and make a product offer, or we can make an offer according to how many times the probability of sales over the lift increases. We can also make a product recommendation with a hybrid filtering where support, lift and confidence are used together.

If user buys a product whose id is 22492, which products do you recommend?

product_id = 22728

check_id(df, product_id)

(22728, 'ALARM CLOCK BAKELIKE PINK')

product_id = 22728

recommendation_list = []

for idx, product in enumerate(sorted_rules["antecedents"]):

# antecendent tuple olduğu için listeye çevirelim ve liste içinde arayalım:

for j in list(product):

if j == product_id:

# bu yakaladığımız integer değerin indexi ne ise (idx) consequentte onu arayacağız, bulduğumuz satırlar için ilk ürünü [0] önerelim

recommendation_list.append(list(sorted_rules.iloc[idx]["consequents"])[0])

recommendation_list = list( dict.fromkeys(recommendation_list) )

list_top5 = recommendation_list[0:5]

list_top5

[22741, 22419, 22752, 21578, 20719]

for item in list_top5:

product_id = item

print(check_id(df, product_id))

Edureka data science

1.28 DEC (THU) 20 Mins

2.29 DEC (FRI) 40 Mins

Data Science Job Roles:

Data Life Cycle:

statistics and probability

2.30 DEC (SAT) 1HR

information gain & entropy

confusion matrix

3.31 DEC (SUN) 1.20 Mins

probability

types of probability

4.01 JAN (MON) 1.40 Mins

bayes' theorm

inferential statistics

5.02 JAN (TUE) 2Hrs

03 machine learning

6.03 JAN (WED) 2.2 Hrs

04 supervised learning algorithms linear regression

7.04 JAN (THU) 2.4 Hrs

logistic regression

titanic logistic regression

suv prediction

10.07 JAN (SUN) 3.4 Hrs

decision tree

11.08 JAN (MON) 4.Hrs

what is decision tree?

entropy

12.09 JAN (TUE) 4.2 Hrs

13.10 JAN (WED) 4.4 Hrs

random forest

14.11 JAN (THU) 5 Hrs

k nearest neighbour

15.12 JAN (FRI) 5.2 Hrs

17.14 JAN (SUN) 6 Hrs

support vector machine

unsupervised learning algorithms

18.15 JAN (MON) 6.2 Hrs

12. association rule mining

19.16 JAN (TUE) 6.4 Hrs

20.17 JAN (WED) 7 Hrs

21.18 JAN (THU) 7.2 Hrs

22.19 JAN (FRI) 7.4 Hrs

23.20 JAN (SAT) 8 Hrs

24.21 JAN (SUN) 8.2 Hrs

25.22 JAN (MON) 8.4 Hrs

26.23 JAN (TUE) 9 Hrs

27.24 JAN (WED) 9.2 Hrs

28.25 JAN (THU) 9.4 Hrs

29.26 JAN (FRI) 10. Hrs

30.27 JAN (SAT) 10.2 Hrs