Data science

Q1: Explain Machine Learning to a school going kid.

Going to a party and meet total strangers. You have no idea about them so you will classify them on the basis of gender, age, group, dressing and so on. (no prior knowledge) (unsupervised learning). Classified them "on-the-go"

Q2: Types of Machine Learning (Supervised Learning and Unsupervised Learning)

Supervised Learning:

Is like learning with a teacher
Training dataset is like a teacher which is used to train the machine.
Model is "trained" on a pre-defined dataset before it starts making decisions when given new data.

Unsupervised Learning:

Is like learning without a teacher.
Model learns through observation and find structures in data.
Model is given a dataset, and are left to automatically find patterns and relationships in that dataset by creating clusters.

Reinforcement Learning:

Model learns with hot and trail method.
Learns on the basis of reward or penalty givn for every action it performs.

Q3: What's your favorite algorithm?

Linear Regression:
Logistic Regression:
Decision Trees:
Naive Bayes:
KNN:
Support Vector Machine (SVM)
K-Means Clustering
Principal Component Analysis (PCA)
Neural Networks
Random Forests

Q4: How deep learning differs from machine learning?

Deep Learning is a form of machine learning that is inspired by the structure of the human brain and is particularly effective in feature detection.

Machine Learning is all about algorithms that parse data, learn from that data, and then apply what they've learned to make informed decisions.

Q5: Explain Classification and Regression

Part of supervised learning

Classification (Class Labels) Very cheap, affordable, Costly

Regression (Continuous Values) Size, Area, Location

Q6: What do you understand by selection bias?

Statistical error that causes a bias in the sampling portion of an experiment.
The error causes one sampling group to be selected more often than other groups included in the experiment.
Selection bias may produce an inaccurate conclusion if the selection bias is not identified.

Q7: What do you understand by Precision and Recall?

Trying to recall birthday party and recall all 10 out of 10 the recall ratio is 100% only recall 7 out of 10 and the recall ratio is 70%.

I recall 15 events and 5 are wrong and 10 events are correct. Precision is only 66.67%

Number of events you can correctly recall = True positive

Number of all correct events = True positive + False negative (they're correct but you don't recall them)

Number of all events you recall = True positive + False positive (they're not correct but you recall them)

Recall: True positive / (True positive + False negative)

Precision = True positive / (True positive + False positive)

Q8: Explain false negative, false positive, true negative and true positive with a simple example.

True Positive: If the alarm goes on in case of a fire, Fire is positive and prediction made by the system is true.

False Positive: If alarm goes on and there is no fire. System predicted fire to be positive which is a wrong prediction, hence the prediction is false.

False Negative: If alarm does not go on but there was a fire. System predicted fire to be negative which was false since there was fire.

True Negative: If alarm does not go on and there was no fire. The fire is negativr and prediction was true.

Q9: What is a confusion matrix?

A confusion matrix or a error matrix is a table which is used for summarizing the performance of a classification algorithm.

Q10: What is the difference between indictive and deductive learning?

Inductive learning = observation -> conclusion

Deductive learning = conclusion -> observation

Q11: How is KNN different from k-means clustering?

Q12 Can your explain what is ROC curve and what does it represent?

Receiver Operating Characteristic curve. Fundamental tool diagnostic test evaluation and plot for true positive rate (Sensitivity) against false positive rate (Specificity) for the different possible cut-off points of a diagnostic test

Q13: What's the difference between Type I and Ype II error?

Type 1 error (false positive): Claiming something has happened when it hasn't happened. (false fire alarm)

Type 2 error (false negative): Claiming nothing has happened when in fact something has happened. (broken fire alarm)

Q14: Is it better to have too many false positive or too many false negatives?

Q15: What is more important to you. Model accuracy, or model performance.

The model accuracy is subset of model performance.

Q16: What is the difference between Gini impurity and Entropy in a Decision Tree?

These two are the metrics for deciding how to split a tree.
Gini measurement is the probability of a random sample being classified correctly if you randomly pick a label according to the distribution in the branch.
Entropy is a measurement of lack of information. You calculate the information gain by making a split. Which is the difference in entropies. This measures tells how you reduce the uncertainty about the label.

Q17: What is difference between Entropy and Information gain?

Entropy is an indicator of how many messy your data is. It keeps on decreasing as you reach closer to the leaf node.
The information gain is based on the decrease in entropy after a dataset is split on an attribute. It keeps on increasing as you reach closer to the left node.

Q18: What is Overfitting? And how do you ensure you're not overfitting with a model?

Overfitting is too much curves.

Three main methods to avoid overfitting?

Collect more data.
Use ensuebling methods that "average" models.
Choose simpler models.

Q19: Explain ensemble learning technique in Machine Learning?

Ensemble learning technique. Different models combined together to make better model.

Learning from community

Q20: What is Bagging and boosting in machine learning?

Types of ensemble models. Weak learner. Generate data, random sample of data. Boosting improve previous model. Keeps on improving.

Q21: How would you screen for outliers and what should you do if you find one?

Extreme Value Analysis (box plot)
Probabilistic and Statistical Model (normal distribution)
Linear Models (time series linear model)
Proximity-based Models (k-means clustering)
Information Theoretic Models
High-Dimensional Outlier Detection

How to handle outliers,

Q22: What is collinearity and multicollinearity?

Collinearity occurs when two predictor variables (e.g,x1 and x2) in a multiple regression have some correlation. (Date of birth and ago is always going to correlated)
Multicollinearity occurs when more than two predictor variables (e.g,x1 and x2) are inter-correlated. (When multiple data are correlated an example. Age, year birth and 5th grade when you are 10 years old)

Q23: What do you understand by Eigenvectors and Eigenvalues?

3 is an eigenvalues, with the original vector in the multiplication problem being an eigenvector

Q24: What is A/B Testing?

Statistical hypothesis testing for randomized experiment with two variables A and B
Goal: Identify any changes to the web page to maximize or increase the outcome of an interest.
Example: Identifying the click-through rate for a banner ad.

Q25: What is Cluster Sampling?

It is process of randomly selecting intact groups within the defined population, sharing similar characteristics.
Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements
EX: If manager(sample) are Elements then Companies are clusters. (random company)

Q26: Running a binary classification tree algorithm is quite easy. But do you know how does the tree decide on which variable to split at the root node and its succeeding child nodes?

We can calculate Gini as fellowing:

Calculate Gini for sub-notes, using formula sun of square of probability for success and failure (p^2+q^2)
Calculate Gini for split using weighted Gini score of each node of that split
Entropy is the measure of impurity or randomness in the data, (for binary class):

Here p and q is probability of success and failure respectively in that code.
Entropy is zero when a node is homogeneous and is maximum when a both the classes are present in a node at 50%-50%. All we want is to achive lower entropy.

PY 1: Name a few libraries in Python used for Data Analysis and Scientific computations.

NumPy: Numberical Libraries
SciPy: Scientific Python
Pandas: Data Analysis library. Data Frames
SciKit: Machine Learning libraries, pre-processing
Matplotlib: Data visualization tool
Seaborn: Build on top of Matplotlib, gives more pattern of data visualization
Bokeh: Data visualization

PY 2: Which library would you prefer for plotting in Python language: Seaborn or Matplotlib or Bokeh?

Seaborn: indepth analysis of your data. statically kde plot
matplotlib: quick access to bar, scatter plot and so on
bokeh: interact with data

PY 3: How are NumPy and SciPy related?

NumPy is part of SciPy
Defines arrays along with some basic numerical functions like indexing, sorting, reshaping, etc.
SciPy implements stuff like numerical integration and optimization and machine learning using NumPy's functionality.

PY 4: What is the main difference between a Pandas series and a single-column DataFrame in Python?

PY 5: How can you handle duplicate values in a dataset for a variable in Python?

bill_data=pd.read_csv("dataset/bill.csv")

bill_data.shape

dupes = bill_data.duplicated()

sum(dupes)

bill_data_uniq = bill_data.drop_duplicates()

PY 6: Write a basic machine learning program to check the accuracy of the dataset importing any dataset using any classifier.

import pandas as pd

import numpy as np

from sklearn.datasets import load_iris

from sklearn.metrics import accuracy_score

# Reading the Iris.csv file

data = load_iris()

# Extracting Attributes / Features

X = data.data

# Extracting Target / Class Labels

y = data.target

# Import Library for splitting data

from sklearn.model_selection import train_test_split

# Creating Train and Test datasets

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 42, test_size = 0.25)

# Creating Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()

clf.fit(X_train,y_train)

# Predict Accuracy Score

y_pred = clf.predict(X_test)

print("Train data accuracy:",accuracy_score(y_true = y_train, y_pred=clf.predict(X_train)))

print("Test data accuracy:",accuracy_score(y_true = y_test, y_pred=y_pred))

make sure tweaking probability cut off. test_size
find better features such as random forest

S1: You are given a data consisting of variables having more than 30% missing values? Let's say, out of 50 variables, 8 variables have missing values higher than 30%. How will you deal with them?

Assign a unique category to missing values, who knows the missing values might decipher some trend.
We can remove them blatantly
Or, we can sensibly check their distribution with the target variable, and if found any pattern we'll keep those missing values and assign them a new category while removing others.

S2: Write an SQL query that makes recommendations using the pages that your friends liked. Assume you have two tables: a two-column table of users and their friends, and a two-column table of users and the pages they liked. It should not recommend pages you already like.

SELECT f.user_id, I.page_id

FROM friend f JOIN like I

ON f.friend_id = I.user_id

WHERE I.page_id NOT IN (SELECT page_id FROM like

WHERE user_id = f.user_id)

S3: There's game where you are asked to roll two fair six-sided dice. If the sum of the values on the dice equals seven, then you win $21. However, you must pay $5 to play each time you roll both dice. Do you play this game? And in fellow-up: if he plays 6 times what is the probability of making money from this game?

36 ways, there are 6 ways to 7. 6/36, 3/18, 1/6, play 6 times and 1 game winning $21. 5 times $5 is $25. No

Probability, binary distribution

import numpy as np

from scipy.stats import binom

# Roll dice 6 times, a probability of getting 7 is 6/36, 1/6 = 0.17. Probability of winning 2 times

n = 6

k = 2

p = 0.17

binomial_pmf = binom.pmf(k, n, p)

print(binomial_pmf)

# 20% at a time I will win 2 times

S4: We have two options for serving ads within Newsfeed:

1 - Out of every 25 stories, one will be an ad

2 - Every story has a 4% chance of being an ad

For each option, what is the expected number of ads shown in 100 news stories if we go with option 2, what is the chance a user will be shown only a single ad in 100 stories? What about no ads at all?

For the question 1: I think both options have the same expected value of 4

For the question 2: Use binomal distribution function. So basically, for one case to happen, you will use this function

p(one case) = (0.96)^99*(0.04)^1

In total. there are 100 positions for the as.

100*p(one case) = 7.03%

import numpy as np

from scipy.stats import binom

# Every story has a 4% chance of being an ad

# Every 100 stories chance one see one ad

n = 100

k = 1

p = 0.04

binomial_pmf = binom.pmf(k, n, p)

bino_pct = binomial_pmf * 100

print("%.2f" % bino_pct,'%')

S5: How would you predict who will renew their subscription next month? What data would you need to solve this? What analysis would you build predictive models? If so, which algorithms?

S6: How do you map nicknames (Pete, Andy, Nick, Rob, etc) to real names?

S7: A jar has 1000 coins, of which 999 are fair and 1 is double headed. Pick a coin at random, and toss it 10 times. Given that you see 10 heads, what is the probability that the next toss of that coin is also head?

These are two ways of choosing the coin. One is to pick a fair coin and the other is to pick the one with two heads.

Probability of selecting fair coin = 999/1000 = 0.999

Probability of selecting unfair coin = 1/1000 = 0.001

Selecting 10 heads in a row = Selecting fair coin * Getting 10 heads + Selecting an unfair coin

P(A) = 0.999 * (1/2)^10 = 0.999 * (1/1024) = 0.000976

P(B) =0.001 * 1 = 0.001

P( A / A + B ) = 0.000976 / (0.000976 + 0.001) = 0.4939

P( B / A + B ) = 0.001 / 0.001976 = 0.5061

Probability of selecting another head = P(A/A+B) * 0.5 + P(B/A+B)*1 = 0.4939 * 0.5 + 0.5061 = 0.7531

S8: Suppose you are given a data set which has missing values spread along 1 standard deviation from the median. What percentage of data would remain unaffected and Why?

Since, the data is spread acress median, let's assume it's a normal distribution.

As you know, in a normal distribution, 68% of the data lies in 1 standard deviation from mean (or mode, median), which leaves 32% of the data unaffected. Therefore, 32% of the data would remain unaffected by missing values.

S9: You are given a cancer detection data set. Let's suppose when you build a classification model you achived an accuracy of 96%. Why shouldn't you be happy with your model performance? What can you do about it?

Add more data
Treat missing outlier values
Feature Engineering
Feature Selection
Multiple Algorithms
Algorithm Tuning
Ensemble Method
Cross Validation

S10: You are working on a time series data set. You manageger has asked you to build a high accuracy model. You start with the decision tree algorithm, since you know it works fairly well on all kinds of data. Later, you tries a time series regression model and got higher accuracy than decision tree model. Can this happen? Why?

Time series data is based on linearity while a decision tree algorithm is known to wprk best to detect non linear interactions
Decision tree fails to provide robust predictions. Why?
The reason is that it couldn't map the linear relationship as good as a regression model did.
We also know that, a linear regression model can provide robust prediction only if the data set satisfies its linearity assumptions.

S11: Suppose you found that your model is suffering from low bias and high variance. Which algorithm you think could tackle this situation and Why?

Type 1: How to tackle high variance?

Low bias occurs when the model's predicted values are near to actual values
In this case we can use bagging algorithm (eg: random forest) to tackle high variance problem.
Bagging algorithm will divide the data set into its subsets with repeated randomized sampling.
Once divided, these samples can be used to generate a set of models using a single learning algorithm. Later, the model predictions are combined using voting (classification) or averaging (regression).

S12: You are given a data set. The data set contains many variables, some of which are highly correlated and you know about it. Your manager has asked you to run PCA. Would you remove correlated variables first? Why?

Possibly, you might get tempted to say no but that would be incorrect. Discarding correlated variables will have a substantials effect on PCA because, in presence of correlated variables, the variance explained by a particular component gets inflated.

S13: You are aked to build a multiple regression model but your model R square isn't as good as you wanted. For improvement, you remove the intercept term now your model R square becomes 0.8 from 0.3 Is it possible? How?

Yes, it is possible. The intercept term refers to model prediction without any independent variable or in other words, mean prediction

In presence of intercept term R square value will evaluate your model wrt. to the mean model
In absence of intercept term (Ymean), the model can make no such evaluation.
With lrge denominator, Value of equation becomes smaller than actual, thereby resulting in higher value of R square

S14: You are asked to build a random forest model with 10000 trees. During its training you got training error as 0.00 but on testing the validation error was 34.23. What is going on? Haven't you trained model perfectly?

The model is overfit.
Training error of 0.00 means that the classifier has mimicked the training data patterns to an extent.
But when this classifier runs on unseen sample, it was not able to find those patterns and returned prediction with higher error
In random forest, it usually happens when we use large number of trees than necessary. Hence, to avoid these situation, we should tune number of trees using cross validation

S15: People who brought this, also brought recommendations seen on amzon is based on which algorith,?

The basic idea for this kind of recommendation engine comes from collaborative filtering

Page updated

Google Sites

Report abuse