2023 February

2023-0224 (WED) Bash DB Dump and Restore

2023-0223 (TUE) Python Data Analysis

2023-0219 (SUN) Python Data Manipulation

2023-0217 (FRI) Time Series Analysis and Forecasting

2023-0217 (FRI) Simple Linear Regression in R

2023-0216 (THU) K-Means Clustering in R

2023-0216 (THU) Hierarchical Clustering

2023-0205 (MON) Python Data Manipulation

2023-0204 (SUN): Python For Data Science Roadmap In 2023

2023-0224 (WED) Bash DB Dump and Restore

mysqldump

rm -f .my.cnf

echo "y" | cp .my.cnf.failed_over .my.cnf

my_path=/backup/myom

my_year=`date +'%Y'`

my_month=`date +'%m'`

my_date=`date +'%d'`

my_hour=`date +'%H'`

my_min=`date +'%M'`

today="$my_year-$my_month$my_date"

my_time="$my_year-$my_month$my_date-$my_hour$my_min"

mkdir $my_path/$today

echo "" >> $my_path/$today/status.txt

echo "Starting the DB backup sqldump $my_time" >> $my_path/$today/status.txt

mysqldump --hex-blob --single-transaction --set-gtid-purged=OFF --default-character-set=utf8mb4 mydb > $my_path/$today/mydb-$today.sql

end_time="$my_year-$my_month$my_date-$my_hour$my_min"

echo "Finished the DB backup sqldump $end_time" >> $my_path/$today/status.txt

echo "" >> $my_path/$today/status.txt

MySQL 5.7 to MySQL 8 restore

rm -f .my.cnf

echo "y" | cp .my.cnf.mysql8.mydb .my.cnf

my_path=/backup/myom

my_year=`date +'%Y'`

my_month=`date +'%m'`

my_date=`date +'%d'`

my_hour=`date +'%H'`

my_min=`date +'%M'`

today="$my_year-$my_month$my_date"

my_time="$my_year-$my_month$my_date-$my_hour$my_min"

echo ""

echo "Dropping MySQL8 mydb DB"

mysql -e "DROP DATABASE IF EXISTS mydb;"

sleep 30

echo ""

echo "Creating MySQL8 mydb DB"

mysql -e "CREATE DATABASE mydb;"

echo "Starting DB restore $my_time" >> $my_path/$today/status.txt

/usr/bin/pv $my_path/$today/mydb-$today.sql | mysql mydb

end_time="$my_year-$my_month$my_date-$my_hour$my_min"

echo "Finished the DB restore $end_time" >> $my_path/$today/status.txt

echo "" >> $my_path/$today/status.txt

2023-0223 (TUE) Python Data Analysis

Content Removed

2023-0219 (SUN) Python Data Manipulation

%%timeit

df.loc[:,'review_stars8'] = pd.cut(x=df['review_score'], bins=[0, 20, 40, 60, 80, 100],

labels=['One Star', 'Two Stars', 'Three Stars', 'Four Stars', 'Five Stars'])

2023-0217 (FRI) Time Series Analysis and Forecasting

Time Series, what will the demand for a business be next month, next year.

Change in real income median.

Pass Data about what we want to forecast
Data must be quantitative or quantifiable
Expectation that pass data patterns will extend to future

Horizontal Pattern (Sales) / (Months)

#load libraries

library(tidyverse)

library(TTR)

library(ggplot2)

#set working directory (adjust this for your own computer)

setwd("/Users/myom@cad/Documents/Eastern/12_DTSC_560_DataScience_for_Business/module4")

#read dataset into R

milkdf <- read.csv("united_dairies.csv")

View(milkdf)

#use the naive method to forecast the 13th week of milk sales

naive13 <- c(NA, sales_actuals)

naive13

#The last value in the vector is the forecast for sales for the 13th week

#Create functions for the accuracy measures with vector of actual values

#and vector of predicted values as inputs

mae<-function(actual,pred){

mae <- mean(abs(actual-pred), na.rm=TRUE)

return (mae)

}

mse<-function(actual,pred){

mse <- mean((actual-pred)^2, na.rm=TRUE)

return (mse)

}

rmse<-function(actual,pred){

rmse <- sqrt(mean((actual-pred)^2, na.rm=TRUE))

return (rmse)

}

mape<-function(actual,pred){

mape <- mean(abs((actual - pred)/actual), na.rm=TRUE)*100

return (mape)

}

#Adjust the vector of predicted values to align with the sales_actuals vector

Naive_pred <- naive13[-length(naive13)]

#Calculate accuracy measures with vector of actual values and vector

#of predicted values as inputs

mae(sales_actuals, Naive_pred)

190.9091

mse(sales_actuals, Naive_pred)

5000

rmse(sales_actuals, Naive_pred)

223.6068

mape(sales_actuals, Naive_pred)

6.269048

#use the simple moving average method to forecast the 13th week of milk sales

sma13<-SMA (sales_actuals, n=3)

sma13

# NA NA 3033.333 3050.000 2983.333 2916.667 3083.333 3150.000 3116.667 3016.667 3050.000 3116.667

#The last value in the vector is the forecast for sales for the 13th week

#Adjust the vector of predicted values to align with the sales_actuals vector

sales_ma_pred<-c(NA, sma13[-length(sma13)])

sales_ma_pred

# [1] NA NA NA 3033.333 3050.000 2983.333 2916.667 3083.333 3150.000 3116.667 3016.667 3050.000

Center moving Average

#Calculate accuracy measures with vector of actual values and vector

#of predicted values as inputs

mae(sales_actuals, sales_ma_pred)

mse(sales_actuals, sales_ma_pred)

rmse(sales_actuals, sales_ma_pred)

mape(sales_actuals, sales_ma_pred)

> mae(sales_actuals, sales_ma_pred)

[1] 161.1111

> mse(sales_actuals, sales_ma_pred)

[1] 36203.7

> rmse(sales_actuals, sales_ma_pred)

[1] 190.2727

> mape(sales_actuals, sales_ma_pred)

[1] 5.268628

#use the exponential smoothing method with alpha = 0.2 to forecast the

#13th week of milk sales

exp13 <- EMA (sales_actuals, n=1, ratio = .2)

exp13

# 2750.000 2820.000 2906.000 2884.800 2887.840 2920.272 2996.218 3016.974 3003.579 3002.863 3042.291 3063.833

#The last value in the vector is the forecast for sales for the 13th week

#Adjust the vector of predicted values to align with the sales_actuals vector

exp_pred <- c(NA, exp13[-length(exp13)])

exp_pred

# NA 2750.000 2820.000 2906.000 2884.800 2887.840 2920.272 2996.218 3016.974 3003.579 3002.863 3042.291

#Calculate accuracy measures with vector of actual values and vector

#of predicted values as inputs

mape(sales_actuals, exp_pred)

mae(sales_actuals, exp_pred)

mse(sales_actuals, exp_pred)

rmse(sales_actuals, exp_pred)

> mape(sales_actuals, exp_pred)

[1] 5.542897

> mae(sales_actuals, exp_pred)

[1] 174.7518

> mse(sales_actuals, exp_pred)

[1] 50462.68

> rmse(sales_actuals, exp_pred)

[1] 224.639

#use the exponential smoothing method with alpha = 0.4 to forecast the

#13th week of milk sales

exp13_4 <- EMA (sales_actuals, n=1, ratio = .4)

exp13_4

# 2750.000 2890.000 3034.000 2940.400 2924.240 2974.544 3104.726 3102.836 3041.702 3025.021 3095.013 3117.008

#The last value in the vector is the forecast for sales for the 13th week

#Adjust the vector of predicted values to align with the sales_actuals vector

exp_pred_4 <- c(NA, exp13_4[-length(exp13_4)])

#Calculate accuracy measures with vector of actual values and vector

#of predicted values as inputs

mae(sales_actuals, exp_pred_4)

mse(sales_actuals, exp_pred_4)

rmse(sales_actuals, exp_pred_4)

mape(sales_actuals, exp_pred_4)

> mae(sales_actuals, exp_pred_4)

[1] 169.5315

> mse(sales_actuals, exp_pred_4)

[1] 44453.35

> rmse(sales_actuals, exp_pred_4)

[1] 210.8396

> mape(sales_actuals, exp_pred_4)

[1] 5.4582

2023-0217 (FRI) Simple Linear Regression in R

#create a time series plot showing 12 weeks of milk sales

ggplot(data = milkdf, mapping = aes(x = Week, y = Sales)) +

geom_line () +

geom_point() +

scale_x_continuous(breaks = seq(0, 13, by = 1)) +

labs(title = "Weekly milk sales for United Dairies", x = "Week", y = "Sales")

#create a separate vector for the actual weekly sales

sales_actuals<-milkdf$Sales

sales_actuals

#recode FuelType variable with 0 for Diesel and 1 for Petrol

cardf$FuelType<-ifelse(cardf$FuelType=="Petrol",1,0)

#Convert categorical variables to factors with levels and labels

cardf$FuelType<-factor(cardf$FuelType,levels = c(0,1),labels = c("Diesel","Petrol"))

cardf$MetColor<-factor(cardf$MetColor,levels = c(0,1),labels = c("No","Yes"))

cardf$Automatic<-factor(cardf$Automatic,levels = c(0,1),labels = c("No","Yes"))

#check for missing data

sum(is.na(cardf))

summary(cardf)

ggplot(data = cardf, mapping = aes(x = Age, y = Price)) +

geom_point() +

geom_smooth(method = "lm") +

labs(title = "Scatterplot for Car Age and Car Price", x = "Age", y = "Price")

# we can see the negative collation

#Calculate a correlation coefficient for the relationship between Age and Price

cor(cardf$Age, cardf$Price)

-0.8715612

# Car age and car price is very negative correlated. Car price go down as car ages.

#Perform a simple linear regression with Car Age and Car Price

toyota_SLR <- lm(Price ~ Age, data = cardf)

summary(toyota_SLR)

R-squared: how well model is predicting. Multiple Linear Regression: Coefficient of multiple determination and the value is represented by R square

Sum of squares due to error (SSE): How well observations cluster around the regression that predicts y

Sum of squares due to regression (SSR)

Total sum of squares (SST) How well observations cluster around the line that represents the mean of y

SST = SSR + SSE

We square Residuals because without squaring they sum to zero.

Coefficient of Determination r square = SSR / SST

r square = .7596 means Car age accounts for 76% of the variability in Toyota Corolla prices

#create a correlation matrix with all quantitative variables in the dataframe

cor(cardf[c(1, 2, 3, 5)])

#turn off scientific notation for all variables

options(scipen=999)

#Perform a multiple regression with Car Age, KM, and Horsepower and Car Price

toyota_MR <- lm(Price ~ Age + KM + Horsepower, data = cardf)

#View the multiple regression output

summary(toyota_MR)

#install lm.beta package to extract standardized regression coefficients

install.packages ("lm.beta")

#load lm.beta

library(lm.beta)

#Extract standardized regression coefficients

lm.beta(toyota_MR)

#install lm.beta package to extract standardized regression coefficients

install.packages ("lm.beta")

#load lm.beta

library(lm.beta)

#Extract standardized regression coefficients

lm.beta(toyota_MR)

#View the multiple regression output

summary(toyota_MR)

Every year car ages decreased the car price in average 148 euro

#install lm.beta package to extract standardized regression coefficients

install.packages ("lm.beta")

#load lm.beta

library(lm.beta)

#Extract standardized regression coefficients

lm.beta(toyota_MR)

#View the multiple regression output

summary(toyota_MR)

#Steps to create a scatterplot of residuals vs. predicted values of the

#dependent variable

#Create a vector of predicted values generated from the multiple

#regression above

toyota_pred = predict(toyota_MR)

#Create a vector of residuals generated from the multiple regression above

toyota_res = resid(toyota_MR)

#Create a data frame of the predicted values and the residuals

pred_res_df <- data.frame(toyota_pred, toyota_res)

#create a scatterplot of the residuals versus the predicted values

ggplot(data = pred_res_df, mapping = aes(x = toyota_pred, y = toyota_res)) +

geom_point() +

labs(title = "Plot of residuals vs. predicted values", x = "Predicted values",

y = "Residuals")

#Steps to create a Normal Probability Plot

#create a vector of standardized residuals generated from the multiple

#regression above

toyota_std.res = rstandard(toyota_MR)

#produce normal scores for the standardized residuals and create

#normal probability plot

qqnorm(toyota_std.res, ylab = "Standardized residuals", xlab = "Normal scores")

#install packages

install.packages ("car")

#load libraries

library(car)

#create a correlation matrix with all quantitative variables in the dataframe

cor(cardf[c(1, 2, 3, 5)])

#calculate Variance Inflation Factor for each variable to assess

#multicollinearity

vif(toyota_MR)

#Perform a multiple regression with Car Age, KM, Horsepower, FuelType,

#MetColor and Automatic as predictor variables and Car Price as the outcome

#variable

toyota_MR_Cat <- lm(Price ~ Age + KM + Horsepower + FuelType + MetColor + Automatic,

data = cardf)

#View multiple regression output

summary(toyota_MR_Cat)

#Steps to create a new scatterplot of residuals vs. predicted values of the

#dependent variable with the new categorical variables added

#Create a vector of predicted values generated from the multiple

#regression above

toyota_pred2 = predict(toyota_MR_Cat)

#Create a vector of residuals generated from the multiple regression above

toyota_res2 = resid(toyota_MR_Cat)

#Create a data frame of the predicted values and the residuals

pred_res_df2 <- data.frame(toyota_pred2, toyota_res2)

#create a scatterplot of the residuals versus the predicted values

ggplot(data = pred_res_df2, mapping = aes(x = toyota_pred2, y = toyota_res2)) +

geom_point() +

labs(title = "Plot of residuals vs. predicted values", x = "Predicted values",

y = "Residuals")

#Steps to create a Normal Probability Plot

#create a vector of standardized residuals generated from the multiple

#regression above

toyota_std.res = rstandard(toyota_MR)

#produce normal scores for the standardized residuals and create

#normal probability plot

qqnorm(toyota_std.res, ylab = "Standardized residuals", xlab = "Normal scores")

#partition the data into a training set and a validation set

#set seed so the random sample is reproducible

set.seed(42)

sample <- sample(c(TRUE, FALSE), nrow(cardf), replace=TRUE, prob=c(0.7,0.3))

traincar <- cardf[sample, ]

validatecar <- cardf[!sample, ]

#Install package needed for best subsets procedure

install.packages("olsrr")

#Load olsrr library

library(olsrr)

#partition the data into a training set and a validation set

#set seed so the random sample is reproducible

set.seed(42)

sample <- sample(c(TRUE, FALSE), nrow(cardf), replace=TRUE, prob=c(0.7,0.3))

traincar <- cardf[sample, ]

validatecar <- cardf[!sample, ]

#Install package needed for best subsets procedure

install.packages("olsrr")

#Load olsrr library

library(olsrr)

#run a multiple regression model using the "train" dataframe and all

#available independent variables

toyota_MR_Alltrain <- lm(Price ~ Age + KM + Horsepower + FuelType + MetColor +

Automatic, data = traincar)

summary(toyota_MR_Alltrain)

#run best subsets procedure with multiple regression output "toyota_MR_Alltrain"

bestsubset <- ols_step_all_possible(toyota_MR_Alltrain)

View(bestsubset)

#run a final multiple regression model using the "validate" dataframe

#and the following predictors: Age, KM, Horsepower, FuelType, and Automatic

toyota_MR_Val <- lm(Price ~ Age + KM + Horsepower + FuelType + Automatic,

data = validatecar)

summary(toyota_MR_Val)

# Largest adjusted R square (Age, KM, Hoursepower, Fuel Type, Automatic)

#read inventory dataset into R

inventorydf <- read.csv("toyota_corolla_inventory.csv")

View(inventorydf)

#Convert categorical variables to factors with levels and labels

inventorydf$FuelType<-factor(inventorydf$FuelType,levels = c(0,1),labels = c("Diesel","Petrol"))

inventorydf$Automatic<-factor(inventorydf$Automatic,levels = c(0,1),labels = c("No","Yes"))

#estimate predicted y values and prediction intervals for five additional

#Toyota Corollas in inventory

predict(toyota_MR_Val, inventorydf, interval = "prediction", level = 0.95)

2023-0216 (THU) K-Means Clustering in R

install.packages ("tidyverse")

install.packages("cluster")

install.packages("fpc")

#load libraries

library(tidyverse)

library(cluster)

library(fpc)

#set working directory (adjust this for your own computer)

setwd("/Users/Documents/Eastern/12_DTSC_560_DataScience_for_Business/module2")

#read dataset into R

empdf <- read.csv("Employees.csv")

View(empdf)

quantdf<-empdf[c(1,16,20,25)]

View(quantdf)

#Create a data frame with only continuous variables - Age,

#Value.of.Investments,Number.of.Transactions, Household.Income -

#by removing columns 2, 3, 6, and 8

quantdf<-magdf[c(-2,-3,-6,-8)]

View(quantdf)

#normalize each variable

quantdfn<-scale(quantdf)

View(quantdfn)

#set random number seed in order to replicate the analysis

set.seed(42)

#create a function to calculate total within-cluster sum of squared deviations

#to use in elbow plot

wss<-function(k){kmeans(quantdfn, k, nstart=10)} $tot.withinss

#range of k values for elbow plot

k_values<- 1:10

# run the function to create the range of values for the elbow plot

wss_values<-map_dbl(k_values, wss)

#create a new data frame containing both k_values and wss_values

elbowdf<- data.frame(k_values, wss_values)

#graph the elbow plot

ggplot(elbowdf, mapping = aes(x = k_values, y = wss_values)) +

geom_line() + geom_point()

#run k-means clustering with 4 clusters (k=4) and 1000 random restarts

k4<-kmeans(quantdfn, 4, nstart=1000)

#display the structure of the 4-means clustering object

str(k4)

[1] 410

$cluster.number

[1] 4

$cluster.size

[1] 63 46 175 126

$min.cluster.size

[1] 46

$noisen

[1] 0

$diameter

[1] 8.731119 6.459981 5.047733 4.948840

$average.distance (dense cluster is smallest) 1.702407

[1] 2.712944 2.289854 1.702407 1.821043

$median.distance

[1] 2.503044 2.096996 1.633555 1.748860

$separation

[1] 0.3072036 0.4262397 0.3094109 0.3072036

$average.toother

[1] 3.260644 3.249676 2.735785 2.717608

$average.between

[1] 2.903973

$average.within

[1] 1.960052

$n.between

[1] 57757

$n.within

[1] 26088

$max.diameter

[1] 8.731119

$min.separation

[1] 0.3072036

$within.cluster.ss

[1] 956.4456

$clus.avg.silwidths

1 2 3 4

0.07470313 0.18916895 0.26707357 0.21641395

$avg.silwidth

[1] 0.2132051

$g2

NULL

$g3

NULL

$pearsongamma

[1] 0.4197272

$dunn

[1] 0.0351849

$dunn2

[1] 0.882738

$entropy

[1] 1.25922

$wb.ratio

[1] 0.6749552

$ch

[1] 96.15431

$cwidegap

[1] 3.413033 3.493278 1.616582 1.376527

$widestgap

[1] 3.493278

$sindex

[1] 0.4509849

$corrected.rand

NULL

$vi

NULL

#combining each observation's cluster assignment with unscaled data frame

quantdfk4<-cbind(quantdf, clusterID=k4$cluster)

View(quantdfk4)

#write data frame to CSV file to analyze in Excel

write.csv(quantdfk4, "magazine_kmeans_4clusters.csv")

#calculate variable averages for all non-normalized observations

summarize_all(quantdf, mean)

#Calculate variable averages for each cluster

quantdfk4 %>%

group_by(clusterID) %>%

summarize_all(mean)

2023-0216 (THU) Hierarchical Clustering

library(tidyverse)

library(cluster)

library(fpc)

library(factoextra)

library(janitor)

#set working directory

setwd("/Users/myom@cadent.tv/Documents/Eastern/12_DTSC_560_DataScience_for_Business")

#read in datafile

magdf<-read.csv("young_professional_magazine.csv")

View(magdf)

#Create a data frame with only binary variables - Gender,

#Real.Estate.Purchases, Graduate.Degree, Have.Children -

#by removing columns 1, 4, 5, and 7

bindf<-magdf[c(-1,-4,-5,-7)]

View(bindf)

#calculate distance between each pair of observations using the dist function

#and manhattan distance

match_dist<-dist(bindf, method="manhattan")

#run hierarchical clustering with the hclust function and group average linkage

cl_match_avg<-hclust(match_dist, method="average")

#plot the dendrogram

plot(cl_match_avg)

#Create 4 clusters using the cutree function

cl_match_avg_4<-cutree(cl_match_avg, k=4)

#display vector of cluster assignments for each observation

cl_match_avg_4

#visualize clusters on the dendrogram

rect.hclust(cl_match_avg, k=4, border=2:4)

#link cluster assignments to original categorical data frame

hcl4df<-cbind(bindf, clusterID=cl_match_avg_4)

#write data frame to CSV file to analyze in Excel

write.csv(hcl4df, "magazine_hier4_clusters.csv")

#display number of observations in each cluster

hcl4df %>%

group_by(clusterID) %>%

summarize(n())

#attach value labels to binary variables

hcl4df$Female<-factor(hcl4df$Female,levels=c(0,1),labels=c("no","yes"))

hcl4df$Real.Estate.Purchases<-factor(hcl4df$Real.Estate.Purchases,levels=c(0,1),labels=c("No","Yes"))

hcl4df$Graduate.Degree<-factor(hcl4df$Graduate.Degree,levels=c(0,1),labels=c("No","Yes"))

hcl4df$Have.Children<-factor(hcl4df$Have.Children,levels=c(0,1),labels=c("No","Yes"))

#Create frequency tables for each variable overall

tabyl(hcl4df$Female)

tabyl(hcl4df$Real.Estate.Purchases)

tabyl(hcl4df$Graduate.Degree)

tabyl(hcl4df$Have.Children)

#Create frequency tables for each variable by cluster

tabyl(hcl4df,Female,clusterID) %>%

adorn_percentages("col") %>%

adorn_pct_formatting(digits=2) %>%

adorn_ns()

tabyl(hcl4df,Real.Estate.Purchases,clusterID) %>%

adorn_percentages("col") %>%

adorn_pct_formatting(digits=2) %>%

adorn_ns()

tabyl(hcl4df,Graduate.Degree,clusterID) %>%

adorn_percentages("col") %>%

adorn_pct_formatting(digits=2) %>%

adorn_ns()

tabyl(hcl4df,Have.Children,clusterID) %>%

adorn_percentages("col") %>%

adorn_pct_formatting(digits=2) %>%

adorn_ns()

2023-0205 (MON) Python Data Manipulation

2023-0204 (SUN): Python For Data Science Roadmap In 2023

1. Programming Fundamentals

As a data scientist, your main work is to use the data to get actionable insights out of it. However, this is a long process in which you will need to write a python program for each of these steps. Therefore you need to have solid python programming fundamentals to be able to write efficient code for your tasks and understand other codes.

Here are some of the basic python programming fundamentals you should master:

Data types: Python has several built-in data types including integers, floats, and strings. It is important to know how to work with each of these data types and when to use them.
Variables: A variable is a way to store a value in a program. In Python, you can create a variable by assigning it a value using the equals sign (=).
Operators: Operators are special symbols in Python that perform specific operations on one or more operands. Some common operators include addition (+), subtraction (-), and multiplication (*).
Lists: A list is a collection of items in a specific order. Lists are useful for storing data that needs to be accessed in a specific order, or for storing multiple items of the same data type.
Dictionaries: A dictionary is a collection of key-value pairs. Dictionaries are useful for storing data that needs to be accessed using a unique key.
Control structures: Control structures are blocks of code that determine how other blocks of code are executed. Some common control structures in Python include if statements, for loops, and while loops.
Functions: A function is a block of code that performs a specific task and can be reused multiple times in a program. Defining and calling functions is an important aspect of programming in Python.
Object-Oriented Programming (OOP): (OOP) is a programming paradigm that uses objects, which are instances of classes, to represent and manipulate data.
Modules and packages: A module is a file containing Python code, while a package is a collection of modules. Knowing how to import and use modules and packages is essential for writing larger, more complex Python programs.

2. Data Manipulation & Analysis

As a data scientist, you will spend a lot of time preparing and manipulating the data to make it ready for analysis and modeling. Therefore it is important to be able to work with python to clean and prepare the data. This includes working with different data types and sizes.

You should be able to use python to manipulate datasets with different sizes and types and analyze them efficiently. Skills in this area include working with libraries like NumPy, and Pandas for structured data manipulation and analysis. In addition to being able to use PySpark for large dataset manipulation as well as use libraries for different types of data such as images, text, and audio if needed.

3. Data Visualization

Data visualization is an important aspect of data science, as it allows you to explore and understand your data, identify patterns and trends, and communicate your findings to others. It is therefore important for data scientists to have a strong understanding and hands-on skills in data visualization tools and how to use them effectively.

There are many libraries and tools available in Python for data visualization, some of the most popular ones include:

Matplotlib: This is a widely-used library for creating static, animated, and interactive visualizations in Python. It provides a high-level interface for drawing attractive and informative statistical graphics.
Seaborn: This is a library for creating statistical graphics in Python. It is built on top of Matplotlib and provides a more refined interface for creating visualizations.
Plotly: This is a library for creating interactive visualizations in Python. It is similar to Bokeh, but also includes support for creating visualizations that can be displayed in other contexts such as Jupyter notebooks.
Bokeh: This is a library for creating interactive visualizations in Python. It is particularly well-suited for creating visualizations that can be displayed in web browsers.
Altair: This is a library for creating declarative statistical visualizations in Python. It is based on the Vega and Vega-Lite visualization grammars, which provide a high-level, concise syntax for creating a wide range of visualizations.

4. Data Storage and Retrieval

As a data scientist, you will be mainly working with data whether you will have to retrieve or store the data after processing it. Therefore data storage and retrieval skills are important for data scientists because they allow them to efficiently manage and access the data they are working with.

There are many ways to store and retrieve data in Python, depending on the needs of the data scientist and the nature of the data. Here are some common approaches that you may encounter during your career:

Flat files: Flat files are simple text files that contain tabular data, with each row representing a record and each column representing a field. Flat files can be read and written using Python’s built-in open() function and the various methods of the file object, such as read(), readline(), and write().
CSV files: CSV (Comma Separated Values) files are a type of flat file that use commas to separate values. They can be read and written using Python’s panda library.
JSON files: JSON (JavaScript Object Notation) is a widely-used, human-readable data interchange format. It can be used to represent complex data structures, such as lists and dictionaries. JSON files can be read and written using Python’s built-in json module and pandas library.
Relational databases: Relational databases are powerful systems for storing and querying structured data. There are several Python libraries for interacting with popular database management systems (DBMS) such as MySQL, PostgreSQL, and SQLite. Some popular options include psycopg2 for PostgreSQL, mysql-connector-python for MySQL, and sqlite3 for SQLite.
NoSQL databases: NoSQL databases are designed to handle large amounts of unstructured data, such as that generated by social media, IoT devices, and e-commerce platforms. Some popular NoSQL databases include MongoDB, Cassandra, and Redis. Python provides various libraries for interacting with these databases, such as pymongo for MongoDB and redis-py for Redis.
Cloud storage: Cloud storage services such as Amazon S3, Google Cloud Storage, and Microsoft Azure Storage provide scalable, flexible options for storing large amounts of data in the cloud. Python provides libraries for accessing these services, such as boto3 Amazon S3 and google-cloud-storage Google Cloud Storage.

5. Applied Machine & Deep Learning

Applied machine learning and deep learning are both important Python skills for data scientists to master. Machine learning involves the use of algorithms and statistical models to enable computers to improve their performance on a given task without explicitly being programmed to perform that task. This is accomplished by training the machine learning model on a dataset and allowing it to learn the relationships and patterns within the data.

Deep learning, on the other hand, involves the use of artificial neural networks to learn and make decisions. Deep learning has proven to be particularly effective in image and speech recognition, natural language processing, and even playing games.

In order to apply machine learning and deep learning in Python, it is important to have a strong understanding of the various algorithms and libraries available. Here are three of the most common libraries you should master as a data scientist:

Scikit-learn: A machine learning library for Python, which provides a wide range of algorithms for classification, regression, clustering, and dimensionality reduction, as well as tools for model evaluation and selection.
TensorFlow: An open-source library for deep learning developed by Google, which provides tools for building, training, and deploying machine learning models.
Keras: A high-level neural network library built on top of TensorFlow, which provides a convenient interface for defining and training deep learning models.

Page updated

Google Sites

Report abuse