mysqldump
rm -f .my.cnf
echo "y" | cp .my.cnf.failed_over .my.cnf
my_path=/backup/myom
my_year=`date +'%Y'`
my_month=`date +'%m'`
my_date=`date +'%d'`
my_hour=`date +'%H'`
my_min=`date +'%M'`
today="$my_year-$my_month$my_date"
my_time="$my_year-$my_month$my_date-$my_hour$my_min"
mkdir $my_path/$today
echo "" >> $my_path/$today/status.txt
echo "Starting the DB backup sqldump $my_time" >> $my_path/$today/status.txt
mysqldump --hex-blob --single-transaction --set-gtid-purged=OFF --default-character-set=utf8mb4 mydb > $my_path/$today/mydb-$today.sql
end_time="$my_year-$my_month$my_date-$my_hour$my_min"
echo "Finished the DB backup sqldump $end_time" >> $my_path/$today/status.txt
echo "" >> $my_path/$today/status.txt
MySQL 5.7 to MySQL 8 restore
rm -f .my.cnf
echo "y" | cp .my.cnf.mysql8.mydb .my.cnf
my_path=/backup/myom
my_year=`date +'%Y'`
my_month=`date +'%m'`
my_date=`date +'%d'`
my_hour=`date +'%H'`
my_min=`date +'%M'`
today="$my_year-$my_month$my_date"
my_time="$my_year-$my_month$my_date-$my_hour$my_min"
echo ""
echo "Dropping MySQL8 mydb DB"
mysql -e "DROP DATABASE IF EXISTS mydb;"
sleep 30
echo ""
echo "Creating MySQL8 mydb DB"
mysql -e "CREATE DATABASE mydb;"
echo "Starting DB restore $my_time" >> $my_path/$today/status.txt
/usr/bin/pv $my_path/$today/mydb-$today.sql | mysql mydb
end_time="$my_year-$my_month$my_date-$my_hour$my_min"
echo "Finished the DB restore $end_time" >> $my_path/$today/status.txt
echo "" >> $my_path/$today/status.txt
Content Removed
%%timeit
df.loc[:,'review_stars8'] = pd.cut(x=df['review_score'], bins=[0, 20, 40, 60, 80, 100],
labels=['One Star', 'Two Stars', 'Three Stars', 'Four Stars', 'Five Stars'])
Time Series, what will the demand for a business be next month, next year.
Change in real income median.
Pass Data about what we want to forecast
Data must be quantitative or quantifiable
Expectation that pass data patterns will extend to future
Horizontal Pattern (Sales) / (Months)
#load libraries
library(tidyverse)
library(TTR)
library(ggplot2)
#set working directory (adjust this for your own computer)
setwd("/Users/myom@cad/Documents/Eastern/12_DTSC_560_DataScience_for_Business/module4")
#read dataset into R
milkdf <- read.csv("united_dairies.csv")
View(milkdf)
#use the naive method to forecast the 13th week of milk sales
naive13 <- c(NA, sales_actuals)
naive13
#The last value in the vector is the forecast for sales for the 13th week
#Create functions for the accuracy measures with vector of actual values
#and vector of predicted values as inputs
mae<-function(actual,pred){
mae <- mean(abs(actual-pred), na.rm=TRUE)
return (mae)
}
mse<-function(actual,pred){
mse <- mean((actual-pred)^2, na.rm=TRUE)
return (mse)
}
rmse<-function(actual,pred){
rmse <- sqrt(mean((actual-pred)^2, na.rm=TRUE))
return (rmse)
}
mape<-function(actual,pred){
mape <- mean(abs((actual - pred)/actual), na.rm=TRUE)*100
return (mape)
}
#Adjust the vector of predicted values to align with the sales_actuals vector
Naive_pred <- naive13[-length(naive13)]
#Calculate accuracy measures with vector of actual values and vector
#of predicted values as inputs
mae(sales_actuals, Naive_pred)
190.9091
mse(sales_actuals, Naive_pred)
5000
rmse(sales_actuals, Naive_pred)
223.6068
mape(sales_actuals, Naive_pred)
6.269048
#use the simple moving average method to forecast the 13th week of milk sales
sma13<-SMA (sales_actuals, n=3)
sma13
# NA NA 3033.333 3050.000 2983.333 2916.667 3083.333 3150.000 3116.667 3016.667 3050.000 3116.667
#The last value in the vector is the forecast for sales for the 13th week
#Adjust the vector of predicted values to align with the sales_actuals vector
sales_ma_pred<-c(NA, sma13[-length(sma13)])
sales_ma_pred
# [1] NA NA NA 3033.333 3050.000 2983.333 2916.667 3083.333 3150.000 3116.667 3016.667 3050.000
Center moving Average
#Calculate accuracy measures with vector of actual values and vector
#of predicted values as inputs
mae(sales_actuals, sales_ma_pred)
mse(sales_actuals, sales_ma_pred)
rmse(sales_actuals, sales_ma_pred)
mape(sales_actuals, sales_ma_pred)
> mae(sales_actuals, sales_ma_pred)
[1] 161.1111
> mse(sales_actuals, sales_ma_pred)
[1] 36203.7
> rmse(sales_actuals, sales_ma_pred)
[1] 190.2727
> mape(sales_actuals, sales_ma_pred)
[1] 5.268628
#use the exponential smoothing method with alpha = 0.2 to forecast the
#13th week of milk sales
exp13 <- EMA (sales_actuals, n=1, ratio = .2)
exp13
# 2750.000 2820.000 2906.000 2884.800 2887.840 2920.272 2996.218 3016.974 3003.579 3002.863 3042.291 3063.833
#The last value in the vector is the forecast for sales for the 13th week
#Adjust the vector of predicted values to align with the sales_actuals vector
exp_pred <- c(NA, exp13[-length(exp13)])
exp_pred
# NA 2750.000 2820.000 2906.000 2884.800 2887.840 2920.272 2996.218 3016.974 3003.579 3002.863 3042.291
#Calculate accuracy measures with vector of actual values and vector
#of predicted values as inputs
mape(sales_actuals, exp_pred)
mae(sales_actuals, exp_pred)
mse(sales_actuals, exp_pred)
rmse(sales_actuals, exp_pred)
> mape(sales_actuals, exp_pred)
[1] 5.542897
> mae(sales_actuals, exp_pred)
[1] 174.7518
> mse(sales_actuals, exp_pred)
[1] 50462.68
> rmse(sales_actuals, exp_pred)
[1] 224.639
#use the exponential smoothing method with alpha = 0.4 to forecast the
#13th week of milk sales
exp13_4 <- EMA (sales_actuals, n=1, ratio = .4)
exp13_4
# 2750.000 2890.000 3034.000 2940.400 2924.240 2974.544 3104.726 3102.836 3041.702 3025.021 3095.013 3117.008
#The last value in the vector is the forecast for sales for the 13th week
#Adjust the vector of predicted values to align with the sales_actuals vector
exp_pred_4 <- c(NA, exp13_4[-length(exp13_4)])
#Calculate accuracy measures with vector of actual values and vector
#of predicted values as inputs
mae(sales_actuals, exp_pred_4)
mse(sales_actuals, exp_pred_4)
rmse(sales_actuals, exp_pred_4)
mape(sales_actuals, exp_pred_4)
> mae(sales_actuals, exp_pred_4)
[1] 169.5315
> mse(sales_actuals, exp_pred_4)
[1] 44453.35
> rmse(sales_actuals, exp_pred_4)
[1] 210.8396
> mape(sales_actuals, exp_pred_4)
[1] 5.4582
#create a time series plot showing 12 weeks of milk sales
ggplot(data = milkdf, mapping = aes(x = Week, y = Sales)) +
geom_line () +
geom_point() +
scale_x_continuous(breaks = seq(0, 13, by = 1)) +
labs(title = "Weekly milk sales for United Dairies", x = "Week", y = "Sales")
#create a separate vector for the actual weekly sales
sales_actuals<-milkdf$Sales
sales_actuals
#recode FuelType variable with 0 for Diesel and 1 for Petrol
cardf$FuelType<-ifelse(cardf$FuelType=="Petrol",1,0)
#Convert categorical variables to factors with levels and labels
cardf$FuelType<-factor(cardf$FuelType,levels = c(0,1),labels = c("Diesel","Petrol"))
cardf$MetColor<-factor(cardf$MetColor,levels = c(0,1),labels = c("No","Yes"))
cardf$Automatic<-factor(cardf$Automatic,levels = c(0,1),labels = c("No","Yes"))
#check for missing data
sum(is.na(cardf))
summary(cardf)
ggplot(data = cardf, mapping = aes(x = Age, y = Price)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Scatterplot for Car Age and Car Price", x = "Age", y = "Price")
# we can see the negative collation
#Calculate a correlation coefficient for the relationship between Age and Price
cor(cardf$Age, cardf$Price)
-0.8715612
# Car age and car price is very negative correlated. Car price go down as car ages.
#Perform a simple linear regression with Car Age and Car Price
toyota_SLR <- lm(Price ~ Age, data = cardf)
summary(toyota_SLR)
R-squared: how well model is predicting. Multiple Linear Regression: Coefficient of multiple determination and the value is represented by R square
Sum of squares due to error (SSE): How well observations cluster around the regression that predicts y
Sum of squares due to regression (SSR)
Total sum of squares (SST) How well observations cluster around the line that represents the mean of y
SST = SSR + SSE
We square Residuals because without squaring they sum to zero.
Coefficient of Determination r square = SSR / SST
r square = .7596 means Car age accounts for 76% of the variability in Toyota Corolla prices
#create a correlation matrix with all quantitative variables in the dataframe
cor(cardf[c(1, 2, 3, 5)])
#turn off scientific notation for all variables
options(scipen=999)
#Perform a multiple regression with Car Age, KM, and Horsepower and Car Price
toyota_MR <- lm(Price ~ Age + KM + Horsepower, data = cardf)
#View the multiple regression output
summary(toyota_MR)
#install lm.beta package to extract standardized regression coefficients
install.packages ("lm.beta")
#load lm.beta
library(lm.beta)
#Extract standardized regression coefficients
lm.beta(toyota_MR)
#install lm.beta package to extract standardized regression coefficients
install.packages ("lm.beta")
#load lm.beta
library(lm.beta)
#Extract standardized regression coefficients
lm.beta(toyota_MR)
#View the multiple regression output
summary(toyota_MR)
Every year car ages decreased the car price in average 148 euro
#install lm.beta package to extract standardized regression coefficients
install.packages ("lm.beta")
#load lm.beta
library(lm.beta)
#Extract standardized regression coefficients
lm.beta(toyota_MR)
#View the multiple regression output
summary(toyota_MR)
#Steps to create a scatterplot of residuals vs. predicted values of the
#dependent variable
#Create a vector of predicted values generated from the multiple
#regression above
toyota_pred = predict(toyota_MR)
#Create a vector of residuals generated from the multiple regression above
toyota_res = resid(toyota_MR)
#Create a data frame of the predicted values and the residuals
pred_res_df <- data.frame(toyota_pred, toyota_res)
#create a scatterplot of the residuals versus the predicted values
ggplot(data = pred_res_df, mapping = aes(x = toyota_pred, y = toyota_res)) +
geom_point() +
labs(title = "Plot of residuals vs. predicted values", x = "Predicted values",
y = "Residuals")
#Steps to create a Normal Probability Plot
#create a vector of standardized residuals generated from the multiple
#regression above
toyota_std.res = rstandard(toyota_MR)
#produce normal scores for the standardized residuals and create
#normal probability plot
qqnorm(toyota_std.res, ylab = "Standardized residuals", xlab = "Normal scores")
#install packages
install.packages ("car")
#load libraries
library(car)
#create a correlation matrix with all quantitative variables in the dataframe
cor(cardf[c(1, 2, 3, 5)])
#calculate Variance Inflation Factor for each variable to assess
#multicollinearity
vif(toyota_MR)
#Perform a multiple regression with Car Age, KM, Horsepower, FuelType,
#MetColor and Automatic as predictor variables and Car Price as the outcome
#variable
toyota_MR_Cat <- lm(Price ~ Age + KM + Horsepower + FuelType + MetColor + Automatic,
data = cardf)
#View multiple regression output
summary(toyota_MR_Cat)
#Steps to create a new scatterplot of residuals vs. predicted values of the
#dependent variable with the new categorical variables added
#Create a vector of predicted values generated from the multiple
#regression above
toyota_pred2 = predict(toyota_MR_Cat)
#Create a vector of residuals generated from the multiple regression above
toyota_res2 = resid(toyota_MR_Cat)
#Create a data frame of the predicted values and the residuals
pred_res_df2 <- data.frame(toyota_pred2, toyota_res2)
#create a scatterplot of the residuals versus the predicted values
ggplot(data = pred_res_df2, mapping = aes(x = toyota_pred2, y = toyota_res2)) +
geom_point() +
labs(title = "Plot of residuals vs. predicted values", x = "Predicted values",
y = "Residuals")
#Steps to create a Normal Probability Plot
#create a vector of standardized residuals generated from the multiple
#regression above
toyota_std.res = rstandard(toyota_MR)
#produce normal scores for the standardized residuals and create
#normal probability plot
qqnorm(toyota_std.res, ylab = "Standardized residuals", xlab = "Normal scores")
#partition the data into a training set and a validation set
#set seed so the random sample is reproducible
set.seed(42)
sample <- sample(c(TRUE, FALSE), nrow(cardf), replace=TRUE, prob=c(0.7,0.3))
traincar <- cardf[sample, ]
validatecar <- cardf[!sample, ]
#Install package needed for best subsets procedure
install.packages("olsrr")
#Load olsrr library
library(olsrr)
#partition the data into a training set and a validation set
#set seed so the random sample is reproducible
set.seed(42)
sample <- sample(c(TRUE, FALSE), nrow(cardf), replace=TRUE, prob=c(0.7,0.3))
traincar <- cardf[sample, ]
validatecar <- cardf[!sample, ]
#Install package needed for best subsets procedure
install.packages("olsrr")
#Load olsrr library
library(olsrr)
#run a multiple regression model using the "train" dataframe and all
#available independent variables
toyota_MR_Alltrain <- lm(Price ~ Age + KM + Horsepower + FuelType + MetColor +
Automatic, data = traincar)
summary(toyota_MR_Alltrain)
#run best subsets procedure with multiple regression output "toyota_MR_Alltrain"
bestsubset <- ols_step_all_possible(toyota_MR_Alltrain)
View(bestsubset)
#run a final multiple regression model using the "validate" dataframe
#and the following predictors: Age, KM, Horsepower, FuelType, and Automatic
toyota_MR_Val <- lm(Price ~ Age + KM + Horsepower + FuelType + Automatic,
data = validatecar)
summary(toyota_MR_Val)
# Largest adjusted R square (Age, KM, Hoursepower, Fuel Type, Automatic)
#read inventory dataset into R
inventorydf <- read.csv("toyota_corolla_inventory.csv")
View(inventorydf)
#Convert categorical variables to factors with levels and labels
inventorydf$FuelType<-factor(inventorydf$FuelType,levels = c(0,1),labels = c("Diesel","Petrol"))
inventorydf$Automatic<-factor(inventorydf$Automatic,levels = c(0,1),labels = c("No","Yes"))
#estimate predicted y values and prediction intervals for five additional
#Toyota Corollas in inventory
predict(toyota_MR_Val, inventorydf, interval = "prediction", level = 0.95)
install.packages ("tidyverse")
install.packages("cluster")
install.packages("fpc")
#load libraries
library(tidyverse)
library(cluster)
library(fpc)
#set working directory (adjust this for your own computer)
setwd("/Users/Documents/Eastern/12_DTSC_560_DataScience_for_Business/module2")
#read dataset into R
empdf <- read.csv("Employees.csv")
View(empdf)
quantdf<-empdf[c(1,16,20,25)]
View(quantdf)
#Create a data frame with only continuous variables - Age,
#Value.of.Investments,Number.of.Transactions, Household.Income -
#by removing columns 2, 3, 6, and 8
quantdf<-magdf[c(-2,-3,-6,-8)]
View(quantdf)
#normalize each variable
quantdfn<-scale(quantdf)
View(quantdfn)
#set random number seed in order to replicate the analysis
set.seed(42)
#create a function to calculate total within-cluster sum of squared deviations
#to use in elbow plot
wss<-function(k){kmeans(quantdfn, k, nstart=10)} $tot.withinss
#range of k values for elbow plot
k_values<- 1:10
# run the function to create the range of values for the elbow plot
wss_values<-map_dbl(k_values, wss)
#create a new data frame containing both k_values and wss_values
elbowdf<- data.frame(k_values, wss_values)
#graph the elbow plot
ggplot(elbowdf, mapping = aes(x = k_values, y = wss_values)) +
geom_line() + geom_point()
#run k-means clustering with 4 clusters (k=4) and 1000 random restarts
k4<-kmeans(quantdfn, 4, nstart=1000)
#display the structure of the 4-means clustering object
str(k4)
$n
[1] 410
$cluster.number
[1] 4
$cluster.size
[1] 63 46 175 126
$min.cluster.size
[1] 46
$noisen
[1] 0
$diameter
[1] 8.731119 6.459981 5.047733 4.948840
$average.distance (dense cluster is smallest) 1.702407
[1] 2.712944 2.289854 1.702407 1.821043
$median.distance
[1] 2.503044 2.096996 1.633555 1.748860
$separation
[1] 0.3072036 0.4262397 0.3094109 0.3072036
$average.toother
[1] 3.260644 3.249676 2.735785 2.717608
$average.between
[1] 2.903973
$average.within
[1] 1.960052
$n.between
[1] 57757
$n.within
[1] 26088
$max.diameter
[1] 8.731119
$min.separation
[1] 0.3072036
$within.cluster.ss
[1] 956.4456
$clus.avg.silwidths
1 2 3 4
0.07470313 0.18916895 0.26707357 0.21641395
$avg.silwidth
[1] 0.2132051
$g2
NULL
$g3
NULL
$pearsongamma
[1] 0.4197272
$dunn
[1] 0.0351849
$dunn2
[1] 0.882738
$entropy
[1] 1.25922
$wb.ratio
[1] 0.6749552
$ch
[1] 96.15431
$cwidegap
[1] 3.413033 3.493278 1.616582 1.376527
$widestgap
[1] 3.493278
$sindex
[1] 0.4509849
$corrected.rand
NULL
$vi
NULL
#combining each observation's cluster assignment with unscaled data frame
quantdfk4<-cbind(quantdf, clusterID=k4$cluster)
View(quantdfk4)
#write data frame to CSV file to analyze in Excel
write.csv(quantdfk4, "magazine_kmeans_4clusters.csv")
#calculate variable averages for all non-normalized observations
summarize_all(quantdf, mean)
#Calculate variable averages for each cluster
quantdfk4 %>%
group_by(clusterID) %>%
summarize_all(mean)
library(tidyverse)
library(cluster)
library(fpc)
library(factoextra)
library(janitor)
#set working directory
setwd("/Users/myom@cadent.tv/Documents/Eastern/12_DTSC_560_DataScience_for_Business")
#read in datafile
magdf<-read.csv("young_professional_magazine.csv")
View(magdf)
#Create a data frame with only binary variables - Gender,
#Real.Estate.Purchases, Graduate.Degree, Have.Children -
#by removing columns 1, 4, 5, and 7
bindf<-magdf[c(-1,-4,-5,-7)]
View(bindf)
#calculate distance between each pair of observations using the dist function
#and manhattan distance
match_dist<-dist(bindf, method="manhattan")
#run hierarchical clustering with the hclust function and group average linkage
cl_match_avg<-hclust(match_dist, method="average")
#plot the dendrogram
plot(cl_match_avg)
#Create 4 clusters using the cutree function
cl_match_avg_4<-cutree(cl_match_avg, k=4)
#display vector of cluster assignments for each observation
cl_match_avg_4
#visualize clusters on the dendrogram
rect.hclust(cl_match_avg, k=4, border=2:4)
#link cluster assignments to original categorical data frame
hcl4df<-cbind(bindf, clusterID=cl_match_avg_4)
#write data frame to CSV file to analyze in Excel
write.csv(hcl4df, "magazine_hier4_clusters.csv")
#display number of observations in each cluster
hcl4df %>%
group_by(clusterID) %>%
summarize(n())
#attach value labels to binary variables
hcl4df$Female<-factor(hcl4df$Female,levels=c(0,1),labels=c("no","yes"))
hcl4df$Real.Estate.Purchases<-factor(hcl4df$Real.Estate.Purchases,levels=c(0,1),labels=c("No","Yes"))
hcl4df$Graduate.Degree<-factor(hcl4df$Graduate.Degree,levels=c(0,1),labels=c("No","Yes"))
hcl4df$Have.Children<-factor(hcl4df$Have.Children,levels=c(0,1),labels=c("No","Yes"))
#Create frequency tables for each variable overall
tabyl(hcl4df$Female)
tabyl(hcl4df$Real.Estate.Purchases)
tabyl(hcl4df$Graduate.Degree)
tabyl(hcl4df$Have.Children)
#Create frequency tables for each variable by cluster
tabyl(hcl4df,Female,clusterID) %>%
adorn_percentages("col") %>%
adorn_pct_formatting(digits=2) %>%
adorn_ns()
tabyl(hcl4df,Real.Estate.Purchases,clusterID) %>%
adorn_percentages("col") %>%
adorn_pct_formatting(digits=2) %>%
adorn_ns()
tabyl(hcl4df,Graduate.Degree,clusterID) %>%
adorn_percentages("col") %>%
adorn_pct_formatting(digits=2) %>%
adorn_ns()
tabyl(hcl4df,Have.Children,clusterID) %>%
adorn_percentages("col") %>%
adorn_pct_formatting(digits=2) %>%
adorn_ns()
1. Programming Fundamentals
As a data scientist, your main work is to use the data to get actionable insights out of it. However, this is a long process in which you will need to write a python program for each of these steps. Therefore you need to have solid python programming fundamentals to be able to write efficient code for your tasks and understand other codes.
Here are some of the basic python programming fundamentals you should master:
Data types: Python has several built-in data types including integers, floats, and strings. It is important to know how to work with each of these data types and when to use them.
Variables: A variable is a way to store a value in a program. In Python, you can create a variable by assigning it a value using the equals sign (=).
Operators: Operators are special symbols in Python that perform specific operations on one or more operands. Some common operators include addition (+), subtraction (-), and multiplication (*).
Lists: A list is a collection of items in a specific order. Lists are useful for storing data that needs to be accessed in a specific order, or for storing multiple items of the same data type.
Dictionaries: A dictionary is a collection of key-value pairs. Dictionaries are useful for storing data that needs to be accessed using a unique key.
Control structures: Control structures are blocks of code that determine how other blocks of code are executed. Some common control structures in Python include if statements, for loops, and while loops.
Functions: A function is a block of code that performs a specific task and can be reused multiple times in a program. Defining and calling functions is an important aspect of programming in Python.
Object-Oriented Programming (OOP): (OOP) is a programming paradigm that uses objects, which are instances of classes, to represent and manipulate data.
Modules and packages: A module is a file containing Python code, while a package is a collection of modules. Knowing how to import and use modules and packages is essential for writing larger, more complex Python programs.
2. Data Manipulation & Analysis
As a data scientist, you will spend a lot of time preparing and manipulating the data to make it ready for analysis and modeling. Therefore it is important to be able to work with python to clean and prepare the data. This includes working with different data types and sizes.
You should be able to use python to manipulate datasets with different sizes and types and analyze them efficiently. Skills in this area include working with libraries like NumPy, and Pandas for structured data manipulation and analysis. In addition to being able to use PySpark for large dataset manipulation as well as use libraries for different types of data such as images, text, and audio if needed.
3. Data Visualization
Data visualization is an important aspect of data science, as it allows you to explore and understand your data, identify patterns and trends, and communicate your findings to others. It is therefore important for data scientists to have a strong understanding and hands-on skills in data visualization tools and how to use them effectively.
There are many libraries and tools available in Python for data visualization, some of the most popular ones include:
Matplotlib: This is a widely-used library for creating static, animated, and interactive visualizations in Python. It provides a high-level interface for drawing attractive and informative statistical graphics.
Seaborn: This is a library for creating statistical graphics in Python. It is built on top of Matplotlib and provides a more refined interface for creating visualizations.
Plotly: This is a library for creating interactive visualizations in Python. It is similar to Bokeh, but also includes support for creating visualizations that can be displayed in other contexts such as Jupyter notebooks.
Bokeh: This is a library for creating interactive visualizations in Python. It is particularly well-suited for creating visualizations that can be displayed in web browsers.
Altair: This is a library for creating declarative statistical visualizations in Python. It is based on the Vega and Vega-Lite visualization grammars, which provide a high-level, concise syntax for creating a wide range of visualizations.
4. Data Storage and Retrieval
As a data scientist, you will be mainly working with data whether you will have to retrieve or store the data after processing it. Therefore data storage and retrieval skills are important for data scientists because they allow them to efficiently manage and access the data they are working with.
There are many ways to store and retrieve data in Python, depending on the needs of the data scientist and the nature of the data. Here are some common approaches that you may encounter during your career:
Flat files: Flat files are simple text files that contain tabular data, with each row representing a record and each column representing a field. Flat files can be read and written using Python’s built-in open() function and the various methods of the file object, such as read(), readline(), and write().
CSV files: CSV (Comma Separated Values) files are a type of flat file that use commas to separate values. They can be read and written using Python’s panda library.
JSON files: JSON (JavaScript Object Notation) is a widely-used, human-readable data interchange format. It can be used to represent complex data structures, such as lists and dictionaries. JSON files can be read and written using Python’s built-in json module and pandas library.
Relational databases: Relational databases are powerful systems for storing and querying structured data. There are several Python libraries for interacting with popular database management systems (DBMS) such as MySQL, PostgreSQL, and SQLite. Some popular options include psycopg2 for PostgreSQL, mysql-connector-python for MySQL, and sqlite3 for SQLite.
NoSQL databases: NoSQL databases are designed to handle large amounts of unstructured data, such as that generated by social media, IoT devices, and e-commerce platforms. Some popular NoSQL databases include MongoDB, Cassandra, and Redis. Python provides various libraries for interacting with these databases, such as pymongo for MongoDB and redis-py for Redis.
Cloud storage: Cloud storage services such as Amazon S3, Google Cloud Storage, and Microsoft Azure Storage provide scalable, flexible options for storing large amounts of data in the cloud. Python provides libraries for accessing these services, such as boto3 Amazon S3 and google-cloud-storage Google Cloud Storage.
5. Applied Machine & Deep Learning
Applied machine learning and deep learning are both important Python skills for data scientists to master. Machine learning involves the use of algorithms and statistical models to enable computers to improve their performance on a given task without explicitly being programmed to perform that task. This is accomplished by training the machine learning model on a dataset and allowing it to learn the relationships and patterns within the data.
Deep learning, on the other hand, involves the use of artificial neural networks to learn and make decisions. Deep learning has proven to be particularly effective in image and speech recognition, natural language processing, and even playing games.
In order to apply machine learning and deep learning in Python, it is important to have a strong understanding of the various algorithms and libraries available. Here are three of the most common libraries you should master as a data scientist:
Scikit-learn: A machine learning library for Python, which provides a wide range of algorithms for classification, regression, clustering, and dimensionality reduction, as well as tools for model evaluation and selection.
TensorFlow: An open-source library for deep learning developed by Google, which provides tools for building, training, and deploying machine learning models.
Keras: A high-level neural network library built on top of TensorFlow, which provides a convenient interface for defining and training deep learning models.