DTSC 520: Fundamentals of Data Science
Introduction to foundational concepts, technologies, and theories of data and data science. This includes methods in data acquisition, cleaning, and visualization. Taught in Python using NumPy, Pandas, Matplotlib, and Seaborn. Includes an introduction to Python, IPython, and Jupyter Notebooks.
NumPy-1
NumPy-2
NumPy-3
NumPy-4
Pandas-1
Pandas-2
Matplotlib-1
Seaborn-1
You can run Matplotlib from the shell, notebook, or scripts from a notebook. (True)
To run scripts from a notebook. (%matplotlib)
Matplotlib has two interfaces, with one being based off of its connection to MATLAB, and the other off of Python's object-oriented approach. (True)
To display our matplotlib plot, we use ( plt.show() )
The MATLAB style interface creates an object. (False)
To create a pie chart in Matplotlib, you must first calculate value counts of columns. (True)
To create a scatterplot with height on the x-axis, and weight on the y-axis, how would you complete the following?
fig, ax = plt.subplots()
ax.scatter('height','weight', data = example)
plt.show()
Matplotlib is build on Seaborn (False)
By convention, you alias seaborn as (sns)
In Seaborn, histograms are include by default in which plot? (Catplots)
By default, a Seaborn distplot contains which of the following? (Histogram)
Seaborn barplots show only the frequency of observation across groups. (False)
Seaborn catplots can show the relationship between a numerical and one or more categorical variables. (True)
The box in boxplot is based of off what values? (Quartiles)
NumPy Questions
How many different data types can be present in a single array? (One)
Which of the following is/are true? (Python allows users to create heterogeneous lists) (A single integer in Python contains four piece of information)
A single integer in Python 3.4 actually contains four pieces:
ob_refcnt: a reference count that helps Python silently handle memory allocation and deallocation.
ob_type: which encodes the type of the variable.
ob_size: which specifies the size of the following data members.
ob_digit: which contains the actual integer value that we expect the Python variable to represent.
In order to create NumPy arrays from Python lists, we use (np.array)
In order to access a subarray of an defined array "x", you must use specific notation in order to make the slice. Fill in the blank for the following generalized NumPy syntax for slicing. ( x[start:stop:step] )
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) x[3:7] include 3rd but don't include last
What is concatenation? (Joining arrays)
What is the opposite of concatenation? (Splitting)
Dictionaries can be thought of as (key-value pairings)
How to delete dictionary ( del (dict_1['Philadelphia'])
The ufunc, use multiplication operation is (np.multiply)
The ufunc, use subtraction operation is (np.subtract)
ndim is an array attribute that tells you the number of (axes)
The total size of an array is indicated by the array attribute (size)
np.random.seed(1)
x = np.random.randint(10, size=(4,4))
print(x)
x[:3, :2] (row first 3 and two columns)
In a multidimensional array, each row is separated by a (comma)
What is the standard mutable multi-element in Python? (list)
The three fundamental Pandas data structures are (Series), (DataFrame), and (Index)
Pandas Series can be thought of as a specialization of a Python (library)
A Pandas (Series) is a one-dimensional array of indexed data.
To construct a Pandas DataFrame, we use (pd.DataFrame)
To construct a Pandas Series, we use (pd.series)
Id we are given code for the following Pandas Series, what is the output
x = pd.Series([1.0, 2.0, 3.0, 4.0, 5.0],
index = ["a", "b", "c", "d", "e"])
x["c"] (3.0)
A (DataFrame) is a two-dimensional array of indexed data with flexible row indices and flexible column names.
Which of the following is/are true? (Pandas objects are more flexible than NumPy's ndarry data structure)
You can NOT create a Series from NumPy Arraays (False)
Pandas Series include both values and indices, both that can be accessed. (True)
Which of the following will successfully create a Pandas Series with index values 'a', 'b', 'c', and 'd'?
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index = ['a', 'b', 'c', 'd'])
The primary difference between Pandas Series and the NumPy Array is the (precence of the index)
If you want to import data from a .csv, which of the following would you use? (pd.load_csv('name.csv')
A Series is a two-dimensional DataFrame. (False)
You can create a Pandas Index as its own object. (True)
I want to access all states with an area greater than 400,000. Which of the following would accomplish that? (data[data['area'] > 400000]
Pandas will automatically align (column) when passing objects to the ufunc.
When conducting operations with DataFrames, indices are aligned. (True)
isnull() generates a boolean mask that indicates missing values. Where there is a missing value, the mask says "True" and where there is not a missing value, the mask says "False"
Missing floating point values are indicated with (NaN) , which stands for Not a Number.
Which of the following are true? (There are many ways of dealing with missing data that are clear and universally agreed upon).
One way of dealing with missing values is by using a (fillna) which allows users to globally indicate missing values. It might refer to an entirely to an entirely separate Boolean array.