Explore and analyze data with Python Flashcards

1
Q

Data exploration and analysis

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

The role of a data scientist primarily involves exploring and analyzing data. Data scientists begin their work with data—with Python being the most popular programming language used by data scientists for working with data.

A

Role of a data scientist: exploring and analyzing data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Python provides extensive functionality with powerful statistical and numerical libraries:

NumPy and Pandas simplify analyzing and manipulating data
Matplotlib provides attractive data visualizations
Scikit-learn offers simple and effective predictive data analysis
TensorFlow and PyTorch supply machine learning and deep learning capabilities

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Usually, a data analysis project is designed to establish insights around a particular scenario or to test a hypothesis.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is NumPy?

NumPy is a Python library that provides functionality comparable to mathematical tools such as MATLAB and R. While NumPy significantly simplifies the user experience, it also offers comprehensive mathematical functions.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Pandas?

Pandas is an extremely popular Python library for data analysis and manipulation. Pandas is like a spreadsheet application for Python—providing easy-to-use functionality for data tables.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explore data in a Jupyter notebook

Jupyter notebooks are a popular way of running basic scripts using your web browser. Typically, these notebooks are a single webpage, broken up into text sections and code sections that are executed on the server rather than your local machine. By running code in Jupyter notebooks on a server, you can get started quickly without needing to install Python or other tools on your local computer.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Testing hypotheses
Data exploration and analysis is typically an iterative process, in which the data scientist takes a sample of data and performs the following kinds of tasks to analyze it and test hypotheses:

Clean data to handle errors, missing values, and other issues.
Apply statistical techniques to better understand the data and how the sample might be expected to represent the real-world population of data, allowing for random variation.
Visualize data to determine relationships between variables, and in the case of a machine learning project, identify features that are potentially predictive of the label.
Revise the hypothesis and repeat the process.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Handling missing values
One of the most common issues data scientists need to deal with is incomplete or missing data. So how would we know that the DataFrame contains missing values? You can use the isnull method to identify which individual values are null.

A

df_students.isnull()
Of course, with a larger DataFrame, it would be inefficient to review all of the rows and columns individually, so we can get the sum of missing values for each column like this:
df_students.isnull().sum()

To see them in context, we can filter the DataFrame to include only rows where any of the columns (axis 1 of the DataFrame) are null.

df_students[df_students.isnull().any(axis=1)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly