Explore and analyze data with Python Flashcards
Data exploration and analysis
The role of a data scientist primarily involves exploring and analyzing data. Data scientists begin their work with data—with Python being the most popular programming language used by data scientists for working with data.
Role of a data scientist: exploring and analyzing data
Python provides extensive functionality with powerful statistical and numerical libraries:
NumPy and Pandas simplify analyzing and manipulating data
Matplotlib provides attractive data visualizations
Scikit-learn offers simple and effective predictive data analysis
TensorFlow and PyTorch supply machine learning and deep learning capabilities
Usually, a data analysis project is designed to establish insights around a particular scenario or to test a hypothesis.
What is NumPy?
NumPy is a Python library that provides functionality comparable to mathematical tools such as MATLAB and R. While NumPy significantly simplifies the user experience, it also offers comprehensive mathematical functions.
What is Pandas?
Pandas is an extremely popular Python library for data analysis and manipulation. Pandas is like a spreadsheet application for Python—providing easy-to-use functionality for data tables.
Explore data in a Jupyter notebook
Jupyter notebooks are a popular way of running basic scripts using your web browser. Typically, these notebooks are a single webpage, broken up into text sections and code sections that are executed on the server rather than your local machine. By running code in Jupyter notebooks on a server, you can get started quickly without needing to install Python or other tools on your local computer.
Testing hypotheses
Data exploration and analysis is typically an iterative process, in which the data scientist takes a sample of data and performs the following kinds of tasks to analyze it and test hypotheses:
Clean data to handle errors, missing values, and other issues.
Apply statistical techniques to better understand the data and how the sample might be expected to represent the real-world population of data, allowing for random variation.
Visualize data to determine relationships between variables, and in the case of a machine learning project, identify features that are potentially predictive of the label.
Revise the hypothesis and repeat the process.
Handling missing values
One of the most common issues data scientists need to deal with is incomplete or missing data. So how would we know that the DataFrame contains missing values? You can use the isnull method to identify which individual values are null.
df_students.isnull()
Of course, with a larger DataFrame, it would be inefficient to review all of the rows and columns individually, so we can get the sum of missing values for each column like this:
df_students.isnull().sum()
To see them in context, we can filter the DataFrame to include only rows where any of the columns (axis 1 of the DataFrame) are null.
df_students[df_students.isnull().any(axis=1)]