Module 6: Data Collection and Cleaning Flashcards
What are APIs
Application programming interfaces (APIs) are the primary method to querying for and retrieving data from the internet and webpages, they are the main method used for enterprises to deliver services and consume data feeds. APIs is the way that applications talk to one another, or how your program connects to external sources like websites or apps:
- access data: google’s weather app, for example, retrieves information from the weather network
HTML vs XML
HTML:
Hypter Text Markup Language (HTML) is quite literally a language for specifying the rules for displaying content. It is simple to assume that HTML has the same structure as XML. HTML can be thought of as instructions on what to show the user. HTML includes a head (contains the information that is supposed to be loaded when you request the website) and body (the main content of the website).
HTML vs XML:
- Purpose: HTML is for creating web pages; XML is for storing and transporting data.
- Flexibility: XML is more flexible as it allows custom tag definitions, while HTML has a fixed set of tags for specific purposes.
XML has no pre-defined elements. It is used to define other file formats. As such, XML has two ways to be evaluated.
- Validity: if the XML in question is well formed and also adheres to definition or schema, then it is valid
- Well formedness: if elements are strictly nested as a tree format and there is no ambiguity as to what is an attribute, element, and text, then the file is well-formed.
XML supports hierarchical nested data with metadata.
Python has many libraries for reading and writing data in the ubiquitous HTML and XML formats. Examples include lxml, Beautiful Soup, and html5lib.
Javascript and JSON
JavaScript is the programming language that helps with the user interacting with the website, getting responses back from the websites. It acts as the primary progamming layer for all client-side requests. It allows for the display of pages to change during the time a webpage is viewed.
JavaScript uses the Document Object Model (DOM), which is a data structure representing the current state of the web page content, and contains the CSS, HTML, XML content that are placed by the web browser into the DOM, which the browser then manipulates using JavaScript. The DOM is the active state of the web page, JavaScript manipulations are forced to only trigger event listeners (like a mouse click) attached on elements, and then run a handler/function for reacting to listened events.
Since the browser already uses DOM to represent all client-side data, it is often easier to use JavaScript values and objects directly to manipulate it. Hence, JSON (short for JavaScript Object Notation), is used to transmit message content and has become one of the standard formats for sending data by HTTP request between web browsers and other applications. It is a much more free-form data format than a tabular text form like CSV. JSON is basically a dictionaries and arrays, with strings and numbers, Booleans and nulls. All of the keys in an object must be strings. JSON is injectable code, and it is allowed because it is assumed that your client already trusts the server, otherwise it wouldn’t have connected to the URL.
There are several Python libraries for reading and writing JSON data. I’ll use json here, as it is built into the Python standard library. To convert a JSON string to Python form, use json.loads:
In [68]: import json
In [69]: result = json.loads(obj)
In [70]: result
Out[70]:
{‘name’: ‘Wes’,
‘cities_lived’: [‘Akron’, ‘Nashville’, ‘New York’, ‘San Francisco’],
‘pet’: None,
‘siblings’: [{‘name’: ‘Scott’,
‘age’: 34,
‘hobbies’: [‘guitars’, ‘soccer’]},
{‘name’: ‘Katie’, ‘age’: 42, ‘hobbies’: [‘diving’, ‘art’]}]
How you convert a JSON object or list of objects to a DataFrame or some other data structure for analysis will be up to you. Conveniently, you can pass a list of dictionaries (which were previously JSON objects) to the DataFrame constructor and select a subset of the data fields:
In [73]: siblings = pd.DataFrame(result[“siblings”], columns=[“name”, “age”])
In [74]: siblings
Out[74]:
name age
0 Scott 34
1 Katie 42
webscraping vs APIs
Web Scraping
Definition: Web scraping involves programmatically extracting data from web pages. This is usually done by sending HTTP requests to a website and parsing the HTML or XML of the web pages to retrieve the desired information.
Techniques: It often uses libraries like BeautifulSoup or Scrapy in Python, which help navigate and parse the document structure of web pages.
Use Cases: Web scraping is useful when:
An API is not available.
You need data from multiple pages or sites.
The structure of the data is not easily accessible through standard methods.
Challenges: It can face legal and ethical issues, especially if scraping a site violates its terms of service. Additionally, websites can change their layout, which may break scraping scripts.
APIs (Application Programming Interfaces)
Definition: An API is a set of rules and protocols for building and interacting with software applications. It allows one piece of software to communicate with another, often providing a way to retrieve or send data in a structured format (like JSON or XML).
Use Cases: APIs are ideal when:
A service provides an official API for developers.
You need reliable and consistent access to data.
You want to interact with services programmatically (e.g., posting data, retrieving user information).
Benefits: APIs are generally more stable and easier to use compared to web scraping. They often include documentation, rate limits, and structured data formats that make integration straightforward.
Summary
In short, web scraping is great for extracting data when no other options are available, while APIs provide a more stable and reliable way to interact with data and services when they are offered. Depending on your needs and the data source, one method may be more suitable than the other.
GET and POST requests in APIs
GET
GET is one of the most common HTTP methods you’ll use when working with REST APIs. This method allows you to retrieve resources from a given API. GET is a read-only operation, so you shouldn’t use it to modify an existing resource.
To test out GET and the other methods in this section, you’ll use a service called JSONPlaceholder. This free service provides fake API endpoints that send back responses that requests can process.
To try this out, start up the Python REPL and run the following commands to send a GET request to a JSONPlaceholder endpoint:
Python
»> import requests
»> api_url = “https://jsonplaceholder.typicode.com/todos/1”
»> response = requests.get(api_url)
»> response.json()
{‘userId’: 1, ‘id’: 1, ‘title’: ‘delectus aut autem’, ‘completed’: False}
Beyond viewing the JSON data from the API, you can also view other things about the response:
Python
»> response.status_code
200
> > > response.headers[“Content-Type”]
‘application/json; charset=utf-8’
POST
Now, take a look at how you use requests to POST data to a REST API to create a new resource. You’ll use JSONPlaceholder again, but this time you’ll include JSON data in the request. Here’s the data that you’ll send:
JSON
{
“userId”: 1,
“title”: “Buy milk”,
“completed”: false
}
This JSON contains information for a new todo item. Back in the Python REPL, run the following code to create the new todo:
Python
»> import requests
»> api_url = “https://jsonplaceholder.typicode.com/todos”
»> todo = {“userId”: 1, “title”: “Buy milk”, “completed”: False}
»> response = requests.post(api_url, json=todo)
»> response.json()
{‘userId’: 1, ‘title’: ‘Buy milk’, ‘completed’: False, ‘id’: 201}
> > > response.status_code
201
Once you’ve made a GET request you can also convert the JSON into dataframes using Pandas:
Convert JSON data to pandas DataFrame
df = pd.DataFrame(json_data)
Display the DataFrame (optional)
print(df.head()) # Print the first few rows to verify
output:
DataFrame created from JSON data:
userId id title body
0 1 1 sunt aut facere repellat provident occaecati e… quia et suscipit\nsuscipit repellat esse qui…
1 1 2 qui est esse est rerum tempore vitae\nsequi sint nihil r…
2 1 3 ea molestias quasi exercitationem repellat qui… et iusto sed quo iure\nvoluptatem occaecati o…
3 1 4 eum et est occaecati ullam et saepe reiciendis voluptatem adipisci…
4 1 5 nesciunt quas odio repudiandae veniam quaerat sunt sed\nalias au…
Now that you have the data in a DataFrame, you can perform various analyses, such as calculating statistics, grouping data, or visualizing it:
Example analysis: Calculate summary statistics
summary_stats = df.describe()
print(“\nSummary statistics:”)
print(summary_stats)
Example analysis: Group by a column and calculate mean
mean_by_userId = df.groupby(‘userId’)[‘id’].count()
print(“\nMean number of posts by userId:”)
print(mean_by_userId)
Summary statistics:
userId id
count 100.00000 100.00000
mean 1.50000 50.50000
std 0.50252 29.01149
min 1.00000 1.00000
25% 1.00000 25.75000
50% 1.50000 50.50000
75% 2.00000 75.25000
max 2.00000 100.00000
Mean number of posts by userId:
userId
1 50
2 50
Name: id, dtype: int64
What does it mean to have clean data
Once data is collected, it must be prepared, meaning it must be:
- Cleaned
- Missing data must be handled
- Transformed into meaningful indicators and measures
First, what does clean data actually mean:
- Each dataset column represents one variable
- Each observation forms a row
- Each type of observational unit forms a table
Tidy data eases variable extraction because of the standardized structure of the dataset. One way of organizing the variables is by their role and use in the analysis, are they for indexing an observation (treatment type or timestamp) or is it an actual measured value in an experiment. Indexes should come first followed by measured variables.
Methods for missing data: imputation, iterative imputation, dropping values, flagging values, using predictive modelling, time series methods
cleanedDF[‘c’] = cleanedDF[‘c’].fillna(cleanedDF[‘c’].median())
Imputation
Mean/Median/Mode Imputation: Replace missing values with the mean, median, or mode of the column.
python
Copy code
import pandas as pd
df[‘column’].fillna(df[‘column’].mean(), inplace=True)
K-Nearest Neighbors (KNN) Imputation: Use KNN to estimate missing values based on similar instances.
python
Copy code
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_imputed = imputer.fit_transform(df)
Iterative Imputation: Model each feature with missing values as a function of other features and iteratively impute.
python
Copy code
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer = IterativeImputer()
df_imputed = imputer.fit_transform(df)
- Dropping Missing Values
Drop Rows: Remove any rows with missing data.
python
Copy code
df.dropna(inplace=True)
Drop Columns: Remove any columns with a significant amount of missing data.
python
Copy code
df.dropna(axis=1, inplace=True)
- Flagging Missing Values
Create an indicator variable to flag rows with missing values.
python
Copy code
df[‘missing_flag’] = df[‘column’].isnull().astype(int) - Using Predictive Models
Build a predictive model to predict missing values based on other features.
python
Copy code
from sklearn.ensemble import RandomForestRegressor
Example for a single feature with missing values
model = RandomForestRegressor()
model.fit(train_data.drop(‘column_with_missing’, axis=1), train_data[‘column_with_missing’])
df.loc[df[‘column_with_missing’].isnull(), ‘column_with_missing’] = model.predict(df.loc[df[‘column_with_missing’].isnull()].drop(‘column_with_missing’, axis=1))
- Using Time Series Methods
For time series data, forward-fill or backward-fill missing values based on adjacent values.
python
Copy code
df.fillna(method=’ffill’, inplace=True) # Forward fill
df.fillna(method=’bfill’, inplace=True) # Backward fill
what is binning in Pandas
In data analysis, binning and permutation are essential techniques for categorizing continuous data and sampling rows, respectively. Here’s a detailed explanation of each process with example code:
________________________________________
Binning with cut() and qcut()
Binning is the process of converting continuous data into discrete categories or bins. This is useful for summarizing and analyzing data in a more interpretable format.
Using cut()
The cut() function bins continuous data into discrete intervals.
Example:
import pandas as pd
Sample DataFrame
df4 = pd.DataFrame({‘d’: [4, 8, 12]}, index=[‘hello’, ‘world’, ‘NaN’])
Define bins for the ‘d’ column
bins = [-100, -10, -5, 0, 5, 10, 100]
Use cut() to bin the data into the specified intervals
binned = pd.cut(df4[“d”], bins, right=True, labels=None)
print(binned)
message
hello (0, 5]
world (5, 10]
NaN (10, 100]
Name: d, dtype: category
Categories (5, interval[int64]): [(-110, -5] < (-5, 0] < (0, 5] < (5, 10] < (10, 100]]
Custom Bin labels:
Assign custom labels to the bins
bin_labels = “a b c d e f g h i j”.split(“ “)
binned_custom_labels = pd.cut(df4[“d”], 10, right=True, labels=bin_labels)
print(binned_custom_labels)
message
hello a
world e
NaN j
Name: d, dtype: category
Categories (10, object): [a < b < c < d … g < h < i < j]
Using qcut()
The qcut() function bins data based on quantiles, dividing the data into bins with approximately equal number of observations.
# Bin the data into quantiles
quantile_bins = pd.qcut(df4[“d”], 100, labels=list(range(100)))
print(quantile_bins)
message
hello 0
world 49
NaN 99
Name: d, dtype: category
Categories (100, int64): [0 < 1 < 2 < 3 … 96 < 97 < 98 < 99]
What are permutations in Pandas
Permutation is the process of randomly reordering elements in a Series or DataFrame. It’s useful for simulations, creating random trials, and shuffling data.
Permuting Rows with numpy.random.permutation()
Example:
import numpy as np
Create a DataFrame
df = pd.DataFrame(np.arange(20).reshape((5, 4)))
Generate a random permutation of row indices
sampler = np.random.permutation(5)
Apply the permutation to reorder rows
df_permuted = df.take(sampler)
print(df_permuted)
0 1 2 3 1 4 5 6 7 4 16 17 18 19 2 8 9 10 11 3 12 13 14 15 0 0 1 2 3
- Explanation:
o Rows are reordered according to a random permutation of indices.
Sampling Rows with DataFrame.sample()
Example:
# Sample 10 rows with replacement
df_sampled = df.sample(n=10, replace=True)
print(df_sampled)
0 1 2 3 0 0 1 2 3 2 8 9 10 11 3 12 13 14 15 4 16 17 18 19 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 4 16 17 18 19 0 0 1 2 3 1 4 5 6 7
Explanation:
* 10 rows are sampled with replacement, allowing for duplicate rows.