Unit 9 Flashcards
Two distinctions with data
What does the data show - fact
Why might this be the case - opinion
correlation
similarities, patterns
causation
this thing caused that thing
metadata
Data about other data
Can Help us uncover the why
questions. (sometimes auto gathered)
Metadata are data about data:
It can be changed without impacting the primary data
Used for finding, organizing, and managing information
Increases effective use of data by providing extra information
Allows data to be structured and organized
visualizations
Look at lots of data at once
See patterns that are “invisible” if you just look at the table
data analysis process
- collect or choose data
- clean and/or filter
- visualize and find patterns
- generate new information
bar chart
Count how many times each value in the column appears and make a bar at that height.
What value(s) are most common in this column?
What value(s) are least common in this column?
What is the unique list of values in this column?
histogram
Similar to a bar chart, but first all numbers in a range or “bucket” are grouped together. For example, the chart below has a bucket size of 20 so the numbers 41, 48, and 53 would all be placed in the same bucket between 40 and 60.
Histograms can only be created with numeric data but can be useful when a normal bar chart may be difficult to read.
What range of value(s) are most common in this column?
What range value(s) are least common in this column?
What ranges of values do or do not appear in this column?
visualization takeaways
Programs (like the Data Visualizer) can help process data so we can understand it and learn.
Charts and other visualizations can help both find and communicate what we’ve learned from data
Bar charts and histograms are two common chart types for exploring one column of data in a table.
when does data need to be cleaned?
Data is incomplete
Data is invalid
Multiple tables are combined into one
What leads to “messy” data?
Users enter in different types of data (“two”, 2)
Users use different abbreviations to represent the same information (“February”, “Feb”, “Febr”)
Data may have different spellings (“color”, “colour”) or inconsistent capitalization (“spring”, “Spring”)
cleaning data
Look through the data manually. Find and fix messy data.
Use a program to find and fix messy data.
filtering data
Filtering data allows the user to look at a subset of the data.
In Unit 5, we filtered data programmatically using traversals to gain insight into knowledge from data.
Software programs with built in tools (like the Data Visualizer) can also be used to filter data.
data stored in text files
old school PC games
.csv Comma Separated Values
date, level, score
01/11/2019, 9. 73
Common File Format
Require Spreadsheet Programs or Specific Programs to Iterate Through
Easy to mess up a file
No Standard ways to create file
data storage through spreadsheets
Designed for people to analyze data not for programs
data storage through databases
Preferred method of storing data that will be used in programs
Programers use SQL (Structured Query Language) to interact with databases.
To be a Data Scientist You often need to learn programming languages like Python/R to analyze and visualize data.
You also need to learn SQL to be able to interact with databases
scatter plot
Shows combinations of values from two columns
Useful for:
Seeing patterns and trends between two values
Numeric data with lots of different values
Not useful:
Lots of repeated values
crosstab chart
counts how many times combinations of values appear. Arrows show where that row in the data table would be counted in the chart
Counts how often pairs of values in two columns appear.
Useful for:
Finding the most / least common combinations of values in two columns
Finding patterns across two columns
Exploring two columns when one or both are strings.
Not useful:
If either column has too many values (the chart would be enormous)
when to use what graph
study slide 17 in 9.4
big data
“Collect huge amounts of data so we can learn even more from it”
The size of the datasets we analyzed impacts how much information can be extracted
As a result, in business, science, and many other contexts people are working with increasingly big data sets
When data gets too big it can no longer be processed on one computer. Cloud computing or parallel systems are sometimes used to help process all that information.
In general scalability of your system is important to consider when working with big data. You want your system to be able to work even as you’re using more and more data.
citizen science and crowdsourcing
“collecting data from others so you can analyze it”
Crowdsourcing is the practice of obtaining input or information from a large number of people via the Internet.
Citizen science is research where some of the data collection is done by members of the public using own computing devices which leads to solving scientific problems
Crowdsourcing offers new models for collaboration, such as connecting businesses or social causes with funding
Both are examples of how human capabilities can be enhanced by collaboration via computing
open data
“sharing data with others so they can can analyze it”
Open data is publicly available data shared by governments, organizations, and others
Making data open help spread useful knowledge or creates opportunities for others to use it to solve problems