Unit 5 (Data) Flashcards
Visualizations can help us:
Answer questions
Look at lots of data at once
See patterns that are “invisible” if you just look at the table
Explain the difference between Correlation and Causation
* Correlation def
* Causation def
Correlation - similarities, patterns
Causation - this thing caused that thing
When looking at visualizations, what are facts and opinions?
* fact question def
* opinion question def
What does the data show? - fact
Why might that be the case? - opinion
Metadata are data about data:
* It can be changed without …..
* Used for …..(3 things)…. information
* Increases ……
* Allows data to be ….. and ……
- It can be changed without impacting the primary data
- Used for finding, organizing, and managing information
- Increases effective use of data by providing extra information
- Allows data to be structured and organized
Programs (like the Data Visualizer) can help:
Charts and other visualizations can help:
process data so we can understand it and learn.
both find and communicate what we’ve learned from data
Bar charts and histograms are two common chart types for exploring one column of data in a table. Explain
* Information we can get out of bar charts (what value(s) are …., what value(s) are …, what is the unique ….
* Information we can get out of histograms (what range of value(s) are …., what range of value(s) are ….., what range of values do or do not ….)
1 column data charts
Bar charts: Count how many times each value in the column appears and make a bar at that height. Aren’t very useful when every value is unique
* What value(s) are most common in this column?
* What value(s) are least common in this column?
* What is the unique list of values in this column?
Histogram: Similar to a bar chart, but first all numbers in a range or “bucket” are grouped together. For example, the chart below has a bucket size of 20 so the numbers 41, 48, and 53 would all be placed in the same bucket between 40 and 60.
* What range of value(s) are most common in this column?
* What range value(s) are least common in this column?
* What ranges of values do or do not appear in this column?
The data analysis process
* The 4 steps
* when do you clean data?
* what would filtering allow the user to do?
Collect or Choose Data
Clean and/or Filter
* Cleaning when: Data is incomplete, Data is invalid, Multiple tables are combined into one. Also when data is messy and not inputted correctly
* Filtering data allows the user to look at a subset of the data.
Visualize and Find Patterns
New Information
Cross tab charts definition, usefulness, and non usefulness
* finding the ….. in two columns
* finding ….. across 2 columns
* exploring two columns when one or both are …..
* not useful if either column has ……
Scatter plots definition, usefulness, and non usefulness
* seeing ….. and …… between 2 values
* …. data with lots of different ……
* not useful if it has lots of ……
2 column data charts
Cross Tab: Counts how often pairs of values in two columns appear.
* Finding the most / least common combinations of values in two columns
* Finding patterns across two columns
* Exploring two columns when one or both are strings.
Not useful: If either column has too many values (the chart would be enormous)
Scatter: Shows combinations of values from two columns
* Seeing patterns and trends between two values
* Numeric data with lots of different values
Not useful: Lots of repeated values
Explain:
1. Open data
2. Citizen science and crowdsourcing
3. Big data
Open Data
* “sharing data with others so they can can analyze it”
* Open data is publicly available data shared by governments, organizations, and others
Citizen Science and Crowdsourcing
* “collecting data from others so you can analyze it”
* Crowdsourcing is the practice of obtaining input or information from a large number of people via the Internet.
* Citizen science is research where some of the data collection is done by members of the public using own computing devices which leads to solving scientific problems
Big data
* “Collect huge amounts of data so we can learn even more from it”
* The size of the datasets we analyzed impacts how much information can be extracted
* As a result, in business, science, and many other contexts people are working with increasingly big data sets
* When data gets too big it can no longer be processed on one computer. Cloud computing or parallel systems are sometimes used to help process all that information.