Unit 9 Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Two distinctions with data

A

What does the data show - fact
Why might this be the case - opinion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

correlation

A

similarities, patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

causation

A

this thing caused that thing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

metadata

A

Data about other data
Can Help us uncover the why
questions. (sometimes auto gathered)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Metadata are data about data:

A

It can be changed without impacting the primary data
Used for finding, organizing, and managing information
Increases effective use of data by providing extra information
Allows data to be structured and organized

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

visualizations

A

Look at lots of data at once
See patterns that are “invisible” if you just look at the table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

data analysis process

A
  1. collect or choose data
  2. clean and/or filter
  3. visualize and find patterns
  4. generate new information
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

bar chart

A

Count how many times each value in the column appears and make a bar at that height.
What value(s) are most common in this column?
What value(s) are least common in this column?
What is the unique list of values in this column?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

histogram

A

Similar to a bar chart, but first all numbers in a range or “bucket” are grouped together. For example, the chart below has a bucket size of 20 so the numbers 41, 48, and 53 would all be placed in the same bucket between 40 and 60.

Histograms can only be created with numeric data but can be useful when a normal bar chart may be difficult to read.
What range of value(s) are most common in this column?
What range value(s) are least common in this column?
What ranges of values do or do not appear in this column?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

visualization takeaways

A

Programs (like the Data Visualizer) can help process data so we can understand it and learn.

Charts and other visualizations can help both find and communicate what we’ve learned from data

Bar charts and histograms are two common chart types for exploring one column of data in a table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

when does data need to be cleaned?

A

Data is incomplete
Data is invalid
Multiple tables are combined into one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What leads to “messy” data?

A

Users enter in different types of data (“two”, 2)
Users use different abbreviations to represent the same information (“February”, “Feb”, “Febr”)
Data may have different spellings (“color”, “colour”) or inconsistent capitalization (“spring”, “Spring”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

cleaning data

A

Look through the data manually. Find and fix messy data.
Use a program to find and fix messy data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

filtering data

A

Filtering data allows the user to look at a subset of the data.
In Unit 5, we filtered data programmatically using traversals to gain insight into knowledge from data.
Software programs with built in tools (like the Data Visualizer) can also be used to filter data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

data stored in text files

A

old school PC games
.csv Comma Separated Values
date, level, score
01/11/2019, 9. 73
Common File Format
Require Spreadsheet Programs or Specific Programs to Iterate Through

Easy to mess up a file
No Standard ways to create file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

data storage through spreadsheets

A

Designed for people to analyze data not for programs

17
Q

data storage through databases

A

Preferred method of storing data that will be used in programs

Programers use SQL (Structured Query Language) to interact with databases.

To be a Data Scientist You often need to learn programming languages like Python/R to analyze and visualize data.

You also need to learn SQL to be able to interact with databases

18
Q

scatter plot

A

Shows combinations of values from two columns

Useful for:
Seeing patterns and trends between two values
Numeric data with lots of different values

Not useful:
Lots of repeated values

19
Q

crosstab chart

A

counts how many times combinations of values appear. Arrows show where that row in the data table would be counted in the chart
Counts how often pairs of values in two columns appear.

Useful for:
Finding the most / least common combinations of values in two columns
Finding patterns across two columns
Exploring two columns when one or both are strings.

Not useful:
If either column has too many values (the chart would be enormous)

20
Q

when to use what graph

A

study slide 17 in 9.4

21
Q

big data

A

“Collect huge amounts of data so we can learn even more from it”
The size of the datasets we analyzed impacts how much information can be extracted
As a result, in business, science, and many other contexts people are working with increasingly big data sets
When data gets too big it can no longer be processed on one computer. Cloud computing or parallel systems are sometimes used to help process all that information.
In general scalability of your system is important to consider when working with big data. You want your system to be able to work even as you’re using more and more data.

22
Q

citizen science and crowdsourcing

A

“collecting data from others so you can analyze it”
Crowdsourcing is the practice of obtaining input or information from a large number of people via the Internet.
Citizen science is research where some of the data collection is done by members of the public using own computing devices which leads to solving scientific problems
Crowdsourcing offers new models for collaboration, such as connecting businesses or social causes with funding
Both are examples of how human capabilities can be enhanced by collaboration via computing

23
Q

open data

A

“sharing data with others so they can can analyze it”
Open data is publicly available data shared by governments, organizations, and others
Making data open help spread useful knowledge or creates opportunities for others to use it to solve problems