Midterm Flashcards
FILTER + REPRESENT
Reorganize your data and take only what you need
The pros of mining before filtering is you know exactly what you want to filter. The con is you don’t know if there is enough data to answer your questions
Filter and Represent have an iterative nature. How you represent data can influence what you acquire
This stage could lead you back to aquire
ACQUIRE
Locate and download the data from a source
Primary Data
information collected for specific purpose at hand
Secondary Data
information that already exists somewhere, having been collected for another purpose
PARSE
Look through data columns and identify the types and its correctness
Modify columns by splitting if needed
Each piece of data needs to be converted to a useful format
String
a set of characters that forms a word of sentence
Float
a number with a decimal point
Character
a single letter or other symbol
Integer
a number with no fractional part
Alphanumeric
consists of both letters and numbers
Boolean
True or False
MINE
Determine basic descriptors and statistics for your data, categorize it, and figure out the range and spread, as well as partters
Categorize your data into groups such as nutrient fact
Should also start asking questions
Figure out if temporal data needs to be reorganized
Range check is important to see if there are null / na or negative numbers
FILTER + REPRESENT
Reorganize your data and take only what you need
The pros of mining before filtering is you know exactly what you want to filter. The con is you don’t know if there is enough data to answer your question
Filter & Represent have an iterative nature. How you represent data can influence what you aquire
This stage could lead you back to acquire
CHRTS
categorical, hieratical, relational, temporal, spatial
Categorical
compare categories of quantitative data
Hierarchical
visualize relationships and hierarchies
Relational
charts relations to explore correlations
Temporal
data that happens over time
Spatial
data pertaining to a location
CRITIQUE + REFINE
Get feedback of your charts and refine based on the feedback
This stage could lead you back to acquire, min, or filter & represent
Data Product
translate the records of a data source into an easily understandable format
ex:
Raw vs Processed
Granular vs Summarized
Textual vs Quantitative
Statistic vs Dynamic
Small vs Massie
Structured Data
easily searchable
Unstructured Data
not easily searchable
ex:
audio, video, reviews
Quantitative
numerical data that is either discrete or continuous
Qualitative Data Types
nominal, ordinal
Nominal
label for a field
ex:
M/F, color, names
Ordinal
order matters
Anatomy of a graphic
Chart tile, data label, legend, horizontal axis title, left vertical axis title, category labels
Bar Charts vs Histograms
bar charts are comparing categories while histograms show the pattern of data within a range
Bar Chart
categories don’t have an order
order the bars by length for each comparison
horizontal bar charts for long category labels
Categorical
Clustered Bar Chart
comparison between subcategories
Categorical
Pictogram
use point marks, in the form of symbols or pictures, to represent an associated quantitative count
Categorical
Proportional Symbol Chart
works best when you have diverse range of quantitative value sizes
Categorical
Word Cloud
shows the frequency of individual word item
Categorical
Matrix Chart and Heat Map
displays quantitative values across the intersection of two categorical and or discrete quantitative dimensions
Categorical
Histogram and Density Plot
displays the frequency and distribution of quantitative measurements across grouped values for data items
Categorical
Box and Whisker Plot
displays the distribution and shape of quantitative values for different categories
Categorical
Pie Chart and Donut Chart
how proportions of quantities for different constituent categories make up a whole
Categorical
Treemap
an enclosure digram providing a hierarchical display that shows how quantitative values for different constituent categorical parts make up a whole
Hierarchical
Venn Diagram
shows collections of and relationships between multiple sets
Hierarchical
Scatter Plot
displays the relationship between two quantitative variables for different category items
Relational
Bubble Plot
displays the relationship between three quantitative variables for different category items
Relational
Network Diagram
display relationships through the connections between data items
Relational
Line Chart
shows how quantitative values have changed over time for different categorical items
Temporal
Bump Chart/Ribbon Chart/Rank Chart
shows how quantitative values have changed over time for categorical items, where the quantitative values are ranking measurement
Temporal
Slope Graph
shows how quantitative values have changed over two points in time for different category items
Temporal
Area Chart
shows how quantitative values have changed over time for a single categorical item
Temporal
Stacked Area Chart
shows how quantitative values have changed over time for multiple categorical items
Gantt Chart
shows time based intervals for different categorical items
Temporal
Instance Chart
displays time-based events for different categorical items
Temporal
Choropleth
displays quantitative values for distinct, definable spatial regions
Spatial
Isarithmic Map/Contour Map
displays distinct spatial surfaces on a map that shares the same quantitative classification
Spatial
Proportional Symbol Map
displays quantitative values for locations on a map; ideal for highlighting the magnitude of data at specific locations through varying symbol sizes
Spatial
Dot Map
displays the distribution of phenomena on a map
Spatial
Flow Map
the characteristics of movement or connections between phenomena across spatial regions
Spatial
Area Categorm
displays the quantitative values associated with distinct, definable spatial regions on a map by proportionately distorting (inflating or deflating) the relative size of and, to some degree, shape of the respective regional areas
Spatial
Dorling Cartogram
displays the quantitative values associated with distinct, definable spatial regions on a map with marks which is proportionally sized to represent the quantitative values
Spatial
Grid Map
displays the quantitative values associated with distinct, definable spatial regions on map. Each geographic region is represented by a fixed-size uniform shape, sometimes termed a tile. Attributes of color are applied to each rational tile to represent a quantitative measurement
Spatial
Projections
Preserving local angles, but introducing severe distortions in areas near the poles
Spatial
Logarithmic Transformation
Useful when data spans multiple orders of magnitude or has skewness (right-skewed)
Square Root Transformation
Appropriate for moderately skewed data or data with moderate outliers (right-skewed)
Reciprocal Transformation
Effective when large values disproportionately influence the dataset or right skewed data
Squaring/Cubing
Effective for left skewed data
Currency (Verifying Data)
Is the information up to date? When was it collected/published/updated
Relevancy (Verifying Data)
Is the information suitable for your intended use? Does it address your research question? Is there other (better) information
Authority (Verifying Data)
Is the information creator reputable and has the necessary credentials? Can you trust the information?
Accuracy (Verifying Data)
Do you spot any errors? What is the source of the information? Can other data or research support this information?
Purpose (Verifying Data)
Was the intended purpose of the information collected? Are other potential uses identified
Data Type Checking (Data Cleaning)
Checking to see if all the data types are the same
ex: all inputs for ages should be integers
Range Check (Data Cleaning)
Checking to make sure that the information is within a reasonable range
ex: an age shouldn’t be negative, zero or over a hundred
Missing or incorrect values should be replaced with an estimate (median age of the dataset) or as “Missing” or “Unknown”
Format Check (Data Cleaning)
Making sure the format is uniform
Handling Missing Data (Data Cleaning)
< 5% of data missing:
delete those entries
make note on how this impacts the data analysis and size
> 5% of the data missing:
Categorical Data should have a placeholder like “Unknown”
Numerical Data: replace the mean of the data
Temporal/Interval Data: User interpolation or a placeholder like “Unknown”
check for patterns of missing data
Duplication (Data Cleaning)
Making sure that there are no duplicates in your data and getting rid of all entries that are
Spelling Check (Data Cleaning)
Detect and correct any spelling errors
Data Standardization
Ensure consistency in text entries, formats, and measurement units
Design Principles
Trustworth: data should be accurate, consistent, complete, and reliable with no misleading data representation
Accessible: data should be relevant and understandable
Elegant: eliminate the arbitrary and be thorough
Interval
quantitative data that’s measured on a scale with equal intervals between values
Ratio
quantitative data and has a true zero point
Textual
stores any kind of text data