Exam 1 Flashcards
Where is the most effort put in data mining?
Data preparation and cleaning
What are the data-related steps in the CRISP-DM guide?
Select/find data, Clean the Data, Prepare the data, Integrate the data, and Format the data
How is data represented?
Numeric: Continuous attributes. Measurements, int, float data types
Nominal - values are symbolic - labels: sunny, old, yellow: can perform equality checks. Categorical coding; may code “1” but has no arithmetic meaning.
Ratio - The measurement scheme defines a zero point a distance, temperature differential but not the temperature itself; math operations are valid.
Ordinal - rank order: “cold, cool, warm, hot” or “good, better, best”. No distance between them, cna beform equality checks.
Interval - ordered and measured in fixed units.
When is numeric data easy to interpret?
When defined ranges exist.
How do you measure something good, bad, healthy?
Need domain expert.
What are some cautions on data cleaning?
Document what you do, work carefully, don’t make assumptions, be aware of bias.
What are some was to introduce bias?
Language - different terms or grammars to describe the domain, data attributes, or the problem.
Search - look at other search options
Overfitting - results provide a solution based on bad assumptions/patters, or search stops too soon.
Actions already perform on the data.
How the data was gathered (how questions were asked, how responses were interpreted, who asked the questions, how samples were selected).
Synonym for “error”.
What are some examples of data cleaning?
Handling invalid values, duplicates, missing data, data entry errors, converting data to specific values in order to perform correct measurements.
What is meant by dirty data?
Data that is incorrect, inaccurate, irrelevant or incomplete.
Data needing to be converted (nominal to numeric)
Data with different formats or coding schemes (such as dates)
Data from >1 file with different field delimiters
Data that is coded
Data that must be summarized “rolled up”
How does data get “dirty”?
Inconsistent definitions, meanings (especially when combining different sources)
Data entry mistakes
Collection errors
Corrupted data transmissions
Conversion errors.
What are some data issues?
Out of range entries
Unknown, unrecorded or irrelevant data
Missing values
Language translation issues
Unavailable readings
Inapplicable data (asking a male if pregnant)
Customer provided incorrect data
Duplicate data
Stale data
Unavailable data
Data may be available but not in electronic form.
Data associated with the wrong person
User provided wrong data.
Consider representing dates as YYYYMM or YYYYMMDD. What’s good about this formatting? What is the limitation?
Good: You can sort the data.
Limitation: Does not perserve intervals. (ie. 20040201 - 20040131
What are some legacy issues when it comes to dates?
Y2K - 2 digit year. - Year 02 is it 1902 or 2002? Depends on context (child birthday or a year a house was built). Typical approach is to set a cutoff year. If YY < cutoff, then 20YY else 19YY
What are some reasons values may be missing?
- They are unknown, unrecorded, irrelevant data
- Malfunctioning equipment
- Changes in the design
- Collation/merge of different datasets.
- Unavailable data
- Removals because of security or privacy issues.
- Translation issues (especially languages)
- That adata being used for a different purposes than originally planned (ethical/ legal issues)
- self-reporting - people may omit if the input mechanism does not require an input.
How should one deal with missing values?
- Ignore the attribute or entire instances. (May throw out the needle in the haystack!)
- Try to estimate or predict: use mean, mode, or median values. Relatively easy and not bad on average.
- Treat as seperate value
- Look for , 0, “.”, 999, N/A Decide on a standard and create a new value.
- Does missing imply a default value?
- Compute the value based on previous values.
- If inserting zeros for missing values, think about what it has done to the mean and standard deviation.
- Be careful when using tools (some have default operations to handle missing data)
- Randomly select values from current distribution (pro: won’t change overall shape of the curve - little impact on the mean).
Again, what are some sources of inaccurate data? :)
- Data entry mistakes
- Measurement errors
- Outliers previously removed
- Duplicates
- stale data
- Different values. New York, NY, N.Y
How can you find inaccurate data?
Look for the obvious (run statistical tools), Look for nonsensical data (negative grade or age).
What is discretization?
- Binning
- Useful for generating summary data
- produces discrete values
What is one issue that can come from binning with equal-width?
It could result in clumping. For example, if 99% of employees earn 0-200,000 and the owner makes 2,000,000. With a width of 200,000 only one person in the upper bin.
How can we even out the distribution?
By binning with equal-height. Instead of defining bin sizes of range N, assign N values to each bin.
When binning…
- Do not split repeated values across bins
- Create a separate bin for special values.
Talk about the considerations with equal-width and equal-height binning.
Equal-height is usually preferred because it avoids clumping. Equal-width is simplest and good in many situations. However, equal-height usually gives better results.
Is by bins okay?
AFter you create bins, create a histogram of values and look at the general shape of it. Jagged shape may indicate a weakness in the way the bins were formed so try different number of bins and different boundaries (shit ranges).
Why use rollup?
Can help reduce the complexity of your model.
What is an outlier in data mining?
Any value that doesn’t really look like most of the others in the data set.
May just be a unique data point, or it could really be an outlier (an noise, error).
How do you handle outliers?
- Do nothing
- Enforce upper and lower bounds
- Let binning handle the problem.
Where can outlier exist?
In a normal distribution, they could be >3 standard deviations from your mean. With a bimodal distribution, they could be at the middle or at the ends.
Some are easy to find such as negative age or an age > 120, negative number of children, gender that’s not M/F.
How can you tell an outlier fron an error?
Usually you can’t. Don’t discard outliers unless you are sure they are really outliers.
What is the nest approach to dealing with outliers?
- Work with your domain expert.
- Try to help identify why the data values are extreme.
- Do remove the outliers if you think they will negatively impact your analysis.
- Check the source and quality of the raw data.
What is the scare with data warehouses?
If they clean the data and remove the outliers, the needle in the haystack you were looking for may have been removed.
Why use transformations?
Such as house prices with a skewed tail due to extremely high house prices. May need to transform the data to make neater bins or to scale for visualizations.
What is one approach to transform data?
Apply log10 function to numeric values. Log(10)1 = 0 log(10)10 = 1 log(10)100 = 2 log(10)1000 = 3
To care for log0 being undefined, we can bump all values by 1 and take the absolute value to handle negative values. This scales the data making it easier to visualize and handle.
In general form: Log(10)|X + 1|. Add sin to regain the sign on negative side: SinXLog(10)|X+ 1|
What is regression modeling?
Try to fit the data to a line, calculate the error, sometimes easy to identify extreme values.