Data Preparation Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

What are the critical steps in data preprocessing that involve cleaning and filling in missing values in a dataset?

A

Data preparation and imputation are critical steps in data preprocessing that involve cleaning and filling in missing values in a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are some reasons why missing values may arise in a dataset?

A

Missing values can arise due to a variety of reasons, including human error, incomplete data, or system failure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is one common method to handle missing values, and what is a potential drawback of this method?

A

One common method to handle missing values is to remove rows and columns with missing data. However, this can result in a loss of data, which may not be acceptable in many cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is imputation, and what is one common imputation method?

A

Imputation involves filling in the missing data with estimated values. One common imputation method is to use mean, median, or mode to fill in the missing values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the k-nearest neighbor algorithm (KNN), and how is it used for imputation?

A

The k-nearest neighbor algorithm (KNN) involves estimating missing values by using the values of the k nearest neighbors in the dataset. KNN imputation is quite simple and can yield decent results, especially in small datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Besides KNN, what are some other machine learning approaches that can be used for imputation?

A

Other machine learning approaches that can be used for imputation include decision trees, random forests, and neural networks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How are missing values typically imputed in binary data, and what are the types of regression used?

A

Missing values are typically imputed using logistic or linear regression in binary data. Logistic regression is used for binary classification problems, while linear regression is used for continuous-valued targets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is normalization, and in what fields is it commonly used?

A

Normalization is a critical data preparation step that involves subtracting background signals from the dataset to improve its accuracy and consistency. This approach is often used in bioinformatics to preprocess data from microarrays and other high-throughput experiments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the KNN algorithm and what tasks can it be used for in bioinformatics?

A

The KNN algorithm is a machine learning algorithm that can be used for classification and regression tasks. In bioinformatics, it can be used for tasks such as gene expression analysis, protein structure prediction, and disease diagnosis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does the KNN algorithm work?

A

The KNN algorithm works by comparing a new data point with the k-nearest neighbors in the training set to predict the class or value of the new data point. The steps of the algorithm include choosing the number of k neighbors, measuring the distance between the new data point and all the data points in the training set, finding the k-nearest neighbors, determining the class or value based on the majority class or average value of the k-nearest neighbors, and outputting the prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the most common distance measure used in the KNN algorithm?

A

The most common distance measure used in the KNN algorithm is the Euclidean distance, which is the straight-line distance between two points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are some advantages and disadvantages of the KNN algorithm?

A

Some advantages of the KNN algorithm include being simple to implement and not requiring any assumptions about the distribution of the data. Some disadvantages include being sensitive to the choice of k and being computationally intensive for large datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why is the KNN algorithm popular in bioinformatics?

A

The KNN algorithm remains a popular algorithm in bioinformatics due to its effectiveness in many applications, such as gene expression analysis, protein structure prediction, and disease diagnosis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are duplicate records?

A

Duplicate records refer to multiple instances of the same data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are some reasons that can lead to the occurrence of duplicate records in a dataset?

A

Duplicate records in a dataset can arise due to various reasons such as data entry errors, data storage issues, merging multiple datasets, or combining data from different sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is it important to clean duplicate data in a dataset?

A

Duplicate data can cause inaccuracies in statistical analysis, modeling, and data visualization. Therefore, cleaning duplicate data is a crucial step in the data cleaning process.

17
Q

What techniques can data scientists use to deal with duplicate records in a dataset?

A

Data scientists typically use record linkage and fuzzy matching techniques to deal with duplicate records in a dataset.

18
Q

What is record linkage and how does it work?

A

Record linkage is a technique that involves comparing different records in a dataset to identify those that refer to the same entity. The aim is to identify the duplicates and merge them into a single record. This process is usually done by using unique identifiers such as social security numbers, email addresses, or phone numbers.

19
Q

When is fuzzy matching useful for identifying duplicate records?

A

Fuzzy matching is useful when the data has errors or inconsistencies, such as spelling variations, missing data, or data entered in different formats. It involves comparing records that are not an exact match using techniques such as edit distance, phonetic matching, and tokenization.

20
Q

How can data scientists decide which duplicate records to keep and which ones to discard?

A

The decision to keep or discard duplicate records depends on the specific use case and the goals of the analysis. For instance, if the goal is to count the number of unique customers, it may be appropriate to remove all duplicate records. However, if the goal is to analyze customer behavior over time, it may be useful to keep all records but mark them as duplicates.

21
Q

What is the importance of correcting incorrect data types in data preprocessing for data science and machine learning?

A

Correcting incorrect data types is important in data preprocessing because incorrect data types can cause errors in data analysis and modeling, and can also affect the accuracy of predictions made by machine learning models.

22
Q

What are some common methods for correcting incorrect data types in a dataset for data science and machine learning?

A

Some common methods include using data type conversion functions, regular expressions, imputation techniques, and data validation techniques.

23
Q

How can regular expressions be used to correct incorrect data types in a dataset for data science and machine learning?

A

Regular expressions can be used to identify and correct data that is in the wrong format. They can be used to identify patterns in data, such as phone numbers, email addresses, or dates, and convert them to the correct format.

24
Q

Why is it important to note that incorrect data types can also arise due to missing values or data entry errors in data science and machine learning?

A

It is important to note that incorrect data types can also arise due to missing values or data entry errors because imputation techniques such as mean imputation or regression imputation can be used to fill in missing values, and data validation techniques can be used to identify and correct data entry errors.

25
Q

What are outliers in machine learning and why is it important to remove them?

A

Outliers are data points that significantly differ from other observations in the dataset. They can occur due to errors in data collection, measurement errors, or rare events. It is important to remove outliers because they can affect the accuracy and performance of machine learning models by introducing noise and bias into the data.

26
Q

What are the statistical techniques used to remove outliers in machine learning?

A

The two commonly used statistical techniques for removing outliers are the Z-score method and the interquartile range (IQR) method. The Z-score method involves calculating the number of standard deviations a data point is from the mean of the dataset, while the IQR method involves calculating the range between the first and third quartiles of the dataset.

27
Q

What are the machine learning techniques used to remove outliers in machine learning?

A

Clustering and regression are two machine learning techniques that can be used to remove outliers. Clustering algorithms can be used to group similar data points together, and outliers can be identified as data points that do not belong to any cluster. Regression analysis can be used to identify data points that have a high residual and are thus potential outliers.

28
Q

What are the potential effects of removing outliers on machine learning models?

A

Removing outliers can improve the accuracy and performance of machine learning models by reducing noise in the data. However, removing too many outliers can also result in biased or inaccurate models. Careful consideration should be given to the number of outliers to remove to ensure that the models are not biased or inaccurate.

29
Q

What to do with an outlier?

A
  • Delete
  • Transform the variable
  • Transform the value
  • Set outliers to NA then impute the values