Data Preparation Flashcards
What are the critical steps in data preprocessing that involve cleaning and filling in missing values in a dataset?
Data preparation and imputation are critical steps in data preprocessing that involve cleaning and filling in missing values in a dataset.
What are some reasons why missing values may arise in a dataset?
Missing values can arise due to a variety of reasons, including human error, incomplete data, or system failure.
What is one common method to handle missing values, and what is a potential drawback of this method?
One common method to handle missing values is to remove rows and columns with missing data. However, this can result in a loss of data, which may not be acceptable in many cases.
What is imputation, and what is one common imputation method?
Imputation involves filling in the missing data with estimated values. One common imputation method is to use mean, median, or mode to fill in the missing values.
What is the k-nearest neighbor algorithm (KNN), and how is it used for imputation?
The k-nearest neighbor algorithm (KNN) involves estimating missing values by using the values of the k nearest neighbors in the dataset. KNN imputation is quite simple and can yield decent results, especially in small datasets.
Besides KNN, what are some other machine learning approaches that can be used for imputation?
Other machine learning approaches that can be used for imputation include decision trees, random forests, and neural networks.
How are missing values typically imputed in binary data, and what are the types of regression used?
Missing values are typically imputed using logistic or linear regression in binary data. Logistic regression is used for binary classification problems, while linear regression is used for continuous-valued targets.
What is normalization, and in what fields is it commonly used?
Normalization is a critical data preparation step that involves subtracting background signals from the dataset to improve its accuracy and consistency. This approach is often used in bioinformatics to preprocess data from microarrays and other high-throughput experiments.
What is the KNN algorithm and what tasks can it be used for in bioinformatics?
The KNN algorithm is a machine learning algorithm that can be used for classification and regression tasks. In bioinformatics, it can be used for tasks such as gene expression analysis, protein structure prediction, and disease diagnosis.
How does the KNN algorithm work?
The KNN algorithm works by comparing a new data point with the k-nearest neighbors in the training set to predict the class or value of the new data point. The steps of the algorithm include choosing the number of k neighbors, measuring the distance between the new data point and all the data points in the training set, finding the k-nearest neighbors, determining the class or value based on the majority class or average value of the k-nearest neighbors, and outputting the prediction.
What is the most common distance measure used in the KNN algorithm?
The most common distance measure used in the KNN algorithm is the Euclidean distance, which is the straight-line distance between two points.
What are some advantages and disadvantages of the KNN algorithm?
Some advantages of the KNN algorithm include being simple to implement and not requiring any assumptions about the distribution of the data. Some disadvantages include being sensitive to the choice of k and being computationally intensive for large datasets.
Why is the KNN algorithm popular in bioinformatics?
The KNN algorithm remains a popular algorithm in bioinformatics due to its effectiveness in many applications, such as gene expression analysis, protein structure prediction, and disease diagnosis.
What are duplicate records?
Duplicate records refer to multiple instances of the same data.
What are some reasons that can lead to the occurrence of duplicate records in a dataset?
Duplicate records in a dataset can arise due to various reasons such as data entry errors, data storage issues, merging multiple datasets, or combining data from different sources.