Data Preparation Flashcards
What are the critical steps in data preprocessing that involve cleaning and filling in missing values in a dataset?
Data preparation and imputation are critical steps in data preprocessing that involve cleaning and filling in missing values in a dataset.
What are some reasons why missing values may arise in a dataset?
Missing values can arise due to a variety of reasons, including human error, incomplete data, or system failure.
What is one common method to handle missing values, and what is a potential drawback of this method?
One common method to handle missing values is to remove rows and columns with missing data. However, this can result in a loss of data, which may not be acceptable in many cases.
What is imputation, and what is one common imputation method?
Imputation involves filling in the missing data with estimated values. One common imputation method is to use mean, median, or mode to fill in the missing values.
What is the k-nearest neighbor algorithm (KNN), and how is it used for imputation?
The k-nearest neighbor algorithm (KNN) involves estimating missing values by using the values of the k nearest neighbors in the dataset. KNN imputation is quite simple and can yield decent results, especially in small datasets.
Besides KNN, what are some other machine learning approaches that can be used for imputation?
Other machine learning approaches that can be used for imputation include decision trees, random forests, and neural networks.
How are missing values typically imputed in binary data, and what are the types of regression used?
Missing values are typically imputed using logistic or linear regression in binary data. Logistic regression is used for binary classification problems, while linear regression is used for continuous-valued targets.
What is normalization, and in what fields is it commonly used?
Normalization is a critical data preparation step that involves subtracting background signals from the dataset to improve its accuracy and consistency. This approach is often used in bioinformatics to preprocess data from microarrays and other high-throughput experiments.
What is the KNN algorithm and what tasks can it be used for in bioinformatics?
The KNN algorithm is a machine learning algorithm that can be used for classification and regression tasks. In bioinformatics, it can be used for tasks such as gene expression analysis, protein structure prediction, and disease diagnosis.
How does the KNN algorithm work?
The KNN algorithm works by comparing a new data point with the k-nearest neighbors in the training set to predict the class or value of the new data point. The steps of the algorithm include choosing the number of k neighbors, measuring the distance between the new data point and all the data points in the training set, finding the k-nearest neighbors, determining the class or value based on the majority class or average value of the k-nearest neighbors, and outputting the prediction.
What is the most common distance measure used in the KNN algorithm?
The most common distance measure used in the KNN algorithm is the Euclidean distance, which is the straight-line distance between two points.
What are some advantages and disadvantages of the KNN algorithm?
Some advantages of the KNN algorithm include being simple to implement and not requiring any assumptions about the distribution of the data. Some disadvantages include being sensitive to the choice of k and being computationally intensive for large datasets.
Why is the KNN algorithm popular in bioinformatics?
The KNN algorithm remains a popular algorithm in bioinformatics due to its effectiveness in many applications, such as gene expression analysis, protein structure prediction, and disease diagnosis.
What are duplicate records?
Duplicate records refer to multiple instances of the same data.
What are some reasons that can lead to the occurrence of duplicate records in a dataset?
Duplicate records in a dataset can arise due to various reasons such as data entry errors, data storage issues, merging multiple datasets, or combining data from different sources.
Why is it important to clean duplicate data in a dataset?
Duplicate data can cause inaccuracies in statistical analysis, modeling, and data visualization. Therefore, cleaning duplicate data is a crucial step in the data cleaning process.
What techniques can data scientists use to deal with duplicate records in a dataset?
Data scientists typically use record linkage and fuzzy matching techniques to deal with duplicate records in a dataset.
What is record linkage and how does it work?
Record linkage is a technique that involves comparing different records in a dataset to identify those that refer to the same entity. The aim is to identify the duplicates and merge them into a single record. This process is usually done by using unique identifiers such as social security numbers, email addresses, or phone numbers.
When is fuzzy matching useful for identifying duplicate records?
Fuzzy matching is useful when the data has errors or inconsistencies, such as spelling variations, missing data, or data entered in different formats. It involves comparing records that are not an exact match using techniques such as edit distance, phonetic matching, and tokenization.
How can data scientists decide which duplicate records to keep and which ones to discard?
The decision to keep or discard duplicate records depends on the specific use case and the goals of the analysis. For instance, if the goal is to count the number of unique customers, it may be appropriate to remove all duplicate records. However, if the goal is to analyze customer behavior over time, it may be useful to keep all records but mark them as duplicates.
What is the importance of correcting incorrect data types in data preprocessing for data science and machine learning?
Correcting incorrect data types is important in data preprocessing because incorrect data types can cause errors in data analysis and modeling, and can also affect the accuracy of predictions made by machine learning models.
What are some common methods for correcting incorrect data types in a dataset for data science and machine learning?
Some common methods include using data type conversion functions, regular expressions, imputation techniques, and data validation techniques.
How can regular expressions be used to correct incorrect data types in a dataset for data science and machine learning?
Regular expressions can be used to identify and correct data that is in the wrong format. They can be used to identify patterns in data, such as phone numbers, email addresses, or dates, and convert them to the correct format.
Why is it important to note that incorrect data types can also arise due to missing values or data entry errors in data science and machine learning?
It is important to note that incorrect data types can also arise due to missing values or data entry errors because imputation techniques such as mean imputation or regression imputation can be used to fill in missing values, and data validation techniques can be used to identify and correct data entry errors.