Data Analyst Flashcards
What are the steps in an analytics project?
Define the problem, Data exploration, data preparation, modeling, validation of data, implementation and tracking
What is data cleansing?
Identifying and removing errors and inconsistencies from data to enhance the quality of the data
What are some of the best practices for data cleansing?
- Sort by attributes
- Stepwise- improve the data with each step
- Break down large data sets into smaller data sets to clean
- Find and replace or other functions within excel
- Look at summary statistics to look for inconsistencies
What is logistic regression?
statistical method for examining a dataset in which there are one or more independent variables that defines an outcome
What is data mining vs. data profiling vs. data analysis?
Data profiling: It targets on the instance analysis of individual attributes. It gives information on various attributes like value range, discrete value and their frequency, occurrence of null values, data type, length, etc.
Data mining: It focuses on cluster analysis, detection of unusual records, dependencies, sequence discovery, relation holding between several attributes, etc.
Data Mining Data Analysis
Used to recognize patterns in data stored. Used to order & organize raw data in a meaningful manner.
Mining is performed on clean and well-documented data. The analysis of data involves Data Cleaning. So, data is not present in a well-documented format.
Results extracted from data mining are not easy to interpret. Results extracted from data analysis are easy to interpret.
Data Mining is often used to identify patterns in the data stored. It is mostly used for Machine Learning, and analysts have to just recognize the patterns with the help of algorithms. Whereas, Data Analysis is used to gather insights from raw data, which has to be cleaned and organized before performing the analysis.
Explain KNN Imputation Method
In KNN imputation, the missing attribute values are imputed by using the attributes value that are most similar to the attribute whose values are missing. By using a distance function, the similarity of two attributes is determined.
What should be done with suspected or missing data?
- -Prepare a validation report that gives information of all suspected data. It should give information like validation criteria that it failed and the date and time of occurrence
- Experience personnel should examine the suspicious data to determine their acceptability
- -Invalid data should be assigned and replaced with a validation code
- -To work on missing data use the best analysis strategy like deletion method, single imputation methods, model based methods, etc.
What is the heirarchical clustering algorithm?
Hierarchical clustering algorithm combines and divides existing groups, creating a hierarchical structure that showcase the order in which groups are divided or merged.
What is the k-means algorithm?
K mean is a famous partitioning method. Objects are classified as belonging to one of K groups, k chosen a priori.
In K-mean algorithm, The clusters are spherical: the data points in a cluster are centered around that cluster
The variance/spread of the clusters is similar: Each data point belongs to the closest cluster
What are the key skills for a data analyst?
- Database knowledge
- Predictive analytics (statistical background, predictive modeling)
- Big data knowledge (machine learning, unstructured data analytics)
- Presentation skills (data visualization, insight presentation, report design)
What is map reduce?
Map-reduce is a framework to process large data sets, splitting them into subsets, processing each subset on a different server and then blending results obtained on each.
What is time series analysis?
Time series analysis can be done in two domains, frequency domain and the time domain. In Time series analysis the output of a particular process can be forecast by analyzing the previous data by the help of various methods like exponential smoothening, log-linear regression method, etc.
What is a hash table?
In computing, a hash table is a map of keys to values. It is a data structure used to implement an associative array. It uses a hash function to compute an index into an array of slots, from which desired value can be fetched.
What is imputation, and what are some techniques?
During imputation we replace missing data with substituted values. Can be done through mean of values, regression (expected values) or average regression variance
What is n-gram?
An n-gram is a contiguous sequence of n items from a given sequence of text or speech. It is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n-1).