Data Analyst Flashcards by Danielle Dale

What are the steps in an analytics project?

Define the problem, Data exploration, data preparation, modeling, validation of data, implementation and tracking

How well did you know this?

Not at all

Perfectly

What is data cleansing?

Identifying and removing errors and inconsistencies from data to enhance the quality of the data

How well did you know this?

Not at all

Perfectly

What are some of the best practices for data cleansing?

Sort by attributes
Stepwise- improve the data with each step
Break down large data sets into smaller data sets to clean
Find and replace or other functions within excel
Look at summary statistics to look for inconsistencies

How well did you know this?

Not at all

Perfectly

What is logistic regression?

statistical method for examining a dataset in which there are one or more independent variables that defines an outcome

How well did you know this?

Not at all

Perfectly

What is data mining vs. data profiling vs. data analysis?

Data profiling: It targets on the instance analysis of individual attributes. It gives information on various attributes like value range, discrete value and their frequency, occurrence of null values, data type, length, etc.
Data mining: It focuses on cluster analysis, detection of unusual records, dependencies, sequence discovery, relation holding between several attributes, etc.
Data Mining Data Analysis
Used to recognize patterns in data stored. Used to order & organize raw data in a meaningful manner.
Mining is performed on clean and well-documented data. The analysis of data involves Data Cleaning. So, data is not present in a well-documented format.
Results extracted from data mining are not easy to interpret. Results extracted from data analysis are easy to interpret.
Data Mining is often used to identify patterns in the data stored. It is mostly used for Machine Learning, and analysts have to just recognize the patterns with the help of algorithms. Whereas, Data Analysis is used to gather insights from raw data, which has to be cleaned and organized before performing the analysis.

How well did you know this?

Not at all

Perfectly

Explain KNN Imputation Method

In KNN imputation, the missing attribute values are imputed by using the attributes value that are most similar to the attribute whose values are missing. By using a distance function, the similarity of two attributes is determined.

How well did you know this?

Not at all

Perfectly

What should be done with suspected or missing data?

-Prepare a validation report that gives information of all suspected data. It should give information like validation criteria that it failed and the date and time of occurrence
Experience personnel should examine the suspicious data to determine their acceptability
-Invalid data should be assigned and replaced with a validation code
-To work on missing data use the best analysis strategy like deletion method, single imputation methods, model based methods, etc.

How well did you know this?

Not at all

Perfectly

What is the heirarchical clustering algorithm?

Hierarchical clustering algorithm combines and divides existing groups, creating a hierarchical structure that showcase the order in which groups are divided or merged.

How well did you know this?

Not at all

Perfectly

What is the k-means algorithm?

K mean is a famous partitioning method. Objects are classified as belonging to one of K groups, k chosen a priori.

In K-mean algorithm, The clusters are spherical: the data points in a cluster are centered around that cluster
The variance/spread of the clusters is similar: Each data point belongs to the closest cluster

How well did you know this?

Not at all

Perfectly

What are the key skills for a data analyst?

Database knowledge
Predictive analytics (statistical background, predictive modeling)
Big data knowledge (machine learning, unstructured data analytics)
Presentation skills (data visualization, insight presentation, report design)

How well did you know this?

Not at all

Perfectly

What is map reduce?

Map-reduce is a framework to process large data sets, splitting them into subsets, processing each subset on a different server and then blending results obtained on each.

How well did you know this?

Not at all

Perfectly

What is time series analysis?

Time series analysis can be done in two domains, frequency domain and the time domain. In Time series analysis the output of a particular process can be forecast by analyzing the previous data by the help of various methods like exponential smoothening, log-linear regression method, etc.

How well did you know this?

Not at all

Perfectly

What is a hash table?

In computing, a hash table is a map of keys to values. It is a data structure used to implement an associative array. It uses a hash function to compute an index into an array of slots, from which desired value can be fetched.

How well did you know this?

Not at all

Perfectly

What is imputation, and what are some techniques?

During imputation we replace missing data with substituted values. Can be done through mean of values, regression (expected values) or average regression variance

How well did you know this?

Not at all

Perfectly

What is n-gram?

An n-gram is a contiguous sequence of n items from a given sequence of text or speech. It is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n-1).

How well did you know this?

Not at all

Perfectly

What is the criteria for a good data model?

Study These Flashcards

It can be easily consumed
Large data changes in a good model should be scalable
It should provide predictable performance
A good model can adapt to changes in requirements

Can you mention a few problems that data analyst usually encounter while performing the analysis?

Study These Flashcards

Presence of Duplicate entries and spelling mistakes, reduce data quality.
If you are extracting data from a poor source, then this could be a problem as you would have to spend a lot of time cleaning the data.
When you extract data from sources, the data may vary in representation. Now, when you combine data from these sources, it may happen that the variation in representation could result in a delay.
Lastly, if there is incomplete data, then that could be a problem to perform analysis of data.

What is normal distribution?

Study These Flashcards

Commonly known as the Bell Curve or Gaussian curve, normal distributions, measure how much values can differ in their means and in their standard deviations. data is usually distributed around a central value without any bias to the left or right side. Also, the random variables are distributed in the form of a symmetrical bell-shaped curve.

What is A/B testing?

Study These Flashcards

A/B testing is the statistical hypothesis testing for a randomized experiment with two variables A and B. Also known as the split testing, it is an analytical method that estimates population parameters based on sample statistics. This test compares two web pages by showing two variants A and B, to a similar number of visitors, and the variant which gives better conversion rate wins.

The goal of A/B Testing is to identify if there are any changes to the web page. For example, if you have a banner ad on which you have spent an ample amount of money. Then, you can find out the return of investment i.e. the click rate through the banner ad.

What is the alternative hypothesis?

Study These Flashcards

To explain the Alternative Hypothesis, you can first explain what the null hypothesis is. Null Hypothesis is a statistical phenomenon that is used to test for possible rejection under the assumption that result of chance would be true.

After this, you can say that the alternative hypothesis is again a statistical phenomenon which is contrary to the Null Hypothesis. Usually, it is considered that the observations are a result of an effect with some chance of variation.

What are eigenvalues and eigenvectors?

Study These Flashcards

Eigenvectors: Eigenvectors are basically used to understand linear transformations. These are calculated for a correlation or a covariance matrix.

For definition purposes, you can say that Eigenvectors are the directions along which a specific linear transformation acts either by flipping, compressing or stretching.

Eigenvalue: Eigenvalues can be referred to as the strength of the transformation or the factor by which the compression occurs in the direction of eigenvectors.

What is the difference between 1-Sample T-test, and 2-Sample T-test?

Study These Flashcards

T-Tests are a type of hypothesis tests, by which you can compare means. Each test that you perform on your sample data, brings down your sample data to a single value i.e. T-value.

the 1-Sample T-test determines how a sample set holds against a mean, while the 2-Sample T-test determines if the mean between 2 sample sets is really significant for the entire population or purely by chance.

What are the different types of hypothesis testing?

Study These Flashcards

T-test: T-test is used when the standard deviation is unknown and the sample size is comparatively small.
Chi-Square Test for Independence: These tests are used to find out the significance of the association between categorical variables in the population sample.
Analysis of Variance (ANOVA): This kind of hypothesis testing is used to analyze differences between the means in various groups. This test is often used similarly to a T-test but, is used for more than two groups.
Welch’s T-test: This test is used to find out the test for equality of means between two population samples.

What is the difference between variance and covariance?

Study These Flashcards

Variance and Covariance are two mathematical terms which are used frequently in statistics. Variance basically refers to how apart numbers are in relation to the mean. Covariance, on the other hand, refers to how two random variables will change together. This is basically used to calculate the correlation between variables.

You have a developed a data model but the user is having difficulty in understanding on how the model works and what valuable insights it can reveal. How will you explain the user so that he understand the purpose of the model?

Use laymans terms. The model is only as good as the user who is using it. Break it down step by step. Use examples to show how it works in different situations.

What are your communications strengths?

- Listening - Open mindedness - Providing and receiving feedback openly - Respect

How do you handle pressure and stress?

I am pretty easy going, and it takes a lot to stress me. I currently work two jobs and am going to school, but I find it easy to manage my time and make things happen. Exercising regularly and taking some time to clear my mind helps me keep my stress levels reduced.

Why do you want this job?

I am looking to enter this field (hence the degree program that I am pursuing), and this position would allow me to grow professionally. The day-to-day skills used in this role would give me so much experience, and I feel like I could help make a difference for our organization by researching business needs.

Why should we hire you?

I have a strong analytical background, and I am learning more techniques and skills in this field daily that I can apply towards this role. I am a quick learner. I have never entered a job where I had prior professional experience in the role, and I have learned and excelled quickly. I feel like I am ahead if I were to enter this role because I have an education in this field, so I would be able to start right away.

What are your greatest professional strengths?

Analytical skills, determination, patience, drive, technical skills learned in my degree program

What are your greatest weaknesses?

- I am very hard on myself, I don’t ever settle for sub-par work - I am still learning the technical skills required for this career. I am about 60% of the way through my program, but I will be done by the end of the year.

What is your greatest professional achievement?

Moving From L1->L2->L3 really quickly, outstanding new colleague award, TEACH award

Tell me about a challenge or conflict you've faced at work, and how you dealt with it.

Biggest challenge: Before the pre-chat survey changed, we switched from asking for email first to going for offline Conflict: managers at the pool—leadership differences (girls shorts, inservice timings, etc.)

What is a time that you exercised leadership?

- As a L3, I have taken the lead on projects, most recently a strategy presentation. I am always happy to answer questions and provide feedback to my team - Working at the pool

What's a time you disagreed with a decision that was made at work?

triple concurrency

How would your boss and co-workers describe you?

- Hard working - Silent leader - Determined - Go-to for advice

What questions do you have?

1. What is a typical day like? 2. What are some examples of projects that I would be working on? 3. What skills and experiences are you looking for in an ideal candidate? 4. What attributes does a person need to have to be successful in this position? 5. What are some of the biggest challenges that a person would face in this position? 6. Would there be opportunities for growth within this position? 7. What kinds of metrics would I be evaluated with? 8. Who would I report to directly?

Data Analyst Flashcards

(37 cards)