DS1 Flashcards

Question 1

Q

An insurer’s business need for data obtained through statistical plans is mainly concerned with

Select one:
A. The cost and pricing of insurance.
B. Market conduct.
C. Regulatory requirements.
D. Financial solvency.

Answer

A

A. The cost and pricing of insurance.

Question 2

Q

A data set consists of heights for third-, fourth-, and fifth-grade students. The fitted value for each grade is the mean height. Which one of the following approaches best describes how to evaluate whether grade distributions differ only in location?

Select one:
A. Plot the fitted heights for one grade on the vertical scale and the fitted heights for all grades on the horizontal scale to evaluate the pattern of residuals.
B. Plot the quantiles of the residuals for one grade on the vertical scale and the fitted heights for all grades on the horizontal scale to evaluate the pattern of residuals.
C. Plot the quantiles of the residuals for one grade on the vertical scale and the quantiles of the residuals for all grades on the horizontal scale to evaluate the pattern of residuals.
D. Plot the fitted heights for one grade on the vertical scale and the quantiles of the residuals for all grades on the horizontal scale to evaluate the pattern of residuals.

Answer

A

C. Plot the quantiles of the residuals for one grade on the vertical scale and the quantiles of the residuals for all grades on the horizontal scale to evaluate the pattern of residuals.

Question 3

Q

An independent state bureau is examining data in a unit statistical plan (USP). The USP indicates the underwriting experience (premiums and losses) collected for which type of insurance?

Select one:
A. General liability
B. Workers compensation
C. Surety
D. Personal auto

Answer

A

B. Workers compensation

Question 4

Q

Which statement about geoms in ggplot is true?

Select one:
A. The default geom smoothing method is loess.
B. Mapping colors can be continuous or categorical.
C. An object with an alpha of 1 will be completely transparent.
D. You must add points to your plot before adding a smoothed line.

Answer

A

B. Mapping colors can be continuous or categorical.

Question 5

Q

In Healy, there is a case study related to a slide created to evaluate Marissa Mayer’s performance as CEO of Yahoo. What is the biggest problem noted about this slide in the text?

Select one:
A. Dual axes can be scaled to misrepresent the association in the variables.
B. For this topic it is not appropriate to have time on the x axis.
C. The color theme doesn’t align with best practices for preattentive processing.
D. The overall message of the slide is unclear.

Answer

A

D. The overall message of the slide is unclear.

Question 6

Q

When using a power transformation to adjust non-normal variables, the effectiveness of the transformation depends on the selection of an appropriate T parameter. Which one of the following statements regarding the selection of the T parameter is true?

Select one:
A. If the ratio of the largest observation to the smallest observation of a data set is very close to 1, power transformations with T from -1 to 1 have a large effect.
B. For data sets with zeroes, a power transformation with a T parameter less than zero will be most effective.
C. A trial and error method can be used to identify the value of the T parameter that will be most effective.
D. For a highly skewed distribution, a power transformation with the parameter T = 1 will be most effective.

Answer

A

C. A trial and error method can be used to identify the value of the T parameter that will be most effective.

Question 7

Q

Robust estimation techniques are valuable for visualizing non-normal data. To assess whether the residuals for different groups of a non-normal data set may be pooled, distributions of the spread-standardized residuals may be graphed by normal q-q plots. Which one of the following descriptions correctly defines the spread-standardized residual?

Select one:
A. The difference between a transformed observation and its group median, divided by its group standard deviation
B. The difference between a transformed observation and its group median
C. The difference between a transformed observation and its group mean, divided by its group standard deviation
D. The difference between a transformed observation and its group median, divided by its group mean absolute deviation

Answer

A

D. The difference between a transformed observation and its group median, divided by its group mean absolute deviation

Question 8

Q

A data set of observations quantifying mobile phone battery life is skewed toward large values. It is most likely that

Select one:
A. The values on a quantile plot are symmetric.
B. The quantile plot displays a convex pattern.
C. The distribution is well-approximated by the normal distribution.
D. The median and the mean measure the same aspect of the distribution.

Answer

A

B. The quantile plot displays a convex pattern.

Question 9

Q

Quantiles are essential to visualizing distributions. Which one of the following statements is true of quantiles?

Select one:
A. The precise form of fi is important.
B. A fraction f of the data is greater than q(f).
C. No explicit rule is needed to compute q(f).
D. The f-values provide a standard for comparison.

Answer

A

D. The f-values provide a standard for comparison.

Question 10

Q

The Review Process includes which one of the following activities?

Select one:
A. Rank results with respect to business success criteria.
B. Identify misleading steps.
C. Determine deployment strategy.
D. Select best model.

Answer

A

B. Identify misleading steps.

Question 11

Q

Which one of the following categories of data quality measures how well data represents true values and the business information being analyzed?

Select one:
A. Accuracy
B. Reasonability
C. Validity
D. Timeliness

Answer

A

A. Accuracy

Question 12

Q

Which one of the following is true in regard to using analytic tools to identify atypical values for a particular variable?

Select one:
A. A common formula for standardizing a variable is to subtract the mean and multiply by the standard deviation.
B. Variance and standard deviation are scale dependent and increase as the scale of a variable increases without the relative variability increasing.
C. The rule of thumb that values greater or less than three standard deviations from the mean is particularly applicable for heavy-tailed insurance data.
D. An unusually narrow range (given the number of values) or few extreme minimum or maximum values will suggest the presence of outliers.

Answer

A

B. Variance and standard deviation are scale dependent and increase as the scale of a variable increases without the relative variability increasing.

Question 13

Q

A stakeholder analysis is undertaken by an insurer’s data governance committee because

Select one:
A. Data is received on different bases and broken down by several variables.
B. Various departments have similar demands for types and formats of collected data.by several variables.
C. Stakeholders come to a consensus on their expectations of how data should be handled.
D. Mergers with legacy systems are considered essential to users of insurance data.

Answer

A

A. Data is received on different bases and broken down by several variables.

Question 14

Q

Web scraping transforms

Select one:
A. Small amounts of data from the internet.
B. Structured data into relational databases.
C. Unstructured data into structured data.
D. Internet data into a library.

Answer

A

C. Unstructured data into structured data.

Question 15

Q

The 5 C’s are

Answer

A

Consent
Clarity
Consistency
Control
Consequences

Question 16

Q

CRISP DM steps

Answer

A

Business Understanding
Data Understanding
Data Prep
Modelling
Evaluation
Deployment

Question 17

Q

Claim velocity vs acceleration

Answer

A

velocity is # of changes
acceleration is rate of changes

Question 18

Q

confusion matrix

Answer

A

shows results of a model
columns are predicted values
rows are actual values
shows precision (amount that are yes that are supposed to be yes)
specificity and sensitivity

Question 19

Q

kernel density

Answer

A

smooths data and gives weights to points
pick a bandwidth (aka bin) and a weighting function
Gaussian gives more weight to close points
uniform is standard weight all around

Question 20

Q

variance

Answer

A

avg squared distance from the mean
(the sq root of this is standard deviation)

Question 21

Q

kurtosis

Answer

A

heavy tailed data. Normal is 3, more than than is tailed

Question 22

Q

reg ex

Answer

A

^ is start of string
$ is end of string
. is wildcard
m// gives true or false on match
s// replaces match

Question 23

Q

NCCI
NAIC
CLUE

Answer

A

is work comp
is annual statement
is personal prop and auto

Question 24

Q

two fields required for prem and loss

Answer

A

Acc Date and Inception Date

Question 25

Q

ggplot fill vs color

Answer

A

fill is inside of shapes, color is only lines
if you set fill= FALSE then the legend goes away

Question 26

Q

Binning is

Select one:
A. A statistic used to evaluate relationships between categorical variables. A high value supports the hypothesis of a significant relationship.
B. When a predictive model fits not just patterns in the data, but also random fluctuations in that data that will not be present when the model is applied to another dataset.
C. A way of dealing with numeric variables when the relationship between the variable and the target variable is unknown or changes with the level of the other variable.
D. A set of mathematical operations, such as regression, classification trees, and clustering, that can be used to achieve a desired result.

Answer

A

C. A way of dealing with numeric variables when the relationship between the variable and the target variable is unknown or changes with the level of the other variable.

Question 27

Q

Which one of the following steps undertaken by an analyst during a data review can be particularly helpful in detecting data anomalies?

Select one:
A. Review prior data
B. Perform exploratory analysis
C. Identify questionable data values
D. Determine data or metadata definitions

Answer

A

B. Perform exploratory analysis

Question 28

Q

Which one of the following is true regarding data quality and an insurer’s financial results?

Select one:
A. Data errors exist but rarely reach the point of directly affecting an insurer’s financial statement.
B. Improving data quality could free up actuarial resources for more value-producing assignments.
C. Actuaries report that even though they spend over half their time on data quality issues, errors still cause financial problems.
D. More than half of projects undertaken by actuaries are adversely affected by data quality issues.

Answer

A

D. More than half of projects undertaken by actuaries are adversely affected by data quality issues.

Question 29

Q

Which one of the following is an example of a piece of nominal data?

Select one:
A. Temperatures measured on the Celsius scale
B. An auto insured’s likelihood of being involved in an accident
C. The number of auto insureds in a particular region
D. A ratio scale measurement

Answer

A

B. An auto insured’s likelihood of being involved in an accident

Question 30

Q

Which one of the following is true regarding data quality?

Select one:
A. Capturing enough data to generate statistically significant results generally leads to quality data.
B. The “fitness” of data is an unchangeable standard regardless of its end users.
C. Having quality and accurate data from the start assures it will remain in that condition.
D. Fair and accurate insurance rates are unrelated to the quality of the data used in their development.

Answer

A

A. Capturing enough data to generate statistically significant results generally leads to quality data.

Question 31

Q

A data scientist decides to perform analysis of potential fraud by using the actual recordings of claimants’ statements rather than transcripts of the statements. Which one of the following explains why a data scientist would decide on this method?

Select one:
A. It is impossible to detect potential fraud from a statement without actually hearing the recording.
B. The data scientist would like to use computer technology to analyze voice patterns that might indicate lying.
C. It is simpler to analyze voice recordings than it is to analyze the text in transcripts.
D. Analysis of the recorded statements will better indicate how well the adjuster handled the process.

Answer

A

B. The data scientist would like to use computer technology to analyze voice patterns that might indicate lying.

Question 32

Q

Which one of the following is true concerning the approximate matching of strings to specifications in a query?

Select one:
A. Approximate matching results from searching for correctly spelled, but inaccurate, entries in a database.
B. Approximate matching of strings is necessary to compensate for possible typographical errors.
C. Data scientists are unable to ignore misspellings in the data they are searching and analyzing.
D. Search results that are inappropriate for use by data scientists are referred to as “fuzzy matches.”

Answer

A

B. Approximate matching of strings is necessary to compensate for possible typographical errors.

Question 33

Q

Durham Insurance is a consolidation of many small carriers. A challenging data issue is that each of the carriers had its data stored and processed on its proprietary legacy platforms. Having a dedicated data governance committee will provide which one of the following advantages for Durham?

Select one:
A. Stakeholder representation
B. Consolidated deployment of resources
C. Timely and integrated systems
D. Coordinated and unified data management

Answer

A

C. Timely and integrated systems

Question 34

Q