Exam potential questions Flashcards
Why do the models which use industrial data cannot be very strong?
Because replicative data points are not typical, and cannot be used for data models.
Define MLR, the goal from using it, and what does it assume?
Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique that uses several explanatory variables to predict the outcome of a response variable.
The goal of multiple linear regression (MLR) is to model the linear relationship between the explanatory (independent) variables and response (dependent) variable.
MLR assumes the following:
1- There is a linear relationship between the dependent variables and the independent variables.
2- The independent variables are not too highly correlated with each other.
Define PLS and determine when it is useful?
Partial least squares regression (PLS regression) is a statistical method that is related to regression:
It finds a linear regression model by projecting predicted variables (predictors) and observable variables. PLS is used to find the fundamental relations between two matrices (X and Y).
Partial least squares (PLS) regression is a technique that reduces the predictors to a smaller set of uncorrelated components and performs least squares regression on these components, instead of the original data.
PLS is useful when:
1) Predictors are highly collinear,
2) you have more predictors than observations,
3) ordinary least-squares regression either produces coefficients with high standard errors or completely fails.
Define Anova, and when is it used?
ANOVA determines IF the difference between samples is by chance or due to systematic treatment effect.
ANOVA is used in the analysis of comparative experiments, those in which only the difference in outcomes is of interest. The statistical significance of the experiment is determined by a ratio of two variances. This ratio is independent of several possible alterations to the experimental observations: Adding a constant to all observations does not alter significance. Multiplying all observations by a constant does not alter significance. So, ANOVA statistical significance result is independent of constant bias and scaling errors as well as the units used in expressing observations
what are the steps that should be followed to check model suitability?
To check model suitability, the following steps should be taken:
1) Evaluation of raw data: data scattering, any replicate points in process data
2) Regression analysis and model interpretation: R2 and Q2 values, coefficient plot, confidence limits of weight factors, …
3) Use of regression model: response contour plot, Equation fit for other models
what is the importance of raw data quality?
The raw data quality is essential in getting a reliable data model which can explain process disturbances, or which can be used in process performance at different conditions.
How can low-quality data be differentiated from better quality data?
There are a couple of main characteristics for which low-quality data can be differentiated from better quality data:
1) data should be dispersed well spatially. It means that amount of data is sufficient (> 20-25 data points to get distributed data). Data should not be in a small region, and it is recommended that it is uniformly dispersed.
2) If replicate points are available, the responses should be within a reasonable limit as it is related to measurement accuracy.
what is the use of the predicted vs. observed plot?
It is useful to find outliers and to determine the data dispersivity;
what is the use of the Residuals n-plot?
The Residuals n-plot is also useful to find out outliers in the data.
Describe a good model fit where an experimental plan can be conducted and when process data is used. why the second one is lower?
The good model fit has R2 > 0.9 and Q2 > 0.7 especially in cases where an experimental plan can be conducted. However, when process data is used the R2 and Q2 values can be weaker, R2 > 0.8 and Q2 > 0.6. An obvious reason is that replicate points may not be available, and the variable values can be concentrated within too tight specifications.
What is the solution to increasing too narrow process window?
1) To Include laboratory test results together with actual process data
2) To Perform intentional out-of-process variable specification data points (confirm hypothesis)
3) To Use process development data history along with actual process data
Define the Coefficient plot?
It is an important tool in defining the most meaningful process parameters. The parameter weight is primarily used for the estimation of parameter significance to the response.
what does the confidence interval of weight factors (error bars) determine?
It determines the parameter significance as well.
What does the contour plot indicate? and what does its shape describe?
It indicates the goodness of the regression model. The contour plot shape not only describes the process parameter significance but also is informative to model reflection to reality.
What are the uses of regression modeling?
can be used for several purposes:
1) Experimental plan results interpretation
- In statistical analysis software the Design of Experiments (DOE) is supported. It is used for screening significant process parameters and initial process conditions, and also for optimizing best process performances.
- The models are relatively weak because physical phenomena are not included in these models. It is, however, possible to some extent when process variables are transformed to e.g. dimensionless numbers: Reynolds-, Nusselt-, Froude- or Damköhler-numbers, dimensionless length, dimensionless time.
2) Trouble-shooting process
- Solving trouble-shooting problems is relatively easy to perform but there are the following limitations to data which will be revealed during data-analysis: the lack of replications causes difficulties to evaluate measurement & sampling-based errors; the model may suffer from a narrow parameter window which weakens the model predictive capabilities in trouble-shooting.
3) Creation of a simplistic mathematical model for simulation of process performance.
- In the case of large data amounts it is relatively fast to set-up a set of linear equation-based model. It is noted that the calculated weight factors are easy to pick-up from results and include them into the selected set of equations.
- The acquired model can be then used for e.g. predicting process performance using different process run scenarios.