Statistical Methods for Data Science DATA7202

There are four questions below. For questions 2 and 5, you should present your analysis of data using Python, Matlab, or R, as a short report, clearly answering the objectives and justifying the modeling (and hence statistical analysis) choices you make, as well as discussing your conclusions. Do not include excessive amounts of output in your reports, though you can append additional output (with explanation) to your report as an appendix.
1. (10%) Consider the following variant of the cross-validation procedure.
(i) Using the available data, find a subset of “good” predictors that show correlation with the response variable.
(ii) Using these predictors, construct a model (for regression or classification).
(iii) Use cross-validation to estimate the model prediction error.
Is this a good method? Do you expect to obtain the true prediction error? Explain your answer.
2. Consider the Hitters data-set (given in Hitters.csv). Our objective is to predict ahitter’s salary via linear models.
(a) (5%) Load the data-set and replace all categorical values with numbers. (Youcan use the LabelEncoder object in Python).
(b) (5%) Fit linear regression and report 10-Fold Cross-Validation mean squarederror.
(c) (10%) Apply Principal Component Regression (PCR) with all possible numberof principal components. Using the 10-Fold Cross-Validation, plot the mean squared error as a function of the number of components and determine the optimal number of components.
(d) (10%) Apply the Lasso method and plot the the 10-Fold Cross-Validation meansquared error as a function of λ. Determine the best λ and the corresponding mean squared error.
3. (10%) Specify a method to generate a random variable from the discrete pmf
x = 0,1,2,…,n,
0 otherwise.
Discuss the time complexity of your method in terms of n, e.g. is it O(n), O(ln(n)), etc. Give a short explanation (at most 2 sentences) for your answer.
4. (5%) Let f(x) = 3(1 − x)2, x ∈ [0,1],
be a pdf. Show how to generate a random variable X ∼ f(x).
5. (5%) Consider a random variable
Y = 3X + X2 − 200cos(X),
where X ∼ f(x) = 3(1 − x)2, x ∈ [0,1]. Write a Crude Monte Carlo algorithm for the estimation of
` = E Y,
using N = 10000 sample size. Deliver the 95% confidence interval.
6. Answer the following questions.
(a) (10%) Let X be a random variable and consider the estimation of the probability `γ = P(X > γ) for some large γ ∈R. The Crude Monte Carlo (CMC) estimator of `γ is
, (1)
where Zi = 1{Xi>γ} is the indicator random variable, and X1,…,XN are iid copies of X for i = 1,…,N. Find the squared coefficient of variation CV2 of
Z. (Recall that CV2 = Var(Z)/(E[Z])2.)
(b) (10%) Find the relative error of the estimator in terms of N and `γ.
(c) (20%) The estimator (1) of `γ = E(Z) is said to be logarithmically efficient if
Prove that the CMC estimator is not logarithmically efficient.

Open chat
Need assignment help?