1. A critical first step in data mining is to perform initial exploratory data analysis. You have collected a data set of statistics of supporters of various sports teams. You have salary information for fans of the Red Sox and Mets: 10,000 fans each. What graphical technique could you use to compare the distributions of salaries for fans of the two teams? Explain what this might look like. Make sure your method allows for easy comparison of the mean or median of the distribution.
2. This question relates to the diagram shown below. Sketch a decision tree corresponding to the partition of the predictor space illustrated in the figure. The classes inside the boxes indicate the response. Use only binary splits and classify every record correctly. You can eigher graphically represent your decision tree or write it down as decision rules (“if X greater/less than Y then Z”).
3. Below is the output of a linear regression using data from the May 1985 Current Population Survey by theUS Census Bureau. The variables used in the model are as follows:
• wage (in dollars per hour).
• education: Number of years of education.
• Age: Age in years.
• Sector: Factor with levels manufacturing (manufacturing or mining), construction , other.
For simplification, you can round the numbers to the nearest full digit.
(a) How can you explain the negative intercept? What does this tell you about the data?
(b) What hourly wage would you predict for a 45-year-old male with 12 years of education who is working in
manufacturing? (You can round numbers to two digits after the decimal to simplify your calculations.)
(c) Given the regression results, how confident are you in the following statement: “Working in construction is very lucrative as people in that sector earn higher wages” – explain!
4. Amy built a model to predict fraud. Amy says “My model is awesome! It is 94% accurate, because it was correct 1005 times out of 1065 total cases!” Do you agree that this is an awesome model? Why or why not?
Fraud = Yes Fraud = No
Prediction = Yes 5 50
Prediction = No 10 1000
5. This is a basic question about R.
Write the first six lines of a hypothetical external data file named data.csv that will be read without error by the following R command:
mydata <- read.table ("data.csv" , skip=3 , header=FALSE , sep="," , colClasses=c ("character" , rep ("numeric" , 3 ) ) ) Lines of data.csv: 1 : 2 : 3 : 4 : 5 : 6 : 6. This is a basic question about R. We often used objects of class data.frame as data structures. Explain the key features of the data.frame objects. What are some of its strengths and weaknesses? 7. This is a basic question about R. Which elements (i.e., rows and columns) of the mtcars data.frame does the following command return? mtcars[mtcars$mpg < mean (mtcars$mpg), c ("mpg" , "hp" )] 8. You have been hired as a security and data analyst for a company operating an online social media platforman. You are tasked to work on a project to identify possible threats related to fake user accounts (so called sibyls ). How can you get started on the project? Try to break it down using the the six phases of the CRISP-DM process. Start your analysis by explaining briefly what the goal of each phase of CRISP-DM is. Use bullet structured lists and short sentences to structure your response. Do not write paragraphs of prose text. 9. A data set is collected on music tracks – for each track 30 variables are collected on qualities of the music, extracted from the audio files. Some of the songs are labeled as classical, jazz, country, rock, rap, or R&B. The goal is to label the remaining songs. Is the following example of classification, regression, or clustering problems? Why? 10. An online business wants to build a model to predict future sales revenue of their customers. They collect information from existing customers: what browser they are using, what state they are from, and what products they have purchased in the past. Is the following example of classification, regression, or clustering problems? Why?