Question 1 (65 marks)
Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit.
Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This requires banks to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years.
Historical data (cs-training.csv) are provided on 150,000 borrowers. The following variables are available to you:
The goal of Question 1 is to build a model from training dataset that banks can use to help make the best financial decisions on borrowers in testing dataset (cs-test.csv).
|1.1 Carefully pre-process the dataset by considering the following activities:
• Exploratory data analysis.
• Missing value handling (if any). Marks will be discounted by just replacing by a value, a correct study of missing values is necessary.
• Outlier detection and treatment (if any). Marks will be discounted by just eliminating or replacing by a value without justification, a correct study of outliers is necessary.
1.2 Build a credit scoring model in which Serious Dlqin2yrs is used as a target (default) and report the following:
• What method do you use?
• Why you use this method?
• Discuss your results.
• The most important variables
• The impact of the variables on the target
• The performance of the model. Use various performance metrics and discuss their relationship if any.
• What do banks win and lose by doing this?
In terms of software, use SAS Enterprise Miner or anything else (e.g., Python, R and so on). Carefully report the various steps of your methodology and discuss your results in a rigorous way!
Question 2 (35 marks)
Find an academic or business paper published in 2019 or later discussing a real-life application of data mining or credit scoring. It is important that the case considered is a real-life case and not an artificial one. Some suggested journals are:
• Management Science
• Operations Research
• INFORMS Journal on Computing
• INFORMS Journal on Applied Analytics
• Journal of Machine Learning Research
• European Journal of Operational Research
• ICDM (The IEEE International conference on data mining)
• NeurlPS (Conference on Neural Information Processing Systems)
• KDD (ACM SIGKDD Conference on Knowledge Discovery and Data Mining)
Once you have found an appropriate paper, report the following in separate sections: • Title, authors and complete citation (journal name, book title, issue, year, …)
• The data mining problem considered
• The data mining techniques used
• The results reported
• A critical discussion of the model and results (assumptions made, shortcomings, limitations, …)
Make sure you demonstrate that you understand what the article is all about!