Checking date: 30/05/2022


Course: 2022/2023

Statistical methods in data mining
(13722)
Study: Bachelor in Statistics and Business (203)


Coordinating teacher: MUÑOZ GARCIA, ALBERTO

Department assigned to the subject: Department of Statistics

Type: Compulsory
ECTS Credits: 6.0 ECTS

Course:
Semester:




Requirements (Subjects that are assumed to be known)
Regression Methods and Multivariate Analysis, third course. Knowledge of R statistical software.
Objectives
1. To know and use advanced statistical techniques, with last generation software support. 2. To extract and analyze information from large data sets. 1. Ability of information analysis and synthesis. 2. Modelization and resolution of practical problems in Data Mining. 3. Oral and written communication skills.
Skills and learning outcomes
Description of contents: programme
1. Introduction Tidyverse 1.1 Data wrangling 2.2 Data Visualization: ggplot2 2.3 Grouping and summarizing. 2. Text Mining. 2.1 Main concepts. 2.2 Word clouds. 2.3 Term by document matrix. 2.4 R implementations and applications. 3. Data visualization. Metric Multidimensional Scaling, Correspondence Analysis, Biplots. 3.1 Metric Multidimensional Scaling. 3.2 Biplots. 3.2 Perceptual Mappings. 4. Cluster Analysis. Hierarchical Methods, k-means and mixture models. 4.1 Bottom up hierarchical clustering algorithms. 4.2 k-means and related algorithms. 5. Information Theory and classification trees. 5.1 Information theory. 5.2 Classification trees algorithms. 5.3 Real case: credit scoring. 5.4 Case studies. 6. Association Rules. 6.1 Main concepts and algorithms. 6.2 Complete example with R code. 6.3 Case studies. 7. Deep Learning. 7.1 Support Vector Machines. 7.2 Neural Networks for classification. 7.3 Neural Networks for regression. 8. Case Studies. 8.1 Comprehensive real cases involving all the studied techniques.
Learning activities and methodology
Theory (4 ECTS). Theory clases with lessons available in Web. Practice (2 ECTS). Problem and case studies solving. Computational practices in computer rooms. Oral presentations and debates.
Assessment System
  • % end-of-term-examination 50
  • % of continuous assessment (assigments, laboratory, practicals...) 50
Calendar of Continuous assessment
Basic Bibliography
  • A.J. Izenman. Modern Multivariate Statistical Techniques. Springer. 2008
  • E. Alpaydin. Introduction to Machine Learning, 2nd Edition. MIT Press. 2010
  • X. Wu. The top ten algorithms in data mining. Chapman &Hall /CRC. 2009
Recursos electrónicosElectronic Resources *
Additional Bibliography
  • I.H. Witten , E. Frank, M.A. Hall. Data Mining. Practical Machine Learning Tools and Techniques, 3d Edition. Morgan Kaufmann. 2011
  • John M. Chambers. Software for Data Analysis. Programming with R.. Springer. 2008
  • Luis Torgo. Data Mining with R. Chapman & Hall/CRC. 2001
  • W.J. Braun, D.J. Murdoch. A first course in statistical programming with R. Cambridge University Press. 2007
(*) Access to some electronic resources may be restricted to members of the university community and require validation through Campus Global. If you try to connect from outside of the University you will need to set up a VPN


The course syllabus may change due academic events or other reasons.