AMS 598: Big Data Analysis

This course introduces the application of the supercomputing to statistical data analyses, particularly on big data. Implementations of various statistical methodologies within parallel computing framework are demonstrated through all lectures. The course will cover (1) parallel computing basics, including architecture on interconnection networks, communications methodologies, algorithm and performance measurements, and (2) their applications to modern data mining techniques, including modern variable selection/Dimension reduction, linear/logistical regression, tree-based classification methods, Kernel-based methods, non-linear statistical models, and model inference/Resampling methods.

Supplementary Textbooks:

  1. Applied Parallel Computing by Yuefan Deng, 2012, World Scientific
  2. The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani and Jerome Friedman, 2nd edition, 2016, Springer
  3. Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman and Jeffrey David Ullman, 3rd edition, Cambridge University Press


Learning Outcomes

  1. Demonstrate knowledge of parallel computing basics:
    • Node architecture, central processing units, and accelerators;
    • Distributed- and shared-memory
  2. Demonstrate skills with software architecture and R:
    • Communication patterns and protocols;
    • Process creation and management;
    • Mapreduce framework;
    • Hapdoop in R;
    • Demonstrate mastery of basic tools for big data analysis:
    • Linear regression
    • Logistic regression
    • Dimension reduction
  3. Demonstrate understanding of advanced methods for big data analysis:
    • Classification and regression trees
    • Random forest
    • Gradient boosting
    • Support vector machine
    • Neural network
  4. Demonstrate understanding of model selection and performance evaluation:
    • Best subset; forward selection; backward selection
    • Cross-validation
    • Bootstrap