STOR 767 Course Information Fall 2022

Class information: Tuesday and Thursday 2:00pm – 3:15pm in Hanes 130.  In person format.

Registration: Enrollment and registration for the course is handled online.  Please contact Ms. Christine Keat (crikeat@email.unc.edu) if you have questions.

Instructor:  Andrew B. Nobel

Office: Hanes 308   Email: nobel@email.unc.edu

Nobel Office Hours (via Zoom): Mondays 2-3:20pm

TA: Stephanie Lin

Office: Hanes B7   Email: tzlin@live.unc.edu

TA Office Hours: Friday 3-4:30pm via Zoom

Overview: Machine learning encompasses a wide variety of activities in academia and industry, including or overlapping data mining, the analysis of “big data”, artificial intelligence, and deep learning.  As an academic discipline, machine learning has points of contact with statistics, optimization, computer science, mathematics, and engineering.  For the purposes of this course, we regard machine learning as the study and understanding of statistical methods that identify structure in large data sets, and that use existing data to make predictions about future observations.  In most cases, machine learning approaches are based on general models and procedures that are not tailored to the specific problem at hand.

Audience: This course is targeted to graduate (masters and PhD) students in STOR, Computer Science, Mathematics, and related fields who have a strong background in statistics, linear algebra, probability, and advanced calculus (see prerequisites below).

Goals: The course will familiarize students with a number of key ideas and techniques in statistical machine learning and exploratory data analysis.  The emphasis of the lectures will be on statistical fundamentals and mathematical rigor, as opposed to methodological recipes: rather than cover many methods in a cursory, drive-by fashion, we will consider a smaller number of representative methods in greater detail, with the goal of illustrating core ideas having broad applicability.  The course will also emphasize the value and importance of exploratory data analysis as an indispensable precursor to more sophisticated methods.

Homework assignments will emphasize theoretical material and understanding.  Computer assignments and class projects will familiarize students with exploratory data analysis and a variety of machine learning methods.

Prerequisites: Students should have a good understanding of theoretical and applied statistics, at the level of STOR 654 and STOR 664, and some familiarity with the R programming language.  In particular, students should be familiar with the following material

  • Basic statistics, including point estimation and hypothesis testing, Bernoulli, binomial, Poisson, normal, exponential, and uniform distributions, conditional distributions
  • Linear and matrix algebra, including norms, inner products, eigenvalues and eigenvectors, rank, inverse, projections, orthogonal matrices, and non-negative definite matrices
  • Calculus based probability, including conditional probability, regular and conditional expectation, variance and covariance, correlation, Markov and Chebyshev inequalities, moment generating functions
  • Advanced calculus, including limits, continuous functions, basic properties of derivatives and integrals, multivariate differentiation, multivariate integration, open, closed, and compact sets

Students should also have familiarity with data wrangling (dplyr, tidyverse, lubridate, stringr) and data visualization(ggplot2, ggcorrplot, shiny, maps) in R.  In particular, students should be familiar with the following programming tasks

  • manipulation of vectors, matrices, and lists
  • importing and outputting data
  • transforming datatypes
  • functions, iterations, and logical design
  • simulating random variables and vectors
  • creation of figures (e.g., line plots, histograms, scatterplots, correlation plots, barplots)

Protocol for lectures

  • Please arrive on-time, before the beginning of class.  If you need to arrive late or leave early, let the instructor know in advance.
  • Please refrain from using laptops, phones, and other non-note-taking devices.  Use of tablets is allowed during lectures only if they are used for taking notes.

Office Hours: If you have questions about the homework assignments or lecture material, please speak with the instructor after class, or during his office hours.  If you have questions about the computer assignments, please speak with the TA during her office hours.  The instructor and the TA may not be able to respond to emails outside office hourse, so please begin assignments well before they are due.

Primary Text: Elements of Statistical Learning [ESL] : Hastie et al (2009) Springer (available online).  We will also make use of online tutorials and surveys as needed.

Secondary Texts and Sources

Machine Learning [MLPP]: A Probabilistic Perspective, by Kevin P. Murphy. (2012). MIT Press.

Optimization Methods for Large-Scale Machine Learning, by Bottou, Curtis, and Nocedal

Robust Stochastic Approximation Approach to Stochastic Programming, by Nemirovski, Juditsky, Lan, and Shapiro

Computational Optimal Transport, by Peyre and Cuturi

Attendance:  Students should attend all lectures.  If you are unable to attend a lecture, please make plans to get the notes from another student in the class.

Homework and Computer Assignments: Homework and computer assignments will be posted on the course web page.  In most cases, homework assignments will be due every week, computer assignments every other week.

Homework Policy: Homework assignments will be handled via Gradescope, and should be submitted before class on the day that they are due, so please be prepared to submit your assignments at that time.

Each homework and computer assignment will be graded: late/missed assignments will receive a grade of zero.  All assignments will have equal weight.  In computing a student’s overall score for the course, their lowest homework score and lowest computer assignment score will be dropped.  This provision is meant to cover exceptional situations in which a student is unable to turn in an assignment due to circumstances beyond his/her control.  Under normal circumstances students should turn in every homework and computer assignment.

For homework assignments, please clearly label each problem, show your work (including your mathematical arguments), and give a clear account of your reasoning in English, using full sentences, when appropriate.

Computer assignments should be completed using the R programming language.  Please add meaningful comments to your code and avoid including excessive output. Your work should be submitted as a single PDF from a R markdown file to Gradescope.

You are allowed to discuss the homework and computer assignments with other students, but must prepare each assignment by yourself.  Copying of solutions or code from any external source (e.g. other students or websites) is not allowed.  If your answers to a question are based in whole or in part on an online source, that source should be cited.

Any questions regarding the grading of homework or computer assignments should first be addressed to the TA. If you are absent from class when an assignment is returned, you can get your paper from the TA during their office hours.

Project:  There will be a final group project due at the end of the semester.  The project will involve an in-class presentation as well as a written report.  Students will have the option of doing a more theoretically oriented project, in which they read, summarize, and  discuss a technical paper in the machiner learing literature, or a more applied project in which they analyze one or more data sets using methods discussed in the lectures, and new methods as well.

Exams (tentative): There will be several quizzes and a final exam.  Quizzes and exams will be in class, and will be closed note and closed book.   There will be no makeup exams.

Grading (tentative): Grading will be based on homeworks, computer assignments, a final project, an in-class midterm, and an in-class final exam, using the weights below.

Homework 15%
Computer Assignments 15%
Final Project 35%
Exams 35%

 

Syllabus (tentative): The course will begin with some introductory material, an overview of exploratory data analysis, and a review of inequalities and matrix analysis.  The remaining course material will be divided into the study of unsupervised methods and supervised methods.  The following is a tentative syllabus for the course.

I. Introduction and Preliminaries

Overview of exploratory data analysis

  • Univariate sample statistics: mean, median, mode, standard deviation; histograms and density plots; z- and t-statistics
  • Bivariate sample statistics: correlation, scatter-plots, r-squared values
  • Normalization, outlier detection, missing values

Mathematical background

  • Euclidean norm, inner products, Cauchy-Schwartz inequality
  • Review of matrix algebra
  • Maxima, minima, absolute values
  • Definition and basic properties of convex sets and functions
  • Jensen’s inequality
  • Local and global minima

II. Unsupervised Learning

Principal Component Analysis (PCA) and dimension reduction

  • Finding good summary directions in high-dimensional data
  • Derivation of PCs from eigenvectors of sample variance matrix

The Singular Value Decomposition (SVD)

  • Connections between PCA and SVD
  • Low rank approximations

Clustering

  • Overview of the problem, finding group structure in data
  • K-means clustering
  • Hierarchical clustering, trees and dendrograms
  • Applications

III. Supervised Learning

Classification, background

  • Introduction. Stochastic setting
  • Classification rules, decision regions, decision boundaries
  • Basic marginal and conditional distributions
  • Bayes risk and Bayes rule
  • Random vectors and the multivariate normal distribution

Classification Methods

  • Histogram rules
  • Nearest neighbor rules
  • Naive Bayes
  • Logistic regression
  • Linear discriminant analysis, quadratic discriminant analysis
  • Support vector machines, separable and non-separable cases

Probability Inequalities

  • Review of Markov and Chebyshev
  • Chernoff bound
  • Bernstein and Hoeffding inequalities
  • McDiarmid’s inequality and Gaussian concentration

Model assessment

  • Overfitting and underfitting
  • Training and test sets
  • Training error vs. test error
  • Optimisim of training error
  • Estimates of prediction error when feature vectors are fixed
  • Effective number of parameters
  • k-fold cross-validation

Empirical risk minimization

  • Estimation and approximation error
  • Vapnik-Chervonenkis dimension
  • Performance bounds for VC families

Linear Regression

  • Basic problem
  • Ordinary least squares: derivation and some basic properties
  • Variable (subset) selection
  • Ridge regression: derivation; shrinkage and regularization
  • The LASSO
  • Principle component regression and partial least squares

Support Vector Machines

  • Linear classification rules
  • Maximum margin classifiers
  • Lagrangian dual and solution of maximum margin problem

Decision and Regression Trees

  • Binary trees and tree-structured partitions
  • Growing and pruning
  • Boosting and bagging
  • Random forests

IV Other Topics

  • The EM Algorithm
  • Multiple testing and the false discovery rate

Disclaimer: The instructor reserves the right to make changes to the syllabus, and to the due dates of assignments. The latter will be announced as early as possible.

 

Study tips

1. Keep up with the reading and homework assignments. If the reading assignment is long, break it up into smaller pieces (perhaps one section or subsection at a time).

2. Always look over the notes from lecture k before attending lecture k+1. This will help keep you on top of the course material. Ideas from one lecture often carry over to the next: you will get much more out of the material if you can maintain a sense of continuity and keep the “big picture” in mind.

3. Complete the reading *before* doing the homework. Trying to find the right formula or paragraph for a particular problem often takes as much time, and it tends to create more confusion than it resolves.

4. When looking over your notes or the reading assignment, keep a pencil and scratch paper on hand, and try to work out the details of any argument or idea that is not completely clear to you.  Even if the argument or idea is clear, it can be helpful to write it down again in a different way in order to test and strengthen your understanding.

5. It is important to know what you know, but it’s especially important to know what you don’t know.  As you look over the reading material and your notes, ask yourself if you (really) understand it.  Keep careful track of any concepts and ideas that are not clear to you, and make efforts to master these in a timely fashion.

6. One good way of seeing if you understand an idea or concept is to write down (or state out loud) the associated definitions and basic facts, without the aid of your notes and in complete, grammatical sentences.  Translating mathematics into English, and back again, is an important research skill, and a good way to build and assess your understanding.

7. The homework and computer assignments play two important roles in the course. First, they provide an opportunity to actively think about, engage with, and learn the course material. In addition, they provide feedback on your understanding of the material. Carefully look over your corrected assignments. Most students do well on the assignments: even if you received a good score, make sure to note and understand and correct any mistakes you have made.

8. Begin studying for exams at least one week before they are given. Look over your notes, homework, and the text. Write up a study guide containing the main concepts and definitions being covered, and use this to get a clear picture of the overall landscape of the material. For every topic on the study guide, you should know the relevant definitions, motivating ideas, and at least one or two examples.

 

Honor Code Policy

As a condition of joining the Carolina community, Carolina students pledge “not to lie, cheat, or steal” and to hold themselves, as members of the Carolina community, to a high standard of academic and non-academic conduct while both on and off Carolina’s campus. This commitment to academic integrity, ethical behavior, personal responsibility and civil discourse exemplifies the “Carolina Way,” and this commitment is codified in both the University’s Honor Code and in other University student conduct-related policies.

Accessibility Resources

The University of North Carolina at Chapel Hill facilitates the implementation of reasonable accommodations, including resources and services, for students with disabilities, chronic medical conditions, a temporary disability or pregnancy complications resulting in barriers to fully accessing University courses, programs and activities.

Accommodations are determined through the Office of Accessibility Resources and Service (ARS) for individuals with documented qualifying disabilities in accordance with applicable state and federal laws. See the ARS Website for contact information: https://ars.unc.edu or email ars@unc.edu.

Counseling and Psychological Resources

CAPS is strongly committed to addressing the mental health needs of a diverse student body through timely access to consultation and connection to clinically appropriate services, whether for short or long-term needs. Go to their website: https://caps.unc.edu/or visit their facilities on the third floor of the Campus Health Services building for a walk-in evaluation to learn more.

Title IX Resources

Any student who is impacted by discrimination, harassment, interpersonal (relationship) violence, sexual violence, sexual exploitation, or stalking is encouraged to seek resources on campus or in the community. Please contact the Director of Title IX Compliance (Adrienne Allison – Adrienne.allison@unc.edu), Report and Response Coordinators in the Equal Opportunity and Compliance Office (reportandresponse@unc.edu), Counseling and Psychological Services (confidential), or the Gender Violence Services Coordinators (gvsc@unc.edu; confidential) to discuss your specific needs. Additional resources are available at safe.unc.edu.

University Attendance Policy

No right or privilege exists that permits a student to be absent from any class meetings, except for these University Approved Absences:

  1. Authorized University activities
  2. Disability/religious observance/pregnancy, as required by law and approved by Accessibility Resources and Service and/or the Equal Opportunity and Compliance Office (EOC)
  3. Significant health condition and/or personal/family emergency as approved by the Office of the Dean of Students, Gender Violence Service Coordinators, and/or the Equal Opportunity and Compliance Office (EOC).