Class times and location: Tuesday and Thursday 3:30pm – 4:45pm in Hanes 120
Format: At present the plan is for classes to be held in-person, with students following the University’s COVID protocols (see below for more details).
Prerequisites: The material in 565 incorporates and makes use of ideas from statistics, optimization, and computer science. Students should have completed and have a good understanding of the material in MATH 233, STOR 215 or MATH 381, and STOR 435. In addition, students should have knowledge of basic matrix algebra at the level of MATH 347 (formerly MATH 547). Familiarity with the material in STOR 415 (Optimization), STOR 455 (Methods of Data Analysis), and some prior experience with basic programming is desirable, but is not required. A more detailed list of select prerequisite material is given below and in the online lecture notes.
Registration: Enrollment and registration for the course is handled online. Please contact Ms. Christine Keat (firstname.lastname@example.org) if you have questions.
Instructor: Andrew B. Nobel
Office: Hanes 308 Email: email@example.com
Nobel Office Hours (via Zoom): Mondays 2-3:20pm, and Fridays 2-3pm
TA: Jose Angel Sanchez Gomez
Office: Hanes B-54 Email: firstname.lastname@example.org
Sanchez Gomez Office Hours (via Zoom): Mondays 1-2pm and Wednesdays 11am-12pm
Overview: Broadly speaking, machine learning is the study and development of statistical and computational methods that identify useful structure in large data sets, or that use existing observations to make predictions about new (unseen or partially unseen) observations. In most cases, machine learning approaches are based on general models and procedures that are not tailored to the specific problem at hand. Machine learning theory and methods draw on ideas from statistics, optimization, and computer science, with key advances coming from researchers and techniques in each of these fields. How one defines machine learning often depends on the field one is in; machine learning encompasses or overlaps a number of active research areas, including data mining, the analysis of “big data”, artificial intelligence, and deep learning.
Audience: The course is targeted to advanced undergraduate (usually seniors) and masters students who have completed the prerequisite coursework and are interested in exploring machine learning from a mathematically rigorous point of view.
Goals: STOR 565 is intended to provide a mathematically rigorous, broad-based introduction to the ideas and techniques of statistical machine learning. The course is targeted to advanced undergraduate and masters students with interest and background in statistics, mathematics, or computer science. The emphasis of the class will be on theory and fundamentals. Rather than cover many methods in a cursory manner, we will cover a smaller number of representative methods in greater detail, with the goal of illustrating core ideas that underly many methods. Lectures will focus primarily on the theoretical background and the description of different methods. Computer assignments completed outside of the lectures will introduce students to the R programming language, the methods discussed in class, and exploratory data analysis.
Protocol for lectures
- Please arrive on-time, before the beginning of class, and leave only after the official end of class. If you need to arrive late or leave early, let the instructor know in advance.
- Use of laptops, tablests, phones, and other electronic devices, as well as reading of newspapers is not permitted during the lecture.
Office Hours: If you have questions about the homework assignments or lecture material, please speak with the instructor after class, or during his office hours. If you have questions about the computer assignments, please speak with the TA during his office hours. The instructor and the TA may not be able to respond to emails (including those received shortly before assignments are due), so please begin assignments well before they are due.
Primary Text: Introduction to Statistical Learning [ISL]: James et al (2013), Springer (free online, includes R labs).
Secondary Texts and Sources
Introduction to Applied Linear Algebra [IALA], by Stephen Boyd and Lieven Vandenberghe.
3Blue1Brown videos on linear algebra
Elements of Statistical Learning [ESL] : Hastie et al (2009) Springer (free online) and
Machine Learning [MLPP]: A Probabilistic Perspective, by Kevin P. Murphy. (2012). MIT Press.
Attendance: Unless they have students are expected to attend all lectures. If you are unable to attend a lecture, please make plans to get the notes from another student in the class.
Homework and Computer Assignments: Homework and computer assignments will be posted on the course web page, and in most cases will be due once a week. The computer assignments are intended to introduce students to the basics of the R programming language, and the basic statistical machine learning methods discussed in the lectures. The homework assignments are intended to cover and strengthen students’ understanding of the theoretical material.
Homework Policy: Homework for the class will be handled using Gradescope. Homework and computer assignments will be collected before class on the day that they are due, so please be prepared to submit your homework at that time. Each assignment will be graded: late/missed assignments will receive a grade of zero. All assignments will have equal weight. In computing a student’s overall score for the course, their lowest homework score and lowest computer assignment score will be dropped. This provision is meant to cover exceptional situations in which a student is unable to turn in an assignment due to circumstances beyond his/her control. Under normal circumstances students are expected to turn in every homework and computer assignment.
To receive full credit on the homework and computing assignments, you should upload the assignments in an orderly manner. Please write your name or initials on each page. In addition, for homework assignments, you should clearly label each problem, neatly show all your work (including your mathematical arguments), and give a clear account of your reasoning in English, using full sentences where appropriate. For computer assignments please avoid including excessive output.
You are allowed to discuss the homework and computer assignments with other students, but must prepare each assignment by yourself. Copying of another person’s answers or code is not allowed. Any questions regarding the grading of homework or computer assignments should first be addressed to the TA. If you are absent from class when an assignment is returned, you can get your paper from the TA during their office hours.
Projects: There will be two group projects, an initial project due near the middle of the semester, and a final project due at the end of the semester. For the projects students can apply methods covered in class to the analysis of one or more data sets of interest, or may investigate in some detail a theoretical direction related to but not covered in the lectures. For both projects students will work in teams, and prepare short written reports on their findings.
Exams: There will be one midterm exam, and a comprehensive final exam. Both exams will be in person, and will be closed note and closed book. There will be no makeup exams. The final exam will be given at the date and time specified in the official University Final Exam Schedule.
Grading: Grading will be based on homeworks, computer assignments, a final project, an in-class midterm, and an in-class final exam, using the weights below.
Honor Code: Students are expected to adhere to the UNC honor code at all times.
1. Calculus: Basic properties of integration and differentiation. Integration and differentiation of simple functions (e.g. exponential functions, trigonometric functions and polynomials). Integration and partial differentiation of functions of several variables. Taylor series, minima and maxima of functions.
2. Probability: Joint and conditional densities and probability mass functions. Cumulative distribution functions. Random variables. Definition and basic properties expectation, variance, covariance, and correlation. Key discrete and continuous distributions and their basic properties: Bernoulli, binomial, Poisson, geometric; uniform, normal, exponential, gamma. Finding the distribution of a function of a random variable: the CDF method and the general change of variables theorem.
3. Statistics: Sample vs. population quantities. One- and two-sample z- and t-statistics. Basics of point estimation, hypothesis testing and p-values.
4. Linear algebra: Vector spaces, dimension, subspaces and spans. Inner and outer products. Matrix addition and multiplication. Determinants and inverses. Eigenvectors and eigenvalues. Symmetric matrices, and non-negative definite matrices. Rank and trace of a matrix. The Frobenius norm of a matrix. Orthogonal matrices.
Tentative Syllabus: The course will begin with some introductory material, an overview of exploratory data analysis, and a review of inequalities and matrix analysis. The bulk of the remaining course material will be divided into the study of unsupervised methods and supervised methods. The following is a tentative syllabus for the course.
I. Introduction and Preliminaries
Overview of machine learning
- Supervised analysis: Fitting functions to labeled observations
- Unsupervised analysis: Finding patterns and groups in unlabeled data
- Euclidean norm, inner products, Cauchy-Schwartz inequality
- Review of matrix algebra
- Maxima, minima, absolute values
Convexity, machine learning and optimization
- Definition and basic properties of convex sets and functions
- Local and global minima
- The canonical convex program
Introduction to exploratory data analysis
- Univariate sample statistics: mean, median, mode, standard deviation; histograms and density plots; z- and t-statistics
- Bivariate sample statistics: correlation, scatter-plots, r-squared values
II. Unsupervised Methods
Principal Component Analysis (PCA) and dimension reduction
- Finding good summary directions in high-dimensional data
- Derivation of PCs from eigenvectors of sample variance matrix
The Singular Value Decomposition (SVD)
- Overview of the problem, finding group structure in data
- K-means clustering
- Hierarchical clustering, trees and dendrograms
III. Supervised Methods
Convex sets and functions, Jensen’s inequality
- Introduction. Stochastic setting
- Classification rules, decision regions, decision boundaries
- Basic marginal and conditional distributions
- Bayes risk and Bayes rule
- Random vectors and the multivariate normal distribution
- Histogram rules
- Nearest neighbor rules
- Naive Bayes
- Logistic regression
- Linear discriminant analysis, quadratic discriminant analysis
- Support vector machines, separable and non-separable cases
Probability Inequalities and the weak law of large numbers
- Training and test sets
- Training error vs. test error
- k-fold cross-validation
Empirical risk minimization
Overview of Optimization
- Basic problem
- Ordinary least squares: derivation and some basic properties
- Ridge regression: derivation; shrinkage and regularization
- The LASSO
Support Vector Machines
- Linear classification rules
- Maximum margin classifiers
- Lagrangian dual and solution of maximum margin problem
Decision and Regression Trees
- Binary trees and tree-structured partitions
- Growing and pruning
- Boosting and bagging
IV Other Topics (depending time and class interest)
- The EM Algorithm
- Analysis of networks, including community detection and the friendship paradox
- Random forests
- Online learning and individual sequences
- Multiple testing and the false discovery rate
Disclaimer: The instructor reserves the right to make changes to the syllabus, and to the due dates of assignments. The latter will be announced as early as possible.
1. Keep up with the reading and homework assignments. If the reading assignment is long, break it up into smaller pieces (perhaps one section or subsection at a time).
2. Always look over the notes from lecture k before attending lecture k+1. This will help keep you on top of the course material. Ideas from one lecture often carry over to the next: you will get much more out of the material if you can maintain a sense of continuity and keep the “big picture” in mind.
3. Complete the reading *before* doing the homework. Trying to find the right formula or paragraph for a particular problem often takes as much time, and it tends to create more confusion than it resolves.
4. When looking over your notes or the reading assignment, keep a pencil and scratch paper on hand, and try to work out the details of any argument or idea that is not completely clear to you. Even if the argument or idea is clear, it can be helpful to write it down again in a different way in order to test and strengthen your understanding.
5. It is important to know what you know, but it’s especially important to know what you don’t know. As you look over the reading material and your notes, ask yourself if you (really) understand it. Keep careful track of any concepts and ideas that are not clear to you, and make efforts to master these in a timely fashion.
6. One good way of seeing if you understand an idea or concept is to write down (or state out loud) the associated definitions and basic facts, without the aid of your notes and in complete, grammatical sentences. Translating mathematics into English, and back again, is an important research skill, and a good way to build and assess your understanding.
7. The homework and computer assignments play two important roles in the course. First, they provide an opportunity to actively think about, engage with, and learn the course material. In addition, they provide feedback on your understanding of the material. Carefully look over your corrected assignments. Most students do well on the assignments: even if you received a good score, make sure to note and understand and correct any mistakes you have made.
8. Begin studying for exams at least one week before they are given. Look over your notes, homework, and the text. Write up a study guide containing the main concepts and definitions being covered, and use this to get a clear picture of the overall landscape of the material. For every topic on the study guide, you should know the relevant definitions, motivating ideas, and at least one or two examples.
Face Mask Policy
This semester, while we are in the midst of a global pandemic, all enrolled students are required to wear a mask covering your mouth and nose at all times in our classroom. This requirement is to protect our educational community — your classmates and me – as we learn together. If you choose not to wear a mask, or wear it improperly, I will ask you to leave immediately, and I will submit a report to the Office of Student Conduct. At that point you will be disenrolled from this course for the protection of our educational community. Students who have an authorized accommodation from Accessibility Resources and Service have an exception. For additional information, see Carolina Together.
Honor Code Policy
As a condition of joining the Carolina community, Carolina students pledge “not to lie, cheat, or steal” and to hold themselves, as members of the Carolina community, to a high standard of academic and non-academic conduct while both on and off Carolina’s campus. This commitment to academic integrity, ethical behavior, personal responsibility and civil discourse exemplifies the “Carolina Way,” and this commitment is codified in both the University’s Honor Code and in other University student conduct-related policies.
The University of North Carolina at Chapel Hill facilitates the implementation of reasonable accommodations, including resources and services, for students with disabilities, chronic medical conditions, a temporary disability or pregnancy complications resulting in barriers to fully accessing University courses, programs and activities.
Accommodations are determined through the Office of Accessibility Resources and Service (ARS) for individuals with documented qualifying disabilities in accordance with applicable state and federal laws. See the ARS Website for contact information: https://ars.unc.edu or email email@example.com.
Counseling and Psychological Resources
CAPS is strongly committed to addressing the mental health needs of a diverse student body through timely access to consultation and connection to clinically appropriate services, whether for short or long-term needs. Go to their website: https://caps.unc.edu/ or visit their facilities on the third floor of the Campus Health Services building for a walk-in evaluation to learn more.
Title IX Resources
Any student who is impacted by discrimination, harassment, interpersonal (relationship) violence, sexual violence, sexual exploitation, or stalking is encouraged to seek resources on campus or in the community. Please contact the Director of Title IX Compliance (Adrienne Allison – Adrienne.firstname.lastname@example.org), Report and Response Coordinators in the Equal Opportunity and Compliance Office (email@example.com), Counseling and Psychological Services (confidential), or the Gender Violence Services Coordinators (firstname.lastname@example.org; confidential) to discuss your specific needs. Additional resources are available at safe.unc.edu.
University Attendance Policy
No right or privilege exists that permits a student to be absent from any class meetings, except for these University Approved Absences:
- Authorized University activities
- Disability/religious observance/pregnancy, as required by law and approved by Accessibility Resources and Service and/or the Equal Opportunity and Compliance Office (EOC)
- Significant health condition and/or personal/family emergency as approved by the Office of the Dean of Students, Gender Violence Service Coordinators, and/or the Equal Opportunity and Compliance Office (EOC).