Instructors: John P. Dickerson and Prem Saggar
Data science encapsulates the interdisciplinary activities required to create data-centric products and applications that address specific scientific, socio-political or business questions. It has drawn tremendous attention from both academia and industry and is making deep inroads in industry, government, health and journalism—just ask Nate Silver!
This course focuses on (i) data management systems, (i) exploratory and statistical data analysis, (ii) data and information visualization, and (iv) the presentation and communication of analysis results. It will be centered around case studies drawing extensively from applications, and will yield a publicly-available final project that will strengthen course participants' data science portfolios.
This course will consist primarily of sets of self-contained lectures and assignments that leverage real-world data science platforms when needed; as such, there is no assigned textbook. Each lecture will come with links to required reading, which should be done before that lecture, and (when appropriate) a list of links to other resources on the web.
Students enrolled in the course should be comfortable with programming (for those at UMD, having passed CMSC216 will be good enough!) and be reasonably mathematically mature. The course itself will make heavy use of the Python scripting language by way of Jupyter Notebooks, leaning on the Anaconda package manager; we'll give some Python-for-data-science primer lectures early on, so don't worry if you haven't used Python before. Later lectures will delve into statistics and machine learning and may make use of basic calculus and basic linear algebra; light mathematical maturity is preferred at roughly the level of a junior CS student.
There will be one written, in-class midterm examination. There will not be a final examination; rather, in the interest of building students' public portfolios, and in the spirit of "learning by doing", students will create a self-contained online tutorial to be posted publicly. This tutorial can be created individually or in a small group. As described here (subject to change!), the tutorial will be a publicly-accessible website that provides an end-to-end walkthrough of identifying and scraping a specific data source, performing some exploratory analysis, and providing some sort of managerial or operational insight from that data.
Final grades will be calculated as:
You can earn full credit for class participation in three ways:
This course is aimed at junior- and senior-level Computer Science majors, but should be accessible to any student of life with some degree of mathematical and statistical maturity, reasonable experience with programming, and an interest in the topic area. If in doubt, e-mail me: john@cs.umd.edu!
For course-related questions, please use Piazza to communicate with your fellow students, the TAs, and the course instructors. For private correspondance or special situations (e.g., excused absences, DDS accomodations, etc), please email John with [CMSC320]
in the email subject line.
Human | Time | Location |
---|---|---|
Eleftheria Briakou | Mondays, 11:45am–1:45pm | AVW 1120 |
Hao Chen | Fridays, 3pm–5pm | AVW 1120 |
John Dickerson | TBD 1hr/wk. Also by appointment; please email John with [CMSC320] in the email subject line. |
AVW 3217 |
Susmija Jabbireddy | Thursdays, 3pm–5pm | AVW 1120 |
Harihara Muralidharan | Mondays, 10am–12pm | AVW 1120 |
Renkun Ni | Wednesdays, 11:00am–1:00pm | AVW 1120 |
Nischal Reddy | Thursdays, 9am–10am | AVW 1120 |
Prem Saggar | By appointment; please email Prem with [CMSC320] in the email subject line. |
|
Yanchao Sun | Tuesdays, 10am–12pm | AVW 1120 |
Policies relevant to Undergraduate Courses are found here: http://ugst.umd.edu/courserelatedpolicies.html. Topics that are addressed in these various policies include academic integrity, student and instructor conduct, accessibility and accommodations, attendance and excused absences, grades and appeals, copyright and intellectual property.
Course evaluations are important and the department and faculty take student feedback seriously. Near the end of the semester, students can go to http://www.courseevalum.umd.edu to complete their evaluations.
# | Date | Topic | Reading | Slides | Lecturer | Notes |
---|---|---|---|---|---|---|
1 | 8/27 | Introduction | What the Fox Knows. | pdf, pptx | Dickerson | Sign up on Piazza! |
Part I: Data Collection, Storage, & Management | ||||||
2 | 8/29 | Scraping Data with Python | Anaconda's Test Drive. | pdf, pptx | Dickerson | PDF download script from class: link |
3 | 9/5 | NumPy, SciPy, & DataFrames | Introduction to pandas. | pdf, pptx | Dickerson | Pandas tutorials: link |
4 | 9/10 | Best practices & version control | — | pdf, pptx | Dickerson | Workflows for git: link |
5 | 9/12 | Data Wrangling I: Pandas & Tidy Data | Hadley Wickham. "Tidy Data." | pdf, pptx | Dickerson | Hould's Tidy Data for Python |
6 | 9/17 | Data Wrangling II: Tidy data & SQL | Derman & Wilmott's "Financial Modelers' Manifesto." | pdf, pptx | Dickerson | SQLite: link; pandasql library: link |
7 | 9/19 | Quality Assurance (QA) | — | ipynb | Saggar | "High Availability" & 9's rule |
8 | 9/24 | Missing Data (Theory) | Pandas tutorial on working with missing data. | pdf, pptx | Dickerson | Scikit-learn's imputation functionality: link |
9 | 9/26 | Missing Data (Practice) | — | ipynb | Saggar | We'll try to have the .ipynb up before lecture; if so, please feel free to download it and bring a laptop to class! |
10 | 10/1 | Data Wrangling Wrap-Up: Data Integration, Data Warehousing, Entity Resolution | Data Cleaning: Problems and Current Approaches (Note: this is a reference piece; please don't read the whole thing!) | pdf, pptx | Dickerson | Wikipdia article on outliers |
11 | 10/3 | Exploratory Data Analysis: Summary Statistics, Transformations, & Visualization | John W. Tukey: His Life and Professional Contributions. | pdf, pptx | Dickerson | Seaborn visualization library for Python: link |
12 | 10/8 | Basic Probability, Causation & Correlation, Hypothesis Testing | — | pdf, pptx | Saggar | |
13 | 10/10 | Basic Probability, Causation & Correlation, Hypothesis Testing | — | pdf, pptx | Saggar | Excel spreadsheet covering hypothesis testing example from class: link |
14 | 10/15 | Data Wrangling Wrap-Up: Graphs | Introduction to GraphQL: link | pdf, pptx | Dickerson | NetworkX: link |
15 | 10/17 | Basic Probability & Bayes' Theorem | — | pdf, pptx | Saggar | Excel spreadsheet (second one!) covering hypothesis testing example from class: link |
16 | 10/22 | Midterm Review & Raw Text/NLP Intro | — | pdf, pptx | Dickerson | Sample midterm from Spring 2017 was also posted on Piazza earlier. |
17 | 10/24 | Midterm | — | — | Dickerson | Bring a pen! |
18 | 10/29 | NLP I | NLTK Book. | pdf, pptx | Dickerson | Python Natural Language Toolkit (NLTK): link; Criticisms of the Turing Test: link |
19 | 10/31 | NLP II | Continued from last class ... | pdf, pptx | Dickerson | Continued from last class ... |
20 | 11/5 | Decision Trees | pdf, pptx | Saggar | — | |
21 | 11/7 | Decision Trees | pdf, pptx | Saggar | Excel file from class: link | |
22 | 11/12 | Introduction to Machine Learning, & Linear Regression | Hal Daumé III. A Course in Machine Learning. | pdf, pptx | Dickerson | Scikit-learn cheat sheet: link |
23 | 11/14 | Gradient Descent | — | pdf, pptx | Saggar | Gradient descent example from class: xlsx file |
24 | 11/19 | Random Forests, SVM | — | pdf, pptx | Dickerson | Tensorflow: link |
— | 11/21 | Thanksgiving Break | — | — | — | — |
25 | 11/26 | Neural Nets I | — | pdf, pptx | Saggar | |
26 | 11/28 | Neural Nets II | — | pdf, pptx | Saggar | |
27 | 12/3 | Debugging Data Science | — | pdf, pptx | Dickerson | |
28 | 12/5 | Data Science In Industry | — | ipynb, pdf, pptx | Saggar | |
29 | 12/10 | Course Wrap Up & Research | — | pdf, pptx | Dickerson | |
Final | 12/15 | Final Exam Date | Final versions of tuturials must be posted by 1:30PM, the exam time. | Instructions & rubric: link |
In addition to the tutorial to be posted publicly at the end of the semester, there will be four "mini-projects" assigned over the course of the semester (plus one simple setup assignment that will walk you through using git, Docker, and Jupyter). The best way to learn is by doing, so these will largely be applied assignments that provide hands-on experience with the basic skills a data scientist needs in industry.
Posting solutions publicly online without the staff's express consent is a direct violation of our academic integrity policy. Late assignments will not be accepted.
# | Description | Date Released | Date Due | Project Link |
---|---|---|---|---|
0 | Setting Things Up | August 27 | September 7 | link |
1 | Fly Me To The Moon | September 13 | link | |
2 | Moneyball | October 1 | link | |
3 | Fact Tank | October 28 | link | |
4 | Baltimore Crime | November 26 | December 7 | link |