CMSC320 – Fall 2020

Introduction to Data Science

Data Science!?

Instructor: John P. Dickerson
TAs: Sweta Agrawal (1/2 TA), Nitin Balachandran, Tracy Chen, Aviva Prins, Noor Singh, Qingyang Tan (1/2 TA)
Lectures: Tuesday and Thursday, 5:00–6:15 PM
Lectures are live on Zoom & posted on ELMS and YouTube

Description of Course

Data science encapsulates the interdisciplinary activities required to create data-centric products and applications that address specific scientific, socio-political or business questions. It has drawn tremendous attention from both academia and industry and is making deep inroads in industry, government, health and journalism—just ask Nate Silver!

This course focuses on (i) data management systems, (i) exploratory and statistical data analysis, (ii) data and information visualization, and (iv) the presentation and communication of analysis results. It will be centered around case studies drawing extensively from applications, and will yield a publicly-available final project that will strengthen course participants' data science portfolios.

This course will consist primarily of sets of self-contained lectures and assignments that leverage real-world data science platforms when needed; as such, there is no assigned textbook. Each lecture will come with links to required reading, which should be done before that lecture, and (when appropriate) a list of links to other resources on the web.

Requirements

Students enrolled in the course should be comfortable with programming (for those at UMD, having passed CMSC216 will be good enough!) and be reasonably mathematically mature. The course itself will make heavy use of the Python scripting language by way of Jupyter Notebooks, leaning on the Anaconda package manager; we'll give some Python-for-data-science primer lectures early on, so don't worry if you haven't used Python before. Later lectures will delve into statistics and machine learning and may make use of basic calculus and basic linear algebra; light mathematical maturity is preferred at roughly the level of a junior CS student.

There will be one written, take-home (obviously, given COVID-19 and all) midterm examination. There will not be a final examination; rather, in the interest of building students' public portfolios, and in the spirit of "learning by doing", students will create a self-contained online tutorial to be posted publicly. This tutorial can be created individually or in a small group. As described here (subject to change!), the tutorial will be a publicly-accessible website that provides an end-to-end walkthrough of identifying and scraping a specific data source, performing some exploratory analysis, and providing some sort of managerial or operational insight from that data.

Final grades will be calculated as:

This course is aimed at junior- and senior-level Computer Science majors, but should be accessible to any student of life with some degree of mathematical and statistical maturity, reasonable experience with programming, and an interest in the topic area. If in doubt, e-mail me: john@cs.umd.edu!

Office Hours & Communication

For course-related questions, please use Piazza to communicate with your fellow students, the TAs, and the course instructors. For private correspondance or special situations (e.g., excused absences, DDS accomodations, etc), please email John with [CMSC320] in the email subject line.

Office Hours (all times EDT)
Human Time Location
Sweta Agrawal 10AM-11AM Tuesday; Piazza on Wednesday Check ELMS/Piazza
Nitin Balachandran 3PM-5PM Monday; Piazza on Monday Check ELMS/Piazza
Tracy Chen 2PM-4PM Thursday; Piazza on Thursday Check ELMS/Piazza
John Dickerson By appointment; please email John with [CMSC320] in the email subject line. Zoom
Aviva Prina 12PM-2PM Tuesday; Piazza on Tuesday Check ELMS/Piazza
Noor Singh 4:00-6:00PM on Wednesday; Piazza on Friday Check ELMS/Piazza
Qingyang Tan 2PM-3PM on Friday; Piazza on Tuesday Check ELMS/Piazza

University Policies and Resources

Policies relevant to Undergraduate Courses are found here: http://ugst.umd.edu/courserelatedpolicies.html. Topics that are addressed in these various policies include academic integrity, student and instructor conduct, accessibility and accommodations, attendance and excused absences, grades and appeals, copyright and intellectual property.

Course evaluations

Course evaluations are important and the department and faculty take student feedback seriously. Near the end of the semester, students can go to http://www.courseevalum.umd.edu to complete their evaluations.


Schedule

(Schedule subject to change as the semester progresses!)
# Date Topic Reading Slides Lecturer Notes
1 9/1 Introduction What the Fox Knows. pdf, pptx Dickerson Sign up on Piazza!
2 9/3 What is Data & Lightning Python Overview Anaconda's Test Drive. pdf, pptx Dickerson
3 9/8 Scraping Data (with Python) I "What happens when you type google.com into your browser's address bar?" pdf, pptx Dickerson PDF download script from class: link; Extra reading/quick tutorial on using BeautifulSoup: link
4 9/10 Scraping Data (with Python) II pdf, pptx Dickerson Regex helper sites: regexr.com, pythex.org, regex101.com, rubular.com (thanks to J Helperin, J Martinez, M Mohades, & R Amor)
5 9/15 NumPy & SciPy, & Best Practices Introduction to pandas. pdf, pptx Dickerson Pandas tutorials: link
6 9/17 Data Wrangling I: Pandas & Tidy Data Hadley Wickham. "Tidy Data." pdf, pptx Dickerson Hould's Tidy Data for Python
7 9/22 Data Wrangling II: Tidy data & SQL Derman & Wilmott's "Financial Modelers' Manifesto." pdf, pptx Dickerson SQLite: link; pandasql library: link
8 9/24 Version Control & Git pdf, pptx Dickerson
9 9/29 Version Conrol Wrap-up, & Graphs Introduction to GraphQL: link pdf, pptx Dickerson NetworkX: link
10 10/1 Graphs, & Summary Statistics and Transformations Backstrom & Kleinberg. "Romantic Partnerships and the Dispersion of Social Ties: A Network Analysis of Relationship Status on Facebook," CSCW-14. arXiv link. pdf, pptx Dickerson
11 10/6 Summary Statistics and Transformations, & Missing Data I pdf, pptx Dickerson
12 10/8 Missing Data II Pandas tutorial on working with missing data. pdf, pptx Dickerson Scikit-learn's imputation functionality: link
13 10/13 Missing Data III, & Data Wrangling Wrap-Up: Data Integration, Data Warehousing, Entity Resolution Data Cleaning: Problems and Current Approaches (Note: this is a reference piece; please don't read the whole thing!) pdf, pptx Dickerson Wikipdia article on outliers
14 10/15 Natural Language I: Syntax & Semantics NLTK Book. pdf, pptx Dickerson Python Natural Language Toolkit (NLTK): link; Criticisms of the Turing Test: link
15 10/20 Natural Language II: Representation Continued from last class ... pdf, pptx Dickerson Continued from last class ...
16 10/22 Natural Language III: Embeddings & Similarity Continued from last class ... pdf, pptx Dickerson Pre-recorded class; John will not be available during the live lecture period for this class.
17 10/27 Midterm Review & TBD Midterm review: pdf, pptx; Lecture slides: pdf, pptx Dickerson New material from this lecture will not be included on the midterm.
18 10/29 Midterm Dickerson
19 11/3 Vote! Dickerson Election day!
20 11/5 Introduction to Machine Learning Hal Daumé III. A Course in Machine Learning. pdf, pptx Dickerson
21 11/10 Decision Trees and Random Forests Russell & Norvig's Chapter 18 lecture slides: pdf, pptx Dickerson Scikit-learn's basic decision tree functionality: link; Bart Selman's CS4700: link
22 11/12 Random Forests, K-NN pdf, pptx Dickerson
23 11/17 Practical Issues I: Overfitting, Cross-validation, Regularization pdf, pptx Dickerson xkcd on overfitting: link; Polynomial features/Interaction terms in Scikit: link
24 11/19 Practical Issues II: Feature Engineering, PCA, Clustering, Association Rules Nguyen & Holmes. "Ten quick tips for effective dimensionality reduction," PLoS Computational Biology. pdf, pptx Dickerson Wikipedia article on the confusion matrix: link
25 11/24 Practical Issues III: Recommender Systems and Association Rules Best Practices for Recommender Systems (from Microsoft). pdf, pptx Dickerson
11/26 Thanksgiving Break
26 12/1 Scaling It Up Dean & Ghemawat. "MapReduce: Simplified Data Processing on Large Clusters," CACM. pdf, pptx Dickerson Wikipedia on SGD: link
27 12/3 Data Science Ethics & Best Practices I The Atlantic. "Everything We Know About Facebook's Secret Mood Manipulation Experiment" pdf, pptx Dickerson What is GDPR? (link)
28 12/8 Data Science Ethics & Best Practices II Apple's brief overview of differential privacy: ; Barocas, Hardt, & Narayanan. Fairness in Machine Learning. pdf, pptx Dickerson SIGCOMM paper that passed IRB review but is widely seen as unethical: link
29 12/10 Debugging Data Science, & Data Science in Industry pdf, pptx Dickerson Additional discussion of debugging models (from Cornell): link
Final 12/21 Final Exam Date Final versions of tutorials must be posted by 4:00PM, the exam time. Instructions & rubric: link

Mini-Projects né Homework

In addition to the tutorial to be posted publicly at the end of the semester, there will be four "mini-projects" assigned over the course of the semester (plus one simple setup assignment that will walk you through using git, Docker, and Jupyter). The best way to learn is by doing, so these will largely be applied assignments that provide hands-on experience with the basic skills a data scientist needs in industry.

Posting solutions publicly online without the staff's express consent is a direct violation of our academic integrity policy. Late assignments will not be accepted.

(Assignments will appear over the course of the semester.)
# Description Date Released Date Due Project Link
0 Setting Things Up September 1 September 8 link
1 Fly Me To The Moon September 15 September 29 link
2 Moneyball October 2 October 19 October 22 link
3 Fact Tank November 6 November 23 December 3 link
4 Baltimore Crime December 3 December 10 link

Final Tutorials

In the spirit of "learning by doing," students created a self-contained online tutorial to be posted publicly. Tutorials could be created individually or in a small group. The intention was to create a publicly-accessible website that provides an end-to-end walkthrough of identifying and scraping a specific data source, performing some exploratory analysis, and providing some sort of managerial or operational insight from that data. Below is a list of (most of) the tutorials created in the Fall 2020 version of CMSC320.

Most links lead to a public GitHub Page created by a student or small group in the Fall 2020 CMSC320 course; some links lead to students' personal websites or to a Notebook hosted on Google Colab. Project creators: if a link is missing or incorrect, please get in touch with John!
Project Title URL
2020 Presidential Election: From a Data Science Perspective link
A Closer look at the NFL Draft link
A Data Science Walkthrough Using Global Happiness Data link
A Data Scientist's Guide to the S&P 500 link
A March Madness Analysis link
A Pandemic: An Analysis of COVID-19 link
American Music Awards Tweets link
An Analysis of Amazon's Top 50 Bestselling Books link
An Analysis of Heart Diseases and Attributes Leading to Heart Disease link
An Analysis of Metrics in Predicting Economic Performance based on the Modern Portfolio Theory link
An Analysis of Salaries and Cost of Living in Different US Cities
An Analysis of the Impact of COVID-19 on Crime in College Park, MD link
An Analysis of the Prevalence of US Events on Reddit link
An Introduction to Genome Analysis in Python (Data Science Tutorial) link
Analysis of Amazon's Top 50 Bestselling Books link
Analysis of Book Data from Amazon link
Analysis of COVID Data and Politicial Outcomes for the United States link
Analysis of COVID-19 data in United States link
Analysis of Crime in Maryland link
Analysis of Homelessness in Maryland link
Analysis of NFL Games link
Analysis of the Coronavirus by Coninent link
Analysis of the Covid-19 Pandemic link
Analysis of the Google Play Store link
Analysis of Tournament Matches in Super Smash Bros. Melee link
Analysis of Traffic Violations in Montgomery County Maryland link
Analysis on the Potential of Life on Exoplanets link
Analysis on Voter Turnout Data from 2020 General Election link
Analysis San Francisco Criminal Records (CMSC320 Final Project) link
Analyze 2020 Election Data link
Analyzing Avocado Prices and Consumption in the U.S. link
Analyzing Football Clubs in the U.K. link
Analyzing Global Suicide Rate from 1985 to 2016 link
Analyzing Retail Investors with Robinhood Data link
Analyzing the Prices of Boston Airbnb Rentals: What Affects Prices and Have Prices Changed Since the Pandemic? link
Analyzing the relationship between home matches and match wins in the English Premier League (Soccer) link
Analyzing the Top Spotify Songs of the 2010s link
Aspects of Trending Videos on YouTube link
Attempting to predict the outcome of a hit baseball link
Black Lives Matter movement link
Breaking down the Grammy Award for Record of the Year - An Analysis link
BREAKING DOWN THE TOP TRENDING YOUTUBE VIDEOS (U.S. & CA) link
Chicago Burglaries link
Citi Bike Ridership & Public Safety During COVID-19 link
Classifying Pokemon Competitively link
CMSC320 Final Project - Spotify Data Analysis link
CMSC320 Final Project: Steven Struglia, Michael Strobel, HtetMyat Aung (Stock Market) link
CMSC320 Final Tutorial link
Cooking Recipes: An analysis of Ratings, Nutrition, and Tags link
Coronavirus Exploratory Data Analysis
Countering a Dangerous Problem link
COVID-19, An Analysis link
COVID-19: Modeling The Relative Impact on US States link
COVID-19's effect on Twitch & which games are the best to stream link
COVID's Effect on Music Trends link
COVID19 and state demographics: Finding which factors might affect COVID19 rates link
Critical and Commercial Success in Music of the 2010 Decade link
Data Analysis on FAANG Stocks from 2013 to 2020 link
Data Visualization and Analysis of COVID-19 link
Determining Buzz Words on Reddit link
Do Masks Help in the Prevention of Covid-19? link
Do Professional Wine Reviewers Know What They're Doing? link
Drink To Forget: An Analysis of Drinking Habits During the COVID-19 Pandemic link
Evaluating Chess Positions link
Expected Value, Win Probability, and why "Common Knowledge" Hurts Sports Teams link
Film Genre and Popularity Trends link
Final Tutorial link
Final Tutorial link
Finding Your Ideal Wine link
Formula 1: A brief look through history link
Formula One Racing link
From 2016 to 2020, How Politics Have Changed In America link
Get a formula to predict the sale price of houses in Ames, Iowa link
Global Food Waste Analysis link
Gold Prices: Driven by Inflation, Volatility, or Treasury Yields? link
Happiness in the World link
Happiness Within Countries link
Hospital Wait Times in The U.S. link
How Corona Started link
How Happy is Our World? link
How has Music Changed over Time - A Spotify Data Analysis link
How Height has effected Win % in 1980 and 2020 link
How much better can the 2019 NBA Draft class get? link
How to Beat Better Rated Chess Players link
How to Make a Successful Game on Steam? link
How well do our police departments represent the populations they serve? link
Individual and Comparative Analysis of Pop, Hip Hop, and Rock Song Structures link
Is College Tuition Infected? Diagnosing Baumol's Cost Disease link
Is Joe Flacco an Elite Quarterback? link
Is The Electoral College Misleading?
Leicester's Unprecedented Title Win link
Mental Health in the Tech Industry link
Missing Migrants - An Analysis on the Risk of Seeking Asylum link
Movie Genre Popularity and Economic Activity link
Music attributes and its effect on popularity link
Music Over the Decades link
Music Throughout the Decades: An Analysis link
My Brother, My Brother and Me and the McElroy Brand link
NBA 2020 Season Statistics Analysis
NBA Project link
Netflix Movie/Tv Show Trends link
NTSB Investigations of Aircraft Incursions in the USA link
Pokemon Type Analysis link
Predicting 2020-2021 English Premier League Table Results Using Machine Learning link
Predicting an MLB Player's Performance In Fantasy Baseball link
Predicting Average Salaries for all Proffessor Ranks link
Predicting Car Prices link
Predicting Chess Wins based off Openings link
Predicting Current Quarterback Win Rates link
Predicting Dementia and Alzheimer's link
Predicting Gaon Digital Streams through Spotify Audio Features link
Predicting Gross Domestic Investment in the United States link
Predicting House Prices in King County, Washington link
Predicting NBA Players' Salaries link
Predicting Student Performance In High School link
Predicting the Chance of Winning in League of Legends (Given the First 10 Minutes of Data) link
Predicting the Popularity of a Book on Project Gutenberg link
Predicting Winning Play Styles in Texas Hold'em link
Predicting Wins at the Highest Level in the NBA link
Prediction of Diabetes Melitus from Patients Medical Records link
Predictive Power of 3-Pointer for Team Win% in 21st Century NBA link
Presidential Election 2020 Voter Turnout Analysis link
Relationships Among Crime Rate, Gini Coefficient, and Median Income in US link
Reverse Line Movement link
Small Ball link
Smart, not fair: An analysis of CS:GO metagame tactics link
Testing a stock buying strategy versus buying and holding a stock link
The Best Times Of The Year To Puchase Tech Stocks link
The Evolution of the 3 point shot link
The Future of Console Gaming in a PC World link
The NBA Draft link
The Probabilities and Financial Impact of Gacha Games link
The Trends of Happiness link
Trends in Seattle Crimes link
Twitter's Climate Tide: An Analysis of Tweets About Climate Change link
Ultimate Fighter Championship Data Analysis link
Using Data Analysis & Visualization to Understand the Performance of NCAA Division 1 College Basketball Teams in March Madness link
Venmo Transactions Analysis link
Video Game Sales link
Visualization and Analysis of Liquor Sales in Iowa link
Visualizing Ocean Data on Reconstructed pH and Coral Bleaching Reports link
Wall Street Bets Sentiment Analysis link
What Determines A Soccer Player's Salary? link
What Factors Help Predict the Outcome of the 2020 Election? link
What Make You Happy? link
What makes an ArchiveOfOurOwn story successful? link
What Review Scores Mean for Games link
What's the movie score link
Will You Accept This Analysis? link
Winter is Coming: An Analysis of Sunshine vs Depression link
World Happiness link

Additional Administrative Information

Excused Absences

Missing an exam for reasons such as illness, religious observance, participation in required university activities, or family or personal emergency (such as a serious automobile accident or close relative’s funeral) will be excused so long as the absence is requested in writing at least 2 days in advance and the student includes documentation that shows the absence qualifies as excused; a self-signed note is not sufficient as exams are Major Scheduled Grading Events. For this class, such events are the final project assessment and midterms, which will be due on the dates listed in the schedule above. The final exam is scheduled according to the University Registrar.

For medical absences, you must furnish documentation from the health care professional who treated you. This documentation must verify dates of treatment and indicate the timeframe that the student was unable to meet academic responsibilities. In addition, it must contain the name and phone number of the medical service provider to be used if verification is needed. No diagnostic information will ever be requested. Note that simply being seen by a health care professional does not constitute an excused absence; it must be clear that you were unable to perform your academic duties.

It is the University’s policy to provide accommodations for students with religious observances conflicting with exams, but it is the your responsibility to inform the instructor in advance of intended religious observances. If you have a conflict with a planned exam, you must inform the instructor prior to the end of the first two weeks of the class.

The policies for excused absences do not apply to project assignments. Projects will be assigned with sufficient time to be completed by students who have a reasonable understanding of the necessary material and begin promptly. In cases of extremely serious documented illness of lengthy duration or other protracted, severe emergency situations, the instructor may consider extensions on project assignments, depending upon the specific circumstances.

Besides the policies in this syllabus, the University’s policies apply during the semester. Various policies that may be relevant appear in the Undergraduate Catalog.

If you experience difficulty during the semester keeping up with the academic demands of your courses, you may consider contacting the Learning Assistance Service in 2201 Shoemaker Building at (301) 314-7693. Their educational counselors can help with time management issues, reading, note-taking, and exam preparation skills.

Right to Change Information

Although every effort has been made to be complete and accurate, unforeseen circumstances arising during the semester could require the adjustment of any material given here. Consequently, given due notice to students, the instructors reserve the right to change any information on this syllabus or in other course materials. Such changes will be announced and prominently displayed at the top of the syllabus.

University of Maryland Policies for Undergraduate Students

Please read the university’s guide on Course Related Policies, which provides you with resources and information relevant to your participation in a UMD course.