course materials for UC Davis STA141C: Big Data & High Performance Statistical Computing
Office Hours:
Lecture content is in the lecture directory.
Date | Topic | Video |
---|---|---|
1-8 | introduction, syllabus, first steps in R | |
1-10 | group by computation, zip files | |
1-15 | debugging | |
1-17 | parallelism with independent local processors | |
1-22 | vectorization, apply family of functions | |
1-24 | size and efficiency of objects, intro to S4 / Matrix | |
1-29 | unsupervised learning / cluster analysis, agglomerative nested clustering | |
1-31 | introduction to bash, file navigation, help, permissions, executables | |
2-5 | SLURM cluster model, example job submissions | |
2-7 | mid quarter evaluation, bash pipes and filters, students practice SLURM, SWC lesson example hw | |
2-12 | review course suggestions, bash coding style guidelines | https://youtu.be/vqA18iYk7BM |
2-14 | shared memory parallelism | https://youtu.be/l34IBkk8xcc |
2-19 | Python Introduction | https://youtu.be/t7UZR_hVMpY |
2-21 | Python Iterators, generators, integration with shell pipeleines | https://youtu.be/e-RAah4Eey4 |
2-26 | bootstrap, data flow, intermediate variables | https://youtu.be/ORvLvBj8dzo |
2-28 | performance monitoring, chunked streaming computation | https://youtu.be/RjrijQXd1dY |
3-5 | profiling in R, test driven development | https://youtu.be/lRKVzirgumw |
3-7 | database interfaces, SQL | https://youtu.be/O99Vx0L6hZM |
3-12 | Map Reduce, Hive | https://youtu.be/vdusmDcgGPg |
3-14 | Compiled languages | https://youtu.be/uwBm0ESc9-s |
Catalog Description:
Highperformance computing in highlevel data analysis languages; different computational approaches and paradigms for efficient analysis of big data; interfaces to compiled languages; R and Python programming languages; highlevel parallel computing; MapReduce; parallel algorithms and reasoning.
The fastest machine in the world as of January, 2019 is the Oak Ridge Summit Supercomputer.
This is an experiential course.
Students will learn how to work with big data by actually working with big data.
We’ll cover the foundational concepts that are useful for data scientists and data engineers.
These are the goals of the course:
The class will cover the following topics.
In class we’ll mostly use the R programming language, but these concepts apply more or less to any language.
Optional topics:
We won’t do the following in class:
These are all worth learning, but out of scope for this class.
We’ll use the raw data behind usaspending.gov as the primary example dataset for this class.
These are comprehensive records of how the US government spends taxpayer money.
From their website:
USA Spending tracks federal spending to ensure taxpayers can see how their money is being used in communities across America.
How did I get this data?
I downloaded the raw Postgres database.
Nehad Ismail, our excellent department systems administrator, helped me set it up.
It’s about 1 Terabyte when built.
The largest tables are around 200 GB and have 100’s of millions of rows.
You may find these books useful, but they aren’t necessary for the course.
I’ll post other references along with the lecture notes.
Category | Grade Percentage |
---|---|
Assignments | 75 |
Group Project | 20 |
Participation | 5 |
I expect you to ask lots of questions as you learn this material.
Here is where you can do this:
For private or sensitive questions you can do private posts on Piazza or email the instructor or TA.
Asking good technical questions is an important skill.
Stack Overflow offers some sound advice on how to ask questions.
Summarizing,
You’re welcome to opt in or out of Piazza’s Network service, which lets employers find you.
For the group project you will form groups of 2-3 and pursue a more open ended question using the usaspending data set.
This is your opportunity to pursue a question that you are personally interested in as you create a public ‘portfolio project’ that shows off your big data processing skills to potential employers or admissions committees.
Start early!
Programming takes a long time, and you may also have to wait a long time for your job submission to complete on the cluster.
I encourage you to talk about assignments, but you need to do your own work, and keep your work private.
OK
NOT OK
Adapted from Nick Ulle’s Fall 2018 STA141A class.
Point values and weights may differ among assignments.
This is to indicate what the most important aspects are, so that you spend your time on those that matter most.
Check the homework submission page on Canvas to see what the point values are for each assignment.
The grading criteria are correctness, code quality, and communication.
The following describes what an excellent homework solution should look like:
The report does the following:
The attached code runs without modification.
The code is idiomatic and efficient.
Different steps of the data processing are logically organized into scripts and small, reusable functions.
Variable names are descriptive.
The style is consistent and easy to read.
Plots include titles, axis labels, and legends or special annotations where appropriate.
Tables include only columns of interest, are clearly explained in the body of the report, and not too large.
Numbers are reported in human readable terms, i.e. 31 billion rather than 31415926535.
Writing is clear, correct English.
The report points out anomalies or notable aspects of the data discovered over the course of the analysis.
It discusses assumptions in the overall approach and examines how credible they are.
It mentions ideas for extending or improving the analysis or the computation.