This is a research-oriented class on human-centered aspects in data management and analysis across the end-to-end data science/AI lifecycle. The class will entail reading and discussion of classical and modern research papers in this space. As part of this class, students will undertake a research project in this space. Students taking the class should have taken a database or data engineering class, at the level of INFO 258 / DATA 101 / COMPSCI 186, and/or have experience working with database or data engineering tools.
This class emphasizes the central role of humans in data management. We will collectively explore a range of research papers in this space, drawn from the leading data management and human-computer interaction (HCI)/visualization venues. We will cover a range of technologies and methodologies.
From a technology standpoint, we will explore some or all of the following, time permitting: visual analytics systems, visualization recommendation systems, spreadsheet systems, data cleaning and transformation systems, notebook-centric analysis tools, SQL query builders, explanation and provenance systems, data discovery systems, gestural interfaces, approximate query processing systems, predictive materialization systems, speech and natural language querying systems, text and video analysis systems, semi-structured data systems. We will also explore papers that study human perception and behavior as it pertains to data management. The emphasis will be on a mix of human-centric concerns, interface ideas, and scalable data processing ideas (all with an eye towards the end user).
There are multiple goals from this class. First, for those whom this is a first exposure to research papers in data management and HCI, we will learn about how to read and critically evaluate research papers from multiple perspectives (more on this later). Second, we will learn about multiple state-of-the-art techniques. We will explore the design of novel interfaces for data management concerns and the design of novel scalability techniques that focus on humans-in-the-center. Third, we will learn about the process of designing, validating, and evaluating ideas in this space.
The class will center around discussion. We will be employing a role-playing approach to the class. More on this here.
Since this is the first time this class is being offered, and the first time we're trying out a role-playing activity, everything, including the breakdown of topics and grading is tentative; please expect hiccups if you're taking the class, and apologies for any issues in advance!
Date | Topic | Materials | Notes |
---|---|---|---|
8/24 | Introduction | Slides | |
8/29 | No Class: VLDB | ||
8/31 | Primers: Reading Papers and Visualization |
Papers Primer
Visualization Primer |
|
9/5 | Polaris: A System for Query, Analysis, and Visualization of Multidimensional Relational Databases |
Paper Author
Industry Practitioner Academic Researcher Additional Discussion |
|
9/7 | Expressive Time-Series Querying by Hand-drawn Visual Sketches | Paper Author & Additional Discussion Academic Researcher |
|
9/12 | SeeDB: Efficient Data-Driven Visualization Recommendations to Support Visual Analytics |
Paper Author
Archaeologist Industry Practitioner Academic Researcher |
|
9/14 | Falcon: Balancing Interactive Latency and Resolution Sensitivity for Scalable Linked Visualizations |
Scalability Primer
Paper Author Academic Researcher Industry Practitioner Peer Reviewer |
|
9/19 | Trust, but Verify: Optimistic Visualizations of Approximate Queries for Exploring Big Data |
Paper Author
Archaeologist Additional Discussion |
Project Proposal Due 19th |
9/21 | Benchmarking Spreadsheet Systems |
HCI Primer
Paper Author |
|
9/26 | Sigma Worksheet: Interactive Construction of OLAP Queries |
Paper Author
Peer Reviewer Academic Researcher |
|
9/28 | Wrangler: interactive visual specification of data transformation scripts |
Paper Author
Additional Discussion |
|
10/3 | Profiler: Integrated Statistical Analysis and Visualization for Data Quality Assessment |
Paper Author
Industry Practitioner Additional Discussion |
|
10/5 | Gestural Query Specification |
Paper Author
Additional Discussion |
|
10/10 | DataTone: Managing Ambiguity in Natural Language Interfaces for Data Visualization |
Intermediate Report & Paper Author
Academic Researcher Industry Practitioner |
|
10/12 | SpeakQL: Towards Speech-driven Multimodal Querying of Structured Data |
Paper Author
Archaeologist |
|
10/17 | Interactive Browsing and Navigation in Relational Databases | Paper Author | |
10/19 | DataPlay: Interactive Tweaking and Example-driven Correction of Graphical Database Queries |
Paper Author
Industry Practitioner |
|
10/24 | Lux: Always-on Visualization Recommendations |
Discussion
Paper Author |
|
10/26 | mage: Fluid Moves Between Code and Graphical Work in Computational Notebooks |
Discussion
Paper Author Industry Practitioner Archaeologist |
|
10/31 | BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data |
Discussion
Paper Author Academic Researcher |
|
11/2 | AQP++: Connecting Approximate Query Processing With Aggregate Precomputation for Interactive Analytics | Paper Author | Intermediate Report Due 2nd |
11/7 | Distributed and Interactive Cube Exploration | ||
11/9 | Dremel: Interactive Analysis Of Web-Scale Datasets |
Discussion
Paper Author Archaeologist |
|
11/14 | Scorpion: Explaining Away Outliers in Aggregate Queries | Paper Author | |
11/16 | Macrobase: Prioritizing attention in fast data |
Paper Author
Industry Practitioner |
|
11/21 | Can Foundation Models Wrangle Your Data? |
Paper Author
Peer Reviewer Academic Researcher |
|
11/23 | Thanksgiving Holiday | ||
11/28 | Final Project Presentations (8-11) | Projects Listing | |
11/30 | No Class | Final Project Report Due 12/3 |
The official class requirements state "Students taking the class should have taken a database or data engineering class, at the level of INFO 258 / DATA 101 / COMPSCI 186, and/or have experience working with database or data engineering tools."
Given the varied backgrounds of students coming into this class, we are willing to be flexible in this regard, but we will expect that you will know or be able to pick up missing pieces as we go along — to the extent that it ensures that you can take part in the paper discussions, presentations, and reviews, as well as the research project.
We will expect that you have experience with data science tooling (such as dataframe and visualization libraries, and computational notebooks), databases (both from a query language standpoint, but also performance aspects: query optimization, materialized view maintenance, indexing). If you struggle with SQL or have not heard of query optimization, I encourage taking a database class first. Experience with research or reading research papers is not a must.
For any class where you are not part of the role-playing activity, you may submit a brief review of the paper. You must use the following link to submit class reviews: Link.
Remember to cover the 5 key questions: what is the problem, why is it important, what sets it apart from previous work, what are the key technical ideas, what are the main areas of improvement and open issues, all within 500 words.
The class reviews must be submitted by midnight the day before class. No late submissions accepted. These reviews will be lightly graded: we just want to make sure you've read the paper enough to contribute in class.
You need to submit 10 reviews over the semester, for papers where you are not actively playing one of the presenter roles.
This class involves multiple roles adapted originally from here. The goal of these multiple roles is to ensure that the class is not organized in a one-to-many format (one speaker, many passive listeners), but a many-to-many format (many speakers, many active listeners).
The roles in our class are the following: paper author, peer reviewer, archaeologist, academic researcher, and industry practitioner.
The paper author role involves a 15-20 minute presentation, in a conference style talk. Convey what's interesting about the paper: what is the domain, what is the problem, what is wrong with prior work, what is the approach advocated by the paper, how well does it do, etc. This is done either solo or as groups of two.
The other four roles are 5 minutes each. Each starts with a one slide summary.
More details on breakdown, how to sign up, to be announced shortly.
As part of this class, you need to complete a semester-long project. Details will be announced shortly. We encourage you to look for ideas in your domain of expertise: for instance, if you work in computational journalism, building a new way to browse and manage large collections of textual archives could be a perfectly reasonable project. Either way, you must speak to the instructor to verify that the project fulfils the needs of the class.
One interesting avenue for projects is to revisit data management problems (including papers in the list above) with LLMs in the loop. Aspects of interest would include thinking of other interaction modalities (beyond chat), human verification/validation of system interpretations, and dealing with brittleness and hallucinations.
We intend for this class to be a safe, accepting, and inclusive environment where everyone, independent of background, can not just thrive, but flourish. We believe in strength in diversity, and that everyone is on a journey of continual “un”-learning and self-reflection to identify and address implicit biases. If there are aspects of your experience in class that violate these principles, please reach out and let the instructors know.
We will not be taking attendance in class, so, as we get into a flu/COVID season, you may skip classes for genuine medical reasons—to protect and safeguard your fellow classmates. While we are incentivizing participation, as long as you participate in a reasonable fraction of the classes, that is sufficient from a grading standpoint. If you are a presenter of one of the roles, please inform the instructor ASAP if you are unable to make it to class due to a medical reason.
Paper readings heavily inspired by Eugene Wu (Columbia)'s Human-Data Interaction class. The Role Playing Class Model heavily inspired by Kexin Rong (Georgia Tech)'s Human-in-the-loop Data Analysis class.