Info290T: Human-Centered Data Management

This is a research-oriented class on human-centered aspects in data management and analysis across the end-to-end data science/AI lifecycle. The class will entail reading and discussion of classical and modern research papers in this space. As part of this class, students will undertake a research project in this space. Students taking the class should have taken a database or data engineering class, at the level of INFO 258 / DATA 101 / COMPSCI 186, and/or have experience working with database or data engineering tools.

Description

This class emphasizes the central role of humans in data management. We will collectively explore a range of research papers in this space, drawn from the leading data management and human-computer interaction (HCI)/visualization venues. We will cover a range of technologies and methodologies.

From a technology standpoint, we will explore some or all of the following, time permitting: visual analytics systems, visualization recommendation systems, spreadsheet systems, data cleaning and transformation systems, notebook-centric analysis tools, SQL query builders, explanation and provenance systems, data discovery systems, gestural interfaces, approximate query processing systems, predictive materialization systems, speech and natural language querying systems, text and video analysis systems, semi-structured data systems. We will also explore papers that study human perception and behavior as it pertains to data management. The emphasis will be on a mix of human-centric concerns, interface ideas, and scalable data processing ideas (all with an eye towards the end user).

There are multiple goals from this class. First, for those whom this is a first exposure to research papers in data management and HCI, we will learn about how to read and critically evaluate research papers from multiple perspectives (more on this later). Second, we will learn about multiple state-of-the-art techniques. We will explore the design of novel interfaces for data management concerns and the design of novel scalability techniques that focus on humans-in-the-center. Third, we will learn about the process of designing, validating, and evaluating ideas in this space.

The class will center around discussion. We will be employing a role-playing approach to the class. More on this here.

Since this is the first time this class is being offered, and the first time we're trying out a role-playing activity, everything, including the breakdown of topics and grading is tentative; please expect hiccups if you're taking the class, and apologies for any issues in advance!

Schedule (tentative)

Date Topic Materials Notes
8/24 Introduction Slides
8/29 No Class: VLDB
8/31 Primers: Reading Papers and Visualization Papers Primer
Visualization Primer
9/5 Polaris: A System for Query, Analysis, and Visualization of Multidimensional Relational Databases Paper Author
Industry Practitioner
Academic Researcher
Additional Discussion
9/7 Expressive Time-Series Querying by Hand-drawn Visual Sketches Paper Author & Additional Discussion
Academic Researcher
9/12 SeeDB: Efficient Data-Driven Visualization Recommendations to Support Visual Analytics Paper Author
Archaeologist
Industry Practitioner
Academic Researcher
9/14 Falcon: Balancing Interactive Latency and Resolution Sensitivity for Scalable Linked Visualizations Scalability Primer
Paper Author
Academic Researcher
Industry Practitioner
Peer Reviewer
9/19 Trust, but Verify: Optimistic Visualizations of Approximate Queries for Exploring Big Data Paper Author
Archaeologist
Additional Discussion
Project Proposal Due 19th
9/21 Benchmarking Spreadsheet Systems HCI Primer
Paper Author
9/26 Sigma Worksheet: Interactive Construction of OLAP Queries Paper Author
Peer Reviewer
Academic Researcher
9/28 Wrangler: interactive visual specification of data transformation scripts Paper Author
Additional Discussion
10/3 Profiler: Integrated Statistical Analysis and Visualization for Data Quality Assessment Paper Author
Industry Practitioner
Additional Discussion
10/5 Gestural Query Specification Paper Author
Additional Discussion
10/10 DataTone: Managing Ambiguity in Natural Language Interfaces for Data Visualization Intermediate Report & Paper Author
Academic Researcher
Industry Practitioner
10/12 SpeakQL: Towards Speech-driven Multimodal Querying of Structured Data Paper Author
Archaeologist
10/17 Interactive Browsing and Navigation in Relational Databases Paper Author
10/19 DataPlay: Interactive Tweaking and Example-driven Correction of Graphical Database Queries Paper Author
Industry Practitioner
10/24 Lux: Always-on Visualization Recommendations Discussion
Paper Author
10/26 mage: Fluid Moves Between Code and Graphical Work in Computational Notebooks Discussion
Paper Author
Industry Practitioner
Archaeologist
10/31 BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Discussion
Paper Author
Academic Researcher
11/2 AQP++: Connecting Approximate Query Processing With Aggregate Precomputation for Interactive Analytics Paper Author Intermediate Report Due 2nd
11/7 Distributed and Interactive Cube Exploration
11/9 Dremel: Interactive Analysis Of Web-Scale Datasets Discussion
Paper Author
Archaeologist
11/14 Scorpion: Explaining Away Outliers in Aggregate Queries Paper Author
11/16 Macrobase: Prioritizing attention in fast data Paper Author
Industry Practitioner
11/21 Can Foundation Models Wrangle Your Data? Paper Author
Peer Reviewer
Academic Researcher
11/23 Thanksgiving Holiday
11/28 Final Project Presentations (8-11) Projects Listing
11/30 No Class Final Project Report Due 12/3

Requirements

The official class requirements state "Students taking the class should have taken a database or data engineering class, at the level of INFO 258 / DATA 101 / COMPSCI 186, and/or have experience working with database or data engineering tools."

Given the varied backgrounds of students coming into this class, we are willing to be flexible in this regard, but we will expect that you will know or be able to pick up missing pieces as we go along — to the extent that it ensures that you can take part in the paper discussions, presentations, and reviews, as well as the research project.

We will expect that you have experience with data science tooling (such as dataframe and visualization libraries, and computational notebooks), databases (both from a query language standpoint, but also performance aspects: query optimization, materialized view maintenance, indexing). If you struggle with SQL or have not heard of query optimization, I encourage taking a database class first. Experience with research or reading research papers is not a must.

Tentative List of Papers

Theme 1: Data Exploration (6)

Visual Analytics Systems + Next Generation: Visualization Search and Recommendation

Perceptual Approximation

Theme 2: Data Manipulation (4)

Spreadsheets and Direct Manipulation

Data Cleaning and Transformation

Theme 3: Beyond GUIs: Other No-Code Interface Modalities (3)

Touch and Gesture

Natural Language & Speech

Theme 4: No-Code-meets-Code (4)

SQL Query Construction

Computational Notebook Tools

Low-Code Data Manipulation Libraries

Theme 5: Scalability for Humans (5)

Approximate Query Processing

Materialization, Reuse, Prediction

Parallel Data Processing

Surveys and Benchmarks

Theme 6: Going Deeper (3)

Outliers, Explanations, and Provenance

Large Language Models for Data Work

Collaborative Query Processing and Data Discovery

Video Analysis

Machine Learning Systems

Instructions for Submitting Class Reviews

For any class where you are not part of the role-playing activity, you may submit a brief review of the paper. You must use the following link to submit class reviews: Link.

Remember to cover the 5 key questions: what is the problem, why is it important, what sets it apart from previous work, what are the key technical ideas, what are the main areas of improvement and open issues, all within 500 words.

The class reviews must be submitted by midnight the day before class. No late submissions accepted. These reviews will be lightly graded: we just want to make sure you've read the paper enough to contribute in class.

You need to submit 10 reviews over the semester, for papers where you are not actively playing one of the presenter roles.

Instructions for the Role Playing/Presentation Activity

This class involves multiple roles adapted originally from here. The goal of these multiple roles is to ensure that the class is not organized in a one-to-many format (one speaker, many passive listeners), but a many-to-many format (many speakers, many active listeners).

The roles in our class are the following: paper author, peer reviewer, archaeologist, academic researcher, and industry practitioner.

The paper author role involves a 15-20 minute presentation, in a conference style talk. Convey what's interesting about the paper: what is the domain, what is the problem, what is wrong with prior work, what is the approach advocated by the paper, how well does it do, etc. This is done either solo or as groups of two.

The other four roles are 5 minutes each. Each starts with a one slide summary.

  • The peer reviewer role describes a full review of the paper for a top venue in the corresponding research area (databases, visualization, HCI, ...). Identifies a summary of the key contributions, and the strong and weak points of the paper.
  • The archaeologist reports on one older paper that the current paper builds on, and one new paper that the current paper inspires, as a means of situating the paper in the literature. Reporting on more papers are also allowed!
  • The academic researcher proposes one or more follow-on projects that builds on the ideas of the paper.
  • The industry practitioner tries to make an argument for why they should be paid to implement the methods in the paper — and to discuss any benefits and risks.

More details on breakdown, how to sign up, to be announced shortly.

Grading Policy

  • Class Participation: 50%
    • Paper Reviews: 15%
      • Due day before class at midnight. Need to submit at least 10 reviews over the semester. Will be lightly graded. You can't "double-dip": the papers where you are a presenter won't count towards your 10 reviews
    • Class Participation: 10%
      • Any participation is good participation (within reason!). We want to have a good discussion as part of the presentations. Feel free to chime in with ideas, concerns, questions, discussion points, etc.
    • Paper Presentation: 25%
      • You will play the paper author role roughly 1-2 times, and other accessory roles up to 4-5 times.
  • Research Project: 50%. A semester-long research project on human-centered data management in teams of 2-3 (1 only in the case of exceptions).
    • Project Proposal: 5%
    • Intermediate Report: 10%
    • Final Report + Presentation: 35%

Project

As part of this class, you need to complete a semester-long project. Details will be announced shortly. We encourage you to look for ideas in your domain of expertise: for instance, if you work in computational journalism, building a new way to browse and manage large collections of textual archives could be a perfectly reasonable project. Either way, you must speak to the instructor to verify that the project fulfils the needs of the class.

One interesting avenue for projects is to revisit data management problems (including papers in the list above) with LLMs in the loop. Aspects of interest would include thinking of other interaction modalities (beyond chat), human verification/validation of system interpretations, and dealing with brittleness and hallucinations.

Final Project Slides

  • SPADE: A System for Prompt Analysis and Delta-Based Evaluation
    [SlideShreya Shankar 
  • Code Generation for Data Wrangling Tasks
    [SlideSahil Bhatia 
  • VisualLing: Enhancing User Interaction and Aesthetics in LLM-Based Visualization Systems
    [SlideChandramita Dutta, Chirag Manghani 
  • GENIE in a Notebook: Speech-Based Code Generation in Computational Notebooks
    [SlideAlice Yeh 
  • LLM-Powered Semantic Dataset Search Engine
    [SlideWenjing Lin 
  • Supporting Human-in-the-Loop Evaluation of Biased Image Embeddings in Vector Databases through Configurable Visualizations
    [SlideAnkita Suresh Shanbhag 

Statement of Values

We intend for this class to be a safe, accepting, and inclusive environment where everyone, independent of background, can not just thrive, but flourish. We believe in strength in diversity, and that everyone is on a journey of continual “un”-learning and self-reflection to identify and address implicit biases. If there are aspects of your experience in class that violate these principles, please reach out and let the instructors know.

Medical Disruptions

We will not be taking attendance in class, so, as we get into a flu/COVID season, you may skip classes for genuine medical reasons—to protect and safeguard your fellow classmates. While we are incentivizing participation, as long as you participate in a reasonable fraction of the classes, that is sufficient from a grading standpoint. If you are a presenter of one of the roles, please inform the instructor ASAP if you are unable to make it to class due to a medical reason.

Credits

Paper readings heavily inspired by Eugene Wu (Columbia)'s Human-Data Interaction class. The Role Playing Class Model heavily inspired by Kexin Rong (Georgia Tech)'s Human-in-the-loop Data Analysis class.