Info 290T

Info290T: Human-Centered Data Management

This is a research-oriented class on human-centered aspects in data management and analysis across the end-to-end data science/AI lifecycle. The class will entail reading and discussion of classical and modern research papers in this space. As part of this class, students will undertake a research project in this space. Students taking the class should have taken a database or data engineering class, at the level of INFO 258 / DATA 101 / COMPSCI 186, and/or have experience working with database or data engineering tools.

Description

This class emphasizes the central role of humans in data management. We will collectively explore a range of research papers in this space, drawn from the leading data management and human-computer interaction (HCI)/visualization venues. We will cover a range of technologies and methodologies.

From a technology standpoint, we will explore some or all of the following, time permitting: visual analytics systems, visualization recommendation systems, spreadsheet systems, data cleaning and transformation systems, notebook-centric analysis tools, SQL query builders, explanation and provenance systems, data discovery systems, gestural interfaces, approximate query processing systems, predictive materialization systems, speech and natural language querying systems, text and video analysis systems, semi-structured data systems. We will also explore papers that study human perception and behavior as it pertains to data management. The emphasis will be on a mix of human-centric concerns, interface ideas, and scalable data processing ideas (all with an eye towards the end user).

There are multiple goals from this class. First, for those whom this is a first exposure to research papers in data management and HCI, we will learn about how to read and critically evaluate research papers from multiple perspectives (more on this later). Second, we will learn about multiple state-of-the-art techniques. We will explore the design of novel interfaces for data management concerns and the design of novel scalability techniques that focus on humans-in-the-center. Third, we will learn about the process of designing, validating, and evaluating ideas in this space.

The class will center around discussion. We will be employing a role-playing approach to the class. More on this here.

Since this is the first time this class is being offered, and the first time we're trying out a role-playing activity, everything, including the breakdown of topics and grading is tentative; please expect hiccups if you're taking the class, and apologies for any issues in advance!

Schedule (tentative)

Date	Topic	Materials	Notes
8/24	Introduction	Slides
8/29			No Class: VLDB
8/31	Primers: Reading Papers and Visualization	Papers Primer Visualization Primer
9/5	Polaris: A System for Query, Analysis, and Visualization of Multidimensional Relational Databases	Paper Author Industry Practitioner Academic Researcher Additional Discussion
9/7	Expressive Time-Series Querying by Hand-drawn Visual Sketches	Paper Author & Additional Discussion Academic Researcher
9/12	SeeDB: Efficient Data-Driven Visualization Recommendations to Support Visual Analytics	Paper Author Archaeologist Industry Practitioner Academic Researcher
9/14	Falcon: Balancing Interactive Latency and Resolution Sensitivity for Scalable Linked Visualizations	Scalability Primer Paper Author Academic Researcher Industry Practitioner Peer Reviewer
9/19	Trust, but Verify: Optimistic Visualizations of Approximate Queries for Exploring Big Data	Paper Author Archaeologist Additional Discussion	Project Proposal Due 19th
9/21	Benchmarking Spreadsheet Systems	HCI Primer Paper Author
9/26	Sigma Worksheet: Interactive Construction of OLAP Queries	Paper Author Peer Reviewer Academic Researcher
9/28	Wrangler: interactive visual specification of data transformation scripts	Paper Author Additional Discussion
10/3	Profiler: Integrated Statistical Analysis and Visualization for Data Quality Assessment	Paper Author Industry Practitioner Additional Discussion
10/5	Gestural Query Specification	Paper Author Additional Discussion
10/10	DataTone: Managing Ambiguity in Natural Language Interfaces for Data Visualization	Intermediate Report & Paper Author Academic Researcher Industry Practitioner
10/12	SpeakQL: Towards Speech-driven Multimodal Querying of Structured Data	Paper Author Archaeologist
10/17	Interactive Browsing and Navigation in Relational Databases	Paper Author
10/19	DataPlay: Interactive Tweaking and Example-driven Correction of Graphical Database Queries	Paper Author Industry Practitioner
10/24	Lux: Always-on Visualization Recommendations	Discussion Paper Author
10/26	mage: Fluid Moves Between Code and Graphical Work in Computational Notebooks	Discussion Paper Author Industry Practitioner Archaeologist
10/31	BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data	Discussion Paper Author Academic Researcher
11/2	AQP++: Connecting Approximate Query Processing With Aggregate Precomputation for Interactive Analytics	Paper Author	Intermediate Report Due 2nd
11/7	Distributed and Interactive Cube Exploration
11/9	Dremel: Interactive Analysis Of Web-Scale Datasets	Discussion Paper Author Archaeologist
11/14	Scorpion: Explaining Away Outliers in Aggregate Queries	Paper Author
11/16	Macrobase: Prioritizing attention in fast data	Paper Author Industry Practitioner
11/21	Can Foundation Models Wrangle Your Data?	Paper Author Peer Reviewer Academic Researcher
11/23	Thanksgiving Holiday
11/28	Final Project Presentations (8-11)	Projects Listing
11/30	No Class	Final Project Report Due 12/3

Requirements

The official class requirements state "Students taking the class should have taken a database or data engineering class, at the level of INFO 258 / DATA 101 / COMPSCI 186, and/or have experience working with database or data engineering tools."

Given the varied backgrounds of students coming into this class, we are willing to be flexible in this regard, but we will expect that you will know or be able to pick up missing pieces as we go along — to the extent that it ensures that you can take part in the paper discussions, presentations, and reviews, as well as the research project.

We will expect that you have experience with data science tooling (such as dataframe and visualization libraries, and computational notebooks), databases (both from a query language standpoint, but also performance aspects: query optimization, materialized view maintenance, indexing). If you struggle with SQL or have not heard of query optimization, I encourage taking a database class first. Experience with research or reading research papers is not a must.

Tentative List of Papers

Theme 1: Data Exploration (6)

Visual Analytics Systems + Next Generation: Visualization Search and Recommendation

Perceptual Approximation

Theme 2: Data Manipulation (4)

Spreadsheets and Direct Manipulation

Data Cleaning and Transformation

Theme 3: Beyond GUIs: Other No-Code Interface Modalities (3)

Touch and Gesture

Gestural Query Specification
Optional: dbTouch: Analytics at your Fingertips
Optional: PanoramicData: Data Analysis through Pen & Touch

Natural Language & Speech

Theme 4: No-Code-meets-Code (4)

SQL Query Construction

Computational Notebook Tools

Low-Code Data Manipulation Libraries

Theme 5: Scalability for Humans (5)

Approximate Query Processing

Materialization, Reuse, Prediction

Parallel Data Processing

Dremel: Interactive Analysis Of Web-Scale Datasets
Optional: The Snowflake Elastic Data Warehouse
Optional: Spark SQL: Relational Data Processing in Spark
Optional: DuckDB: an Embeddable Analytical Database

Surveys and Benchmarks

Theme 6: Going Deeper (3)

Outliers, Explanations, and Provenance

Large Language Models for Data Work

Collaborative Query Processing and Data Discovery

Optional: Finding Related Tables in Data Lakes for Interactive Data Science
Optional: Google fusion tables: web-centered data management and collaboration
Optional: The case for a Collaborative Query Management System

Video Analysis

Machine Learning Systems

Optional: MAD Skills: New Analysis Practices for Big Data
Optional: MLbase: A Distributed Machine-learning System
Optional: Towards a Unified Architecture for in-RDBMS Analytics
Optional: GraphLab: A New Framework For Parallel Machine Learning

Instructions for Submitting Class Reviews

For any class where you are not part of the role-playing activity, you may submit a brief review of the paper. You must use the following link to submit class reviews: Link.

Remember to cover the 5 key questions: what is the problem, why is it important, what sets it apart from previous work, what are the key technical ideas, what are the main areas of improvement and open issues, all within 500 words.

The class reviews must be submitted by midnight the day before class. No late submissions accepted. These reviews will be lightly graded: we just want to make sure you've read the paper enough to contribute in class.

You need to submit 10 reviews over the semester, for papers where you are not actively playing one of the presenter roles.

Instructions for the Role Playing/Presentation Activity

This class involves multiple roles adapted originally from here. The goal of these multiple roles is to ensure that the class is not organized in a one-to-many format (one speaker, many passive listeners), but a many-to-many format (many speakers, many active listeners).

The roles in our class are the following: paper author, peer reviewer, archaeologist, academic researcher, and industry practitioner.

The paper author role involves a 15-20 minute presentation, in a conference style talk. Convey what's interesting about the paper: what is the domain, what is the problem, what is wrong with prior work, what is the approach advocated by the paper, how well does it do, etc. This is done either solo or as groups of two.

The other four roles are 5 minutes each. Each starts with a one slide summary.

The peer reviewer role describes a full review of the paper for a top venue in the corresponding research area (databases, visualization, HCI, ...). Identifies a summary of the key contributions, and the strong and weak points of the paper.
The archaeologist reports on one older paper that the current paper builds on, and one new paper that the current paper inspires, as a means of situating the paper in the literature. Reporting on more papers are also allowed!
The academic researcher proposes one or more follow-on projects that builds on the ideas of the paper.
The industry practitioner tries to make an argument for why they should be paid to implement the methods in the paper — and to discuss any benefits and risks.

More details on breakdown, how to sign up, to be announced shortly.

Grading Policy

Class Participation: 50%

Paper Reviews: 15%

Due day before class at midnight. Need to submit at least 10 reviews over the semester. Will be lightly graded. You can't "double-dip": the papers where you are a presenter won't count towards your 10 reviews

Class Participation: 10%

Any participation is good participation (within reason!). We want to have a good discussion as part of the presentations. Feel free to chime in with ideas, concerns, questions, discussion points, etc.

Paper Presentation: 25%

You will play the paper author role roughly 1-2 times, and other accessory roles up to 4-5 times.

Research Project: 50%. A semester-long research project on human-centered data management in teams of 2-3 (1 only in the case of exceptions).

Project Proposal: 5%
Intermediate Report: 10%
Final Report + Presentation: 35%

Project

As part of this class, you need to complete a semester-long project. Details will be announced shortly. We encourage you to look for ideas in your domain of expertise: for instance, if you work in computational journalism, building a new way to browse and manage large collections of textual archives could be a perfectly reasonable project. Either way, you must speak to the instructor to verify that the project fulfils the needs of the class.

One interesting avenue for projects is to revisit data management problems (including papers in the list above) with LLMs in the loop. Aspects of interest would include thinking of other interaction modalities (beyond chat), human verification/validation of system interpretations, and dealing with brittleness and hallucinations.

Final Project Slides

SPADE: A System for Prompt Analysis and Delta-Based Evaluation
[Slide] Shreya Shankar
Code Generation for Data Wrangling Tasks
[Slide] Sahil Bhatia
VisualLing: Enhancing User Interaction and Aesthetics in LLM-Based Visualization Systems
[Slide] Chandramita Dutta, Chirag Manghani
GENIE in a Notebook: Speech-Based Code Generation in Computational Notebooks
[Slide] Alice Yeh
LLM-Powered Semantic Dataset Search Engine
[Slide] Wenjing Lin
Supporting Human-in-the-Loop Evaluation of Biased Image Embeddings in Vector Databases through Configurable Visualizations
[Slide] Ankita Suresh Shanbhag

Statement of Values

We intend for this class to be a safe, accepting, and inclusive environment where everyone, independent of background, can not just thrive, but flourish. We believe in strength in diversity, and that everyone is on a journey of continual “un”-learning and self-reflection to identify and address implicit biases. If there are aspects of your experience in class that violate these principles, please reach out and let the instructors know.

Medical Disruptions

We will not be taking attendance in class, so, as we get into a flu/COVID season, you may skip classes for genuine medical reasons—to protect and safeguard your fellow classmates. While we are incentivizing participation, as long as you participate in a reasonable fraction of the classes, that is sufficient from a grading standpoint. If you are a presenter of one of the roles, please inform the instructor ASAP if you are unable to make it to class due to a medical reason.

Credits

Paper readings heavily inspired by Eugene Wu (Columbia)'s Human-Data Interaction class. The Role Playing Class Model heavily inspired by Kexin Rong (Georgia Tech)'s Human-in-the-loop Data Analysis class.