Shuyang Li

Shuyang Li, PhD

Research Scientist at Meta

shuyangli94 [at] GMAIL
( he / him )

I earned my Ph.D. and M.S. at UC San Diego, advised by Julian McAuley. Before that, I received my B.S.E. from Princeton University in Operations Research and Financial Engineering (ORFE).

My research is centered around Natural Language Generation, Interactive AI, and Conversational Agents.


Recent Publications


Recent News

  • [Aug. 2022]: Joined Meta as a Research Scientist in New York
  • [June 2022]: Defended my PhD dissertation: Personalizing Interactive Agents
  • [Oct. 2021]: Joined Salesforce Research in Palo Alto (Remote) as a research intern for winter 2021
  • [June 2021]: Joined Amazon Alexa in NYC (Remote) again as a research intern for summer 2021
  • [Sep. 2020]: Received the 2020 Qualcomm Innovation Fellowship for our proposal on conversational recommender systems
  • [June 2020]: Joined Amazon Alexa in NYC (Remote) as a research intern for summer 2020
  • [June 2019]: Joined Google as a summer intern with the Kaggle Datasets team, working on metadata extraction and data discoverability
  • [June 2019]: Our team was selected as one of 10 finalists in the 2019 Alexa Prize! News coverage
  • [Sep. 2018]: Started my PhD at UCSD, studying applied machine learning, recommender systems and NLP under Prof. Julian McAuley
  • [June 2017]: Joined Bloomberg LP as a Senior Software Engineer in the Structured Products Waterfall team
  • [July 2016]: Joined Goldman Sachs as a Technology Analyst in the Operations Automation and Analytics team
  • [June 2016]: Graduated from Princeton University with a BSE in Operations Research and Financial Engineering

Shuyang Li

Research and Experience

My research focuses on the intersection of natural language processing (NLP) and recommender systems. I explore personalization in natural language generation and question-answering, subjectivity in knowledge bases, and mining the wealth of data available publicly online. My previous research includes structured text generation, sentiment analysis, and spatial clustering in neuroscience.

My research is centered around Natural Language Generation, Interactive AI, and Conversational Agents.

I also focus on making large-scale, useful and interesting datasets available for research, hosted at my pages on Kaggle and HuggingFace.

I have been a PC/reviewer for: ICLR, NeurIPS, ICML, EMNLP, NAACL-HLT, WWW, KDD, RecSys, AAAI, and ACL Rolling Review.

Selected reseach projects below. For a complete list, see my Google Scholar page.

Quick links to sections: publications, workshop/other projects.


Publications

* denotes joint authorship and equal contribution

Self-Supervised Bot Play for Transcript-Free Conversational Critiquing with Rationales

Self-Supervised Bot Play for Transcript-Free Conversational Critiquing with Rationales
Shuyang Li, Bodhisattwa Prasad Majumder, Julian McAuley
2024 ACM TORS, 2022 RecSys
doi (ACM TORS) | pdf (RecSys) | q&a | hub

We introduce a framework for training interactive recommender systems for multi-turn conversational recommendation using self-supervised bot-play predicated on review data. We demonstrate via experiments on three real-world datasets that our method is model-agnostic and allows simple matrix factorization and linear recommender systems to out-perform state-of-the-art existing techniques for conversational recommendation.

SumCSE: Summary as a transformation for Contrastive Learning

SumCSE: Summary as a transformation for Contrastive Learning
Raghuveer Thirukovalluru, Xiaolan Wang, Jun Chen, Shuyang Li, Jie Lei, Rong Jin, Bhuwan Dhingra
2024 NAACL (Findings)
pdf

We present SumCSE, a method to compose several sentence transformation methods (summarization, paraphrasing, contradictions) to train sentence embeddings via contrastive learning. Sentence representations trained using our compositional method improve Semantic Textual Similarity (STS) performance compared to previous SOTA methods (SimCSE, SynCSE).

Assistive Recipe Editing through Critiquing

Assistive Recipe Editing through Critiquing
Diego Antognini, Shuyang Li, Boi Faltings, Julian McAuley
2023 EACL
pdf

We present RecipeCrit, a hierarchical denoising auto-encoder that edits recipes given ingredient-level critiques. The model is trained for recipe completion to learn semantic relationships within recipes. Our work's main innovation is our unsupervised critiquing module that allows users to edit recipes by interacting with the predicted ingredients; the system iteratively rewrites recipes to satisfy users' feedback. Experiments on the Recipe1M recipe dataset show that our model can more effectively edit recipes compared to strong language-modeling baselines, creating recipes that satisfy user constraints and are more correct, serendipitous, coherent, and relevant as measured by human judges.

SHARE: a System for Hierarchical Assistive Recipe Editing

SHARE: a System for Hierarchical Assistive Recipe Editing
Shuyang Li, Yufei Li, Jianmo Ni, Julian McAuley
2022 EMNLP
pdf | video | data

We introduce SHARE: a System for Hierarchical Assistive Recipe Editing to assist home cooks with dietary restrictions---a population under-served by existing cooking resources. Our hierarchical recipe editor makes necessary substitutions to a recipe's ingredients list and re-writes the directions to make use of the new ingredients. We show that recipe editing is a challenging task that cannot be adequately solved with human-written ingredient substitution rules or straightforward adaptation of state-of-the-art models for recipe generation.

Instilling type knowledge in language models via multi-task QA

Instilling type knowledge in language models via multi-task QA
Shuyang Li, Mukund Sridhar, Chandana Satya Prakash, Jin Cao, Wael Hamza, Julian McAuley
2022 NAACL (Findings)
pdf | data

We introduce a novel multi-task QA framework for pre-training language models on joint articles and knowledge bases to recognize types, entities, and their relationship in a type ontology. We demonstrate that generative language models trained using our framework can a) achieve up to 14.9% (absolute) / 49.4% (relative) improvement in zero-shot dialog state tracking (DST) over current state-of-the-art, b) infer types for previously unseen entities better than other pre-trained language models (+16.7 F1), and c) discover novel complex and meaningful types. We also release the WikiWiki dataset linking 10M Wikipedia articles with 41K unique types from the Wikidata knowledge graph.

Zero-shot Generalization in Dialog State Tracking through Generative Question Answering

Zero-shot Generalization in Dialog State Tracking through Generative Question Answering
Shuyang Li, Jin Cao, Mukund Harakere Sridhar, Henry Zhu, Daniel Li, Wael Hamza, Julian McAuley
2021 EACL
pdf

We introduce a novel ontology-free framework that supports natural language queries for unseen constraints and slots in multi-domain task-oriented dialogs. Our approach is based on generative question-answering using a conditional language model pre-trained on substantive English sentences. Our model improves joint goal accuracy in zero-shot domain adaptation settings by up to 9\% (absolute) over the previous state-of-the-art on the MultiWOZ 2.1 dataset.

Interview: Large-scale Modeling of Media Dialog with Discourse Patterns and Knowledge Grounding

Interview: Large-scale Modeling of Media Dialog with Discourse Patterns and Knowledge Grounding
Bodhisattwa P. Majumder*, Shuyang Li*, Jianmo Ni, Julian McAuley
2020 EMNLP
pdf | data | code

We perform the first large-scale analysis of discourse in media dialog and its impact on generative modeling of dialog turns, with a focus on interrogative patterns and use of external knowledge. We introduce Interview—a large-scale (105K conversations) media dialog dataset collected from news interview transcripts—which allows us to investigate such patterns at scale.

Speech Recognition and Multi-Speaker Diarization of Long Conversations

Speech Recognition and Multi-Speaker Diarization of Long Conversations
Henry Huanru Mao, Shuyang Li, Julian McAuley, Gary W. Cottrell
2020 INTERSPEECH
pdf | data

We compare separate and joint frameworks for speech recognition (ASR) and speaker diarization in a multi-speaker setting, showing that joint models can perform well in settings without known utterance bounds. We release a dataset to support multi-speaker ASR and diarization, drawn from a weekly radio program: This American Life.

Interview: A Large-Scale Open-Source Corpus of Media Dialog

Interview: A Large-Scale Open-Source Corpus of Media Dialog
Bodhisattwa P. Majumder*, Shuyang Li*, Jianmo Ni, Julian McAuley
2020 ArXiv preprint (CoRR)
pdf | data

We introduce a large-scale dataset of media dialog: interview-style transcripts from NPR talk shows. We demonstrate its usefulness for dialog modeling and its special traits and dialog structure when compared to existing small-scale spontaneous dialog transcriptions and written proxies for conversation.

Generating Personalized Recipes from Historical User Preferences (EMNLP 2019)

Generating Personalized Recipes from Historical User Preferences
Bodhisattwa P. Majumder*, Shuyang Li*, Jianmo Ni, Julian McAuley
2019 EMNLP
pdf | code | data | news 1 2 3

We propose the task of personalized recipe generation: expanding a name and incomplete ingredient details into a complete natural-text recipe instruction set, aligned with a user's historical preferences; we present a model that attends on user activity traces to solve this task.

Molecular Pain 2012 TRP Channel Clustering

Expression of the transient receptor potential channels TRPV1, TRPA1 and TRPM8 in mouse trigeminal primary afferent neurons innervating the dura
Dongyue Huang, Shuyang Li, Ajay Dhaka, Gina M Story, Yu-Qing Cao
Molecular Pain 2012, 8:66
pdf

We study the size and clustering of TRP-channel expressing neurons in the trigeminal dura. Our results suggest that TRPV1 and TRPA1 but not TRPM8 channels likely contribute to the excitation of dural afferent neurons and the subsequent activation of the headache circuit. These results provide an anatomical basis for understanding further the functional significance of TRP channels in headache pathophysiology.



Workshops and Articles

Variable Bitrate Discrete Neural Representations via Causal Self-Attention

Variable Bitrate Discrete Neural Representations via Causal Self-Attention
Shuyang Li*, Huanru Henry Mao*, Julian McAuley
2021 NeurIPS Workshop on Pre-registration in Science
pdf | poster | video

Generative modeling across text, video, and image domains has benefited from the introduction of discrete (quantized) representations for continuous data. We aim to learn a single model able to produce discrete representations at different granularities (bitrates). We propose a framework based on the Perceiver IO architecture, incorporating causal attention to learn ordered latent codes that can then be adaptively pruned to a target compression rate.

Bernard: A Stateful Neural Open-domain Socialbot

Bernard: A Stateful Neural Open-domain Socialbot
Bodhisattwa Prasad Majumder, Shuyang Li, Jianmo Ni, Huanru Henry Mao, Sophia Sun, Julian McAuley
Alexa Prize Grand Challenge 3 Proceedings (2019)
pdf | press

We propose Bernard: a framework for an engaging open-domain socialbot. We explore various strategies to generate coherent dialog given an arbitrary dialog history. We incorporate a stateful autonomous dialog manager using non-deterministic finite automata to control multi-turn conversations.

Recipes for Success: Data Science in the Home Kitchen (HDSR 2019)

Recipes for Success: Data Science in the Home Kitchen
Shuyang Li, Julian McAuley
Harvard Data Science Review
pdf | link

We survey the history of data science and machine learning methods to assist home cooks find and make nutritious, delicious meals. This article explores domains including recipe recommendation, recipe generation, image-to-recipe retrieval, and assistive technologies that integrate with the cooking process.

Exploring Rich Features for Sentiment Analysis with Various Machine Learning Models (2015-6)

Exploring Rich Features for Sentiment Analysis with Various Machine Learning Models
Shuyang Li, Xiaoyan Li
IEEE Undergraduate Research Technology Conference 2016, Princeton Senior Thesis 2016
pdf | poster | data

This was my senior thesis at Princeton University, supervised by Dr. Xiaoyan Li. Here, we investigate the use of rich features to extend the bag-of-words model for sentiment analysis using machine learning in the movie review domain. We focus on subjectivity analysis and sentence position features. In addition, I created a manually labeled set of subjective and summary sentences for 2000 reviews in the Cornell IMDB movie review corpus.

Learning Rate Analysis for Temporal-Difference Learning (2014)

Learning Rate Analysis for Temporal-Difference Learning
Work in Castle Labs, supervised by Warren Powell
Summer 2014
pdf | framework

The objective was to find optimal policies in an energy allocation problem. The model contained a battery, power grid, wind energy, and demand, with the latter three stochastic variables. I investigated the performance of Q-Learning and SARSA given different learning rates.


Shuyang Li

Work Experience

I have worked as a research scientist and software engineer in the tech and finance industries. My work focuses on applying AI and machine learning to improve consumer products (via personalization, knowledge grounding, and data quality) and democratizing access.

My work is centered around Natural Language Generation, Interactive AI, and Conversational Agents.

Work Experience

Meta AI: NLP (2022 August - Present)

Meta: Natural Language Processing
Research Scientist

Heterogeneous Knowledge Graphs, Text Generation (August 2022 - Present)

Salesforce AI: NLP (2022 Winter)

Salesforce AI: Natural Language Processing
Research Scientist Intern

Meta-Learning for Low-Resource Dialog Agents (October 2021 - February 2022)

Alexa NLU - Amazon Alexa AI (2020 Summer)

Amazon Alexa: Alexa Natural Language Understanding
Applied Scientist Intern

Concept-centric Pre-training for Language Modeling (June 2021 - October 2021)
At Alexa NLU, I researched ways to instill knowledge about entities and types in large language models. I investigated how such pre-training can be posed as question-answering and how such models enable strong entity typing and generalization performance in dialog state tracking.


Generalizable Dialog State Tracking (June 2020 - October 2020)
At Alexa NLU, I conducted research on generative approaches toward generalizable dialog state tracking of user preferences in a domain adaptation setting. I investigated knowledge transfer in autoregressive language models for question-answering based dialog understanding. I also analyzed common error modalities in preference and belief tracking for task-oriented dialog.

Kaggle - Google Cloud  AI (2019 Summer)

Google Cloud AI: Kaggle
Software Engineering Intern

Dataset Categorization and Search (June 2019 - September 2019)
At Kaggle, I built a framework for automatically generating semantic tags for datasets based on free-text metadata. I also implemented metrics for dataset discoverability and search success. By the end of the summer, I had more than doubled the size of our tag ontology and tripled tag coverage across all public datasets on Kaggle.

Bloomberg, LP (2017-2018)

Bloomberg: Structured Products
Senior Software Engineer

Data Quality Control Platform (June 2017 - September 2018)
I helped to design and implement a Spark-based infrastructure for high bandwidth data processing jobs. We wrote a series of applications to wrap common data-access, filtering, and testing paradigms, so other members of the Structured Products group could write test logic as a self-contained plugin and farm the job out to our Spark cluster. We achieved a 100x+ speedup for some jobs due to parallelization, and abstracted away Spark application boilerplate.


Semi-Structured Text Clustering for Securities (June 2017 - September 2018)
I investigated how various mortgage-backed securities (pools, generics, CMOs) could be clustered by the shape of the data we regularly received for each security. This entailed identifying data shape features for semi-structured text data received for over 2.5 million securities and clustering on time-series data for each security.

Goldman Sachs (2016-2017)

Goldman Sachs: Operations Analytics Strats
Technology Analyst

Machine Learning Platform as a Service (July 2016 - June 2017)
We built a platform for data science and machine learning on top of a centralized data store, to allow anybody to classify and regress on arbitrary datasets without needing a deep background in programming for machine learning. One client was an internal data platforms team, which used our platform to better predict expected runtimes for their data ingestion workflow.


FIX Market Data Pipeline (July 2016 - August 2017)
This was the project team I worked on for Grey Wolf on Goldman Sachs Asset Management - Fixed Income (GSAM FI). Over 6 weeks, we built a system to consume FIX messages from several marketplaces, parse them, and store the messages in a database. Our application allowed the GSAM FI team to consume a vastly greater quantity of market messages and made it available for further analysis by traders. It is currently a production system.

Goldman Sachs (2016-2017)

Goldman Sachs: Operations Analytics Strats
Summer Analyst

Automated Invoice Recognition and Extraction (June 2015 - August 2015)
We worked with Accounting Services to build a tool that could automatically index certain values from images of invoices. They received invoices through email or fax and scanned them into .tiff files. Our algorithm relied on segmentation of individual values to produce structured output and templating to identify likely segment locations.


Workflow Assignment through Mixed Integer Linear Programming (June 2015 - August 2015)
We worked with several different Operations groups to produce a tool for automatically assigning tasks to available analysts. This took into account available analysts, their expected bandwidth for the remainder of the day, their proficiency with incoming tasks, and implemented a Maker-Checker process. The tool is to be run by managers, and outputs a list of each person's assignments as well as a list of unassigned tasks with reasoning listed.


Shuyang Li

Projects

The End of the World

Top-Down Exploration Game


A simple game inspired by Haruki Murakami's Hard-Boiled Wonderland and the End of the World. Move around the world with the arrow keys.


Maze Generator and Explorer

Cellular Automaton


Using cellular automata to generate mazes. Allows customization of maze dimensions and visualizations. Solve the maze with arrow keys. Contains an auto-solver feature.


Cellular Automaton Sandbox

Cellular Automaton


Set a beginning configuration and a ruleset, and allow a cellular automaton to grow. Allows flipping of grid squares as the automaton is running, as well as customization of run speed and canvas dimensions.


Cookie Crumbler

Game


Break cookies into different-sized pieces to feed a friendly colony of ants, as they're constantly harrassed by locusts. Experiment with HTML5 canvas.


Incremental Adventure

Incremental Game


Dabbling in development of an incremental RPG. Progress through an infinite dungeon and collect party members and upgrade your gear.





Data Quality Control Platform (2017-)

Distributed Computing


In my work with the Structured Products team (Mortgage Waterfall Infrastructure), I helped to design and implement a Spark-based infrastructure for high bandwidth data processing jobs. We wrote a series of applications to wrap common data-access, filtering, and testing paradigms, so other members of the Structured Products group could write test logic as a self-contained plugin and farm the job out to our Spark cluster. We achieved a 100x+ speedup for some jobs due to parallelization, and abstracted away Spark application boilerplate.

The platform infrastructure was written in Python with jobs distributed via pySpark.



Identifying Similar Securities (2017-)

Semi-structured Text Clustering


As part of my work with the Quality Control team in Mortgage Waterfall Infrastructure, I am currently investigating how various mortgage-backed securities (pools, generics, CMOs) can be clustered by the shape of the data we regularly receive for each security. This entails identifying data shape features for semi-structured text data received for over 2.5 million securities and clustering on time-series data for each security.

The application was written in Python, using Spark MLlib library for feature processing and clustering, and PostgreSQL to store time series data, accessed programmatically via SQLAlchemy. The application utilizes the infrastructure framework developed as part of the Data QC Platform.



Machine Learning Platform (2016-2017)

Distributed ML Platform


This was my primary project in Ops Analytics Strats (OAS), in the Operations Technology group at Goldman Sachs. Over the course of 10 months, we built a platform for data science and machine learning on top of the firm's centralized data store. Our goal was to allow anybody to classify and regress on arbitrary datasets without needing a deep background in programming for machine learning. One client was an internal data platforms team, which used our platform to better predict expected runtimes for their data ingestion workflow.

The platform infrastructure was written in Java. For machine learning, we used Spark MLlib and R.



Market Data Pipeline (2016)

FIX Message Parsing


This was the project team I worked on for Grey Wolf on Goldman Sachs Asset Management - Fixed Income (GSAM FI) as a new Technology Analyst in the fall of 2016. Over 6 weeks, our team built a system to consume FIX messages from several marketplaces, parse them into a standardized object, and store the messages in a database. Our application allowed the GSAM FI team to consume a vastly greater quantity of market messages and made it available for further analysis by traders. It is currently a production system.

The application was written in Java, using the QuickFIX/J library for FIX message decoding.



Automated Invoice Recognition (2015)

Optical Character Recognition


This was my main project as a Summer Technology Analyst in Operations Analytics Strats. We worked with Accounting Services to build a tool that could automatically index certain values from images of invoices. They received invoices through email or fax and scanned them into .tiff files. Our algorithm relied on segmentation of individual values to produce structured output and templating to identify likely segment locations.

The program was written in Python, using the numpy library for fast array operations and scikit-image library for image processing. We used the Tesseract open-source OCR engine.



Workflow Assignment (2015)

Mixed Integer Linear Programming


This was another project that I worked on as a Summer Technology Analyst in Operations Analytics Strats. We worked with several different Operations groups to produce a tool for automatically assigning tasks to available analysts. This took into account available analysts, their expected bandwidth for the remainder of the day, their proficiency with incoming tasks, and implemented a Maker-Checker process. The tool is to be run by managers, and outputs a list of each person's assignments as well as a list of unassigned tasks with reasoning listed.

The program was written in R, using the Rglpk library for linear programming.



Rich Features for Sentiment Analysis (2015-6)

Natural Language Processing


This was my senior thesis at Princeton University, supervised by Dr. Xiaoyan Li. Here, we investigate the use of rich features to extend the bag-of-words model for sentiment analysis using machine learning in the movie review domain. We focus on subjectivity analysis and sentence position features. In addition, I created a manually labeled set of subjective and summary sentences for 2000 reviews in the Cornell IMDB movie review corpus. The full title of the thesis is Exploring Rich Features for Sentiment Analysis with Various Machine Learning Models.

All model training and execution were performed in Python, using NLTK and TextBlob for document parsing and feature extraction, and scikit-learn for machine learning algorithms. Data analysis was done in R.

Documents: Poster presented at the 2016 IEEE MIT Undergraduate Research Technology Conference. A full copy of the thesis may be requested via Princeton DataSpace. You can also download a .7z archive of my manually labeled movie reviews.



Career Imagineers (2015)

Social Impact


I worked on this semester-long project in Junior year of college for EGR 392 - Creativity, Innovation, and Design. We helped the campus career center provide resources for career exploration and preparation in a way that was better suited for the undergraduate population. We created a regular mentorship program for underclassmen and upperclassmen who had experience with various research and industry roles. We also held a well-attended career exploration workshop on campus.

The team consisted of Grace Chang '17, Catherine Idylle '16, Jean Choi '15, Maggie Zhang '16, Annie Chen '15, and myself.

Documents: Presentation given to the EGR 392 class, as well as administrators of Princeton career services.



Los Angeles Freeway Pricing (2014)

Optimal Learning


This was a project that I worked on for ORF 418: Optimal Learning in Spring 2014. I worked with Max Kaplan. We created an algorithm for pricing Express Lanes on the I-110 Freeway using Optimal Learning techniques. Specifically, we tested the Knowledge Gradient, Interval Estimation, Pure Exploitation, and Constrained Exploration algorithms with linear and logistic belief models.

Code and graphics done in MATLAB.

Documents: Project Report, Project Presentation