Shuyang Li

Shuyang Li

shuyangli94 [at] GMAIL
Office @ 4146 CSE (EBU3B)
2nd year PhD student at UC San Diego, advised by Julian McAuley.
Conducting research at the intersection of Natural Language Generation (NLG), Recommender Systems, Dialogue Systems, and Artificial Intelligence.

Princeton University Class of 2016; BSE in Operations Research and Financial Engineering. Avid badminton player and writer.

Recent News


Shuyang Li

Publications

My research focuses on the intersection of natural language processing (NLP) and recommender systems. I explore personalization in natural language generation and question-answering, subjectivity in knowledge bases, and mining the wealth of data available publicly online. My previous research includes structured text generation, sentiment analysis, and spatial clustering in neuroscience.

Selected reseach projects below. For a complete list, see my Google Scholar page.


Generating Personalized Recipes from Historical User Preferences (EMNLP 2019)

Generating Personalized Recipes from Historical User Preferences
Bodhisattwa P. Majumder*, Shuyang Li*, Jianmo Ni, Julian McAuley
2019 Conference on Empirical Methods in Natural Language Processing (EMNLP)
pdf | code | data

We propose the task of personalized recipe generation: expanding a name and incomplete ingredient details into a complete natural-text recipe instruction set, aligned with a user's historical preferences; we present a model that attends on user activity traces to solve this task.

Molecular Pain 2012 TRP Channel Clustering

Expression of the transient receptor potential channels TRPV1, TRPA1 and TRPM8 in mouse trigeminal primary afferent neurons innervating the dura
Dongyue Huang, Shuyang Li, Ajay Dhaka, Gina M Story, Yu-Qing Cao
Molecular Pain 2012, 8:66
pdf

We study the size and clustering of TRP-channel expressing neurons in the trigeminal dura. Our results suggest that TRPV1 and TRPA1 but not TRPM8 channels likely contribute to the excitation of dural afferent neurons and the subsequent activation of the headache circuit. These results provide an anatomical basis for understanding further the functional significance of TRP channels in headache pathophysiology.

Work Experience

Kaggle - Google Cloud  AI (2019 Summer)

Google Cloud AI: Kaggle
Software Engineering Intern
June 2019 - September 2019

Dataset Categorization and Search
At Kaggle, I built a framework for automatically generating semantic tags for datasets based on free-text metadata. I also implemented metrics for dataset discoverability and search success. By the end of the summer, I had more than doubled the size of our tag ontology and tripled tag coverage across all public datasets on Kaggle.

Bloomberg, LP (2017-2018)

Bloomberg: Structured Products
Senior Software Engineer
June 2017 - September 2018

Data Quality Control Platform
I helped to design and implement a Spark-based infrastructure for high bandwidth data processing jobs. We wrote a series of applications to wrap common data-access, filtering, and testing paradigms, so other members of the Structured Products group could write test logic as a self-contained plugin and farm the job out to our Spark cluster. We achieved a 100x+ speedup for some jobs due to parallelization, and abstracted away Spark application boilerplate.

Semi-Structured Text Clustering for Securities
I investigated how various mortgage-backed securities (pools, generics, CMOs) could be clustered by the shape of the data we regularly received for each security. This entailed identifying data shape features for semi-structured text data received for over 2.5 million securities and clustering on time-series data for each security.

Goldman Sachs (2016-2017)

Goldman Sachs: Operations Analytics Strats
Technology Analyst
July 2016 - June 2017

Machine Learning Platform as a Service
Over the course of 10 months, we built a platform for data science and machine learning on top of the firm's centralized data store. Our goal was to allow anybody to classify and regress on arbitrary datasets without needing a deep background in programming for machine learning. One client was an internal data platforms team, which used our platform to better predict expected runtimes for their data ingestion workflow.

FIX Market Data Pipeline
This was the project team I worked on for Grey Wolf on Goldman Sachs Asset Management - Fixed Income (GSAM FI). Over 6 weeks, our team built a system to consume FIX messages from several marketplaces, parse them into a standardized object, and store the messages in a database. Our application allowed the GSAM FI team to consume a vastly greater quantity of market messages and made it available for further analysis by traders. It is currently a production system.

Goldman Sachs (2016-2017)

Goldman Sachs: Operations Analytics Strats
Summer Analyst
June 2015 - August 2015

Automated Invoice Recognition and Extraction
We worked with Accounting Services to build a tool that could automatically index certain values from images of invoices. They received invoices through email or fax and scanned them into .tiff files. Our algorithm relied on segmentation of individual values to produce structured output and templating to identify likely segment locations.

Workflow Assignment through Mixed Integer Linear Programming
We worked with several different Operations groups to produce a tool for automatically assigning tasks to available analysts. This took into account available analysts, their expected bandwidth for the remainder of the day, their proficiency with incoming tasks, and implemented a Maker-Checker process. The tool is to be run by managers, and outputs a list of each person's assignments as well as a list of unassigned tasks with reasoning listed.

Other Research

Cooking Common Sense: Personalized Recipe ‘Tweak’ Inference via Common Sense Reasoning (SoCal NLP 2019)

Cooking Common Sense: Personalized Recipe ‘Tweak’ Inference via Common Sense Reasoning
Shuyang Li*, Bodhisattwa P. Majumder*, Julian McAuley
2019 Southern California Natural Language Processing Symposium (SoCal NLP)
poster

We propose the task of personalized tweak selection modeled as both entailment of a tweak from a recipe and activity traces, as well as a generative task. We collect a dataset of 72K tweaks linked to the Food.com dataset from our EMNLP 2019 publication on personalized recipe generation. We also propose a framework for collecting such tweaks via crowd-sourcing.

Exploring Rich Features for Sentiment Analysis with Various Machine Learning Models (2015-6)

Exploring Rich Features for Sentiment Analysis with Various Machine Learning Models
Shuyang Li, Xiaoyan Li
IEEE Undergraduate Research Technology Conference 2016, Princeton Senior Thesis 2016
pdf | poster | data

This was my senior thesis at Princeton University, supervised by Dr. Xiaoyan Li. Here, we investigate the use of rich features to extend the bag-of-words model for sentiment analysis using machine learning in the movie review domain. We focus on subjectivity analysis and sentence position features. In addition, I created a manually labeled set of subjective and summary sentences for 2000 reviews in the Cornell IMDB movie review corpus.

Learning Rate Analysis for Temporal-Difference Learning (2014)

Learning Rate Analysis for Temporal-Difference Learning
Work in Castle Labs, supervised by Warren Powell
Summer 2014
pdf | framework

The objective was to find optimal policies in an energy allocation problem. The model contained a battery, power grid, wind energy, and demand, with the latter three stochastic variables. I investigated the performance of Q-Learning and SARSA given different learning rates.


Shuyang Li

Projects


The End of the World

Top-Down Exploration Game


A simple game inspired by Haruki Murakami's Hard-Boiled Wonderland and the End of the World. Move around the world with the arrow keys.



Maze Generator and Explorer

Cellular Automaton


Using cellular automata to generate mazes. Allows customization of maze dimensions and visualizations. Solve the maze with arrow keys. Contains an auto-solver feature.



Cellular Automaton Sandbox

Cellular Automaton


Set a beginning configuration and a ruleset, and allow a cellular automaton to grow. Allows flipping of grid squares as the automaton is running, as well as customization of run speed and canvas dimensions.



Cookie Crumbler

Game


Break cookies into different-sized pieces to feed a friendly colony of ants, as they're constantly harrassed by locusts. Experiment with HTML5 canvas.



Incremental Adventure

Incremental Game


Dabbling in development of an incremental RPG. Progress through an infinite dungeon and collect party members and upgrade your gear.



Data Quality Control Platform (2017-)

Distributed Computing


In my work with the Structured Products team (Mortgage Waterfall Infrastructure), I helped to design and implement a Spark-based infrastructure for high bandwidth data processing jobs. We wrote a series of applications to wrap common data-access, filtering, and testing paradigms, so other members of the Structured Products group could write test logic as a self-contained plugin and farm the job out to our Spark cluster. We achieved a 100x+ speedup for some jobs due to parallelization, and abstracted away Spark application boilerplate.

The platform infrastructure was written in Python with jobs distributed via pySpark.



Identifying Similar Securities (2017-)

Semi-structured Text Clustering


As part of my work with the Quality Control team in Mortgage Waterfall Infrastructure, I am currently investigating how various mortgage-backed securities (pools, generics, CMOs) can be clustered by the shape of the data we regularly receive for each security. This entails identifying data shape features for semi-structured text data received for over 2.5 million securities and clustering on time-series data for each security.

The application was written in Python, using Spark MLlib library for feature processing and clustering, and PostgreSQL to store time series data, accessed programmatically via SQLAlchemy. The application utilizes the infrastructure framework developed as part of the Data QC Platform.



Machine Learning Platform (2016-2017)

Distributed ML Platform


This was my primary project in Ops Analytics Strats (OAS), in the Operations Technology group at Goldman Sachs. Over the course of 10 months, we built a platform for data science and machine learning on top of the firm's centralized data store. Our goal was to allow anybody to classify and regress on arbitrary datasets without needing a deep background in programming for machine learning. One client was an internal data platforms team, which used our platform to better predict expected runtimes for their data ingestion workflow.

The platform infrastructure was written in Java. For machine learning, we used Spark MLlib and R.



Market Data Pipeline (2016)

FIX Message Parsing


This was the project team I worked on for Grey Wolf on Goldman Sachs Asset Management - Fixed Income (GSAM FI) as a new Technology Analyst in the fall of 2016. Over 6 weeks, our team built a system to consume FIX messages from several marketplaces, parse them into a standardized object, and store the messages in a database. Our application allowed the GSAM FI team to consume a vastly greater quantity of market messages and made it available for further analysis by traders. It is currently a production system.

The application was written in Java, using the QuickFIX/J library for FIX message decoding.



Automated Invoice Recognition (2015)

Optical Character Recognition


This was my main project as a Summer Technology Analyst in Operations Analytics Strats. We worked with Accounting Services to build a tool that could automatically index certain values from images of invoices. They received invoices through email or fax and scanned them into .tiff files. Our algorithm relied on segmentation of individual values to produce structured output and templating to identify likely segment locations.

The program was written in Python, using the numpy library for fast array operations and scikit-image library for image processing. We used the Tesseract open-source OCR engine.



Workflow Assignment (2015)

Mixed Integer Linear Programming


This was another project that I worked on as a Summer Technology Analyst in Operations Analytics Strats. We worked with several different Operations groups to produce a tool for automatically assigning tasks to available analysts. This took into account available analysts, their expected bandwidth for the remainder of the day, their proficiency with incoming tasks, and implemented a Maker-Checker process. The tool is to be run by managers, and outputs a list of each person's assignments as well as a list of unassigned tasks with reasoning listed.

The program was written in R, using the Rglpk library for linear programming.



Rich Features for Sentiment Analysis (2015-6)

Natural Language Processing


This was my senior thesis at Princeton University, supervised by Dr. Xiaoyan Li. Here, we investigate the use of rich features to extend the bag-of-words model for sentiment analysis using machine learning in the movie review domain. We focus on subjectivity analysis and sentence position features. In addition, I created a manually labeled set of subjective and summary sentences for 2000 reviews in the Cornell IMDB movie review corpus. The full title of the thesis is Exploring Rich Features for Sentiment Analysis with Various Machine Learning Models.

All model training and execution were performed in Python, using NLTK and TextBlob for document parsing and feature extraction, and scikit-learn for machine learning algorithms. Data analysis was done in R.

Documents: Poster presented at the 2016 IEEE MIT Undergraduate Research Technology Conference. A full copy of the thesis may be requested via Princeton DataSpace. You can also download a .7z archive of my manually labeled movie reviews.



Career Imagineers (2015)

Social Impact


I worked on this semester-long project in Junior year of college for EGR 392 - Creativity, Innovation, and Design. We helped the campus career center provide resources for career exploration and preparation in a way that was better suited for the undergraduate population. We created a regular mentorship program for underclassmen and upperclassmen who had experience with various research and industry roles. We also held a well-attended career exploration workshop on campus.

The team consisted of Grace Chang '17, Catherine Idylle '16, Jean Choi '15, Maggie Zhang '16, Annie Chen '15, and myself.

Documents: Presentation given to the EGR 392 class, as well as administrators of Princeton career services.



Los Angeles Freeway Pricing (2014)

Optimal Learning


This was a project that I worked on for ORF 418: Optimal Learning in Spring 2014. I worked with Max Kaplan. We created an algorithm for pricing Express Lanes on the I-110 Freeway using Optimal Learning techniques. Specifically, we tested the Knowledge Gradient, Interval Estimation, Pure Exploitation, and Constrained Exploration algorithms with linear and logistic belief models.

Code and graphics done in MATLAB.

Documents: Project Report, Project Presentation