Resting at Eikan-dō Zenrin-ji, in Kyoto, Japan - June 2016

Shuyang Li

PhD Student at UC San Diego, advised by Prof. Julian McAuley
shuyangli94 [at] GMAIL / LinkedIn / Resume / Google Scholar

Princeton University Class of 2016
BSE in Operations Research and Financial Engineering

Avid badminton player. Writer. Check out various projects I've worked on via tabs on the upper-left.

Recent News

  • [August 2019] Paper "Generating Personalized Recipes from Historical User Preferences" accepted to EMNLP 2019; work w/ Bodhisattwa, Jianmo, and Julian
  • [August 2019] Will present poster "Cooking Common Sense: Personalized Recipe ‘Tweak’ Inference via Common Sense Reasoning" at SoCal NLP 2019; work w/ Bodhisattwa and Julian
  • [June 2019] Joined Google as a summer intern with the Kaggle Datasets team, working on metadata extraction and data discoverability
  • [June 2019] Our team was selected as one of 10 finalists in the 2019 Alexa Prize. News.
  • [September 2018] Started my PhD at UCSD, studying applied machine learning, recommender systems and NLP under Prof. Julian McAuley
  • [June 2017] Joined Bloomberg LP as a Senior Software Engineer in the Structured Products Waterfall team
  • [July 2016] Joined Goldman Sachs as a Technology Analyst in the Operations Automation and Analytics team
  • [June 2016] Graduated from Princeton University with a BSE in Operations Research and Financial Engineering

The End of the World

Top-Down Exploration Game

A simple game inspired by Haruki Murakami's Hard-Boiled Wonderland and the End of the World. Move around the world with the arrow keys.

Maze Generator and Explorer

Cellular Automaton

Using cellular automata to generate mazes. Allows customization of maze dimensions and visualizations. Solve the maze with arrow keys. Contains an auto-solver feature.

Cellular Automaton Sandbox

Cellular Automaton

Set a beginning configuration and a ruleset, and allow a cellular automaton to grow. Allows flipping of grid squares as the automaton is running, as well as customization of run speed and canvas dimensions.

Cookie Crumbler


Break cookies into different-sized pieces to feed a friendly colony of ants, as they're constantly harrassed by locusts. Experiment with HTML5 canvas.

Incremental Adventure

Incremental Game

Dabbling in development of an incremental RPG. Progress through an infinite dungeon and collect party members and upgrade your gear.

Data Quality Control Platform (2017-)

Distributed Computing

In my work with the Structured Products team (Mortgage Waterfall Infrastructure), I helped to design and implement a Spark-based infrastructure for high bandwidth data processing jobs. We wrote a series of applications to wrap common data-access, filtering, and testing paradigms, so other members of the Structured Products group could write test logic as a self-contained plugin and farm the job out to our Spark cluster. We achieved a 100x+ speedup for some jobs due to parallelization, and abstracted away Spark application boilerplate.

The platform infrastructure was written in Python with jobs distributed via pySpark.

Identifying Similar Securities (2017-)

Semi-structured Text Clustering

As part of my work with the Quality Control team in Mortgage Waterfall Infrastructure, I am currently investigating how various mortgage-backed securities (pools, generics, CMOs) can be clustered by the shape of the data we regularly receive for each security. This entails identifying data shape features for semi-structured text data received for over 2.5 million securities and clustering on time-series data for each security.

The application was written in Python, using Spark MLlib library for feature processing and clustering, and PostgreSQL to store time series data, accessed programmatically via SQLAlchemy. The application utilizes the infrastructure framework developed as part of the Data QC Platform.

Machine Learning Platform (2016-2017)

Distributed ML Platform

This was my primary project in Ops Analytics Strats (OAS), in the Operations Technology group at Goldman Sachs. Over the course of 10 months, we built a platform for data science and machine learning on top of the firm's centralized data store. Our goal was to allow anybody to classify and regress on arbitrary datasets without needing a deep background in programming for machine learning. One client was an internal data platforms team, which used our platform to better predict expected runtimes for their data ingestion workflow.

The platform infrastructure was written in Java. For machine learning, we used Spark MLlib and R.

Market Data Pipeline (2016)

FIX Message Parsing

This was the project team I worked on for Grey Wolf on Goldman Sachs Asset Management - Fixed Income (GSAM FI) as a new Technology Analyst in the fall of 2016. Over 6 weeks, our team built a system to consume FIX messages from several marketplaces, parse them into a standardized object, and store the messages in a database. Our application allowed the GSAM FI team to consume a vastly greater quantity of market messages and made it available for further analysis by traders. It is currently a production system.

The application was written in Java, using the QuickFIX/J library for FIX message decoding.

Automated Invoice Recognition (2015)

Optical Character Recognition

This was my main project as a Summer Technology Analyst in Operations Analytics Strats. We worked with Accounting Services to build a tool that could automatically index certain values from images of invoices. They received invoices through email or fax and scanned them into .tiff files. Our algorithm relied on segmentation of individual values to produce structured output and templating to identify likely segment locations.

The program was written in Python, using the numpy library for fast array operations and scikit-image library for image processing. We used the Tesseract open-source OCR engine.

Workflow Assignment (2015)

Mixed Integer Linear Programming

This was another project that I worked on as a Summer Technology Analyst in Operations Analytics Strats. We worked with several different Operations groups to produce a tool for automatically assigning tasks to available analysts. This took into account available analysts, their expected bandwidth for the remainder of the day, their proficiency with incoming tasks, and implemented a Maker-Checker process. The tool is to be run by managers, and outputs a list of each person's assignments as well as a list of unassigned tasks with reasoning listed.

The program was written in R, using the Rglpk library for linear programming.

Rich Features for Sentiment Analysis (2015-6)

Natural Language Processing

This was my senior thesis at Princeton University, supervised by Dr. Xiaoyan Li. Here, we investigate the use of rich features to extend the bag-of-words model for sentiment analysis using machine learning in the movie review domain. We focus on subjectivity analysis and sentence position features. In addition, I created a manually labeled set of subjective and summary sentences for 2000 reviews in the Cornell IMDB movie review corpus. The full title of the thesis is Exploring Rich Features for Sentiment Analysis with Various Machine Learning Models.

All model training and execution were performed in Python, using NLTK and TextBlob for document parsing and feature extraction, and scikit-learn for machine learning algorithms. Data analysis was done in R.

Documents: Poster presented at the 2016 IEEE MIT Undergraduate Research Technology Conference. A full copy of the thesis may be requested via Princeton DataSpace. You can also download a .7z archive of my manually labeled movie reviews.

Career Imagineers (2015)

Social Impact

I worked on this semester-long project in Junior year of college for EGR 392 - Creativity, Innovation, and Design. We helped the campus career center provide resources for career exploration and preparation in a way that was better suited for the undergraduate population. We created a regular mentorship program for underclassmen and upperclassmen who had experience with various research and industry roles. We also held a well-attended career exploration workshop on campus.

The team consisted of Grace Chang '17, Catherine Idylle '16, Jean Choi '15, Maggie Zhang '16, Annie Chen '15, and myself.

Documents: Presentation given to the EGR 392 class, as well as administrators of Princeton career services.

Learning Rate Analysis for Temporal-Difference Learning (2014)

Reinforcement Learning

This was a project that I worked on as a Summer Research Intern at CASTLE Lab at Princeton University in the summer of 2014. The objective was to find optimal policies in an energy allocation problem. The model contained a battery, power grid, wind energy, and demand, with the latter three stochastic variables. I investigated the performance of Q-Learning and SARSA given different learning rates.

The simulator was written in Java, building on the BURLAP Learning Algorithm Library. Graphics and metrics were produced in Excel and MATLAB.

Documents: Problem Model, Project Presentation

Los Angeles Freeway Pricing (2014)

Optimal Learning

This was a project that I worked on for ORF 418: Optimal Learning in Spring 2014. I worked with Max Kaplan. We created an algorithm for pricing Express Lanes on the I-110 Freeway using Optimal Learning techniques. Specifically, we tested the Knowledge Gradient, Interval Estimation, Pure Exploitation, and Constrained Exploration algorithms with linear and logistic belief models.

Code and graphics done in MATLAB.

Documents: Project Report, Project Presentation