My research focuses on the intersection of natural language processing (NLP) and recommender systems. I explore personalization in natural language generation and question-answering, subjectivity in knowledge bases, and mining the wealth of data available publicly online. My previous research includes structured text generation, sentiment analysis, and spatial clustering in neuroscience.
I also focus on extracting large-scale, useful and interesting datasets from publicly available sources to analyze and make available for research.
Selected reseach projects below. For a complete list, see my Google Scholar page.
We introduce a large-scale dataset of media dialog: interview-style transcripts from NPR talk shows. We demonstrate its usefulness for dialog modeling and its special traits and dialog structure when compared to existing small-scale spontaneous dialog transcriptions and written proxies for conversation.
Generating Personalized Recipes from Historical User Preferences
We propose the task of personalized recipe generation: expanding a name and incomplete ingredient details into a complete natural-text recipe instruction set, aligned with a user's historical preferences; we present a model that attends on user activity traces to solve this task.
Expression of the transient receptor potential channels TRPV1, TRPA1 and TRPM8 in mouse trigeminal primary afferent neurons innervating the dura
We study the size and clustering of TRP-channel expressing neurons in the trigeminal dura. Our results suggest that TRPV1 and TRPA1 but not TRPM8 channels likely contribute to the excitation of dural afferent neurons and the subsequent activation of the headache circuit. These results provide an anatomical basis for understanding further the functional significance of TRP channels in headache pathophysiology.
Dataset Categorization and Search
Bloomberg: Structured Products
Data Quality Control Platform
Semi-Structured Text Clustering for Securities
Goldman Sachs: Operations Analytics Strats
Machine Learning Platform as a Service
FIX Market Data Pipeline
Goldman Sachs: Operations Analytics Strats
Automated Invoice Recognition and Extraction
Workflow Assignment through Mixed Integer Linear Programming
Cooking Common Sense: Personalized Recipe ‘Tweak’ Inference via Common Sense Reasoning
We propose the task of personalized tweak selection modeled as both entailment of a tweak from a recipe and activity traces, as well as a generative task. We collect a dataset of 72K tweaks linked to the Food.com dataset from our EMNLP 2019 publication on personalized recipe generation. We also propose a framework for collecting such tweaks via crowd-sourcing.
Exploring Rich Features for Sentiment Analysis with Various Machine Learning Models
This was my senior thesis at Princeton University, supervised by Dr. Xiaoyan Li. Here, we investigate the use of rich features to extend the bag-of-words model for sentiment analysis using machine learning in the movie review domain. We focus on subjectivity analysis and sentence position features. In addition, I created a manually labeled set of subjective and summary sentences for 2000 reviews in the Cornell IMDB movie review corpus.
The objective was to find optimal policies in an energy allocation problem. The model contained a battery, power grid, wind energy, and demand, with the latter three stochastic variables. I investigated the performance of Q-Learning and SARSA given different learning rates.
A simple game inspired by Haruki Murakami's Hard-Boiled Wonderland and the End of the World. Move around the world with the arrow keys.
Using cellular automata to generate mazes. Allows customization of maze dimensions and visualizations. Solve the maze with arrow keys. Contains an auto-solver feature.
Set a beginning configuration and a ruleset, and allow a cellular automaton to grow. Allows flipping of grid squares as the automaton is running, as well as customization of run speed and canvas dimensions.
Break cookies into different-sized pieces to feed a friendly colony of ants, as they're constantly harrassed by locusts. Experiment with HTML5 canvas.
Dabbling in development of an incremental RPG. Progress through an infinite dungeon and collect party members and upgrade your gear.
In my work with the Structured Products team (Mortgage Waterfall Infrastructure), I helped to design and implement a Spark-based infrastructure for high bandwidth data processing jobs. We wrote a series of applications to wrap common data-access, filtering, and testing paradigms, so other members of the Structured Products group could write test logic as a self-contained plugin and farm the job out to our Spark cluster. We achieved a 100x+ speedup for some jobs due to parallelization, and abstracted away Spark application boilerplate.
The platform infrastructure was written in Python with jobs distributed via pySpark.
As part of my work with the Quality Control team in Mortgage Waterfall Infrastructure, I am currently investigating how various mortgage-backed securities (pools, generics, CMOs) can be clustered by the shape of the data we regularly receive for each security. This entails identifying data shape features for semi-structured text data received for over 2.5 million securities and clustering on time-series data for each security.
The application was written in Python, using Spark MLlib library for feature processing and clustering, and PostgreSQL to store time series data, accessed programmatically via SQLAlchemy. The application utilizes the infrastructure framework developed as part of the Data QC Platform.
This was my primary project in Ops Analytics Strats (OAS), in the Operations Technology group at Goldman Sachs. Over the course of 10 months, we built a platform for data science and machine learning on top of the firm's centralized data store. Our goal was to allow anybody to classify and regress on arbitrary datasets without needing a deep background in programming for machine learning. One client was an internal data platforms team, which used our platform to better predict expected runtimes for their data ingestion workflow.
The platform infrastructure was written in Java. For machine learning, we used Spark MLlib and R.
This was the project team I worked on for Grey Wolf on Goldman Sachs Asset Management - Fixed Income (GSAM FI) as a new Technology Analyst in the fall of 2016. Over 6 weeks, our team built a system to consume FIX messages from several marketplaces, parse them into a standardized object, and store the messages in a database. Our application allowed the GSAM FI team to consume a vastly greater quantity of market messages and made it available for further analysis by traders. It is currently a production system.
The application was written in Java, using the QuickFIX/J library for FIX message decoding.
This was my main project as a Summer Technology Analyst in Operations Analytics Strats. We worked with Accounting Services to build a tool that could automatically index certain values from images of invoices. They received invoices through email or fax and scanned them into .tiff files. Our algorithm relied on segmentation of individual values to produce structured output and templating to identify likely segment locations.
The program was written in Python, using the numpy library for fast array operations and scikit-image library for image processing. We used the Tesseract open-source OCR engine.
This was another project that I worked on as a Summer Technology Analyst in Operations Analytics Strats. We worked with several different Operations groups to produce a tool for automatically assigning tasks to available analysts. This took into account available analysts, their expected bandwidth for the remainder of the day, their proficiency with incoming tasks, and implemented a Maker-Checker process. The tool is to be run by managers, and outputs a list of each person's assignments as well as a list of unassigned tasks with reasoning listed.
The program was written in R, using the Rglpk library for linear programming.
This was my senior thesis at Princeton University, supervised by Dr. Xiaoyan Li. Here, we investigate the use of rich features to extend the bag-of-words model for sentiment analysis using machine learning in the movie review domain. We focus on subjectivity analysis and sentence position features. In addition, I created a manually labeled set of subjective and summary sentences for 2000 reviews in the Cornell IMDB movie review corpus. The full title of the thesis is Exploring Rich Features for Sentiment Analysis with Various Machine Learning Models.
All model training and execution were performed in Python, using NLTK and TextBlob for document parsing and feature extraction, and scikit-learn for machine learning algorithms. Data analysis was done in R.
Documents: Poster presented at the 2016 IEEE MIT Undergraduate Research Technology Conference. A full copy of the thesis may be requested via Princeton DataSpace. You can also download a .7z archive of my manually labeled movie reviews.
I worked on this semester-long project in Junior year of college for EGR 392 - Creativity, Innovation, and Design. We helped the campus career center provide resources for career exploration and preparation in a way that was better suited for the undergraduate population. We created a regular mentorship program for underclassmen and upperclassmen who had experience with various research and industry roles. We also held a well-attended career exploration workshop on campus.
The team consisted of Grace Chang '17, Catherine Idylle '16, Jean Choi '15, Maggie Zhang '16, Annie Chen '15, and myself.
Documents: Presentation given to the EGR 392 class, as well as administrators of Princeton career services.
This was a project that I worked on for ORF 418: Optimal Learning in Spring 2014. I worked with Max Kaplan. We created an algorithm for pricing Express Lanes on the I-110 Freeway using Optimal Learning techniques. Specifically, we tested the Knowledge Gradient, Interval Estimation, Pure Exploitation, and Constrained Exploration algorithms with linear and logistic belief models.
Code and graphics done in MATLAB.
Documents: Project Report, Project Presentation