I'm open to opportunites in the areas of Machine Learning and Data Science. In particular, I'm interested in using machine learning (especially deep learning) to solve real world problems. I have created this site as my expanded portfolio where I describe my previous work in detail. If you want just the highlights please download my resume above.
DO NOT contact me about ANY data engineering positions! I'm NOT interested in roles that focus on building pipelines or supporting data science teams. The same goes for low level analytics roles that primarily just revolve around writing SQL queries or dashboards.
My Work
Feel free to contact me regarding opportunities, networking, my projects/PaddleSoft, or just to discuss technology in general.
When Coronavirus struck I started volunteering at CoronaWhy. At CoronaWhy I've worked on several projects including extracting adverse drug events , clustering COVID-19 literature , and forecasting COVID-19 spread . I'm currently leading a team of data scientists, data engineers, and epidemiologists to create models to forecast COVID-19 spread and develop casual models to gauge policy impacts. We are also spending significant time to continue to develop Flow Forecast, an open-source deep learning for time series forecasting framework built in PyTorch.
I developed and fine-tuned PyTorch models (specifically variations of the transformer) to add new triplets to the company knowledge graph in order to enhance downstream applications such as search, job recommendation, and autocompletion. I also help educate and set standards regarding writing PyTorch models, Python code quality, and data science best practices. Additionally, part of my time went to building out the company data lake on GCP with tools like Terraform, Cloud Functions, Dataflow, and BigQuery
My role involved a mix of data engineering and machine learning tasks. On the data engineering side I helped develop the company's unified data platform. This platform is built with Apache Airflow (running on Kubernetes), Spark (running on EMR clusters), and Hive tables stored on S3. Specifically some of my tasks included developing Airflow DAGs to run pipelines, creating custom Airflow operators to launch EMR clusters with the proper dependencies, and translating old SQL jobs to SparkSQL. On the machine learning side I refined models to forecast retail demand with Spark MLlib (later experimented with deep learning architectures in PyTorch), trained/tested models to cluster products for better product categorization with Tensorflow, and researched techniques to improve personalization (also with Tensorflow).
From July to January I worked on a variety of contracts and open-source projects. This included designing Peelout, a set of tools aimed at easing the process of deploying deep learning models to production, creating a chatbot with Spacy, Tensorflow, and Redis, and writing a series of articles for EyeOnAI, an online magazine focused on A.I.
I designed interactive charts with Bokeh in Python for hospital administrators and doctors, created ETL pipelines to pull data from the hospital's decentralized data sources (such as Cerner, 3M, "side" SQL databases, and manually maintained Excel notebooks), and employed data driven approaches to improve hospital performance. Finally, I'm also working on a "modern" data architecture built on Kafka and PostgreSQL in order to automate tedious manual processes and provide realtime analytics for clients.
I assisted the data analytics team at EMMC during the summer months (June-August) and over the winter break (December-January). I analyzed a variety of data including both patient and financial data. I used various tools such as SQL and Altova MapForce to extract, transform, and load data (ETL) and then created data visualizations.
I founded PaddleSoft in order to help paddlers plan their whitewater adventures. Over the course of the last two years I have created many paddling related services for paddlers. Some of my favorites included using D3.js, NodeJS, and Kafka to build a real time river flow map , using Neo4j and CQL queries to to recommend rivers and paddling partners to our users , and using MATLAB to create a time series neural network to predict the flow of the Kenduskeag stream. More details can be found below in the project section of this page and on the PaddleSoft blog . All blog entries are written by myself unless otherwise noted.
I collaborated with HR in order to develop and fine-tune HRIS systems. A few of my tasks included analyzing LAWSON reports with Excel, using SQL and Microsoft Access to automate Affordable Care Act and COBRA reporting, creating custom Excel functions and Macros with VB, and collaborating with a larger team in the implementation of manager self-service.
I conducted research for the University of Maine Chemistry department. I worked mainly on computer modeling of tungsten-oxide molecules. Specifically, I wrote bash scripts in order to run jobs on the univeristy supercomputer. We were the theoretical part of a larger team working to convert forest biproducts to gasoline grade fuel oil. The overall goal of my specific team was to simulate chemical reactions and the formation of chemical compounds using computational chemistry software. I also attended weekly meetings and collaborated with the larger research team.
Initially, a repository to forecast flash floods and stream flows, flow evolved to serve as a general time series forecasting libray. We are currently using flow forecast to forecast COVID cases around the U.S. as well as study transfer learning training on flow, wind, and solar data.
View ProjectI developed a Game of Thrones chatbot from scratch using Flask, Redis, ElasticSearch, PostgreSQL, Spacy, and Tensorflow. The bot operated through a Flask based REST-API where incoming user messages (from the Slack-API) were combined with prior cached messages in Redis and fed to a combination of rule-based NLP methods and Tensorflow models. These methods in turn constructed queries to the appropiate data sources (ElasticSearch or PostgreSQL) and synthesized a response based on the returned information. Full conversation history for the bot was stored in PostgreSQL and periodically analyzed and reintegrated into the training data to improve performance. This app ran primarily on AWS (ElasticSearch service, ECS, and EC2), while a few of the Tensorflow components ran on GCP instances. Bot integrated with the Slack-API and used OAuth for authorization.
View ProjectThis project aims to ease the difficulties surrounding utilzing deep learning models in a production environment. This involves three parts: creating a set of model agnostic tools to rapidly adapt models to the business use case, developing a set of scripts/extensions to existing frameworks to actually deploy the models, and designing a set of tools to monitor models and automatically adapt/continue to train the models. The focus now is on the deployment phase. Specifically, this consists of automatically packaging deep learning models into a Docker container and creating a Kubernetes based auto-scaling microservice that can integrate with other applications. There is also work to embed DL models in Flink (and other Java applications) directly with Java Embedded Python or JEP.
View ProjectThis research explored localizing and classifying a variety of conditions in lung X-Rays given only a small dataset through the use of transfer learning and meta-learning. I for the most part have abandoned this project to focus on my NLP research and developing Peelout.
View ProjectThis is a project to automate scraping of Facebook data. The end goal is create a continous scraping and analysis engine of posts from Facebook groups and pages. For the purposes of PaddleSoft we want to use it to extract information about which rivers people are paddling, records of trip, and flow information about rivers that do not have gauges. However, we are working very hard to make our solution generalizable to anyone who wishes to extract meaningful information from Facebook.
View ProjectThis was a project I originally undertook in 2016 in order to predict the flow of the Kenduskeag stream in Bangor ME. This involved many components including initially collecting flow and weather data in a PostgreSQL database, training a time series neural network (NARX) to predict flow in MATLAB, and finally displaying the predictions in ChartJS. Unfortunately, as you may know, MATLAB is not easily deployable and closed source; as such I do not have a working demo at the moment. However, I'm working at recreating our model in Python for both the Kenduskeag and other streams as well. In the process, I hope to make meaningful contributions and provide valuable insights to Python frameworks like PyFlux and PyNeurGen.
View ProjectThis is a Ruby on Rails application that I built on request for a friend/race coordinator of the ACA nationals. He wanted an application where racers in the compeition could form teams and earn points for their respective teams. So I built a simple ROR application where users could sign-up, browse teams, and join teams. The application used Devise for the authentication system and a PostgreSQL database for storing team data. Gradually, I added more features such as Facebook login and ways for teams to filter/search for prespective members. It is still online at the below link (though some of the links are now stale). Code is also availible on my GitHub.
View ProjectA real-time map of river flows across America that renders a graph of the river's flow information when selected. It also, queries our whitewater search engine for results on the selected river. The project uses ElasticSearch (for search results), NodeJS, D3.js, and SocketIO.
View Project