Chaskin Saroff

Machine Learning Engineer

About Me

Hi, I’m Chaskin Saroff, a machine learning engineer with a passion for creating tools that empower people to achieve more in less time!

After previously working at IBM, Bad-Monkeys and B-Stock, I’m currently building ML models for fetal diagnostics and reporting automation at BioticsAI.

I’m really interested in the intersection of machine learning and health science and I’m grateful to be working on advancing this technology.

Experience

BioticsAI

Co-Founder and CTO

October 2021 - Present

According to the American College of Obstetricians and Gynecologists, the US has a shortage of 9000 obstetricians. With only 1 in 5 Ob/Gyns being under the age of 40, ACOG projects the shortage to grow to 22,000 Ob/Gyns by 2050. A large part of this deficit is physician burnout, which occurs at twice the rate of other working adults. What’s the number one cause of physician burnout? Too many administrative tasks (e.g. charting, paperwork).

At BioticsAI, we’re building a platform to automate the most time-consuming and tedious aspects of obstetrics and gynecology, namely reporting. Our tool automatically generates reports from fetal ultrasound images and we’re currently piloting it with a handful of clinics in the MENA region.

We’re starting with a focus on administrative automation, but we’re also building tools to help doctors recognize abnormalities earlier and with higher accuracy to improve patient outcomes. Over 50% of fetal structural anomalies go undetected during pregnancy with ~20% of those detected being found after 23 weeks, when termination is often no longer an option. Congenital anomalies affect around 1.8% of fetuses. Of the ~140 million births per year globally, ~1.26 million have undetected congenital anomalies. This is probably an underestimate because most of this data comes from the US and Europe where high quality healthcare is relatively accessible.

In developing nations, where the bulk of pregnancies are occurring, the problem is even worse. Luckily, ultrasound is a widely available technology with probe costs running as low as $1000. If we can improve the diagnosis rate by just 10% or detect anomalies earlier, it would have a life changing impact on of hundreds of thousands of families. That’s our mission: building tools to help doctors detect anomalies earlier and with higher accuracy to improve patient outcomes.

So how are we doing it?

The most important ingredient in any machine learning project is data, and the journey to get to high-quality labeled data was long and bumpy. We built a dataset and annotation team with one of our partner clinics and started the process of labeling our ultrasound images with plane labels and anatomical regions.

BioticsAI started as a side-project in 2019, and at the time there weren’t a lot of high-quality annotation tools around. The really good tools were hundreds of dollars a month per user and certainly outside the budget of an unfunded side-project. We were annotating all of our images with a local tool called PixelAnnotationTool, having our annotation team sync the dataset and annotations between Google Drive and their local machines.

Reviewing annotation progress was manual. Uploading anonymized data to google drive was manual. Splitting the data and organizing it so each annotator had their own set of data to work on was manual. We were spending way too much time keeping our annotation team organized.

After doing research and testing a bunch of different labeling tools, my cofounder Robhy and I eventually settled on an open-source, cloud-based tool called Label Studio.

I’m really partial to open-source projects that offer a hosted option to fund the project, but perhaps the biggest factors in our decision were cost and security. Self-deployment meant we could keep all of our data inside Google Cloud and we could use our cloud credits to pay for everything (thanks GCP!).

Finally our data was in the cloud and we didn’t have to manually split the data so that each annotator had their own set of data to label. Paradise, right? Well… not quite.

Label Studio is great for annotating images, but the open-source version doesn’t have features for measuring labeling progress. To track our annotation progress, I deployed into our GKE cluster and added some dashboards.

One last pain point though: I was still manually uploading our data into Label Studio, which was a huge time-suck.

At this point, we had already built a data ingestion pipeline for our core reporting software where our partner clinics would forward DICOM studies into a GCP DICOM store using a protocol called DIMSE. From there, we just set up a pub/sub subscriber that called the gcp anonymization API, uploaded the anonymized data to our GCS bucket, called our latest model to generate a starting annotation, and called the Label Studio API to import the image.

Voila! No more manual data munging to get our data annotated.

We’re still in the early stages of our project, but we already have a handful of clinics actively using our tool for reporting!

B-Stock Solutions

Senior Data Engineer

January 2020 - October 2021

B-Stock is a company that provides a marketplace to help other companies liquidate their excess inventory. It’s a surprisingly large market, and B-Stock is the largest player. Many companies like Amazon or Costco have large warehouses full of inventory. When something gets returned, if they can’t sell it right away it starts to clutter up their warehouse.

When consumer returns start to clutter up their warehouses, retailers have two options:

  • Sell it to a liquidator like B-Stock
  • Dump it in a landfill

Prior to B-Stock, the liquidations industry was pretty insular, involving many small private liquidators that would offer recovery rates as low as 2-3%, and some companies would even pay to have their inventory destroyed.

When I joined B-Stock, we had a small “data warehouse” built on top of MySQL called “alldata” with a one-off ETL pipeline for funneling data from our application dbs into alldata. The problem with alldata was that it had only a small subset of our application data, the data pipeline was pretty buggy, and it was really time-consuming to add new data sources.

Because of these limitations, most of the account managers had been running their own analytics through Excel and were manually generating reports for their clients. The report generation process was collectively costing the account management team almost 100 hours a week, and it was a huge bottleneck for the company.

I built a new data warehouse on top of bigquery and set up Fivetran to sync data directly from our microservices databases into bigquery. Some of our data, like auction manifests (stored as csv files), couldn’t be ingested through Fivetran, so I wrote a few data pipelines to sync that data into bigquery as well.

From there, I built a dashboard within our BI tool (metabase) that allowed account managers to generate reports for their clients with a few clicks. In the end, we ended up with a data warehouse accessible to the whole company through which really democratized the company’s data and allowed anyone to build data visualizations without having to download a bunch of csv files, concatenate them, and then generate a report in Excel.

Bad Monkeys

Machine Learning Engineer

January 2019 - January 2020

After taking a bit of time off to visit family and work on personal projects, I joined Bad Monkeys as a machine learning engineer.

Bad Monkeys is a startup that builds tools and does consulting for the building information modeling industry(think CADD tools like Revit). They had the vision to automate some of the more repetitive and boring aspects of floor plan design.

One issue with floor plan design is that you have to label various elements within a room (like columns, walls, doorways, etc) prior to construction. A draftsman or technician will often finish a room or building and then go back and label all of the elements in the room. This process is very tedious, and trying to automate it with traditional heuristic rules like “place it as close as possible to the element, but with no overlap” fails to place labels in reasonable locations.

We decided to tackle this problem with computer vision. Unlike many computer vision problems that require human annotators to label training data, we were able to generate our own training data from completed floor plans. As long as you have a dataset of completed drawings, you can open them up in a CAD program, zoom in on a specific element, and then take a screenshot of the element. You can then temporarily remove the element and take another screenshot.

After that, you just diff the two screenshots to grab a bounding box for the label and voila, you have a fully labeled dataset to train your model!

Obviously we scripted all of this, but the end result was a model that could annotate potential label locations in a floor plan very effectively. We plugged this into Revit and boom, we had a fully automated floor plan element tagging tool!

We worked on a few different ML projects while I was there. Not all of the projects panned out as well as this one, but it was a great experience and I learned a ton about modeling and computer vision in the process.

IBM Cloud

Data Engineer

May 2015 - July 2018

I started my career at IBM cloud. My team built and maintained the account management, billing and analytics systems.

We wrote almost everything using Node, Express, CouchDB, and React when a frontend was required. Over the course of my three years there I had my hands in many pies, including:

  • Email notifications for outages and customer-lifecycle events.
  • Data pipelines for funneling behavioral and billing data into our analytics systems.
  • An event-based messaging system for any events that happened on the cloud platform.
  • Managing onboarding and placement of our intern cohorts.

One issue we had at the time was that nearly all of our microservice infrastructure was built on top of Cloudant, IBM’s CouchDB-as-a-service offering.

Whenever Cloudant had an outage (which was fairly often at the time), many of our microservices would go down. The outage was usually localized to a specific Cloudant cluster, but all of our microservices were hosted on the same cluster.

To mitigate these downtime issues, I set up a secondary backup cluster with bi-directional replication where we could failover in the event of an outage. I added a proxy in front of the clusters that would route traffic to the healthy cluster and failover to the backup cluster if the primary cluster was down. This was a huge improvement for our uptime and allowed us to keep our SLAs.

During my time at IBM I became really fascinated with machine learning, particularly reinforcement learning, and began to study it on the side. After a few years, I decided to leave IBM to work fulltime on machine learning.

Education

State University of New York at Oswego

BA Computer Science & BS Applied Mathematics

2012 - 2015

I started programming on my TI-84 in middle school, writing text-based games and algebra solvers. My high-school didn’t offer programming classes, so when I finally graduated and started college I was starving to learn more. I was also addicted to mathematics, spending many evenings in the math lab helping other students with their homework.

College was an amazing experience for me. It was the first time that I felt truly independent. I finally had the freedom to pursue programming, and I was surrounded by people with the same interests and passions as me!

To help pay for college, I worked as a resident assistant and math and computer science tutor. I joined the math and computer science clubs, eventually becoming the VP of the Computer Science Association.

I won first place in the CS2 programming competition. The prize was a Raspberry Pi which I used to build a gym capacity indicator.

It’s difficult to convey how special my time at Oswego was. I met some of my best friends there and I learned a lot about myself. Being in an environment where everyone is passionate about learning brought out the best side of me.

A Little More About Me

When I’m not programming, you’ll probably find me at the climbing gym or doing a weekend climbing trip to Bishop, CA. I love traveling and working from new places!

I’m a huge audiobook addict and I’m working through Nick Lane’s The Vital Question right now.

If you’re interested in sci-fi definitely check out Andy Weir’s Project Hail Mary–10/10 science fiction!

Each December, my partner and I do advent of code problems together.