Machine Learning Pipelines - Part II

In Part I, we went ahead and wrote a rudimentary version of the pipeline we’ll be using for the webapp.

It's not a pipeline

In this post, we’ll focus on acquiring an initial dataset and extract and generate features.

Writing Better Questions

Following along with the book, we want to build an editor that lets its users write better questions. Before we go and build a model, the first step is playing around with the data. That begs the question, what kind of dataset should we be looking at?

Some good places to find datasets are Kaggle, the UCI Machine Learning Repository as well as the AWS Open Data registry.

For our use case, we’ll go ahead with StackExchange dump specifically the Writers dump here

Each of these dumps is an XML file with the headers and attributes containing the actual info we need. We need to extract raw text, and this is where we’ll write a basic pipeline to get the data we need.