Machine Learning Pipelines - Part II
Machine Learning Pipelines - Part II
In Part I, we went ahead and wrote a rudimentary version of the pipeline we’ll be using for the webapp.
In this post, we’ll focus on acquiring an initial dataset and extract and generate features.
Writing Better Questions
Following along with the book, we want to build an editor that lets its users write better questions. Before we go and build a model, the first step is playing around with the data. That begs the question, what kind of dataset should we be looking at?
Some good places to find datasets are Kaggle, the UCI Machine Learning Repository as well as the AWS Open Data registry.
For our use case, we’ll go ahead with StackExchange dump specifically the Writers dump here
Each of these dumps is an XML file with the headers and attributes containing the actual info we need. We need to extract raw text, and this is where we’ll write a basic pipeline to get the data we need.