Data Science Altitude for This Article: Camp Two. As we concluded our last post, we used Azure ML Studio to attach a Logistic regression model to some data for a hypothetical electrical grid (11 predictor variables). We had a first look at the output which scores whether our response variable as adequately predicted, and promised a dive into the scoring. We’ll get to that, but first I’d like to put together another, competitive model to compare and contrast to our Logistic regression efforts.
Data Science Altitude for This Article: Camp Two. Splitting the Data Azure ML Studio gives us a handy way to split our data and provides some alternatives in doing so. I’ll mark 80% of our data for use in training the model, and 20% to use in scoring it with an eye towards which model is better when encountering new data. With our split of categories to predict being roughly 60/40, there’s little to gain from ensuring that the division of data into 80% train / 20% test keeps to a consistent 60/40 split along those category percentages.
Data Science Altitude for This Article: Camp Two. Our prior posts set the stage for access to MS Azure’s ML Studio and got us rolling on data loading, problem definition and the initial stages of Exploratory Data Analysis (EDA). Let’s finish off the EDA phase so that in our next post, we can get to evaluating the first of two models we’ll use to forecast stability - or the lack thereof - for a hypothetical electrical grid.
Data Science Altitude for This Article: Camp Two. We left our prior post with covering what’s involved for you to set up a Microsoft Account, a survey of your access options (Quick Evaluation / Most Popular / Enterprise Grade), and a brief and basic tour of the Azure ML Studio Interface. We’re going to head there shortly and get a big whiff of the drag-and-drop catnip, but first, let’s discuss the particular problem I’d like to throw at it and set the stage for the next several posts.
Data Science Altitude for This Article: Camp Two. While working my way through the code in Wei-Meng Lee’s excellent Python Machine Learning, I ran into Chapter 11, titled Using Azure Machine Learning Studio. Sporting a drag-and-drop interface, you can tackle many Data Science problems without the need to write a line of code. You’re not absolved from knowing why you’re doing what you’re doing, of course, but you can try out some problem-solving proofs-of-concept in short order.
Data Science Altitude for This Article: Camp Two. So, all the pieces on the chessboard are in their strategic locations. We’ve identified a set of papers from which we want to identify thematic intent, taking The Federalist Papers directly from the Project Gutenberg site. We’ve cleaned them up, removing common words and metadata, and have formatted them into a DocumentTermMatrix. We then pulled that object into a Latent Dirichlet Allocation (LDA) model as defined in the topicmodels package and took a look at some of the high-level mathematics involved and the resulting object’s composition.
Data Science Altitude for This Article: Camp Two. Previously, we created a DocumentTermMatrix for the express purpose of its fitting in nicely with our upcoming LDA model formation. Here, we’ll discuss LDA in a bit of detail and dive into our findings. But first, a brief refresher on the format and content of the DT matrix. The word count for the first ten stemmed words out of the first eight documents and their aggregation for documents 9-85 are seen below:
Data Science Altitude for This Article: Camp Two. Previously, we removed a bunch of metadata from The Federalist Papers that was introduced from its being hosted by the team at The Gutenberg Project. After that, we took out much of the intra-document metadata that was explanatory in nature to each of the 85 essays. Now, our goal is to polish off the metadata removal and transition the original unstructured data into object types that are more conducive to numerical analysis.
Data Science Altitude for This Article: Camp Two. Our last post set the stage for what it’s going to take for us to end up at our desired conclusion: a programmatic assessment of topics gleaned from The Federalist Papers. Before we can throw some fancy mathematics at the subject matter, we have to get the data to the point where it’s conducive to analysis. In this era of data coming at us from multiple sources and in structured and unstructured formats, we have to be versatile in our coding skills to deal with data in whatever manner it comes.
Data Science Altitude for This Article: Camp Two. For our next post, we tackle something a little more difficult than in the past posts: the use of Probabilistic Topic Modeling for thematic assessment in literature. Got a lot of documents that you’re trying to condense down into manageable themes? From an academic perspective, you might be interested in thematic constructs in one or more of Mark Twain’s works, perhaps. Or from a socio-political perspective - and something a little more present-day - you might want to identify recurring themes in political speeches and see how they track among candidates for a particular office.