Text Mining

Probabilistic Topic Models and Latent Dirichlet Allocation: Part 4

From Data Formatting to Model Formation and Object Characteristics. All this effort is about to pay off...

Data Science Altitude for This Article: Camp Two. Previously, we created a DocumentTermMatrix for the express purpose of its fitting in nicely with our upcoming LDA model formation. Here, we’ll discuss LDA in a bit of detail and dive into our findings. But first, a brief refresher on the format and content of the DT matrix. The word count for the first ten stemmed words out of the first eight documents and their aggregation for documents 9-85 are seen below:

Probabilistic Topic Models and Latent Dirichlet Allocation: Part 3

From Data Cleaning to Data Formatting: Finding a statue within a block of marble.

Data Science Altitude for This Article: Camp Two. Previously, we removed a bunch of metadata from The Federalist Papers that was introduced from its being hosted by the team at The Gutenberg Project. After that, we took out much of the intra-document metadata that was explanatory in nature to each of the 85 essays. Now, our goal is to polish off the metadata removal and transition the original unstructured data into object types that are more conducive to numerical analysis.

Probabilistic Topic Models and Latent Dirichlet Allocation: Part 2

Data Cleaning. A dirty job, but someone has to do it. Yes, I'm talking to you...

Data Science Altitude for This Article: Camp Two. Our last post set the stage for what it’s going to take for us to end up at our desired conclusion: a programmatic assessment of topics gleaned from The Federalist Papers. Before we can throw some fancy mathematics at the subject matter, we have to get the data to the point where it’s conducive to analysis. In this era of data coming at us from multiple sources and in structured and unstructured formats, we have to be versatile in our coding skills to deal with data in whatever manner it comes.