Data Cleaning

Probabilistic Topic Models and Latent Dirichlet Allocation: Part 3

From Data Cleaning to Data Formatting: Finding a statue within a block of marble.

Data Science Altitude for This Article: Camp Two. Previously, we removed a bunch of metadata from The Federalist Papers that was introduced from its being hosted by the team at The Gutenberg Project. After that, we took out much of the intra-document metadata that was explanatory in nature to each of the 85 essays. Now, our goal is to polish off the metadata removal and transition the original unstructured data into object types that are more conducive to numerical analysis.

Probabilistic Topic Models and Latent Dirichlet Allocation: Part 2

Data Cleaning. A dirty job, but someone has to do it. Yes, I'm talking to you...

Data Science Altitude for This Article: Camp Two. Our last post set the stage for what it’s going to take for us to end up at our desired conclusion: a programmatic assessment of topics gleaned from The Federalist Papers. Before we can throw some fancy mathematics at the subject matter, we have to get the data to the point where it’s conducive to analysis. In this era of data coming at us from multiple sources and in structured and unstructured formats, we have to be versatile in our coding skills to deal with data in whatever manner it comes.