Spam filtering techniques used for classifying healthcare job data

Roger Magoulas shared a story on O’Reilly Radar about how he’s using Bayesian classification, a technique widely used for categorization in applications like spam filtering, in a project for the Department of Health and Human Services:

“We are working with the US Department of Health and Human Services (HHS) on a project to look for trends in demand for jobs related to Electronic Medical Records (EMR) and Health Information Technology (HIT). The twist, and the reason we decided to build a classifier, is that we wanted to separate jobs for those using EMR systems from those building, implementing, running and selling EMR systems. While many jobs easily fit in one of the two buckets, plenty of job descriptions had duties and company descriptions that made classifying the jobs difficult even for humans with domain expertise.”

He goes to describe how his team tweaked the Bayes algorithm to radically boost speed. The final result? “On the latest run, a random sample showed the classifier working with 92% accuracy.”