The New York Times recently open-sourced Emphasis, the technology they use to allow users to link directly to individual paragraphs and sentences of each story on their site.The technology, including the clever fuzzy hashing mechanisms they use to create close-to-permanent identifiers for potentially changeable text, is described on their blog.
“The solution seemed clear: create a unique identifier or key for each paragraph. But how to do that? The identifiers need to be consistent for all readers, and they must remain intact when a paragraph is edited and a page is republished. So it’s not simply a matter of generating the identifiers on the back end — with that approach, you might end up with the insane idea of building a mini-CMS for managing paragraph keys. On the flip side, how do you generate the same key each time the piece of content changes?”
They solved the problem by knowing their data. For example, the New York Times‘ developers know that it’s more common for paragraph order to change (e.g. if an update para gets added to the beginning of the article) than sentence order within a given paragraph, and their fuzzy hashing mechanism reflects that.