Researcher’s Bleg: Looking for a technical solution to enable project & document management, collaboration and revising

UCSF researcher Ralph Gonzales writes to get our advice regarding his Wiki-Whiteboard (or Wiki-Noteboard). Here is his description of what he is looking for:

Version 1.0. Lives on my iPad. A handwriting recognition program that allows one to organize documents into different notebooks (i.e., projects), and that allows one to attach different types of documents (Word, Powerpoint, PDF, scanned documents, etc.) to different locations on different pages and notebooks. Think about the “insert comment” function in Word… for this we would have an “Insert document” function. The mock-up/layout could actually resemble the word document, except instead it’s my handwritten notes with documents inserted. It would be nice to be able to insert documents directly from different sources such as email folders, as well has hard-drive.

Version 2.0. Lives on a server with all the same functions as above. Selected individuals could also access the specific Noteboards and provide comments to the notes or attached documents using something similar to “Track Changes” from Word… using the “Insert Comments section. You would have different colors for different individual’s comments.

Great question. Team, can we offer some ideas/recommendations?

Data anonymization: mission impossible?

Pete Warden discusses why anonymized social media datasets can be so easy to match up again:

“[T]his anonymization process is an illusion. Precisely because there are now so many different public datasets to cross-reference, any set of records with a non-trivial amount of information on someone’s actions has a good chance of matching identifiable public records. Arvind first demonstrated this when he and his fellow researcher took the “anonymous” dataset released as part of the first Netflix prize, and demonstrated how he could correlate the movie rentals listed with public IMDB reviews. That let them identify some named individuals, and then gave access to their complete rental histories. More recently, he and his collaborators used the same approach to win a Kaggle contest by matching the topography of the anonymized and a publicly crawled version of the social connections on Flickr. They were able to take two partial social graphs, and like piecing together a jigsaw puzzle, figure out fragments that matched and represented the same users in both.” (via)

(Photo by Ell Hind at Flickr)

Open Source Genetics

We’re familiar with open source software and open source data.  Now it looks like we need to add open source molecular biology to the list.

The same concepts that have lead to open source rockin the software world have spawned the beginning of a revolution in biotech. An organization called Biofab, funded by the NSF and run through teams at Stanford and Berkeley, is applying open development approaches to creating building blocks (BioBricksTM from BioBricks Foundation) for the bio products of the future. Now, the first of those building blocks based on E. coli are just rolling off the production line. This, according to the organizers, represents “a new paradigm for biological research.” (via)

Read more:

Compelling Video Describes New Visualization Tool “Many Eyes”

It can be challenging to create animated video that conveys a complex message. Here is a great example that shows it’s doable – mind you, without a single spoken word.

A 60 second social story about developing and refining ideas, gaining insight and sharing through community; all based on the premise that many sets of eyes are better than one!

Take a look and let me know what you think. – Btw, the visualization tool “Many Eyes“, developed by IBM, is worth a look as well.

Measuring scholarly impact, beyond citation scores

How do you track scholarly impact, beyond citation-counting? Princeton computer scientists Sean Gerrish and David Blei developed a model based on the hypothesis that the most impactful publications will impact the mix of terminology used in subsequent work in the field, using corpora from Nature, PNAS, etc.:

“Identifying the most influential documents in a corpus is an important problem in many fields, from information science and historiography to text summarization and news aggregation. Unfortunately, traditional bibliometrics such as citations are often not available. We propose using changes in the thematic content of documents over time to measure the importance of individual documents within the collection. We describe a dynamic topic model for both quantifying and qualifying the impact of these documents. We validate the model by analyzing three large corpora of scientific articles.” (via)

For example, they show how after the publication of “Molecular cloning of a cDNA encoding human antihaemophilic factor” in 1984, terms very frequently used in the highly-cited paper (e.g. “expression” and “blot”) became much more commonplace in the field. This content-based approach makes for an interesting supplement to bibliometric approaches that rely primary on author-generated citations.

Read more:

Scientists, Social Media, and Web 2.0

Here are two interesting postings regarding science and the “new web”.

First, how do most labs view the use of social media?  Not very highly, if you believe the results from a recent survey by Lab Manager Magazine:

Laboratories are at the forefront of research and analysis. But when it comes to communication, they are followers rather than leaders and can be very slow to adopt innovations. The use of social media is a case in point, as a recent survey of nearly 200 lab managers revealed. There are six good reasons for labs to explore the opportunities offered by the social media…

This could also be part of a bigger event, which some say is the demise (or maybe transition) of science 2.0.  As David Crotty argues in “Not with a Bang: The First Wave of Science 2.0 Slowly Whimpers to an End“:

The Nature Network launched in 2006, organized around researchers in Boston, then went global in 2007, five years ago. It perhaps offered the high-water mark in terms of the irrational exuberance by publishers and other companies in building big Web 2.0 tools for scientists. For a time, the widespread adoption of these tools seemed inevitable, and business models were an afterthought when investing in revolutionary new technologies.

Five years on, reality has reared its ugly head, and, as is often repeated here at the Scholarly Kitchen, culture has trumped technology. It turns out that what works well for some cultures does not immediately translate into success in others. Rather than focusing on the needs of the research community, much of what passed for Science 2.0 was an attempt to force science to change — to make the culture adapt to the tools rather than the other way around.

Do we see either of these phenomena in our day-to-day interactions?

A GitHub of science

A conversation on scientists’ favorite online tools on Quora led to several ideas on online tools scientists wish existed. The most popular was Marius Kembe’s idea:

Github for scientists – a distributed hosting and version control system for all parts of scientific communication, including writing, code, data, and audio/video/images. So that you could build on somebody else’s work by versioning it! Isn’t that what science is meant to be about?”

As a GitHub user in non-biomedical domains, this makes so much sense to me. Marium went on to describe the idea further on his blog:

“GitHub is a social network of code, the first platform for sharing validated knowledge native to the social web…I believe it represents a demonstrably superior way of distributing validated knowledge than academic publishing. How are these even related? Software developers rarely write applications from scratch. Instead, they often start with various modular bundles of open source code…Scientists never begin a research project from an intellectual vacuum. They stand on the shoulders of giants, building on the knowledge contained in previous publications to form a new, coherent finding…Gems on GitHub are not just code.  They also have authors whose relative contributions are automatically catalogued…This impact graph can let you know precisely which developers are responsible for this awesome-ness…By contrast, current Open Science efforts that ask scientists to ‘share all your data’ have not become mainstream, because they do not appropriately reward knowledge producers.”

[Link]

Database replication for global health applications

Can solid database replication support have global health impacts? Global health tech company Dimagi discusses how they use CouchDB (a NoSQL document-oriented database) for health data management in rural Zambia:

“We’ve got computers at clinics that are maintaining patient records…None of these clinics have Internet out of the box, so most of the time our only Internet connection is through a GSM modem that connects over the local cell network. It’s very hard to move data in that environment, and you can’t do anything that relies on an always-on Internet connection with a web app that is always accessing data remotely…CouchDB was a really good option for us because we could install a Couch database at each clinic site, and then that way all the clinic operations would be local. There would be no Internet use in terms of going out and getting the patient records, or entering data at the clinic site. Couch has a replication engine that lets you synchronize databases — both pull replication and push replication — so we have a star network of databases with one central server in the middle and all of these satellite clinic servers that are connecting through that cell network whenever they’re able to get on, and sending the data back and forth. That way we’re able to get data in and out of these really remote, rural areas without having to write our own synchronization protocols and network stack.” (via)

Link

Enhancing linking, sentence by sentence

Deep links to content enhance users’ ability to discuss and comment specific parts of a site — and that extends out to intra-page elements.

The New York Times recently open-sourced Emphasis, the technology they use to allow users to link directly to individual paragraphs and sentences of each story on their site.The technology, including the clever fuzzy hashing mechanisms they use to create close-to-permanent identifiers for potentially changeable text, is described on their blog.

“The solution seemed clear: create a unique identifier or key for each paragraph. But how to do that? The identifiers need to be consistent for all readers, and they must remain intact when a paragraph is edited and a page is republished. So it’s not simply a matter of generating the identifiers on the back end — with that approach, you might end up with the insane idea of building a mini-CMS for managing paragraph keys. On the flip side, how do you generate the same key each time the piece of content changes?”

They solved the problem by knowing their data. For example, the New York Times‘ developers know that it’s more common for paragraph order to change (e.g. if an update para gets added to the beginning of the article) than sentence order within a given paragraph, and their fuzzy hashing mechanism reflects that.