Data sharing licenses to avoid

So you want to share scientific data, but what license to use? The Panton Principles have something to say:

“Many widely recognized licenses are not intended for, and are not appropriate for, data or collections of data. A variety of waivers and licenses that are designed for and appropriate for the treatment of data are described here. Creative Commons licenses (apart from CCZero), GFDL, GPL, BSD, etc are NOT appropriate for data and their use is STRONGLY discouraged.”

Instead, the Panton Principles recommends the four licenses conforming to the 11 requirements of the Open Knowledge Definition: the Open Data Commons Public Domain Dedication and Licence (PDDL), the Open Data Commons Attribution License, the Open Data Commons Open Database License (ODbL), and Creative Commons’ CC Zero license.

Database replication for global health applications

Can solid database replication support have global health impacts? Global health tech company Dimagi discusses how they use CouchDB (a NoSQL document-oriented database) for health data management in rural Zambia:

“We’ve got computers at clinics that are maintaining patient records…None of these clinics have Internet out of the box, so most of the time our only Internet connection is through a GSM modem that connects over the local cell network. It’s very hard to move data in that environment, and you can’t do anything that relies on an always-on Internet connection with a web app that is always accessing data remotely…CouchDB was a really good option for us because we could install a Couch database at each clinic site, and then that way all the clinic operations would be local. There would be no Internet use in terms of going out and getting the patient records, or entering data at the clinic site. Couch has a replication engine that lets you synchronize databases — both pull replication and push replication — so we have a star network of databases with one central server in the middle and all of these satellite clinic servers that are connecting through that cell network whenever they’re able to get on, and sending the data back and forth. That way we’re able to get data in and out of these really remote, rural areas without having to write our own synchronization protocols and network stack.” (via)

Link

Mining ClinicalTrials.gov Data

The ClinicalTrials.gov results database now offers summary trial data that were not previously available publicly. A new article, published in The New England Journal of Medicine, summarizes the updates, key issues, and limitations of the database. However, according to the authors, ClinicalTrials.gov is continually adding features and linkages to facilitate the use and repackaging of the data by different audiences. The article provides some good food for thought as we’re looking for additional public data sources to expand our research networking tool UCSF Profiles.

Turning Science Communication into a Dialogue

The Stanford School of Medicine managed to promote science stories broadly without issuing any press releases. At the national CTSA Communications Meeting, John Stafford, New Media Strategist at Stanford, shared some insights how this worked.

Depending on the science story, they posted what’s newsworthy on their blog Scope , Twitter, Facebook, Flickr, and – very important – they successfully leveraged the informal relationships with their “blogger friends”. As a result, some of their stories made it into leading science magazines and newspapers.

But the story doesn’t end here: John also demoed a few online monitoring tools to measure media reach and brand leadership. These tools provide dashboards for monitoring how many and what types of media outlets pick up science stories, and even what attitudes readers have towards those stories. Here is a list of tools that might become useful to some of our organizational initiatives:

  • Radian6: Provides a platform to listen, measure and engage with customers across the entire social web.
  • ScoutLabs: A self-serve, web-based tool that includes natural language processing techniques for sentiment and tone scoring. Read article
  • Sysomos Heartbeat: Provides constantly updated snapshots of online conversations.
  • General Sentiment: Media Measurement Dashboard, Reporting Service, and Data API.
  • Jive: Social media monitoring, engagement, and measurement.
  • Klout: Helps you identify people you might want to start a conversation with.
  • Cotweet
  • Tweetreach

For those who still seek more, Stanford will be hosting a social media conference “Medicine 2.0” in September this year.

Visualizations vary

Visualizations aren’t set in stone. Five network visualization developers recently did a cook-off at University College Dublin’s Visual Analysis of Complex Networks workshop, each using his or her own tools to represent a large dataset of college Facebook friendship data.

The results and conclusions are wildly different, reflecting the art and science of extracting and visualizing meaning.

Link

Unlocking the hospitalization algorithm

Heritage Provider Network (HPN) is launching the $3 million Heritage Health Prize, with help from data prediction contest operator Kaggle.

HPN is releasing anonymized patient health records, hospitalization records, and claims data. The team that can come up with the best algorithm to predict which patients have the greatest need for hospitalization wins the big bucks.

As they put it:

“More than 71 Million individuals in the United States are admitted to hospitals each year, according to the latest survey from the American Hospital Association. Studies have concluded that in 2006 well over $30 billion was spent on unnecessary hospital admissions. Each of these unnecessary admissions took away one hospital bed from someone else who needed it more…Can we identify earlier those most at risk and ensure they get the treatment they need? The Heritage Provider Network (HPN) believes that the answer may be “yes” – but to do it will require harnessing the world’s top experts from many fields. Heritage launched the $3 million Heritage Health Prize with one goal in mind: to develop a breakthrough algorithm that uses available patient data, including health records and claims data, to predict and prevent unnecessary hospitalizations.

Link

Web registration for kids

Three girls using the computer at the grand opening.Debra Gelman writes about designing web registration processes for 6-8 year olds in A List Apart. She shares fascinating stories and best practices. For example, many parents have trained children to never reveal anything about themselves online:

“As a result, kids are wary of providing any data, even information as basic as gender and age. In fact, many kids fib about their ages online. A savvy eight-year-old girl, when prompted by the Candystand site to enter her birthdate, said, ‘I’m going to put that I’m 12. I know it’s lying, but it’s ok because I’m not allowed to tell anyone on the internet anything real about me.’…Similarly, a seven-year-old boy refused to create a Club Penguin account because it asked for a parent’s e-mail address. ‘You can’t say anything about yourself on the web. If you do, people will figure out where you live and come to your house and steal your stuff.'”

Gelman goes on to share one example of how to collect innocuous non-identifying data (e.g. grade level) without triggering children’s anxieties about sharing personal information.

She also describes the importance of using images that are “simple, clear representations of common items that are part of a child’s current context,” while trying to avoid symbolic meanings:

“It’s important to note that while pictures are useful, symbols and icons can be problematic, because, at this age kids are just learning abstract thought. While adults realize that a video camera icon means they can watch videos, kids associate the icon with actually making videos. In a recent usability test evaluating popular kids’ sites, a six-year-old girl pointed out the video camera icon and said, ‘This is cool! It means I can make a movie here and share it with my friends.’ She wasn’t able to extrapolate the real meaning of the icon based on site context and content.”

The lesson is clear: know your users.

Read the article.

Enhancing linking, sentence by sentence

Deep links to content enhance users’ ability to discuss and comment specific parts of a site — and that extends out to intra-page elements.

The New York Times recently open-sourced Emphasis, the technology they use to allow users to link directly to individual paragraphs and sentences of each story on their site.The technology, including the clever fuzzy hashing mechanisms they use to create close-to-permanent identifiers for potentially changeable text, is described on their blog.

“The solution seemed clear: create a unique identifier or key for each paragraph. But how to do that? The identifiers need to be consistent for all readers, and they must remain intact when a paragraph is edited and a page is republished. So it’s not simply a matter of generating the identifiers on the back end — with that approach, you might end up with the insane idea of building a mini-CMS for managing paragraph keys. On the flip side, how do you generate the same key each time the piece of content changes?”

They solved the problem by knowing their data. For example, the New York Times‘ developers know that it’s more common for paragraph order to change (e.g. if an update para gets added to the beginning of the article) than sentence order within a given paragraph, and their fuzzy hashing mechanism reflects that.

Spam filtering techniques used for classifying healthcare job data

Roger Magoulas shared a story on O’Reilly Radar about how he’s using Bayesian classification, a technique widely used for categorization in applications like spam filtering, in a project for the Department of Health and Human Services:

“We are working with the US Department of Health and Human Services (HHS) on a project to look for trends in demand for jobs related to Electronic Medical Records (EMR) and Health Information Technology (HIT). The twist, and the reason we decided to build a classifier, is that we wanted to separate jobs for those using EMR systems from those building, implementing, running and selling EMR systems. While many jobs easily fit in one of the two buckets, plenty of job descriptions had duties and company descriptions that made classifying the jobs difficult even for humans with domain expertise.”

He goes to describe how his team tweaked the Bayes algorithm to radically boost speed. The final result? “On the latest run, a random sample showed the classifier working with 92% accuracy.”