Data & Statistics for Journalists: Managing Data

Data Management

What is data management?

Data management is planning for the short-term and/or long-term care of and access to your data.

  • How will you describe your data?
  • How are you organizing your data?
  • Where will you store your data--while you use it (short-term), and when you've completed your project (long-term)?
  • How will you and others access it? For how long?

Why is data management important?

You may need your data in the future for many reasons:

  • Grant requirements (funding may depend on public access to data/results)
  • Validation, replication of results
  • Re-use (by others) or continue research (yourself)
  • Ethical commitment to providing sources for your story

Without proper data management, your data might become inaccessible for a number of reasons:

  • Natural disasters
  • Computer failure, stolen laptop
  • USB or hard drives can break, fail, get lost or stolen, etc.
  • Corrupted files
  • No code book, metadata, or notes = "What is this file for? What does this variable measure? What does this code mean?"

File Organization & Naming

Name your files so that you can easily determine their contents, and organize them in directories (folders) on your computer so that they are easy to find. Identify the most important aspects of your files, and put them in the filename. Use the strategies below:

  • Be consistent, brief, and descriptive with filenames.
  • Track versions in the filename (v01, v02, FINAL).
  • Avoid spaces in filenames.
  • Suggested directory structure (and example):
    • /[Project]/[Event]/[Date]/
    • /MastersProject/Interviews/20141215/
  • Suggested filename method (and examples):
    • [description]_[instrument]_[location]_[YYYMMDD]_[version].[ext]
    • JimSmith_interviewtranscript_Queens_20141205.doc
    • PetraArgos_interviewquestions_20141205_v02.doc

File Size & Format

  • Are you using open formats (.txt, .csv) or proprietary formats (.doc, .xls, .mdb)? (Users without certain kinds of software may not be able to read proprietary files.)
  • How large are your files? Will this create problems for storage or user download?

Security

When accessing university information resources (or other limited-access information), you must take appropriate and necessary measures to ensure the security, integrity, and protection of these resources. Be vigilant and protective of the data in your custody!

  • Be careful about what your files reveal (about yourself and/or your investigation)! Check your metadata.
  • Does your data reveal identifying information about individuals (names, addresses, patient treatments, geo-location tags, etc.)?
    • If so, you may need to anonymize it.
    • Be cautious when transporting data (whether paper or electronic). Don't leave it in unsecured locations (cars, lockers).
    • Alternately, perhaps your data will be for restricted-use, limited to certain users who will need access to that identifying information (for instance, school records). In that case, you should encrypt this data.
  • Don't upload data to cloud storage (DropBox, Bot.net) unless you use proper encryption!

Your data may require different security at different stages.

Storage, Retention, & Backup System

Your storage and access needs may vary throughout the lifetime of your project.

  • Active access: frequent additions and updates, and/or need to access frequently.
  • Archival access: data in fixed, final form. Only need periodic access.

What's your backup plan?

  • Test it to be sure it's working! Check file restoration, use checksum validation, be sure backups are occurring regularly.
  • Using cloud storage? You may want to encrypt your data (unless it is from public sources).

When archiving your data, consider:

  • How long does it need to be accessible? (one year, 7 years, forever)
  • Who needs access to this data? (you, only a certain group, publicly accessible)

Access, Sharing, & Transparency

Document your data with a code book and metadata (more about metadata for journalists).

  • Label your variables, take notes on their meaning, what they measue, what units you used, and any codes (1 = female, 2 = male).
  • Write out any interview or survey questions.
  • Explain your statistical analysis processes and/or your research methodology.
  • Provide a preferred data citation so that others can cite (and find!) your data. Example:
    • Pew Hispanic Center. (2008). 2007 Hispanic Healthcare Survey [Data file and code book]. Retreieved from: http://pewhispanic.org/datasets/

To help others find your data (for re-use, validation, etc.), you can share it:

A Cautionary Tale

The importance of proper data management and sharing is underscored in the recent case of Michael LaCour, a grad student at UCLA who is now infamous for having faked an entire research study.

Saving Public Data

Why bother saving publicly available data?

Well, public data doesn't always stay put! The NYPD is notorious for replacing its precinct-level Crime Statistics on a weekly basis. Although the NYPD sends this information to the FBI, who archives it, the FBI only makes it available to the public at the county (borough) level, so the precinct data isn't there. Therefore, it's important to save this--and any--data for your own reference.