Best Practices & Definitions

Sources for Further Information on Best Data Practices


Data Citations

Data citations, like literature citations, give credit to the data creator or provider improving data source transparency, and provide links between research and data. If you share a dataset, providing citation guidelines for your dataset can help other researchers cite your work accurately, which can boost your publication metrics.

Find how-to information about data citations provided by the Digital Curation Centre (DCC) here.

Data Dictionaries

Data dictionaries are created, typically in the form of tables, to add documentation or detail about the values and variables present in research data spreadsheets.

Consider including the following information in your data dictionary:

  • A complete list of the parameter names used in the dataset

    • Use standardized naming across files and projects, when possible.

  • Description of each parameter

  • Units of measurement

    • When possible, use standards. If using abbreviations in the dataset, spell out the complete units in the documentation.

  • Description of what a missing value signifies and how missing values are represented (e.g., -9999, n/a, FALSE, NULL, NaN, nodata, None). Leaving an entry blank may cause misregistration of the data in many applications.

  • Instrument used to collect the data

  • Meaning of acronyms that may be in the element value

  • An attribute/variable that describes data quality or certainty using coded values. Describe precision, accuracy, and uncertainty, and the quality control methods used. Some repositories may have standardized data quality levels.


What this might look like in practice:

Spreadsheet tab

Element or value display name


Character Length

Acceptable Values




454db1 source Database searched for RNA sequencyes






SGN source Database searched for RNA sequences





All unigenes

Total number of unigenes from database searched


All unigenes




Date of database search





Standard Deviation

Standard Deviation of length of RNA sequences





Sources for Further Information

Data Management Plans (DMPs)

Learn more about Data Management Plans (DMPs) here.

Data Repositories

Data repositories are storage locations for datasets and information about those datasets in a way that is searchable to users. Data repositories are often databases made available online, and they tend to have an overall theme for the type of data stored within, i.e., a repository may be relevant to specific field of study or research methodology. There are also general or multidisciplinary data repositories that cover a wide variety of topics.

Learn more about data repositories and how to find one that meets your requirements here.

Data Use Agreements (DUAs)

A DUA is a legally binding agreement stipulating how data may or may not be used. You may need to sign a DUA in order to access a dataset, or you may want to create a DUA when providing your data to other parties.

At Yale, the Office of sponsored projects can assist you in creating or interpreting DUAs. Learn more here.


Digital Object Identifier (DOI)

A digital object identifier (DOI) is a unique alphanumeric string assigned by a registration agency (the International DOI Foundation) to identify content and provide a persistent link to its location or representation on the internet. DOIs can be applied to journal articles or data sets; publishers or repositories assign a DOI when materials are made available electronically.

File Names

File names are, quite simply, the names you assign to the files you create and save, but they are also one of the easiest ways to capture metadata (data about data) a file.

For example, if a file is called, “2019-02-07_dog_playing.jpeg”, we can make an educated guess that this file is a photo of a dog playing, that was saved on February 7, 2019, without needing to open the file.

Best practices in file name creation

  • Create unique file names that are descriptive and easy to understand.  

  • Use only alphanumeric characters; avoid using special characters such as: ? / $ % & ^ # . \ : < >

  • Use underscores (_) and dashes (-) to represent spaces.

  • Use leading zeros with the numbers 0-9 to facilitate proper sorting and file management.

  • Dates should follow the ISO 8601 standard of YYYY_MM_DD or YYYYMMDD.  Variations include YYYY, YYYY-MM, YYYY-YYYY. This maintains chronological order.  

  • Include the version number in the file name by using ‘v’ or ‘V’ and the version number at the end of the document (example: 2019_Notes_v01.doc).

File Versioning

Versioning involves tracking the changes you make to your data by saving new copies of data files with indicators of the changes made. This allows you to recognize and access older copies of files.

Add these features in your file names to track changes you make to your data:

  • Include a version number, e.g "v1," "v2," or "v2.1".

  • Include information about the status of the file, e.g. "draft" or "final," as long as you don't end up with confusing names like "final2" or "final_revised".

  • Include information about what changes were made, e.g. "cropped" or "normalized".



Metadata is data that provides information or data about something. Here are three types of commonly used metadata:

  1. Descriptive metadata: describes a resource for purposes such as discovery and identification. It can include elements such as title, abstract, author, and keywords.
  2. Structural metadata: indicates how compound objects are put together, for example, a data dictionary describing the structure of a datatable, or a XML hierarchy.
  3. Administrative metadata: provides information to help manage a resource, such as when and how it was created, file type and other technical information, and who can access it. This can include rights management metadata which could detail the copyright or data use agreement of a data set.

Here you can find more information about metadata provided by the Naitonal Information Standards Organization (NISO).

Research Data

Recorded factual material commonly accepted in the scientific community as necessary to document and support research findings. This does not mean summary statistics or tables; rather, it means the data on which summary statistics and tables are based. - NIH Data Sharing Policy and Implementation Guidance

Types of Research Data:

  • Observational data

  • Experimental data

  • Simulation data


Research Data Stages:

  • Raw data

  • Processed data

  • Intermediate data

  • Derived data

Research Data Does Not Include:

  • Summary statistics, tables, or visualizations

  • Physical objects such as gels or lab specimens


Research Data Management

The care and maintenance of data produced during research, through:

  • File and folder organization
  • Data backups
  • Applying appropriate security measures
  • Preserving the context and meaning of the data through documentation and metadata.

Speaking broadly, throughout your entire research process you should complete the following data management activities:

  1. Plan your data management efforts early, with a data management plan.
  2. Include data management costs in your application budget
  3. Use descriptive file naming conventions
  4. Store your data in multiple locations
  5. Define roles and assign responsibilities for data management within your research team  
  6. Identify and use relevant metadata standards
  7. Deposit your data into an appropriate repository