
Supporting health science data users across the Yale medical campus through instruction, collaboration, and consultation.
Who We Are
Email healthdatalibrarian@yale.edu with data questions.
Learn More…
Find Data
Where to start your search…
Literature Search
Review academic literature, where many published studies generate and analyze data. Search for articles related to the population, problem, or methodology you are targeting. These articles should reference the data repositories, databases, or specific datasets used to conduct the research, and may detail where data has been deposited.
- Select data-specific search filters. For example, in PubMed, you can search articles with Associated Data by selecting this filter under Article Attributes in the sidebar. You can also select Dataset as an Article Type. If Dataset doesn’t appear under Article Type in your sidebar, click Additional Filters button and add it.
- Browse articles for sections such as Supplementary Materials, Data Availability Statement, or Data Citations to find attached data files and links to external data.
- Use data as a search term. For example, “nonalcoholic fatty liver disease AND data”.
- Search across all National Center for Biotechnology Information (NCBI) resources to see many National Library of Medicine resources at once, including literature and databases.
Data Repositories
Find storage locations for datasets and information about those datasets (such as metadata and documentation):
- Use a data repository registry, such as re3data.org and fairsharing.org
- Find data repositories by funder, such as the National Institutes of Health (NIH).
- Find data repositories by discipline, such as those recommended by Nature’s Scientific Data and PLoS ONE.
- Search generalist data repositories, such as Dataverse, Dryad, Figshare, Inter-University Consortium for Political and Social Research (ICPSR), Mendeley Data, Open Science Framework, Qualitative Data Repository (QDR), Synapse, Vivli, and Zenodo.
- Use a government data portal, such as data.gov and healthdata.gov.
Data at Yale
Find data that has been purchased by the library or departments, data generated by your colleagues, and data available through some other means. Data at Yale that may be of interest to health sciences researchers includes:
- Yale University Open Data Access (YODA) Project, available from Center for Outcomes Research and Evaluation
- American Hospital Association (AHA), available from Yale Library on WRDS
- Merative MarketScan Database (must be on Yale VPN to access the link), available from Yale Biomedical Informatics and Computing (YBIC) and Biomedical Informatics & Data Science (BIDS)
Data Creators
Consider who might be creating the data you need for research. This includes governing bodies, academic and research institutions, biomedical companies, and not-for-profit organizations, as well as news organizations and other invested groups.
Examples of these include:
- New York Times’ COVID-19 data on Github (read more about this data)
- Gapminder data (read more about Gapminder)
Data-Oriented Research
Some academic journals focus on publishing data, and information about data science. Try searching one of the following:
- GigaScience: publishes all research objects from big data studies across the entire spectrum of life and biomedical sciences.
- Scientific Data: a peer-reviewed, open-access journal for descriptions of scientifically valuable datasets and research that advances the sharing and reuse of scientific data.
- Data in Brief: a multidisciplinary, open access, peer-reviewed journal, publishing short articles that describe and provide access to research data.
- Journal of Open Psychology Data: features peer-reviewed data papers describing psychology datasets with high reuse potential.
- Open Health Data: features peer-reviewed data papers describing health datasets with high reuse potential.
What to ask as you search…
Does the data help me answer my research question?
It’s important to formulate your research question before starting your data search, as your question can direct and inform your data search.
Does the data contain the variables I need?
Once you have a research question, you may start to understand what type of analysis you want to perform. Does that data you’ve found contain the variables you need to perform your analysis?
For example, if you’re studying brain disorders in young adults, you may be interested in data variables such as age at diagnosis, disorder type, and progression state. You may also be seeking brain scan images. Thinking about analysis in advance of your data search can help narrow down relevant datasets.
Is the data within the scope of my project?
The data you find may involve additional work on your part before it becomes useful — for instance, you may need to conduct data cleaning, curation, or analysis to answer your research question. You may also encounter obstacles to working with the data, such as licensing, wait times, and technical challenges.
It may take time and resources to acquire the data you need — and it’s important to recognize whether the data you’ve found aligns with your constraints, such as an upcoming deadline or budget limitation. For more about this, see our “Re(Use) Data” page.
Other Resources
- Definition of Scientific Data | Final NIH Policy for Data Management and Sharing
- Discovering Associated Data in PMC | NCBI Insights
- Data filters in PMC and PubMed | NLM Technical Bulletin
- Related Data | PubMed Help
- National Center for Data Services | National Library of Medicine
(Re)Use Data

What is data reuse, and what’s open data?
Data reuse is the final R of FAIR, a best practice standard for data sharing. FAIR stands for findable, accessible, interoperable, and reusable. Clear descriptions are at the heart of data reuse. When reusing data, look for well-described data, where the data’s context is apparent. When publishing data, ensure you document it thoroughly, well enough that others can understand and reuse it, and if possible and appropriate, consider publishing your data openly.
Open data is data that can be freely used, re-used, and redistributed by and to anyone as publicly available resources — definition adapted from Open Knowledge Foundation.
Why should you reuse data, and make your data reusable?
Reuse data to:
- Verify your own research.
- Mine the data for new insights.
- Work with data that matches a population or problem you’re interested in — without the overhead of data collection or generation.
- Increase collaboration by analyzing someone else’s data and connecting with like-minded data producers.
- Make open science a standard practice in medicine.
Make your data reusable to:
- Comply with funder mandates — such as NIH’s and NSF’s — and scientific transparency standards.
- Allow others to verify and validate your findings, and potentially collaborate with you.
- Propagate the research cycle and fuel new discoveries, by allowing someone to derive new findings from your data.
- Contribute to the process of tracking scientific inquiry over time.
- Allow citizen scientists to view and interact with health sciences data about their own conditions.
- Align with open science aims as set forth by many professional and cultural organizations, including the UN, UNESCO, and the National Academies of Science, Engineering, and Medicine
What potential challenges should you know about when reusing data?
- Licensing. Some data are licensed under certain terms — a common one is that you won’t attempt to re-identify research subjects — and some data require you to sign a data use agreement. Read licenses and other agreements/terms carefully, and ensure you and your research team can comply.
- Access. Sometimes, you have to fill out a data request form, or contact the creator(s) directly and ask for data access. This process can take time.
- Lack of context. As noted above in ‘what’s data reuse?,’ data documentation is central to whether data can be reused. If you’re having difficulty understanding what data variables mean, or how the data was produced, you may not be able to reuse the data.
- Technical difficulties. Sometimes, technical difficulties, such as not enough data storage or unfamiliar data formats, prevent you from accessing a dataset. Reach out to the Medical Library for help before deciding if this is a barrier, though.
- Fees. Not all data are free. If the dataset is not yet in Yale’s collections, consider requesting we purchase it through this form. Additionally, sometimes similar datasets can be found for free. Consider consulting our “Find Data” page.
When reusing data, remember to be:
- Curious
- Critical
- Compliant
Additional Resources
- Ten Simple Rules for using public biological data for your research | PLOS Computational Biology
- A FAIR guide for data providers to maximise sharing of human genomic data | PLOS Computational Biology
- Best practices for creating reusable data publications | Dryad
- Your data can live forever: how to plan for data reuse | Mozilla
- A dataset describing data discovery and reuse practices in research | Scientific Data
Manage Data
What is research data management?
Research data management is the care and maintenance of data produced during research. It starts when your project starts, and continues through the end of the project, and sometimes extends beyond that. It has many components, but in summary, it involves planning, organizing, documenting, storing, securing, assessing, citing, and sharing your data alongside your research.
Good research data management helps you:
- Stay compliant with institutional, funder, and publisher requirements
- Find, analyze, and reuse your own data — even within your own team
- Communicate your data to others
- Stay publication-ready
- Share your data for reuse
- Contribute to the scientific record
Essential Components of Data Management
- Plan for data management when you start your research project
- Organize your data (preferably according to a schema using established data and metadata standards)
- Document your data so that it can be understood in context later
- Store data with reuse and security in mind — keep original data files, use version control, and back up data in multiple locations
- Secure your data by following all cybersecurity protocols, based on your data’s risk
- Validate your data, and assess for data quality
- Share your data
- Cite your data
Learn more in this Research Data Management guide and consult the NIH Data Sharing Policy
Yale Data Management Policies
Many of Yale’s pertinent policies are summarized below:
Research Data and Materials Policy
From Yale’s Office of the Vice Provost for Research, this policy applies to all research data and materials generated with Yale resources, and covers data ownership, retention, transfer, sharing, and access policies.Notable points include that (1) Yale owns the data and Yale researchers are responsible for managing it; (2) data and materials must be retained for at least three years after publication or final reporting; and (3) Yale researchers must make their data publicly available “to the extent feasible while minimizing harm.”
Data Classification Policy
From Yale’s Information Technology Services (ITS), this policy explains data risk level definitions and how to choose secure data systems based on the data’s risk level. For more assistance, read the policy guidelines and minimum security standards, and take the data classification questionnaire to determine your data’s risk.
Other Related Policies
Depending on the nature of your project, we also recommend you consult on relevant data policies with the following: Office of Sponsored Projects (OSP), Human Research Protection Program (HRPP – includes IRB and HIPAA policies as well), the University Privacy Office, and your funder (see below).
Funder Data Management Policies
Find basic information as it pertains to data management summarized for several major funders. Most government agencies require data management plans, and data sharing upon project completion.
- United Kingdom Research & Innovation (UKRI) Councils
- Data management plan required: Yes – for BBSRC
- DMP Tool Template: No
Popular Data Management Tools
- DMPTool — Free for Yale users, this data management plan (DMP) generator has templates for most major funders, including NIH and NSF. DMPTool guides you through plan completion (e.g., with policy information, sample language, etc.), then allows for plan download in multiple formats. For those who choose to make their plan public, DMPTool lists these – this is great if you’re looking for sample plans to review!
- StorageFinder — This in-house Yale tool helps you find and compare data storage options at and across Yale.
- FairSharing.org — This website allows you to search for relevant data and metadata standards and policies across many subject areas.
- re3data.org — This registry of data repositories allows you to search for places to deposit data (and find data to reuse)
- Dryad — This digital repository enables finding and depositing of data. Yale is an institutional member of this service, which means you can deposit data in Dryad for free.
- LabArchives — Licensed by Yale and free for those with a Yale NetID, this cloud-based electronic lab notebook (ELN) allows users to store and manage data in one place.
- REDCap (for Yale medical campus in general | for Yale-New Haven Hospital) — A secure web application for building and managing online surveys and databases.
- YSM Grant Library — Based within the Office of Physician-Scientist and Scientist Development, the Yale School of Medicine Grant Library serves as a model of successful grantsmanship, and currently holds 100+ grants. Access to the library is restricted to Yale faculty, trainees, and students.
Additional Resources
- Data management made simple | Nature
- Ten simple rules for the care and feeding of scientific data | PLoS Computational Biology
- Ten simple rules for maximizing the recommendations of the NIH data management and sharing plan | PLoS Computational Biology
- Ten simple rules for creating a good data management plan | PLoS Computational Biology
- The FAIR guiding principles for scientific data management | Scientific Data
- Data organization in spreadsheets | American Statistician
- Selecting a data repository | National Institutes of Health
- Generalist repository comparison chart | Zenodo
- DataWorks! Help Desk Knowledge Base | FASEB
- RDMkit | ELIXIR
- Data Repository Finder | National Library of Medicine (NLM)
- Dataset Catalog | National Library of Medicine (NLM)
Work with Data
Self-Guided Learning
Looking for asynchronous ways of learning? Here are some of our favorite free resources for learning to work with data.
Python
- Learn Python the Right Way [online book] — A free, comprehensive guide to Python and programming in general. If you’ve taken “Getting Started with Python” at the library, you will already be familiar with Replit, which this book also uses for exercises.
- Python for Non-Programmers [LinkedIn Learning] — Access this free course, aimed at those new to programming, through Yale’s subscription to LinkedIn Learning. Like the above suggestion, this course also uses Replit.
- RealPython.com [online resource and tutorials] — An expansive collection of free Python tutorials, as well as other resources like forums, podcasts, and helpful articles.
- An Introduction to Programming for Bioscientists: A Python-Based Primer. [open-access journal article] — Read this PLOS Computational Biology article for a step-by-step guide to getting started with Python for biological and biomedical use.Full citation: Ekmekci B, McAnany CE, Mura C (2016) An Introduction to Programming for Bioscientists: A Python-Based Primer. PLOS Computational Biology 12(6): e1004867. https://doi.org/10.1371/journal.pcbi.1004867.
- Automate the Boring Stuff [online book] — A free, excellent introduction to all things automation, including web scraping, reminder applications, data formatting, auto-complete forms, and more.
- CS Dojo’s Python Tutorial for Absolute Beginners [YouTube videos] — If you prefer to learn through video, this is a great series.
- Python Documentation — Official Python docs are available at python.org, where you can also find a beginner’s guide and many additional resources. We also recommend W3 Schools Python Tutorial as supplementary quick-reference documentation and as a learning resource.
R
- SwirlStats [interactive tutorial and R package] — SwirlStats allows you to “Learn R, in R!” This interactive tutorial provides an immersive experience for learning R and data science concepts.
- R Programming [online course] — A comprehensive online Coursera course for getting up and running with R, R programming and troubleshooting, and simulation and profiling in R.
- R for Data Science [online book with exercises] — From RStudio’s Chief Scientist and the inventor of the concept of “tidy data” comes this book: the definitive guide to R, the Tidyverse, and how to use R for data science.
- Ten simple rules for teaching yourself R. [open-access journal article] — Read this PLOS Computational Biology article for a step-by-step guide to getting started with R on your own.Full citation: Lawlor J, Banville F, Forero-Muñoz NR, Hébert K, Martínez-Lanfranco JA, et al. (2022) Ten simple rules for teaching yourself R. PLOS Computational Biology 18(9): e1010372. https://doi.org/10.1371/journal.pcbi.1010372.
- R-Bloggers [online resource] — A blog aggregator for content about R, R programming, data science, and statistics. A great place to learn what’s new in R and find tutorials and guides on a variety of topics.
- R Documentation — Official R docs are available at r-project.org. We also recommend W3 Schools R Tutorial as supplementary quick-reference documentation and as a learning resource.
Data cleaning
What is data cleaning?
Data cleaning typically involves changing a dataset to adjust for information that is:
- Malformed (for example, incorrect, incomplete, inconsistent, corrupted, poorly formatted, etc.)
- Duplicated
- Missing
- Other (for example, outliers, irrelevant rows/columns, etc.)
See resources below for more information on what data cleaning is and how to do it.
Questions to ask before data cleaning:
- Is anything actually wrong with the data? Deal with this first. See list above for possible issues.
- What’s missing in the data, and why? This may require you to gather more documentation, or more data. Once you have the information you need, make a plan for how you want to deal with missing data.
- What do you have planned for analysis? Do you need to make data more consistent to enable clean visualizations, for instance? Are you most interested in a subset of the data? Data cleaning can be endless; prioritize tasks that affect analysis.
Data cleaning resources:
What is data cleaning? | Tableau
Pandas Data Cleaning in Python | W3 Tutorials
Handling missing data | Search in LIGHTS, the library of guidance for health scientists
Data visualization
Data visualization guidance, types, and recommended tools are compiled in this research guide on data visualization.
Working with data in spreadsheets
Read this article, Data organization in spreadsheets. Full citation: Karl W. Broman & Kara H. Woo (2018) Data Organization in Spreadsheets, The American Statistician, 72:1, 2-10, DOI: 10.1080/00031305.2017.1375989.
Or consider reading through this Data Carpentry course of the same name, “Data Organization in Spreadsheets.”
Free datasets for practicing your data skills
The data science education site, Dataquest, has a great list of free datasets on a wide variety of topics from an array of sources. Some of our favorites in their list include data.gov, World Health Organization, and Pew.
Additional recommendations:
Across Yale
- Bioinformatics Support Hub: Provides consultations and training on various bioinformatics topics, as well as free access to popular bioinformatics software.
- StatLab: Provides walk-in help on statistical tools and topics.
- GIS Support: provides access to GIS software as well as consultations and workshops on GIS topics.
- Yale School of Medicine’s Biomedical Informatics and Data Science Department: Engages faculty, students, and staff to promote equitable and sustainable health with informatics and data science.
- Yale REDCap Team: A team of experts in data management, programming, ITS engineering, Linux, research project management, and administrative support for the Yale REDCap instance that also provides trainings and other resources.
- Yale Center for Research Computing (YCRC): Supports high performance computing needs at Yale through office hours and workshops and has four on-site Linux clusters available for advanced computational projects.
- Yale Center for Biomedical Data Science (YCBDS): Research and education hub for biomedical data science on Yale’s Medical Campus.
- Research Core Facilities: Provides Yale researchers access to scientific instrumentation.
Beyond Yale
- BD2K Foundations of Biomedical Data Science
- NIH Training Modules to Enhance Data Reproducibility
- NCBI Workshops, Webinars, and Codeathons
- Data Carpentry: training in fundamental data skills needed to conduct research, including Python, R, SQL, OpenRefine and more
- Software Carpentry: training in basic lab skills for research computing, including Unix shell, git, and Python/R