Got data questions? Book a free consultation with Cushing/Whitney Medical Library’s data librarian for yourself, group, or team to discuss data-related research questions and needs.
Most consultations occur over Zoom, but you may also send your question via email, or schedule an in-person meeting.
Examples of consultation topics
- How to find and select health sciences dataset(s), including how to reuse open public data for research or course assignments
- Best practices for research data management
- How to make your research more open, accessible, and reproducible, and how to make your data FAIR (findable, accessible, interoperable, and reusable)
- Data compliance and governance, including data use agreements, data licensing, and proper data storage methods
- Data processing, analysis, and visualization
- Data tools and software to support your research, such as Python and R
Looking for asynchronous ways of learning? Here are some of our favorite free resources for learning to work with data.
- Learn Python the Right Way [online book] — A free, comprehensive guide to Python and programming in general. If you've taken "Getting Started with Python" at the library, you will already be familiar with Replit, which this book also uses for exercises.
- Python for Non-Programmers [LinkedIn Learning] — Access this free course, aimed at those new to programming, through Yale's subscription to LinkedIn Learning. Like the above suggestion, this course also uses Replit.
- RealPython.com [online resource and tutorials] — An expansive collection of free Python tutorials, as well as other resources like forums, podcasts, and helpful articles.
- An Introduction to Programming for Bioscientists: A Python-Based Primer. [open-access journal article] — Read this PLOS Computational Biology article for a step-by-step guide to getting started with Python for biological and biomedical use.Full citation: Ekmekci B, McAnany CE, Mura C (2016) An Introduction to Programming for Bioscientists: A Python-Based Primer. PLOS Computational Biology 12(6): e1004867. https://doi.org/10.1371/journal.pcbi.1004867.
- Automate the Boring Stuff [online book] — A free, excellent introduction to all things automation, including web scraping, reminder applications, data formatting, auto-complete forms, and more.
- CS Dojo’s Python Tutorial for Absolute Beginners [YouTube videos] — If you prefer to learn through video, this is a great series.
- Python Documentation — Official Python docs are available at python.org, where you can also find a beginner's guide and many additional resources. We also recommend W3 Schools Python Tutorial as supplementary quick-reference documentation and as a learning resource.
- SwirlStats [interactive tutorial and R package] — SwirlStats allows you to "Learn R, in R!" This interactive tutorial provides an immersive experience for learning R and data science concepts.
- R Programming [online course] — A comprehensive online Coursera course for getting up and running with R, R programming and troubleshooting, and simulation and profiling in R.
- R for Data Science [online book with exercises] — From RStudio's Chief Scientist and the inventor of the concept of "tidy data" comes this book: the definitive guide to R, the Tidyverse, and how to use R for data science.
- Ten simple rules for teaching yourself R. [open-access journal article] — Read this PLOS Computational Biology article for a step-by-step guide to getting started with R on your own.Full citation: Lawlor J, Banville F, Forero-Muñoz NR, Hébert K, Martínez-Lanfranco JA, et al. (2022) Ten simple rules for teaching yourself R. PLOS Computational Biology 18(9): e1010372. https://doi.org/10.1371/journal.pcbi.1010372.
- R-Bloggers [online resource] — A blog aggregator for content about R, R programming, data science, and statistics. A great place to learn what's new in R and find tutorials and guides on a variety of topics.
- R Documentation — Official R docs are available at r-project.org. We also recommend W3 Schools R Tutorial as supplementary quick-reference documentation and as a learning resource.
What is data cleaning?
Data cleaning typically involves changing a dataset to adjust for information that is:
- Malformed (for example, incorrect, incomplete, inconsistent, corrupted, poorly formatted, etc.)
- Other (for example, outliers, irrelevant rows/columns, etc.)
See resources below for more information on what data cleaning is and how to do it.
Questions to ask before data cleaning:
- Is anything actually wrong with the data? Deal with this first. See list above for possible issues.
- What’s missing in the data, and why? This may require you to gather more documentation, or more data. Once you have the information you need, make a plan for how you want to deal with missing data.
- What do you have planned for analysis? Do you need to make data more consistent to enable clean visualizations, for instance? Are you most interested in a subset of the data? Data cleaning can be endless; prioritize tasks that affect analysis.
Data cleaning resources:
- What is data cleaning? | Tableau
- Pandas Data Cleaning in Python | W3 Tutorials
- Handling missing data | Search in LIGHTS, the library of guidance for health scientists
- Also, see the data reuse section of this site and the data organization in spreadsheets section below.
Data visualization guidance, types, and recommended tools are compiled in this research guide on data visualization.
Working with data in spreadsheets
- Read this article, Data organization in spreadsheets. Full citation: Karl W. Broman & Kara H. Woo (2018) Data Organization in Spreadsheets, The American Statistician, 72:1, 2-10, DOI: 10.1080/00031305.2017.1375989.
- Or consider reading through this Data Carpentry course of the same name, "Data Organization in Spreadsheets."
Free datasets for practicing your data skills
- The data science education site, Dataquest, has a great list of free datasets on a wide variety of topics from an array of sources. Some of our favorites in their list include data.gov, World Health Organization, and Pew.
- Additional recommendations:
- Bioinformatics Support Hub: Provides consultations and training on various bioinformatics topics, as well as free access to popular bioinformatics software.
- StatLab: Provides walk-in help on statistical tools and topics.
- GIS Support: provides access to GIS software as well as consultations and workshops on GIS topics.
- Yale School of Medicine's Biomedical Informatics and Data Science Department: Engages faculty, students, and staff to promote equitable and sustainable health with informatics and data science.
- Yale REDCap Team: A team of experts in data management, programming, ITS engineering, Linux, research project management, and administrative support for the Yale REDCap instance that also provides trainings and other resources.
- Yale Center for Research Computing (YCRC): Supports high performance computing needs at Yale through office hours and workshops and has four on-site Linux clusters available for advanced computational projects.
- Yale Center for Biomedical Data Science (YCBDS): Research and education hub for biomedical data science on Yale's Medical Campus.
- Research Core Facilities: Provides Yale researchers access to scientific instrumentation.
- BD2K Foundations of Biomedical Data Science
- NIH Training Modules to Enhance Data Reproducibility
- NCBI Workshops, Webinars, and Codeathons
- Data Carpentry: training in fundamental data skills needed to conduct research, including Python, R, SQL, OpenRefine and more
- Software Carpentry: training in basic lab skills for research computing, including Unix shell, git, and Python/R