Open Science

Vaclav (Vashek) Petras

NCSU GeoForAll Lab at the Center for Geospatial Analytics
North Carolina State University

GIS 710: Geospatial Analytics for Grand Challenges
November 11, 2024

https://bit.ly/2oR9owy

Learning Objectives

  • Understating motivation for practicing open science
  • Understating complexity of practicing open science
  • Critical thinking about pros, cons, and challenges
  • General understanding of tools and services involved
  • Practical knowledge of tools for sharing research and computations
  • Ideas about how to use them in complex geospatial applications

Reproducibility of Computational Articles

Stodden et al. (PNAS, March 13, 2018)
204 computational articles from Science in 2011–2012

26% reproducible, 74% irreproducible Stodden, V., Seiler, J., & Ma, Z. (2018). An empirical analysis of journal policy effectiveness for computational reproducibility. In: Proceedings of the National Academy of Sciences 115(11), p. 2584-2589. DOI 10.1073/pnas.1708290115 Discussion questions: Do you know about similar studies? What do they say?

Challenges of Open Science

Focus on Novelty in Publishing and Funding

It would be preferable to have the time to do it right, while simultaneously allowing scientists to be human and make mistakes, instead of focusing on novelty, being first and publishing in highly selective journals.
Holding, A. N. (2019). Novelty in science should not come at the cost of reproducibility. The FEBS Journal, 286(20), 3975–3979. DOI 10.1111/febs.14965
Discussion questions: What research gets published? What research gets funded?

Lack of Incentives for Reproducibility

[...] irreproducible research [...] careers [...] personal cost: young scientists [...] their families [...] visas that are conditional [...] Running out of time [...] pressure on early-career researchers to deliver high-impact results. The outcome is an environment that pushes people to get across the line as quickly as possible, while the incentives to challenge or to reproduce previous studies are minimal.
Holding, A. N. (2019). Novelty in science should not come at the cost of reproducibility. The FEBS Journal, 286(20), 3975–3979. DOI 10.1111/febs.14965
Discussion questions: How important is to challenge or reproduce previous studies? How important is being able to reproduce your own studies?

Scooping

‘There is always this fear, that someone steals your ideas, or is doing the same thing at the same time, and some people fear it more than other people, I think especially younger people, also some older. I think this causes a lot of stress to the scientists, and it has happened to me. […] you try not to think about it, you still think that what if someone else is doing the same thing and this is useless work, so then it takes your energy.’
— research participant in
Laine, Heidi (2017). Afraid of scooping: Case study on researcher strategies against fear of scooping in the context of open science. Data Science Journal. DOI 10.5334/dsj-2017-029
Discussion questions: What is scooping (being scooped) in science? Are you afraid of it?
PLOS publishes scooped research and negative and null results

Private Data

[...] releasing datasets as open data may threaten privacy, for instance if they contain personal or re-identifiable data. Potential privacy problems include chilling effects on people communicating with the public sector, a lack of individual control over personal information, and discriminatory practices enabled by the released data.
Borgesius, F. Z., Gray, J., & van Eechoud, M. (2016). Open Data, Privacy, and Fair Information Principles: Towards a Balancing Framework. 10.15779/Z389S18
Discussion questions: Do you use or create personal or private data in your research or do you expect you will?

Sensitive Data

Insights obtained by compiling public information from Open Data sources, may represent a risk to Critical Infrastructure Protection efforts. This knowledge can be obtained at any time and can be used to develop strategic plans of sabotage or even terrorism.
Fontana, R. (2014). Open Data analysis to retrieve sensitive information regarding national-centric critical infrastructures. http://open.nlnetlabs.nl/downloads/publications/...
Discussion questions: Do you use or create sensitive data in your research or do you expect you will?

Publishing Source Code with a Paper

[...] that’s going to be harder. [...] I’m expecting to get screenshots of MATLAB procedures and horrible Python code that even the author can’t read anymore, and I don’t know what we’re going to do about that. Because in some sense, you can’t push too hard because if they go back and rewrite the code or clean it up, then they might actually change it.
— An interviewed journal editor-in-chief in
Sholler, D., Ram, K., Boettiger, C., & Katz, D. S. (2019). Enforcing public data archiving policies in academic publishing: A study of ecology journals. Big Data & Society, 6(1). DOI 10.1177/2053951719836258
Discussion questions: Have you ever broadly shared source code or other internal parts of your work?

Open Source Software and Research Funding

“That’s really the tragedy of the funding agencies in general,” says Carpenter. “They’ll fund 50 different groups to make 50 different algorithms, but they won’t pay for one software engineer.”
— Anne Carpenter, a computational biologist at the Broad Institute of Harvard and MIT in Cambridge in
Nowogrodzki, Anna (2019). How to support open-source software and stay sane. Nature, 571(7763), 133–134. DOI 10.1038/d41586-019-02046-0
Discussion questions: What open source software which high-relevant to research do you know? Any idea about how it is funded?

Open Source Software and Government

[around 1990] [...] GIS industry claimed that it was unfair for the Federal Government to be competing with them.
Westervelt, J. (2004). GRASS Roots. Proceedings of the FOSS/GRASS Users Conference. Bangkok, Thailand.

In 1996 USA/CERL, [...] announced that it was formally withdrawing support [...and...] announced agreements with several commercial GISs, and agreed to provide encouragement to commercialization of GRASS. [...] result is a migration of several former GRASS users to COTS [...] The first two agreements encouraged the incorporation of GRASS concepts into ESRI's and Intergraph's commercial GISs.
Hastings, D. A. (1997). The Geographic Information Systems: GRASS HOWTO. tldp.org/HOWTO/GIS-GRASS
Original announcement: grass.osgeo.org/news/cerl1996/grass.html

Discussion questions: Do you know GRASS GIS?

NSF Open-Source Ecosystem Grant

  • $1.5M NSF grant awarded to NC State, ASU, NMSU, Yale.
  • To enhance infrastructure, revise contributing guidelines and to support community building of GRASS GIS.
  • The NSF program is aiming at improving sustainability, not sustaining the project, adding features, or fixing bugs.
Discussion questions: Anything surprising in what is funded? Why? Why not?

Turn of the Tide?

National Institutes of Health New 2023 Policy

NIH expects that [...] researchers will maximize the appropriate sharing of scientific data, acknowledging certain factors (i.e., legal, ethical, or technical) [...] Shared scientific data should be made accessible as soon as possible [...]
NOT-OD-21-013: Final NIH Policy for Data Management and Sharing. Retrieved November 6, 2024, from https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.html
Discussion questions: Would you expect a health-related organization to be on the forefront of sharing data?

2023: Year of Open Science

[The White House Office of Science and Technology Policy (OSTP)] is [...] launching the Year of Open Science, featuring actions across the federal government throughout 2023 to advance national open science policy, provide access to the results of the nation’s taxpayer-supported research, accelerate discovery and innovation, promote public trust, and drive more equitable outcomes.
FACT SHEET: Biden-Harris Administration Announces New Actions to Advance Open and Equitable Research. January 11, 2023. whitehouse.gov/...open-and-equitable-research
Discussion questions: Anything you consider new?

NASA's 2022 commitment

[In 2022,] NASA committed $20 million per year to advance open science, beginning in 2023.
Why NASA and federal agencies are declaring this the Year of Open Science. Nature 613, 217 (2023). DOI 10.1038/d41586-023-00019-y
Discussion questions: What do you think this will be spend on?

Nelson Memo: Peer-Reviewed Publications (2025)

[...] all peer-reviewed scholarly publications [...] resulting from federally funded research are made freely available [...] without any [...] delay after publication.
White House Office of Science and Technology Policy (2022). Desirable Characteristics of Data Repositories for Federally Funded Research. DOI 10.5479/10088/113528
Discussion questions: What open-science concept this refers to?

Nelson Memo: Scientific Data (2025)

Scientific data underlying peer-reviewed scholarly publications resulting from federally funded research should be made freely available [...] at the time of publication, unless subject to limitations [...]
White House Office of Science and Technology Policy (2022). Desirable Characteristics of Data Repositories for Federally Funded Research. DOI 10.5479/10088/113528 [...] “scientific data” include the recorded factual material commonly accepted in the scientific community as of sufficient quality to validate and replicate research findings. Such scientific data do not include laboratory notebooks, preliminary analyses, case report forms, drafts of scientific papers, plans for future research, peer-reviews, communications with colleagues, or physical objects and materials, such as laboratory specimens, artifacts, or field notes.
Discussion questions: What open-science concept or concepts this refers to?

Open Source Software and Industry

Open source became a movement – a mentality. Suddenly infrastructure software was nearly free [comparing to 1999]. We paid 10% of the normal costs for the software and that money was for software support. A 90% disruption in cost spawns innovation – believe me.
— Mark Suster (2011) in
Eghbal, Nadia (2016). Roads and bridges: The unseen labor behind our digital infrastructure. Ford Foundation
Discussion questions: Do you know any open-source software success stories?

Other Reasons for Doing Open Science

Open Science Beginnings

First journal ever published:
Philosophical Transactions (of the Royal Society)

CC BY Stefan Janusz, Wikipedia

Theoretical Publishing Goals

  • registration so that scientists get credit
  • archiving so that we preserve knowledge for the future
  • dissemination so that people can use this knowledge
  • peer review so that we know it's worth it

Discussion questions: How are these publishing goals fulfilled by journal papers?

Internal Reasons for Open Science

  • Open science in your lab (team-oriented reasons):
    • collaboration work together with your colleagues
    • transfer transfer research between you and your colleague
    • longevity re-usability of parts of research over time
  • Open science by yourself (“selfish” reasons):
    • revisit revisit or return to a project after some time
    • correction correct a mistake in the research
    • extension improve or build on an existing project

Discussion questions: What is your experience with getting back to your own research or continuing research started by someone else? (See PhD Comics: Scratch.) How does open science relate to team science? How making things public can help us to achieve the desired effect and what challenges that brings?

Open Science Components and Definitions

What Open Means

  • The Open Definition
    • Knowledge is open if anyone is free to access, use, modify, and share it — subject, at most, to measures that preserve provenance and openness. [as officially summed up]
  • More than one term in use: open, free, libre
  • Free Cultural Works
  • The Open Source Definition
  • The Open Source AI Definition
  • The Free Software Definition
  • The Debian Free Software Guidelines and the Debian Social Contract

Image: “Free beer bottles” by free beer pool (CC BY 2.0)

Discussion questions: What is the difference between “free as in free beer” and “free as in freedom”? Have you seen “open” being used for something not fulfilling the Open Definition?

Open Science Components

  • 6 pillars [Watson 2015]:
    • open methodology
    • open access
    • open data
    • open source (software)
    • open peer review
    • open education (or educational resources)
  • other components:
    • open hardware
    • open formats
    • open standards
    • open source AI
  • related concepts:
    • Open-notebook science
    • Provenance
    • FAIR Principles
    • Science 2.0 (like Web 2.0)
    • Team science
    • Citizen science
    • Public science
    • Participatory research
    • Open innovation
    • Open organization
    • Crowdsourcing
    • Preprints
    • Inner source

Discussion questions: What would add to the list? What do you see something for the first time? What is openwashing?

The “re” Words

No agreement on some of the definitions especially in different fields; definitions are often overlapping or swapped, some don't make any distinction.
  • replicability independent validation of specific findings
  • repeatability same conditions, people, instruments, ... (test–retest reliability)
  • reproducibility same results using the same raw data or same materials, procedures, ...
  • recomputability same results received by computation (in computational research)
  • reusability use again the same data, tools, or methods
For example, Ince et al. (2012) in computational science distinguishes direct reproducibility as rerunning the code and indirect reproducibility as validating something other than the entire code.

Discussion questions: As a PhD student, which of these features would you like to see in other research? As a journal paper reviewer, what should you be able to do when you receive a scientific publication for review?

Computational and Geospatial Research

  • code is a part of method description [Ince et al. 2012, Morin et al. 2012, Nature Methods 2007]
  • use of open source tools is a part of reproducibility [Lees 2012, Alsberg & Hagen 2006]
  • easily reproducible result is a result obtained in 10 minutes [Schwab et al. 2000]
  • geospatial research specifics:
    • some research introduces new code
    • some research requires significant dependencies
    • some research produces user-ready software

Discussion questions: Is spatial special? Is recomputing the results useful for research? How long should it take to recompute results? Do dependencies need to be open source as well?

Open Science Publication: Use Case

Petras et al. 2017

Petras, V., Newcomb, D. J., & Mitasova, H. (2017). Generalized 3D fragmentation index derived from lidar point clouds. In: Open Geospatial Data, Software and Standards 2(1), 9. DOI 10.1186/s40965-017-0021-8

Open Science Publication: Components

Publication Component in the Petras et al. 2017 use case
Text background, methods, results, discussion, conclusions, … (OA)
Data input data (formats readable by open source software)
Reusable Code methods as GRASS GIS modules (C & Python)
Publication-specific Code scripts to generate results (Bash & Python)
Computational Environment details about all dependencies and the code (Docker, Dockerfile*)
Versions repository with current and previous versions* (Git, GitHub)

* Version associated with the publication included also as a supplemental file.

Petras, V. (2018). Geospatial analytics for point clouds in an open science framework. Doctoral dissertation. URI http://www.lib.ncsu.edu/resolver/1840.20/35242

Discussion questions: What are other technologies which are good fit for these components? Are there other components or categories? What parts of research did you publish or tried to publish and what challenges did you face?

Open Science Publication: In a Single Package Online

  • Components other than Text and Versions for Petras et al. 2017 are now also available at Code Ocean as a capsule.
Code Ocean capsule content in a web browser
DOI 10.24433/CO.3986355.v2

Discussion questions: What is the skill set needed to publish results like this? What is the long-term sustainability of online recomputability tools such as Code Ocean?

Open Science Publication: Software Platform

  • Preprocessing, visualization, and interfaces (GUI, CLI, API)
  • Data inputs and outputs, memory management
  • Integration with existing analytical tools
  • Preservation of the reusable code component (long-term maintenance)
  • Dependency which would be hard to change for something else
  • Example: FUTURES model implemented as a set of GRASS GIS modules (r.futures.pga, r.futures.demand, r.futures.parallelpga, ...)
Petras, V. (2018). Geospatial analytics for point clouds in an open science framework. Doctoral dissertation. URI http://www.lib.ncsu.edu/resolver/1840.20/35242

Discussion questions: What software can play this role? What are the different levels of integration with a piece of software and their advantages and disadvantages?

Licensing

“Creative Commons License Spectrum” by Shaddim (CC BY 4.0), Creative Commons: Understanding Free Cultural Works

Discussion questions: Do you read “terms and conditions”? Have you ever read any “terms and conditions” or end user license agreement (EULA)? What about an open source software license? (Read license of GDAL right now! It's less than 170 words.)

FAIR Principles

  • Findable persistent identifier and metadata for data
  • Accessible sharing protocol is open, free, and universally implementable
  • Interoperable formal language and references to other datasets
  • Reusable clear usage license (can be restricted), detailed provenance

The principles emphasize machine-actionability (…) because humans increasingly rely on computational support to deal with data (…)

[Wilkinson 2016]
https://www.go-fair.org/fair-principles

Image: “FAIR guiding principles for data resources” by SangyaPundir (CC BY-SA 4.0)

Discussion questions: Which parts are unique to FAIR and not present in open science? Is source code part of data, data provenance, or it is a separate thing?

References

  • Alsberg, Bjørn K., and Ole Jacob Hagen. “How Octave Can Replace Matlab in Chemometrics.” Chemometrics and Intelligent Laboratory Systems, Selected papers presented at the 9th Scandinavian Symposium on Chemometrics Reykjavik, Iceland 21–25 August 2005, 84, no. 1 (December 1, 2006): 195–200. doi:10.1016/j.chemolab.2006.04.025.
  • Buckheit, Jonathan B., and David L. Donoho. “WaveLab and Reproducible Research.” In Wavelets and Statistics, edited by Anestis Antoniadis and Georges Oppenheim, 103:55–81. Lecture Notes in Statistics. New York, NY: Springer New York, 1995. doi:10.1007/978-1-4612-2544-7_5.
  • Ince, Darrel C., Leslie Hatton, and John Graham-Cumming. “The case for open computer programs”. In: Nature 482.7386 (2012), pp. 485–488. doi:10.1038/nature10836
  • Lees, Jonathan M. “Open and free: Software and scientific reproducibility”. In: Seismological Research Letters 83.5 (2012), pp. 751–752. doi:10.1007/s10816-015-9272-9
  • Marwick, Ben. “Computational reproducibility in archaeological research: basic principles and a case study of their implementation”. In: Journal of Archaeological Method and Theory 24.2 (2017), pp. 424–450. doi:10.1007/s10816-015-9272-9
  • Morin, A et al. “Shining light into black boxes”. In: Science 336.6078 (2012), pp. 159–160. doi:10.1126/science.1218263
  • Nature Publishing Group. “Social Software.” Nature Methods 4, no. 3 (March 2007): 189. doi:10.1038/nmeth0307-189.
  • Peng, Roger D. “Reproducible Research in Computational Science.” Science (New York, N.Y.) 334, no. 6060 (December 2, 2011): 1226–27. doi:10.1126/science.1213847
  • Petras, Vaclav. “Geospatial analytics for point clouds in an open science framework.” Doctoral dissertation. 2018. www.lib.ncsu.edu/resolver/1840.20/35242
  • Petras, Vaclav, Douglas J. Newcomb, and Helena Mitasova. “Generalized 3D Fragmentation Index Derived from Lidar Point Clouds.” Open Geospatial Data, Software and Standards 2, no. 1 (April 2017): 9. doi:10.1186/s40965-017-0021-8.
  • Rocchini, Duccio and Markus Neteler. “Let the four freedoms paradigm apply to ecology”. In: Trends in Ecology and Evolution (2012). doi:10.1016/j.tree.2012.03.009
  • Rodriguez-Sanchez, Francisco, Antonio Jesús Pérez-Luque, Ignasi Bartomeus, and Sara Varela. “Ciencia Reproducible: Qué, Por Qué, Cómo.” Revista Ecosistemas 25, no. 2 (2016): 83–92.
  • Schwab, Matthias, Martin Karrenbach, and Jon Claerbout. “Making Scientific Computations Reproducible.” Computing in Science & Engineering 2, no. 6 (2000): 61–67. doi:10.1109/5992.881708.
  • Stodden, V., Seiler, J., & Ma, Z. (2018). “An empirical analysis of journal policy effectiveness for computational reproducibility.” In: Proceedings of the National Academy of Sciences 115(11), p. 2584-2589. doi:10.1073/pnas.1708290115
  • Watson, M. (2015). When will ‘open science’ become simply ‘science’?. Genome biology, 16(1), 101. doi:10.1186/s13059-015-0669-2
  • Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., ... & Bouwman, J. (2016). “The FAIR Guiding Principles for scientific data management and stewardship”. Scientific data, 3(1), 1-9. doi:10.1038/sdata.2016.18

Resolution for Class Debate

A scientific publication needs to consist of text, data, source code, software environment, and reviews which are all openly licensed, in open formats, checked during the submission process, and publicly available without any delay after publication.