class: center, middle, inverse, title-slide # Improving (analysis) process: developing packages ###
Lluís Revilla
### CIBEREHD, IDIBAPS ###
2021/11/09
nhsr2021.llrs.dev
--- # About me Bioinformatician at [CIBEREHD](https://www.ciberehd.org/en "CIBERHD's website") finishing my PhD at [IDIBAPS](https://www.clinicbarcelona.org/en/idibaps "IDIBAPS' website") a research center of the [Hospital Clinic de Barcelona](https://www.clinicbarcelona.org/en/ "Hospital's website"). .pull-left[ <div class="figure"> <img src="images/hospital_clinic.jpg" alt="Source: Hospital Clínic de Barcelona" width="100%" /> <p class="caption">Source: Hospital Clínic de Barcelona</p> </div> ] .pull-right[ Using mostly R for research with multiple packages and developing and maintaining other packages, from implementing mathematical theories ([BaseSet](https://cran.r-project.org/package=BaseSet "Package for fuzzy set operations")), help designing experiments ([experDesign](https://cran.r-project.org/package=experDesign "Package to analyse samples in batches")), comparing annotations of genes ([BioCor](https://bioconductor.org/packages/BioCor "Comparing genes and pathways annotations")). And other packages unrelated to research: analysing R bugs ([bugRzilla](https://github.com/llrs/bugRzilla "Package still in development")) or the official state gazette of Spain ([BOE](https://ropenspain.github.io/BOE/ "BOE in Spanish acronym"))... ] ??? You can also find me on [Twitter](https:/twitter.com/Lluis_Revilla) and on [my website](https://llrs.dev). --- # Index .pull-left[ .center[ A bit of background ![:scale 20% "Image of the sky on the night"](images/background.jpg) Practical recommendations ![:scale 40% "Image of a wall with lots of tools"](images/practical_tools.jpg) A step back to see the bigger picture ] ] .pull-right[ ![:scale 70% "Image of a frame with a paraglider behind it"](images/bigger_picture.jpg) ] --- # Improving processes: not repeating ourselves Much of the progress of humanity has come from simplifying: - Machines: Repeat actions - Templates: Repeat information - Software: Repeat logic ![](images/excavator_form_code.jpg) ??? Our efforts are to reduce repetition by humans or making those faster, easier and more efficient. [TODO: Find image for each topic and add title above them] --- # Science .bg-washed-green.b--dark-green.ba.bw1.center[ Is a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions .tr[ — [Wikipedia](https://en.wikipedia.org/wiki/Science) ]] ??? Science is a method of knowledge: there are others like mathematics, philosophy, theology, ... -- .center[ ![Table with data and analysis as axis](images/turing_way_reproducible-matrix.jpg) [The Turing Way](https://the-turing-way.netlify.app/reproducible-research/overview/overview-definitions.html "CC-BY The Turing Way" ) ] ??? Missing part: code/logic which needs to be the same (Same version, OS, ...) --- # Code .pull-left[ - Shareable - Easier to understand (sometimes) - Modularizable - Repeatable (in principle) ] .pull-right[ Recommendations: - [Follow best practices](https://swcarpentry.github.io/r-novice-inflammation/06-best-practices-R/ "Software carpentery recommendation for scripts"): Decide a style, record dependencies, comment code... - Find help: at your institution, from [NHS-R community](https://nhsrcommunity.com/r-near-me/ "NHS-R user groups") from organizations like [Research Software Engineer](https://society-rse.org/ "Community for research software engineer") or other organizations. ] ![](images/practical_tools.jpg) ??? Better code benefits the process and the result. If reports are your main outcome they can also be automated via knitr parametrized reports or use interactive reports (See Simon Wellesley-miller talk tomorrow) --- # Brief introduction to packages Slide adapted from a [workshop](https://ghana.llrs.dev/#4) on how to develop packages. .left-column[ <img src="images/tidyverse_dtplyr.png" title="Tree view of the dtplyr package repository on 2021/08/06." alt="Tree view of the dtplyr package repository on 2021/08/06." width="100%" /> ] .right-column[ - `.github` folder: Files specific to GitHub (this is not necessary) - A `R` folder with *.R files: your code. - A `man` folder with *.rd files: your documentation. - A `tests` folder: Check the code of the package. - A `vignette` folder: Long documentation; not just examples. - A `.Rbuildignore`: A file describing what to omit when building the package. - A `DESCRIPTION` file: Summary and description of the package. - A `LICENSE` file: The conditions under the package is released. - A `NAMESPACE` file: What this package shares and needs. - A `NEWS` file: What has changed since last release. - A `README`: How to install and why this packages is needed and some basic examples. ] ??? I recommend to look at it, but here are some resources: - [R extensions](https://cran.r-project.org/doc/manuals/r-release/R-exts.html "Offical complete instructions"): Official documentation, everything is answered here - [R packages](https://r-pkgs.org/ "R pakages"): A book more digestible and using other packages and tools to create packages. --- # From scripts to packages .center[ ![:scale 60% "Plot comparing internal and open source package based on solution breadth and problem definition internal pacakges are more specific and more workflow-like than open source packages" ](images/packages_featured.png "Comparing internal and open source package based on solution breadth and problem definition") [Emily Riederer](https://emilyriederer.netlify.app/post/team-of-packages/ "Link to post about how to create internal packages") ] ??? Use function (use other's packages) Write functions Analysing some data might lead to a package or not (too specific, private logic). Package should provide some non-trivial solution to other's. It might not be worth of publication (on traditional journals) but it might be worth sharing (on Journal of Open Source or just as a package people can install). Check [Emily Rieder's post](https://emilyriederer.netlify.app/post/team-of-packages/) on internal packages --- # When to write a package? After a search of a package that does solves your problem: .pull-left[**Around a central idea** A method, a program, a project, a report (like REDCapR)] .pull-right[**A worflow/salad of processes** Things you repeat (checking column names, match between two tables), steps you can automatize between manual/visual inspection.] -- .indent[ ] .center[ | Project | Use | Example | |:---:|:---:|:---:| |**Small** | script with functions | `function(x){x+5}` | |**Medium** | script with package | [script](https://github.com/llrs/TRIM "Code for a project") + [own_package](https://github.com/llrs/integration "Internal package for integration") | |**Large** | architecture/workflow | [targets](https://cran.r-project.org/package=targets "Keeping track of the results") + [renv](https://rstudio.github.io/renv/index.html "Keeping track of packages and enviroment used") + script + own_package | ] ??? Depending on your work if it is more research based probably package around an idea or a salad to make your life easier Because code in research most of them is run just once and few multiple times If it is more clinical perhaps you do more reporting, and you might be interested on a mix to retrieve data and explore it with reports, plots... --- # Benefits for healthcare analytics - More confidence on results and process - Errors might be caught in advance (& fixed in advance) - Automated repetitive process > We’ve been using R with REDCap’s API since 2012 and have developed REDCapR. Before encapsulating these functions in a package, we were replicating 50+ lines of code to contact REDCap and robustly transform the returned csv into an R `data.frame`; it took twice that much to implement batching. All this can be done in one call to `redcap_read()` > > .right[[REDCapR's](https://cran.r-project.org/package=REDCapR) README] - Reduce effort (long term) > For ease of sharing, documenting, and testing we wrapped these tools up into an R package. > > .right[ [Jay Hugues](https://medium.com/healthfdn-data-analytics/introducing-a-new-r-package-to-process-primary-care-data-88e863130ada#236b "Blog post on how a package was developed")] ??? We built a package (in perl) to automate a process on our lab: GETS: multiple analysis of RNA-seq data -> webpage/program Documentation of process or why things are done that way Advantages - Function checking - Documentation Disadvantages - More experience programming (not that much) Setting checks on scores (endoscopic scores, ) To mitigate the disadvantage it is better paired with other tools like [renv](https://rstudio.github.io/renv/index.html). --- # How and when to use external packages There are three ways: Depends, Imports, Suggests and Enhance **Depending** is equivalent of `library("package")`: If the package is not present it does not install. Enhancing is equivalent of Depends. **Import** is equivalent to `package::function()`: Only use some functions of the package. **Suggests** means that it is not needed for the whole package but just some parts. .details[ Suggests requires more work before hand but makes it easier to install your package. You need to use: ```r if (!requireNamespace("package", quiet = TRUE)) { stop("We need this package") } package::function() ``` ] ??? If you use RNA-seq, 16S data design experiments: [Bioconductor](https://bioconductor.org) [rOpenSci](https://ropensci.org) organization has also scientific tools: [targets](https://cran.r-project.org/package=targets) help avoiding rerunning the same steps. If you are already considering building your own package you will already be aware of other packages. --- # How to develop internal pacakges If it is just you or few of you: your own computer and then share collaboratively via your version control system. ??? Depending on the support of your IT department or your setup/computer facilities. Problems when editing the same file, I recommend using version control systems. -- If it is shared or you are trying to know if some more people might be interested, maybe on github or gitlab. Example: [aurumpipeline](https://github.com/HFAnalyticsLab/aurumpipeline "Package to clean and process patient-level CPRD Aurum" ) ??? Having this public or semi-public will make it easier a workmate or someone from the department besides you. It will make it easier to collaborate in case you release your package to the public. Internal packages are usually not shared outside the organization. [medicaldata](https://higgi13425.github.io/medicaldata/index.html "Package with datasets for teaching") is another example but this is intended to be used by more people and as such is on CRAN. -- Very specific to institution, team, collaborators... which translate in a bit of work on finding what works best for you. .center[ ![:scale 50% "Image of a monkey thinking"](images/rob-schreckhise-8zdEgWg5JAA-unsplash_monkey_thinking.jpg) ] ??? Take into consideration both the infrastructure you have, who develops and *maintains* the software and who benefits from it. Most research software is run few times and not worth keeping it along. --- # How to use internal packages Install from source each time (via `R CMD INSTALL` or `devtools::install`) If you have your own Github organization you can use r-universe to gather the packages and make it easier to install them. You could set up your own internal [CRAN-like repository](https://rstudio.github.io/packrat/custom-repos.html "Instructions for package repositoris"). Once installed use it normally as any other package on your reproducible environment. .center[ ![:scale 40% "Window of installing a program on windows by a 'wizard'"](images/wizard_installation.png) [WizardAssistant](https://wizardassistant.com/) ] ??? The main difference between internal and external packages is how are they installed ( There is a list of [potential soluctions](https://support.rstudio.com/hc/en-us/articles/115000239587-Sharing-Internal-R-Packages). Depending on your institutions you might already have a local CRAN or [miniCRAN](https://cran.r-project.org/package=miniCRAN) so it might be easy to add your internal packages to that repository. This also helps to know if the work is reproducible and if the package works well for the users (and not just you) --- # Packages and process are not static - R is not always equal, you might need to modify the package - Your users (or you) will not have always the same priorities - Might need to move some functions to another package - Use/set deprecation policies .center[ <div class="figure"> <img src="images/La_palma_volcano_flow.jpg" alt="Image of the lava flow from palma, showing it exiting the cone, flowing and reaching the sea." width="50%" /> <p class="caption">Source: El País</p> </div> ] ??? At the beginning lots of activities and changes (eruption) On the middle less changes but they might take new paths (redefining functions or finding better ways to do the same thing) At the end the package development goes quiet but there is friction with changes on R and or other packages. --- # Going back to the big picture Packages help trusting in the reporting and implementation: ![Reproducibility As A Trust Scale from low overall trust to the higher levels of trust ](https://gmbecker.github.io/MayInstituteKeynote2019/trustscale3.png "Reproducibility As A Trust Scale from low overall trust to the higher levels of trust") .right[ [Gabriel Becker, Copyright Genentech Inc. ](https://gmbecker.github.io/MayInstituteKeynote2019/outline.html "Link to the slides with the original source") ] ??? [rOpenSci guide](http://ropensci.github.io/reproducibility-guide) [Gabe Becker's slides about reproducibility](https://gmbecker.github.io/MayInstituteKeynote2019/outline.html) --- # Conclusions Not complicated to have packages. Complicated to be consistent (new people, process change, people leave...) Packages are just a part that help with being reproducible and replicable, there are others. .pull-left[ ![:scale 100% "Hands together"](images/hannah-busing-Zyx1bK9mqmA-unsplash_hands_together.jpg) ] .pull-right[ Thank you! [nhsr2021.llrs.dev](https://nhsr2021.llrs.dev) [<svg viewBox="0 0 448 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M416 32H31.9C14.3 32 0 46.5 0 64.3v383.4C0 465.5 14.3 480 31.9 480H416c17.6 0 32-14.5 32-32.3V64.3c0-17.8-14.4-32.3-32-32.3zM135.4 416H69V202.2h66.5V416zm-33.2-243c-21.3 0-38.5-17.3-38.5-38.5S80.9 96 102.2 96c21.2 0 38.5 17.3 38.5 38.5 0 21.3-17.2 38.5-38.5 38.5zm282.1 243h-66.4V312c0-24.8-.5-56.7-34.5-56.7-34.6 0-39.9 27-39.9 54.9V416h-66.4V202.2h63.7v29.2h.9c8.9-16.8 30.6-34.5 62.9-34.5 67.2 0 79.7 44.3 79.7 101.9V416z"></path></svg>](https://www.linkedin.com/in/lluisrevilla/) [<svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg> Lluís Revilla](https://twitter.com/Lluis_Revilla/) [<svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> llrs](https://github.com/llrs/) [<svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M176 216h160c8.84 0 16-7.16 16-16v-16c0-8.84-7.16-16-16-16H176c-8.84 0-16 7.16-16 16v16c0 8.84 7.16 16 16 16zm-16 80c0 8.84 7.16 16 16 16h160c8.84 0 16-7.16 16-16v-16c0-8.84-7.16-16-16-16H176c-8.84 0-16 7.16-16 16v16zm96 121.13c-16.42 0-32.84-5.06-46.86-15.19L0 250.86V464c0 26.51 21.49 48 48 48h416c26.51 0 48-21.49 48-48V250.86L302.86 401.94c-14.02 10.12-30.44 15.19-46.86 15.19zm237.61-254.18c-8.85-6.94-17.24-13.47-29.61-22.81V96c0-26.51-21.49-48-48-48h-77.55c-3.04-2.2-5.87-4.26-9.04-6.56C312.6 29.17 279.2-.35 256 0c-23.2-.35-56.59 29.17-73.41 41.44-3.17 2.3-6 4.36-9.04 6.56H96c-26.51 0-48 21.49-48 48v44.14c-12.37 9.33-20.76 15.87-29.61 22.81A47.995 47.995 0 0 0 0 200.72v10.65l96 69.35V96h320v184.72l96-69.35v-10.65c0-14.74-6.78-28.67-18.39-37.77z"></path></svg> lluis.revilla@ciberehd.org](lluis.revilla@ciberehd.org) ] ???