-
Notifications
You must be signed in to change notification settings - Fork 13
Biodiversity Data Utilities (bdtools)
Biodiversity research is evolving rapidly, progressively changing into a more collaborative and data-intensive science. The integration and analysis of large amounts of data is inevitable, as researchers increasingly address questions at broader spatial, taxonomic and temporal scales than before. There are an increasing number of scientists using R for their data analyses, however, the skill set required to handle biodiversity data in R, considerably varies. Since, the user needs to retrieve, manage and assess data with high volume and complex structure (DwC), only users with an extremely sound R programing background can attempt this.
In recent years, various R packages dealing with biodiversity data and specifically data cleaning have been published (e.g. finch, scrubr, biogeo, and taxize). Numerous new procedures are now available, which hopefully invites more integrative projects. The invaluable products and insights generated in our last year GSoC projects (Ashwin’s; Thiloshon’s) serve as foundation for this project idea.
Identifying spatial, temporal, and environmental outliers can single out erroneous records. However, identifying an outlier is a subjective exercise, and not all outliers are errors. Various statistical methods and techniques will be evaluated based on virtual species and real species occurrence data. An environment for algorithm performance assessment and characterization will be established.
Capabilities for handling and accessing taxonomic data in R are staggering, thanks to the rOpenSci team. Their collection of taxonomy packages opened the window for developing many exciting taxonomic workflows. In this module several workflows will be tailored around the biodiversity informatics community needs. A good example for such workflow can be found in Thiloshon’s GSoC2017 project.
We hope to introduce a parallelization option in some heavy duty procedures. Thus, enabling users to fully use their multi core processors and available memory, while saving precious running time. In this module you will explore and test different parallelization techniques in R.
By offering a collection of ready-to-use solutions for data quality assessment, we can fully harness the overwhelming synergetic quality of the R packages ecosystem.
Students, please contact mentors below after completing at least one of the tests below.
- Tomer Gueta [email protected] is the author of R package bdclean and has been working with large biodiversity datasets for several years. Part of his research is dealing with integrating data-cleaning with data analysis to enhance usability of biodiversity big-data.
- Vijay Barve [email protected] is a biodiversity data scientist and has contributed to several R packages related to biodiversity i.e. rgbif, rvertnet, rinat, bdvis and so on.
- Yohay Carmel [email protected] Associate Professor, Faculty of Civil and Environmental Engineering, Technion. Yohay is an ecologist dealing with wide a range of biodiversity research.
- Easy: Download 10,000 GBIF’s occurrence records of Bats in Australia (georeferenced records only), using the ‘rgbif’ R package.
- Medium: Build several visualizations in R that most effectively summarize the bats data you downloaded. Extra points for high aesthetics and creativity.
- Medium: Shortly review possible techniques for implementing parallel processing in R, and give us a full list of considerations for designing the most appropriate parallelization strategy.
- Hard: Please read Liu et al. 2017 paper dealing with detecting outliers in species distribution data. Write a critical review on the paper and follow these guidelines.
Students, please post a link to your test results here.
Gayan Seneviratna | Github | Visualisation dashboard.
Povilas Gibas - Test solutions