Collaborative Data Science Project

Objectives

The objective of this assignment is to work with one or two of your classmates over the final weeks of the class to complete your choice of one of the following options:

A “Methods Exploration” Project

  • In consultation with the instructor, your group should select a particular statistical method, data analysis workflow, and/or data visualization tool that we do not cover in the class and that you would like to explore further and prepare a 10-15 minute presentation for your peers. In your presentation, you will describe the purpose of the method and walk the class through examples/applications of its use using a publicly available dataset (e.g., from Kaggle, Google Dataset Search, Data.gov, etc.).

  • In addition to your presentation, you will develop an accompanying R package and “vignette” taking a user through the method/workflow that will be shared through GitHub.

  • Ideally, the method you choose to explore would be one that you and your partner(s) will find useful for your own research. This is your chance both to be creative and to delve more deeply into the statistics, data science, and R literature on a topic of your own choice! Below, I list a large number of possible topics: these are by no means exclusive… your group should feel free to suggest any other topics you might find interesting, just confirm with me that what you would like to focus on is appropriate.

    • Principal components analysis (PCA)
    • Discriminant function analysis
    • Factor analysis
    • Structured equation modeling
    • Clustering and classification tools (e.g., k-means clustering, classification trees, random forests, SVMs)
    • Nonlinear regression, ridge regression, lasso regression
    • Species occupancy/distribution modeling
    • Social/biological network construction and visualization
    • Geospatial data visualization and spatial queries
    • Text mining and natural language processing
    • Interacting with relational/nonrelational databases and query languages
    • Manipulating and analyzing DNA sequence data
    • Phylogeny construction, phylogenetic comparative analysis
    • Bioinformatics data processing
    • Image analysis (e.g., supervised and unsupervised classification, feature extraction)
    • Machine learning
    • Acoustic data analysis

A “Novel Data Science” Project

  • Implement a modest data science analysis using your one or both of your group members’ own data and addressing a theoretically interesting problem in the natural or social sciences. The goal is to incorporate several of the tools learned in this class into a project of your own design that, ideally, might also help you move your own research forward. The project should include both exploratory (descriptive) and inferential data analyses. Your group will give a 10-15 minute presentation on your project to your peers.

  • In addition to your presentation, you will develop an accompanying R package and “vignette” that walks the reader though the your analysis and workflow that will be shared through GitHub.

What to Do

You and your groupmate(s) will work together to develop a short (roughly 10-15 minute) but reasonably comprehensive presentation and associated HTML “vignette” that can be rendered from a “.qmd” or “.Rmd” document). Depending on which type of project you are doing, your vignette will provide background on the statistical method/topic/data visualization procedures that you have chosen to explore or on the dataset and theory that you are exploring with your own data. It should then take the user through a set of analyses either demonstrating the method being explored or applying approaches we have learned in class to your own data. Your module should be organized similarly to those that I have prepared for various other topics this semester, though need not be as long.

The presentation can either be done from the module itself or as a PowerPoint/Keynote/Google Slides/R presentation/whatever slideshow.

Thus, all groups will need to produce a “.qmd” file that will be the basis for a module vignette that you build and will bundle into a custom R package. Besides the vignette, your package should also bundle at least some of the data you are working with into one or more included datasets and should also bundle code for least two functions that are called in the vignette. The functions you include can be custom functions that you create, but it is also perfectly fine to use functions that you have copied from other packages and include into your own package. The important thing is that you are bundling them together with your dataset and vignette for easy distribution. You will want to create your own documentation for each function, as discussed in the Miscellany module on Building Custom R Packages. One of the main objectives for this exercise is to give you experience with tools for distributing shared data and code to other researchers and pull the curtain back a bit on just what producing an R package entails.

Please be sure to divide up the work with your partner(s) so that each of you are contributing more-or-less equally!

What to Turn In

You will collectively take the class through your presentation/module during our meeting final class meeting period and final exam period. We will aim for 7-8 presentations total of ~15 minutes each with time for questions. I may try to also record these video presentations and share them via the Canvas site.

Your group’s work also should result in a custom R package that can be shared as a single file and loaded into an R workspace (e.g., using the install.packages() function. As noted above, that package should combine the following:

  • A set of functions and associated function documentation appropriate to the topic you have chosen.
  • One or more relevant datasets (either data you have simulated or, preferably, real data), that you use in a module vignette about your chosen topic,
  • A vignette that walks users through demonstrations of relevant analyses,
  • Appropriate METADATA for your package (e.g., information on dependencies, etc.),

NOTE: The Miscellany module on Building Custom R Packages takes the user through all of the step of package development, so use it as a resource! We will work through this module during our one of our last class periods.

Additionally, by the due date for this assignment - the last day of the formal final exam period - one group member should upload the final R Package’s bundled “.gz.tar” file to the Canvas site for the course, and your group should collectively upload your entire R Package project to a GitHub repository and share with your instructors, via the assignment submission text field on Canvas, the URL to that repository.

NOTE: We should be able to CLONE your repository and see all of the components associated with your package development.