You, as part of a team, will be responsible for the completion of an open ended final project for this course, the goal of which is to tackle an “interesting” real life problem using the tools and techniques covered in this class.
Stick to optional interim deadlines.
You will pick a dataset, from a dedicated list of data sources given below. From different resources, these data sets are carefully selected to make that step easier for you. Generally, your goal is to do something reasonable with the selected data set using what you learnt from the course. That is your final project in a nutshell! More specifically, you can think of the following steps as an example (but not limited to)
The final project for this class will consist of analysis on a dataset of your own selection from the suggested list of options. You can choose the data based on your interests or based on work in other courses or research projects. The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this class (and beyond, if you like) and apply them to a novel dataset in a meaningful way.
The goal is not to do an exhaustive data analysis i.e., do not calculate every statistic and procedure you have learned for every variable, but rather let us know that you are proficient at asking meaningful questions and answering them with results of data analysis, that you are proficient in using R, and that you are proficient at interpreting and presenting the results. Focus on methods that help you begin to answer your research questions. You do not have to apply every statistical procedure we learned. Ensure that you focus on the data that you are investigating and be cautious against over interpretation. Also, critique your own methods and provide suggestions for improving your analysis. Issues pertaining to the reliability and validity of your data, and appropriateness of the statistical analysis should be discussed here.
The project is very open ended. You should create some kind of compelling visualization(s) of this data in R. There is no limit on what tools or packages you may use, but sticking to packages we learned in class (e.g. tidyverse
) is advised. You do not need to visualize all of the data at once.
A single high quality visualization will contribute much more to a good mark than a large number of poor quality visualizations. Also pay attention to your presentation. Neatness, coherency, and clarity will count. All analyses must be done in RStudio, using R. You must also set up and use a GitHub repository to collaborate with your team.
One member of your team should be responsible for managing the repository of your group project. This team member should clone the project template repository from the course GitHub account and add each of the team members as collaborators. Instructions on how to clone a repository, add collaborators and to create a new version control R project can be found in the lab worksheets.
The project template repository can be found HERE .
You must add the course GitHub account (uoeIDS
) as a collaborator to your GitHub repository, similar to the homework process.
It is highly recommended that you regularly commit any changes you have made to your work, and to frequently pull & push these changes with your repository on GitHub. Please be aware of any merger conflicts and try to resolve them (if in doubt, contact your team members on how best to resolve a conflict). If you notice an unusual error message then please seek assistance, either from a tutor in workshops or via an informative post on Piazza.
In order for you to have the greatest chance of success with this project it is important that you choose a manageable dataset. This means that the data should be readily accessible and large enough that multiple relationships can be explored.
For this, we recommend that you select one of the data sets listed below. Read all of the data descriptions and select one that interests you the most:
ID | Link | Description |
---|---|---|
01 | Link | Weekly A&E Activity and Waiting Times, Public Health Scotland |
02 | Link | Scottish heart disease statistics, Public Health Scotland |
03 | Link | Mental health inpatient activity, Public Health Scotland |
04 | Link | Data on air pollutants in Scottish cities, Air Quality Scotland |
05 | Link | Small-Body Database of asteroids and comets, NASA |
06 | Link | Precipitation and temperature measurements, National Centers for Environmental Information |
07 | Link | Prisoner population statistics (England & Wales), GOV.UK |
08 | Link | Formula E championship data, Kaggle |
09 | Link | Laptop prices, Kaggle |
10 | Link | Bird morphological measurements, Ecology Letters |
11 | Link | Communities and Crime, UCI |
It is advised that you select a data set from the list above. However, feel free to search the internet for alternative data sets. Some suggested online data resources can be found at Kaggle, Scottish Government, Public Health Scotland, UK Government, TidyTuesday, World Bank and UCI machine learning repository. The dataset you select should have at least 200 rows and a mix of numerical and categorical variables. Please check with the course organiser or a lecturer as to whether the data you select is feasible for your project
The two ‘check-in’ points are not assessed and do not count towards your final grade. A check-in will consist of a discussion with a tutor to receive feedback on your current progress and on your future plans. We highly recommend that you put in some effort into preparing for these check-in discussions so that your check-in discussion runs smoothly and that you receive as much feedback as possible.
You can add any feedback and suggestions you receive from your tutor for you to refer back to in the future in your project_template README.md file
Week 6: You should demonstrate that you have a basic understanding of the data you have selected and have made initial steps in cleaning, summarising and visualising your data. We will be looking to understand whether you have a clear problem statement and that you have a plan as to how you would explore the data to answer your statement.
Week 10: At this stage, you should be able to demonstrate a reasonable understanding of your data and to make some comments in relation to your problem statement, and thinking carefully about what type of model to use. We will be looking to see if your plans are appropriate and reasonable to be completed within the last two week, and that you have thought carefully about your presentation and report.
It is important to NOT miss these workshops for the health of your final group project.
You will need to prepare a slide deck using the template in the repo. There isn’t a limit to how many slides you can use, however there is a time limit (10 minutes/per group in total). You can think of this time as 7 minutes talk + 2 minutes questions by default and prepare yourself based on this time limitation in general.
Each team member should get a chance to speak during the presentation. Your presentation should not just be an account of everything you tried (“then we did this, then we did this, etc."), instead it should convey what choices you made, and why, and what you found.
Before you finalize your presentation, make sure your code chunks are turned off (not displaying the code) with echo = FALSE
.
Presentation schedule: Presentations will take place during the last workshop of the semester (29 November). All teams will give them as a live presentation in the workshop. During your workshop you will watch presentations from other teams in your workshop and will be able to ask questions at the end. The presentation line-up will be generated randomly, later during the semester.
Along with your presentation slides, we want you to provide a brief summary of your project.
.md
) or an rmarkdown (.Rmd
) file, depending on whether you want to include code in your report.This report should provide information on;
To find the word count of your report, ensure that you have your report .Rmd
file open and in the editor panel (top-left) and then navigate to Edit > Word Count
. The presented value ignores all text within code chunks. It counts the words that appears in the title, section headings and the main text of your report.
You will be asked to fill out a survey where you rate the contribution and teamwork of each team member at the end. Submitting this information is a prerequisite for getting credit on the team member evaluation.
IMPORTANT: If you are concerned that a member of your team is severely under contributing to your group project, then please contact the course organiser as soon as possible.
When completing the peer evaluation, if you are suggesting that an individual did less than 5% of the work (“No contribution” or “Very poor” contribution) then you must provide a justifiable explanation.
This survey will be available on WebPA, accessible via the course Learn page (Learn > Assessment > WebPA).
This is a template repository for your group project. One of your group members should clone this repository and add the other team members as collaborators. You should also add the uoeIDS
course GitHub account as a collaborator.
/data
– Save any data you are using for your project in this folder./img
– If you choose to incorporate any supplementary images into your report or presentation that are not generated by your code, then you should upload the images to this folder.investigation.Rmd
– Use this file as the primary location for doing your data science investigation. Here you can develop code and document your findings as part of your exploration. Feel free to create additional .Rmd
files for different group members or different investigation directions - do what works best for your group.report.Rmd
– This file is for the write-up of your group project report, which is to be submitted in week 11. Provided that you are documenting your findings during your investigations, then this should involve copying & pasting material and ensuring that the report flows coherently as if from one unified voice.presentation.Rmd
– Use this file to create your presentation. You will need to download and install the xaringan
package that contains all of the R and markdown code that you will need to create and compile the presentation slides. For this, run install.packages("xaringan")
once on your console. Guidance on using the xaringan
package for creating a presentation in Rmarkdown can be found at https://bookdown.org/yihui/rmarkdown/xaringan.html.README.md
– This document, which outlines the structure of the report. The contents of this file will be rendered on GitHub, and so you can add comments below to keep a track of what and can be used to keep additional note or task lists.See the other details from the project template file when you start working
You will need to submit your project report to Learn (Learn > Assessment > Gradescope > Final project - team) on the 29 November:
.html
version of your report (including a link to the GitHub repository) by 16pm.html
version of your presentation by 16pm.Only one member of your team should submit the report and presentation to Learn. This person will need to add the other team members from the drop down menu that appears under View or edit group after uploading the files and viewing the submission. Everyone should submit the peer evaluation individually.
As explained above, you must also add the course GitHub account (username: uoeIDS) to your team GitHub repository by the time of the deadline.
We expect your GitHub repository to contain the following folders and files:
presentation.Rmd
+ presentation.html
: Your presentation slidesreport.Rmd
+ report.html
: Your reportinvestigation.Rmd
+ investigation.html
: Your working document for your data science investigation (Note: the work here will not be assessed, it is up to you as to how you use these files to distribute work amongst members and to keep notes about what you find.)README.md
: Your progress discussion notes from the check-in meetings earlier in the semester./data/*
: Your dataset in csv or RDS format, in the /data
folder./img/*
: Any supplementary images for your report/presentation.Style and format does count for this assignment, so please take the time to make sure everything looks good and your data and code are properly formatted.
There are also 10pts available for reproducibility and organisation - the marks for this will be based on your GitHub repository so make sure you include this.
Code: In your presentation your code should be hidden (echo = FALSE
) so that your document is neat and easy to read. However, your document should include all your code such that if I re-knit your R Markdown file then I should be able to obtain the results you presented.
Exception: If you want to highlight something specific about a piece of code, you’re welcome to show that portion.
Teamwork: You are to complete the assignment as a team. All team members are expected to contribute equally to the completion of this assignment and team evaluations will be given at its completion—anyone judged to not have sufficiently contributed to the final product will have their grade penalized. While different teams members may have different backgrounds and abilities, it is the responsibility of every team member to understand how and why all code and approaches in the assignment work.
Total | 100 pts |
---|---|
Presentation | 50 pts |
Report | 30 pts |
Reproducibility and organization | 10 pts |
Team peer evaluation | 10 pts |
There is no late submission / make up for the presentation. You must be in class on the day of the presentation to get credit for it (In case of illness, requirement to self-isolate or similar, special circumstances can be applied for from the university: see the Policies page).
Late submissions are not accepted for the written component of the project, but extensions are permitted (up to 4 days).