• Home   /  
  • Archive by category "1"

Coursera R Programming Assignment 1 Help

The assignment for week 2 is kinda tough if you have not used R before. The video lectures also did not prepare you for it. If you have not taken the swirl tutorial, I strongly recommend that you finish it at the beginning of the week 2. You also want to start working on the assignment as soon as possible.

Derek Franks wrote a great tutorial. If you follow the step by step tutorial closely, you should have no problem finishing some problems in assignment 1. Here is the link to the tutorial:


The second challenge I had about this assignment is that I did not know how to return a data frame in a function. After experimenting a bit and I finally got it to work. Here are the code for returning a data frame in a function.

## initiate the data frame results <- data.frame() ## loop through the files for (i in id) { ## read file and get completed cases ## add to the data frame. results <- rbind(results, data.frame(id=i,nobs=completed_cases)) } ## return the data frame return(results)

Function cor is used in one of the problems, but it’s not taught. You are supposed to figure it out by yourself. The usage is actually quite easy. Suppose you read the file and store it in a data frame called data. To calculate the correlation between column 2 and column 3, you use corr this way.

cor(data[,2], data[,3])

I’m currently going through the John Hopkins Data Science  specialization. So far they are okay. These courses are pretty tough so if you are complete beginner you can complement these courses with Data Camp course if you need more practice.The only annoying part about this class is that they do not mention some of the functions you will need to complete the assignments during lectures.  Luckily there are TA hints to supplement this lack of information.

I am a researcher. When I run into problems I would go online and try to ask the right question to help me solve the problem that’s in front of me. Or I would talk to different people who have expert knowledge on the subject matter or I will just find the answer in a book at my local library. But being a programmer or data scientist involves breaking down problems. The toughest part for me is being comfortable with problem solving.

Background about the data

Air Pollution: look at the report that you did about it.

First thing I do whenever I have data is to explore the file in excel in order to get a better understand of its structure. Each file in the specdata folder contains data for one monitor.

  • Date: the date of the observation
  • sulfate: the level of sulfate particle matter in the air on that date
  • nitrate: the level of nitrate particle matter in the air on that date

The pollutantmean function the prompt states:

Write a function named ‘pollutantmean’ that calculates the mean of a pollutant (sulfate or nitrate) across a specified list of monitors. The function ‘pollutantmean’ takes three arguments: ‘directory’, ‘pollutant’, and ‘id’. Given a vector monitor ID numbers, ‘pollutantmean’ reads that monitors’ particulate matter data from the directory specified in the ‘directory’ argument and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA.

At this point I don’t really know what I’m doing. It’s just me staring at the computer for an hour…

With every coding problem I try to think about things on a higher level. If I have an understanding of the problem then I can muddle my way through the syntax of the code.

What are they asking for here?

There just want the mean of the pollutants by the IDs.

So the higher level stuff is done next we break the problem into steps.

  1. read the files all the data files
  2. merge the data files in one data frame
  3. ignore the NAs
  4. subset the data frame by pollutant
  5. calculate the mean.

Lets find out what the mean function requires.

?mean Usage mean(x, ...) ## Default S3 method: mean(x, trim = 0, na.rm = FALSE, ...) Arguments x An R object. Currently there are methods for numeric/logical vectors and date, date-time and time interval objects. Complex vectors are allowed for trim = 0, only. trim the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed. Values of trim outside that range are taken as the nearest endpoint. na.rm a logical value indicating whether NA values should be stripped before the computation proceeds.

According to the documentation the mean can only take one R object. But there is a problem.


The information for each monitor is in an individual csv file. The monitor id is the same as the csv files names. In order to find the mean of pollutants of monitors by different ids we will have to put each monitor’s information into one data frame.


At this point  I do not know how to walk  rough a directory of files. I know in python there is os.walk . In R there is the list.files function .

In order to perform the same actions to multiple files I looped through the list.

pollutantmean<-function(directory,pollutant,id=1:332){ #create a list of files filesD<-list.files(directory,full.names = TRUE) #create an empty data frame dat <- data.frame() #loop through the list of files until id is found for(i in id){ #read in the file temp<- read.csv(filesD[i],header=TRUE) #add files to the main data frame dat<-rbind(dat,temp) } #find the mean of the pollutant, make sure you remove NA values return(mean(dat[,pollutant],na.rm = TRUE)) }


Part 2

Write a function that reads a directory full of files and reports the number of completely observed cases in each data file. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases.

This question was easier than both parts 1 and parts 2. The question is simple to understand. They just want a report that displays the number of completed cases in each data file. This question reminds me of nobs function  that is used in another statistical software called SAS its used in many businesses. Unlike R which seems to be primarily used in academia and research.


  1. read in the files
  2. remove the NAs from the set
  3. count the number of rows
  4. create a new data set  has two columns that contains the monitors id number and the number of observations
complete <- function(directory,id=1:332){ #create a list of files filesD<-list.files(directory,full.names = TRUE) #create an empty data frame dat <- data.frame() for(i in id){ #read in the file temp<- read.csv(filesD[i],header=TRUE) #delete rows that do not have complete cases temp<-na.omit(temp) #count all of the rows with complete cases tNobs<-nrow(temp) #enumerate the complete cases by index dat<-rbind(dat,data.frame(i,tNobs)) } return(dat) }


Part 3

Write a function that takes a directory of data files and a threshold for complete cases and calculates the correlation between sulfate and nitrate for monitor locations where the number of completely observed cases (on all variables) is greater than the threshold. The function should return a vector of correlations for the monitors that meet the threshold requirement. If no monitors meet the threshold requirement, then the function should return a numeric vector of length 0. A prototype of this function follows

Part 3 was a dosey. Its similar to part 1 for the fact that we are basically aggregating data. But this time instead of a data frame we are aggregating data inside a vector.

  1. read in the files
  2. remove the NAs from the set
  3. check to see if the number of complete cases are > then threshold
  4. find  the correlation between different types of pollutants.

In order to combine different data sets  we used  rbind (combine rows). To combine different  vectors we use cbind(combine columns).

corr<-function(directory,threshold=0){ #create list of file names filesD<-list.files(directory,full.names = TRUE) #create empty vector dat <- vector(mode = "numeric", length = 0) for(i in 1:length(filesD)){ #read in file temp<- read.csv(filesD[i],header=TRUE) #delete NAs temp<-temp[complete.cases(temp),] #count the number of observations csum<-nrow(temp) #if the number of rows is greater than the threshold if(csum>threshold){ #for that file you find the correlation between nitrate and sulfate #combine each correlation for each file in vector format using the concatenate function #since this is not a data frame we cannot use rbind or cbind dat<-c(dat,cor(temp$nitrate,temp$sulfate)) } } return(dat) }

In retrospect this assignment is very useful. In most of the data science courses throughout this specialization you are going to be selecting, aggregating, and doing basic statistics with data.

Like this:



One thought on “Coursera R Programming Assignment 1 Help

Leave a comment

L'indirizzo email non verrà pubblicato. I campi obbligatori sono contrassegnati *