Project R

R代码代写 Submit your code as a .R file. The filename MUST have the following format:LastnameFirstname_Username_ProjectR.R.

Submission format:

  • Submit your code as a .R file. The filename MUST have the following format:

LastnameFirstname_Username_ProjectR.R.

Username is your RIT email address. If you don’t have RIT email, you can use the email that you use in mycourse.

At the beginning of your R code file: Add 2 lines:

dirOUT=path to your R output folder; (this one is the path you create yourself);

sink(paste0(dirOUT,” ProjectR_Ouput.txt”));

In this file the problems must be listed in order. Each problem must also have a comment line that precedes it. For example, for question 2, use:

### q2  ###;

After this line, also print out the problem number in R output using: R代码代写

cat(“q2\n”);

Each part of a problem must also have a comment line that precedes it. For example for question 2b, use:

### q2b ###;

After this line, also print out in R output using:

cat(“q2b\n”);

At the end of your file add a line:

Sink()

 

This ensures that I and the graders can follow your work. R代码代写

The sink function will write R output to a text file when you run your R code file.

After that copy the content of the text file and paste it to a Word document.

In this document, you can also paste other types of output (such as graphs, plots).

  • The output of the code is submitted in a Word document.
  • The code should be written so that, except for file-directory locations, the code can be directly run on any computer with Rstudio.
  • Any text that is not code ( such as responses to a question) will be included in the .R file as comment lines. (IMPORTANT). The Word document is strictly for R output. Everything else should be in the .Rfile.
  • You must submit the Project to the correct myCourses dropbox.
  • Due date: 5pm Monday Dec 13th, 2019.
  • Make sure you have both files in the submission. Make sure you run your R code file to create the Output file. Make sure the Output file has the output it is supposed to have. R代码代写
dirOUT=”C:/My R folder/”;

sink(paste0(dirOUT,"LastNameFirstname_Username_ProjectR_Ouput.txt"));
##

###q1###

cat(“q1\n”);

##q1a###

cat(“q1a\n”);

…some code…





cat(“q2\n”);

##q2a###

cat(“q2a\n”);

…some code…

######

Sink()

 

Some extra notes: R代码代写

  1. As always, it is a good idea to read all of the questions first. Some of the later questions may be easier to answer if you keep them in mind while you are answering the earlier ones.
  2. If you can’t figure out how to create such a data frame, send me an email requesting it. You will then get a score of 0 for that question, but you will be able to use it to work on later questions. Note: I do not supply the code to solve the question—I only supply the data frame.
  3. It will be important to get started on the project as soon as you can.
  4. You should also use your good judgment on the project. If it looks like it will take you 3 hours to answer a 5-point question, perhaps you should move on to another one.
  5. For this project, please ignore all of R’s “Warning message… incomplete final line found”, just as we usually do.

 

  1. (15pts) In Week 3, you were asked to read in “the data for snowfall from the 1884-85 season until the 2001-02 season (the 2002-03 season is incomplete so please eliminate it from consideration.

The problem with this coding is that you manually needed to look in the file to find that you had to start on line 5 and end on line 122.

 

For this reason, and because your supervisor has recently told you that many such files will need to be processed, it is not reasonable to have to look in each file manually to find the starting and ending lines. Instead, she wants you to write code to read in the correct information from such files. To be more specific, here is what you know about the file structure. (You can guess from RochesterSnowfall.csv that this is the structure, but it is being supplied here for clarity): R代码代写

  1. The files are comma-delimited (like csv).
  2. The lines that contain the snowfall information are of the form shown in csv. Examples:
1884-85,0,T,1,27.1,22.2,17,3.5,19.5,T,90.3,

2001-02,0,T,0.1,7.1,11.9,18.7,13.8,6.5,T,58.1,

2002-03,0.0,0,16.9,41.1,43.4,21.9,,,,,
  1. In particular, the lines that contain the snowfall information always begin with “xxxx” in columns 1-4, where xxxx is a 4-digit number. (The numbers should also be between 1800 and 2200, so the code can be used for a long time. However, you do not need to check this 1800-2200 range in your code.)
  2. These snowfall-information lines occur sequentially in the data file. R代码代写

  3. The lines that do not contain the snowfall information never begin with such an “xxxx”.
  4. The lines that contain the snowfall information always begin with xxxx-yy in columns 1-6, where xxxx-yy is the Season variable.
  5. The remaining parts of the line that contain the snowfall information are either numeric or “T” (with one more exception that is noted in the next paragraph).
  6. The last line of snowfall information may be incomplete. Such an incomplete line has a missing value (blank entry) for the Total variable (as well as at least one missing value for the months). All other lines of snowfall information will be complete.

 

Also,

  1. Only the variables
    Year,Season, Sep,Oct,Nov,Dec,Jan,Feb,Mar,Apr,May,Total R代码代写
    should be included in the output data set, and in this order. Here, Year is numeric and is equal to the “xxxx” above, Season is equal to the
    “xxxx-yy” above, and Sep–Total is numeric.
  2. Your code should convert each “T” to a 0.
  3. Only full seasons should be included. (In the above example, the 2001-02 season is a full season—all of the values have been entered; the 2002-03 season is not a full season—some of the values have not been entered. This is mentioned in item 8 above.)

a. (10pts) (q1a) Write code to read in such a file. Use csvas your data file. Your code must work not just for this particular file, of course, but for other files that have this same format. Name your R data frame snow1. Also, “str” your data frame and then print out your data frame.

b.Bonus (2)  (q1b) Print out R代码代写

N full years of snowfall data

where N is the number of full years of data. For example, if there are 101 full years of snowfall data, the output should include

101 full years of snowfall data

c.(5) (q1c) Do the same work as in (a), but for csv. Name this R data frame snow2. Your code should be exactly the same in (a) and (c) except for the data file name and the data frame name. Make sure you “str” your data frame and then print out your data frame.

Some hints: na.strings. readLines will be useful to read in all data and work with.

  1. (25 pts) You are working with a group of biostatisticians on a large clinical trial. You have been asked to read in and perform a series of checks on some of the data for this trial, and to find some basic summary measures.

It may be a good idea to read all of these questions first before you do any work.

The data are in the tab-delimited file Pr1CT.dat. This data file contains: R代码代写

  • PatientID: format is Xyyy, where X is a code (values should P, J, M, N, or V)that indicates where the hospital is located and yyy is a 3-digit number whose values should be integers between 101 and 999. However, the text “Missing” is also a possible value for PatientID.
  • Sex: should be M, F, or “Missing”.
  • Race (or Ethnicity): should be Caucasian, Asian, Black, Hispanic/Latino, Other, or “Missing”.
  • Treatment: T (treatment group) or C (control groups) for a potential anti-allergen drug. (The drug is intended to be taken once each week, but that does not matter for this problem.) There should not be any value of “Missing” – but there may be.
  • Sens0: An allergen-sensitivity measure to the drug at “time 0,” just before the treatment or control was given to the patient for the first time. Lower numbers are better.
  • Sens1, Sens2, Sens6, Sens12,  Sens24: the same measurement at 1, 2,…, 24 months after time 0.

 

In some of the questions below, you will be asked to check some parts of the data for errors. There may be other errors in the data set as well, such as misspellings. In real life, you would also want to detect those errors. However, for this project, please ignore those other errors.

 

Using R: R代码代写

a. (5pts) (q2a) Read in the data into the data frame CT1 and name the variables as indicated in the data file or above. Change all “Missing” values into the R missing value code. (If you are unable to read in the data, (which you will need for later questions), you may send me an email to request this initial data frame. (Your score will then be 0/5, and there will be no need to do the next stror print.) Then call str(CT1) and also print out the first 325 rows of the data frame CT1.

b.(5pts) (q2b) Using CT1 as an input data frame, split the PatientID into Location (first digit) and IDNumber (last 3 digits), and write the resulting updated data frame to CT2. Both Location and IDNumber should be character variables and they should be the last two variables in the data frame—in that order. (If you are unable to do this step, which you will need for later questions, you may send me an email to request it.Your score will then be 0/5, and there will be no need to do the next stror print.) Then call str(CT2) and also print out the first 325 rows of the data frame CT2.

c.(5pts) (q2c) Using CT2 as an input data frame, verify that all non-missing Location values are P, J, M, N, or V. If they are, print

All Location values are correct

to the R Console. If not, then for each incorrect value, print a line that looks like this. (This line is for PatientID=”Q124″ but of course you need to use the actual incorrect value.) Also, change the incorrect location value to X (but do not make any changes to the Patient ID value).

Error: Location Q changed to X for PatientID Q124.

Write the resulting updated data frame to CT3. Print out the first 325 rows of the data frame CT3. R代码代写

d. (5pts) (q2d) Using CT3, check whether all non-missing IDNumber values are, in fact, integers, and are between 101 and 999. If so, then print
All IDNumber values are correct

If not, then for each incorrect value, print a line that looks like this. (This line is for PatientID=”V094″ but of course you need to use the actual incorrect value.) Also, change the incorrect value to a 999 (but do not make any changes to the Patient ID value).

Error: IDNumber 094 changed to 999 for PatientID V094.

Write the resulting file to the data frame CT4. Print out the first 325 rows of the data frame CT4.

e. (5pts) (q2e) Using CT4, use ddplyto find means and standard deviations of Sens0, Sens1,… Sens24 for each Treatment Group. Put Treatment levels (C and T) and the Mean and Std within each treatment level in the rows, and Sens0, Sens1,… Sens24 in the columns (4×6 table of the sensitivity summaries). Round your answers on the printout to the nearest integer. If you can’t achieve this 4×6 format, use another one (-1 to -2 points).

Some hints: %in% will be useful for (q2c). as.numeric function or as.integer function for (q2d).

 

For the next 3 problems, we will be using the snowfall data sets that you have seen for Project SAS. R代码代写

Monthly weather information has been obtained for weather stations in four cities in upstate New York: Buffalo, Rochester, Syracuse and Albany. (From http://www.ncdc.noaa.gov/cdo-web/search.) This information is contained in three .csv files, whose names should make the contents obvious: Weather_Buffalo.csv, Weather_Rochester.csv, and Weather_SyracuseAlbany.csv

 

The weather information contains monthly readings that start at 1900/01/01 (or later, depending on when the weather station started collecting data) and end at 2013/09/01 (or earlier, depending on when the weather station stopped collecting data). In some cities, more than one weather station has collected the data.

 

The first five monthly readings for a weather station in Buffalo look like this. The readings are transposed here to make them easier to read—in the data file, these columns are rows and these rows are columns. I also added column (really, row, or variable) numbers to this example.

1 STATION GHCND: USC00301010 GHCND: USC00301010 GHCND: USC00301010 GHCND: USC00301010
2 STATION_NAME BUFFALO NY US BUFFALO NY US BUFFALO NY US BUFFALO NY US
3 ELEVATION 234.1 234.1 234.1 234.1
4 LATITUDE 42.88333 42.88333 42.88333 42.88333
5 LONGITUDE -78.88333 -78.88333 -78.88333 -78.88333
6 DATE 19000101 19000201 19000301 19000401
7 MXSD 102 203 635 0
8 Missing 0 0 0 0
9 Consecutive Missing 0 0 0 0
10 TPCP 972 1327 965 288
11 Missing 0 0 0 0
12 Consecutive Missing 0 0 0 0
13 TSNW 241 649 854 13
14 Missing 0 0 0 0
15 Consecutive Missing 0 0 0 0
16 MMXT 9 -12 -1 103
17 Missing 0 0 0 0
18 Consecutive Missing 0 0 0 0
19 MMNT -57 -83 -72 25
20 Missing 0 0 0 0
21 Consecutive Missing 0 0 0 0
22 MNTM -24 -47 -37 64
23 Missing 0 0 0 0
24 Consecutive Missing 0 0 0 0

 

Here is the meaning for each variable. Note that there are variables named Missing and Consecutive Missing for each of the six measured variables (the snow and precipitation variables)—the meaning, however, is only given here for the first pair. R代码代写

Variable Meaning Example
STATION Station identification code GHCND: USC00301010
STATION_NAME Usually, a city or airport name BUFFALO NY US
ELEVATION Above mean sea level (thousandths of meters) 234.1
LATITUDE Latitude 42.88333
LONGITUDE Longitude -78.88333
DATE (Monthly) Year 4 digits; month 2 digits; day 2 digits 19000101
MXSD Maximum snow depth during the month (mm) 102
Missing N of days MSXD is missing in that month 0
Consecutive Missing Maximum N of consecutive days in that month that MXSD is missing. 0
TPCP Total precipitation for month (tenths of mm) 972
Missing   0
Consecutive Missing   0
TSNW Total snow fall for month (mm) 241
Missing   0
Consecutive Missing   0
MMXT Monthly mean maximum temperature (tenths of °C) 9
Missing   0
Consecutive Missing   0
MMNT Monthly mean minimum temperature (tenths of °C) -57
Missing   0
Consecutive Missing   0
MNTM Monthly mean temperature (tenths of °C) -24
Missing   0
Consecutive Missing   0

 

For this project, you will eventually need to rename some of these variables, as follows: R代码代写

Variable                                    New Name

STATION_NAME                       Station

ELEVATION                               Elevation

LATITUDE                                  Latitude

LONGITUDE                              Longitude

DATE                                          Date

MXSD                                        MaxSnow

TPCP                                          Precip

TSNW                                        Snowfall

MMXT                                       MeanMaxTemp

MMNT                                      MeanMinTemp

MNTM                                       MeanTemp

 

  1. (30 pts) Read in the Buffalo csv file only. For this file:

a.(5pts) (q3a)  Read in the Buffalo file, letting R read in the variable names that are in the file (Please note that R automatically renames some names to keep them unique). Put the columns in the data frame in the order shown above. Make the first two columns class character. There may be missing values in this file or other files, and those are coded as -9999. (The government documentation was misleading—it said “9’s in a field (e.g. 9999)”.)  So convert all -9999’s to the R missing value code. Convert the date field to class Date, using the lubridate package (as we have been doing). Name this data frame wBuff1.

R代码代写
R代码代写

b. (5pts) (q3b) Find the total number of missing days for each of the Missing and Consecutive.Missing columns (12 columns in total) and display these in an appropriate listing. The listing should include each variable name and the total number of missing days.

  • Output: the listing. Make the listing go “down,” not “across.”
  • If you wrote code so that the listing goes “across” (for partial credit), make sure the listing does not wrap around several lines (this is a reason for using options(width= a large number)). If you wrote code so the listing goes “down” then this should not be an issue.

c. (5pts) (q3c) Drop these Missing and Consecutive.Missing columns (12 columns in total) from your data frame. Also drop the STATION column. Rename the other columns as shown on the previous page. Then drop the Elevation, Latitude, and Longitude columns. Name this data frame wBuff2.Then, using code, find all of the distinct (unique) Station names in the data frame.

  • Output: (1) an str()of this data frame and (2) a listing of the unique Station names. Make the listing go “down,” not “across.”

d. (5pts) (q3d) Rename the stations from their long all-capital-letter versions to “Buffalo Airport” and “Buffalo City”. Also create two new character variables (and in this order, at the end of the data frame): City and Site. For this problem, City would be “Buffalo” and Site would be either “Airport” or “City”. These new variables, especially City, will be useful later in this project. Make these class character. Name this data frame wBuff3.

e. (5pts) (q3e) Add more date-based variables, as follows (and in this order, at the end of the data frame):

i.MonthN, the month number (1, 2, …, 12) for the date.

ii.Month, the abbreviated month name (“Jan”, “Feb”, … , “Dec”) for the date. Make this class character.

iii.Year, the year of the date.

iv.SnowSeasonLong, the year of the “longer version” of the snow season. Here is an example: for month numbers 10, 11, 12 of 1930 and month numbers 1, 2, 3 ,4 of 1931, set SnowSeasonLong to 1930. For any other month numbers, set SnowSeasonLong to be the R missing value code. (Do this for all years, of course.)

Name this data frame wBuff4.

  • Output: (1) an str()of this data frame and (2) a listing of the first 20 records (as always, one record per line—make sure the output width is wide enough).

 

f. (5pts) (q3f) Clean up the measured variables as follows. Convert the 3 snow and precipitation values into inches (to the nearest 0.1). Convert the 3 temperature values into °F (to the nearest 0.1). Make sure you do these correctly. Name this data frame wBuff5.

  1. (15pts) Next, for this Buffalo file, we want to have onlyone set of values for each month. (The problem is that there may be months when both the airport and city stations have data in the file.)

a. (10pts) (q4a) Verify (well, actually investigate) the extent of any overlapping data between the Airport and City sites. This is a more open-ended question—you will need to figure out how to do this.

  • Output: a (data-frame) listing of every month for which the Airport and City sites have overlapping months (dates) (but not necessarily overlapping useful data—because of any original -9999 coding, some of the records may have missing data). For this particular answer, your listing should include only five variables, and in this order:

Date of overlap, Snowfall amount at Airport, Precip at Airport, Snowfall amount at City, Precip at City. For variable names, use Date,Snowfall.Air,Precip.Air,Snowfall.City,Precip.City.

b. (5pts) (q4b) Regardless of your previous answer, use 1943-07-01 as the switch date—the date to stop using the City data and begin using the Airport data. Create a new data frame from wBuff5that only contains this subset of rows. (So, this new data frame should still have the same starting and ending dates, but with no more overlapping months. As always, you should be checking your results, for example to see if the N of rows has been reduced the amount that you expected, that the N of rows makes sense based on the initial and final dates, and so on.) Name this data frame wBuff.

 

  1. Similar work has been done for you for the cities of Rochester, Syracuse, and Albany. (Note: for Rochester, all data have been collected from the Airport. For Syracuse, the switch date from City to Airport is 1940-05-01; for Albany, it is 1938-06-01.) These data frames are available in the file Rdataas wRoch, wSyr, and wAlb. Please load these objects into your workspace.
    You will now need to merge all 4 cities together (from the corresponding 4 data frames, of course), and then create summaries. We are only going to look at the Snowfall data for the rest of the R questions in this project, so that is the only measured variable you should keep here. Also, only keep the dates that all 4 cities have in common. Your final data frame (if you answer the first question below correctly) should then look like this for the first 6 records:

a.(10pts) (q5a) Because R can only merge two data frames at a time, let’s first merge Buffalo and Rochester. You will need to think about how to do this, but the resulting data frame should contain Date, the other date-based variables (see above listing), Snowfall.B (the Snowfall values for Buffalo) and Snowfall.R (the Snowfall values for Rochester). Then, merge Syracuse and Albany data into a second data frame, with suffixes .S and .A for Snowfall. Then merge both data frames together. Name this data frame that contains all four cites as wAll. Make sure that the variables are in the data frame in the order shown above.If you did this correctly, you should get the listing above for the first 6 records.

  • Output: (1) the str()for wAll and (2) the a listing of the first 6 records of wAll in the order shown above. R代码代写

b. (10pts) (q5b) Summarize the data. First, find the yearly total snowfall during the long snow season (from October to April) for each city. Exclude the 1925 season because it is incomplete. Save your results in a data frame that has the variables SnowSeasonLong, Snowfall.B, Snowfall.R, Snowfall.S, and Snowfall.A. You should, of course, have one row for each season. Name this data frame wAllYearSumsSL. Using this data frame, find the overall mean, standard deviation, and %CVfor the snowfall in each of the four cities. Round the results to the nearest inch (or nearest %), and put your results in a table that is formatted like this:

     Snowfall.B Snowfall.R Snowfall.S Snowfall.A

mean         xx         xx         xx         xx

std          xx         xx         xx         xx

CV           xx         xx         xx         xx

 

  • Output: (1) the str()for wAllYearSumsSL and (2) the mean/std/CV table. R代码代写

  1. (10pts) Graph the results. There are many graphs that could be made here. Here, you’ll make two sets of them. Both are based on wAllYearSumsSL. For all graphs, please use the default plotting character (pch=1 – but do not bother to specify this) and the default color (black), unless you are otherwise instructed.

a.(5pts) (q6a) Graph the yearly snowfall amounts for each of the four cities vs. Season. Make one graph for each city. Do this with a 2×2 arrangement of graphs, using winds(2,2,7). (There is no need to include the sourcing of winds in the code—that is, you may assume that function is available. However, it is OK if you do include the sourcing in the code.) Make the graphs in the same order shown in the table above: that is, Buffalo, Rochester, Syracuse, and Albany. Make sure that the y-axis limits are the same, and appropriate, on all graphs—please use code to find the appropriate limits based on the overall minimum and maximum snowfall values.

On each graph, draw horizontal lines at 50, 100, and 150 inches using the “grey75” color. For the Buffalo, Syracuse, and Albany graphs, draw a vertical line in the season when the weather station changed from City to Airport so that any changes in the amount of snowfall can be seen more easily. Use the “lightblue” color here. For the x-axis label, use “Season”; for the y-axis label, say for Rochester, use “Snowfall, Rochester (in)”. (Not part of the project, but did the switch from City to Airport seem to affect the amount of snow reported?)

  • Output: the graph (copy/paste this into the Word file for the R graphs; surround your answer with q6a—typing it in is fine).

b. (5pts) (q6b) Graph the yearly snowfall for each of Buffalo, Syracuse, and Albany (y-axis) vs. the yearly snowfall for Rochester (x-axis). This will result in a total of 3 graphs. Use a 2×2 arrangement of graphs, using winds(2,2,7). Make the graphs in the order Buffalo, Syracuse, and Albany. Use the same y-axis scaling from a above, but now on both the y-axis and x-axis of each graph. Use labels like the y-axis labels you used in the last question. Use red plotting symbols for the seasons when the y-axis data was collected from the City.  Use black plotting symbols for the other seasons. Draw a y=x reference line on each graph using the “grey75” color. Leave the last graph area blank.

  • Output: the graph (copy/paste this into the Word file for the R graphs; surround your answer with q6b).

 

Earlier in the course, we looked at political-contribution data for some names beginning with Church. In the next few problems, we are going to look at more political-contribution data. R代码代写

 

The web page http://www.elections.ny.gov/ContributionSearchB_zip.html (accessed December 5, 2013) is a “Contribution Search Page” for campaign financial disclosure. It is provided by the New York State Board of Elections.

 

On this page, it is possible to gather a history of donations from a specific contributor’s last name.

 

The cleaned-up data are in the files whose names start with “contrib” and end with “2013a.txt”.

 

As you can see from looking at a particular file: R代码代写

  1. The first 20 rows contain header information that can be skipped. (Row 20 contains variable names, but not in a useful format.)
  2. The contribution information starts on row 21. Each record exists on 3 lines:

a. Line 1: Contributor Full Name (CFullName);

b. Line 2: Contributor Address1 (CAddr1);

c. Line 3. Remaining information, starting with Contributor Address 2. Please use these as the remaining 10 variable names:
CAddr2 Amount CDate Recipient Filing
Sched Office Dist County Municipality
You should be able to figure out what many of these names mean; however, we will only use a subset of them.

3. You should be able to figure out from one of the files how these fields are delimited.

4. For purposes of this project, you may assume that the general structure of the first few records in any one of these files will be repeated for all records in all files.

5. Please put all six files in one directory. In R, use the object dirMineto refer to this directory, in the way we have been doing for the entire course.

 

  1. (32 pts)

a.(5pts) (q7a) Read in the data for the first file (data for “Brown”). Use the 12 variable names shown above, and keep the variable names in this order. Make all variables class character except for Amount (numeric) and CDate (Convert the date field to class Date, using the lubridate package (as we have been doing)).

b.(5pts) (q7b) Using regular expressions, extract the two-letter state code from CAddr2 into the variable State (class character) and extract the 5-digit zip code from CAddr2into the variable Zip (class integer). For this project, please assume that only records with a valid two-letter state code and a valid 5-digit zip code yield a correct value for State and for Zip. Also, please assume that such a valid pair occurs only as follows in CAddr2: one or more spaces, a two-letter code, one or more spaces, a five-digit code. Values that are not correct should be coded with the correct missing-value code. Also, extract the year from Filing into the variable YearF (class integer). Append these three variables to the previous data frame, and in this order (State, Zip, YearF).

  • Output: an str()of this data frame. Also, a head() of the data frame, but only the first 6 records, and only of the 4 variables CAddr2 State Zip YearF, and listed out in this order. Don’t forget to use the print function to ensure the output is printed.

c.(4pts) (q7c) Each of the five boroughs that make up New York City are sometimes considered different from the rest of New York. For this reason, create the variable Group (as the last variable in the data frame and as class character), whose values are determined as follows:

  • Zip Code in 10001-10163 à“Manhattan”
  • Zip Code in 10451-10475 à“Bronx”
  • Zip Code in 11201-11256 à“Brooklyn”
  • Zip Code in 10301-10314 à“Staten Island”
  • Zip Code in 11001-11020, 11351-11499, 11690-11697 à“Queens”
  • The rest of NY à“NY Other”
  • Other states à“Other”
  • Missing for State/Zip àmissing value code

  • Output: an str()of this data frame. Also, a head() of the data frame, for the first 20 records but only for the 4 variables CAddr2 State Zip Group, and in this order.

d.(5pts) (q7d) Now read in all 6 files with this information into a list of 6 data frames. Access the filenames directly from the operating system. Use the method we covered in class: write a function to read in one file from one directory, and perform the tasks shown in (a), (b), and (c); then use lapplyor llply to create a list of such data frames. Name each of the 6 elements of the list as the name of the person—for example, the name of the first element of the list should be “Brown”. Do this naming using code, by extracting the names from the vector of filenames that you have.

e.(3pts) (q7e) Next, using the output from q7d, put all 6 data frames together into one data frame. Include the contributor’s last name as the variable CName (class character) in the data frame, for example “Brown” for the first group of contributors.  Make this the last variable in the data frame. Use code to do this, based on the names contained within the file names (the names in the filenames, not the names in the files)—do not simply type in the names. Also, in one line of code, verify that there are no NA’s for the Amount variable.

  • Output: an str()of the data frame; and a verification that there are no NA’s for the Amount variable.

f.(5pts) (q7f) Make a table of the fraction (nearest 0.01) or percent (nearest 1%) of counts, with CName in rows (in alphabetical order), and Group in columns. Calculate the fractions (or percents) within each CName, so the values add up to 1 (or 100) for each row (Note that the idea of a distribution only makes sense here within a name—if we had selected all of the names instead of some arbitrary ones, we could look at an overall distribution, although I’m not sure it would be of much value.

Looking within names (that is, conditional on names) is much more likely to be informative). For each Group, include the marginal fraction or percent across all names (that is, so these marginal values should also add up to 1 or 100%). For full credit, sort the table so that Group is arranged in decreasing marginal percent—that it, the highest marginal-percent group should be first (left) and the smallest should be last (right).

g.(5pts) (q7g) Create a list of the following values for Amount contributed from each CName: minimum, 1stquartile, median, 3rd quartile, and maximum. Display this in a tabular format (a data frame is fine—and it’s OK to keep row names), with CName in the first column, and the other values in the remaining 5 columns in the order listed above (minimum, …, maximum). Round all values to the nearest dollar in the output listing.

  • Output: the listing.

更多代写:代写程序 雅思代考 R studio代写 算法代考 Algorithm代做 翻译英文文献