Quantcast
Channel: Search Results for “boxplot”– R-bloggers
Viewing all 252 articles
Browse latest View live

The perfect t-test

$
0
0

(This article was first published on Daniel Lakens, and kindly contributed to R-bloggers)
I've created an easy to use R script that will import your data, and performs and writes up a state-of-the-art dependent or independent t-test. The goal of this script is to examine whether more researcher-centered statistical tools (i.e., a one-click analysis script that checks normality assumptions, calculates effect sizes and their confidence intervals, creates good figures, calculates Bayesian and robust statistics, and writes the results section) increases the use of novel statistical procedures. Download the script here: https://github.com/Lakens/Perfect-t-test. For comments, suggestions, or errors, e-mail me at D.Lakens@tue.nl. The script will likely be updated - check back for updates or follow me @Lakens to be notified of updates.


Correctly comparing two groups is remarkably challenging. When performing a t-test researchers rarely manage to follow all recommendations that statisticians have made over the years. Where statisticians update their recommendations, statistical textbooks often do not. Even though reporting effect sizes and their confidence intervals has been recommended for decades (e.g., Cohen, 1990), statistical software (e.g., SPSS 22) often does not provide these statistics. Progress is slow, and Sharpe (2013) points to a lack of awareness, a lack of time, a lack of easily usable software, and a lack of education as some of the main reasons for the resistance to adopting statistical innovations.

Here, I propose a way to speed up the widespread adoption of the state-of-the-art statistical techniques by providing researchers with an easy to use script in free statistical software (R) that will perform and report all statistical analyses, practically with a single button press. The script (Lakens, 2015, available at https://github.com/Lakens/Perfect-t-test) follows state-of-the-art recommendations (see below), creates plots of the data, and writes the results section, including a minimally required interpretation of the statistical results.

Automated analyses might strike readers as a bad idea because it facilitates mindless statistics. Having performed statistics mindlessly for most of my professional career, I sincerely doubt access to this script would have reduced my level of understanding. If anything, reading an automatically generated results section of your own data that includes statistics you are not accustomed to calculate or report is likely to make you think more about the usefulness of these statistics, not less. However, the goal of this script is not to educate people. The main goal is to get researchers to perform and report the analyses they should, and make this as efficient as possible.

Comparing two groups


Keselman, Othman, Wilcox, and Fradette (2004) proposed the a more robust two-sample t-test that provides better Type 1 error control in situations of variance heterogeneity and nonnormality, but their recommendations have not been widely implemented. Researchers might in general be unsure whether it is necessary to change the statistical tests they use to analyze and report comparisons between groups. As Wilcox, Granger, and Clark (2013, p. 29) remark: “All indications are that generally, the safest way of knowing whether a more modern method makes a practical difference is to actually try it.” Making sure conclusions based on multiple statistical approaches converge is an excellent way to gain confidence in your statistical inferences. This R script calculates traditional Frequentist statistics, Bayesian statistics, and robust statistics, using both a hypothesis testing as an estimation approach, to invite researchers to examine their data from different perspectives.

Since Frequentist and Bayesian statistics are based on assumptions of equal variances and normally distributed data, the R script provides boxplots and histograms with kernel density plots overlaid with a normal distribution curve to check for outliers and normality. Kernel density plots are a non-parametric technique to visualize the distribution of a continuous variable. They are similar to a histogram, but less dependent on the specific choice of bins used when creating a histogram. The graphs plot both the normal distribution, as the kernel density function, making it easier to visually check whether the data is normally distributed or not. Q-Q plots are provided as an additional check for normality.

Yap and Sim (2011) show that no single test for normality will perform optimally for all possible distributions. They conclude (p. 2153): “If the distribution is symmetric with low kurtosis values (i.e. symmetric short-tailed distribution), then the D'Agostino-Pearson and Shapiro-Wilkes tests have good power. For symmetric distribution with high sample kurtosis (symmetric long-tailed), the researcher can use the JB, Shapiro-Wilkes, or Anderson-Darling test." All four normality tests are provided in the R script. Levene’s test for the equality of variances is provided, although for independent t-tests, Welch’s t-test (which does not require equal variances) is provided by default, following recommendations by Ruxton (2006). A short explanation accompanies all plots and assumption checks to help researchers to interpret the results. 

The script also creates graphs that, for example, visualize the distribution of the datapoints, and provide both within as between confidence intervals:




The script provides interpretations for effect sizes based on the classifications ‘small’, ‘medium’, and ‘large’. Default interpretations of the size of an effect based on these three categories should only be used as a last resort, and it is preferable to interpret the size of the effect in relation to other effects in the literature, or in terms of its practical significance. However, since researchers often do not interpret effect sizes (if they are reported to begin with), the default interpretation (and the suggestion to interpret effect sizes in relation to other effects in the literature) should at least function as a reminder that researchers are expected to interpret effect sizes. The common language effect size  (McGraw & Wong, 1992) is provided as an additional way to communicate the effect size.

Similarly, the Bayes Factor is classified into anecdotal, moderate, strong, very strong, and decisive evidence for the alternative or null hypothesis, following Jeffreys (1961), even though researchers are reminded that default interpretations of the strength of the evidence should not distract from the fact that strength of evidence is a continuous function of the Bayes Factor. We can expect researchers will rely less on default interpretations, the more acquainted they become with these statistics, but for novices some help in interpreting effect sizes and Bayes Factors will guide their interpretation.

Running the Markdown script


R Markdown scripts provide a way to create fully reproducible reports from data files. The script combines the commands to perform all statistical analyses with the written sections of the final output. Calculated statistics and graphs are inserted into the written report at specified locations. After installing the required packages, preparing the data, and specifying some variables in the Markdown document, the report can be generated (and thus, the analysis procedure can be performed) with a single mouse-click (scroll down for an example of the output).

The R Markdown script and the ReadMe file contain detailed instructions on how to run the script, and how to install required packages, including the PoweR package (Micheaux & Tran, 2014) to perform the normality tests, HLMdiag to create the Q-Q plots (Loy & Hofmann, 2014). ggplot2 for all plots (Wickham, 2009), car (Fox & Weisberg, 2011) to perform Levene's test, MBESS(Kelley, 2007) to calculate effect sizes and their confidence intervals, WRS for the robust statistics (Wilcox & Schönbrodt, 2015), BootsES to calculate a robust effect size for the independent t-test (Kirby & Gerlanc, 2013), BayesFactor for the bayes factor (Morey & Rouder, 2015), and BEST (Kruschke & Meredith, 2014) to calculate the Bayesian highest density interval.

The data file (which should be stored in the same folder that contains the R markdown script) needs to be tab delimited with a header at the top of the file (which can easily be created from SPSS by saving data through the 'save as' menu and selecting 'save as type: Tab delimited (*.dat)', or in Excel by saving the data as ‘Text (Tab delimited) (.txt)’. For the independent t-test the data file needs to contain at least two columns (one specifying the independent variable, and one specifying the dependent variable, and for the dependent t-test the data file needs to contain three columns, one subject identifier column, and two columns for the two dependent variables. The script for dependent t-tests allows you to select a subgroup for the analysis, as long as the data file contains an additional grouping variable (see the demo data). The data files can contain irrelevant data, which will be ignored by the script. Finally, researchers need to specify the names (or headers) of the independent and dependent variables, as well as grouping variables. Finally, there are some default settings researchers can change, such as the sidedness of the test, the alpha level, the percentage for the confidence intervals, and the scalar on the prior for the Bayes Factor.

The script can be used to create either a word document or a html document. The researchers can easily interpret all the assumption checks, look at the data for possible outliers, and (after minor adaptations) copy-paste the result sections into their article. 

The statistical results the script generates has been compared against the results provided by SPSS, JASP, ESCI, online Bayes Factor calculators, and BEST online. Minor variations in the HDI calculation between BEST online and this script are possible depending on the burn-in samples and number of samples, and for huge t-values there are minor variations between JASP and the latest version of the Bayes Factor package used in this script. This program is distributed in the hope that it will be useful, but without any warranty. If you find an error, please contact me at D.Lakens@tue.nl.

Promoting Statistical Innovations


Statistical software is built around individual statistical tests, while researchers perform a set of procedures. Although it is not possible to create standardized procedures for all statistical analyses, most, if not all, of the steps researchers have to go through when they want to report correlations, regression analyses, ANOVA’s, and meta-analyses are sufficiently structured. These tests make up a large portion of analyses reported in journal articles. Demonstrating this, David Kennyhas created R scripts that will perform and report mediation and moderator analyses. Felix Schönbrodt has created a Shiny app that performs several meta-analytic techniques. Making statistical innovations more accessible has a high potential to substantially improve the quality of the statistical tests researchers perform and report. Statisticians who take the application of generated knowledge seriously should try to experiment with the best way to get researchers to use state-of-the-art techniques. R markdown scripts are an excellent method to combine statistical analyses and a written report in free software. Shiny apps might make these analyses even more accessible, because they no longer require users to install R and R packages.

Despite the name of this script, there is probably not such a thing as a ‘perfect’ report of a statistical test. Researchers might prefer to report standard errors instead of standard deviations, perform additional checks for normality, different Bayesian or robust statistics, or change the figures. The benefit of markdown scripts with a GNU license stored on GitHub is that they can be forked(copied to a new repository) where researchers are free to remove, add, or change sections of the script to create their own ideal test. After some time, a number of such scripts may be created, allowing researchers to choose an analysis procedure that most closely matches their desires. Alternatively, researchers can post feature requests or errors that can be incorporated in future versions of this script.

It is important that researchers attempt to draw the best possible statistical inferences from their data. As a science, we need to seriously consider the most efficient way to accomplish this. Time is scarce, and scientists need to master many skills in addition to statistics. I believe that some of the problems in adopting new statistical procedures discussed by Sharpe (2013) such as lack of time, lack of awareness, lack of education, and lack of easy to use software can be overcome by scripts that combine traditional and more novel statistics, are easy to use, and provide a brief explanation of what is calculated while linking to the relevant literature. This approach might be a small step towards a better understanding of statistics for individual researchers, but a large step towards better reporting practices.




References

Baguley, T. (2012). Calculating and graphing within-subject confidence intervals for ANOVA. Behavior research methods, 44, 158-175.
Cumming, G. (2012). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. New York: Routledge.
Fox, J. & Weisberg, S. (2011). An R Companion to Applied Regression, Second edition. Sage, Thousand Oaks CA.
Jeffreys, H. (1961). Theory of probability (3rd ed.). Oxford: Oxford University Press, Clarendon Press.
Kelley, K. (2007). Confidence intervals for standardized effect sizes: Theory, application, and implementation. Journal of Statistical Software, 20, 1-24.
Kirby, K. N., & Gerlanc, D. (2013). BootES: An R package for bootstrap confidence intervals on effect sizes. Behavior Research Methods, 45, 905-927.
Kruschke, J. K., & Meredith, M. (2014). BEST: Bayesian Estimation Supersedes the t-test. R package version 0.2.2, URL: http://CRAN.R-project.org/package=BEST.
Lakens, D. (2015). The perfect t-test (version 0.1.0). Retrieved from https://github.com/Lakens/perfect-t-test. doi:10.5281/zenodo.17603
Loy, A., & Hofmann, H. (2014). HLMdiag: A Suite of Diagnostics for Hierarchical Linear Models. R. Journal of Statistical Software, 56, pp. 1-28. URL: http://www.jstatsoft.org/v56/i05/.
McGraw, K. O., & Wong, S. P. (1992). A common language effect size statistic. Psychological Bulletin, 111, 361-365.
Micheaux, P., & Tran, V. (2012). PoweR. URL: http://www.biostatisticien.eu/PoweR/.
Morey R and Rouder J (2015). BayesFactor: Computation of Bayes Factors for Common Designs. R package version 0.9.11-1, URL: http://CRAN.R-project.org/package=BayesFactor
Sharpe, D. (2013). Why the resistance to statistical innovations? Bridging the communication gap. Psychological Methods, 18, 572-582.
Wickham, H. (2009). ggplot2: elegant graphics for data analysis. Springer New York. ISBN 978-0-387-98140-6, URL: http://had.co.nz/ggplot2/book.
Wilcox, R. R., Granger, D. A., Clark, F. (2013). Modern robust statistical methods: Basics with illustrations using psychobiological data. Universal Journal of Psychology, 1, 21-31.
Wilcox, R. R., & Schönbrodt, F. D. (2015). The WRS package for robust statistics in R (version 0.27.5). URL: https://github.com/nicebread/WRS.

Yap, B. W., & Sim, C. H. (2011). Comparisons of various types of normality tests. Journal of Statistical Computation and Simulation, 81, 2141-2155.

To leave a comment for the author, please follow the link and comment on his blog: Daniel Lakens.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Getting started with PostgreSQL in R

$
0
0

(This article was first published on Data Shenanigans » R, and kindly contributed to R-bloggers)

When dealing with large datasets that potentially exceed the memory of your machine it is nice to have another possibility such as your own server with an SQL/PostgreSQL database on it, where you can query the data in smaller digestible chunks. For example, recently I was facing a financial dataset of 5 GB. Although 5 GB fit into my RAM the data uses a lot of resources. One solution is to use an SQL-based database, where I can query data in smaller chunks, leaving resources for the computation.

While MySQL is the more widely used, PostgreSQL has the advantage of being open source and free for all usages. However, we still need to get a server. One possible way to do it is to rent Amazon server, however, as I don’t have a budget for my projects and because I only need the data on my own machine I wanted to set up a server on my Windows 8.1 machine. This is how I did it.

Installing software, Starting the Server and Setting up new Users

First, we need to install the necessary software. Besides R and RStudio we need to install PostgreSQL, that we find here. When installing we are asked to provide a password, just remember it as we need it later. Say for this example we set the password to: “DataScienceRocks”.

Now we can already access and use the database, for example we can start the interface (pgAdmin III) that was automatically installed with PostgreSQL. To connect to the database double click on the server in pgAdmin III and type in your password. The server seems to run after the installation as well. If this is not the case (i.e. you get the error “Server doesn’t listen” when trying to connect to the server with pgAdmin III), you can start the server with the following command in the command line:

pg_ctl -D "C:Program FilesPostgreSQL9.4data" start

As we can see, we only have one user (“postgres“). It is good practice to use the database with another user that has no createrole (think of it as a non-admin user).

To set up a new user I follow this explanation. Start the command line (go to the start menu and type cmd”) and move to the folder where you installed PostgreSQL (more precisely, the bin-folder). In my case I navigated to the folder by typing:

cd C:/Program Files/PostgreSQL/9.4/bin

Now we need to create a new user (openpg), which we can do by executing the following command:

createuser.exe --createdb --username postgres --no-createrole --pwprompt openpg

We have to enter the password for the new user twice (note that there is no feedback from the commandline), for this example I set it to “new_user_password“, lastly we are asked to give our password for the main user (“postgres“) which is in this case “DataScienceRocks“, as specified during the installation.

We can check if we have two users by going into pgAdmin III, which should look like this:

pgAdmin Users
Users in pgAdmin III

Creating a Table in pgAdmin III

An easy way to create a table (database) is to use pgAdmin III. Right click on the “Tables” and choose “New Table”.

Tables in pgAdmin III
Tables in pgAdmin III

For this example we create a table called cartable that we will later populate with the dataset of mtcars. For the dataset we need to specify the columes and their types as shown in the next picture.

pgAdmin III Table Columns
pgAdmin III Table Columns

Lastly, we need to specificy a primary key in constraints. In this case I use the carname column as key.

Connection with R

Now it is time to connect to the database with R. This approach uses the RPostgreSQL package and this approach.

To connect, we need to enter the following commands in R:

# install.packages("RPostgreSQL")
require("RPostgreSQL")

# create a connection
# save the password that we can "hide" it as best as we can by collapsing it
pw <- {
  "new_user_password"
}

# loads the PostgreSQL driver
drv <- dbDriver("PostgreSQL")
# creates a connection to the postgres database
# note that "con" will be used later in each connection to the database
con <- dbConnect(drv, dbname = "postgres",
                 host = "localhost", port = 5432,
                 user = "openpg", password = pw)
rm(pw) # removes the password

# check for the cartable
dbExistsTable(con, "cartable")
# TRUE

If we don’t get an error, that means we are connected to the database.

Write and Load Data with RPostgreSQL

The following code show how we can write and read data to the database:

# creates df, a data.frame with the necessary columns
data(mtcars)
df <- data.frame(carname = rownames(mtcars), 
                 mtcars, 
                 row.names = NULL)
df$carname <- as.character(df$carname)
rm(mtcars)

# writes df to the PostgreSQL database "postgres", table "cartable" 
dbWriteTable(con, "cartable", 
             value = df, append = TRUE, row.names = FALSE)

# query the data from postgreSQL 
df_postgres <- dbGetQuery(con, "SELECT * from cartable")

# compares the two data.frames
identical(df, df_postgres)
# TRUE

# Basic Graph of the Data
require(ggplot2)
ggplot(df_postgres, aes(x = as.factor(cyl), y = mpg, fill = as.factor(cyl))) + 
  geom_boxplot() + theme_bw()
Visualization of the data
Visualization of the data

Lastly, if we are finished, we have to disconnect from the server:

# close the connection
dbDisconnect(con)
dbUnloadDriver(drv)

Outro

If you have any questions about the code, PostgreSQL or pgAdmin III or if you have remarks or have found a way to do it better/faster feel free to leave a comment or write me an email.

Useful links:

Get the PostgreSQL software here:
http://www.postgresql.org/download/windows/

PostgreSQL commandline commands: http://www.postgresql.org/docs/9.4/static/app-pg-ctl.html

Create a new User: https://doc.odoo.com/install/windows/postgres/

For a short introduction to postgreSQL queries have a look at this: http://www.postgresql.org/docs/8.4/static/tutorial-select.html

Appendix

If you want to create a table in R instead of pgAdmin III you can do that of course. The following creates the same table as the we did earlier in pgAdmin III:

# specifies the details of the table
sql_command <- "CREATE TABLE cartable
(
  carname character varying NOT NULL,
  mpg numeric(3,1),
  cyl numeric(1,0),
  disp numeric(4,1),  
  hp numeric(3,0),
  drat numeric(3,2),
  wt numeric(4,3),
  qsec numeric(4,2),
  vs numeric(1,0),
  am numeric(1,0),
  gear numeric(1,0),
  carb numeric(1,0),
  CONSTRAINT cartable_pkey PRIMARY KEY (carname)
)
WITH (
  OIDS=FALSE
);
ALTER TABLE cartable
  OWNER TO openpg;
COMMENT ON COLUMN cartable.disp IS '
';"
# sends the command and creates the table
dbGetQuery(con, sql_command)

 


To leave a comment for the author, please follow the link and comment on his blog: Data Shenanigans » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Course on using Oracle R Enterprise

$
0
0

(This article was first published on BNOSAC - Belgium Network of Open Source Analytical Consultants, and kindly contributed to R-bloggers)

BNOSAC will be giving from June 08 up to June 12 a 5-day crash course on the use of R using Oracle R Enterprise. The course is given together with our Oracle Partner in Leuven, Belgium. If you are interested in attending, contact us for further details.

For R users who aren't aware of this yet. Oracle has embedded R into it's database which allows R users to transparently run R code inside the database - yes really transparently. The Oracle R Enterprise is part of the Oracle Advanced Analytics stack which basically consists of the following elements for R users:

  1. ROracle: supported native DBI driver to connect R with Oracle which is open source and available at CRAN (link).
  2. Oracle R Enterprise (ORE): This consists of an Oracle released version of R which is up to date with the current version of R and supported by Oracle and next to that a number of R packages which are available for download at the ORE website. These packages embed R into Oracle.
  3. Oracle Data Mining (ODM): a set of distributed data mining algorithms accessible from R
  4. Oracle Advanced Analytics for Hadoop (ORAAH) : a set of R packages which allow R users to connect with Hadoop and run data mining models and map reduce jobs on top of Hadoop. (link)

During the 5-day course, you will learn how to use R alongside the Oracle DB. The course covers some base R, advanced R usage and functionality from the Oracle R Enterprise (ORE) suite of packages.

Module 1: Introduction to base R

What is R, packages available (CRAN, R-Forge, ...), R documentation search, finding help, RStudio editor, syntax, Data types (numeric/character/factor/logicals/NA/Dates/Times), Data structures (vector/data.frame/matrix/lists and standard operations on these), Saving (RData) & importing data from flat files, csv, Excel, Oracle, MS SQL Server, SAS, SPSS, Creating functions, data manipulation (subsetting, adding variables, ifelse, control flow, recoding, rbind, cbind) and aggregating and reshaping, Plotting in R using base and lattice functionality (dot plots, barcharts, graphical parameters, legends, devices), Basic statistics in R (mean, variance, crosstabs, quantile, correlation, distributions, densities, histograms, boxplot, t-tests, wilcoxon test, non-parametric tests)

Module 2: Advanced R programming & data manipulation with base R

vectorisation, writing your own functions, control flow, aggregating and data.table - fast group by, joining and data.table programming tricks, reshaping from wide to long format, miscellaneous usefull functions, apply family of functions & split-apply-combine strategy, do.call, parallel execution of code, handling of errors and exceptions, debugging code, other goodies: basic regular expressions, data manipulations, rolling data handling, S3 classes, generics and basic S4 methodology

Module 3: ROracle and Oracle R Enterprise (ORE) - transparancy layer

•    ROracle - getting and sending SQL queries from Oracle
•    Installing Oracle R Enterprise (ORE)
•    Basic database connectivity: ore.exec, ore.ls, ore.synch, ore.push, ore.pull, ore.create, ore.drop, ore.get
•    ORE data types: ore.character, ore.factor, ore.logical, ore.number, ore.datetime, ore.numeric. Conversion between data types
•    ORE data structures: ore.matrix, ore.frame, ore.vector
•    ORE transparancy data operations on ore.frame/ore.vector (subset, ncol, nrow, head, ifelse, paste, is.na, sd, mean, tapply, by, c, %in%, ...) and indexing and overwriting in-database ore.vectors
•    Save R objects in Oracle ore.save, ore.load, ore.datastore and ORE data store handling
•    Basic statistics with ORE (ore.univariate, ore.summary, ore.crosstab, ore.corr, exponential smoothing, t.test, wilcoxon, IQR)

Module 4: Oracle R Enterprise - advanced data manipulation

•    Running R functions parallel inside the database: ore.doEval, ore.groupApply, ore.indexApply, ore.rowApply, ore.tableApply
•    Creating R scripts inside the database and accessing ORE stored procedures
•    Embedding R scripts in production database applications
•    Embedded (parallel) R execution within ORE using the R Interface as well as the SQL Interface

Module 5: Data mining models inside Oracle R Enterprise (ORE) and Oracle Data Mining (ODM)

In this session you will become acquainted with some of the most common data mining methods and learn how to use these algorithms in ORE. The following algorithms will be covered.
•    principal component analysis and factor analysis
•    kmeans clustering and orthogonal partitioning
•    data reduction using Minimum Description Length attribute importance
•    linear models and generalized linear models
•    naive bayes, neural networks, decision tree and support vector machines
•    market basket analysis / recommendation engines (apriori)
•    bagging
 

If you are interested in attending, contact us for further details.


 

To leave a comment for the author, please follow the link and comment on his blog: BNOSAC - Belgium Network of Open Source Analytical Consultants.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Vega.jl Rebooted – Now with 100% More Pie and Donut Charts!

$
0
0

(This article was first published on randyzwitch.com » R, and kindly contributed to R-bloggers)

piedonut

 

 

 

 

 

Mmmmm, chartjunk!

Rebooting Vega.jl

Recently, I’ve found myself without a project to hack on, and I’ve always been interested in learning more about browser-based visualization. So I decided to revive the work that John Myles White had done in building Vega.jl nearly two years ago. And since I’ll be giving an analytics & visualization workshop at JuliaCon 2015, I figure I better study the topic in a bit more depth.

Back In Working Order!

The first thing I tackled here was to upgrade the syntax to target v0.4 of Julia. This is just my developer preference, to avoid using Compat.jl when there are so many more visualizations I’d like to support. So if you’re using v0.4, you shouldn’t see any deprecation errors; if you’re using v0.3, well, eventually you’ll use v0.4!

Additionally, I modified the package to recognize the traction that Jupyter Notebook has gained in the community. Whereas the original version of Vega.jl only displayed output in a tab in a browser, I’ve overloaded the writemime method to display :VegaVisualization inline for any environment that can display HTML. If you use Vega.jl from the REPL, you’ll still get the same default browser-opening behavior as existed before.

The First Visualization You Added Was A Pie Chart…

…And Followed With a Donut Chart?

Yup. I’m a troll like that. Besides, being loudly against pie charts is blowhardy (even if studies have shown that people are too stupid to evaluate them).

Adding these two charts (besides trolling) was a proof-of-concept that I understood the codebase sufficiently in order to extend the package. Now that the syntax is working for Julia v0.4, I understand how the package works (important!), and have improved the workflow by supporting Jupyter Notebook, I plan to create all of the visualizations featured in the Trifacta Vega Editor and other standard visualizations such as boxplots. If the community has requests for the order of implementation, I’ll try and accommodate them. Just add a feature request on Vega.jl GitHub issues.

Why Not Gadfly? You’re Not Starting A Language War, Are You?

No, I’m not that big of a troll. Besides, I don’t think we’ve squeezed all the juice (blood?!) out of the R vs. Python infographic yet, we don’t need another pointless debate.

My sole reason for not improving Gadfly is just that I plain don’t understand how the codebase works! There are many amazing computer scientists & developers in the Julia community, and I’m not really one of them. I do, however, understand how to generate JSON strings and in that sense, Vega is the perfect platform for me to contribute.

Collaborators Wanted!

If you’re interested in visualization, as well as learning Julia and/or contributing to a package, Vega.jl might be a good place to start. I’m always up for collaborating with people, and creating new visualizations isn’t that difficult (especially with the Trifacta examples). So hopefully some of you will be interested in enough to join me to adding one more great visualization library to the Julia community.

To leave a comment for the author, please follow the link and comment on his blog: randyzwitch.com » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Upcoming Tutorial: Analyzing US Census Data in R

$
0
0

(This article was first published on » R, and kindly contributed to R-bloggers)

Today I am pleased to announce that on May 21 I will run a tutorial titled Analyzing US Census Data in R. While I have spoken at conferences before, this is my first time running a tutorial. My hope is that everyone who participates will learn something interesting about the demographics of the state, county and ZIP code that they are from. Along the way, I hope that people become comfortable doing exploratory data analysis with maps, learn a bit about geography and leave with a better understanding of how US Census data works. Here is the description:

In this tutorial Ari Lamstein will explain how to use R to explore the demographics of US States, Counties and ZIP Codes.  Each person will analyze their home state, county and ZIP code. There will be an emphasis on sharing results with each other. We will use boxplots and choropleth maps to visualize the data.

Time permitting we will also explore historic demographic data, learn more about the data itself, and how to use the Census Bureau’s API.

Attendance is free. If you are interested and can make it to the event, then I hope to see you there!

Note that the current draft of my slides is available on github. After the talk I plan to publish my slides on slideshare, where I have placed slides from my previous talks.

To leave a comment for the author, please follow the link and comment on his blog: » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

PERFORMANCE: Calling R_CheckUserInterrupt() every 256 iteration is actually faster than ever 1,000,000 iteration

$
0
0

(This article was first published on jottR, and kindly contributed to R-bloggers)

If your native code takes more than a few seconds to finish, it is a nice courtesy to the user to check for user interrupts (Ctrl-C) once in a while, say, every 1,000 or 1,000,000 iteration. The C-level API of R provides R_CheckUserInterrupt() for this (see 'Writing R Extensions' for more information on this function). Here's what the code would typically look like:

for (int ii = 0; ii < n; ii++) {
/* Some computational expensive code */
if (ii % 1000 == 0) R_CheckUserInterrupt()
}

This uses the modulo operator % and tests when it is zero, which happens every 1,000 iteration. When this occurs, it calls R_CheckUserInterrupt(), which will interrupt the processing and “return to R” whenever an interrupt is detected.

Interestingly, it turns out that, it is significantly faster to do this check every k=2m iteration, e.g. instead of doing it every 1,000 iteration, it is faster to do it every 1,024 iteration. Similarly, instead of, say, doing it every 1,000,000 iteration, do it every 1,048,576 - not one less (1,048,575) or one more (1,048,577). The difference is so large that it is even 2-3 times faster to call R_CheckUserInterrupt() every 256 iteration rather than, say, every 1,000,000 iteration, which at least to me was a bit counter intuitive the first time I observed it.

Below are some benchmark statistics supporting the claim that testing / calculating ii % k == 0 is faster for k=2m (blue) than for other choices of k (red).

Note that the times are on the log scale (the results are also tabulated at the end of this post). Now, will it make a big difference to the overall performance of your code if you choose, say, 1,048,576 instead of 1,000,000? Probably not, but on the other hand, it does not hurt to pick an interval that is a 2m integer. This observation may also be useful in algorithms that make lots of use of the modulo operator.

So why is ii % k == 0 a faster test when k=2m? I can only speculate. For instance, the integer 2m is a binary number with all bits but one set to zero. It might be that this is faster to test for than other bit patterns, but I don't know if this is because of how the native code is optimized by the compiler and/or if it goes down to the hardware/CPU level. I'd be interested in feedback and hear your thoughts on this.

Details on how the benchmarking was done

I used the inline package to generate a set of C-level functions with varying interrupt intervals k. I'm not passing k as a parameter to these functions. Instead, I use it as a constant value so that the compiler can optimize as far as possible, but also in order to imitate how most code is written. This is why I generate multiple C functions. I benchmarked across a wide range of interval choices using the microbenchmark package. The C functions (with corresponding R functions calling them) and the corresponding benchmark expressions to be called were generated as follows:

## The interrupt intervals to benchmark
## (a) Classical values
ks <- c(1, 10, 100, 1000, 10e3, 100e3, 1e6)
## (b) 2^k values and the ones before and after
ms <- c(2, 5, 8, 10, 16, 20)
as <- c(-1, 0, +1) + rep(2^ms, each=3)

## List of unevaluated expressions to benchmark
mbexpr <- list()

for (k in sort(c(ks, as))) {
name <- sprintf("every_%d", k)

## The C function
assign(name, inline::cfunction(c(length="integer"), body=sprintf("
int i, n = asInteger(length);
for (i=0; i < n; i++) {
if (i %% %d == 0) R_CheckUserInterrupt();
}
return ScalarInteger(n);
", k)))

## The corresponding expression to benchmark
mbexpr <- c(mbexpr, substitute(every(n), list(every=as.symbol(name))))
}

The actual benchmarking of the 25 cases was then done by calling:

n <- 10e6  ## Number of iterations
stats <- microbenchmark::microbenchmark(list=mbexpr)
exprminlqmeanmedianuqmax
every_1(n)174.05178.77184.68180.76183.97262.69
every_3(n)66.7869.1672.1070.2072.42114.75
every_4(n)53.8055.3156.9856.3257.2669.71
every_5(n)46.1747.5249.4248.8349.9966.98
every_10(n)33.3134.3236.5835.1236.6654.83
every_31(n)23.7824.4525.7425.1025.8358.10
every_32(n)17.8118.2518.9118.8219.2225.25
every_33(n)22.9023.5824.9024.5925.2634.45
every_100(n)18.1418.5519.4719.1519.6327.42
every_255(n)19.9620.5621.6721.1621.9842.53
every_256(n)7.077.187.547.407.6310.73
every_257(n)19.3219.7220.6020.3620.8529.66
every_1000(n)16.3716.9817.8117.5318.0824.24
every_1023(n)19.5420.1620.9420.5021.2528.20
every_1024(n)6.326.406.816.606.8313.32
every_1025(n)18.5819.0519.9119.7420.0830.51
every_10000(n)15.9216.7617.4017.3817.8224.10
every_65535(n)18.9219.6020.4120.1020.8027.69
every_65536(n)6.086.166.626.396.5713.40
every_65537(n)22.0822.7023.7923.6924.3531.57
every_100000(n)16.1616.5517.2017.0517.6124.54
every_1000000(n)16.0216.4217.1716.8517.4221.84
every_1048575(n)18.8819.2320.2719.8520.5230.21
every_1048576(n)6.086.186.536.476.5812.64
every_1048577(n)22.8823.2324.2823.8324.6331.84

I get similar results across various operating systems (Windows, OS X and Linux) all using GNU Compiler Collection (GCC).

Feedback and comments are welcome!

To reproduce these results, do:

> path <- 'https://raw.githubusercontent.com/HenrikBengtsson/jottr.org/master/blog/20150604%2CR_CheckUserInterrupt'
> html <- R.rsp::rfile('R_CheckUserInterrupt.md.rsp', path=path)
> !html ## Open in browser

To leave a comment for the author, please follow the link and comment on his blog: jottR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Visualization and Analysis of Reddit’s "The Button" Data

$
0
0

(This article was first published on everyday analytics, and kindly contributed to R-bloggers)

Introduction

People are weird. And if there’s anything that’s greater collective proof of this fact than Reddit, you’d be hard pressed to find it.I tend to put reddit in the same bucket as companies like Google, Amazon and Netflix, where they have enough money, or freedom, or both, to say something like “wouldn’t it be cool if….?” and then they do it simply because they can.

Enter “the button” (/r/thebutton), reddit’s great social experiment that appeared on April Fool’s Day of this year. An enticing blue rectangle with a timer that counts down from 60 to zero that’s reset when the button is pushed, with no explanation as to what happens when the time is allowed to run out. Sound familiar? The catch here being that it was an experience shared by anyone who visited the site, and each user also only got one press (though many made attempts to game the system, at least initially).

Finally, the timer reached zero, the last button press being at 2015-06-05 21:49:53.069000UTC, and the game (rather anti-climactically I might offer) ended.

What does this have to do with people being weird? Well, an entire mythology was built up around the button, amongst other things. Okay, maybe interesting is a better word. And maybe we’re just talking about your average redditor.

Either way, what interests me is that when the experiment ended, all the data were made available. So let’s have a look shall we?

Background

The dataset consists of simply four fields:
press time, the date and time the button was pressed
flair, the flair the user was assigned given at what the timer was at when they pushed the button
css, the flair class given to the user
and lastly outage press, a Boolean indicator as to if the press occurred during a site outage.
The data span a time period from 2015-04-01 16:10:04.468000 to 2015-06-05 21:49:53.069000, with a total of 1,008,316 rows (unique presses).
I found there was css missing for some rows, and a lot of of “non presser” flair (users who were not eligible to press the button as their account was created after the event started). For these I used a “missing” value of -1 for the number of seconds remaining when the button was pushed; otherwise it could be stripped from the css field.

Analysis

With this data set, we’re looking at a pretty straightforward categorical time series.
Overall Activity in Time
First we can just look at the total number of button presses, regardless of what the clock said (when they occurred in the countdown) by plotting the raw number of presses per day:
Hmmm… you can see there is a massive spike at the beginning of the graph and there’s much, much fewer for the rest of the duration of the experiment. In fact, nearly 32% of all clicks occurred in the first day, and over half (51.3%) in the first two days.
I think has something to do with both the initial interest in the experiment when it first was announced, and also with the fact that the higher the counter is kept at, the more people can press the button in the same time period (more on this later).
Perhaps a logarithmic graph for the y-axis would be more suitable?
That’s better. We can see the big drop-off in the first two days or so, and also that little blip around the 18th of May is more apparent. This is likely tied to one of several technical glitches which are noted in the button wiki,

For a more granular look, let’s do the hourly presses as well (with a log scale):

Cool. The spike on the 18th seems to be mainly around one hour with about a thousand presses, and we can see too that perhaps there’s some kind of periodic behavior in the data on an hourly basis? If we exclude some of the earlier data we can also go back to not using a log scale for the y-axis:

Let’s look more into the hours of the day when the button presses occur. We can create a simple bar plot of the count of button presses by hour overall:

You can see that the vast majority occurred around 5 PM and then there is a drop-off after that, with the lows being in the morning hours between about 7 and noon. Note that all the timestamps for the button pushes are in Universal Time. Unfortunately there is no geo data, but assuming most redditors who pushed the button are within the continental United States (a rather fair assumption) the high between 5-7 PM would be 11 AM to 1 PM (so, around your lunch hour at work).

But wait, that was just the overall sum of hours over the whole time period. Is there a daily pattern? What about by hour and day of week? Are most redditors pushing the button on the weekend or are they doing it at work (or during school)? We should look into this in more detail.

Hmm, nope! The majority of the clicks occurred Wednesday-Thursday night. But as we know from the previous graphs, the vast majority also occurred within the first two days, which happened to be a Wednesday and Thursday. So the figures above aren’t really that insightful, and perhaps it would make more sense to look at the trending in time across both day and hour? That would give us the figure as below:

As we saw before, there is a huge amount of clicks in the first few days (the first few hours even) so even with log scaling it’s hard to pick out a clear pattern. But most of the presses appear to be present in the bands after 15:00 and before 07:00. You can see the clicks around the outage on the 18th of May were in the same high period, around 18:00 and into the next day.

Maybe alternate colouring would help?

That’s better. Also if we exclude the flurry of activity in the first few days or so, we can drop the logarithmic scaling and see the other data in more detail:

Activity by Seconds Remaining
So far we’ve only looked at the button press activity by the counts in time. What about the time remaining for the presses? That’s what determined each individual reddit user’s flair, and was the basis for all the discussion around the button.

The reddit code granted flairs which were specific to the time remaining when the button was pushed.  For example, if there were 34 seconds remaining, then the css would be “34s”, so it was easy to strip these and convert into numeric data. There were also those that did not press the button who were given the “non presser” flair (6957 rows, ~0.69%), as well as a small number of entries missing flair (67, <0.01%), which I gave the placeholder value of -1.

The remaining flair classes served as a bucketing which functioned very much like a histogram:

Color Have they pressed? Can they press? Timer number when pressed
Grey/Gray N Y NA
Purple Y N 60.00 ~ 51.01
Blue Y N 51.00 ~ 41.01
Green Y N 41.00 ~ 31.01
Yellow Y N 31.00 ~ 21.01
Orange Y N 21.00 ~ 11.01
Red Y N 11.00 ~ 00.00
Silver/White N N NA

We can see this if we plot a histogram of the button presses by using the CSS class which gives the more granular seconds remaining, and use breaks the same as above:

We can see there is much greater proportion of those who pressed within 51-60s left, and there is falloff from there (power law). This is in line with what we saw in the time series graphs: the more the button was pressed, the more presses could occur in a given interval of time, and so we expect that most of those presses occurred during the peak activity at the beginning of the experiment (which we’ll soon examine).

What’s different from the documentation above from the button wiki is the “cheater” class, which was given to those who tried to game the system by doing things like disconnecting their internet and pressing the button multiple times (as far as I can tell). You can see that plotting a bar graph is similar to the above histogram with the difference being contained in the “cheater” class:

Furthermore, looking over the time period, how are the presses distributed in each class? What about in the cheater class? We can plot a more granular histogram:

Here we can more clearly see the exponential nature of the distribution, as well as little ‘bumps’ around the 10, 20, 30 and 45 second marks. Unfortunately this doesn’t tell us anything about the cheater class as it still has valid second values. So let’s do a boxplot by css class as well, showing both the classes (buckets) as well as their distributions:

Obviously each class has to fit into a certain range given their definition, but we can see some are more skewed than others (e.g. class for 51-60s is highly negatively skewed, whereas the class for 41-50 has median around 45). Also we can see that the majority of the cheater class is right near the 60 mark.

If we want to be fancier we can also plot the boxplot using just the points themselves and adding jitter:

This shows the skew of the distributions per class/bucket (focus around “round” times like 10, 30, 45s, etc.) as before, as well as how the vast majority of the cheater class appears to be at 59s mark.

Presses by seconds remaining and in time
Lastly we can combine the analyses above and look at how the quantity and proportion of button presses varies in time by the class and number of seconds remaining.

First we can look at the raw count of presses per css type per day as a line graph. Note again the scale on the y-axis is logarithmic:

This is a bit noisy, but we can see that the press-6 class (presses with 51-60s remaining) dominate at the beginning, then taper off toward the end. Presses in the 0-10 class did not appear until after April 15, then eventually overtook the quicker presses, as would have to be the case in order for the timer to run out. The cheater class starts very high with the press-6 class, then drops off significantly and continues to decrease. I would have like to break this out into small multiples for more clarity, but it’s not the easiest to do using ggplot.

Another way to look at it would be to look at the percent of presses by class per day. I’ve written previously about how stacked area graphs are not your friend, but in this case it’s actually not too bad (plus I wanted to learn how to do it in ggplot). If anything it shows the increase presses in the 51-60 range right after the outage on May 18, and the increase in the 0-10 range toward the end (green):

This is all very well and good, but let’s get more granular. We can easily visualize the data more granularly using heatmaps with the second values taken from the user flair to get a much more detailed picture. First we’ll look at a heatmap of this by hour over the time period:

Again, the scaling is logarithmic for the counts (here the fill colour). We can see some interesting patterns emerging, but it’s a little too sparse as there are a lot of hours without presses for a particular second value. Let’s really get granular and use all the data on the per second level!

On the left is the data for the whole period with a logartihmic scale, whereas the figure on the right excludes some of the earlier data and uses a linear scale. We can see the beginning peak activity in the upper lefthand corner, and then these interesting bands around the 5, 10, 20, 30, and 45 marks forming and gaining strength over time (particular toward the end). Interestingly in addition the resurgence in near-instantaneous presses after the outage around May 18, there was also a hotspot of presses around the 45s mark close to the end of April. Alternate colouring below:

Finally, we can divide by the number of presses per day and calculate the percent each number of seconds remaining made up over the time period. That gives the figures below:

Here the flurry of activity at the beginning continues to be prominent, but the bands also stand out a little more on a daily basis. We can also see how the proportion of clicks for the smaller number of seconds remaining continues to increase until finally the timer is allowed to run out.

Conclusion

The button experiment is over. In the end there was no momentous meaning to it all, no grand scheme or plan, no hatch exploding into the jungle, just an announcement that the thread would be archived. Again, somewhat anti-climactic.
But, it was an interesting experiment. This was an interesting data set, given the relationship between the amount of data that could exist in the same interval of time because of the nature of it.
And I think it really says something about what the internet allows us to do (both in terms of creating something simply for the sake of it, and collecting and analyzing data), and also about people’s desire to find patterns and create meaning in things, no matter what they are. If you’d asked me, I never would have guessed religions would have sprung up around something as simple as pushing a button. But then again, religions have sprung up around stranger things.
You can read and discuss in the button aftermath thread, and if you want to have a go at it yourself, the code and data are below. Until next time I’ll just keep pressing on.

References & Resources

the button press data (from reddit’s github)
 
R code for plots
 
/r/thebutton

To leave a comment for the author, please follow the link and comment on his blog: everyday analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

From cats to zombies, Wednesday at useR2015

$
0
0

(This article was first published on 4D Pie Charts » R, and kindly contributed to R-bloggers)

The morning opened with someone who I was too bleary eyed to work out who it was. Possibly the dean of the University of Aalborg. Anyway, he said that this is the largest ever useR conference, and the first ever in a Nordic country. Take that, Norway! Also, considering that there are now quite a few R-based conferences (Bioconductor has its own conference, not to mention R in Finance and EARL), it’s impressive that these haven’t taken away from the main event.

Torben the conference organiser then spoke briefly and mentioned that planning for this event started back in June 2013.

Keynote

Romain Francois gave a talk of equal parts making-R-go-faster, making-R-syntax-easier, and cat-pix. He quipped that he has an open relationship with R: he gets to use Java and C++, and “I’m fine with other people using R”. Jokes aside, he gave an overview of his big R achievements: the new and J functions for rJava that massively simplified that calling syntax; the //[[Rcpp::export]] command that massively simplified writing Rcpp code, and the internals to the dplyr package.

He also gave a demo of JJ Allaire’s RcppParallel, and talked about plans to intgrate that into dplyr for a free performance boost on multicore systems.

I also had a coffee-break chat with mathematician Luzia Burger-Ringer (awesome name), who has recently started R programming after a seventeen year career break to raise children. She said:

“When I returned to work I struggled to remember my mathematics training, but using R I could be productive. Compared to Fortran it’s just a few lines of code to get an answer.”

Considering that I’ve forgotten pretty much everything I know by the time I’ve returned from vacation, I’m impressed that by Lucia’s ability to dive in after a 17 year break. And I think this is good counter-evidence of R’s perceived tricky learning curve. Try fitting a random forest model in Fortran!

After being suitably caffeinated, I went to the interfacing session discussing connecting R to other languages.

Interfacing

Kasper Hansen had some lessons for integrating external libraries into R packages. He suggested two approaches:

“Either you link to the library, maybe with a function to download that library – this is easiest for the developer; or you include the library in your package – this is easiest for the user”.

He said he’s mostly gone for the latter approach, but said that cross-platform development in this way is mostly a bit of a nightmare.

Kasper gave examples of the illuminaio package, for reading some biological files with no defined specification, some versions of which were encrypted; the affxparser package for reading Affymetrix RNA sequence files, which didn’t have proper OS-independent file paths, and RGraphviz which connects to the apparently awfully implemented Graphviz network visualization software. There were many tales of death-by-memory-leak.

In the discussion afterwards it was interesting to note the exchange between Kasper and Dirk Eddelbuettel. Dirk suggested that Kasper was overly negative about the problems of interfacing with external libraries because he’d had the unfortunate luck to deal with many bad-but-important ones, whereas in general you can just pick good libraries to work with.

My opinion is that Kasper had to pick libraries built by biologists, and my experience is that biologists are generally better at biology than software development (to put it politely).

Christophe Best talked about calling R from Go. After creating the language, Google seem to be making good internal use of Go. And as a large organisation, they suffer from the -different-people-writing-different-languages problem quite acutely. Consequently, they have a need for R to plug modelling gaps in their fledgeling systems language.

Their R-Go connector runs R in a different process to Go (unlike Rcpp, which uses an intra-process sytem, according to Christophe). This is more complex to set-up, but means that “R and Go don’t have shared crashes”.

It sounds promising, but for the moment, you can only pass atomic types and list. Support for data frames is planned, as is support for calling Go from R, so this is a project to watch.

Matt Ziubinski talked about libraries to help you work with Rcpp. He recommended Catch, a testing framework C++. The code for this looked pretty readably (even to me, who hasn’t really touched C++ in over a decade).

He also recommended Boost, which allows compile-time calculations, easy parallel processing, and pipes.

He was also a big fan of C++11, which simplifies a lot of boilerplate coding.

Dan Putler talked about connecting to Sparks Mlib packge for machine learning. He said that connecting to the library was easy, but then they wondered why they had bothered! Always fun to see some software being flamed.

Apparently the regression tools in Spark Mlib don’t hold a candle to R’s lm and glm. They may not be fancy functions, but they’ve been carefully built for robustness.

After some soul-searching, Dan decided that Spark was still worth using, despite the weakness of Mlib, since it nicely handles distributing your data.

He and his team have created a SparkGLM package that ports R’s linear regression algorithms to Spark. lm is mostly done; glm is work-in-progress.

After lunch, I went to the clustering session.

Clustering

Anders Bilgram kicked off the afternoon session with a talk on unsupervised meta-analysis using Gaussian mixed copula models. Say that ten times fast.

He described this a a semi-parametric version of the more standard Gaussian mixed models. I think he meant this as in “mixture models” where you consider your data to be consist of things from several different distributions, rather than mixed effects models where you have random effects.

The Gaussian copula bit means that you have to transform you data to be normally distributed first, and he recommended rank normalization for that.

(We do that in proteomics too; you want qnorm(rank(x) / (length(x) + 1)), and yeah, that should be in a package somewhere.)

Anders gave a couple of nice examples: he took a 1.4Mpx photo of the sapce shuttle and clustered it by pixel color, and clustered the ranks of a replicated gene study.

He did warn that he hadn’t tested his approach with high-dimensional data though.

Claudia Beleites, whic asked the previous question about high-dimensional data, went on to talk about hierarchical clustering of (you guessed it) high dimensional data. In particular, she was looking at the results of vibrational spectroscopy. This looks at the vibrations of molecules, in this case to try to determine what some tissue consists of.

The data is a big 3D-array: two dimensional images at lots of different spectral frequencies.

Claudia had a bit of a discussion about k-means versus hierarchical modelling. She suggested that the fact that k-means often overlooks small clusters, and the fact that you need to know the number of clusters in advance, meant that it was unsuitable for her datasets. The latter point was vigorously debated after the talk, with Martin Maechler arguing that for k-means analyses, you just try lots of values for the number of clusters, and see what gives you the best answer.

Anyway, Claudia had been using hierarchical clustering, and running into problems with calculation time because she has fairly big datasets and hierarchical clustering takes O(n^2) to run.

Her big breakthrough was to notice that you get more or less the same answer clustering on images, or clustering on spectra, and clustering on spectra takes far less time. She had some magic about compressing the information in spectra (peak detection?) but I didn’t follow that too closely.

Silvia Liverani talked about profile regression clustering and her PReMiuM package. She clearly has better accuracy with the shift key than I do.

Anyway, she said that if you have highly correlated variables (body mass and BMI was her example), it can cause instability and general bad behaviour in your models.

Profile regression models were her solution to this, and she described them as “Bayesian infinite mixture models”, but the technical details went over my head.

The package has support for normal/Poisson/binomial/categorical/censored response variable, missing values, and spatial correlations, so it sounds fully featured.

Silvia said it’s written in C++, but runs MCMC underneath, so tha tmakes it medium speed.

I then dashed off to to the Kaleidoscope session for hear about Karl Broman’s socks.

Kaleidoscope2

Rasmus Bååth talked about using approximate Bayesian computation to solve the infamous Karl Broman’s socks problem. The big selling point of ABC is that you can calculate stuff where you have no idea how the calculate the maximum likelihood. Anyway, I mostly marvelled at Rasmus’s ability to turn a silly subject into a compelling statistical topic.

Keynote

Adrian Baddesley gave a keynote os spatial statistics and his work with the spatstat package. He said that in 1990 when work began on the S version of spatstat, the field of spatial statistics was considered a difficult domain to work in.

“In 1990 I taught that likelihood methods for spatial statistics were infeasible, and that time-series methods were not extensible to spatial problems.”

Since then, the introduction of MCMC, composite likelihood and non-parametric moments have made things easier, but he gave real credit to the R language for pushing things forward.

“For the first time, we could share code easily to make cumulative progress”

One persistent problem in spatial stats was how to deal with edge corrections. If you sample values inside a rectangular area, and try to calculate the distance to their nearest neighbour, then values near the edge appear to be further away because you didn’t match to points that you didn’t sample outside the box.

Apparently large academic wars were fought in the 1980s and early 90s over how best to correct for the edge effects, until R made it easy to compare methods and everyone realised that there wasn’t much difference between them.

Adrian also talked about pixel logistic regression as being a development made by the spatstat team, where you measure the distance from each pixel in an image to a response feature, then do logistic regression on the distances. This turned out to be equivalent to a Poisson point process.

He also said that the structure of R models helped to generate new research questions. The fact that you are supposed to implement residuals and confint and influence functions for every model meant that they had to invent new mathematics to calculate them.

Adrian concluded with the idea that we should seek a grand unification theory for statistics to parallel the attempt to reconcile relativity and quantum physics. Just as several decades ago lm and glm were considered separate classes of model, but today are grouped together, one day we might reconcile frequentist and Bayesian stats.

Lightning talks

These are 5 minute talks.

Rafaël Coudret described an algorithm for SAEM. It was a bit technical, and I didn’t grasp what the “SA” stood for, but the apparently it works well when you can’t figure how how to write the usual Expectation Maximization.

Thomas Leeper talked about the MTurkR interface to Amazon’s Mechanical Turk. This let’s you hire workers to do tasks like image recognition, modify and report on tasks, and even pay the workers, all without leaving the R command line.

In future, he wants to support rival services microWorkers and CrowdFunder too.

Luis Candanedo discussed modelling occupancy detection in offices, to save on the heating and electricity bills. He said that IR sensors are too expensive t obe practical, so he tried using temperature, humidity, light and CO2 sensors to detect the number of people in the office, then used photographs to make it a supervised dataset.

Random forest models showed that the light sensors were best for predicting occupancy.

He didn’t mention it, but knowing how many hospital beds are taken up is maybe an even more important use case. Though you can probably just see who has been allocated where.

Dirk Eddelbuettel talked about his drat package for making local file systems or github (or possibly anywhere else) behave like an R repo.

Basically, it bugs him that if you use devtools::install_github, then you can’t do utils::update.packages on it afterwards, and drat fixes that problem.

Saskia Freytag talked about epilepsy gene sequencing. She had 6 datasets of children’s brains gene expression data, and drew some correlation networks of them. (Actually, I’m not sure if they were correlation networks, or partial correlation networks, which seem to be more popular these days.)

Her idea was that true candidate genes for epilepsy should lie in the same networks as known epilepsy genes, thus filtering out many false negatives.

She also had a Shiny interface to help her non-technical colleagues interact with the networks.

Soma Datta talked about teaching R as a first programming language to secondary school children. She said that many of them found C++ and Java to be too hard, and that R had a much higher success rate.

Simple things like having array indicies start at one rather than zero, not having to bother with semi-colons to terminate lines, and not having to declare variable types made a huge difference to the students ability to be productive.

Alan Friedman talked about Lotka’s Law, which states that a very small number of journal paper authors write most of the papers, and it quickly drops off so that 60% of journal authors only write one paper.

He has an implementation package called LoktasLaw, which librarians might find useful.

Berry Boessenkool talked about extreme value stats. Apparently as the temperature increases, the median chance of precipitation does to. However when you look at the extreme high quantiles (> 99.9%) of the chance of precipitation, they increase upto to a temperature of 25 degrees Celsius or so, then drop again.

Berry suggested that this was a statistical artefact of not having much data, and when he did a more careful extreme value analysis, the high-quantile probability of precipitation kept increasing with temperature, as the underlying physics suggested it should.

When he talked about precipitation, I’m pretty sure he meant rain, since my rudimentary meteorological knowledge suggests that the probability of sleet and snow drops off quite sharply above zero degrees Celsius.

Jonathan Arta talked about his participation in a Kaggle competition predicting NCAA Basketball scores in a knockout competition called March Madness.

His team used results from the previous season’s league games, Las Vegas betting odds, a commercial team metric dataset, and the distance travelled to each game to try to predict the results.

He suggested that they could have done better if they’d used a Bayesian approach: if a poor team wins it’s first couple of games, you know it is better than your model predicts.

Adolfo Alvarez gave a quick rundown of the different approaches for making your code go faster. No time for details, just a big list.

Vectorization, data.table and dplyr, do things in a database, try alternate R engines, parallelize stuff, use GPUs, use Hadoop and Spark, buy time on Amazon or Azure machines.

Karen Nielsen talked about predicting EEG data (those time series of your heart beating) using regression spline mixed models. Her big advance was to include person and trail effects into the model, which was based on the lmer function.

Andrew Kriss talked about his rstats4ag.org website, which gives statistics advice for arable farmers. The stats are fairly basic (on purpose), tailored for use by crop farmers.

Richard Layton talked about teaching graphics. “As well as drawing ‘clear’ graphs, it is important to think about the needs of the audience”, he argued.

While a dot plot may be the best option, if you’re audience had never seen them before, it may be best to use a boxplot instead. (There are no question in the lightning talks, so I didn’t get chance to ask him if he would go so far as to recommend a 4D pie chart!)

One compelling example for considering the psychology of the audience was a mosaic plot of soldiers’ deaths in the (I think first) Iraq war. By itself, the plot evokes little emotion, but if you put a picture of a soldier dying next to it, it reminds you what the numbers mean.

Michael Höhle headlined today with a talk on zombie preparedness, filling in some of the gaps in the Zombie Survival Guide.

He explained that the most important thing was to track possible zombie outbreak metrics in order to ge tan early warning of a problem. He gave a good explanation of monitoring homicides by headshot and decapitation, then correcting for the fact that the civil servants reporting these numbers had gone on holiday.

His surveillance package can also be used for non-zombie related disease outbreaks.

Tagged: r, user2015

To leave a comment for the author, please follow the link and comment on his blog: 4D Pie Charts » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R Package to access the Open Movie Database (OMDB) API

$
0
0

(This article was first published on rud.is » R, and kindly contributed to R-bloggers)

It’s not on CRAN yet, but there’s a devtools-installable R package for getting data from the OMDB API.

It covers all of the public API endpoints:

  • find_by_id: Retrieve OMDB info by IMDB ID search
  • find_by_title: Retrieve OMDB info by title search
  • get_actors: Get actors from an omdb object as a vector
  • get_countries: Get countries from an omdb object as a vector
  • get_directors: Get directors from an omdb object as a vector
  • get_genres: Get genres from an omdb object as a vector
  • get_writers: Get writers from an omdb object as a vector
  • print.omdb: Print an omdb result
  • search_by_title: Lightweight omdb title search

Here’s a bit of it in action:

devtools::install_github("hrbrmstr/omdbapi")
library(dplyr)
library(pbapply)
 
search_by_title("Captain America")
 
# Source: local data frame [10 x 4]
# 
#                                                   Title  Year    imdbID   Type
# 1                    Captain America: The First Avenger  2011 tt0458339  movie
# 2                   Captain America: The Winter Soldier  2014 tt1843866  movie
# 3                                       Captain America  1990 tt0103923  movie
# 4                                       Captain America  1979 tt0078937  movie
# 5           Iron Man and Captain America: Heroes United  2014 tt3911200  movie
# 6                    Captain America II: Death Too Soon  1979 tt0078938  movie
# 7                                       Captain America  1944 tt0036697  movie
# 8                                       Captain America 1966– tt0206474 series
# 9                        Captain America: Super Soldier  2011 tt1740721   game
# 10 Comic Book Origins: Captain America - Winter Soldier  2014 tt3618126  movie
 
search_by_title("Captain America", year_of_release=2013)
 
# Source: local data frame [1 x 4]
# 
#                              Title Year    imdbID  Type
# 1 A Look Back at 'Captain America' 2013 tt3307378 movie
 
games <- search_by_title("Captain America", type="game")
 
glimpse(games)
 
# Observations: 2
# Variables:
# $ Title  (chr) "Captain America: Super Soldier", "Captain America and the A...
# $ Year   (chr) "2011", "1991"
# $ imdbID (chr) "tt1740721", "tt0421939"
# $ Type   (chr) "game", "game"
 
find_by_title(games$Title[1])
 
#      Title: Captain America: Super Soldier
#       Year: 2011
#      Rated: N/A
#   Released: 2011-07-19
#    Runtime: N/A
#      Genre: Action
#   Director: Michael McCormick, Robert Taylor
#     Writer: Christos N. Gage
#     Actors: Hayley Atwell, Chris Evans, Sebastian Stan, Neal McDonough
#       Plot: You play the Sentinel of Liberty as you raid the Red Skull's scientist
#             minion, Armin Zola's, lair.
#   Language: English
#    Country: USA
#     Awards: N/A
#     Poster: http://ia.media-imdb.com/images/M/
#             MV5BMTUwMzQ0NjE5N15BMl5BanBnXkFtZTgwODI3MzQxMTE@._V1_SX300.jpg
#  Metascore: N/A
# imdbRating: 7.2
#  imdbVotes: 271
#     imdbID: tt1740721
#       Type: game
 
find_by_title("Game of Thrones", type="series", season=1, episode=1)
 
#      Title: Winter Is Coming
#       Year: 2011
#      Rated: TV-MA
#   Released: 2011-04-17
#    Runtime: 62 min
#      Genre: Adventure, Drama, Fantasy
#   Director: Timothy Van Patten
#     Writer: David Benioff (created by), D.B. Weiss (created by), George R.R.
#             Martin ("A Song of Ice and Fire" by), David Benioff, D.B.
#             Weiss
#     Actors: Sean Bean, Mark Addy, Nikolaj Coster-Waldau, Michelle Fairley
#       Plot: Jon Arryn, the Hand of the King, is dead. King Robert Baratheon plans
#             to ask his oldest friend, Eddard Stark, to take Jon's
#             place. Across the sea, Viserys Targaryen plans to wed his
#             sister to a nomadic warlord in exchange for an army.
#   Language: English
#    Country: USA
#     Awards: N/A
#     Poster: http://ia.media-imdb.com/images/M/
#             MV5BMTk5MDU3OTkzMF5BMl5BanBnXkFtZTcwOTc0ODg5NA@@._V1_SX300.jpg
#  Metascore: N/A
# imdbRating: 8.5
#  imdbVotes: 12584
#     imdbID: tt1480055
#       Type: episode
 
get_genres(find_by_title("Star Trek: Deep Space Nine", season=5, episode=7))
 
# [1] "Action"    "Adventure" "Drama"
 
get_writers(find_by_title("Star Trek: Deep Space Nine", season=4, episode=6))
 
# [1] "Gene Roddenberry (based upon "Star Trek" created by)"
# [2] "Rick Berman (created by)"                              
# [3] "Michael Piller (created by)"                           
# [4] "David Mack"                                            
# [5] "John J. Ordover"
 
get_directors(find_by_id("tt1371111"))
 
# [1] "Tom Tykwer"     "Andy Wachowski" "Lana Wachowski"
 
get_countries(find_by_title("The Blind Swordsman: Zatoichi"))
 
# [1] "Japan"
 
ichi <- search_by_title("Zatoichi")
bind_rows(lapply(ichi$imdbID, function(x) {
  find_by_id(x, include_tomatoes = TRUE)
})) -> zato
 
par(mfrow=c(3,1)) 
boxplot(zato$tomatoUserMeter, horizontal=TRUE, main="Tomato User Meter", ylim=c(0, 100))
boxplot(zato$imdbRating, horizontal=TRUE, main="IMDB Rating", ylim=c(0, 10))
boxplot(zato$tomatoUserRating, horizontal=TRUE, main="Tomato User Rating", ylim=c(0, 5))

README-usage-1

You can find out more at it’s github repo

To leave a comment for the author, please follow the link and comment on his blog: rud.is » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R 101 – Aggregate By Quarter

$
0
0

(This article was first published on Data Driven Security, and kindly contributed to R-bloggers)

We were asked a question on how to (in R) aggregate quarterly data from what I believe was a daily time series. This is a pretty common task and there are many ways to do this in R, but we’ll focus on one method using the zoo and dplyr packages. Let’t get those imports out of the way:

library(dplyr)
library(zoo)
library(ggplot2)

Now, we need some data. This could be from a database, log file or even Excel spreadsheet or CSV. Since we’re focusing on the aggregation and not the parsing, let’s generate some data, for daily failed logins in calendar year 2014:

set.seed(1492)

yr_2014 <- seq(from=as.Date("2014-01-01"), 
                              to=as.Date("2014-12-31"), 
                              by="day")

logins <- data_frame(date=yr_2014,
                     failures=round(rlnorm(length(yr_2014)) * 
                                      sample(10:50, 1)), 0.5, 3)

glimpse(logins)

## Observations: 365
## Variables:
## $ date     (date) 2014-01-01, 2014-01-02, 2014-01-03, 2014-01-04, 2014...
## $ failures (dbl) 18, 13, 6, 91, 24, 46, 14, 34, 10, 48, 45, 11, 8, 40,...
## $ 0.5      (dbl) 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5...
## $ 3        (dbl) 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,...

Using set.seed makes the pseudo-random draws via rlnorm repeatable on other systems. We can get a better look at that data:

ggplot(logins, aes(x=date, y=failures)) + 
  geom_bar(stat="identity") +
  labs(x=NULL, y="# Login Failuresn") +
  theme_bw() +
  theme(panel.grid=element_blank()) +
  theme(panel.border=element_blank())

We can then, summarize the number of failed logins by quarter using as.yearqtr:

logins %>% 
  mutate(qtr=as.yearqtr(date)) %>% 
  count(qtr, wt=failures) -> total_failed_logins_by_qtr

total_failed_logins_by_qtr

## Source: local data frame [4 x 2]
## 
##       qtr    n
## 1 2014 Q1 4091
## 2 2014 Q2 5915
## 3 2014 Q3 6141
## 4 2014 Q4 5229

NOTE: you can control the way those quarter labels look with the format parater to as.yearqtr:

format

character string specifying format. "%C", "%Y", "%y" and "%q", if present, are replaced with the century, year, last two digits of the year, and quarter (i.e. a number between 1 and 4), respectively.

But you can also get more intra-quarter detail as well by looking at the distribution of failed logins:

logins %>% 
  mutate(qtr=as.character(as.yearqtr(date))) %>% 
  ggplot() +
  geom_violin(aes(x=qtr, y=failures), fill="#cab2d6") +
  geom_boxplot(aes(x=qtr, y=failures), alpha=0) +
  scale_y_continuous(expand=c(0, 0)) +
  labs(x=NULL, y=NULL, title="nDistribution of login failures per quarter") +
  coord_flip() +
  theme_bw() +
  theme(panel.grid=element_blank()) +
  theme(panel.border=element_blank()) +
  theme(axis.ticks.y=element_blank())

To leave a comment for the author, please follow the link and comment on his blog: Data Driven Security.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Stop the madness – no more pie charts

$
0
0

(This article was first published on cpwardell.com » R, and kindly contributed to R-bloggers)

There has been a trend in the last few years to put interesting-looking but non-informative figures in papers; the pie chart is the worst recurrent offender.  I have no idea how they keep getting included, as they’re famously misleading and awful.  I want my work to look as much like the cockpit of a mecha or Iron Man’s HUD as possible, but I know that it’s not going to be as clear or concise as it should be.

ku-xlarge
So cool

A recent example is this otherwise very interesting and good paper:

Subclonal diversification of primary breast cancer revealed by multiregion sequencing by Yates et al, 2015

Figure 2 is confusing at best and reproduced below.  These pie-chart-like plots were made famous by Florence Nightingale 150 years ago and have been called “coxcomb plots”.  Wikipedia claims that that’s a mistake and they are really called “polar area diagrams” or “courbes circulaires”.

nm.3886-F2

I admit that they look pretty cool and are fine if you’re building some sort of hip infographic about consumer tastes or Facebook trends.  However, I think that they’re inappropriate and misleading because:

  1. They look unusual.  The reader expends a lot of energy working out what they’re looking at instead of processing the relationships in the data
  2. These data contain only one variable that changes, but the plots used can encode two variables (arc length and radius can vary, discussed below).  The plot is therefore unnecessarily complex
  3. In pie charts, the arc length is proportional to the value represented.  In these plots, the arc length is identical for each slice of pie… but you might not know this and you may infer that there is less of a certain segment.  This is appropriate for the type of data that Florence Nightingale was plotting (deaths in each month of the year), but not for arbitrary divisions
  4. In these plots, the radius of each segment (i.e. how far it extends from the centre) is informative.  You’re supposed to read off an almost-invisible grey scale of concentric rings, but it’s not easy.  Also, the visual effect is non-linear because the area of a circle is πr^2, which means that a small increase in radius has a disproportionately large gain
  5. It’s really hard to compare between plots; visual comparison is difficult and reading the scale to get at the numbers is even harder
  6. Multiple data types are represented in a single plot.  I’m not sure mixing somatic mutation variant allele frequencies and copy number log ratios is very effective

Some potential solutions

Criticizing without offering a solution isn’t very helpful, so I offer two solutions:

Use barplots or boxplots.  Yes, they’re boring, but they’re also familiar and easy to read.  I suppose you could even put a scale at either edge and mix data types without too much complaint (e.g. somatic VAFs and copy number logRs on the same plot).

A type of 3D barplot.  I generally think 3D is a bad thing when displayed in a 2D plane, particularly if it’s non interactive.  For example, a 3D shape on a 2D display is fine if you can rotate and zoom it freely.  “Lego plots” have been popularized by the Broad Institute in their sequencing papers, usually to show the sequence context of mutations (see below; taken from Exome and whole-genome sequencing of esophageal adenocarcinoma identifies recurrent driver events and mutational complexity by Dulak et al, 2013).  The irony of the pie charts isn’t lost on me.

ng_2591-F1

They’re relatively appealing and a relatively concise way of showing what would otherwise be a barplot of 96 values; it would get tricky if the data weren’t carefully arranged so as not to obscure anything, though.

If 3D barplots are now acceptable, why don’t we make a 3D barplot that encodes two variables?  The height would represent one variable and the depth another.

I’ve created a small data set and some R code to illustrate these points below:

Alternative plots

The data set represents 4 mutations (labeled A to D) in a set of 100 samples.  Each sample has a variant allele frequency (VAF) for each mutation, between 0 (not present) and 1 (every sequencing read contains it).

# A is frequent (80/100 samples) and usually subclonal (VAF ~0.1)
# B is infrequent (20/100 samples) and usually clonal (VAF ~0.8)
# C is frequent (90/100 samples) and usually clonal (VAF ~0.9)
# D is infrequent (15/100 samples) and usually subclonal (VAF ~0.15)

  • Coxcomb / polar area plot.  Arc length represents proportion of samples, radius (wedge length) represents median VAF.  Can be difficult to interpret.

coxcomb

  • Boxplot.  Good representation of the data, but hard to tell how many samples are represented.  For example, B is far less common than C but appears to be very similar.  Likewise for D and A.

boxplot

  • Barplot.  No representation of the spread of the data.  The proportion of affected samples is encoded using colour (darker mutations affect more samples).  Perhaps the colours could be applied to the boxplot for the best result?

barplot

  • 3D barplot.  It’s very basic (missing axes and labels), but shows the relationships between proportion and VAF more clearly than other plots.  I added some transparency so no column totally obscures any others.  It’s more convincing when you can rotate the object yourself (download the code and try it yourself), but I think even a static image is better than a Coxcomb/polar coordinate plot.

3dbarplot

Code is below and available on GitHub here.

## Load packages
library(rgl)
library(ggplot2)

#### START OF FUNCTIONS

## Functions modified from the "demo(hist3d)" examples in the rgl package:
# library(rgl)
# demo(hist3d)
## Would it have killed the author to comment their code?

## Draws a single "column" or "stack".
## X and Y coordinates determine the area of the stack
## The Z coordinate determines the height of the stack
stackplot.3d&lt;-function(x,y,z=1,alpha=1,topcol="#078E53",sidecol="#aaaaaa"){

## These lines allow the active rgl device to be updated with multiple changes
save &lt;- par3d(skipRedraw=TRUE)
on.exit(par3d(save))

## Determine the coordinates of each surface of the stack and its edges
x1&lt;-c(rep(c(x[1],x[2],x[2],x[1]),3),rep(x[1],4),rep(x[2],4))
z1&lt;-c(rep(0,4),rep(c(0,0,z,z),4))
y1&lt;-c(y[1],y[1],y[2],y[2],rep(y[1],4),rep(y[2],4),rep(c(y[1],y[2],y[2],y[1]),2))
x2&lt;-c(rep(c(x[1],x[1],x[2],x[2]),2),rep(c(x[1],x[2],rep(x[1],3),rep(x[2],3)),2))
z2&lt;-c(rep(c(0,z),4),rep(0,8),rep(z,8) )
y2&lt;-c(rep(y[1],4),rep(y[2],4),rep(c(rep(y[1],3),rep(y[2],3),y[1],y[2]),2) )

## These lines create the sides of the stack and its top surface
rgl.quads(x1,z1,y1,col=rep(sidecol,each=4),alpha=alpha)
rgl.quads(c(x[1],x[2],x[2],x[1]),rep(z,4),c(y[1],y[1],y[2],y[2]),
col=rep(topcol,each=4),alpha=1)
## This line adds black edges to the stack
rgl.lines(x2,z2,y2,col="#000000")
}
# Example:
#stackplot.3d(c(-0.5,0.5),c(4.5,5.5),3,alpha=0.6)

## Calls stackplot.3d repeatedly to create a barplot
# x is a constant distance along x axis
# y is the depth of column
# z is the height of column
barz3d&lt;-function(x,y,z,alpha=1,topcol="#078E53",sidecol="#aaaaaa",scaley=1,scalez=1){
## These lines allow the active rgl device to be updated with multiple changes
save &lt;- par3d(skipRedraw=TRUE)
on.exit(par3d(save))

## Plot each of the columns
n=length(x)
breaks.x = seq(0,n-1)
for(i in 1:n){
stackplot.3d(c(breaks.x[i],breaks.x[i]+1),c(0,-y[i])*scaley,z[i]*scalez,alpha=alpha,topcol=topcol)
}
## Set the viewpoint
rgl.viewpoint(theta=30,phi=25)
}
# Example
#barz3d(x=LETTERS[1:4],y=c(0.8,0.2,0.9,0.15),z=c(0.11,0.75,0.89,0.16),alpha=0.4,scaley=2,scalez=2)

#### END OF FUNCTIONS

## Example data:
# 4 mutations in 100 samples
# VAF range is from 0 to 1
# A is frequent and usually subclonal
# B is infrequent and usually clonal
# C is frequent and usually clonal
# D is infrequent and usually subclonal

Avaf=rnorm(80,0.1,0.05)
Bvaf=rnorm(20,0.8,0.1)
Cvaf=rnorm(90,0.9,0.05)
Dvaf=rnorm(15,0.15,0.05)

## Summarize data in new object
vafsum=data.frame(median=sapply(list(Avaf,Bvaf,Cvaf,Dvaf),median),
proportion=sapply(list(Avaf,Bvaf,Cvaf,Dvaf),function(x){length(x)/100}))
rownames(vafsum)=c(LETTERS[1:4])

## Code to produce coxcomb/polar coordinate plot adapted from:
## http://robinlovelace.net/r/2013/12/27/coxcomb-plots-spiecharts-R.html
## https://github.com/Robinlovelace/lilacPlot
pos = 0.5 * (cumsum(vafsum$proportion) + cumsum(c(0, vafsum$proportion[-length(vafsum$proportion)])))
p = ggplot(vafsum, aes(x=pos)) + geom_bar(aes(y=median), width=vafsum$proportion, color = "black", stat = "identity") + scale_x_continuous(labels = rownames(vafsum), breaks = pos) # Linear version is ok
p + coord_polar(theta = "x")
# (ignore warnings thrown)

## A traditional boxplot
boxplot(Avaf,Bvaf,Cvaf,Dvaf,names=LETTERS[1:4])

## A barplot where height represents median VAF and the color of the bar represents
## how many samples contain each mutation
barplot(vafsum$median,names=LETTERS[1:4],col=rgb(0.1,0.1,0.1,vafsum$proportion))

## Our new 3D barplot function
barz3d(x=LETTERS[1:4],y=vafsum$proportion,z=vafsum$median,alpha=0.4,scaley=2,scalez=2)
rgl.snapshot("3dbarplot.png", fmt = "png", top = TRUE )

To leave a comment for the author, please follow the link and comment on his blog: cpwardell.com » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Graphs in R – Overlaying Data Summaries in Dotplots

$
0
0

(This article was first published on Design Data Decisions » R, and kindly contributed to R-bloggers)

Dotplots are useful for the graphical visualization of small to medium-sized datasets. These simple plots provide an overview of how the data is distributed, whilst also showing the individual observations. It is however possible to make the simple dotplots more informative by overlaying them with data summaries and/or smooth distributions.

This post is about creating such superimposed dotplots in R – we first see how to create these plots using just base R graphics, and then proceed to create them using the ggplot2 R package.

## First things first - dataset 'chickwts': Weights of
## chickens fed with any one of six different feed types

?chickwts
data(chickwts)  ## load the dataset

 

Graphs using base R:

## First some plot settings

par(cex.main=0.9,cex.lab=0.8,font.lab=2,cex.axis=0.8,font.axis=2,col.axis="grey50")

We first create a dotplot where the median of each group is also displayed as a horizontal line:

## Getting the dotplot first, expanding the x-axis to leave room for the line
stripchart(weight ~ feed, data = chickwts, xlim=c(0.5,6.5), vertical=TRUE, method = "stack", offset=0.8, pch=19,
main = "Chicken weights after six weeks", xlab = "Feed Type", ylab = "Weight (g)")

## Then compute the group-wise medians
medians <- tapply(chickwts[,"weight"], chickwts[,"feed"], median)

## Now add line segments corresponding to the group-wise medians
loc <- 1:length(medians)
segments(loc-0.3, medians, loc+0.3, medians, col="red", lwd=3)

dotWithMedian_baseR

Next , we create a dotplot where the median is shown, along with the 1st and 3rd quartile, i.e., the ‘box’ of the boxplot of the data is overlaid with the dotplot:

## Getting the dotplot first, expanding the x-axis to leave room for the box
stripchart(weight ~ feed, data = chickwts, xlim=c(0.5,6.5), vertical=TRUE, method="stack", offset=0.8, pch=19,
main = "Chicken weights after six weeks", xlab = "Feed Type", ylab = "Weight (g)")

## Now draw the box, but without the whiskers!
boxplot(weight ~ feed, data = chickwts, add=TRUE, range=0, whisklty = 0, staplelty = 0)

dotWithBox_baseR

Plots similar to ones created above, but using the ggplot2 R package instead:

## Load the ggplot2 package first
library(ggplot2)

## Data and plot settings
p <- ggplot(chickwts, aes(x=feed, y=weight)) +
labs(list(title = "Chicken weights after six weeks", x = "Feed Type", y = "Weight (g)")) +
theme(axis.title.x = element_text(face="bold"), axis.text.x = element_text(face="bold")) +
theme(axis.title.y = element_text(face="bold"), axis.text.y = element_text(face="bold"))

We use the stat_summary function to plot the median line as an errorbar, but we need to define our own function that calculates the group-wise median and produces output in a format suitable for stat_summary like so:

## define custom median function
plot.median <- function(x) {
  m <- median(x)
  c(y = m, ymin = m, ymax = m)
}

## dotplot with median line
p1 <- p + geom_dotplot(binaxis='y', stackdir='center', method="histodot", binwidth=5) +
stat_summary(fun.data="plot.median", geom="errorbar", colour="red", width=0.5, size=1)
print(p1)

dotWithMedian_ggplot2

For the dotplot overlaid with the median and the 1st and 3rd quartile, the ‘box’ from the boxplot is plotted using geom_boxplot function:

## dotplot with box
p2 <- p + geom_boxplot(aes(ymin=..lower.., ymax=..upper..)) +
geom_dotplot(binaxis='y', stackdir='center', method="histodot", binwidth=5)
print(p2)

dotWithBox_ggplot2

Additionally, let’s also plot a dotplot with a violin plot overlaid. We cannot do this in base R!

## dotplot with violin plot
## and add some cool colors
p3 <- p + geom_violin(scale="width", adjust=1.5, trim = FALSE, fill="indianred1", color="darkred", size=0.8) +
geom_dotplot(binaxis='y', stackdir='center', method="histodot", binwidth=5)
print(p3)

violinPlotColored

To leave a comment for the author, please follow the link and comment on his blog: Design Data Decisions » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Why I use Panel/Multilevel Methods

$
0
0

(This article was first published on DiffusePrioR » R, and kindly contributed to R-bloggers)

I don’t understand why any researcher would choose not to use panel/multilevel methods on panel/hierarchical data. Let’s take the following linear regression as an example:

y_{it} = \beta_{0} + \beta_{1}x_{it} + a_{i} + \epsilon_{it},

where a_{i} is a random effect for the i-th group. A pooled OLS regression model for the above is unbiased and consistent. However, it will be inefficient, unless a_{i}=0 for all i.

Let’s have a look at the consequences of this inefficiency using a simulation. I will simulate the following model:

y_{it} = 1 + 5 x_{it} + a_{i} + \epsilon_{it},

with a_{i} \sim N(0, 3) and \epsilon_{it} \sim N(0, 1). I will do this simulation and compare the following 4 estimators: pooled OLS, random effects (RE) AKA a multilevel model with a mixed effect intercept, a correlated random effects (CRE) model (include group mean as regressor as in Mundlak (1978)), and finally the regular fixed effects (FE) model. I am doing this in R, so the first model I will use the simple lm() function, the second and third lmer() from the lme4 package, and finally the excellent felm() function from the lfe package. These models will be tested under two conditions. First, we will assume that the random effects assumption holds, the regressor is uncorrelated with the random effect. After looking at this, we will then allow the random effect to correlate with the regressor x_{it}.

The graph below shows the importance of using panel methods over pooled OLS. It shows boxplots of the 100 simulated estimates. Even when the random effects assumption is violated, the random effects estimator (RE) is far superior to simple pooled OLS. Both the CRE and FE estimators perform well. Both have lowest root mean square errors, even though the model satisfies the random effects assumption. Please see my R code below.

remc

# clear ws
rm(list=ls())

# load packages
library(lme4)
library(plyr)
library(lfe)
library(reshape)
library(ggplot2)
# from this:

### set number of individuals
n = 200
# time periods
t = 5

### model is: y=beta0_{i} +beta1*x_{it} + e_{it}
### average intercept and slope
beta0 = 1.0
beta1 = 5.0

### set loop reps
loop = 100
### results to be entered
results1 = matrix(NA, nrow=loop, ncol=4)
results2 = matrix(NA, nrow=loop, ncol=4)

for(i in 1:loop){
  # basic data structure
  data = data.frame(t = rep(1:t,n),
                    n = sort(rep(1:n,t)))
  # random effect/intercept to add to each 
  rand = data.frame(n = 1:n,
                    a = rnorm(n,0,3))
  data = join(data, rand, match="first")
  # random error
  data$u = rnorm(nrow(data), 0, 1)
  # regressor x
  data$x = runif(nrow(data), 0, 1)
  # outcome y
  data$y = beta0 + beta1*data$x + data$a + data$u  
  # make factor for i-units
  data$n = as.character(data$n)
  # group i mean's for correlated random effects
  data$xn = ave(data$x, data$n, FUN=mean)
  # pooled OLS
  a1 = lm(y ~ x, data)
  # random effects
  a2 = lmer(y ~ x + (1|n), data)
  # correlated random effects
  a3 = lmer(y ~ x + xn + (1|n), data)
  # fixed effects
  a4 = felm(y ~ x | n, data)
  
  # gather results
  results1[i,] = c(coef(a1)[2],
                  coef(a2)$n[1,2],
                  coef(a3)$n[1,2],
                  coef(a4)[1])
  ### now let random effects assumption be false
  ### ie E[xa]!=0
  data$x = runif(nrow(data), 0, 1) + 0.2*data$a
  # the below is like above
  data$y = beta0 + beta1*data$x + data$a + data$u  
  data$n = as.character(data$n)
  data$xn = ave(data$x, data$n, FUN=mean)
  a1 = lm(y ~ x, data)
  a2 = lmer(y ~ x + (1|n), data)
  a3 = lmer(y ~ x + xn + (1|n), data)
  a4 = felm(y ~ x | n, data)
  
  results2[i,] = c(coef(a1)[2],
                  coef(a2)$n[1,2],
                  coef(a3)$n[1,2],
                  coef(a4)[1])  
}
# calculate rmse
apply(results1, 2, function(x) sqrt(mean((x-5)^2)))
apply(results2, 2, function(x) sqrt(mean((x-5)^2)))

# shape data and do ggplot
model.names = data.frame(X2 = c("1","2","3","4"),
                         estimator = c("OLS","RE","CRE","FE"))
res1 = melt(results1)
res1 = join(res1, model.names, match="first")
res2 = melt(results2)
res2 = join(res2, model.names, match="first")

res1$split = "RE Valid"
res2$split = "RE Invalid"
res1 = rbind(res1, res2)

res1$split = factor(res1$split, levels =  c("RE Valid", "RE Invalid"))
res1$estimator = factor(res1$estimator, levels =  c("OLS","RE","CRE","FE"))

number_ticks = function(n) {function(limits) pretty(limits, n)}

ggplot(res1, aes(estimator, value)) + 
  geom_boxplot(fill="lightblue") +
  #coord_flip() +
  facet_wrap( ~ split, nrow = 2, scales = "free_y") +
  geom_hline(yintercept = 5) +
  scale_x_discrete('') + 
  scale_y_continuous(bquote(beta==5), breaks=number_ticks(3)) + 
  theme_bw() + 
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16),
        legend.title = element_blank(),
        legend.text = element_text(size=16),
        strip.text.x = element_text(size = 16),
        axis.text.x = element_text(angle = 45, hjust = 1))
ggsave("remc.pdf", width=8, height=6)

To leave a comment for the author, please follow the link and comment on his blog: DiffusePrioR » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Computing AIC on a Validation Sample

$
0
0

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

This afternoon, we’ve seen in the training on data science that it was possible to use AIC criteria for model selection.

> library(splines)
> AIC(glm(dist ~ speed, data=train_cars, 
  family=poisson(link="log")))
[1] 438.6314
> AIC(glm(dist ~ speed, data=train_cars, 
  family=poisson(link="identity")))
[1] 436.3997
> AIC(glm(dist ~ bs(speed), data=train_cars, 
  family=poisson(link="log")))
[1] 425.6434
> AIC(glm(dist ~ bs(speed), data=train_cars, 
  family=poisson(link="identity")))
[1] 428.7195

And I’ve been asked why we don’t use a training sample to fit a model, and then use a validation sample to compare predictive properties of those models, penalizing by the complexity of the model.    But it turns out that it is difficult to compute the AIC of those models on a different dataset. I mean, it is possible to write down the likelihood (since we have a Poisson model) but I want a code that could work for any model, any distribution….

Hopefully, Heather suggested a very clever idea, using her package

And actually, it works well.

Create here two datasets, one for the training, and one for validation

> set.seed(1)
> idx = sample(1:50,size=10,replace=FALSE)
> train_cars = cars[-idx,]
> valid_cars = cars[idx,]

then use simply

> library(gnm)
> reg1 = gnm(dist ~ speed, data=train_cars, 
  family=poisson(link="log"))
> reg2 = gnm(dist ~ speed, data=valid_cars, 
  contrain = "*", contrainTo = 
  coef(reg1),family=poisson(link="log"))

Here Akaike criteria on the validation sample is

> AIC(reg2)
[1] 82.57612

Let us keep tracks of a prediction to plot it later on

> v_log=predict(reg1,newdata=
  data.frame(speed=u),type="response")

We can challenge that Poisson model with a log link with a Poisson with a linear link function

> reg1 = gnm(dist ~ speed, data=train_cars, 
  family=poisson(link="identity"))
> reg2 = gnm(dist ~ speed, data=valid_cars, 
  contrain = "*", contrainTo = 
  coef(reg1),family=poisson(link="identity"))
> AIC(reg2)
[1] 73.24745
> v_id=predict(reg1,newdata=data.frame(speed=u),
  type="response")

We can also try to include splines, either with the log link

> library(splines)
> reg1 = gnm(dist ~ bs(speed), data=train_cars, 
  family=poisson(link="log"))
> reg2 = gnm(dist ~ speed, data=valid_cars, 
  contrain = "*", contrainTo = 
  coef(reg1),family=poisson(link="log"))
> AIC(reg2)
[1] 82.57612
> v_log_bs=predict(reg1,newdata=
  data.frame(speed=u),type="response")

or with the identity link (but always in the same family)

> reg1 = gnm(dist ~ bs(speed), data=train_cars, 
  family=poisson(link="identity"))
> reg2 = gnm(dist ~ speed, data=valid_cars, 
  contrain = "*", contrainTo = 
  coef(reg1),family=poisson(link="identity"))
> AIC(reg2)
[1] 73.24745
> v_id_bs=predict(reg1,newdata=
  data.frame(speed=u),type="response")

If we plot the predictions, we get

> plot(cars)
> points(train_cars,pch=19,cex=.85,col="grey")
> lines(u,v_log,col="blue")
> lines(u,v_id,col="red")
> lines(u,v_log_bs,col="blue",lty=2)
> lines(u,v_id_bs,col="red",lty=2)

Now, the problem with this holdout technique is that we might get unlucky (or lucky) when creating the samples. So why not try some monte carlo study, where many samples are generated,

> four_aic=function(i){
+   idx = sample(1:50,size=10,replace=FALSE)
+   train_cars = cars[-idx,]
+   valid_cars = cars[idx,]
+   reg1 = gnm(dist ~ speed, data=train_cars, 
    family=poisson(link="log"))
+   reg2 = gnm(dist ~ speed, data=valid_cars, 
    contrain = "*", contrainTo = 
    coef(reg1),family=poisson(link="log"))
+   a1=AIC(reg2)
+   reg0 = lm(dist ~ speed, data=train_cars)
+   reg1 = gnm(dist ~ speed, data=train_cars, 
    family=poisson(link="identity"), 
    start=c(1,1))
+   reg2 = gnm(dist ~ speed, data=valid_cars, 
    contrain = "*", contrainTo = 
    coef(reg1),family=poisson(link="identity"),
    start=c(1,1))
+   a2=AIC(reg2)
+   reg1 = gnm(dist ~ bs(speed), data=train_cars,
    family=poisson(link="log"))
+   reg2 = gnm(dist ~ bs(speed), data=valid_cars,
    contrain = "*", contrainTo = 
    coef(reg1),family=poisson(link="log"))
+   a3=AIC(reg2)  
+   reg1 = gnm(dist ~ bs(speed), data=train_cars,
    family=poisson(link="identity"))
+   reg2 = gnm(dist ~ bs(speed), data=valid_cars,
    contrain = "*", contrainTo = 
    coef(reg1),family=poisson(link="identity"))
+   a4=AIC(reg2)
+   return(c(a1,a2,a3,a4))}

Consider for instance 1,000 scenarios

> S = sapply(1:1000,four_aic)

The model that has, on average, the lowest AIC on a validation sample was the log-link with splines

> rownames(S) = c("log","id","log+bs","id+bs")
> W = apply(S,2,which.min)  
> barplot(table(W)/10,names.arg=rownames(S))

And indeed,

> boxplot(t(S))

with that model, the AIC is usually lower with the spline model with a log link than the other one (or at least almost the same as the spline model with an identity link). Or at least, we can confirm here that a nonlinear model should be better than a nonlinear one.

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

15 Questions All R Users Have About Plots

$
0
0

(This article was first published on The DataCamp Blog » R, and kindly contributed to R-bloggers)

R allows you to create different plot types, ranging from the basic graph types like density plots, dot plots, bar charts, line charts, pie charts, boxplots and scatter plots, to the more statistically complex types of graphs such as probability plots, mosaic plots and correlograms.

In addition, R is pretty known for its data visualization capabilities: it allows you to go from producing basic graphs with little customization to plotting advanced graphs with full-blown customization in combination with interactive graphics. Nevertheless, not always do we get the results that we want for our R plots: Stack Overflow is flooded with our questions on plots, with many recurring questions.

This is why DataCamp decided to put all the frequently asked questions and their top rated answers together in a blog post, completed with additional, short explanations that could be of use for R beginners.

question

If you are rather interested in learning how to plot with R, you might consider reading our tutorial on histograms, which covers basic R, ggplot2 and ggvis, or this shorter tutorial which offers an overview of simple graphs in R. However, if you’re looking to learn everything on creating stunning and informative graphical visualizations, our interactive course on (interactive) data visualization with ggvis will definitely interest you!

1. How To Draw An Empty R Plot?

How To Open A New Plot Frame

You can open an empty plot frame and activate the graphics device in R as follows:

plot.new() # or frame()

Note that the plot.new() and frame() functions define a new plot frame without it having any axes, labels, or outlining. It indicates that a new plot is to be made: a new graphics window will open if you don’t have one open yet, otherwise the existing window is prepared to hold the new plot. You can read up on these functions here.

  • x11() can also opens a new windows device in R for the X Window System (version 11)!
  • quartz() starts a graphics device driver for the OS X System.
  • windows() starts a new graphics device for Windows.

How To Set Up The Measurements Of The Graphics Window

You can also use the plot.window() function to set the horizontal and vertical dimensions of the empty plot you just made:

pWidth = 3
pHeight = 2
plot.window(c(0,pWidth),
            c(0,pHeight))

How To Draw An Actual Empty Plot

You can draw an empty plot with the plot() function:

plot(5, 
     5, 
     type="n", 
     axes=FALSE, 
     ann=FALSE, 
     xlim=c(0, 10), 
     ylim = c(0,10))

You give the coordinates (5,5) for the plot, but that you don’t show any plotting by putting the type argument to "n".

(Tip: try to put this argument to "p" or "b" and see what happens!)

What’s more, you don’t annotate the plot, but you do put limits from 0 to 10 on the x- and y-axis. Next, you can fill up your empty plot with axes, axes labels and a title with the following commands:

mtext("x-axis", 
      side=1) #Add text to the x-axis
mtext("y-axis",
      side=2) 
title("An R Plot") #Add a title

Note that if you want to know more about the side argument, you can just keep on reading! It will be discussed in more detail below, in question 3 about R plots.

Lastly, you may choose to draw a box around your plot by using the box() function and add some points to it with the points() function:

box() #Draw a box
points(5,  #Put (red) point in the plot at (5,5)
       5, 
       col="red") 
points(5, 
       7, 
       col="orange", 
       pch=3, 
       cex=2)
points(c(0, 0, 1), 
       c(2, 4, 6), 
       col="green", 
       pch=4)

Note that you can put your x-coordinates and y-coordinates in vectors to plot multiple points at once. The pch argument allows you to select a symbol, while the cex argument has a value assigned to it that indicates how much the plotted text and symbols should be scaled with respect to the default.

Tip: if you want to see what number links to what symbol, click here.

2. How To Set The Axis Labels And Title Of The R Plots?

The axes of the R plots make up one of the most popular topics of Stack Overflow questions; The questions related to this topic are very diverse. Keep on reading to find out what type of questions DataCamp has found to be quite common!

How To Name Axes (With Up- Or Subscripts) And Put A Title To An R Plot?

You can easily name the axes and put a title in place to make your R plot more specific and understandable for your audience.

This can be easily done by adding the arguments main for the main title, sub for the subtitle, xlab for the label of the x-axis and ylab for the label of the y-axis:

x <- seq(0,2*pi,0.1)
y <- sin(x)
plot(x,
     y,
     main="main title", 
     sub="sub-title", 
     xlab="x-axis label", 
     ylab="y-axis label")

Note that if you want to have a title with up-or subscripts, you can easily add these with the following commands:

plot(1,
     1, 
     main=expression("title"^2)) #Upscript
plot(1,
     1, 
     main=expression("title"[2])) #Subscript

This all combined gives you the following plot:

Rplot1

Good to know: for those who want to have Greek letters in your axis labels, the following code can be executed:

x <- seq(0,2*pi,0.1)
y <- sin(x)
plot(x,
     y,
     xlab = expression(paste("Greek letter ", phi)),
     ylab = expression(paste("Greek letter ",mu)))

How To Adjust The Appearance Of The Axes’ Labels

To adjust the appearance of the x-and y-axis labels, you can use the arguments col.lab and cex.lab. The first argument is used to change the color of the x-and y-axis labels, while the second argument is used to determine the size of the x-and y-axis labels, relative to the (default) setting of cex.

x <- seq(0,2*pi,0.1)
y <- sin(x)
plot(x,
     y,
     main=expression("main title"^2), 
     sub=expression("sub-title"[2]), 
     xlab="x-axis label", 
     ylab="y-axis label",
     col.lab="blue",
     cex.lab=0.75)

Rplot2

For more information on these arguments, go to this page.

How To Remove A Plot’s Axis Labels And Annotations

If you want to get rid of the axis values of a plot, you can first add the arguments xaxt and yaxt, set as "n". These arguments are assigned a character which specifies the x-axis type. If you put in an "n", like in the command below, you can suppress the plotting of the axis.

Note that by giving any other character to the xaxt and yaxt arguments, the x-and y-axes are plotted.

Next, you can add the annotation argument ann and set it to FALSE to make sure that any axis labels are removed.

x <- seq(0,2*pi,0.1)
y <- sin(x)
plot(x,
     y,
     xaxt="n",
     yaxt="n",
     ann=FALSE)

Rplot3

Tip: not the information you are looking for? Go to this page.

How To Rotate A Plot’s Axis Labels

You can add the las argument to the axis() function to rotate the numbers that correspond to each axis:

x <- seq(0,2*pi,0.1)
y <- sin(x)
plot(x,
     y, 
     axes=FALSE)
box()
axis(2, 
     las=2)
axis(1, 
     las=0)

Rplot4

Note how this actually requires you to add an argument to the plot() function that basically says that there are no axes to be plotted (yet): this is the task of the two axis() function that come next.

The las argument can have three values attributed to it. According to whichever option you choose, the placement of the label will differ: if you choose 0, the label will always be parallel to the axis (which is the default); If you choose 1, the label will be put horizontally. Pick 2 if you want it to be perpendicular to the axis and 3 if you want it to be placed vertically.

But there is more. If you want to know more about the possibilities of the axis() function, keep on reading!

How To Move The Axis Labels Of Your R Plot

So, you want to move your axes’ labels around?

No worries, you can do this with the axis() function; As you may have noticed before in this tutorial, this function allows you to first specify where you want to draw the axes. Let’s say you want to draw the x-axis above the plot area and the y-axis to the right of it.

Remember that if you pass 1 or 2 to the axis() function, your axis will be drawn on the bottom and on the left of the plot area. Scroll a bit up to see an example of this in the previous piece of code!

This means that you will want to pass 3 and 4 to the axis() function:

x<-seq(0,2*pi,0.1)
y<-sin(x)
plot(x,
     y,
     axes=FALSE, # Do not plot any axes
     ann=FALSE) # Do not plot any annotations
axis(3)   # Draw the x-axis above the plot area
axis(4)   # Draw the y-axis to the right of the plot area
box()

Rplot5

As you can see, at first, you basically plot x and y, but you leave out the axes and annotations. Then, you add the axes that you want to see and specify their location with respect to the plot.

The flexibility that the axis() function creates for you keeps on growing! Check out the next frequently asked question to see what else you can solve by using this basic R function.

Tip: go to the last question to see more on how to move around text in the axis labels with hjust and vjust!

3. How To Add And Change The Spacing Of The Tick Marks Of Your R Plot

How To Change The Spacing Of The Tick Marks Of Your R Plot

Letting R determine the tick marks of your plot can be quite annoying and there might come a time when you will want to adjust these.

1. Using The axis() Function To Determine The Tick Marks Of Your Plot

Consider the following piece of code:

v1 <- c(0,pi/2,pi,3*pi/2,2*pi) # -> defines position of tick marks.
v2 <- c("0","Pi/2","Pi","3*Pi/2","2*Pi") # defines labels of tick marks.
x <- seq(0,2*pi,0.1)
y <- sin(x)
plot(x,
     y,
     xaxt = "n")
axis(side = 1, 
     at = v1, 
     labels = v2,
     tck=-.05)

Rplot6

As you can see, you first define the position of the tick marks and their labels. Then, you draw your plot, specifying the x axis type as "n", suppressing the plotting of the axis.

Then, the real work starts:

  • The axis() function allows you to specify the side of the plot on which the axis is to be drawn. In this case, the argument is completed with 1, which means that the axis is to be drawn below. If the value was 2, the axis would be drawn on the left and if the value was 3 or 4, the axis would be drawn above or to the right, respectively;
  • The at argument allows you to indicate the points at which tick marks are to be drawn. In this case, you use the positions that were defined in V1;
  • Likewise, the labels that you want to use are the ones that were specified in V2;
  • You adjust the direction of the ticks through tck: by giving this argument a negative value, you specify that the ticks should appear below the axis.

Tip: try passing a positive value to the tck argument and see what happens!

You can further specify the size of the ticks through tcl and the appearance of the tick labels is controlled with cex.axis, col.axis and font.axis.

v1 <- c(0,pi/2,pi,3*pi/2,2*pi) # -> defines position of tick marks.
v2 <- c("0","Pi/2","Pi","3*Pi/2","2*Pi") # defines labels of tick marks.
x <- seq(0,2*pi,0.1)
y <- sin(x)
plot(x,
     y,
     xaxt = "n")
axis(side = 1, 
     at = v1, 
     labels = v2,
     tck=-.1,
     tcl = -0.5,
     cex.axis=1.05,
     col.axis="blue",
     font.axis=5)

Rplot7

2. Using Other Functions To Determine The Tick Marks Of Your R Plot

You can also use the par() and plot() functions to define the positions of tickmarks and the number of intervals between them.

Note that then you use the argument xaxp to which you pass the position of the tick marks that are located at the extremes of the x-axis, followed by the number of intervals between the tick marks:

x <- seq(0,2*pi,0.1)
y <- sin(x)
plot(x,
     y,
     xaxp = c(0, 2*pi, 5))

Rplot8

# Or
x <- seq(0,2*pi,0.1)
y <- sin(x)
plot(x, 
     y, 
     xaxt="n")
par(xaxp= c(0, 2*pi, 2)) 
axis(1)

Rplot8b

Note that this works only if you use no logarithmic scale. If the log coordinates set as true, or, in other words, if par(xlog=T), the three values that you pass to xaxp have a different meaning: for a small range, n is negative. Otherwise, n is in 1:3, specifying a case number, and x1 and x2 are the lowest and highest power of 10 inside the user coordinates, 10 ^ par(“usr”)[1:2].

An example:

n <- 1
x <- seq(0, 10000, 1)
y <- exp(n)/(exp(n)+x)
par(xlog=TRUE, 
    xaxp= c(1, 4, 3))
plot(x, 
     y, 
     log="x")

Rplot9
In this example, you use the par() function: you set xlog to TRUE and add the xaxp argument to give the coordinates of the extreme tick marks and the number of intervals between them. In this case, you set the minimal value to 1, the maximal value to 4 and you add that the number of intervals between each tick mark should be 3.

Then, you plot x and y, adding the log argument to specify whether to plot the x-or y-axis or both on a log scale. You can pass "x", "y", and "xy" as values to the log arguments to do this.

An example with both axes in logarithmic scale is:

n <- 1
x <- seq(0, 20, 1)
y <- exp(x)/(x)
par(xlog=TRUE, 
    xaxp= c(1, 4, 3))
par(ylog=TRUE, 
    yaxp= c(1, 11, 2)) 
plot(x, 
     y, 
     log="xy")

Rplot10

How To Add Minor Tick Marks To An R Plot

You can quickly add minor tick marks to your plot with the minor.tick() function from the Hmisc package:

plot.new()
library(Hmisc)
minor.tick(nx = 1.5, 
           ny = 2, 
           tick.ratio=0.75)
  • The nx argument allows you to specify the number of intervals in which you want to divide the area between the major tick marks on the axis. If you pass the value 1 to it, the minor tick marks will be suppressed;
  • ny allows you to do the same as nx, but then for the y-axis;
  • The tick.ratio indicates the ratio of lengths of minor tick marks to major tick marks. The length of the latter is retrieved from par(tck=x).

4. How To Create Two Different X- or Y-axes

The first option is to create a first plot and to execute the par() function with the new argument put to TRUE to prevent R from clearing the graphics device:

set.seed(101)
x <- 1:10
y <- rnorm(10)
z <- runif(10, min=1000, max=10000)
plot(x, y) 
par(new = TRUE)

Then, you create the second plot with plot(). You make one of the type

plot(x, 
     z, 
     type = "l", #Plot with lines
     axes = FALSE, #No axes
     bty = "n", #Box about plot is suppressed
     xlab = "",  #No labels on x-and y-axis
     ylab = "")

Note that the with axes argument has been put to FALSE, while you also lave the x- and y-labels blank.

You also add a new axis on the right-hand side by adding the argument side and assigning it the value 4.

Next, you specify the the at argument to indicate the points at which tick-marks need to be drawn. In this case, you compute a sequence of n+1 equally spaced “round” values which cover the range of the values in z with the pretty() function. This ensures that you actually have the numbers from the y-axis from your second plot, which you named z.

Lastly, you add an axis label on the right-hand side:

axis(side=4, 
     at = pretty(range(z)))
mtext("z", 
      side=4, 
      line=3)

Note that the side argument can have three values assigned to it: 1 to place the text to the bottom, 2 for a left placement, 3 for a top placement and 4 to put the text to the right. The line argument indicates on which margin line the text starts.

The end result will be like this:Rplot11

Tip: Try constructing an R plot with two different x-axes! You can find the solution below:

plot.new()
set.seed(101)
x <- 1:10
y <- rnorm(10)
z <- runif(10, min=1000, max=10000) 
par(mar = c(5, 4, 4, 4) + 0.3)
plot(x, y)
par(new = TRUE)
plot(z, y, type = "l", axes = FALSE, bty = "n", xlab = "", ylab = "")
axis(side=3, at = pretty(range(z)))
mtext("z", side=3, line=3)

Rplot12

Note that twoord.plot() of the plotrix package and doubleYScale() of the latticeExtra package automate this process:

library(latticeExtra)
chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE)
Age <- chol$AGE
Chol <- chol$CHOL
Smoke <- chol$SMOKE
State <- chol$MORT
a <- xyplot(Chol ~ Age|State)
b <- xyplot(Smoke ~ Age|State)
doubleYScale(a, b, style1 = 0, style2 = 3, add.ylab2 = TRUE, columns=3)

Rplot13

Note that the example above is made with this dataset. If you’re not sure how you can import your data, check out our tutorial on importing data in R.

5. How To Add Or Change The R Plot’s Legend?

Adding And Changing An R Plot’s Legend With Basic R

You can easily add a legend to your R plot with the legend() function:

x <- seq(0,pi,0.1)
y1 <- cos(x)
y2 <- sin(x)
plot(c(0,3), c(0,3), type="n", xlab="x", ylab="y")
lines(x, y1, col="red", lwd=2)
lines(x, y2, col="blue", lwd=2)
legend("topright", 
       inset=.05, 
       cex = 1, 
       title="Legend", 
       c("Cosinus","Sinus"), 
       horiz=TRUE, 
       lty=c(1,1), 
       lwd=c(2,2), 
       col=c("red","blue"), 
       bg="grey96")

Rplot14

Note that the arguments pt.cex and title.cex that are described in the documentation of legend() don’t really work. There are some workarounds:

1.Put the title or the labels of the legend in a different font with text.font

x <- seq(0,pi,0.1)
y1 <- cos(x)
y2 <- sin(x)
plot(c(0,3), c(0,3), type="n", xlab="x", ylab="y")
lines(x, y1, col="red", lwd=2)
lines(x, y2, col="blue", lwd=2)
legend("topright", 
       inset=.05, 
       cex = 1, 
       title="Legend", 
       c("Cosinus","Sinus"), 
       horiz=TRUE, 
       lty=c(1,1), 
       lwd=c(2,2), 
       col=c("red","blue"), 
       bg="grey96",
       text.font=3)

Rplot15

2. Draw the legend twice with different cex values

x <- seq(0,pi,0.1)
y1 <- cos(x)
y2 <- sin(x)
plot(c(0,3), c(0,3), type="n", xlab="x", ylab="y")
lines(x, y1, col="red", lwd=2)
lines(x, y2, col="blue", lwd=2)
legend("topright",
       inset=.05,
       c("Cosinus","Sinus"),
       title="",
       horiz=TRUE,
       lty=c(1,1), 
       lwd=c(2,2), 
       col=c("red","blue"))
legend(2.05, 2.97, 
       inset=.05,
       c("",""),
       title="Legend",
       cex=1.15, 
       bty="n")

Rplot16

Tip: if you’re interested in knowing more about the colors that you can use in R, check out this very helpful PDF document.

How To Add And Change An R Plot’s Legend And Labels In ggplot2

Adding a legend to your ggplot2 plot is fairly easy. You can just execute the following:

chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE)
library(ggplot2)
ggplot(chol, aes(x=chol$WEIGHT, y=chol$HEIGHT)) + 
  geom_point(aes(colour = factor(chol$MORT), shape=chol$SMOKE)) + 
  xlab("Weight") + 
  ylab("Height") 

Rplot17

And it gives you a default legend. But, in most cases, you will want to adjust the appearance of the legend some more.

There are two ways of changing the legend title and labels in ggplot2:

1. If you have specified arguments such as colour or shape, or other aesthetics, you need to change the names and labels through scale_color_discrete and scale_shape_discrete, respectively:

chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE)
library(ggplot2)
ggplot(chol, aes(x=chol$WEIGHT, y=chol$HEIGHT)) + 
  geom_point(aes(colour = factor(chol$MORT), 
                 shape = chol$SMOKE)) + 
  xlab("Weight") + 
  ylab("Height") + 
  theme(legend.position=c(1,0.5),
        legend.justification=c(1,1)) + 
  scale_color_discrete(name ="Condition", 
                       labels=c("Alive", "Dead")) +
  scale_shape_discrete(name="Smoker", 
                       labels=c("Non-smoker", "Sigare", "Pipe" ))

Rplot18

Note that you can create two legends if you add the argument shape into the geom_point() function and into the labels arguments!

If you want to move the legend to the bottom of the plot, you can specify the legend.position as "bottom". The legend.justification argument, on the other hand, allows you to position the legend inside the plot.

Tip: check out all kinds of scales that could be used to let ggplot know that other names and labels should be used here.

2. Change the data frame so that the factor has the desired form. For example:

chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE)
levels(chol$SMOKE)[levels(chol$SMOKE)=="Non-smoker"] <- "Non-smoker"
levels(chol$SMOKE)[levels(chol$SMOKE)=="Sigare"] <- "Sigare"
levels(chol$SMOKE)[levels(chol$SMOKE)=="Pipe"] <- "Pipe"
names(chol)[names(chol)=="SMOKE"]  <- "Smoker"

You can then use the new factor names to make your plot in ggplot2, avoiding the “hassle” of changing the names and labels with extra lines of code in your plotting.

Tip: for a complete cheat sheet on ggplot2, you can go here.

6. How To Draw A Grid In Your R Plot?

Drawing A Grid In Your R Plot With Basic R

For some purposes, you might find it necessary to include a grid in your plot. You can easily add a grid to your plot by using the grid() function:

x <- c(1,2,3,4,5)
y <- 2*x
plot(x,y)
grid(10,10)

Rplot19

Drawing A Grid In An R Plot With ggplot2

chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE)
library(ggplot2)
ggplot(chol, aes(x=chol$WEIGHT, y=chol$HEIGHT)) + 
  geom_point(aes(colour = factor(chol$MORT), shape = chol$SMOKE)) + 
  xlab("Weight") + 
  ylab("Height") + 
  scale_color_discrete(name ="Condition", labels=c("Alive", "Dead")) +
  scale_shape_discrete(name="Smoker", labels=c("Non-smoker", "Sigare", "Pipe" )) +
  theme(legend.position=c(1,0.5),
        legend.justification=c(1,1),
        panel.grid.major = element_line(colour = "grey40"),
        panel.grid.minor = element_line(colour = "grey40"))

Rplot20

Tip: if you don’t want to have the minor grid lines, just pass element_blank() to panel.grid.minor. If you want to fill the background up with a color, add the panel.background = element_rect(fill = "navy") to your code, just like this:

library(ggplot2)
ggplot(chol, aes(x=chol$WEIGHT, y=chol$HEIGHT)) + 
  geom_point(aes(colour = factor(chol$MORT), shape = chol$SMOKE)) + 
  xlab("Weight") + 
  ylab("Height") + 
  scale_color_discrete(name ="Condition", labels=c("Alive", "Dead")) +
  scale_shape_discrete(name="Smoker", labels=c("Non-smoker", "Sigare", "Pipe" )) +
  theme(legend.position=c(1,0.5),
        legend.justification=c(1,1),
        panel.grid.major = element_line(colour = "grey40"),
        panel.grid.minor = element_line(colour = "grey40"),
        panel.background = element_rect(fill = "navy")
        )

7. How To Draw A Plot With A PNG As Background?

You can quickly draw a plot with a .png as a background with the help of the png package. You install the package if you need to, activate it for use in your workspace through the library function library() and you can start plotting!

install.packages("png")
library(png)

First, you want to load in the image. Use the readPNG() function to specify the path to the picture!

image <- readPNG("<path to your picture>")

Tip: you can check where your working directory is set at and change it by executing the following commands:

getwd()
setwd("<path to a folder>")

If your picture is saved in your working directory, you can just specify readPNG("picture.png") instead of passing the whole path.

Next, you want to set up the plot area:

plot(1:2, type='n', main="Plotting Over an Image", xlab="x", ylab="y")

And you want to call the par() function:

lim <- par()

You can use the par() function to set the graphical parameters in rasterImage(). You use the argument usr to define the extremes of the user coordinates of the plotting region. In this case, you put 1, 3, 2 and 4:

rasterImage(image, lim$usr[1], lim$usr[3], lim$usr[2], lim$usr[4])

Next, you draw a grid and add some lines:

grid()
lines(c(1, 1.2, 1.4, 1.6, 1.8, 2.0), c(1, 1.3, 1.7, 1.6, 1.7, 1.0), type="b", lwd=5, col="red")

This can give you the following result if you use the DataCamp logo:

library(png)
image <- readPNG("datacamp.png")
plot(1:2, type="n", main="Plotting Over an Image", xlab="x", ylab="y", asp=1)
lim <- par()
rasterImage(image, lim$usr[1], lim$usr[3], lim$usr[2], lim$usr[4])
lines(c(1, 1.2, 1.4, 1.6, 1.8, 2.0), c(1.5, 1.3, 1.7, 1.6, 1.7, 1.0), type="b", lwd=5, col="red")

Rplot21

Note that you need to give a .png file as input to readPNG()!

8. How To Adjust The Size Of Points In An R Plot?

Adjusting The Size Of Points In An R Plot With Basic R

To adjust the size of the points with basic R, you might just simply use the cex argument:

x <- c(1,2,3,4,5)
y <- c(6,7,8,9,10)
plot(x,y,cex=2,col="red")

Remember, however, that that R allows you to have much more control over your symbols through the function symbols():

df <- data.frame(x1=1:10, 
                 x2=sample(10:99, 10), 
                 x3=10:1)
symbols(x=df$x1, 
        y=df$x2, 
        circles=df$x3, 
        inches=1/3, 
        ann=F, 
        bg="steelblue2", 
        fg=NULL)

The circles of this plot receive the values of df$x3 as the radii, while the argument inches controls the size of the symbols. When this argument receives a positive number as input, the symbols are scaled to make largest dimension this size in inches.

Adjusting The Size Of Points In Your R Plot With ggplot2

In this case, you will want to adjust the size of the points in your scatterplot. You can do this with the size argument:

chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE)
ggplot(chol, 
       aes(x=chol$WEIGHT, y=chol$HEIGHT), 
       size = 2) + 
  geom_point()
#or
ggplot(chol, 
       aes(x=chol$WEIGHT, y=chol$HEIGHT)) + 
  geom_point(size = 2)

9. How To Fit A Smooth Curve To Your R Data

The loess() function is probably every R programmer’s favorite solution for this kind of question. It actually “fits a polynomial surface determined by one or more numerical predictors, using local fitting”.

In short, you have your data:

x <- 1:10
y <- c(2,4,6,8,7,12,14,16,18,20)

And you use the loess() function, in which you correlate y and x. Through this, you specify the numeric response and one to four numeric predictors:

lo <- loess(y~x) ### estimations between data

You plot x and y:

plot(x,y)

And you plot lines in the original plot where you predict the values of lo:

lines(predict(lo))

Which gives you the following plot:

Rplot22

10. How To Add Error Bars In An R Plot

Drawing Error Bars With Basic R

The bad news: R can’t draw error bars just like that. The good news: you can still draw the error bars without needing to install extra packages!

#Load the data
chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE)
#Calculate some statistics for the chol dataset
library(Rmisc)
cholc <- summarySE(chol, 
                   measurevar="CHOL", 
                   groupvars=c("MORT","SMOKE"))
#Plot the data
plot(cholc$N, 
     cholc$CHOL,
     ylim=range(c(cholc$CHOL-cholc$sd, cholc$CHOL+cholc$sd)),
     pch=19, 
     xlab="Cholesterol Measurements", 
     ylab="Cholesterol Mean +/- SD",
     main="Scatterplot With sd Error Bars"
)

#Draw arrows of a "special" type
arrows(cholc$N, 
       cholc$CHOL-cholc$sd, 
       cholc$N, 
       cholc$CHOL+cholc$sd, 
       length=0.05, 
       angle=90, 
       code=3)

If you want to read up on all the arguments that arrows() can take, go here.

Drawing Error Bars With ggplot2

Error Bars Representing Standard Error Of Mean

First summarize your data with the summarySE() function from the Rmisc package:

#Load in the data
chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE)
#Calculate some statistics for the chol dataset
library(Rmisc)
cholc <- summarySE(chol, 
                   measurevar="CHOL", 
                   groupvars=c("MORT","SMOKE"))

Then, you can use the resulting dataframe to plot some of the variables, drawing error bars for them at the same time, with, for example, the standard error of mean:

Rplot23

If you want to change the position of the error bars, for example, when they overlap, you might consider using the position_dodge() function:

#Load in the data
chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE)
#Calculate some statistics for the chol dataset
library(Rmisc)
cholc <- summarySE(chol, 
                   measurevar="CHOL", 
                   groupvars=c("MORT","SMOKE"))
#Plot the cholc dataset
library(ggplot2)
pd <- position_dodge(0.1)
ggplot(cholc, aes(x=SMOKE, y=CHOL, colour=MORT)) + 
    geom_errorbar(aes(ymin=CHOL-se, ymax=CHOL+se, group=MORT), 
                  width=.1, 
                  position=pd) +
    geom_line(aes(group=MORT)) +
    geom_point()

Rplot24
Tip: if you get an error like “geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?”, it usually requires you to adjust the group aesthetic.

Error Bars Representing Confidence Intervals

Continuing from the summary of your data that you made with the summarySE() function, you can also draw error bars that represent confidence intervals. In this case, a plot with error bars of 95% confidence are plotted.

#Load in the data
chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE)
#Calculate some statistics for the chol dataset
library(Rmisc)
cholc <- summarySE(chol, 
                   measurevar="CHOL", 
                   groupvars=c("MORT","SMOKE"))
#Plot the cholc dataset
library(ggplot2)
pd <- position_dodge(0.1)
ggplot(cholc, aes(x=SMOKE, y=CHOL, colour=MORT)) + 
    geom_errorbar(aes(ymin=CHOL-ci, ymax=CHOL+ci, group=MORT), 
                  width=.1, 
                  colour="black",
                  position=pd) +
    geom_line(aes(group=MORT)) +
    geom_point()

Rplot25
Note how the color of the error bars is now set to black with the colour argument.

Error Bars Representing The Standard Deviation

Lastly, you can also use the results of the summarySE() function to plot error bars that represent the standard deviation. Specifically, you would just have to adjust the ymin and ymax arguments that you pass to geom_errorbar():

#Load in the data
chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE)
#Calculate some statistics for the chol dataset
library(Rmisc)
cholc <- summarySE(chol, 
                   measurevar="CHOL", 
                   groupvars=c("MORT","SMOKE"))
#Plot the cholc dataset
library(ggplot2)
pd <- position_dodge(0.1)
ggplot(cholc, aes(x=SMOKE, y=CHOL, colour=MORT)) + 
    geom_errorbar(aes(ymin=CHOL-sd, ymax=CHOL+sd, group=MORT), 
                  width=.1,
                  position=pd) +
    geom_line(aes(group=MORT)) +
    geom_point()

Big tip: also take a look at this for more detailed examples on how to plot means and error bars.

11. How To Save A Plot As An Image On Disc

You can use dev.copy() to copy your graph, made in the current graphics device to the device or folder specified by yourself.

x <- seq(0,2*pi,0.1)
y <- sin(x)
plot(x,y)
dev.copy(jpeg,
         filename="<path to your file/name.jpg>");
dev.off();
(x)
(y)

12. How To Plot Two R Plots Next To Each Other?

How To Plot Two Plots Side By Side Using Basic R

You can do this with basic R commands:

d0 <- matrix(rnorm(15), ncol=3)
d1 <- matrix(rnorm(15), ncol=3)

limits <- range(d0,d1) #Set limits 

par(mfrow = c(1, 2)) 
boxplot(d0,
        ylim=limits)
boxplot(d1,
        ylim=limits)

By adding the par() function with the mfrow argument, you specify a vector, which in this case contains 1 and 2: all figures will then be drawn in a 1-by-2 array on the device by rows (mfrow). In other words, the boxplots from above will be printed in one row inside two columns.

How To Plot Two Plots Next To Each Other Using ggplot2

If you want to put plots side by side and if you don’t want to specify limits, you can consider using the ggplot2 package to draw your plots side-by-side:

#Load in the data
chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE)
#Calculate some statistics for the chol dataset
library(Rmisc)
cholc <- summarySE(chol, 
                   measurevar="CHOL", 
                   groupvars=c("MORT","SMOKE"))
#Plot the cholc dataset
library(ggplot2)
ggplot(cholc, aes(x=SMOKE, y=CHOL, colour=MORT)) + 
  geom_errorbar(aes(ymin=CHOL-sd, ymax=CHOL+sd, group=MORT), 
                width=.1, 
                position=pd) + 
  geom_line(aes(group=MORT)) + 
  geom_point() + 
  facet_grid(. ~ MORT)

Rplot26
Note how you just add the facet_grid() function to indicate that you want two plots next to each other. The element that is used to determine how the plots are drawn, is MORT, as you can well see above!

How To Plot More Plots Side By Side Using gridExtra

To get plots printed side by side, you can use the gridExtra package; Make sure you have the package installed and activated in your workspace and then execute something like this:

library(gridExtra)
plot1 <- qplot(1)
plot2 <- qplot(1)
grid.arrange(plot1, 
             plot2, 
             ncol=2)

Note how here again you determine how the two plots will appear to you thanks to the ncol argument.

How To Plot More Plots Side By Side Using lattice

Just like the solution with ggplot2 package, the lattice package also doesn’t require you to specify limits or the way you want your plots printed next to each other.

Instead, you use bwplot() to make trellis graphs with the graph type of a box plot. Trellis graphs display a variable or the relationship between variables, conditioned on one or more other variables.

In this case, if you’re using the chol data set (which you can find here or load in with the read.table() function given below), you display the variable CHOL separately for every combination of factor SMOKE and MORT levels:

#Load in the data
chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE)
#Plot two plots side by side
library(lattice)
bwplot(~ CHOL|SMOKE+MORT,
       chol)

Rplot27

Plotting Plots Next To Each Other With gridBase

Another way even to put two plots next to each other is by using the gridBase package, which takes care of the “integration of base and grid graphics”. This could be handy when you want to put a basic R plot and a ggplot next to each other.

You work as follows: first, you activate the necessary packages in your workspace. In this case, you want to have gridBase ready to put the two plots next to each other and grid and ggplot2 to actually make your plots:

library(grid)
library(gridBase)
library(ggplot2)
plot.new()
gl <- grid.layout(nrow=1, 
                  ncol=2)
vp.1 <- viewport(layout.pos.col=1, 
                 layout.pos.row=1) 
vp.2 <- viewport(layout.pos.col=2, 
                 layout.pos.row=1)
pushViewport(viewport(layout=gl))
pushViewport(vp.1)
par(new=TRUE, 
    fig=gridFIG())
plot(x = 1:10, 
     y = 10:1)
popViewport()
pushViewport(vp.2)
ggplotted <- qplot(x=1:10,y=10:1, 'point')
print(ggplotted, newpage = FALSE)
popViewport(1)

Rplot28
If you want and need it, you can start an empty plot:

plot.new()

To then set up the layout:

gl <- grid.layout(nrow=1, 
                  ncol=2)

Note that since you want the two plots to be generated next to each other, this requires you to make a grid layout consisting of one row and two columns.

Now, you want to fill up the cells of the grid with viewports. These define rectangular regions on your graphics device with the help of coordinates within those regions. In this case, it’s much more handy to use the specifications of the grid that have just been described above rather than real x-or y-coordinates. That is why you should use the layout.pos.col and layout.pos.row arguments:

vp.1 <- viewport(layout.pos.col=1, 
                 layout.pos.row=1) 
vp.2 <- viewport(layout.pos.col=2, 
                 layout.pos.row=1)

Note again that since you want the two plots to be generated next to each other, you want to put one plot in the first column and the other in the second column, both located on the first row.

Since the viewports are only descriptions or definitions, these kinds of objects need to be pushed onto the viewport tree before you can see any effect on the drawing. You want to use the pushViewport() function to accomplish this:

pushViewport(viewport(layout=gl))

Note the pushViewport() function takes the viewport() function, which in itself contains a layout argument. This last argument indicates “a grid layout object which splits the viewport into subregions”.

Remember that you started out making one of those objects.

Now you can proceed to adding the first rectangular region vp.1 to the ViewPort tree:

pushViewport(vp.1)

After which you tell R with gridFig() to draw a base plot within a grid viewport (vp.1, that is). The fig argument normally takes the coordinates of the figure region in the display region of the device. In this case, you use the fig argument to start a new plot, adding it to an existing plot use by adding new = TRUE in the par() function as well. You plot the base graphic and remove the viewport from the tree:

par(new=TRUE, 
    fig=gridFIG())
plot(x = 1:10, 
     y = 10:1)
popViewport()

Note that you can specify in the popViewport() function an argument to indicate how many viewports you want to remove from the tree. If this value is 0, this indicates that you want to remove the viewports right up to the root viewport. The default value of this argument is 1.

Go to add the second rectangular region vp.2 to the ViewPort tree. You can then make the ggplot and remove the viewport from the tree.

pushViewport(vp.2)
ggplotted <- qplot(x=1:10,
                   y=10:1, 
                   'point')
print(ggplotted, 
      newpage = FALSE)
popViewport(1)

Note that you need to print to print the graphics object made by qplot() in order to actually draw it and get it displayed. At the same time, you also want to specify newpage = FALSE, otherwise you’ll only see the qplot()

Also remember that the default value of viewports to remove in the function popViewport() is set at 1. This makes it kind of redundant to put popViewport(1) in the code.

13. How To Plot Multiple Lines Or Points?

Using Basic R To Plot Multiple Lines Or Points In The Same R Plot

To plot two or more graphs in the same plot, you basically start by making a usual basic plot in R. An example of this could be:

x <- seq(0,pi,0.1)
y1 <- cos(x)
plot(x,
     y1,
     type="l",
     col = "red")

Then, you start adding more lines or points to the plot. In this case, you add more lines to the plot, so you’ll define more y axes:

y2 <- sin(x)
y3 <- tan(x)
y4 <- log(x)

Then, you plot these y axes with the use of the lines() function:

lines(x,y2,col="green")
lines(x,y2,col="green")
lines(x,y3,col="black")
lines(x,y4,col="blue")

This gives the following result:

Rplot29

Note that the lines() function takes in three arguments: the x-axis and the y-axis that you want to plot and the color (represented with the argument col) in which you want to plot them. You can also include the following features:

Feature Argument Input
Line type lty Integer or character string
Line width lwd Integer
Plot type pch Integer or single character
Line end style lend Integer or string
Line join style ljoin Integer or string
Line mitre limit lmitre Integer < 1

Here are some examples:

lines(x,y2,col="green", lty = 2, lwd = 3)
lines(x,y2,col="green", lty = 5, lwd = 2, pch = 2)
lines(x,y3,col="black", lty = 3, lwd = 5, pch = 3, lend = 0, ljoin = 2)
lines(x,y4,col="blue", lty = 1, lwd = 2, pch = 3, lend = 2, ljoin = 1, lmitre = 2)

Note that the pch argument does not function all that well with the lines() function and that it’s best to use it only with points().

Tip: if you want to plot points in the same graph, you can use the points() function:

y5 <- x^3
points(x,
       y5,
       col="yellow")

You can add the same arguments to the points() function as you did with the lines() function and that are listed above. There are some additions, though:

Feature Argument Input
Background (fill) color bg Only if pch = 21:25
Character (or symbol) expansion cex Integer

Code examples of these arguments are the following:

points(x,y4,col="blue", pch=21, bg = "red") 
points(x, y5, col="yellow", pch = 5, bg = "blue") 

If you incorporate these changes into the plot that you see above, you will get the following result:

x <- seq(0,pi,0.1)
y1 <- cos(x)
plot(x,y1,type="l" ,col = "red") #basic graphical object
y2 <- sin(x)
y3 <- tan(x)
y4 <- log(x)
y5 <- x^3
lines(x,y2,col="green", lty = 1, lwd = 3) #first layer
lines(x,y2,col="green", lty = 3, lwd = 2, pch = 2) #second layer
lines(x,y3,col="black", lty = 2, lwd = 1, pch = 3, lend = 0, ljoin = 2) #third layer
points(x,y4,col="blue", pch=21, bg = "red") #fourth layer
points(x, y5, col="yellow", pch = 24, bg = "blue") #fifth layer

Rplot30

Using ggplot2 To Plot Multiple Lines Or Points In One R Plot

The ggplot2 package conveniently allows you to also create layers, which allows you to basically plot two or more graphs into the same R plot without any difficulties and pretty easily:

library(ggplot2) 
x <- 1:10
y1 <- c(2,4,6,8,7,12,14,16,18,20)
y2 <- rnorm(10, mean = 5)
df <- data.frame(x, y1, y2)
ggplot(df, aes(x)) +  # basic graphical object
  geom_line(aes(y=y1), 
            colour="red") +  # first layer
  geom_line(aes(y=y2),      # second layer
            colour="green")  

14. How To Fix The Aspect Ratio For Your R Plots

If you want to put your R plot to be saved as an image where the axes are proportional to their size, it’s a sign that you want to fix the aspect ratio.

Adjusting The Aspect Ratio With Basic R

When you’re working with basic R commands to produce your plots, you can add the argument asp of the plot() function, completed with an integer, to set your aspect ratio. Look at this first example without a defined aspect ratio:

x <- seq(0,2*pi,0.1)
y <- sin(x)
plot(x,y)

Rplot31

And compare this now to the plot where the aspect ratio is defined with the argument asp:

x <- seq(0,2*pi,0.1)
y <- sin(x)
plot(x,
     y, 
     asp=2)

Rplot32

Adjusting The Aspect Ratio For Your Plots With ggplot2

To fix the aspect ratio for ggplot2 plots, you just add the function coord_fixed(), which provides a “fixed scale coordinate system [that] forces a specified ratio between the physical representation of data units on the axes”.

In other words, this function allows you to specify a number of units on the y-axis which is equivalent to one unit on the x-axis. The default is always set at 1, which means that one unit on the x-axis has the same length as one unit on the y-axis. If your ratio is set at a higher value, the units on the y-axis are longer than units on the x-axis and vice versa.

Compare the following examples:

library(ggplot2)
df <- data.frame(
  x = runif(100, 0, 5),
  y = runif(100, 0, 5))

ggplot(df, aes(x=x, y=y)) + geom_point()

Rplot33

versus

library(ggplot2)
df <- data.frame(
  x = runif(100, 0, 5),
  y = runif(100, 0, 5))

ggplot(df, aes(x=x, y=y)) + 
  geom_point() + 
  coord_fixed(ratio=1)

Rplot34

Adjusting The Aspect Ratio For Your Plots With MASS

You can also consider using the MASS package, which encompasses the eqscplot() function: it produces plots with geometrically equal scales. It does this for scatterplots:

chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE)
library(MASS)
x = chol$HEIGHT
y = chol$WEIGHT
z = as.numeric(chol$MORT)

eqscplot(x, 
         y, 
         ratio = 1, 
         col=c("red", "green"), 
         pch=c(1,2))

Rplot35

Tip: you might do well starting a new plot frame before executing the code above!

Note that you can give additional arguments to the eqscplot() function to customize the scatterplot’s look!

15. What Is The Function Of hjust And vjust In ggplot2?

Well, you basically use these arguments when you want to set the position of text in your ggplot. hjust allows you to define the horizontal justification, while vjust is meant to control the vertical justification. See the documentation on geom_text() for more information.

To demonstrate what exactly happens, you can create a data frame from all combinations of factors with the expand.grid() function:

hjustvjust <- expand.grid(hjust=c(0, 0.5, 1),
                          vjust=c(0, 0.5, 1),
                          angle=c(0, 45, 90),
                          text="Text"
                          )

Note that hjust and vjust can only take values between 0 and 1.

  • 0 means that the text is left-justified; In other words, all text is aligned to the left margin. This is usually what you see when working with text editors such as Word.
  • 1 means that the text is right-justified: all text is aligned to the right margin.

Then, you can plot the data frame that you have just made above with the ggplot() function, defining the x-and y-axis as “hjust” and “vjust” respectively:

library(ggplot2)
ggplot(hjustvjust, aes(x=hjust, y=vjust)) + 
    geom_point() +
    geom_text(aes(label=text, 
                  angle=angle, 
                  hjust=hjust, 
                  vjust=vjust)) + 
    facet_grid(~angle) +
    scale_x_continuous(breaks=c(0, 0.5, 1), 
                       expand=c(0, 0.2)) +
    scale_y_continuous(breaks=c(0, 0.5, 1), 
                       expand=c(0, 0.2))

Rplot36

Also note how the hjust and vjust arguments are added to geom_text(), which takes care of the textual annotations to the plot.

In the plot above you see that the text at the point (0,0) is left-aligned, horizontally as well as vertically. On the other hand, the text at point (1,1) is right-aligned in horizontal as well as vertical direction. The point (0.5,0.5) is right in the middle: it’s not really left-aligned nor right-aligned for what concerns the horizontal and vertical directions.

Note that when these arguments are defined to change the axis text, the horizontal alignment for axis text is defined in relation to the entire plot, not to the x-axis!

DF <- data.frame(x=LETTERS[1:3],
                 y=1:3)
p <- ggplot(DF, aes(x,y)) + 
  geom_point() + 
  ylab("Very long label for y") + 
  theme(axis.title.y=element_text(angle=0))


p1 <- p + theme(axis.title.x=element_text(hjust=0)) + xlab("X-axis at hjust=0")
p2 <- p + theme(axis.title.x=element_text(hjust=0.5)) + xlab("X-axis at hjust=0.5")
p3 <- p + theme(axis.title.x=element_text(hjust=1)) + xlab("X-axis at hjust=1")

library(gridExtra)
grid.arrange(p1, p2, p3)

Rplot37

Also try for yourself what defining the vjust agument to change the axis text does to the representation of your plot:

DF <- data.frame(x=c("ana","b","cdefghijk","l"),
                 y=1:4)
p <- ggplot(DF, aes(x,y)) + geom_point()

p1 <- p + theme(axis.text.x=element_text(vjust=0, colour="red")) + 
        xlab("X-axis labels aligned with vjust=0")
p2 <- p + theme(axis.text.x=element_text(vjust=0.5, colour="red")) + 
        xlab("X-axis labels aligned with vjust=0.5")
p3 <- p + theme(axis.text.x=element_text(vjust=1, colour="red")) + 
        xlab("X-axis labels aligned with vjust=1")


library(gridExtra)
grid.arrange(p1,p2,p3)

To go to the original excellent discussion, from which the code above was adopted, click here.

As A Last Note…

It’s really worth checking out this article, which lists 10 tips for making your R graphics look their best!

Also, if you want to know more about data visualization, you might consider checking out DataCamp’s interactive course on data visualization with ggvis, given by Garrett Grolemund, author of Hands on Programming with R, as well as Data Science with R.

Or maybe our course on reporting with R Markdown can interest you!

facebooktwittergoogle_pluslinkedin

The post 15 Questions All R Users Have About Plots appeared first on The DataCamp Blog .

To leave a comment for the author, please follow the link and comment on his blog: The DataCamp Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Use box plots to assess the distribution and to identify the outliers in your dataset

$
0
0

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

After you check the distribution of the data by ploting the histogram, the second thing to do is to look for outliers. Identifying the outliers is important becuase it might happen that an association you find in your analysis can be explained by the presence of outliers.

The best tool to identify the outliers is the box plot. Through box plots we find the minimum, lower quartile (25th percentile), median (50th percentile), upper quartile (75th percentile), and maximum of an continues variable. The function to build a boxplot is boxplot().

Let see this example:

# load data
data("iris")

names(iris)
"Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

# build the box plot
boxplot(iris$Sepal.Length, main="Box plot", ylab="Sepal Length")

This will generate the following box plot. Each horizontal line starting from bottom will show the minimum, lower quartile, median, upper quartile and maximum value of Sepal.Length.
Plot-Box-plot

We can use box plot to explore the distribution of a continues variable accross the strata. I’m saying strata becuase the variable should be categorical. For example you want to see what is the distribution of age among individuals with and without blood pressure. In the example below, I’m showing the length of sepal in different species.

# load data
data("iris")

# names of variables
names(iris)
"Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

# build the box plot
boxplot(Sepal.Length ~ Species, data=iris,
     main="Box Plot",
     xlab="Species",
     ylab="Sepal Length")

This will generate the following box plots:
Plot-Box-plot-compare
If you look at the bottom of third box plot you will find an outlier. If you find in your dataset an outlier I suggest to remove it. Although, to remove an outlier should be a topic of another post, for now you can check your dataset and manually remove the observation. However, there are functions which remove outliers automatically.

Feel free to post comments if you have any question.

To leave a comment for the author, please follow the link and comment on his blog: DataScience+.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Free R Help

$
0
0

(This article was first published on AriLamstein.com » R, and kindly contributed to R-bloggers)

Today I am giving away 10 sessions of free, online, one-on-one R help. My hope is to get a better understanding of how my readers use R, and the issues they face when working on their own projects. The sessions will be over the next two weeks, online and 30-60 minutes each. I just purchased Screenhero, which will allow me to screen share during the sessions.

If you would like to reserve a session then please contact me using this form and describe the project that you want help with.  It can be anything, really. But here are some niches within R that I have a lot of experience with:

  • Packages that I have created
  • Analyzing web site data using R and MySQL
  • Exploratory data analysis using ggplot2, dplyr, etc.
  • Creating apps with Shiny
  • Creating reports using RMarkdown and knitr
  • Developing your own R package
  • Working with shapefiles in R
  • Working with public data sets
  • Marketing R packages

Parting Image

I’ve included an image, plus the code to create it, in every blog post I’ve done. I’d hate to stop now just because of the free giveaway. So here’s a comparison of two ways to view the distribution  Per Capita Income in the Census Tracts of Orange County, California:

orange-county-tract-income

On the right is a boxplot of the data, which shows the distribution of the values. On the left is a choropleth, which shows us where the values are. The choropleth uses a continuous scale, which highlights outliers. Here is the code to create the map. Note that the choroplethrCaCensusTract package is on github, not CRAN.

library(choroplethrCaCensusTract)
data(df_ca_tract_demographics)
df_ca_tract_demographics$value = df_ca_tract_demographics$per_capita_income

choro = ca_tract_choropleth(df_ca_tract_demographics, 
                            legend      = "Dollars",
                            num_colors  = 1,
                            county_zoom = 6059)

library(ggplot2)
library(scales)
bp = ggplot(df_ca_tract_demographics, aes(value, value)) +
  geom_boxplot() + 
  theme(axis.text.x = element_blank()) +
  labs(x = "", y = "Dollars") +
  scale_y_continuous(labels=comma)

library(gridExtra)
grid.arrange(top = "Orange County, CalifornianCensus Tracts, Per Capita Income", 
            choro, 
            bp, 
            ncol = 2)

LEARN TO MAP CENSUS DATA
Subscribe and get my free email course: Mapping Census Data in R!
100% Privacy. We don’t spam.


The post Free R Help appeared first on AriLamstein.com.

To leave a comment for the author, please follow the link and comment on his blog: AriLamstein.com » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Importing the New Zealand Income Survey SURF

$
0
0

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

The quest for income microdata

For a separate project, I’ve been looking for source data on income and wealth inequality. Not aggregate data like Gini coefficients or the percentage of income earned by the bottom 20% or top 1%, but the sources used to calculate those things. Because it’s sensitve personal financial data either from surveys or tax data, it’s not easy to get hold of, particularly if you want to do it in the comfort of your own home and don’t have time / motivation to fill in application forms. So far, what I’ve got is a simulated unit record file (SURF) from the New Zealand Income Survey, published by Statistics New Zealand for educational and instructional purposes. It’s a simplified, simulated version of the original survey and it lets me make graphics like this one:

This plot shows the density of income from all sources for four of the most prominent ethnicity types in New Zealand. New Zealand official statistics allow people to identify with more than one ethnicity, which means there is some double counting (more on that below). Three things leap out at me from this chart:

  1. The density curves of the different ethnic groups are very similar visually
  2. People of Maori and Pacific Peoples ethnicity have proportionately more $300-$400 per week earners than Europeans and Asians, leading to an overall noticeable lean to the lower numbers
  3. Weekly income is bimodal, bunching up at $345 per week and $820 per week. Actually, this image is misleading in that respect; in reality it is trimodal, with a huge bunch of people with zero income (and some negative), who aren’t shown on this plot because of the logarithmic scale. So you could say that, for New Zealanders getting any income at all, there is a bimodal distribution.

Where’s that bimodal distribution coming from? The obvious candidate is part time workers, and this is seen when we plot income versus hours worked in the next image:

(That interesting diagonal line below which there are very few points is the minimum wage per hour)

Statistics New Zealand warn that while this simulated data is an accurate representation of the actual New Zealand population, it shouldn’t be used for precise statistics, so for now I won’t draw particularly strong conclusions on anything. A simulated unit record file is a great way of solving confidentiality purposes, but in the end it has been created by a statistical model. There’s a risk that interesting inferences might be just picking up something implicit in the model that wasn’t taken into account when it was first put together. That’s not likely to be the case for the basic distribution of the income values, but we’ll note the exploratory finding for now and move on.

A box plot is better for examining the full range of reported ethnicities but not so good for picking up the bi-modal subtlety. It should also be noted that both these graphs delete people who made losses (negative income) in a given week – the data show income from all sources:

Loading the NZIS SURF into a database

Here’s how I drew the graphs, and estimated the two modes. I think I’ll want to re-use this data quite often, so it’s worth while putting in a database that’s accessible from different projects without having to move a bunch of data files around in Windows Explorer. The way Statistics New Zealand have released the file, with codings rather than values for dimensions like ethnicity and region, also makes a database a good way to make the data analysis ready.

After importing the data, the first significant job is to deal with that pesky ethnicity variable. In the released version of the data respondents with two ethnicities have both code numbers joined together eg 12 means both European (1) and Maori (2). To get around this I split the data into two fact tables, one with a single row per respondent with most of the data; and a second just for ethnicity with either one or two rows for each respondent. Here’s how I do that with a combination of {dplyr} and {tidyr}:

library(dplyr)
library(RODBC)
library(tidyr)
library(mbie) # for AskCreds, which is alternatively directly available 
              # https://github.com/nz-mbie/mbie-r-package/blob/master/pkg/R/Creds.R 

# imports, clean up, and save to database the data from
# http://www.stats.govt.nz/tools_and_services/microdata-access/nzis-2011-cart-surf.aspx

url <- "http://www.stats.govt.nz/~/media/Statistics/services/microdata-access/nzis11-cart-surf/nzis11-cart-surf.csv"
nzis <- read.csv(url)


#-----------------fact tables------------------
# Create a main table with a primary key

f_mainheader <- nzis %>%
   mutate(survey_id = 1:nrow(nzis))

# we need to normalise the multiple ethnicities, currently concatenated into a single variable
cat("max number of ethnicities is", max(nchar(nzis$ethnicity)), "n")

f_ethnicity <- f_mainheader %>%
   select(ethnicity, survey_id) %>%
   mutate(First = substring(ethnicity, 1, 1),
          Second = substring(ethnicity, 2, 2)) %>%
   select(-ethnicity) %>%
   gather(ethnicity_type, ethnicity_id, -survey_id) %>%
   filter(ethnicity_id != "") 
   
# drop the original messy ethnicity variable and tidy up names on main header
f_mainheader <- f_mainheader %>%
   select(-ethnicity) %>%
   rename(region_id = lgr,
          sex_id = sex,
          agegrp_id = agegrp,
          qualification_id = qualification,
          occupation_id = occupation)

Second step is to re-create the dimension tables that turn the codes (eg 1 and 2) into meaningful values (European and Maori). Statistics New Zealand provide these, but unfortunately in an Excel workbook that’s easier for humans than computers to link up to the data. There’s not too many of so it’s easy enough to code them by hand, which the next set of code does:

#-----------------dimension tables------------------
# all drawn from the data dictionary available at the first link given above
d_sex <- data_frame(sex_id = 1:2, sex = c("male", "female"))

d_agegrp <- data_frame(
   agegrp_id = seq(from = 15, to = 65)) %>%
   mutate(agegrp = ifelse(agegrp_id == 65, "65+", paste0(agegrp_id, "-", agegrp_id + 4)))

d_ethnicity <- data_frame(ethnicity_id = c(1,2,3,4,5,6,9),
                          ethnicity = c(
                             "European",
                             "Maori",
                             "Pacific Peoples",
                             "Asian",
                             "Middle Eastern/Latin American/African",
                             "Other Ethnicity",
                             "Residual Categories"))


d_occupation <- data_frame(occupation_id = 1:10,
                       occupation = c(
                          "Managers",
                          "Professionals",
                          "Technicians and Trades Workers",
                          "Community and Personal Service Workers",
                          "Clerical and Adminsitrative Workers",
                          "Sales Workers",
                          "Machinery Operators and Drivers",
                          "Labourers",
                          "Residual Categories",
                          "No occupation"                          
                       ))


d_qualification <- data_frame(qualification_id = 1:5,
                        qualification = c(
                           "None",
                           "School",
                           "Vocational/Trade",
                           "Bachelor or Higher",
                           "Other"
                        ))

d_region <- data_frame(region_id =1:12,
                       region = c("Northland", "Auckland", "Waikato", "Bay of Plenty", "Gisborne / Hawke's Bay",
                                  "Taranaki", "Manawatu-Wanganui", "Wellington", 
                                  "Nelson/Tasman/Marlborough/West Coast", "Canterbury", "Otago", "Southland"))

The final step in the data cleaning is to save all of our tables to a database, create some indexes so they work nice and fast, and join them up in an analysis-ready view. In the below I use an ODBC (open database connectivity) connection to a MySQL server called “PlayPen”. R plays nicely with databases; set it up and forget about it.

#---------------save to database---------------
creds <- AskCreds("Credentials for someone who can create databases")

PlayPen <- odbcConnect("PlayPen_prod", uid = creds$uid, pwd = creds$pwd)
try(sqlQuery(PlayPen, "create database nzis11") )
sqlQuery(PlayPen, "use nzis11")

# fact tables.  These take a long time to load up with sqlSave (which adds one row at a time)
# but it's easier (quick and dirty) than creating a table and doing a bulk upload from a temp 
# file.  Any bigger than this you'd want to bulk upload though - took 20 minutes or more.
sqlSave(PlayPen, f_mainheader, addPK = FALSE, rownames = FALSE)
sqlSave(PlayPen, f_ethnicity, addPK = TRUE, rownames = FALSE) 
                                            # add a primary key on the fly in this case.  All other tables
                                            # have their own already created by R.

# dimension tables
sqlSave(PlayPen, d_sex, addPK = FALSE, rownames = FALSE)
sqlSave(PlayPen, d_agegrp, addPK = FALSE, rownames = FALSE)
sqlSave(PlayPen, d_ethnicity, addPK = FALSE, rownames = FALSE)
sqlSave(PlayPen, d_occupation, addPK = FALSE, rownames = FALSE)
sqlSave(PlayPen, d_qualification, addPK = FALSE, rownames = FALSE)
sqlSave(PlayPen, d_region, addPK = FALSE, rownames = FALSE)

#----------------indexing----------------------

sqlQuery(PlayPen, "ALTER TABLE f_mainheader ADD PRIMARY KEY(survey_id)")

sqlQuery(PlayPen, "ALTER TABLE d_sex ADD PRIMARY KEY(sex_id)")
sqlQuery(PlayPen, "ALTER TABLE d_agegrp ADD PRIMARY KEY(agegrp_id)")
sqlQuery(PlayPen, "ALTER TABLE d_ethnicity ADD PRIMARY KEY(ethnicity_id)")
sqlQuery(PlayPen, "ALTER TABLE d_occupation ADD PRIMARY KEY(occupation_id)")
sqlQuery(PlayPen, "ALTER TABLE d_qualification ADD PRIMARY KEY(qualification_id)")
sqlQuery(PlayPen, "ALTER TABLE d_region ADD PRIMARY KEY(region_id)")

#---------------create an analysis-ready view-------------------
# In Oracle we'd use a materialized view, which MySQL can't do.  But
# the below is fast enough anyway:

sql1 <-
   "CREATE VIEW vw_mainheader AS SELECT sex, agegrp, occupation, qualification, region, hours, income FROM
      f_mainheader a   JOIN
      d_sex b          on a.sex_id = b.sex_id JOIN
      d_agegrp c       on a.agegrp_id = c.agegrp_id JOIN
      d_occupation e   on a.occupation_id = e.occupation_id JOIN
      d_qualification f on a.qualification_id = f.qualification_id JOIN
      d_region g       on a.region_id = g.region_id"

sqlQuery(PlayPen, sql1)

Average weekly income in NZ 2011 by various slices and dices

Whew, that’s out of the way. Next post that I use this data I can go straight to the database. We’re now in a position to check our data matches the summary totals provided by Statistics New Zealand. Statistics New Zealand say this SURF can be treated as a simple random sample, which means each point can get an identical individual weight, which we can estimate from the summary tables in their data dictionary. Each person in the sample represents 1,174 in the population (in the below I have population figures in thousands, to match the Statistics New Zealand summaries.

Statistics New Zealand doesn’t provide region and occupation summary statistics, and the qualification summaries they provide use a more detailed classification than is in the actual SURF. But for the other categories – sex, age group, and the trick ethnicity – my results match theirs, so I know I haven’t munched the data.


sex Mean Sample Population
female 611 15217 1787.10
male 779 14254 1674.00
tab1 <- sqlQuery(PlayPen, 
"SELECT 
                              sex,
                              ROUND(AVG(income))          as Mean, 
                              COUNT(1)                    as Sample,
                              ROUND(COUNT(1) * .11744, 1) as Population
                           FROM vw_mainheader
                           GROUP BY sex")


agegrp Mean Sample Population
15-19 198 2632 309.10
20-24 567 2739 321.70
25-29 715 2564 301.10
30-34 796 2349 275.90
35-39 899 2442 286.80
40-44 883 2625 308.30
45-49 871 2745 322.40
50-54 911 2522 296.20
55-59 844 2140 251.30
60-64 816 1994 234.20
65+ 421 4719 554.20
tab2 <- sqlQuery(PlayPen, 
"SELECT 
                              agegrp,
                              ROUND(AVG(income))          as Mean, 
                              COUNT(1)                    as Sample,
                              ROUND(COUNT(1) * .11744, 1) as Population
                           FROM vw_mainheader
                           GROUP BY agegrp")


qualification Mean Sample Population
School 564 7064 829.60
None 565 6891 809.30
Other 725 1858 218.20
Vocational/Trade 734 8435 990.60
Bachelor or Higher 955 5223 613.40
# qualification summary in data dictionary uses a different classification to that in data
tab3 <- sqlQuery(PlayPen, 
"SELECT 
                              qualification,
                              ROUND(AVG(income))          as Mean, 
                              COUNT(1)                    as Sample,
                              ROUND(COUNT(1) * .11744, 1) as Population
                           FROM vw_mainheader
                           GROUP BY qualification
                           ORDER BY Mean")


occupation Mean Sample Population
No occupation 257 10617 1246.90
Labourers 705 2154 253.00
Residual Categories 726 24 2.80
Community and Personal Service Workers 745 1734 203.60
Sales Workers 800 1688 198.20
Clerical and Adminsitrative Workers 811 2126 249.70
Technicians and Trades Workers 886 2377 279.20
Machinery Operators and Drivers 917 1049 123.20
Professionals 1105 4540 533.20
Managers 1164 3162 371.30
# occupation summary not given in data dictionary
tab4 <- sqlQuery(PlayPen, 
"SELECT 
                             occupation,
                             ROUND(AVG(income))          as Mean, 
                             COUNT(1)                    as Sample,
                             ROUND(COUNT(1) * .11744, 1) as Population
                           FROM vw_mainheader
                           GROUP BY occupation
                           ORDER BY Mean")


region Mean Sample Population
Bay of Plenty 620 1701 199.80
Taranaki 634 728 85.50
Waikato 648 2619 307.60
Southland 648 637 74.80
Manawatu-Wanganui 656 1564 183.70
Northland 667 1095 128.60
Nelson/Tasman/Marlborough/West Coast 680 1253 147.20
Otago 686 1556 182.70
Gisborne / Hawke’s Bay 693 1418 166.50
Canterbury 701 4373 513.60
Auckland 720 9063 1064.40
Wellington 729 3464 406.80
# region summary not given in data dictionary
tab5 <- sqlQuery(PlayPen, 
"SELECT 
                             region,
                             ROUND(AVG(income))          as Mean, 
                             COUNT(1)                    as Sample,
                             ROUND(COUNT(1) * .11744, 1) as Population
                           FROM vw_mainheader
                           GROUP BY region
                           ORDER BY Mean ")


ethnicity Mean Sample Population
Residual Categories 555 85 10.00
Maori 590 3652 428.90
Pacific Peoples 606 1566 183.90
Middle Eastern/Latin American/African 658 343 40.30
Asian 678 3110 365.20
European 706 22011 2585.00
Other Ethnicity 756 611 71.80
tab6 <- sqlQuery(PlayPen,
"SELECT
                     ethnicity,
                     ROUND(AVG(income))          as Mean,
                     COUNT(1)                    as Sample,
                     ROUND(COUNT(1) * .11744, 1) as Population
                  FROM f_mainheader m
                  JOIN f_ethnicity e ON m.survey_id = e.survey_id
                  JOIN d_ethnicity d ON e.ethnicity_id = d.ethnicity_id
                  GROUP BY ethnicity
                  ORDER BY Mean")

Graphics showing density of income

Finally, here’s the code that drew the charts we started with, showing the distribution of the weekly income New Zealanders of different ethnicity.

library(showtext)
library(RODBC)
library(ggplot2)
library(dplyr)
font.add.google("Poppins", "myfont")
showtext.auto()

# download individual level data from database including ethnicity join
dtf <- sqlQuery(PlayPen,
"SELECT
                   ethnicity,
                   income
                FROM f_mainheader m
                JOIN f_ethnicity e ON m.survey_id = e.survey_id
                JOIN d_ethnicity d ON e.ethnicity_id = d.ethnicity_id")

# density plot of income by density                
dtf %>%
   filter(ethnicity %in% c("Asian", "European", "Maori", "Pacific Peoples")) %>%
   ggplot(aes(x = income, colour = ethnicity)) +
   geom_density(size = 1.1) +
   scale_x_log10("Weekly income from all sources", label = dollar, breaks = c(10, 100, 345, 825, 10000)) +
   theme_minimal(base_family = "myfont") +
   theme(legend.position = "bottom") +
   scale_colour_brewer("", palette = "Set1")

# boxplot of income by ethnicity
dtf %>%
   ggplot(aes(y = income, x = ethnicity, colour = ethnicity)) +
   geom_boxplot() +
   geom_rug() +
   scale_y_log10("Weekly income from all sources", label = dollar, breaks = c(10, 100, 1000, 10000)) +
   theme_minimal() +
   labs(x = "") +
   coord_flip() +
   scale_colour_brewer(palette = "Set2") +
   theme(legend.position = "none")

# scatter plot of joint density of hours and income
dtf2 <- sqlQuery(PlayPen, "select hours, income from vw_mainheader")
ggplot(dtf2, aes(x = hours, y = income)) +
   geom_jitter(alpha = 0.05) +
   scale_x_log10("Hours worked", breaks = c(1, 10, 20, 40, 80)) +
   scale_y_log10("Weekly income from all sources", label = dollar, breaks = c(10, 100, 345, 825, 10000)) +
   theme_minimal(base_family = "myfont")   
   
   
# how did I choose to mark $345 and $825 on the scale?
# ripped off (and improved) from http://stackoverflow.com/questions/27418461/calculate-the-modes-in-a-multimodal-distribution-in-r
find_modes<- function(data, ...) {
   dens <- density(data, ...)
   y <- dens$y
   modes <- NULL
   for ( i in 2:(length(y) - 1) ){
      if ( (y[i] > y[i - 1]) & (y[i] > y[i + 1]) ) {
         modes <- c(modes,i)
      }
   }
   if ( length(modes) == 0 ) {
      modes = 'This is a monotonic distribution'
   }
   return(dens$x[modes])
}

x <- dtf$income
x[x < 1] <- 1 # not interested in negative income just now

# where are those modes?
exp(find_modes(log(x)))

Edited 18 August 2015 to add the scatter plot of the joint density of hours and income.

To leave a comment for the author, please follow the link and comment on his blog: Peter's stats stuff - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Hypothesis-Driven Development Part II

$
0
0

(This article was first published on QuantStrat TradeR » R, and kindly contributed to R-bloggers)

This post will evaluate signals based on the rank regression hypotheses covered in the last post.

The last time around, we saw that rank regression had a very statistically significant result. Therefore, the next step would be to evaluate the basic signals — whether or not there is statistical significance in the actual evaluation of the signal–namely, since the strategy from SeekingAlpha simply selects the top-ranked ETF every month, this is a very easy signal to evaluate.

Simply, using the 1-24 month formation periods for cumulative sum of monthly returns, select the highest-ranked ETF and hold it for one month.

Here’s the code to evaluate the signal (continued from the last post), given the returns, a month parameter, and an EW portfolio to compare with the signal.


signalBacktest <- function(returns, nMonths, ewPortfolio) {
  nMonthAverage <- apply(returns, 2, runSum, n = nMonths)
  nMonthAverage <- xts(nMonthAverage, order.by = index(returns))
  nMonthAvgRank <- t(apply(nMonthAverage, 1, rank))
  nMonthAvgRank <- xts(nMonthAvgRank, order.by=index(returns))
  selection <- (nMonthAvgRank==5) * 1 #select highest average performance
  sigTest <- Return.portfolio(R = returns, weights = selection)
  difference <- sigTest - ewPortfolio
  diffZscore <- mean(difference)/sd(difference)
  sigZscore <- mean(sigTest)/sd(sigTest)
  return(list(sigTest, difference, mean(sigTest), sigZscore, mean(difference), diffZscore))
}

ewPortfolio <- Return.portfolio(monthRets, rebalance_on="months")

sigBoxplots <- list()
excessBoxplots <- list()
sigMeans <- list()
sigZscores <- list()
diffMeans <- list()
diffZscores <- list()
for(i in 1:24) {
  tmp <- signalBacktest(monthRets, nMonths = i, ewPortfolio)
  sigBoxplots[[i]] <- tmp[[1]]
  excessBoxplots[[i]] <- tmp[[2]]
  sigMeans[[i]] <- tmp[[3]]
  sigZscores[[i]] <- tmp[[4]]
  diffMeans[[i]] <- tmp[[5]]
  diffZscores[[i]] <- tmp[[6]]
}

sigBoxplots <- do.call(cbind, sigBoxplots)
excessBoxplots <- do.call(cbind, excessBoxplots)
sigMeans <- do.call(c, sigMeans)
sigZscores <- do.call(c, sigZscores)
diffMeans <- do.call(c, diffMeans)
diffZscores <- do.call(c, diffZscores)

par(mfrow=c(2,1))
plot(as.numeric(sigMeans)*100, type='h', main = 'signal means', 
     ylab = 'percent per month', xlab='formation period')
plot(as.numeric(sigZscores), type='h', main = 'signal Z scores', 
     ylab='Z scores', xlab='formation period')

plot(as.numeric(diffMeans)*100, type='h', main = 'mean difference between signal and EW',
     ylab = 'percent per month', xlab='formation period')
plot(as.numeric(diffZscores), type='h', main = 'difference Z scores',
     ylab = 'Z score', xlab='formation period')

boxplot(as.matrix(sigBoxplots), main = 'signal boxplots', xlab='formation period')
abline(h=0, col='red')
points(sigMeans, col='blue')

boxplot(as.matrix(sigBoxplots[,1:12]), main = 'signal boxplots 1 through 12 month formations', 
        xlab='formation period')
abline(h=0, col='red')
points(sigMeans[1:12], col='blue')

boxplot(as.matrix(excessBoxplots), main = 'difference (signal - EW) boxplots', 
        xlab='formation period')
abline(h=0, col='red')
points(sigMeans, col='blue')

boxplot(as.matrix(excessBoxplots[,1:12]), main = 'difference (signal - EW) boxplots 1 through 12 month formations', 
        xlab='formation period')
abline(h=0, col='red')
points(sigMeans[1:12], col='blue')

Okay, so what’s going on here is that I compare the signal against the equal weight portfolio, and take means and z scores of both the signal values in general, and against the equal weight portfolio. I plot these values, along with boxplots of the distributions of both the signal process, and the difference between the signal process and the equal weight portfolio.

Here are the results:




To note, the percents are already multiplied by 100, so in the best cases, the rank strategy outperforms the equal weight strategy by about 30 basis points per month. However, these results are…not even in the same parking lot as statistical significance, let alone in the same ballpark.

Now, at this point, in case some people haven’t yet read Brian Peterson’s paper on strategy development, the point of hypothesis-driven development is to *reject* hypothetical strategies ASAP before looking at any sort of equity curve and trying to do away with periods of underperformance. So, at this point, I would like to reject this entire strategy because there’s no statistical evidence to actually continue. Furthermore, because August 2015 was a rather interesting month, especially in terms of volatility dispersion, I want to return to volatility trading strategies, now backed by hypothesis-driven development.

If anyone wants to see me continue to rule testing with this process, let me know. If not, I have more ideas on the way.

Thanks for reading.

NOTE: while I am currently consulting, I am always open to networking, meeting up (Philadelphia and New York City both work), consulting arrangements, and job discussions. Contact me through my email at ilya.kipnis@gmail.com, or through my LinkedIn, found here.

To leave a comment for the author, please follow the link and comment on his blog: QuantStrat TradeR » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Hypothesis Driven Development Part III: Monte Carlo In Asset Allocation Tests

$
0
0

(This article was first published on QuantStrat TradeR » R, and kindly contributed to R-bloggers)

This post will show how to use Monte Carlo to test for signal intelligence.

Although I had rejected this strategy in the last post, I was asked to do a monte-carlo analysis of a thousand random portfolios to see how the various signal processes performed against said distribution. Essentially, the process is quite simple: as I’m selecting one asset each month to hold, I simply generate a random number between 1 and the amount of assets (5 in this case), and hold it for the month. Repeat this process for the number of months, and then repeat this process a thousand times, and see where the signal processes fall across that distribution.

I didn’t use parallel processing here since Windows and Linux-based R have different parallel libraries, and in the interest of having the code work across all machines, I decided to leave it off.

Here’s the code:

randomAssetPortfolio <- function(returns) {
  numAssets <- ncol(returns)
  numPeriods <- nrow(returns)
  assetSequence <- sample.int(numAssets, numPeriods, replace=TRUE)
  wts <- matrix(nrow = numPeriods, ncol=numAssets, 0)
  wts <- xts(wts, order.by=index(returns))
  for(i in 1:nrow(wts)) {
    wts[i,assetSequence[i]] <- 1
  }
  randomPortfolio <- Return.portfolio(R = returns, weights = wts)
  return(randomPortfolio)
}

t1 <- Sys.time()
randomPortfolios <- list()
set.seed(123)
for(i in 1:1000) {
  randomPortfolios[[i]] <- randomAssetPortfolio(monthRets)
}
randomPortfolios <- do.call(cbind, randomPortfolios)
t2 <- Sys.time()
print(t2-t1)

algoPortfolios <- sigBoxplots[,1:12]
randomStats <- table.AnnualizedReturns(randomPortfolios)
algoStats <- table.AnnualizedReturns(algoPortfolios)

par(mfrow=c(3,1))
hist(as.numeric(randomStats[1,]), breaks = 20, main = 'histogram of monte carlo annualized returns',
     xlab='annualized returns')
abline(v=as.numeric(algoStats[1,]), col='red')
hist(as.numeric(randomStats[2,]), breaks = 20, main = 'histogram of monte carlo volatilities',
     xlab='annualized vol')
abline(v=as.numeric(algoStats[2,]), col='red')
hist(as.numeric(randomStats[3,]), breaks = 20, main = 'histogram of monte carlo Sharpes',
     xlab='Sharpe ratio')
abline(v=as.numeric(algoStats[3,]), col='red')

allStats <- cbind(randomStats, algoStats)
aggregateMean <- apply(allStats, 1, mean)
aggregateDevs <- apply(allStats, 1, sd)

algoPs <- 1-pnorm(as.matrix((algoStats - aggregateMean)/aggregateDevs))

plot(as.numeric(algoPs[1,])~c(1:12), main='Return p-values',
     xlab='Formation period', ylab='P-value')
abline(h=0.05, col='red')
abline(h=.1, col='green')

plot(1-as.numeric(algoPs[2,])~c(1:12), ylim=c(0, .5), main='Annualized vol p-values',
     xlab='Formation period', ylab='P-value')
abline(h=0.05, col='red')
abline(h=.1, col='green')

plot(as.numeric(algoPs[3,])~c(1:12), main='Sharpe p-values',
     xlab='Formation period', ylab='P-value')
abline(h=0.05, col='red')
abline(h=.1, col='green')

And here are the results:


In short, compared to monkeys throwing darts, to use some phrasing from the Price Action Lab blog, these signal processes are only marginally intelligent, if at all, depending on the variation one chooses. Still, I was recommended to see this process through the end, and evaluate rules, so next time, I’ll evaluate one easy-to-implement rule.

Thanks for reading.

NOTE: while I am currently consulting, I am always open to networking, meeting up (Philadelphia and New York City both work), consulting arrangements, and job discussions. Contact me through my email at ilya.kipnis@gmail.com, or through my LinkedIn, found here.

To leave a comment for the author, please follow the link and comment on his blog: QuantStrat TradeR » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Viewing all 252 articles
Browse latest View live