Which plot can be used to display the relationship between a numeric and a categorical variable?


>

Adapted from Frank E Harrell, Jr: on Graphics:

http://biostat.mc.vanderbilt.edu/twiki/pub/Main/StatGraphCourse/graphscourse.pdf

  • Bar charts have many problems:
    • High ink to information ratio
    • Error bars cause perception errors
    • Can only show one-sided confidence intervals well
    • Thick bars reduce the number of  categories that can be shown
    • Labels on vertical bar charts may be difficult to read
  • Dots plots are almost always better
  • Consider multi-panel side-by-side display for comparing several contrasting or similar cases
    • Use same scales for both x- and y-axes across different panels
  • Consider ordering categories by values represented, for more accurate perception

Bar Charts, Error Bars and Dot Plots

As noted previously, bar charts can be problematic. Here is another one presenting means and error bars, but the error bars are misleading because they only extend in one direction. A better alternative would have been to to use full error bars with a scatter plot, as illustrated previously (right).

Which plot can be used to display the relationship between a numeric and a categorical variable?

Source: Hummer BT, Li XL, Hassel BA (2001) Role for p53 in gene

induction by double-stranded RNA. J Virol 75:7774-7777, Figure 4

Which plot can be used to display the relationship between a numeric and a categorical variable?

Consider the four graphs below presenting the incidence of cancer by type. The upper left graph unnecessary uses bars, which take up a lot of ink. This layout also ends up making the fonts for the types of cancer too small. Small font is also a problem for the dot plot at the upper right, and this one also has unnecessary grid lines across the entire width.

The graph at the lower left has more readable labels and uses a simple dot plot, but the rank order is difficult to figure out.

The graph at the lower right is clearly the best, since the labels are readable, the magnitude of incidence is shown clearly by the dot plots, and the cancers are sorted by frequency.

*************************

+

Which plot can be used to display the relationship between a numeric and a categorical variable?

Which plot can be used to display the relationship between a numeric and a categorical variable?

Which plot can be used to display the relationship between a numeric and a categorical variable?

Which plot can be used to display the relationship between a numeric and a categorical variable?

Single Continuous Numeric Variable

In this situation a cumulative distribution function conveys the most information and requires no grouping of the variable. A box plot will show selected quantiles effectively, and box plots are especially useful when stratifying by multiple categories of another variable.

Histograms are also possible. Consider the examples below.

Density Plot

Histogram

Box Plot

Which plot can be used to display the relationship between a numeric and a categorical variable?

Which plot can be used to display the relationship between a numeric and a categorical variable?

Which plot can be used to display the relationship between a numeric and a categorical variable?

Two Variables

Adapted from Frank E. Harrell Jr. on graphics: 

http://biostat.mc.vanderbiltedu/twiki/pub/Main/StatGraphCourse/graphscourse.pdf

Two categorical variables

  • Use frequency table
    • One categorical variable and other continuous variable
  • Box plots of continuous variable values for each category of categorical variable
  • Side-by-side dot plots (means + measure of uncertainty, SE or confidence interval)
    • Do not link means across categories!

Two continuous variables

  • Scatter plot of raw data if sample size is not too large
  • Prediction with confidence bands

 The two graphs below summarize BMI (Body Mass Index) measurements in four categories, i.e., younger and older men and women. The graph on the left shows the means and 95% confidence interval for the mean in each of the four groups. This is easy to interpret, but the viewer cannot see that the data is actually quite skewed. The graph on the right shows the same information presented as a box plot. With this presentation method one gets a better understanding of the skewed distribution and how the groups compare.

Which plot can be used to display the relationship between a numeric and a categorical variable?

Which plot can be used to display the relationship between a numeric and a categorical variable?

The next example is a scatter plot with a superimposed smoothed line of prediction. The shaded region embracing the blue line is a representation of the 95% confidence limits for the estimated prediction. This was created using "ggplot" in the R programming language.

Which plot can be used to display the relationship between a numeric and a categorical variable?

Source: Frank E. Harrell Jr. on graphics:  http://biostat.mc.vanderbilt.edu/twiki/pub/Main/StatGraphCourse/graphscourse.pdf (page 121)

Multivariate Data

  • If there aren't too many variables, it may be possible display the relationship among variables using a line plot with multiple lines.
  • Another option is to display the data multiple panels rather than a single plot with multiple lines than may be hard to distinguish.
  • In any event, be sure to use consistent axes and colors across panels.

The example below shows the use of multiple panels.

Which plot can be used to display the relationship between a numeric and a categorical variable?

Source: Cleveland S. The Elements of Graphing Data. Hobart Press, Summit, NJ, 1994.

return to top | previous page | next page

Which plot shows relationship between a numerical variable and a categorical variable?

To study the relationship between two variables, a comparative bar graph will show associations between categorical variables while a scatterplot illustrates associations for measurement variables.

How do you find the relationship between categorical and numerical variables?

If your categorical variable is dichotomous (only two values), then you can use the point-biserial correlation. There is a function to do this in the ltm package. You could do a logistic regression and use various evaluations of it (accuracy, etc.) in place of a correlation coefficient.

What graph is used for numerical and categorical variables?

Histogram: Histograms, similar to bar graphs, use rectangular bars whose heights correspond to frequency. However, bar graphs plot categorical data and have gap between each bar, whereas histograms plot numerical data and are continuous (no gaps).

Is a scatter plot numerical or categorical?

Categorical Scatter Plots Both strip plots and swarm plots are essentially scatter plots where one variable is categorical. I like to use them as additions to other kinds of plots, which we'll discuss below as they are useful for quickly visualizing the number of data points in a group.