For a few years now I’ve been chipping away at Regression Modeling Strategies by Frank Harrell. Between the exposition, the case studies, the notes, the references, the code and exercises, it’s a deep and rich source of statistical learning. The book’s associated R package, {rms}, is used throughout the book and is one of the reasons it can be tough sledding at times, especially if you’re like me and learned regression using the base R functions lm()
and glm()
. It’s not that {rms} is hard to use. In fact it’s quite easy to use. But using functions such as summary()
and anova()
with {rms} models produces very different output than what you get with a base R lm()
model and can seem baffling to the uninitiated. In this post I hope to explain a little of what {rms} is doing by comparing it to the more traditional approaches in R so commonly taught in classrooms and textbooks.
Model fitting and summary output
To begin, let’s load the gala
data from the {faraway} package. The gala
data set contains data on species diversity on the Galapagos Islands. Below we use the base R lm()
function to model the number of plant Species found on the island as function of the Area of the island (km\(^2\)), the highest Elevation of the island (m), the distance from the Nearest island (km), the distance from Santa Cruz island (km), and the area of the Adjacent island (square km). This is how I and thousands of others learned to do regression in R. Of course we use the familiar summary()
function on our model object to see the model coefficients, marginal tests, the residual standard error, R squared, etc.
library(faraway)
data("gala")
m <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent,
data = gala)
summary(m)
##
## Call:
## lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent,
## data = gala)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.679 -34.898 -7.862 33.460 182.584
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.068221 19.154198 0.369 0.715351
## Area -0.023938 0.022422 -1.068 0.296318
## Elevation 0.319465 0.053663 5.953 3.82e-06 ***
## Nearest 0.009144 1.054136 0.009 0.993151
## Scruz -0.240524 0.215402 -1.117 0.275208
## Adjacent -0.074805 0.017700 -4.226 0.000297 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.98 on 24 degrees of freedom
## Multiple R-squared: 0.7658, Adjusted R-squared: 0.7171
## F-statistic: 15.7 on 5 and 24 DF, p-value: 6.838e-07
Now let’s fit the same model using the {rms} package. For this we use the ols()
function. Notice we can use R’s formula syntax as usual. One additional argument we need to use that will come in handy in a few moments is x = TRUE
. This stores the predictor variables with the model fit as a design matrix. Notice we don’t need to call summary()
on the saved model object. We simply print it.
library(rms)
mr <- ols(Species ~ Area + Elevation + Nearest + Scruz + Adjacent,
data = gala, x = TRUE)
mr
## Linear Regression Model
##
## ols(formula = Species ~ Area + Elevation + Nearest + Scruz +
## Adjacent, data = gala, x = TRUE)
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 30 LR chi2 43.55 R2 0.766
## sigma60.9752 d.f. 5 R2 adj 0.717
## d.f. 24 Pr(> chi2) 0.0000 g 105.768
##
## Residuals
##
## Min 1Q Median 3Q Max
## -111.679 -34.898 -7.862 33.460 182.584
##
##
## Coef S.E. t Pr(>|t|)
## Intercept 7.0682 19.1542 0.37 0.7154
## Area -0.0239 0.0224 -1.07 0.2963
## Elevation 0.3195 0.0537 5.95 <0.0001
## Nearest 0.0091 1.0541 0.01 0.9932
## Scruz -0.2405 0.2154 -1.12 0.2752
## Adjacent -0.0748 0.0177 -4.23 0.0003
The coefficient tables and summaries of residuals are the same. Likewise both return R-squared and Adjusted R-squared. The residual standard error from the lm()
output is called sigma
in the ols()
output.
The ols
output also includes a “g” statistic. This is Gini’s mean difference and measures the dispersion of predicted values. It’s an alternative to standard deviation. We can use the {Hmisc} function GiniMd
to calculate Gini’s mean difference for the model fit with lm()
as follows.
Hmisc::GiniMd(predict(m))
## [1] 105.7677
Whereas the summary output for lm()
includes an F test for the null that all of the predictor coefficients equal 0, the ols()
function reports a Likelihood Ratio Test. This tests the same null hypothesis using a ratio of likelihoods. We can calculate this test for the lm()
model using the logLik()
function. First we subtract the log likelihood of a model with no predictors from the log likelihood for the original model, and then multiply by 2. Apparently, according to Wikipedia, multiplying by 2 ensures mathematically that the test statistic converges to a chi-square distribution if the null is true. If the null is true, we expect this different of log likelihoods (or ratio, according to the quotient rule of logs) to be about 1. This statistic is very large. Clearly at least one of the predictor coefficients is not 0.
LRchi2 <- (logLik(m) - logLik(lm(gala$Species ~ 1)))*2
LRchi2
## 'log Lik.' 43.55341 (df=7)
If we wanted to calculate the p-value that appears in the ols()
output, we can use the pchisq()
function. Obviously the p-value in the ols()
output is rounded.
pchisq(LRchi2, df = 7, lower.tail = FALSE)
## 'log Lik.' 2.60758e-07 (df=7)
Partial F tests and important measures
Now let’s use the anova()
function on both model objects and compare the output. For the lm()
object, we get sequential partial F tests using Type I sums of squares. Each line tests the null hypothesis that adding the listed predictor to the previous model without it explains no additional variability. So the Area line compares a model with just an intercept to a model with an intercept and Area. The Elevation line compares a model with an intercept and Area to a model with an intercept, Area, and Elevation. And so on. Small p-values are evidence against the null.
anova(m)
## Analysis of Variance Table
##
## Response: Species
## Df Sum Sq Mean Sq F value Pr(>F)
## Area 1 145470 145470 39.1262 1.826e-06 ***
## Elevation 1 65664 65664 17.6613 0.0003155 ***
## Nearest 1 29 29 0.0079 0.9300674
## Scruz 1 14280 14280 3.8408 0.0617324 .
## Adjacent 1 66406 66406 17.8609 0.0002971 ***
## Residuals 24 89231 3718
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Calling anova()
on the ols()
object produces a series of partial F tests using Type II sums of squares. Each line tests the null hypothesis that adding the listed predictor to a model with all the other listed predictors already in it explains no additional variability. So the Area line compares a model with all the other predictors to a model with all the other predictors and Area. The Elevation line compares a model with all the other predictors to a model with all the other predictors and Elevation. The line labeled REGRESSION is the F Test reported in the summary output for the lm()
model.
anova(mr)
## Analysis of Variance Response: Species
##
## Factor d.f. Partial SS MS F P
## Area 1 4.237718e+03 4.237718e+03 1.14 0.2963
## Elevation 1 1.317666e+05 1.317666e+05 35.44 <.0001
## Nearest 1 2.797576e-01 2.797576e-01 0.00 0.9932
## Scruz 1 4.635787e+03 4.635787e+03 1.25 0.2752
## Adjacent 1 6.640639e+04 6.640639e+04 17.86 0.0003
## REGRESSION 5 2.918500e+05 5.837000e+04 15.70 <.0001
## ERROR 24 8.923137e+04 3.717974e+03
To run these same Partial F tests for the lm()
object we can use the base R drop1()
function.
drop1(m, test = "F")
## Single term deletions
##
## Model:
## Species ~ Area + Elevation + Nearest + Scruz + Adjacent
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 89231 251.93
## Area 1 4238 93469 251.33 1.1398 0.2963180
## Elevation 1 131767 220998 277.14 35.4404 3.823e-06 ***
## Nearest 1 0 89232 249.93 0.0001 0.9931506
## Scruz 1 4636 93867 251.45 1.2469 0.2752082
## Adjacent 1 66406 155638 266.62 17.8609 0.0002971 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The {rms} package offers a convenient plot method for anova()
objects that “draws dot charts depicting the importance of variables in the model” (from the anova.rms
help page). The default importance measure is the chi-square statistic for each factor minus its degrees of freedom. There are several other importance measures available. See the anova.rms
help page for the what
argument.
plot(anova(mr))
IQR effect plots
Earlier we noted that we don’t use the summary()
function on ols()
model objects to see a model summary. That doesn’t mean there isn’t a summary()
method available. There is, but it does something entirely different than the summary method for lm()
objects. If we try it right now we get an error message:
summary(mr)
## Error in summary.rms(mr): adjustment values not defined here or with datadist for Area Elevation Nearest Scruz Adjacent
Notice what it says: “adjustment values not defined here or with datadist”. This tells us we need to define “adjustment values” to use the summary()
function. Instead of summarizing the model, the {rms} summary function when called on {rms} model objects returns a “summary of the effects of each factor”. Let’s demonstrate what this means.
The easiest way to define adjustment values is to use the datadist()
function on our data frame and then set the global datadist option using the base R options()
function. Harrell frequently assigns datadist results to “d” so we do the same. (Also, the help page for datadist states, “The best method is probably to run datadist once before any models are fitted, storing the distribution summaries for all potential variables.” I elected to wait for presentation purposes.)
d <- datadist(gala)
options(datadist = "d")
If we print “d” we see adjustment levels for all variables in the model.
d
## Species Endemics Area Elevation Nearest Scruz Adjacent
## Low:effect 13.0 7.25 0.2575 97.75 0.800 11.025 0.5200
## Adjust to 42.0 18.00 2.5900 192.00 3.050 46.650 2.5900
## High:effect 96.0 32.25 59.2375 435.25 10.025 81.075 59.2375
## Low:prediction 2.0 1.45 0.0390 47.35 0.445 0.490 0.1000
## High:prediction 319.1 85.40 782.6215 1229.40 40.205 193.905 782.6215
## Low 2.0 0.00 0.0100 25.00 0.200 0.000 0.0300
## High 444.0 95.00 4669.3200 1707.00 47.400 290.200 4669.3200
Now let’s call summary()
on our {rms} model object and see what it returns.
summary(mr)
## Effects Response : Species
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## Area 0.2575 59.238 58.980 -1.411900 1.3225 -4.1413 1.3176
## Elevation 97.7500 435.250 337.500 107.820000 18.1110 70.4400 145.2000
## Nearest 0.8000 10.025 9.225 0.084353 9.7244 -19.9860 20.1550
## Scruz 11.0250 81.075 70.050 -16.849000 15.0890 -47.9910 14.2930
## Adjacent 0.5200 59.238 58.718 -4.392400 1.0393 -6.5374 -2.2473
This produces an interquartile range “Effects” summary for all predictors. For example, in the first row we see the effect of Area is about -1.41. To calculate this we predict Species when Area = 59.238 (High, 75th percentile) and when Area = 0.2575 (Low, 25th percentile) and take the difference in predicted species. For each prediction, all other variables are held at their “Adjust to” level as shown above when we printed the datadist object. In addition to the effect, the standard error (SE) of the effect and a 95% confidence interval on the effect is returned. In this case we’re not sure if the effect is positive or negative.
To calculate this effect measure using our lm()
model object we can use the predict()
function to make two predictions and then take the difference using the diff()
function. Notice all the values in the newdata argument come from the datadist object above. The two values for Area are the “Low:effect” and “High:effect” values. The other values from the “Adjust to” row.
# How to get Area effect in summary(mr) output = -1.411900
p <- predict(m, newdata = data.frame(Area = c(0.2575, 59.238),
Elevation=97.75,
Nearest=3.05,
Scruz=46.65,
Adjacent=2.59))
p
## 1 2
## 26.90343 25.49153
diff(p)
## 2
## -1.411895
To get the standard error and confidence interval we can do the following. (Recall 58.980 is the IQR of Area.)
se <- sqrt((58.980 * vcov(m)["Area","Area"] * 58.980))
se
## [1] 1.32247
diff(p) + c(-1,1) * se * qt(0.975, df = 24)
## [1] -4.141340 1.317549
This is made a little easier using the emmeans()
function in the {emmeans} package. We use the at
argument to specify at what values we wish to make predictions for Area. All other values are held at their median by setting cov.reduce = median
. Sending the result to the {emmeans} function contrast
with argument “revpairwise” says to subtract the estimated means in reverse order. Finally we pipe into the confint()
to replicate the IQR effect estimate for Area that we saw in the ols()
summary.
library(emmeans)
emmeans(m, specs = "Area", at = list(Area = c(0.2575, 59.238)),
cov.reduce = median) |>
contrast("revpairwise") |>
confint()
## contrast estimate SE df lower.CL upper.CL
## Area59.238 - Area0.2575 -1.41 1.32 24 -4.14 1.32
##
## Confidence level used: 0.95
There is also a plot method for {rms} summary objects. It plots the Effects and the 90, 95, and 99 percent confidence intervals using different shades of blue. Below we see the Area effect of -1.41 with relatively tight confidence intervals hovering around 0.
plot(summary(mr))
Non-linear effects
Harrell advocates using non-linear effects in the form of regression splines. This makes a lot of sense when you pause and consider how many effects in real life are truly linear. Not many. Very few associations in nature indefinitely follow a straight line relationship. Fortunately, we can easily implement regression splines in R and specify how much non-linearity we want to entertain. We do this either in the form of degrees of freedom or knots. More of both means more non-linearity. I personally think of it as the number of times the relationship might change direction.
One way to implement regression splines in R is via the ns()
function in the {splines} package, which comes installed with R. Below we fit a model that allows the effect of Nearest to change directions three times by specifying df=3
. We might think of this as sort of like using a 3-degree polynomial to model Nearest (but that’s not what we’re doing). The summary shows three coefficients for Nearest, neither of which have any interpretation.
library(splines)
m2 <- lm(Species ~ Area + Elevation + ns(Nearest, df = 3) +
Scruz + Adjacent,
data = gala)
summary(m2)
##
## Call:
## lm(formula = Species ~ Area + Elevation + ns(Nearest, df = 3) +
## Scruz + Adjacent, data = gala)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79.857 -28.285 -0.775 25.498 163.069
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.18825 27.39386 0.481 0.6350
## Area -0.03871 0.02098 -1.845 0.0785 .
## Elevation 0.35011 0.05086 6.883 6.51e-07 ***
## ns(Nearest, df = 3)1 -168.44016 68.41261 -2.462 0.0221 *
## ns(Nearest, df = 3)2 -56.97623 57.67426 -0.988 0.3339
## ns(Nearest, df = 3)3 38.61120 46.22137 0.835 0.4125
## Scruz -0.13735 0.19900 -0.690 0.4973
## Adjacent -0.08254 0.01743 -4.734 0.0001 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 55.11 on 22 degrees of freedom
## Multiple R-squared: 0.8247, Adjusted R-squared: 0.7689
## F-statistic: 14.78 on 7 and 22 DF, p-value: 5.403e-07
To decide whether we should keep this non-linear effect, we could use the anova()
function to compare this updated more complex model to the original model with only linear effects. The null of the test is that both models are equally adequate. It appears there is some evidence that this non-linearity improves the model.
anova(m, m2)
## Analysis of Variance Table
##
## Model 1: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
## Model 2: Species ~ Area + Elevation + ns(Nearest, df = 3) + Scruz + Adjacent
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 24 89231
## 2 22 66808 2 22424 3.6921 0.04144 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To entertain non-linear effects in {rms} use the rcs()
function, which stands for restricted cubic splines. Instead of degrees of freedom we specify knots. Three degrees of freedom corresponds to four knots.
mr2 <- ols(Species ~ Area + Elevation + rcs(Nearest, 4) +
Scruz + Adjacent,
data = gala, x = TRUE)
mr2
## Linear Regression Model
##
## ols(formula = Species ~ Area + Elevation + rcs(Nearest, 4) +
## Scruz + Adjacent, data = gala, x = TRUE)
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 30 LR chi2 51.32 R2 0.819
## sigma55.9556 d.f. 7 R2 adj 0.762
## d.f. 22 Pr(> chi2) 0.0000 g 107.069
##
## Residuals
##
## Min 1Q Median 3Q Max
## -90.884 -29.870 -3.715 28.019 163.497
##
##
## Coef S.E. t Pr(>|t|)
## Intercept 14.1147 30.2230 0.47 0.6451
## Area -0.0368 0.0212 -1.74 0.0964
## Elevation 0.3444 0.0512 6.73 <0.0001
## Nearest 3.6081 16.9795 0.21 0.8337
## Nearest' -714.5143 891.8922 -0.80 0.4316
## Nearest'' 1001.4929 1207.4895 0.83 0.4158
## Scruz -0.2286 0.1978 -1.16 0.2602
## Adjacent -0.0806 0.0175 -4.60 0.0001
Notice we get three coefficients for Nearest that differ in value from our lm()
result. A brief explanation of why that is can be found on datamethods.org. This paper provides a deeper explanation.
Calling the {rms} anova()
method on the model object returns a test for nonlinearity under the Nearest test, labeled “Nonlinear”. Notice the resulting p-value is slightly higher and exceeds 0.05.
anova(mr2)
## Analysis of Variance Response: Species
##
## Factor d.f. Partial SS MS F P
## Area 1 9446.295 9446.295 3.02 0.0964
## Elevation 1 141872.398 141872.398 45.31 <.0001
## Nearest 3 20348.931 6782.977 2.17 0.1208
## Nonlinear 2 20348.651 10174.326 3.25 0.0580
## Scruz 1 4182.177 4182.177 1.34 0.2602
## Adjacent 1 66204.433 66204.433 21.14 0.0001
## REGRESSION 7 312198.651 44599.807 14.24 <.0001
## ERROR 22 68882.715 3131.033
To replicate the {rms} anova()
output using lm()
, we need to change ns()
arguments. The following R code comes from this hbiostat.org page
w <- rcs(gala$Nearest, 4)
kn <- attr(w, 'parms')
m2 <- lm(Species ~ Area + Elevation + ns(Nearest, knots = kn[2:3],
Boundary.knots = c(kn[1], kn[4]))
+ Scruz + Adjacent, data = gala)
anova(m, m2)
## Analysis of Variance Table
##
## Model 1: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
## Model 2: Species ~ Area + Elevation + ns(Nearest, knots = kn[2:3], Boundary.knots = c(kn[1],
## kn[4])) + Scruz + Adjacent
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 24 89231
## 2 22 68883 2 20349 3.2495 0.05801 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Which approach is better: ns()
or rcs()
? I don’t pretend to know and I’m not sure it really matters. The substance of the results doesn’t differ much. Yes, we saw a p-value jump from 0.04 to 0.05, but we know better than to make hard binary decisions based on a p-value.
Effect plots
Since interpreting non-linear effect coefficients is all but impossible, we turn to visualization. When fitting a non-linear model with lm()
, the {effects} and {ggeffects} packages are fantastic for creating effect plots.
m2 <- lm(Species ~ Area + Elevation + ns(Nearest, df = 3) +
Scruz + Adjacent, gala)
library(effects)
Effect("Nearest", m2) |> plot()
library(ggeffects)
ggeffect(m2, "Nearest") |> plot()
In both cases we see the effect of Nearest is negative for values 0 – 20, but then seems to increase for values above 20.
To get a similar plot for an {rms} model, we can use the Predict()
function (note the capital “P”; this is an {rms} function) and then pipe into plot()
. You can also pipe into ggplot()
and plotp()
to get ggplot2 and plotly plots, respectively.
Predict(mr2, Nearest) |> plot()
To get all effect plots in one go, we can use the {effects} function allEffects()
with the lm()
model.
allEffects(m2) |> plot()
To do the same with {rms}, we use the Predict()
function with no predictors specified. Notice the y-axis has the same limits for all plots.
Predict(mr2) |> plot()
Diagnostics
A linear model makes several assumptions, two of which are constant variance of residuals and normality of residuals. The plot()
method for lm()
objects makes this easy to assess graphically.
op <- par(mfrow = c(2,2))
plot(m2)
par(op)
The left two plots assess constant variance using two different types of residuals. The upper right plot assess normality. The bottom right helps identify influential observations. The islands of Isabela and Santa Cruz stand out as either not being well fit by the model or unduly influencing the fit of the model.
We can use the plot.lm()
method on the ols()
model, but only the first one (i.e., which = 1
).
plot(mr2, which = 1)
To create a QQ plot to assess normality of residuals we need to extract the residuals using the residuals() function and plot with qqnorm()
and qqline()
r <- residuals(mr2, type="student")
qqnorm(r)
qqline(r)
To check for possible influential observations for an lm()
model we can use the base R function influence.measures()
and its associated summary()
method. See ?influence.measures
for a breakdown of what is returned. The columns that begin “dfb” are the DFBETA statistics. DFBETAS indicate the effect that deleting each observation has on the estimates of the regression coefficients. That’s why there’s a DFBETA for every coefficient in the model.
summary(influence.measures(m2))
## Potentially influential observations of
## lm(formula = Species ~ Area + Elevation + ns(Nearest, df = 3) + Scruz + Adjacent, data = gala) :
##
## dfb.1_ dfb.Area dfb.Elvt dfb.n(N,d=3)1 dfb.n(N,d=3)2 dfb.n(N,d=3)3
## Darwin -0.02 0.07 -0.08 0.05 -0.10 -0.10
## Fernandina 0.14 0.19 -0.15 0.02 0.00 0.03
## Genovesa -0.07 -0.08 0.12 0.13 -0.15 -0.43
## Isabela 0.05 -18.76_* 4.55_* -1.69_* -1.12_* 0.94
## SanCristobal -0.09 -0.11 0.18 -0.18 0.22 0.49
## SantaCruz 0.43 -1.03_* 1.58_* -0.57 -0.96 -0.38
## Wolf 0.00 0.00 0.00 0.01 0.00 0.00
## dfb.Scrz dfb.Adjc dffit cov.r cook.d hat
## Darwin 0.48 0.00 0.62 2.36_* 0.05 0.48
## Fernandina -0.07 -0.91 -1.51 27.71_* 0.30 0.95_*
## Genovesa 0.14 -0.10 -0.47 3.16_* 0.03 0.57
## Isabela -0.54 -0.78 -26.78_* 0.45 50.94_* 0.98_*
## SanCristobal -0.18 -0.12 0.58 2.56_* 0.04 0.50
## SantaCruz -0.17 -1.16_* 2.33_* 0.01 0.36 0.21
## Wolf 0.03 0.00 0.04 2.21_* 0.00 0.34
The {rms} package offers the which.influence()
and show.influence()
tandem to identify observations (via DFBETAS) that effect regression coefficients with their removal. Below we set cutoff = 0.4
. An asterisk is placed next to a variable when any of the coefficients associated with that variable change by more than 0.4 standard errors upon removal of the observation. Below we see that removing the Isabela observation changes four of the coefficients. (Also, this is why we set x = TRUE
in our call to ols()
, so we could use these functions.) The values displayed are the observed values for each observation.
w <- which.influence(mr2, cutoff = 0.4)
show.influence(w, gala)
## Count Area Elevation Nearest Scruz Adjacent
## Darwin 1 2.33 168 34.1 *290.2 2.85
## Fernandina 1 634.49 1494 4.3 95.3 *4669.32
## Isabela 4 *4669.32 *1707 0.7 * 28.1 * 634.49
## Pinta 3 * 59.56 * 777 29.1 119.6 * 129.49
## SanSalvador 3 * 572.33 * 906 0.2 19.8 * 4.89
## SantaCruz 3 * 903.82 * 864 0.6 0.0 * 0.52
## SantaFe 1 24.08 259 *16.5 16.5 0.52
## SantaMaria 5 * 170.92 * 640 * 2.6 49.2 * 0.10
More information
Got a question? Harrell runs a discussion board for the {rms} package.
Chapter 14 of these course notes by Thomas Love are also helpful.
And here’s a nice blog post called An Introduction to the Harrell“verse”: Predictive Modeling using the Hmisc and rms Packages by Nicholas Ollberding.
Eventually you may want to buy his book, Regression Modeling Strategies. It’s not cheap, but it’s not outrageous either. Considering how much information, advice, code, examples, and references it contains, I think it’s a bargain. It’s a text that will provide many months, if not years, of study.