--- title: "Application: Modeling Relative Humidity in Brasília" author: "Everton da Costa, Francisco Cribari-Neto, and Rodney V. Fonseca" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 2 number_sections: true fig_caption: true bibliography: ../inst/REFERENCES.bib vignette: > %\VignetteIndexEntry{Application: Modeling Relative Humidity in Brasília} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 3.5, fig.align = "center", warning = FALSE, message = FALSE ) ``` > **Note:** This vignette requires suggested packages. Install them with: > `install.packages(c("ggplot2", "gridExtra", "zoo", "forecast", "scales", "moments"))` # Introduction This vignette illustrates a complete $\beta$ARMA modeling workflow applied to monthly relative humidity data from Brasília, Brazil, using the `betaARMA` package. The analysis has two complementary objectives. The first objective concerns **numerical stability**. Conditional maximum likelihood estimation (CMLE) of $\beta\text{ARMA}$ models can fail in practice: Cribari-Neto, Costa, and Fonseca (2025) show, via Monte Carlo simulation, that convergence failure rates can reach 18.68% in challenging scenarios (e.g., $\beta\text{ARMA}(1,4)$ with $n = 125$). Their paper also demonstrates, using two empirical datasets, that certain model specifications only converge when ridge penalization is applied. A natural question then arises: once the penalized estimator (PCMLE) achieves convergence, how reliable is the resulting model? Does it produce a well-specified fit that passes residual diagnostics? This vignette addresses that question directly by presenting a specification for which CMLE fails and PCMLE succeeds, and by carrying out a full residual analysis on the penalized model. The second objective is to present a **full modeling workflow** — from exploratory analysis through feature engineering, model fitting, residual diagnostics, and out-of-sample forecasting — applied to the complete Brasília humidity series. This mirrors the applied methodology of Costa, Cribari-Neto, and Scher (2024), published in the *Journal of Hydrology*. Replication code and data are openly available at . # Data and Exploratory Analysis ```{r libraries} library(betaARMA) library(forecast) library(ggplot2) library(gridExtra) library(zoo) library(scales) library(moments) ``` ```{r config_plot, echo=FALSE} # Shared ggplot configuration sample_size <- length(brasilia_ts) end_time_series <- time(brasilia_ts)[sample_size] + 1 ggplot_size_font <- 9 ggplot_theme <- theme_minimal() + theme( legend.position = "bottom", title = element_text(size = ggplot_size_font), axis.text = element_text(size = ggplot_size_font), axis.title = element_text(size = ggplot_size_font), legend.text = element_text(size = ggplot_size_font), legend.title = element_text(size = ggplot_size_font) ) ggplot_scale_y <- scale_y_continuous( breaks = seq(0.0, 1.00, 0.10) ) ggplot_scale_x <- scale_x_continuous( breaks = seq(1999, end_time_series, 2), limits = c(1999, end_time_series) ) ``` The dataset consists of 306 monthly observations of average relative humidity in Brasília, Brazil, covering January 1999 to June 2024. The data were obtained from NASA's POWER project. Relative humidity is naturally bounded in $(0, 1)$, making it well suited for $\beta$ARMA modeling. # A Case of Numerical Instability To motivate the penalized estimator, we focus on a specific subsample of the Brasília dataset. In Cribari-Neto, Costa, and Fonseca (2025), this subsample was used to illustrate convergence failure under standard CMLE, and it was shown that ridge penalization successfully restores convergence. While that paper established the numerical stability of the penalized estimator, the objective here is different: to verify the statistical adequacy of the penalized fit through residual diagnostics, and to show how this workflow can guide model reduction. To that end, we specify a $\beta\text{ARMA}$ model with $\text{AR} = (10, 18)$, $\text{MA} = (1, 11, 13)$, and harmonic regressors. ## Train--Test Split ```{r subsample} y_subsample_ts <- window( brasilia_ts, start = c(2006, 02), end = c(2019, 07) ) y_train <- window( y_subsample_ts, end = c(2018, 07) ) y_test <- window( y_subsample_ts, start = c(2018, 08) ) ``` The figure below shows the full time series with the training period delimited by dashed vertical lines. ```{r ggplot_subsample, echo=FALSE} len_y_ts <- length(y_train) start_subsample <- time(y_train)[1] end_subsample <- time(y_train)[len_y_ts] ggplot_subsample <- ggplot(brasilia_df, aes(time, y)) + geom_point(size = 0.5) + geom_line(aes(group = 1)) + geom_vline(xintercept = start_subsample, linetype = "dashed", linewidth = 0.5, color = "red") + geom_vline(xintercept = end_subsample, linetype = "dashed", linewidth = 0.5, color = "red") + labs(x = "Year", y = "Relative humidity (proportion)") + ggplot_scale_x + ggplot_scale_y + ggplot_theme ``` ```{r print_ggplot_subsample, echo=FALSE, fig.cap="Relative humidity in Brasília. Dashed red lines delimit the training period used in the instability analysis."} print(ggplot_subsample) ``` ## Descriptive Statistics ```{r descriptive_subsample, echo=FALSE} descriptive_sub_df <- data.frame( Min = min(y_subsample_ts), Max = max(y_subsample_ts), Median = median(y_subsample_ts), Mean = mean(y_subsample_ts), SD = sd(y_subsample_ts), Skewness = moments::skewness(y_subsample_ts), Excess_kurtosis = moments::kurtosis(y_subsample_ts) ) descriptive_sub_df <- round(descriptive_sub_df, 2) ``` ```{r print_descriptive_subsample, echo=FALSE} knitr::kable( descriptive_sub_df, caption = "Descriptive statistics of the monthly relative humidity in Brasília — subsample (February 2006 to July 2019)." ) ``` ## Feature Engineering To capture annual seasonality we include two harmonic regressors, $s_t = \sin(2\pi t / 12)$ and $c_t = \cos(2\pi t / 12)$, which together represent a smooth annual wave. ```{r regressors} freq <- frequency(y_train) n_test <- length(y_test) trend_index_train <- seq_along(y_train) trend_index_test <- (max(trend_index_train) + 1): (max(trend_index_train) + n_test) hs_train <- sin(2 * pi * trend_index_train / freq) hc_train <- cos(2 * pi * trend_index_train / freq) hs_test <- sin(2 * pi * trend_index_test / freq) hc_test <- cos(2 * pi * trend_index_test / freq) X_train <- cbind(hs = hs_train, hc = hc_train) X_test <- cbind(hs = hs_test, hc = hc_test) ``` ## Standard CMLE ```{r fit_cmle} ar_lags <- c(10, 18) ma_lags <- c(1, 11, 13) fit_cmle <- barma( y = y_train, ar = ar_lags, ma = ma_lags, penalty = FALSE, xreg = X_train ) ``` ```{r summary_cmle} summary(fit_cmle) ``` The optimizer failed to converge (`conv = 1`), and the estimates are unreliable. Without a penalized alternative, an analyst would have little recourse but to abandon this lag structure and search for a different specification. ## Penalized CMLE (PCMLE) Applying ridge penalization to the same specification restores convergence. ```{r fit_pcmle} fit_pcmle <- barma( y = y_train, ar = ar_lags, ma = ma_lags, penalty = TRUE, xreg = X_train ) ``` ```{r summary_pcmle} summary(fit_pcmle) ``` The penalized estimator successfully converges. The ridge penalty value $\lambda_n = 1/(n - a)^{0.9}$ is chosen to balance bias and variance, as established in Cribari-Neto, Costa, and Fonseca (2025). This illustrates a practical use case for PCMLE. When standard CMLE presents a convergence failure, an analyst might discard the entire lag structure without further investigation. By applying PCMLE, convergence is achieved and the parameter estimates become available for inspection. Furthermore, the diagnostic plots below for the PCMLE fit confirm that the residuals exhibit no serial correlation (Scher, Cribari-Neto, Pumi, and Bayer, 2020), indicating that the lag structure adequately captures the dynamics of the series. ```{r diagnostics_pcmle, fig.height=7.0, fig.cap="Diagnostic plots for the PCMLE fit: AR=(10,18), MA=(1,11,13)."} plot(fit_pcmle, which = "all") ``` The observed and fitted values track each other closely. All ACF and PACF spikes lie within the confidence bands, and both the Ljung-Box and Monti p-values remain consistently above 0.05 across all evaluated lags, providing no evidence of residual serial correlation. Upon reviewing the PCMLE summary, we note that the $\theta_{11}$ estimate is not statistically significant (p-value = 0.488). This suggests that the MA(11) lag does not contribute meaningfully to the model dynamics. Removing it yields the reduced specification $\text{AR} = (10, 18)$, $\text{MA} = (1, 13)$, for which the likelihood surface is no longer flat and standard CMLE converges. Without PCMLE, the non-significance of $\theta_{11}$ would never have been discovered, and the valid underlying structure of the series would have remained inaccessible. ## Reduced Model ```{r fit_cmle_reduced} ar_lags_reduced <- c(10, 18) ma_lags_reduced <- c(1, 13) fit_cmle_reduced <- barma( y = y_train, ar = ar_lags_reduced, ma = ma_lags_reduced, penalty = FALSE, xreg = X_train ) ``` ```{r summary_cmle_reduced} summary(fit_cmle_reduced) ``` The reduced CMLE model converges and all coefficients are statistically significant. We proceed with residual diagnostics and forecasting using this parsimonious specification. ## Residual Diagnostics The panels below assess whether the fitted model adequately captures the dynamics of the Brasília relative humidity series. ```{r diagnostics_all, fig.height=7.0, fig.cap="Diagnostic plots for the reduced CMLE model AR=(10,18), MA=(1,13)."} plot(fit_cmle_reduced, which = "all") ``` The observed and fitted values track each other closely throughout the sample period. All ACF and PACF spikes lie within the confidence bands, and both the Ljung-Box and Monti p-values remain consistently above 0.05 across all evaluated lags, providing no evidence of residual serial correlation. ```{r diagnostics_qq, fig.height=4.0, fig.width=4.5, fig.cap="Normal Q-Q plot of quantile residuals for the reduced CMLE model."} plot(fit_cmle_reduced, which = "qq") ``` The quantile residuals fall closely along the reference line within the 95% confidence band, supporting the normality assumption. One observation in the upper tail departs slightly from the line but does not indicate a systematic departure from the assumed distribution. The largest residual corresponds to October 2011 (quantile residual = 4.37), a month that marks the climatological transition from the dry to the rainy season in Brasília — the period of highest interannual humidity variability. This observation coincides with the onset of an extreme drought period documented across Brazil from 2011 onward (Cunha et al., 2019). The second largest residual corresponds to January 2015 (−3.44), coinciding with the strong 2015–2016 El Niño event, which suppressed rainfall across central Brazil. These isolated departures reflect genuine meteorological anomalies rather than systematic model misspecification. # Forecasting ```{r forecast_setup} # Define the translation dictionary month_map <- c( "Jan" = "Jan", "Fev" = "Feb", "Mar" = "Mar", "Abr" = "Apr", "Mai" = "May", "Jun" = "Jun", "Jul" = "Jul", "Ago" = "Aug", "Set" = "Sep", "Out" = "Oct", "Nov" = "Nov", "Dez" = "Dec" ) dates <- as.Date(zoo::as.yearmon(time(y_test))) pt_months <- tools::toTitleCase(format(dates, "%b")) months <- unname(month_map[pt_months]) forecasts_df <- data.frame( observed = as.numeric(y_test), forecast = forecast(fit_cmle_reduced, xreg = X_test, h = n_test)$mean, month = months ) forecasts_df$time <- seq_len(n_test) ``` ```{r ggplot_forecast, echo=FALSE} ggplot_forecast <- ggplot(forecasts_df, aes(x = time)) + geom_line(aes(y = forecast, colour = "Forecast"), linewidth = 0.8) + geom_point(aes(y = observed, shape = "Observed"), color = "#00BFC4", size = 3) + scale_colour_manual(name = " ", values = c("Forecast" = "#C77CFF")) + scale_shape_manual(name = " ", values = c("Observed" = 16)) + scale_x_continuous(breaks = seq_len(n_test), limits = c(1, n_test), labels = forecasts_df$month) + scale_y_continuous(breaks = seq(0.3, 1.0, 0.1), limits = c(0.3, 1.0), labels = number_format(accuracy = 0.1)) + labs(x = "Time (months)", y = "Relative humidity (proportion)") + ggplot_theme ``` ```{r print_forecast, echo=FALSE, fig.cap="Out-of-sample forecasts from the reduced CMLE model against the observed test data (teal dots)."} print(ggplot_forecast) ``` ```{r forecast_accuracy} mae <- function(obs, pred) mean(abs(obs - pred)) rmse <- function(obs, pred) sqrt(mean((obs - pred)^2)) obs <- forecasts_df$observed accuracy_table <- data.frame( Model = "betaARMA CMLE (reduced)", MAE = round(mae(obs, forecasts_df$forecast) * 100, 2), RMSE = round(rmse(obs, forecasts_df$forecast) * 100, 2) ) knitr::kable( accuracy_table, caption = paste0( "MAE and RMSE ($\\times 100$) over the ", n_test, "-month forecast horizon." ) ) ``` # Conclusion This vignette illustrated the practical value of PCMLE as a diagnostic tool in $\beta$ARMA modeling, using monthly relative humidity data from Brasília. Standard CMLE failed to converge for the $\beta\text{ARMA}$ model with $\text{AR} = (10, 18)$, $\text{MA} = (1, 11, 13)$. Applying ridge penalization (PCMLE) restored convergence and made the parameter estimates available for inspection. Residual diagnostics on the PCMLE fit confirmed that the lag structure adequately captured the series dynamics. Inspection of the PCMLE estimates revealed that $\theta_{11}$ was not statistically significant, motivating the reduction to the parsimonious $\beta\text{ARMA}$ model with $\text{AR} = (10, 18)$, $\text{MA} = (1, 13)$. The reduced CMLE model converged, passed all residual diagnostics, and produced accurate out-of-sample forecasts, with all predictions guaranteed to lie in $(0, 1)$. This workflow — CMLE fails, PCMLE reveals the structure, reduction yields a stable CMLE model — demonstrates that PCMLE is not merely a fallback estimator but an active analytical tool that can guide model selection in challenging estimation scenarios. For the complete simulation study, bootstrap procedures, and extended empirical results, we refer the reader to Cribari-Neto, Costa, and Fonseca (2025). # References Costa, E., Cribari-Neto, F., & Scher, V. T. (2024). Test inferences and link function selection in dynamic beta modeling of seasonal hydro-environmental time series with temporary abnormal regimes. *Journal of Hydrology*, 638, 131489. Cribari-Neto, F., Costa, E., & Fonseca, R. V. (2025). Numerical stability enhancements in beta autoregressive moving average model estimation. *Brazilian Journal of Probability and Statistics*, 39(4), 410--437. Cunha, A. P. M. A., Zeri, M., Deusdará Leal, K., Costa, L., Cuartas, L. A., Marengo, J. A., ... & Cunningham, C. (2019). Extreme drought events over Brazil from 2011 to 2019. *Atmosphere*, 10(11), 642. Rocha, A. V., & Cribari-Neto, F. (2009). Beta autoregressive moving average models. *TEST*, 18(3), 529--545. Rocha, A. V., & Cribari-Neto, F. (2017). Erratum to: Beta autoregressive moving average models. *TEST*, 26, 451--459. Scher, V. T., Cribari-Neto, F., Pumi, G., & Bayer, F. M. (2020). Goodness-of-fit tests for $\beta$ARMA hydrological time series modeling. *Environmetrics*, 31(3), e2607. # Reproducibility The analysis was conducted in the following computational environment. ```{r session_info, echo=FALSE} cat("=================================================================\n") cat("Session Information\n") cat("=================================================================\n") cat("This report was generated at:", format(Sys.time(), "%B %d, %Y at %I:%M %p"), "\n\n") print(sessionInfo()) ```