---
title: "aelab"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{instruction}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(message = FALSE)
rmarkdown.html_vignette.check_title = FALSE
library(aelab)
library(readxl)
library(tibble)
library(lubridate)
library(stats)
library(dplyr)
library(openxlsx)
```

# Introduction

The goal of achieving net zero emissions by 2050 is a significant driver of research focused on understanding greenhouse gas (GHG) flux dynamics and the autotrophic production of ecosystems. This ambitious target highlights the urgent need to explore how different ecosystems contribute to GHG emissions and how they can be managed to reduce these emissions effectively. To support this research, this R package offers a variety of useful tools designed to facilitate the analysis of data related to GHG flux and production in aquatic environments.

# GHG flux calculation

Various techniques are used to measure greenhouse gas (GHG) flux, including eddy covariance (EC), the boundary layer method (BLM), and the static chamber method. This study provides R functions specifically designed to analyze data from GHG flux measurements using the static chamber method with a portable gas analyzer.
 
## Load and process the raw data file

First, we need to obtain necessary parameters from the raw data downloaded from the portable gas analyzer. We will remove unnecessary rows and columns and extract the GHG concentrations along with the corresponding date and time recorded by the analyzer using `tidy_ghg_analyzer()`. Additionally, this function will also eliminate any NaN values in the GHG data. Currently, this function supports raw data downloaded from:  
(i) LI-COR Trace Gas Analyzer (LI-7810 and LI-7820).
(ii) ABB LGR-ICOS Gas Analyzer (M-GGA-918).

```{r}
# The provided file is a raw data file downloaded from 
# the LI-COR Trace Gas Analyzer
ghg_data_path <- system.file("extdata", "ch4.xlsx", package = "aelab", mustWork = T)
ch4 <- tidy_ghg_analyzer(ghg_data_path, "ch4")
ch4[c(1:3), ]
```

Sometimes, the date and time recorded by the analyzer do not match the actual date and time we recorded in situ. Therefore, we need to convert the `date_time` column from the previous step to align with the real-life time, if there are any discrepancies:

```{r}
# The analyzer's time was assumed to be 
# 15 minutes and 30 seconds faster than real time
ch4 <- convert_time(ch4, min = -15, sec = 30)
ch4[c(1:3), c(5:6)]
```

## Calculate the slope

We first need to obtain the slope from the linear regression of GHG concentration changes over time within the chamber to calculate the flux. This can be done by defining the measurement time interval, either from a reference file (see Method 1 below) or directly in R (see Method 2 below).

### Method 1: Load from excel

In Excel, enter the date and time when the GHG flux measurement started, then load the file into R. Make sure the column containing the date and time information is formatted in ISO 8601 as `YYYY-MM-DD hh:mm:ss`.

```{r}
ref_data_path <- system.file("extdata", "reference.xlsx", package = "aelab", mustWork = T)
ref <- read_excel(ref_data_path)
ref
```

Now, we can perform multiple simple linear regressions on the GHG concentrations, separated by the `date_time` value in the `ref` data, using `calculate_regression()`. The measurement duration, referred to as `duration_minutes`, is set to a default of 7 minutes, and the number of rows used for the regression, also denoted as `duration_minutes`, defaults to 300 rows (which is equivalent to 5 minutes). These values can be modified if desired. 

When the measurement duration (specified by `duration_minutes`) exceeds the data used for regression (specified by `num_rows`), this function automatically selects the rows with the highest R$^2$ within the defined interval.

In the output tibble, `start_time` and `end_time` indicate the time range of the data used for the regression. `slope` represents the slope, and `r_square` denotes the R$^2$ value of the regression. `reference_time` is the start time of the measurement from the input.

```{r}
calculate_regression(ch4, ghg = "CH4", reference_time = ref$date_time,
                     duration_minutes = 7, num_rows = 300)
```

### Method 2: Type directly in R

The start time of measurement can also be input directly into the function. Please note that the reference_time must be in POSIXct format.

```{r}
calculate_regression(ch4, ghg = "CH4", reference_time = as.POSIXct("2023-03-11 07:32:00", tz = "UTC"))
```

## Calculate the flux

The equation we used to calculate the GHG flux was derived from Yong et al. (2024):

$$
F = \frac{(S \times V \times c)}{R \times T \times A}
$$

where S is the slope obtained from the linear regression of GHG concentration changes over time (ppm s$^{-1}$), V is the chamber volume (liters), c is the conversion factor from seconds to hours, R is the ideal gas constant (0.082 L atm K$^{−1}$ mol$^{−1}$), T is the temperature inside the chamber (kelvin), and A is the surface area of the chamber (m$^2$).

Therefore, the function `calculate_ghg_flux()` requires a dataframe with columns for the slope, area, volume, and temperature. Note that the units of the parameters must match those described above, resulting in a flux with units of mmol m$^{-2}$ d$^{-1}$.

```{r}
results_ch4 <- calculate_regression(ch4, ghg = "CH4", reference_time = as.POSIXct("2023-03-11 07:32:00", tz = "UTC"))
flux_ch4 <- data.frame(
    slope = results_ch4$slope,
    area = 1, # in square meter
    volume = 1, # in litre
    temp = 1) # in celcius
calculate_ghg_flux(flux_ch4)
```

## Additional GHG tools

### `convert_ghg_unit()`

`convert_ghg_unit()` converts a GHG flux value to µg m⁻² h⁻¹. The input can be a plain number or a character string containing embedded numbers (e.g., a mean ± SD string); each number in the string is converted individually.

```{r}
# Numeric input: 97 mg CH4 m-2 h-1 → µg m-2 h-1
convert_ghg_unit(97, ghg = "ch4", mass = "mg", area = "m2", time = "hr")

# Character string input: each number is converted individually
convert_ghg_unit("0.002 ± 0.003", ghg = "ch4", mass = "mmol", area = "m2", time = "hr")
```

### `calculate_MDF()`

`calculate_MDF()` calculates the Minimum Detectable Flux (MDF) for a static chamber system. This is useful for evaluating whether a gas analyser is precise enough to detect fluxes at the expected magnitude of the study site.

```{r}
calculate_MDF(
  precision_ppm     = 1,
  closure_time_s    = 300,
  data_point_n      = 300,
  chamber_volume_m3 = 0.0064,
  temperature_C     = 25,
  chamber_area_m2   = 0.07
)
```

### `calculate_total_co2e()`

`calculate_total_co2e()` aggregates individual GHG fluxes (mg m⁻² h⁻¹) into a single total CO₂-equivalent flux (g m⁻² d⁻¹) using IPCC AR6 100-year GWPs (CH₄ = 27, N₂O = 273).

```{r}
calculate_total_co2e(co2 = 4.02, ch4 = 0.001, n2o = 0.003)
```

# NEM calculation

The change in oxygen concentration in the water reflects the balance between photosynthetic production, respiratory consumption, and physical exchange at the water-atmosphere interface. By measuring dissolved oxygen (DO) concentrations at high frequency with in situ data loggers, we can calculate gross primary production (GPP), ecosystem respiration (ER), and net ecosystem metabolism (NEM), as detailed by Staehr et al. (2010).

## Load and process the raw data file

First, we need to obtain the necessary parameters from the raw data downloaded from the data loggers. We will remove unnecessary rows and columns and extract the dissolved oxygen concentrations and water temperature, along with the corresponding date and time recorded by the logger using `process_hobo()`. This function will also eliminate any NA values in the dataframe. The `no_hobo` input was necessary for further analysis. Currently, it supports raw data downloaded from the Onset HOBO Dissolved Oxygen Data Logger (U26-001).

```{r}
hobo_data_path <- system.file("extdata", "ex_hobo.csv", package = "aelab")
do <- process_hobo(hobo_data_path, no_hobo = "code_for_logger")
do[c(1:3), ]
```

The air pressure and wind speed at the sampling site should be obtained either manually or from a nearby weather station. The `process_weather` function tidies the CSV file downloaded from Taiwan's weather station. In addition to extracting the columns with air pressure and wind speed values, it also duplicates the hourly data to match the half-hourly recorded frequency of the DO concentrations.

```{r}
weather_data_path <- system.file("extdata", "ex_weather.csv", package = "aelab")
weather <- process_weather(weather_data_path, date = "2024-04-10", zone = "zone_A")
weather[c(1:5), ]
```

In addition to the previously mentioned parameters, we also needed the water depth, salinity, start and end date and time of the data logger deployment, and the sunrise and sunset times during that period. These parameters can be entered in Excel and then imported into R using `process_info()` to ensure that all necessary parameters are included. The column `zone` corresponds to `process_weather()`, and column `no_hobo` corresponds to `process_hobo()`.

```{r}
info_data_path <- system.file("extdata", "info.xlsx", package = "aelab")
info <- process_info(info_data_path)
info
```

Finally, we can merge the dataframe containing air pressure and wind speed data from `process_weather()` with the processed data downloaded from the data loggers. Next, we will combine this with the other necessary parameters using the `zone` and `no_hobo` columns. We can use the `plot_hobo()` function to inspect the changes in DO concentrations.

```{r}
data <- merge(weather, do, by = "date_time")
merged_df <- data %>% 
  inner_join(info, by = c("zone", "no_hobo"))
merged_df[c(1:3), ]
plot_hobo(merged_df)
```

## Calculate GPP and ER

GPP and ER can be calculated using `calculate_do()`. NEM (denoted as `nep` in the output dataframe) was calculated by GPP + ER.

```{r}
calculate_do(merged_df)
```

# Statistical analysis

aelab provides a small suite of statistical helpers that cover a typical aquatic ecology workflow: describe your data, check for outliers, test normality, transform if needed, and choose between a parametric or non-parametric group comparison.

The examples below use a simple self-contained data frame with three groups.

```{r}
set.seed(42)
stat_df <- data.frame(
  group = rep(c("A", "B", "C"), each = 8),
  value = c(rnorm(8, 2, 0.4), rnorm(8, 3.5, 0.5), rnorm(8, 5, 0.6))
)
```

## Descriptive statistics — `descriptive_statistic()`

`descriptive_statistic()` returns mean ± SD and min–max for each group.

```{r}
descriptive_statistic(stat_df, vars = value, groups = group)
```

## Outlier detection — `find_outlier()`

`find_outlier()` uses the IQR method (values outside 1.5 × IQR from Q1/Q3) to flag outliers.

```{r}
find_outlier(stat_df, var = "value")
```

## Normality testing

Use `normality_test_t()` when comparing two groups (t-test context), and `normality_test_aov()` when comparing three or more groups (ANOVA context). Both apply Shapiro-Wilk to raw, square-root, and log10 transforms so you can identify whether a transformation improves normality.

```{r}
# Two-group context
normality_test_t(stat_df, "value", group, "A", "B")

# One-way ANOVA context (all three groups)
normality_test_aov(stat_df, "value", "group")
```

## Data transformation — `df_trans()`

If residuals are not normal, `df_trans()` appends a transformed column to the data frame for re-testing.

```{r}
stat_df_t <- df_trans(stat_df, "value", "sqrt")
head(stat_df_t)
```

## Parametric test — `aov_test()`

When normality holds, `aov_test()` runs a one-way ANOVA followed by Tukey HSD post-hoc comparisons and returns compact letter display (CLD) groupings.

```{r}
aov_test(stat_df, "value", "group")
```

## Non-parametric test — `ks_test()`

When normality cannot be assumed, `ks_test()` runs Kruskal-Wallis followed by Dunn's post-hoc test with Bonferroni correction.

```{r}
ks_test(stat_df, "value", "group")
```

# Visualization

aelab provides plot wrappers — `plot_point()`, `plot_line()`, `plot_box()`, and `plot_bar()` — with a consistent theme built on Century Gothic. Custom colour palettes are available through `aelab_palettes()`, `scale_fill_aelab_d()`, and related scale functions.

The example below previews the `"ghg"` palette and applies it to a box plot comparing CH₄ fluxes across three habitat types.

```{r eval=FALSE}
# Preview the "ghg" palette
aelab_palettes("ghg")

# Box plot with aelab palette
df_vis <- data.frame(
  site = rep(c("Mangrove", "Mudflat", "Seagrass"), each = 10),
  ch4  = c(rnorm(10, 2, 0.5), rnorm(10, 1, 0.3), rnorm(10, 0.5, 0.2))
)

plot_box(df_vis, site, ch4, site) +
  scale_fill_aelab_d("ghg")
```

# Reference

Staehr, P. A., Bade, D., Van de Bogert, M. C., Koch, G. R., Williamson, C., Hanson, P., Cole, J. J., & Kratz, T. (2010). Lake metabolism and the diel oxygen technique: State of the science. Limnology and Oceanography: Methods, 8(11), 628–644. https://doi.org/10.4319/lom.2010.8.0628
Yong, Z.-J., Lin, W.-J., Lin, C.-W., & Lin, H.-J. (2024). Tidal influence on carbon dioxide and methane fluxes from tree stems and soils in mangrove forests. Biogeosciences, 21(22), 5247–5260. https://doi.org/10.5194/bg-21-5247-2024