--- title: "Getting Started with fuzzystring" author: "Paul Efren Santos Andrade" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with fuzzystring} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(fuzzystring) ``` ## Introduction **fuzzystring** provides fast, flexible fuzzy string joins for `data.frame` and `data.table` objects using approximate string matching. Built on top of `data.table` and `stringdist`, it uses compiled C++ result assembly plus adaptive candidate planning to reduce unnecessary distance evaluations in single-column joins. ## Installation You can install **fuzzystring** from CRAN: ```r install.packages("fuzzystring") ``` You can also install the development version from GitHub: ```r # Using pak (recommended) # pak::pak("PaulESantos/fuzzystring") # Or using remotes # remotes::install_github("PaulESantos/fuzzystring") ``` ## Quick Start Here's a simple example matching diamond cuts with slight misspellings: ```{r quick-start} # Your messy data x <- data.frame( name = c("Idea", "Premiom", "Very Good"), id = 1:3 ) # Reference data y <- data.frame( approx_name = c("Ideal", "Premium", "VeryGood"), grp = c("A", "B", "C") ) # Fuzzy join with max distance of 2 edits fuzzystring_inner_join( x, y, by = c(name = "approx_name"), max_dist = 2, distance_col = "distance" ) ``` ## Key Features ### All Join Types Supported **fuzzystring** supports all standard join types. Below is a small, reusable example dataset so you can compare the behavior of each join family. ```{r join-datasets} x_join <- data.frame( name = c("Idea", "Premiom", "Very Good", "Gooood"), id = 1:4 ) y_join <- data.frame( approx_name = c("Ideal", "Premium", "VeryGood", "Good"), grp = c("A", "B", "C", "D") ) ``` - `fuzzystring_inner_join()`: Only matching rows. - `fuzzystring_left_join()`: All rows from `x`, matching rows from `y`. - `fuzzystring_right_join()`: All rows from `y`, matching rows from `x`. - `fuzzystring_full_join()`: All rows from both tables. - `fuzzystring_semi_join()`: Rows from `x` that have a match in `y`. - `fuzzystring_anti_join()`: Rows from `x` that don't have a match in `y`. #### Inner join ```{r join-inner, eval = TRUE} fuzzystring_inner_join( x_join, y_join, by = c(name = "approx_name"), max_dist = 2, distance_col = "distance" ) ``` #### Left join ```{r join-left, eval = TRUE} fuzzystring_left_join( x_join, y_join, by = c(name = "approx_name"), max_dist = 2, distance_col = "distance" ) ``` #### Right join ```{r join-right, eval = TRUE} fuzzystring_right_join( x_join, y_join, by = c(name = "approx_name"), max_dist = 2, distance_col = "distance" ) ``` #### Full join ```{r join-full, eval = TRUE} fuzzystring_full_join( x_join, y_join, by = c(name = "approx_name"), max_dist = 2, distance_col = "distance" ) ``` #### Semi join (rows from `x` with a match in `y`) ```{r join-semi, eval = TRUE} fuzzystring_semi_join( x_join, y_join, by = c(name = "approx_name"), max_dist = 2 ) ``` #### Anti join (rows from `x` without a match in `y`) ```{r join-anti, eval = TRUE} fuzzystring_anti_join( x_join, y_join, by = c(name = "approx_name"), max_dist = 2 ) ``` #### Using the generic `fuzzystring_join()` If you prefer a single entry point, you can use `fuzzystring_join()` directly by specifying `mode`. ```{r join-generic, eval = TRUE} fuzzystring_join( x_join, y_join, by = c(name = "approx_name"), max_dist = 2, mode = "left", distance_col = "distance" ) ``` ### Multiple Distance Methods You can choose from various distance metrics provided by the `stringdist` package: ```{r distance-methods, eval = FALSE} # Optimal String Alignment (default) fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "osa") # Damerau-Levenshtein fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "dl") # Jaro-Winkler (good for names) fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "jw") # Soundex (phonetic matching) fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "soundex") ``` ### Case-Insensitive Matching Use `ignore_case = TRUE` to ignore capitalization: ```{r ignore-case, eval = FALSE} fuzzystring_inner_join( x, y, by = c(name = "approx_name"), ignore_case = TRUE, max_dist = 1 ) ``` ## Advanced Usage ### Multiple Column Joins You can match on multiple string columns at once. The same distance method and threshold are applied to each mapped column. ```{r multi-column, eval = FALSE} x_multi <- data.frame( first = c("Jon", "Maira"), last = c("Smyth", "Gonzales") ) y_multi <- data.frame( first_ref = c("John", "Maria"), last_ref = c("Smith", "Gonzalez"), customer_id = 1:2 ) fuzzystring_inner_join( x_multi, y_multi, by = c(first = "first_ref", last = "last_ref"), method = "osa", max_dist = 1 ) ``` ## Performance **fuzzystring** now keeps more of the join execution on a compiled C++ path while using `data.table` to orchestrate candidate generation. In practice this means compiled row expansion and binding across join modes, better preservation of typed columns, and adaptive candidate planning that helps both duplicate-heavy and low-duplication workloads. For a dedicated comparison against `fuzzyjoin::stringdist_join()`, see the benchmark article bundled with the package.