pipelinemapper

Thomas Keggin 2 Apr, 2025

Welcome to the pipelinemapper mini-package.

pipelinemapper is a lightweight tool that allows users to tag input and output path strings in any code scripts (not limited to R) to produce a graph, or flowchart, of how objects are passed through scripts in a pipeline.

For me, this has been most useful in making sure each script is actually importing and exporting the files that I intend them to whilst prototyping and building pipelines. If you are looking for a broader pipeline or project management tool, then perhaps something like Snakemake would be a better fit for you.

Installation

You can install the package directly through CRAN.

install.packages("pipelinemapper")

Or you can install the development version of pipelinemapper directly from the GitHub repository.

# install.packages("devtools")
devtools::install_github("thomaskeggin/pipelinemapper")

Usage

Inline tags

The package is built entirely around the premise of inline path tagging of input and output paths. If you don’t tag your scripts, then paths won’t be detected. There are two basic limitations:

Only one path per line.
Paths must be a single string, i.e. not a composite, or pasted string (see below).

The inline tags can be whatever you want, although I find that the default #input and #output tags are intuitive and easy to implement.

It is important that tags are consistent across scripts.

✅ Using the default tags, a script within a pipeline could look something like this.

# load in raw data
raw_data <-
  read.csv("data/raw_data.csv") #input

# process raw data (convert to matrix)
processed_data <-
  as.matrix(raw_data)

# write processed data
saveRDS("processed_data/matrix_data.rds") #output

❌ Whilst something like this composite, or pasted, path would throw an error.

input_path_1 <-
  paste0("./external/",string_variable,".csv") #input

✅ If your script is looped or run in parallel to load in many files within a single directory, then you can tag that container directory instead of each file individually.

# define the input directory
input_directory <-
  "path/to/inputs/" #input

# list the files within the input directory
input_files <-
  list.files(input_directory)

# create a list container to load the files into
object_container <-
  vector(mode = "list", length = length(input_files))

# loop through each file to load them into a list container
for(file in 1:length(input_files)){
  
  object_container[[file]] <-
    readRDS(paste0(input_directory,
                   input_files[file]))
}

Mapping and visualisation

Once you have tagged your pipeline scripts, then mapping and visualising your data flow is straightforward. The package only contains 4 functions: the core mapScript() function, and the three wrapper functions which run in order.

mapScipt() generates an input/output data frame for a single script.

mapPipeline() loops mapScript()through all files with target extensions in a particular directory - returning a compiled input/output data frame for those scripts.
graphPipeline() uses the igraph package to convert the input/output data frame into an igraph object.
plotPipeline() generates a flowchart from the igraph object to visualise the pipeline using the ggraph and ggplot2 packages. I hope this is useful, but if you prefer, the default igraph::plot.igraph() (or any other graph plotting method) also works.

Implementation can be as simple as this:

# define your pipeline directory containing your scripts.
pipeline_directory <-
  system.file("dummy_pipeline",
              package = "pipelinemapper")

# apply mapScript() to all target files in the pipeline directory.
pipelinemapper::mapPipeline(pipeline_directory_path =  pipeline_directory,
                            file_extensions = c(".r",".rmd",".qmd"),
                            input_tag = "#input",
                            output_tag = "#output") |> 
  
  # pass the pipeline input/output data frame to graphPipeline()
  pipelinemapper::graphPipeline() |> 
  
  # visualise the resultant igraph object
  pipelinemapper::plotPipeline(show_full_paths = FALSE)

Note

If your pipeline is split across multiple container directories, it is possible to apply mapPipeline() to each directory, bind the resultant pipeline data frames, then convert to a single graph with graphPipeline().

# map each directory
pipeline_dataframe_1 <-
  pipelinemapper::mapPipeline("pipelines/pipeline_1")

pipeline_dataframe_2 <-
  pipelinemapper::mapPipeline("pipelines/pipeline_2")

# bind pipeline data frames together
composite_pipeline_dataframe <-
  rbind.data.frame(pipeline_dataframe_1,
                   pipeline_dataframe_2)

# convert to pipeline graph
composite_pipeline_graph <-
  pipelinemapper::graphPipeline(composite_pipeline_dataframe)

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.github		.github
R		R
inst		inst
man		man
readme_files/figure-commonmark		readme_files/figure-commonmark
tests		tests
.Rbuildignore		.Rbuildignore
.gitattributes		.gitattributes
.gitignore		.gitignore
CRAN-SUBMISSION		CRAN-SUBMISSION
DESCRIPTION		DESCRIPTION
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
cran-comments.md		cran-comments.md
pipelinemapper.Rproj		pipelinemapper.Rproj
readme.qmd		readme.qmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pipelinemapper

Installation

Usage

Inline tags

Mapping and visualisation

Note

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

thomaskeggin/pipelinemapper

Folders and files

Latest commit

History

Repository files navigation

pipelinemapper

Installation

Usage

Inline tags

Mapping and visualisation

Note

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages