Railog is a Rust-based command-line tool that uses machine learning to analyze log files, identify patterns, and classify new log messages. It transforms unstructured log messages into numerical vectors (embeddings) and groups them into clusters based on semantic similarity. This allows the system to learn the "normal" patterns in your logs and identify new, potentially interesting messages that don't fit known patterns.
The tool is designed with an online learning workflow in mind, allowing the model to adapt and improve as it processes more data over time.
- Log Message Embedding: Utilizes the
sentence-transformers/all-MiniLM-L6-v2model to convert log messages into 384-dimensional vectors.- Pattern Discovery via Clustering: Employs DBSCAN clustering to group similar log vectors, effectively identifying distinct log patterns.- Configurable Preprocessing: Uses a customizable text file (
patterns.txt) of regular expressions to normalize log messages before analysis (e.g., replacing PIDs, IP addresses with generic tokens like<PID>and<IP>).
- Pattern Discovery via Clustering: Employs DBSCAN clustering to group similar log vectors, effectively identifying distinct log patterns.- Configurable Preprocessing: Uses a customizable text file (
- Efficient Ingestion: The
ingestcommand avoids reprocessing log messages by skipping duplicates and logs older than the last model update. - Online Learning Workflow:
- Train: Create a baseline model of log patterns from a sample file.
- Ingest: Process new logs, automatically updating the model for known patterns and separating unknown ones for review.
- Retrain: Incorporate reviewed, previously unknown logs into the model by creating new clusters.
- Command-Line Interface: A simple and powerful CLI for managing the entire workflow.
The intended workflow allows the system to continuously learn and adapt to your log data.
-
Initial Training:
- Start with a large, representative log file (e.g.,
example.txt). - Run the
traincommand to analyze this file and create an initialcentroids.jsonfile, which stores the mathematical centers of the identified log patterns.
- Start with a large, representative log file (e.g.,
-
Ongoing Ingestion:
- As new logs are generated, collect them into a file (e.g.,
new_logs.txt). - Run the
ingestcommand. The tool will:- Skip log messages that are older than the
centroids.jsonfile. - Skip duplicate log messages within the same run.
- Update the existing centroids for logs that match known patterns.
- Write any non-matching logs to an
unmatched.logfile.
- Skip log messages that are older than the
- As new logs are generated, collect them into a file (e.g.,
-
Manual Review & Retraining:
- Periodically, a human operator should review the
unmatched.logfile. This file contains logs that the system considers novel. - After validating that these logs represent new, valid patterns, run the
retraincommand onunmatched.log. This will create new centroids for these patterns and add them to the model.
- Periodically, a human operator should review the
This cycle of ingesting, reviewing, and retraining allows the model to evolve without requiring a full, costly retraining from scratch.
First, build the project using Cargo:
cargo build --releaseThe executable will be located at target/release/railog.
The --verbose (-v) flag can be used with any command to enable detailed DEBUG level logging.
Creates the initial centroids.json file from a sample log file.
./target/release/railog train --input-file <path_to_your_logs.txt> --epsilon 0.5 --min-points 2--input-file(-i): The log file to train on. Defaults toexample.txt.--output-file(-o): The file to save centroids to. Defaults tocentroids.json.--epsilon(-e): The maximum distance between two points for one to be considered as in the neighborhood of the other. Defaults to0.5.--min-points(-m): The minimum number of points required to form a dense region (a cluster). Defaults to2.
Processes a file of new logs, updating centroids and separating non-matches.
./target/release/railog ingest --input-file new_logs.txt --threshold 0.5--input-file(-i): The file containing new logs. Defaults tonew_logs.txt.--centroids-file(-c): The centroids model file. Defaults tocentroids.json.--unmatched-file(-u): The file to write non-matching logs to. Defaults tounmatched.log.--threshold(-t): The distance threshold for considering a log a "match". Lower is stricter. Defaults to1.0.--learning-rate(-l): The rate at which a matching log influences a cluster's centroid. Defaults to0.1.
Creates new centroids from a file of (typically unmatched) logs and adds them to the model.
./target/release/railog retrain --input-file unmatched.log--input-file(-i): The log file to create new centroids from. Defaults tounmatched.log.--centroids-file(-c): The centroids model file to update. Defaults tocentroids.json.
A utility command to test your regex patterns on a file without performing any analysis. It prints the original and processed versions of each line.
./target/release/railog test-patterns --input-file new_logs.txt--input-file(-i): The log file to test patterns on. Defaults tonew_logs.txt.--patterns-file(-p): A global flag to specify the location of your patterns file. Defaults topatterns.txt.
To improve accuracy, Railog preprocesses each log message to normalize dynamic or high-variance tokens. The patterns for this are defined in patterns.txt.
The format is one pattern per line:
<REGEX> :: <REPLACEMENT>
Example patterns.txt:
# Each line should be in the format: regex :: replacement
# Note: Use raw string syntax for regex if needed in your language, but here just use standard regex.
\[\d+\]: :: [<PID>]:
\b(?:\d{1,3}\.){3}\d{1,3}\b :: <IP>
This file will replace process IDs like [12345]: with [<PID>]: and any IPv4 address with <IP>, allowing the model to learn the general pattern rather than the specific noisy data.