dtcpp

Naive attempt of a Decision Tree implementation for continuous variables (WIP !!!)

author: S. Kramm
licence: GPL v3
home page: https://github.com/skramm/dtcpp
language: C++14

WORK IN PROGRESS, NO RELEASES YET !

current status 20210405:

WIP:

investigate on IF vs threshold value

Features

This sofware can train a decision tree using some input data. The tree can then be used to classify some other data. During the training step, it also analyses the input data and produces different output data files and plots.

a histogram of the classes found,
for each attribute, a histogram of the attribute values
for each attribute, a plot of the value vs. output class, also whowing the mean, meadian, and standard deviation value
a plot of the tree

Input dataset format: csv style

class values: string or numerical (integer values), see -cs option
attribute values: only numerical at present
class value position: either first or last element of the line
field separator character adjustable
decimal character for floating-point values can be either '.'' or ',', does not matter.
handles classless points: default behavior is to consider negative values as classless

Training algorithm

The algorithm is more or less based on C4.5: https://en.wikipedia.org/wiki/C4.5_algorithm At each step, it searches for the best attribute to use and best threshold on that attribute so that a split will maximize the Gini Impurity coefficient: https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity

Performance scores

See https://en.wikipedia.org/wiki/Confusion_matrix for definitions.

For two-class algorithms, the following scores are computed: True Positive Rate (recall), True Negative Rate, Positive Predictive Value (precision), Accuracy, Balanced Accuracy and F1-score
see
For multiclass tasks, only Macro recall and precision are computed at present

Sources:

Related software:

https://rulequest.com/download.html : C5 algorithm

Recommended tools

While not mandatory, this software will greatly benefit from some additional standard tools, that are freely available on any OS/architecture:

Gnuplot
Graphviz

Command-line usage

$ dtcpp <switches> input_datafile

Switches

-cs : class value is a string. Numerical indexes will be automatically associated to each string value.
-sep "x" : use 'x' as field separator in the input file
-cf : class value is the First value of line
-cl : class value is the Last value of line (default)
-ll X : sets log level to 'X'. Available levels: 0,1,2,3
-i : load the datafile and prints info and stats on its contents, then exit (no training)
-nf x : do training on 'x' folds of data
-nbh xx : number of bins for building the histograms
-ro : remove outliers before training
-md xx : max depth for tree
-fl : First line of input data file holds labels, ignore it
-sd : use sorting of points to find thresholds, to evaluate best split (default is histogram binning technique)

Build information

This software is build from 2 files only:

dtcpp.h : header that hold all the useful code
dtcpp.cpp : basically only a command-line parser, then calls the code from the header

Needed tools

C++14 compiler
GnuMake
Doxygen

Dependencies:

Boost:

Tested successfully (march 2021) with releases 1.70 and 1.75

Warning: 1.70 seems to be the minimal required version, and depending on you distribution, sudo apt install libboost-all-dev might install an older release.
Check with: cat /usr/include/boost/version.hpp

argh: command-line parser, from https://github.com/adishavit/argh
(version 1.3.1 included for conveniency)

For the test build only:

Catch2: https://github.com/catchorg/Catch2/ (tested with catch 2.13.4)

Build

The build is make based. Several options are available and can be passe to make in the form:

$ make <target> <option={Y|N}>

Targets

all
doc: builds Doxygen pages (please do, lots of additional info !). Needs the Graphviz package
cleandoc: erases Doxygen pages
dot: calls Graphviz to build the rendering of the tree, from the generated dot files (see Graphical rendering )
plt: calls gnuplot to build the graphs from the generated plot scripts
clean
cleanout
cleanall
test: runs the tests
check: runs cppcheck (light static analysis)

Options

HMV: this will enable the "missing values" features, by defining the symbol HANDLE_MISSING_VALUES. It will slow down a bit computing.
NDEBUG: this will disable all assertions in the code, to speed up things
DEBUG: this will enable some addition debug code, and set logging level to 4. Not meant to be used by end-user, only for dev/debugging purposes
DEBUGS: similar as the above, but will also print down each function start (automatically defines DEBUG)
HO: enables outlier handling features (defines the symbol HANDLE_OUTLIERS).

Error handling

The default behavior on errors is to throw exceptions. This enabling catching them. The downside is that this might decrease performance. If speed is an issue, you can define the symbol DTCPP_ERRORS_ASSERT. This will replace tests and throw instructions with classical assertions (that will immediatly abort execution). The main advantage is that you can remove all these tests by defining the symbol NDEBUG.

Name		Name	Last commit message	Last commit date
Latest commit History 208 Commits
build		build
misc		misc
other		other
out		out
sample_data		sample_data
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
dectree.cpp		dectree.cpp
diff.html		diff.html
dtcpp.h		dtcpp.h
histac.hpp		histac.hpp
private.hpp		private.hpp
run_demo.sh		run_demo.sh
temp1.cpp		temp1.cpp
test_catch.cpp		test_catch.cpp
zzz		zzz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dtcpp

WORK IN PROGRESS, NO RELEASES YET !

current status 20210405:

Features

Training algorithm

Performance scores

Sources:

Related software:

Recommended tools

Command-line usage

Switches

Build information

Needed tools

Dependencies:

Build

Targets

Options

Error handling

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

dtcpp

WORK IN PROGRESS, NO RELEASES YET !

current status 20210405:

Features

Training algorithm

Performance scores

Sources:

Related software:

Recommended tools

Command-line usage

Switches

Build information

Needed tools

Dependencies:

Build

Targets

Options

Error handling

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages