COre Variable Feature Extraction Feature Extractor
- Simple description
- Detailed description
- Installation instructions
- Available Pipelines
- Custom Pipelines
- Optional dependencies
COVFEFE is a a tool for feature extraction. Given a folder containing your data, it will compute features for each file in the input folder. Out of the box, it currently supports audio and text inputs, but is easily extensible (feel free to make a pull request if you want to add more).
As an example, say you had an input folder with two audio files and two text files.
input_data
├── file1.txt
├── file1.wav
├── file2.txt
└── file2.wav
And you wanted to extract acoustic features for all the wav files and lexicosyntactic features for the txt files. You could run
python covfefe.py -i input_data -o output_folder -p pipeline_name -p opensmile_is10to extract acoustic features on all the wav files. And
python covfefe.py -i input_data -o output_folder -p pipeline_name -p lexto extract lexicosyntactic features on all the txt files. This will create an output folder with all the features.
output_folder
├── is10
│ ├── file1.csv
│ └── file2.csv
└── lexicosyntactic
├── file1.csv
└── file2.csv
A fast, multi-threaded tool for running various feature extraction pipelines. A pipeline is a directed acyclic graph where each node is a processing task that sends it's output to the next node in the graph.
Nodes are defined in nodes/ and pipelines in pipelines/.
An example pipeline is opensmile_is10_lld defined in pipelines/pipelines.py.
@pipeline_registry
def opensmile_is10_lld(in_folder, out_folder, num_threads):
file_finder = helper.FindFiles("file_finder", dir=in_folder, ext=".wav")
is10 = audio.OpenSmileRunner("is10_lld", out_dir=out_folder, conf_file="IS10_paraling.conf", out_flag="-lldcsvoutput")
p = ProgressPipeline(file_finder | is10, n_threads=num_threads, quiet=True)
return pThe function is decorated using @pipeline_registry which adds it to registry containing all pipelines. When called, a
pipeline function will be provided an input folder, output folder and number of threads as parameters. These parameters are
used to configure the pipeline. The opensmile_is10_lld function shown above first creates a node to find all the files in
the input folder that have a .wav extension. The second node it creates is an OpenSmileRunner, which is defined in the
nodes.audio package. This node passes its input to openSMILE (https://audeering.com/technology/opensmile/), a feature
extraction tool. Some common nodes (such as converting wav to mp3, resampling audio, calling matlab functions or shell scripts)
are provided and users can define their own nodes.
After defining the nodes, the opensmile_is10_lld function creates a pipeline using the | operator. This is inspired by the
unix pipe, and simply means that the output of the left node is passed to the right node. The right hand side of the operator
can be a list of nodes, in which case the input from the left side is passed to all nodes in the list.
The way covfefe is set up, each node accepts as input a file path and outputs a file path. Standardizing this makes it easier to create new nodes and pipelines that are interoperable.
After creating the pipeline, p, the pipeline function returns it. The pipeline so far has only been defined and not executed.
It will be executed by the main function in covfefe.py, which is the script you can use to call different pipelines.
To execute a pipeline, simply run
python covfefe.py -i path/to/in/folder -o /path/to/put/out/files -p pipeline_namewhere pipeline_name is the name of function that's been added to the registry (for example, opensmile_is10_lld)
to run with multiple threads (ex. 8 threads):
python covfefe.py -i path/to/in/folder -o /path/to/put/out/files -p pipeline_name -n 8Running python covfefe.py --help will print help options and give a list of available pipelines to run.
First download covfefe and setup a virtual environment:
git clone https://github.com/SPOClab-ca/COVFEFE.git
cd COVFEFE
virtualenv -p python3 .venvActivate the virtual environment
source .venv/bin/activateInstall python libraries:
pip install -r requirements.txtInstall nltk packages, if not already installed
python -c "import nltk; deps=['cmudict', 'wordnet_id', 'punkt', 'wordnet']; [nltk.download(d) for d in deps];"A script is provided that will download and setup dependencies. Before running this script, you should have openSMILE installed. You can find instructions on the openSMILE website
./setup.sh /path/to/put/downloaded/filesThis script will ask you to enter the path to the openSMILE source. This is the path to the extracted zip or tar file, not the smilExtract binary.
Next the setup script downloads various dependencies (requires 1.6 GB of disk space) and creates a file called
config.ini which stores paths to the dependencies. When covfefe is run, it will try to find it's dependencies
from environment variables first, then this config file.
| Pipeline | Input data | Description | Requirements |
|---|---|---|---|
| split_speech_eaf | .wav, .eaf | Reads in wav files and eaf files (with same name except for extension) from input folder and splits each annotated segment into its own wav file | |
| split_speech_txt | .wav, .txt | Same as split_speech_eaf except uses tab separated .txt files as annotations (start \t end \t annotation) | |
| opensmile_is10 | .wav | Computes openSMILE IS10 features describing an entire wav file for each wav file in the input folder | openSMILE is installed and OPENSMILE_HOME is set in `config.ini` |
| opensmile_is10_lld | .wav | Computes openSMILE IS10 low level descriptors for each wav file in the input folder | openSMILE is installed and OPENSMILE_HOME is set in `config.ini` |
| praat_syllable_nuclei | .wav | Runs Praat script that computes syllable nuclei features | praat is installed |
| matlab | .wav | Computes matlab acoustic features for each wav file in the input folder (very slow) | Matlab engine for python is installed |
| lex | .txt | Computes lexicosyntactic features for each txt file in the input | All dependencies were downloaded using the `setup.sh` script `config.ini` was correctly generated |
| kaldi_asr | .wav | Runs automatic speech recognition on all wav files using kaldi. Wav files will be reasmpled to 8KHz. | Kaldi is installed and compiled and the `aspire` example is setup. |
| main | .wav, .eaf | Computers IS10 (both low level descripors and full file summaries) for each wav file in the input. If the file has an associated .eaf file, it will split all annotations into individual files and compute IS10 feautures on the isolated .wav files | opensmile + lex + kaldi |
You can create your own custom nodes and pipelines. For example, if you wanted to create a pipeline that computed one feature vector for each wav file, you could copy the the opensmile_is10_lld pipeline and change the output flag to '-csvoutput'.
Any pipelines added to the pipelines/ folder and decorated with @pipeline_registry will be automatically discovered
and available through the CLI.
If you would like to make your custom pipelines and nodes available for others to use, please feel free to make a pull request.
LIWC2015 features from receptiviti can also be added to the output. Simply copy secrets.py.example, rename it to secrets.py and fill in your api key.
If you have matlab installed on your system, you can install the MATLAB Engine API. As long as your matlab script
takes as input a path the the input file and a path specifying where to save the output, you should be able to create
a pipeline that uses the nodes.matlab.MatlabRunner to call your matlab script.
If you have access to the ANEW2010 dictionary, you can put the ANEW2010All.txt file in path you gave to the setup script
to add additional features to lexicosyntactic output. You will also need to add
path_to_anew=/path/to/dependencies/ANEW2010All.txt to config.ini.
Similar to ANEW, you can put the RST treebank data in the dependency folder and the following lines to 'config.ini'
path_to_rst_python=path/to/deps/RST/rstenv/bin/python
path_to_rst=path/to/deps/RST/src/