Tags: mr-c/cnvkit
Tags
Version 0.9.8 ------------- Continuing a focus on stability and compatibility with other software: * Support for reading CRAM files with an optional user-provided local FASTA file for the reference genome sequence. (etal#555; thanks @johnegarza) * Call Rscript subprocess with safer flags for the R environment. Previously, `--vanilla` ignored R environments with the library path in a non-default location specified in the user's .Rprofile. Now, `--no-restore` and `--no-environ` ensure a clean environment but still respect the user's .Rprofile settings beyond that. (etal#491; thanks @pablo-gar) * Compatibility with the latest release of pandas. (etal#502, etal#523) This release also fixes some regressions reported since the release of CNVkit 0.9.7 (which introduced a number of new performance optimizations). * `scatter`: A bug when plotting a region of a chromosome. (etal#536, etal#457; thanks tskir) * `scatter`: An IndexError when plotting entire chromosomes, e.g. chr7. (etal#541, etal#461, etal#535; thanks @tskir) * `fix`: A bug that occurred after automatic bias corrections, introducing NaN-valued rows in placed of rejected bins, leading to a downstream crash in CBS segmentation. (etal#551, etal#436, etal#547; thanks @johnegarza)
Version 0.9.7 Stable release with only minor changes from the previous beta release 0.9.7.b1. New contributions: - Cram support: Look for and use .cram + .crai alignment and index file pairs, in addition to .bam + .bai. (etal#495, etal#434; thanks @sridhar0605) - Update Docker file to use Python 3 apt packages and pip3 (etal#493; thanks @keiranmraine) - Documentation fix (etal#496; thanks @rollf)
Version 0.9.7-beta This release contains several major enhancements particularly relevant to germline analysis. If used in production pipelines, further evaluation and benchmarking would be wise. Highlights: **Control sample clustering**: To make better use of larger reference sample pools, `reference --cluster` will correlate the given normal samples' bin-wise coverage depths to extract clusters to be used as reference profiles. The reference .cnn file produced this way will then contain the `log2` and `spread` summary statistics for each cluster, in addition to the global summary stats. Given this "clustered reference" profile, `fix --cluster` will then correlate each test sample to each clustered `log2` profile in the reference to choose the most relevant control pool for normalization. The `batch` option `--cluster` will perform both these steps. Nod to Gambin lab and the authors of ExomeDepth, CoNVaDING, CLAMMS, and others for inspiration. (etal#308) Calculation of bin weights has changed. **This will change your segmentation results**, hopefully for the better. Details below. (etal#429) The `batch` pipeline now performs some **segmentation post-processing** automatically: calculating and filtering segmentation calls by 50% confidence intervals of the segment mean log2 ratios, in order to reduce false positives, followed by separate bin-level testing to detect small (e.g. exon-size) CNVs that were not caught by segmentation. The bin- and segment-level results are returned as separate .cns files; deciding whether and how to combine or use these results together is left as an exercise for the user. We've **dropped Python 2.7 support**. Python version 3.5 or later is now required. This is a beta release. Please let me know how it works for you via the Issues page. If this release contains any issues that are blocking your work, try installing one of the previous stable versions 0.9.6 or 0.9.5:: conda install cnvkit=0.9.6 Dependencies ------------ - Remove all Python 2.7 compatibility shims. - Raise minimum pandas version from 0.20.1 to 0.23.3. - Add scikit-learn (dependency of pomegranate, for HMM segmentation). Remove the older hmmlearn implementation. Commands -------- `batch`: - Post-process segments with `segmetrics` (50% CI), `call` (filter by CI, but don't call integer copy number), and `bintest`. - Return `bintest` result as a separate, independent .cns output. - Add option '--segment-method', equivalent to `segment -m`. - Rename option '--method' to '--seq-method' (but '--method' still accepted for now). - Add option `--cluster`, passed to `reference` and `fix` if given. (etal#308) `bintest`: - New command superseding `cnv_ztest.py` script. - Report p-value as a column `p_bintest` (previously `ztest`) in the .cns output. - Fix probabilities for positive log2 values, i.e. gains, which previously always had p-value = 1.0. (etal#429) `fix`: - Change calculation of bin weights to be more consistent with `1-var` meaning, with more emphasis on reference spread. It is now simpler, more consistent with `import-rna`, and particularly improves the accuracy of `bintest`. (etal#429) - Squeeze the range of reference-free weights - Drop bins with gc outside [.3, .7]. CLAMMS paper shows these bins carry no useful signal. - With `--cluster` and a clustered reference input, calculate the test sample's Pearson correlation versus each cluster's log2, and take the best one for normalization. `reference`: - With `--cluster`, do k-means clustering of the sample bin-level read depth correlation matrix, per [Kusmirek et al. 2018](https://doi.org/10.1101/478313). Parameter k defaults to the cube root of number of samples. Only clusters of at least 4 samples are kept for emitting summary statistics in the reference profile. `segment`: - hmm: Fix pomegranate-based implementation. Use iterative Savitzky-Golay smoothing with a narrow bandwidth. - Use HMM for post-TCN segmentation on VCF allele freqs - Add parameter for smoothing before CBS (thanks @EwaMarek) `segmetrics`: - Add 'ttest' option for 1-sample t-test p-value. - Implement & expose --smooth-bootstrap option. For smoothing, KDE bandwidth is based on each bin's weight as a proxy for the SD of its log2 ratio values. To reduce the risk of over-smoothing on larger sample sizes, we use a loose interpretation of Silverman's Rule to reduce the bandwidth as the number of bins in a segment increases (k^-1/4). API --- - `do_heatmap`: Add 'ax' parameter (thanks @fbrundu) - `CNA.residuals()`: speed; keep index intact in returned pd.Series - smoothing: Linearly roll-off weights in mirrored wings. Affects CNA.smoothed() / savgol, but not rolling median bias correction. - Rename `CNA.smoothed()` to `CNA.smooth_log2()`, since it returns the smoothed log2 values, not a new/altered CNA. Bug fixes --------- - `batch`: Fix argparse formatting issue (etal#466) - `import-rna`: Fix a regression in reading 2-column per-gene counts (`-f counts`). - `reference`: Fix sex inference/usage when creating haploid-x reference (etal#459; thanks @duartemolha) - `scatter`: Use a safe matplotlib backend on OS X to avoid crash - VariantArray: Fix/streamline indexing of variants by bin/segment
Version 0.9.6 ============= Much-needed maintenance and bug fixes, for the most part. Some key dependencies have changed, though this should be generally painless for you, and one or two regressions introduced by recent optimizations have been fixed. This will be the last CNVkit version to run on Python 2.7. The next major release of pandas (0.25.0) will remove support for Python 2.7, and once that happens it will become increasingly difficult to install future versions of CNVkit on Python 2.7 -- so we're not going to try. The segmentation method `flasso` depends on the R package `cghFLasso`, which is unmaintained and has been removed from CRAN. For now, `segment -m flasso` is still supported if you already have `cghFLasso` installed. But given the above, `flasso` will be removed from the next CNVkit version in favor of the HMM-based methods. Dependencies ------------ - Raised minimum pandas version from 0.18.1 to 0.20.1, and support up to 0.24.2, resolving some warnings and an error in pandas 0.22+. (etal#413; thanks @chapmanb) - The soft dependency on `hmmlearn` is replaced with an explicit dependency on `pomegranate` for the HMM-based segmentation methods. This dependency will now be pulled in automatically when installing via `pip` or `conda`. - The R package `cghFLasso` has been removed from CRAN, and therefore is no longer a dependency of CNVkit and will not be installed automatically through the standard `conda` installation method. (etal#419) Commands -------- `antitarget`: - Be more specific in removing noncanonical chromosomes (e.g. alternate contigs, mitochondria) from the binned regions. This avoids skipping chromosomes of interest in some non-human genomes with non-numeric contig names, like yeast. (etal#388; credit for regexes to @brentp) `coverage`: - With `--count-reads`, use query aligned length to handle soft-clipped reads properly. Now the results with and without this option should be similar. (etal#411; thanks @desnar) `segment`: - For `-m flasso`, partition array by chromosome to avoid edge effects. (etal#409, etal#412; thanks @giladmishne) - Removed the deprecated option `--rlibpath`; use `--rscript-path` instead. - Note that the HMM methods are still provisional. A stable, supported version of these methods will be provided in the next CNVkit release. Python API ---------- - `do_scatter` now returns a figure (etal#408; thanks @jeremy9959) Bug fixes --------- - `scatter`: Whole chromosomes can once again be specified with `-c`. (In the previous release, a chromosome without coordinates would cause an IndexError.) (etal#393) - `import-rna`: Option --max-log2 can now be specified by users. (Previously, only the default value of +3.0 worked.) - VCF I/O (`skgenome.tabio`): Support GATK 4's VCF files that contain records with empty ALT alleles, substituting zero if ALT AD is missing. (etal#391; thanks @chapmanb) - Due to a certain versioning-dependent interaction between numpy, pandas, cython, and conda (details [here](numpy/numpy#432)), CNVkit may have printed spurious RuntimeWarning messages which could be safely ignored. The current release attempts to silence these messages if they occur. (etal#390).
Version 0.9.3 This release fixes a single bug that caused the `segmetrics` command to crash (etal#325). Specifically, the command would crash unless at least one option from each of the following option sets was specified: - Location statistics: --mean, --median, --mode - Spread statistics: --stdev, --sem, --mad, --mse, --iqr, --bivar - Interval statistics: --ci, --pi This bug would not be triggered by calling `cnvlib.do_segmetrics` through the Python API, which is why it was not caught in automated testing.
Version 0.9.2 This release contains a new command `import-rna` to infer coarse-grained copy number from RNA expression data. (etal#151) Three new HMM-based segmentation methods are offered: 'hmm', 'hmm-germline', and 'hmm-tumor'. These should be considered experimental and used with caution; the implementations are likely change in the next release. The option `--male-reference` in the commands `batch`, `reference`, `fix`, `call`, and `export` (at least) has been renamed to `--haploid-x-reference` everywhere to reduce user confusion. A shim is in place so `--male-reference` will continue to work. Documentation, logging, and some error messages are improved. Thanks to @chapmanb, @MajoroMask, and others for contributing to this release. Dependencies ------------ - 'pandas' version 0.22 is supported. - 'pysam' version 0.13.0 is supported. - 'hmmlearn' version 0.2 is a run-time requirement to use the new HMM-based segmentation methods. The rest of CNVkit can be run without it. To ensure the right version is installed, install CNVkit with conda as usual, then install hmmlearn with pip within the CNVkit conda environment. - Assume and require pip/setuptools for installation. (This is included with stock Python 2.7 and later.) Scripts ------- - New script "skg_convert.py" to convert between BED, GATK interval list, GFF, VCF, and tabular formats using the 'skgenome.tabio' sub-package, with options for simple post-processing. - Removed the deprecated script refFlat2bed.py. (Use skg_convert.py instead.) Commands -------- `access`: - Drop noncanonical, untargeted contigs/chromsomes by default. This affects analyses run from scratch with `batch`, too. (etal#169, etal#299) `segment`: - Three new methods can be specified with `-m`: `hmm`, `hmm-germline`, and `hmm-tumor`. - With `-m flasso`, force a breakpoint at centromeres, as was already done for the default 'cbs' method. `reference`: - The option `--antitargets` is no longer required to build a flat reference. Previously, building a flat reference for WGS or TAS required creating an empty file to use as antitargets alongside the target BED. - Print a warning if the sample sex inferred from targets does not match that of antitargets. (etal#281) `scatter`: - Removed the deprecated, invisible option `--background-marker`. (Use `--antitarget-marker` instead.) - Trendlines should reflect small CNVs better, while preserving overall smoothing. The implementation now uses the Savitzky-Golay method instead of a Kaiser window, and the smoothing bandwidth is better-tuned. (This can also slightly improve outlier filtering in `segment`.) `export seg`: - Add option `--enumerate-chroms` to replace chromosome or contig names with sequential integers. Previously, this renumbering was always done, following some version of the SEG format. But since most tools don't require the contigs to be sequential integers, and this behavior causes trouble for users, it's now disabled by default. (etal#282) `gainloss`/`genemetrics`: - Rename `gainloss` command to `genemetrics`. A shim is in place so `cnvkit.py gainloss` will continue to work. (etal#278) - Report segment- and bin-level weight and probes separately. (etal#107, etal#278) Bug fixes --------- - autobin: Require -g/--access for WGS (etal#289) - batch: Use the "access" regions for the WGS workflow to choose bin size; these were previously being ignored, so bin sizes were too large, being based on the size of the whole genome, not just sequencing-accessible regions. - call: Safely handle bins with zero weight when running `call --filter cn`. (bcbio/bcbio-nextgen#2112; thanks @chapmanb) - coverage, guess_baits.py: Handle input BED files containing >4 columns. (etal#301) - gainloss: Without `-s`, make 'depth' the weighted mean of bins, not just the first bin's value. - segment: Ensure the .cns output file's columns are sorted properly (etal#291) - vcfio: Don't crash if a record has no ALT values (etal#279) - tabio: - Recognize BED format with decimal in chromosome name (etal#293) - Improvements to GFF/GTF/GFF3 parsing. The new options are mostly accessible through the Python API and the script 'skg_convert.py'. (etal#311) - In 'read_auto' (and all CNVkit commands that take regions as input), determine the file format first by checking the file extension and verifying the format of the first(-ish) line. Only if that doesn't work, fallback to the original method of testing the first(-ish) line against a brittle series of regular expressions. (etal#315) Python API ---------- - cnvlib.write: Newly available at the top level to write tabular files (like .cnr and .cns), symmetric with 'cnvlib.read()'. The 'cnvlib.tabio' alias to 'skgenome.tabio' has been removed; to read and write formats other than TSV-with-header ('tab'), import and use 'skgenome.tabio' directly. - CopyNumArray.squash_genes: remove deprecated keyword argument 'squash_background'. Use 'squash_antitarget' instead. - segmetrics: Move the functions supporting this command from 'cnvlib.command' to a new module 'cnvlib.segmetrics'.
Version 0.9.1 Highlights: Useful enhancements and changes to plotting and segmentation, and a new script for single-exon CNV testing. Plus, bug fixes and usability improvements to avoid unexpected errors. (etal#250, etal#255, etal#262, etc.) Dependencies ------------ - Compatible with the most recent pandas version 0.21.0 (etal#273, etal#274; thanks @chapmanb) - R dependencies were reduced to simplify installation Scripts ------- - Renamed "cnn_*.py" to "cnv_*.py" - New script "cnv_ztest.py" to detect single-bin (e.g. single exon) deep deletions and high-level amplifications. - In "cnv_updater.py", rename "Background" (i.e. off-target) bins to "Antitarget", addition to adding a "depth" column if it's missing. Commands -------- `autobin`: - Raise the maximum target/antitarget bin sizes to 50kb/1Mb. `fix`: - Allow specifying sample_id via ``--sample-id``/``-id``, in case the input coverage filenames do not have the expected form "sample_id.targetcoverage.cnn" and "sample_id.antitargetcoverage.cnn". (etal#269; thanks @chapmanb) `segment`: - Process each chromosome arm separately (with 'cbs' and 'haar', but not 'flasso'). Centromere locations are guessed from the largest gap between sequencing-accessible regions, and are not necessarily the true locations, although they do match fairly well on the human genome. - Logging of dropped bins is streamlined somewhat. - New method `-m none` to only calculate arm-level segment means (for testing and experimentation). `scatter`: - Highlight non-neutral segments from .call.cns. If segments have the columns 'cn' and potentially also 'cn1' and 'cn2' (as added by the `call` command), use those fields to display copy number alterations, LOH and allelic imbalance with colorized segments (orange by default), and use gray for neutral segments. If a VCF is also given, the same is done for SNVs in the lower panel. Otherwise, all segments are colorized as before. (etal#18, etal#157) - New option `--by-bins` to display x-axis positions by sequential bin number on each chromosome, rather than genomic coordinates. This makes the plots much more useful with targeted amplicon sequencing data, or very small gene panels. (etal#63) - Trend line (`--trend`) now accounts for bin weights, which generally results in a better fit. - Improved interaction of -c and -g options: - Only apply the window margin (-w) if -g is used alone, or -c specifies a small chromosomal region with no genes. - Allow an empty gene list (-g '' or -g ',') to prevent highlighting and labeling of any genes / small non-genic "Selection" in the -c region. - If any gene in -g is not fully within the region specified by -c, name that gene and its coordinates in the error message. - If the -c region has size <=0, show a specific error message. - Handle NaN log2 values when calculating y-axis limits. `heatmap`: - Incorporate the `--by-bins` argument to match `scatter`. (etal#63) - Warn if selected region contains no data for a sample. This helps troubleshoot if a chromosome name was mis-specified on the command line. (etal#268) `export seg`: - Change column headers to match DNAcopy output. The column headers generally don't matter in the SEG format, but the DNAcopy dataframe is considered the canonical form. Python API ---------- - cnvlib.do_segment -- new keyword argument min_weight to drop bins with 'weight' below the specified value. If not used, then only bins with weight 0 will be dropped. This feature is not recommended for normal usage and is not available on the command line. - cnvlib.do_scatter -- Remove deprecated keyword argument 'background_marker' in favor of 'antitarget_marker', corresponding to `scatter` options deprecated in v0.9.0. - cnvlib.cnary.CopyNumArray: Add method 'smoothed', which calculates the trendline displayed by the `scatter` command. - skgenome.tabio: Add read support for samtools 'dict' format, which resembles the plain-text SAM header and can contain chromosome names and sizes. - skgenome.gary.GenomicArray: Add magic methods __bool__ (Py3) and __nonzero__ (Py2) to ensure an empty GenomicArray, i.e. 0 rows, is treated as false-ish on both Python 2.7 and 3.x.
PreviousNext