Skip to content

Tags: etal/cnvkit

Tags

v0.9.13

Toggle v0.9.13's commit message
Thanks to [Wei Gu Lab at Stanford](https://cfna.stanford.edu/) for sp…

…onsoring development of this release!

A significant practical improvement to support clinical research is the bedGraph
(.bed.gz) input option to the "batch" and "coverage" commands. With no other change to
the workflow, you can now precalculate the per-base coverage profile of each BAM file,
effectively stripping PHI genomic sequence information before feeding the raw data to
CNVkit, before you or a collaborator perform copy number analysis.

This approach not only reduces HIPAA/IRB/legal risk, but also greatly reduces the size
of the raw data files that need to be stored for CNV calling, and streamlines reanalysis
of samples using different bin sizes and/or excluded genomic regions.

In steps:

1. Scan each BAM file for per-base coverage depth with e.g. `bedtools genomecov -gb` or
   `mosdepth`. Output is `.bed.gz` (a.k.a. bedGraph).
2. Use the sample .bed.gz file as input to CNVkit's `batch` and `coverage` commands, the
   same as you would use BAMs. It does not meaningfully affect the rest of the CNVkit
   pipeline whether BAM or .bed.gz was used as the original sample input.

This release also includes major improvements to HMM segmentation performance,
packaging, testing, and general infrastructure, and fixes bugs in `import-rna` and
handling of genomic intervals.

`coverage`:

* Accept bedGraph (`.bed.gz`) files as input in place of BAM. This enables a
  privacy-preserving workflow: extract per-base coverage from sensitive BAM
  files once (e.g. with `bedtools genomecov -gb`), then share only the non-PHI
  coverage data for downstream CNV analysis and collaboration. Format is
  auto-detected from the file extension. (#984, #985)
* Expose the `samtools bedcov` max-depth option (`-d`) via a new `max_depth`
  parameter, and correctly parse the extra output column that bedcov emits
  when `-d` is used. Default behavior is unchanged. (#973, #974; thanks
  @tobias-beers)

`genemetrics`:

* Implement the same summary statistics as segmetrics: confidence interval (ci),
  prediction interval (pi), mean, median, mode, t-test (p_ttest), stdev, MAD, MSE, IQR,
  midweight bivariance (bivar). These stats are useful for filtering to reduce
  false-positive calls and for building ensemble callers. (#278, #987)
* Enable the `--smooth-bootstrap` option for both `genemetrics` and `segmetrics` to give
  more accurate CI estimates at genes or segments with a small number of bins (default
  10 and below).

`segment`:

* HMM segmentation now uses the pomegranate 1.x API. The minimum pomegranate
  version is raised to 1.0.0. (#910)

`batch`:

* Show the sample name in error messages when a sample fails, instead of
  silently swallowing exceptions. Previously, errors during parallel
  processing were suppressed, making failures difficult to diagnose. (#971,
  #979)
* Generate `target_bed` in `output_dir` when `--output-dir` is given.
  (Thanks @pontushojer)

`segment`:

* Handle empty `.cnr` input cleanly instead of crashing. (#970)

`import-rna`:

* Fix several long-standing bugs. Gene ID mismatches between the counts file
  and the gene resource are now detected and reported. NaN values from
  zero-count genes are replaced with `NULL_LOG_COVERAGE` instead of
  propagating through downstream steps. A new test suite covers these paths.
  (#499, #596, #706, #940, #944, #981)

`fix` / `CNA.by_gene`:

* Fix an indexing bug where `iloc`/`loc` confusion caused incorrect slicing
  when bin coordinates contained duplicates. (#773, #951, #979)

`sniff_region_format`:

* Fix the known-extensions mapping so that file format detection no longer
  always mismatches. (#956; thanks @dlaehnemann)

* **Minimum Python version raised to 3.10.** Python 3.14 is now tested in CI.
  `argparse.FileType` usage removed (pending deprecation in 3.14),
  `itertools.pairwise` and `|`-union type syntax adopted throughout.
* **NumPy 2.x compatibility.** Removed `np.asfarray`, `np.float_`, and
  `np.string_` usage. (Thanks @mr-c and @suhas-r)
* **Pandas 3.0 compatibility.** Eliminated chained assignment and addressed
  `FutureWarning` messages.
* Minimum dependency versions raised to match Ubuntu 25.04 Plucky. Notably:
  matplotlib >= 3.9.0, pyfaidx >= 0.8.0, reportlab >= 3.6.13 (security
  fix).
* Python 3.8 and 3.9 support removed.

* Scripts shipped with CNVkit (beyond `cnvkit.py`) are once again installed by
  `pip install`. Argument parsing and invocation setup standardized across
  scripts. (#957; thanks @dlaehnemann)
* Conda recipe updated with `build.run_exports` version pin per Bioconda
  linting requirements. (#877, #880)
* Replaced flake8 with ruff for linting; added ruff formatting.
* Added type annotations across most of the codebase.
* Added devcontainer configuration for local development and testing.
* Added `.dockerignore`; parameterized CNVkit version in Dockerfile.
* Sphinx/ReadTheDocs configuration updated; docstrings converted to NumPy
  format throughout the core pipeline for better API docs.
* CI: integration tests via `test/Makefile`, Codecov upload, security
  scanning (safety + bandit), tox caching.

* @tobias-beers made their first contribution in #974
* @dlaehnemann made their first contribution in #956
* @mr-c and @suhas-r contributed NumPy 2.x compatibility fixes in #934 and #945
* @pontushojer fixed batch output directory handling in #940

**Full Changelog**: v0.9.12...v0.9.13

v0.9.12

Toggle v0.9.12's commit message
Version 0.9.12

==============

Bug fixes
---------

- Re-enable `coverage -q/--min-mapq` option. (#912; thanks @rach-kennedy)
- Prevent CBS segmentation failures due to nulls in input .cnr (#914, #436, #582, maybe #760, #896, #901 and nf-core/sarek#1625)
- Raise max pomegranate dependency version from <=0.14.9 to <1.0.0 to avoid conflicts
  during installation (#911, #890)

v0.9.11

Toggle v0.9.11's commit message
Version 0.9.11

==============

New features
------------

- Most commands include a new option, `--diploid-parx-genome`, to treat the
  pseudoautosomal regions (PAR1/2) of human chromosome X as autosomal, i.e. diploid
  regardless of sample sex. The value it takes is a human reference genome ID such as
  "grch38". This feature should help reduce false calls on sex chromosomes in human
  samples. (Thanks @rollf; #789)
- The `fix` command takes a new option `--smoothing-window-fraction` to allow manual
  tuning of the smoothing window used in GC and other automatic bias corrections.
  (Thanks @kkchau; #859)
- hg38 refFlat and genome accessibility data files are now included in the source tree.
  (Thanks @berguner; #822, #837)

Bug fixes
---------

- The Docker image once again includes the additional scripts beyond cnvkit.py.
- User-specified sample sex with `-x` now works properly. (Thanks @28rietd and @ccoo22;
  #843, #851)
- User-specified smoothing window size now applies in HMM segmentation. (Thanks
  @zhuying412; #833, #835)
- An error in `export vcf` has been fixed. (Thanks @pwwang; #818)

Other updates
-------------

- Dependency versions are updated to match Ubuntu 23.04 Lunar, more or less.
- Automated testing is done on Python version 3.8 through 3.12 -- these are the
  "supported" versions.
- Small documentation fixes.

v0.9.10

Toggle v0.9.10's commit message
Version 0.9.10

==============

This long-awaited release includes major plotting enhancements in the `heatmap`,
`scatter`, and `diagram` commands, as well as a new `export gistic` command, thanks to
joint work by @tetedange13 and @tskir (see below).

There are also significant infrastructure improvements including bug fixes, modernized
packaging, and build/test automation.

New features
------------

`diagram`:

- New options `--no-gene-labels` to not display gene labels on the plot, and `-c` /
  `--chromosome` to plot a single chromosome (#628, #629, #634; thanks @tetedange13)

`heatmap`:

New CLI options  (#35, #625, #632, #652; thanks @tetedange13 and @tskir):

- `--vertical`: Transpose the plot, displaying the genome axis vertically instead of horizontally
- `--delimit-samples`: Add an delimitation line between each sample row (or column, with
  `--vertical`)
- `--title`: Set the plot title

`scatter`:

- New option `--fig-size`: Set the output image dimensions (#600, #641; thanks
  @tetedange13 and @tskir)
- Show triangles at the bottom of the plot to indicate where segments are hidden below
  the plotted region by automatic pruning at 'ymin=-5'. Also log a warning when this
  happens. (#385, #643, #645; thanks @tetedange13, @tskir, and @micknudsen)

`export gistic`:

- New export command to generate an unsegmented "markers" file for use with GISTIC.
  GISTIC also takes a second input file with corresponding segments in SEG format, which
  CNVkit can generate with `export seg`. (#622, #623, #776; thanks @tetedange13, @tskir,
  @BioComSoftware)

API and CLI changes
-------------------

- Running `cnvkit.py` without any arguments will now display the full help text instead
  of an error message.
- Supporting scripts (aside from `cnvkit.py`) are no longer installed automatically.
  They are still available in the source tree.

Documentation
-------------

- Clarified `bintest` usage, provided an example, and explained outputs. (#646; thanks
  @tetedange13 and @tskir)

Bugfixes
--------

- Fixed several errors and warnings due to outdated usage of dependencies, e.g. pandas,
  pysam.
- Fixed the Dockerfile and Docker image to install R packages properly for CNVkit to use
  internally. (#765; thanks @28rietd)
- Made the Makefile example/test workflow more portable across environments. (#661,
  #666, #695, #699; thanks @tetedange13)
- `batch`: Apply --drop-low-coverage option in the segmetrics step. (#694)
- `bintest`: Include 'probes' column in .cns output so that it is valid .cns (closes #693)
- `fix`: Condense the error message when coordinate set contains duplicate values. (#637,
  #638; thanks @tskir)
- `fix`: Choose a smoothing window fraction based on the data size to help correct
  biases better at the extremes of the GC range, where previously some residual GC bias
  could still be present after correction. (#379)
- BED inputs: Handle UCSC BED 'browser' header line, as used in Agilent BED files with a
  2-line header. (closes #696, #618)

Internal
--------

- Modernized the packaging configuration with pyproject.toml, leaving a stub setup.py
  for legacy setuptools compatibility. (#790)
- Set up automated testing through GitHub Actions (GHA) to verify Python versions 3.7
  through 3.10 using pytest and tox. The latter make local testing with multiple
  Python versions more reliable, too. (#792, #793, #794)
- Updated minimum dependency versions to roughly match Ubuntu 22.04 LTS packages; these
  are used in CI, too.
- Applied black and pylint to reformat the codebase consistently and replace deprecated
  calls to libraries. (#795)
- Remove joblib pinning (#589, #770; thanks @DavidCain and @risicle)
- Remove networkx pinning (#606, #771; thanks @DavidCain)
- Make the extreme-GC filters more easily configurable via `params.py` (#738, #752, #753,
  #764; thanks @tetedange13 and @tsivaarumugam)

v0.9.9

Toggle v0.9.9's commit message
Version 0.9.9

-------------

This release contains a new script and, more importantly, a volley of bug fixes
by @tskir, a new CNVkit collaborator.

New script `genome_instability_index.py`:
- For each given sample (.cnr or .cns, ideally .call.cns), this script reports
  two values, the number of non-neutral segments and the fraction of the total
  sequencing-accessible genome that they cover. Together, these values have been
  described as the Genome Instability Index (G2I) by [Bonnet et al.
  (2012)](https://doi.org/10.1186/1755-8794-5-54). These numbers are not
  difficult to calculate directly from .cns files, but they are frequently
  requested, so here you go.

Bug fixes by @tskir:

Installation:
- Set NetworkX minimum version to work with pomegranate on Python 3.9.
  (#614, #606; thanks @auberginekenobi)

genemetrics, diagram, scatter:

- Fix an error in iterating over chromosomes during gene-wise operations or
  gene selection. (#580, #573, #576, #579; thanks  @diushiguzhi @eriktoo
  @hrkemp @drmrgd @HYan-lei)

access:

- Fix an error when all chromosomes listed in the exclusion BED file appear
  only once. (#581, #574; thanks @dajana17)

autobin:

- Allow specifying explicit output filenames via -o/--output. If this option is
  not used, the behavior is the same as before. Some pipeline frameworks such
  as Snakemake require output filenames to be explicit in wrapped commands.
  (#608, #607; thanks @enes-ak)
- Fix median-size file selection. (#613, #611; thanks @michaelsykes)

coverage:

- Fix a potential crash with the -c option; generally make the -c option's
  results more stable. This changes the results you'd get with `coverage -c`
  compared to previous CNVkit versions, but in any case -c isn't recommended
  for production use, only for algorithm exploration. (#598, #593; thanks
  @joys8998)

genemetrics:

- Rename column `n_bins` to `probes` in output, for compatibility with 'call'
  and 'export' commands. (#586, #585; thanks @eriktoo)

scatter:

- Avoid losing short segments in rasterized PNG output, depending on DPI
  settings.  (#615, #604; thanks @jimmy200340)
- Allow NCBI-style chromosome names that contain a ".", e.g. "NC_039902.1".
  (#603, #602; thanks @amora197)

segment:

- Fix an IndexError during smoothing when the signal is shorter than a window,
  e.g. on chrY where the chromosome contains few bins. (#590, #587; thanks
  @tetedange13)

Improvements from other contributors:

- scripts/guess_baits.py: Fix a copy-paste error on script launch.  (#588; thanks @sssimonyang)
- Documentation: Link to the Debian package alongside other packages. (#562; thanks @mr-c)

v0.9.8

Toggle v0.9.8's commit message
Version 0.9.8

-------------

Continuing a focus on stability and compatibility with other software:

* Support for reading CRAM files with an optional user-provided local FASTA
  file for the reference genome sequence. (#555; thanks @johnegarza)
* Call Rscript subprocess with safer flags for the R environment. Previously,
  `--vanilla` ignored R environments with the library path in a non-default
  location specified in the user's .Rprofile. Now, `--no-restore` and
  `--no-environ` ensure a clean environment but still respect the user's
  .Rprofile settings beyond that. (#491; thanks @pablo-gar)
* Compatibility with the latest release of pandas. (#502, #523)

This release also fixes some regressions reported since the release of CNVkit
0.9.7 (which introduced a number of new performance optimizations).

* `scatter`: A bug when plotting a region of a chromosome. (#536, #457; thanks tskir)
* `scatter`: An IndexError when plotting entire chromosomes, e.g. chr7. (#541,
  #461, #535; thanks @tskir)
* `fix`: A bug that occurred after automatic bias corrections, introducing
  NaN-valued rows in placed of rejected bins, leading to a downstream crash in
  CBS segmentation. (#551, #436, #547; thanks @johnegarza)

v0.9.7

Toggle v0.9.7's commit message
Version 0.9.7

Stable release with only minor changes from the previous beta release 0.9.7.b1.

New contributions:

- Cram support: Look for and use .cram + .crai alignment and index file pairs,
  in addition to .bam + .bai. (#495, #434; thanks @sridhar0605)
- Update Docker file to use Python 3 apt packages and pip3 (#493; thanks
  @keiranmraine)
- Documentation fix (#496; thanks @rollf)

v0.9.7.b1

Toggle v0.9.7.b1's commit message
travis: Workaround for OSX openssl dependency quirk

v0.9.7.b0

Toggle v0.9.7.b0's commit message
Version 0.9.7-beta

This release contains several major enhancements  particularly relevant to germline
analysis. If used in production pipelines, further evaluation and benchmarking would be
wise. Highlights:

**Control sample clustering**: To make better use of larger reference sample pools,
`reference --cluster` will correlate the given normal samples' bin-wise coverage depths
to extract clusters to be used as reference profiles. The reference .cnn file produced
this way will then contain the `log2` and `spread` summary statistics for each cluster,
in addition to the global summary stats. Given this "clustered reference" profile, `fix
--cluster` will then correlate each test sample to each clustered `log2` profile in the
reference to choose the most relevant control pool for normalization. The `batch` option
`--cluster` will perform both these steps. Nod to Gambin lab and the authors of
ExomeDepth, CoNVaDING, CLAMMS, and others for inspiration. (#308)

Calculation of bin weights has changed. **This will change your segmentation results**,
hopefully for the better. Details below. (#429)

The `batch` pipeline now performs some **segmentation post-processing** automatically:
calculating and filtering segmentation calls by 50% confidence intervals of the segment
mean log2 ratios, in order to reduce false positives, followed by separate bin-level
testing to detect small (e.g. exon-size) CNVs that were not caught by segmentation.
The bin- and segment-level results are returned as separate .cns files; deciding whether
and how to combine or use these results together is left as an exercise for the user.

We've **dropped Python 2.7 support**. Python version 3.5 or later is now required.

This is a beta release. Please let me know how it works for you via the Issues page. If
this release contains any issues that are blocking your work, try installing one of the
previous stable versions 0.9.6 or 0.9.5::

    conda install cnvkit=0.9.6

Dependencies
------------

- Remove all Python 2.7 compatibility shims.
- Raise minimum pandas version from 0.20.1 to 0.23.3.
- Add scikit-learn (dependency of pomegranate, for HMM segmentation). Remove the older
  hmmlearn implementation.

Commands
--------

`batch`:

- Post-process segments with `segmetrics` (50% CI), `call` (filter by CI, but don't call
  integer copy number), and `bintest`.
- Return `bintest` result as a separate, independent .cns output.
- Add option '--segment-method', equivalent to `segment -m`.
- Rename option '--method' to '--seq-method' (but '--method' still accepted for now).
- Add option `--cluster`, passed to `reference` and `fix` if given. (#308)

`bintest`:

- New command superseding `cnv_ztest.py` script.
- Report p-value as a column `p_bintest` (previously `ztest`) in the .cns output.
- Fix probabilities for positive log2 values, i.e. gains, which previously always had
  p-value = 1.0. (#429)

`fix`:

- Change calculation of bin weights to be more consistent with `1-var` meaning,
  with more emphasis on reference spread. It is now simpler, more consistent with
  `import-rna`, and particularly improves the accuracy of `bintest`. (#429)
- Squeeze the range of reference-free weights
- Drop bins with gc outside [.3, .7]. CLAMMS paper shows these bins carry no useful
  signal.
- With `--cluster` and a clustered reference input, calculate the test sample's Pearson
  correlation versus each cluster's log2, and take the best one for normalization.

`reference`:

- With `--cluster`, do k-means clustering of the sample bin-level read depth correlation
  matrix, per [Kusmirek et al. 2018](https://doi.org/10.1101/478313).
  Parameter k defaults to the cube root of number of samples. Only clusters of at least
  4 samples are kept for emitting summary statistics in the reference profile.

`segment`:

- hmm: Fix pomegranate-based implementation. Use iterative Savitzky-Golay smoothing with
  a narrow bandwidth.
- Use HMM for post-TCN segmentation on VCF allele freqs
- Add parameter for smoothing before CBS (thanks @EwaMarek)

`segmetrics`:

- Add 'ttest' option for 1-sample t-test p-value.
- Implement & expose --smooth-bootstrap option.  For smoothing, KDE bandwidth is based
  on each bin's weight as a proxy for the SD of its log2 ratio values.  To reduce the
  risk of over-smoothing on larger sample sizes, we use a loose interpretation of
  Silverman's Rule to reduce the bandwidth as the number of bins in a segment increases
  (k^-1/4).

API
---

- `do_heatmap`: Add 'ax' parameter (thanks @fbrundu)
- `CNA.residuals()`: speed; keep index intact in returned pd.Series
- smoothing: Linearly roll-off weights in mirrored wings.  Affects CNA.smoothed() /
  savgol, but not rolling median bias correction.
- Rename `CNA.smoothed()` to `CNA.smooth_log2()`, since it returns the smoothed log2
  values, not a new/altered CNA.

Bug fixes
---------

- `batch`: Fix argparse formatting issue (#466)
- `import-rna`: Fix a regression in reading 2-column per-gene counts (`-f counts`).
- `reference`: Fix sex inference/usage when creating haploid-x reference (#459; thanks
  @duartemolha)
- `scatter`: Use a safe matplotlib backend on OS X to avoid crash
- VariantArray: Fix/streamline indexing of variants by bin/segment

v0.9.6

Toggle v0.9.6's commit message
Version 0.9.6

=============

Much-needed maintenance and bug fixes, for the most part. Some key dependencies
have changed, though this should be generally painless for you, and one or two
regressions introduced by recent optimizations have been fixed.

This will be the last CNVkit version to run on Python 2.7. The next major
release of pandas (0.25.0) will remove support for Python 2.7, and once that
happens it will become increasingly difficult to install future versions of
CNVkit on Python 2.7 -- so we're not going to try.

The segmentation method `flasso` depends on the R package `cghFLasso`, which is
unmaintained and has been removed from CRAN.  For now, `segment -m flasso` is
still supported if you already have `cghFLasso` installed. But given the above,
`flasso` will be removed from the next CNVkit version in favor of the HMM-based
methods.

Dependencies
------------

- Raised minimum pandas version from 0.18.1 to 0.20.1, and support up to 0.24.2,
  resolving some warnings and an error in pandas 0.22+. (#413; thanks @chapmanb)
- The soft dependency on `hmmlearn` is replaced with an explicit dependency on
  `pomegranate` for the HMM-based segmentation methods. This dependency will now
  be pulled in automatically when installing via `pip` or `conda`.
- The R package `cghFLasso` has been removed from CRAN, and therefore is no
  longer a dependency of CNVkit and will not be installed automatically through
  the standard `conda` installation method. (#419)

Commands
--------

`antitarget`:

- Be more specific in removing noncanonical chromosomes (e.g. alternate
  contigs, mitochondria) from the binned regions. This avoids skipping
  chromosomes of interest in some non-human genomes with non-numeric contig
  names, like yeast. (#388; credit for regexes to @brentp)

`coverage`:

- With `--count-reads`, use query aligned length to handle soft-clipped reads
  properly. Now the results with and without this option should be similar.
(#411; thanks @desnar)

`segment`:

- For `-m flasso`, partition array by chromosome to avoid edge effects. (#409, #412; thanks @giladmishne)
- Removed the deprecated option `--rlibpath`; use `--rscript-path` instead.
- Note that the HMM methods are still provisional. A stable, supported version
  of these methods will be provided in the next CNVkit release.

Python API
----------

- `do_scatter` now returns a figure (#408; thanks @jeremy9959)

Bug fixes
---------

- `scatter`: Whole chromosomes can once again be specified with `-c`. (In the
  previous release, a chromosome without coordinates would cause an IndexError.)
  (#393)
- `import-rna`: Option --max-log2 can now be specified by users. (Previously,
  only the default value of +3.0 worked.)
- VCF I/O (`skgenome.tabio`): Support GATK 4's VCF files that contain records
  with empty ALT alleles, substituting zero if ALT AD is missing. (#391; thanks
  @chapmanb)
- Due to a certain versioning-dependent interaction between numpy, pandas,
  cython, and conda (details [here](numpy/numpy#432)),
  CNVkit may have printed spurious RuntimeWarning messages which could be safely
  ignored. The current release attempts to silence these messages if they occur.
  (#390).