Hi,
Thank you for providing this useful tool. I have used xTea to identify TE insertions in a large cohort and have a few questions regarding quality control and population-level merging.
I am using your x_vcf_merger.py script to combine nearby insertions together at the population level because, without merging, I end up with a very large number of rare insertions. I am only merging TEs with the same family (ex: only merge Alus with Alus, not Alus with LINEs).
Upon looking into what got merged together I noticed it combines TEs regardless of the reported SVLEN, which can vary greatly. For example, LINE1s should be around 6kb, but some of the LINEs that get merged into a single insertion can have a SVLEN from ~100bp to ~6kb.
I have also noticed that the x_vcf_merger.py script reports the most common SVLEN (correct me if I'm wrong), which is often shorter than the expected SVLEN for the TE family.
I had a few questions regarding this:
- Would you recommend applying additional filtering based on SVLEN before running the merge script?
- Should filters be applied based on subclass ("two_sides_tprt_both","one_half_side",etc) before merging? If so, where might I find the definition of each of these? From what I understand "two_sides_tprt_both" is the most reliable, but these aren't always the most common.
- When I report SVLEN in my downstream analysis, should I use the SVLEN chosen by the
x_vcf_merger.py script (even though it can be very small at times, even zero)? Or should I use the maximum SVLEN of the ones merged (which seems to be closer to the expected length based on TE family).
Thank you for your help and for developing xTea.
Hi,
Thank you for providing this useful tool. I have used xTea to identify TE insertions in a large cohort and have a few questions regarding quality control and population-level merging.
I am using your
x_vcf_merger.pyscript to combine nearby insertions together at the population level because, without merging, I end up with a very large number of rare insertions. I am only merging TEs with the same family (ex: only merge Alus with Alus, not Alus with LINEs).Upon looking into what got merged together I noticed it combines TEs regardless of the reported SVLEN, which can vary greatly. For example, LINE1s should be around 6kb, but some of the LINEs that get merged into a single insertion can have a SVLEN from ~100bp to ~6kb.
I have also noticed that the
x_vcf_merger.pyscript reports the most common SVLEN (correct me if I'm wrong), which is often shorter than the expected SVLEN for the TE family.I had a few questions regarding this:
x_vcf_merger.pyscript (even though it can be very small at times, even zero)? Or should I use the maximum SVLEN of the ones merged (which seems to be closer to the expected length based on TE family).Thank you for your help and for developing xTea.