0% found this document useful (0 votes)
36 views8 pages

Cancer - Capstone Project

The document describes an objective to develop a comprehensive cancer disease data model providing factors involved in cancer. It outlines requirements for the data model including various data types and ability to handle ongoing changes. It also describes leveraging graph-oriented data stores, indexed document stores, and ontology-based concept definition. The data model classifies clinical variables and allows filtering cases. It provides visualizations of mutation frequencies and maps mutations to protein-coding regions.

Uploaded by

Sellam V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views8 pages

Cancer - Capstone Project

The document describes an objective to develop a comprehensive cancer disease data model providing factors involved in cancer. It outlines requirements for the data model including various data types and ability to handle ongoing changes. It also describes leveraging graph-oriented data stores, indexed document stores, and ontology-based concept definition. The data model classifies clinical variables and allows filtering cases. It provides visualizations of mutation frequencies and maps mutations to protein-coding regions.

Uploaded by

Sellam V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Objectives

Cancer is a complex multifactorial disease that affects up to 40% of people across the world.
However, many mechanisms of cancer remain unclear due to the lack of studies based on
systematic knowledge, leading to ineffective treatment and/or trans- mission of genetic
defects. Here, we developed an cancer disease data model to provide a comprehensive
resource featuring various factors involved.

The data model is designed to maintain data and metadata consistency, integrity, and
availability while accommodating:

 Biospecimen, clinical, and cancer genomic data and metadata


 Multiple, disparate NCI ongoing projects
 Completely new, as yet unthought of projects
 Ongoing changes and technological progress
 Frequent and complex queries from both external users and internal administrators

To meet these requirements, the design and implementation of the data model leverages:

 Flexible but robust graph-oriented data stores


 Indexed document stores for API and front end performance
 Ontology-based concept and data element definition

Data Processing steps:

 Schema-based entity and relationship validation on loading


 Properties are key-value pairs associated with an entity. Properties cannot be nested,
which means that the value must be numerical, boolean, or a string, and cannot be
another key-value set. Properties can be either required or optional. The following
properties are of particular importance in constructing the infertility Data Model:
o Type is a required property for all entities. Entity types
include project, case, demographic, sample, read_group and others.
o System properties are properties used in infertility system operation and
maintenance. They cannot be modified except under special circumstances.
o Unique keys are properties, or combinations of properties, that can be used to
uniquely identify the entity in the database. For example, the tuple
(combination) of [ project_id, submitter_id ] is a unique key for most entities.
 Links define relationships between entities, and the multiplicity of those relationships
(e.g. one-to-one, one-to-many, many-to-many).
Directed Acyclic Graph – Interconnected entities
Users can filter by specific clinical variables, grouped into these categories:

 Demographic: Data for the characterization of the patient by means of segmenting the
population (e.g. characterization by age, sex, race, etc.).
 Diagnoses: Data from the investigation, analysis, and recognition of the presence and
nature of disease, condition, or injury from expressed signs and symptoms; also, the
scientific determination of any kind; the concise results of such an investigation.
 Treatments: Records of the administration and intention of therapeutic agents
provided to a patient to alter the course of a pathologic process.
 Exposures: Clinically-relevant patient information not immediately resulting from
genetic predispositions.

The Cases tab gives an overview of all the cases/patients who correspond to the filters chosen
(Cohort).

The top of this section contains a few pie graphs with categorical information regarding the
Primary Site, Project, Disease Type, Gender, and Vital Status.

Below these pie charts is a tabular view of cases, which can be exported, sorted and saved
using the buttons on the right and includes the following information:

 Case ID (Submitter ID): The Case ID / submitter ID of that case/patient (i.e. TCGA
Barcode).
 Project: The study name for the project for which the case belongs.
 Primary Site: The primary site of the cancer/project.
 Gender: The gender of the case.
 Files: The total number of files available for that case.
 Available Files per Data Category: Seven columns displaying the number of files
available in each of the seven data categories. These link to the files for the specific
case.
 # Mutations: The number of SSMs (simple somatic mutations) detected in that case.
 # Genes: The number of genes affected by mutations in that case.

Note: By default, the UUID is not displayed on summary page tables. You can display the
UUID by clicking on the icon with 3 parallel lines and checking the UUID option.

Case Summary Page

The Case Summary Page displays case details including the project and disease information,
data files that are available for that case, and the experimental strategies employed.

CLINICAL AND BIOSPECIMEN INFORMATION

The page also provides clinical and biospecimen information about that case. Links to export
clinical and biospecimen information in JSON format are provided.
Some clinical records can support multiple records of the same type (Diagnoses, Family
Histories, Exposures, Follow-Ups, Molecular Tests). If only one record exists, the UUID of
the record is provided at the top of the corresponding tab.

If there are multiple records, they are listed as horizontal tabs.


Some record types are further nested under another. For example, a Diagnosis record may
have multiple associated Treatment records. Or a Follow-Up record may have multiple
associated Molecular Test Records. The associated sub-records are listed in a table on the tab.

Users can filter by specific clinical variables, grouped into these categories:

 Demographic: Data for the characterization of the patient by means of segmenting the
population (e.g. characterization by age, sex, race, etc.).
 Diagnoses: Data from the investigation, analysis, and recognition of the presence and
nature of disease, condition, or injury from expressed signs and symptoms; also, the
scientific determination of any kind; the concise results of such an investigation.
 Treatments: Records of the administration and intention of therapeutic agents
provided to a patient to alter the course of a pathologic process.
 Exposures: Clinically-relevant patient information not immediately resulting from
genetic predispositions.

Data Visualization
A table and two bar graphs show how many cases are affected by mutations and copy number
variation within the gene as a ratio and percentage. Each row/bar represents the number of
cases for each project. The final column in the table lists the number of unique mutations
observed on the gene for each project.
PROTEIN VIEWER

Mutations and their frequency across cases are mapped to a graphical visualization of
protein-coding regions with a lollipop plot. Pfam domains are highlighted along the x-axis to
assign functionality to specific protein-coding regions. The bottom track represents a view of
the full gene length. Different transcripts can be selected by using the drop-down menu above
the plot.

You might also like