Group Project Authors:
- Hugo Werner
- Gizem Çaylak e-mail
Video link: https://kth.box.com/s/y3jsb9lgp6cll6op15o6z77rchaefh24
Problem description:
SARS-CoV-2 is spreading across the world and as it spreads mutations are occuring. A way to understand the spreading and the mutations is to explore the structure and information hidden in the genome.
Project goal:
The Goal of this project is to explore a SARS-CoV-2 genome dataset and try to predict the origin of a SARS-CoV-2 genome sample.
Data:
We will use publicly available NCBI SARS-CoV-2 genome with their geographic region information.
Geographic region | # | #samples | |
---|---|
Africa | 397 | |
Asia | | 2534 | |
Europe | | 1418 | |
North America | | 26836 |
Oceania | | 13304 |
South America | | 158 | |
Data link: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqTypes=Nucleotide&VirusLineagess=Severe%20acute%20respiratory%20syndrome%20coronavirus%202%20(SARS-CoV-2),%20taxid:2697049
Background:
- Genome: Sequence of nucleotides (A-T-G-C) [https://en.wikipedia.org/wiki/Genome]
- k-mer: Subsequences of length k of a genome [https://en.wikipedia.org/wiki/K-mer]
- Sequence analysis of SARS-CoV-2 genome reveals features important for vaccine design [https://www.nature.com/articles/s41598-020-72533-2]
- Latent Dirichlet Allocation (LDA) tutorial from the course
Challenges:
- How to encode genome sequence?
- Project solution :
- Represent genome as k-mers and use countVectorizer to convert k-mers into a matrix of token counts (term-frequency table)
- Extract features with Latent Dirichlet Allocation (LDA) by considering each genome sequence as a document and each 3-mer as a word. So, we have a collection of genomes consisting of 3-mers.
- Project solution :
- How to relate encoded features to the origins?
- Project solution: We used a Random Forest Classifier and tried both topic distributions, LDA output, and k-mer frequencies directly. One advantage is interpretability: we can understand the positive or negative relations a topic has on the origin.
- How to solve unbalanced class problem? E.g. North America has 26836 samples but South America has only 158
- Project solution: Use f1 measure as metric
Project steps:
- Get SARS-CoV-2 data from NCBI
- Process data:
- Extract 3-mers:
- Split train/test dataset with split ratio 0.7
- Extract topic features: We used Latent Dirichlet Allocation to extract patterns from k-mers features.
- Classify: We used Random Forest Classifier
- (Classification directly on k-mers features) To find a mapping from extracted k-mers features to labels (multiclass problem).
- (Classification on LDA features) To find a mapping from extracted topic distributions to labels (multiclass problem).
- Evaluation: We use (accuracy and f1) measure as our evaluation metrics. We compared Classification on LDA features vs Classification directly on k-mers features to see whether LDA is capable of summarizing the data (and thus reducing the feature dimensionality)
What we lack mainly:
- A biological interpretation of the results (whether found terms in topic distributions are significant/connected in a biological network).