ScaDaMaLe Course site and book

Topic Modeling with SARS-Cov-2 Genome 🧬

Group Project Authors:

Hugo Werner
Gizem Çaylak e-mail

Video link: https://kth.box.com/s/y3jsb9lgp6cll6op15o6z77rchaefh24

Problem description:

SARS-CoV-2 is spreading across the world and as it spreads mutations are occuring. A way to understand the spreading and the mutations is to explore the structure and information hidden in the genome.

Project goal:

The Goal of this project is to explore a SARS-CoV-2 genome dataset and try to predict the origin of a SARS-CoV-2 genome sample.

Data:

We will use publicly available NCBI SARS-CoV-2 genome with their geographic region information.

Geographic region \| #	#samples \|
Africa	397 \|
Asia \|	2534 \|
Europe \|	1418 \|
North America \|	26836
Oceania \|	13304
South America \|	158 \|

Data link: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqTypes=Nucleotide&VirusLineagess=Severe%20acute%20respiratory%20syndrome%20coronavirus%202%20(SARS-CoV-2),%20taxid:2697049

Background:

Genome: Sequence of nucleotides (A-T-G-C) [https://en.wikipedia.org/wiki/Genome]
k-mer: Subsequences of length k of a genome [https://en.wikipedia.org/wiki/K-mer]
Sequence analysis of SARS-CoV-2 genome reveals features important for vaccine design [https://www.nature.com/articles/s41598-020-72533-2]
Latent Dirichlet Allocation (LDA) tutorial from the course

Challenges:

How to encode genome sequence?
- Project solution :
  - Represent genome as k-mers and use countVectorizer to convert k-mers into a matrix of token counts (term-frequency table)
  - Extract features with Latent Dirichlet Allocation (LDA) by considering each genome sequence as a document and each 3-mer as a word. So, we have a collection of genomes consisting of 3-mers.
How to relate encoded features to the origins?
- Project solution: We used a Random Forest Classifier and tried both topic distributions, LDA output, and k-mer frequencies directly. One advantage is interpretability: we can understand the positive or negative relations a topic has on the origin.
How to solve unbalanced class problem? E.g. North America has 26836 samples but South America has only 158
- Project solution: Use f1 measure as metric

Project steps:

Get SARS-CoV-2 data from NCBI
Process data:
Extract 3-mers:
Split train/test dataset with split ratio 0.7
Extract topic features: We used Latent Dirichlet Allocation to extract patterns from k-mers features.
Classify: We used Random Forest Classifier
- (Classification directly on k-mers features) To find a mapping from extracted k-mers features to labels (multiclass problem).
- (Classification on LDA features) To find a mapping from extracted topic distributions to labels (multiclass problem).
Evaluation: We use (accuracy and f1) measure as our evaluation metrics. We compared Classification on LDA features vs Classification directly on k-mers features to see whether LDA is capable of summarizing the data (and thus reducing the feature dimensionality)

What we lack mainly:

A biological interpretation of the results (whether found terms in topic distributions are significant/connected in a biological network).

Geographic region \| #	#samples \|
Africa	397 \|
Asia \|	2534 \|
Europe \|	1418 \|
North America \|	26836
Oceania \|	13304
South America \|	158 \|

sds-3.x/ScaDaMaLe