ScaDaMaLe Course site and book

Word count vectors as features for classification, overlapping k-mers

Confusion matrix
5453.08.06.01.00.0
135.02754.01.01.00.0
211.04.0297.04.00.0
115.00.03.0275.00.0
42.01.02.03.048.0
7.00.00.02.00.0

Summary Statistics * Accuracy = 0.942

Word count vectors as features for classification, overlapping k-mers

Confusion matrix
5446.08.010.04.00.0
137.02746.00.01.00.0
186.07.0314.09.00.0
112.02.09.0270.00.0
43.03.01.03.046.0
10.00.00.00.00.0

Summary Statistics * Accuracy = 0.941

Word count vectors as features for classification, overlapping k-mers

Confusion matrix
5446.08.010.04.00.0
137.02746.00.01.00.0
186.07.0314.09.00.0
112.02.09.0270.00.0
43.03.01.03.046.0
10.00.00.00.00.0

Summary Statistics * Accuracy = 0.941

Number of topics:10 Number of iterations: 100 Nonoverlapping accuracy table:

Confusion matrix
5167.0288.00.011.00.0
2414.0472.00.01.00.0
496.020.00.00.00.0
217.05.00.0170.00.0
94.02.00.00.00.0
28.00.00.03.00.0

Summary Statistics * Accuracy = 0.618 * fm0 = 0.744 * fm1 = 0.257 * fm2 = 0.0 * fm3 = 0.588 * fm4 = 0.0

Number of topics:10 Number of iterations: 1000 Nonoverlapping accuracy table:

Confusion matrix
5290.0113.010.03.00.0
1012.01667.022.00.03.0
430.010.076.00.00.0
213.06.03.0169.00.0
93.02.01.00.00.0
31.00.00.00.00.0

Summary Statistics * Accuracy = 0.787 * fm0 = 0.847 * fm1 = 0.740 * fm2 = 0.242 * fm3 = 0.600 * fm4 = 0.0

Number of topics:20 Number of iterations: 100 Nonoverlapping accuracy table:

Confusion matrix
4174.01286.00.08.00.0
829.02056.00.06.00.0
408.0108.00.00.00.0
189.034.00.0170.00.0
48.048.00.00.00.0
29.01.00.01.00.0

Summary Statistics * Accuracy = 0.681 * fm0 = 0.749 * fm1 = 0.640 * fm2 = 0.0 * fm3 = 0.588 * fm4 = 0.0

Number of topics:20 Number of iterations: 20000 Nonverlapping accuracy table:

Confusion matrix
5373.089.03.03.00.0
368.02522.00.01.00.0
478.015.020.03.00.0
193.08.00.0192.00.0
89.05.00.01.01.0
14.00.00.00.00.0

Summary Statistics * Accuracy = 0.865 * fm0 = 0.897 * fm1 = 0.912 * fm2 = 0.074 * fm3 = 0.648 * fm4 = 0.021

Number of topics:20 Number of iterations: 15000 Overlapping accuracy table:

Confusion matrix
5419.025.012.012.00.0
190.02687.07.07.00.0
398.013.089.015.00.0
193.02.02.0196.00.0
89.02.03.01.01.0
11.00.00.00.00.0

Summary Statistics * Accuracy = 0.895 * fm0 = 0.921 * fm1 = 0.956 * fm2 = 0.283 * fm3 = 0.628 * fm4 = 0.021

Number of topics:50 Number of iterations: 100 Nonoverlapping accuracy table:

Confusion matrix
5250.0217.00.01.00.0
2667.0224.00.00.00.0
503.013.00.00.00.0
220.03.00.0170.00.0
94.02.00.00.00.0
30.01.00.00.00.0

Summary Statistics * Accuracy = 0.601 * fm0 = 0.738 * fm1 = 0.134 * fm2 = 0.0 * fm3 = 0.603 * fm4 = 0.0

Summary Tables for (LDA + Classification) and Classification

  • (LDA + Classification) Varying number of topics and fixed number of iterations = 100:
# topicsAccuracy
100.618
200.681
500.601

Conclusion: 20, approximately the number of aminoacids, is a good candidate for the number of topics

  • We tried both nonoverlapping and overlapping k-mer features directly on classifier:
Data typeAccuracy
Nonoverlapping k-mers0.941
Overlapping k-mers0.942
  • Also, on (LDA + classification) for number of topics = 20. While for nonoverlapping the number of iterations are 20000, for overlapping the number of iterations are 15000. Due to 'unexpected shutdown' of clusters we couldn't run LDA on overlapping clusters for 20000 iterations. However results suggests that overlapping k-mers helps LDA to learn structure better in terms of prediction power (but not efficiently) :
Data typeAccuracy
Nonoverlapping k-mers0.865
Overlapping k-mers0.895

Conclusion: For classifier, there is not much difference on overlapping or nonoverlapping k-mer features; however for LDA having overlapping helps to learn structure more. But increased time complexity of overlapping k-mers requires LDA to have more iterations to learn.

  • (LDA + Classification) Varying number of iterations and topics :
(# iterations, # topics, overlapping)Accuracy
(100, 10, false)0.618
(1000, 10, false)0.787
(100, 20, false)0.681
(15000, 20, true)0.895
(20000, 20, false)0.865

Conclusion: Also considering the topic summaries, we can conclude that the number of iterations highly affect the topic diversity, and thus classifier performance. Although we can check convergence of EM algorithm to stop, this migh be problematic due to increasing computation time with the number of iterations.

  • The best of (LDA + Classification) where the number of topics is 20 and the number of iterations is 15000 and vs direct Classification on k-mers performance comparision.
MethodAccuracy
(LDA + Classification)0.895
Direct Classification0.942

Conclusion: We couldn't perform higher number of iterations (due to limited time and cluster restarts); however, this result shows that LDA is capable of summarizing k-mer features. Once, we learn the mapping from k-mers to reduced topic distribution space via LDA, then we can use this reduced number of features to train classifier. This makes the data more scalable and may save computation time in the long run.

Our conclusions

  • What we have tried and failed? What we have learned?
    • Overlapping k-mers with low iteration led poor diversity in topics
    • With expectation maximization LDA required iteration increases significantly. Where to stop the iteration becomes a problem due to computation time concerns.
    • Changing doc or term concentration did not lead better results
    • We tried several number of topics (10, 20, 50) and #topics = 20, which is almost equal to number of aminoacids, yields the best result (coincidence or the number of aminoacids is a good choice for topic number?)
    • Using a fixed vocabulary of k-mers (using nucleotide alphabet A-T-G-C) yields poor topic diversity [since there are sequencing errors in genomes, there are different k-mers such as TAK, TAR. And, we concluded that these errors actually give some insight on the virus]
  • The comparison between LDA-based classifier and directly-k-mers classifier demonstrates that with enough iterations LDA is capable of summarising the data.
  • There is not much difference between overlapping and nonoverlapping k-mer features when we directly give them to classifier. However, for LDA overlapping features require much higher number of iterations for convergence [or topic divergence]
  • Very final conclusion: Directly giving k-mers to classifier works better. However, once we learn from LDA with reduced number of features (from k-mers to topic distributions), we can have a good result on classification which can save computation time and make it scalable (by reducing the number of features to the number of topics chosen).

“Everything should be made as simple as possible, but no simpler.”