ScaDaMaLe Course site and book

Gaussian Analysis

Here we test the results for normally distributed points.


import numpy as np
import os
import shutil
import glob
import matplotlib.pyplot as plt
import scipy as sp
import scipy.stats as stats


os.listdir('/dbfs/FileStore/group17/data/')

Reading the files.


def read_csv(data_name):
  results = glob.glob('/dbfs/FileStore/group17/data/' + data_name + '/*.csv')
  assert(len(results) == 1)
  filepath = results[0]

  csv = np.loadtxt(filepath, delimiter=',')
  csv = csv[csv[:, 0].argsort()]
  return csv


train_data = read_csv('gaussian_train')
test_data = read_csv('gaussian_test')
weights = read_csv('gaussian_weights')


def display_density(data, weights):
  fig = plt.figure(figsize=(10, 10))
  plt.scatter(data[:, 0], data[:, 1], weights / np.max(weights) * 50)
  display(fig)

True density visualization.


true_density = stats.multivariate_normal.pdf(test_data[:, 1:], mean=np.zeros(2))

display_density(test_data[:, 1:], true_density)

Density, obtained from our method.


display_density(test_data[:, 1:], weights[:, 1])

Density, obtained from kernel density estimation with tophat kernel.


from sklearn.neighbors.kde import KernelDensity
kde = KernelDensity(kernel='tophat', bandwidth=0.13).fit(train_data[:, 1:])
kde_weights = kde.score_samples(test_data[:, 1:])
kde_weights = np.exp(kde_weights)

display_density(test_data[:, 1:], kde_weights)

Density, obtained from kernel density estimation with gaussian kernel.


kde = KernelDensity(kernel='gaussian', bandwidth=0.13).fit(train_data[:, 1:])
gauss_weights = kde.score_samples(test_data[:, 1:])
gauss_weights = np.exp(kde_weights)

display_density(test_data[:, 1:], gauss_weights)

A simple computation of the number of inverses.


def rank_loss(a, b):
  n = a.shape[0]
  assert(n == b.shape[0])
  ans = 0
  for i in range(n):
    for j in range(i + 1, n):
      if (a[i] - a[j]) * (b[i] - b[j]) < 0:
        ans += 1
  return ans

Comparison of losses. On this one test, we get the smallest loss.

One of the immediate futher works: do a proper statistical comparison, also on different sizes of data.


rank_loss(weights[:, 1], true_density)


rank_loss(kde_weights, true_density)



rank_loss(gauss_weights, true_density)

sds-3.x/ScaDaMaLe

Gaussian Analysis