Project Offerings

Project Offerings for Semester 1, 2021

For 2021, I have the following projects on offer on large-scale distributed data management, next generation data analytics and data management for bio-data:

Data Privacy Analysis of Health Tracking Services
A Touch Interface for SQL Databases
MongoDB with Transactional Memory
Data Processing on MultiCore Machines: Building a Virtual Database Cluster
PowerDB: Freshness-aware Replication in a Database Cluster
Bio-Data Processing using Map/Reduce
Reporting and Dashboard Facility for TeachingDB
Vulnerability Assessment of Web-Database
Database Cluster Management Tool

If you are interested in any of these projects, please contact me by email or in person.

Projects on Next-Generation Data Analytics

Data Privacy Analysis of Health Tracking Services

In recent years, personal health tracking services have become very popular. Those systems collect data from personal health sensors, such as health bands, step counters or smart watches, and provide health data analysis via graphical user interfaces. Some services additionally integrate some social networking functionality, for example to share experiences or to provide additional motivation by comparing own health habits with peers. The underlying data processing infrastructure is typically cloud-based: Data is collected locally and then send to a central services that is hosted on some cloud data centers, where the processing, sharing and visualisation is done.

The goal of this project is to compare popular health tracking services with regard to their processing infrastructure from the point of view of data privacy: How is the data collected, where is it processed, and is any data disclosed to other people or even organisations?

For students interested in taking on this project as a research project, this task can be extended to include the design of a distributed health tracking service with guaranteed data privacy and anonymization functionalities.

A Touch Interface for SQL Databases (Honours project)

More and more computing systems are produced with touch interfaces, from smartphones via tablets to the latest versions of desktop operating systems (Windows 8 and Max OS X). At the same time, the basic interface to database systems is still SQL, which is a text-based query language that requires keyboard input and that is hard to learn for novice users.

In our TouchQL project, we aim to develop a query 'language' that is purely based on a graphical schema representation and input gestures and that allows to query a relational database using a tablet computer.

There exists already an initial prototype of TouchQL for Android devices that supports basic selections, projections and natural joins over local databases.

The goal of this Honours project is to extend this system with a mechanism for grouping and aggregation, and also to support querying remote databases. The challenge in the later part is to provide timely feedback to the user for the intended operations as in TouchQL, there is no separation between query formulation and query execution - users shall get immediate feedback on their intended actions on the actual data set. It would be additionally beneficial if the student would be able to port TouchQL from the Java-based Android to the Objective-C based iOS.

Projects on Large-Scale Distributed Data Management:

MongoDB with Transactional Memory

An interesting recent development for server-class, multi-core CPUs is hardware transactional memory which allows a CPU to execute short code sections with transactional guarantees: Memory changes are kept only of the whole code section is executed without conflict to parallel threads, otherwise the program execution is reseted transparently to the start of the transactional code and any previous changes are dismissed. This is especially beneficial for the efficient execution of critical sections in multi-threaded programs.

In a previous project, we already investigated the core execution characteristics of Intel's hardware transactional memory for MySQL. The goal of this project is to extend this study to a popular NoSQL database, such as MongoDB. We are interested in identifying code sections which tend to become performance bottlenecks once the core-count of a CPU gets large enough, because extended periods of blocking occur while all threads but one have to wait to enter the critical section (blocking mutex). These section(s) then shall be modified to use the hardware transactional memory extensions and the performance changes being evaluated.

Data Processing on MultiCore Machines: Building a Virtual Database Cluster (Honours project)

Multi-Core computers are becoming increasingly common for large servers. At the end of this year, server CPUs with 128 cores will become available. This poses a real challenge to database engines as those are optimised for concurrent workloads sharing resources and hiding latency, rather than for large numbers of parallel cores that can run many queries completely independent. There's some body of work on optimising databases for distributed systems such as a cluster of databases. In this project, we are interested to learn how those techniques perform if applied to a single multi-core machine that is configured as a 'virtual' cluster by deploying several virtual machines on the same hardware. To this end, we have both a large multi-core machine as well as a small research cluster available as hardware platform. The project student shall compare the performance of an open source DBMS on either platforms for a given workload and develop a new load distribution technique that optimizes the performance of the virtual database cluster.

Database Cluster Management Tool (Software Development: MIT 12cp / TSP / Engineering Project /Undergraduate project for 1-2 students)

We have a small database cluster of 8 nodes which we use for several research projects. It is a multi-boot cluster (Linux and Windows 2003 Server) that can run different database engines, both commercial and open source, such as Oracle, Microsoft SQL Server, and PostgreSQL. We need a platform independent monitoring tool with a GUI that helps us (a) keeping track of the current cluster state and (b) reboot cluster nodes into different configurations. Ideally, it would also include a cluster allocation component to manage our research projects and allow us to use subsets of the cluster concurrently in different projects. This project shall conduct a study and review of corresponding database cluster management tools and set-up a suitable solution, eventually enriched with self-developed software components.

Skills needed:Some experience with programming and databases; Sys-Administration background of advantage

Suitable majors: Databases, Software Engineering, Networking

PowerDB: Freshness-aware Replication in a Database Cluster (12-18cp MIT Project or TSP project)

This project aims to set-up a freshness-aware replication engine for a cluster of databases. It will be based on an existing cluster coordinator called PowerDB that is written in C++ and optimised for SQL Server. The student shall install, configure and optimise this version on our new database cluster running PostgreSQL. The 18cp version of the project will then in addition run some performance and scalability tests on the new system.

Skills needed: Good knowledge in C++ and in databases

Suitable majors: Databases, Software Engineering, Computer Science

Projects on Database Applications:

(12cp Capstone Project): This Project is about developing a reporting functionality and a graphical dashboard for the School of Computer Science's teaching database.

Skills needed: PHP and Javascript, good knowledge in web technologies and SQL databases.

Suitable for the following majors: Databases, Software Engineering
Web Database Vulnerability Analysis & Improvement (12cp Capstone Project): Web security is crucial nowadays. The goal of this project is to conduct a security analysis of a web database application, and to improve the implementation such that the identified security weaknesses are fixed. The web database in question is hosted within the School of Computer Science, written in PHP and Javascript and runs on top of a MySQL database. This project will consist of two phases: In the first phase, the student will conduct a 'white-box' vulnerability analysis of the existing system with regard to code implementation, design issues, known security threads, and data privacy requirements. This will include inspecting the existing code base and system architecture, as well as an analysis of the system design against known vulnerabilities. In the second phase, the student shall fix any high-priority vulnerabilities found during the analysis, and implement a logging component which shall keep track of data changes during runtime.

Skills needed: PHP and Javascript, good knowledge in web technologies and SQL databases.

Suitable for the following majors: Databases, Software Engineering, IT Security
Web Content Mining Database (12cp to 18cp Capstone Project): The content of most websites changes constantly - this is in particular true for forums or news websites. The goal of this project is to implement an automated web content mining system that allows to follow and analyse the changes of a given website over time. The intended system consists of two parts: The first part is a website tracker that periodically captures the content of a given website and stores the web content in a temporal text database. The student shall compare different open source solutions and if possible adapt one of them for this project. The second part is to perform proof-of-concept website monitoring with some simple explorative analysis of the captured content, such as: When is a site most active in terms of updates? Which topics are most popular? How can authors be classified by the articles they are writing?

Skills needed: Python, good knowledge in web technologies and SQL databases.

Suitable for the following majors: Databases, Software Engineering

Projects on Database Support for Bioinformatics:

Bio-Data Processing using Map/Reduce (Honours project or 18cp MIT Research Project)

This project will investigate the suitability of a map/reduce framework for the parallel processing of DNA fragment data (so called 'short reads'). The student shall implement a short-read comparison algorithm on the database research cluster of the DBRG using the open source Hadoop system.

Skills needed: INFO2x20, COMP5138 or equivalent database course (INFO3404 would be perfect); good programming skills; Bioinformatics background not neccessary

Suitable MIT majors: Databases, Software Engineering, Computer Science

I also maintain a list of former projects supervised by me in recent years.