Yesterday I decided to participate in a german contest called the “Bundeswettbewerb Informatik” – or short “BWINF”. It’s a computer science competition where hundreds of students from all germany take part in. It is divided into three rounds, and the final seven contestants are declared the winners and win some money and a grant, which is somewhat attractive to me. Anyways, because cs is one of my two main subjects – the other being maths – I was forced to participate in the first round together with the rest of my course, and as a team some of us – me included – passed the first round. At first I wasn’t interested in a competition, because you know this means working in your free time which is something I try to avoid normally. But I just thought that it might actually be fun and worthwile. Problem: the 2nd round started january and it ends tomorrow. 3 months worth of time lost to sloth.
The second round consists of 3 task of which you have to solve at least one, maximally two. I chose the task, where you have to write a program, which can read in a String together with two numbers k and l. Given these informations, you are supposed to find and output ALL substrings of the original string, which repeat at least k times and are at least l characters long. “Doesn’t sound too hard”, I thought to myself. Oh boy was I wrong.
I wrote the code in Java, because it’s the language I know best. I concentrated on the method that finds the substrings. I implemented this using two for-loops, going through every possibility of starting and ending offset, leading to as much as (n * (n+1)) / 2 substrings (or basically the sum of 1 + 2 + 3 + … + n) which I now have to count and cleanse of those that are always part of a larger substring of equal frequency. This worked with the given string “CAGGAGGATTA” and k = 2. The example output was AGGA, T, A, which my method correctly computed, so I decided to try the other examples. I searched the material section and found what would actually render my method useless: the second input example was the mitochondrial dna of humans, consisting of more than 17000 characters. My very naive method with O(n^2) operations would need a tremendous amount of time to compute the solution, if it wouldn’t crash because of insufficient memory soon after starting the program. If I’d have started in january, I at least could have minimized memory consumption and it maybe even could have managed to compute the right answer within this three months or so of runtime.
Awesome. I’ve tried everything, but the inevitable truth was that my algorithm sucked. Looking for other possibilities, I stumbled upon the suffix tree, which can be used to find all substrings in O(n), but I don’t have the time and the motivation to continue now. Well, seems like I’ll just try again next year…