Skip to content
This repository has been archived by the owner on Apr 9, 2020. It is now read-only.

Latest commit

 

History

History
36 lines (29 loc) · 2.1 KB

README.md

File metadata and controls

36 lines (29 loc) · 2.1 KB

Generalized Suffix Tree

Introduction

This project is the implementation of ERA: Efficient Serial and Parallel Suffix Tree Construction for Very Long Strings algorithm. We optimized the vertical partition algorithm for best parallelism. This project is also a problem in The third term Cloud Computing Contest [Sildes]

Usage

This project build by maven, just open the project in Intellij Idea and compile&package into jar.

e.g. spark-submit --master spark://master:7077 --class GST.Main --executor-memory 6G --driver-memory 20G suffixtree-1.0-SNAPSHOT.jar "hdfs://master:9000/input/5000 1000" "hdfs://master:9000/output/5000 1000"

Experiments

The code write in Java and run on Apache Spark. We tested on 20GB memory 48 cores cluster, for 400MB gene dataset this program build suffix tree and save result only consumed 4.4 min.
Note: AAA_BBB means, dataset is composed by 'BBB' strings, the longest string no more than 'AAA' characters.

Dataset

  1. input-20161215
  2. exset-20161215

The first one contains 6 folders. The folder name consists of the length of string and the number of files e.g. "5000000 20". "5000000" means the length of the longest string no more than 5000000 characters. "20" means there are 20 text files in the folder.

The second one is dataset for checking the correctness of the program, which contains input dataset and output result. The result like :

4 3819.txt:284
4 3819.txt:679
3 3819.txt:350
3 3439.txt:913
4 3439.txt:826
4 3439.txt:867
3 3439.txt:805

Each line of contains three parts e.g. "4 3819.txt:284". "4" means the depth of the leaf node in the suffix tree. "3819.txt" is the file name. "284" means the start index of the leaf node in the file "3819.txt"