Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDFS benchmarks #567

Draft
wants to merge 46 commits into
base: future
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
229c91c
porting some of the for loops to hdfs
tammam1998 Apr 26, 2022
a11812d
porting oneliners to use hdfs
tammam1998 Apr 26, 2022
748f7ab
added gitingore for dependecy_untagling
tammam1998 Apr 26, 2022
f89cbdd
add stateless and pure function annotations
tammam1998 Apr 27, 2022
ab5c3b9
improve oneliners hdfs setup script
tammam1998 Apr 27, 2022
32a0b86
fix path bug in spell.sh
tammam1998 May 1, 2022
2b0876b
allow varying replication factor in tests
tammam1998 May 1, 2022
9060742
improve benchmark scripts
tammam1998 May 4, 2022
bc03de7
fix small bug
tammam1998 May 8, 2022
d3246da
Merge branch 'future' into hdfs_benchmarks
tammam1998 May 25, 2022
cc106bc
Merge remote-tracking branch 'origin/future' into hdfs_benchmarks
tammam1998 May 25, 2022
e933a36
port nlp scripts to distributed exec
tammam1998 May 30, 2022
3f65e42
replace non parallelizable tr with parallelizable variation
tammam1998 May 30, 2022
c884fff
Merge branch 'hdfs_benchmarks' of https://github.com/binpash/pash int…
tammam1998 May 30, 2022
9d11dd6
nlp eval script
tammam1998 May 30, 2022
0d9c1d8
Merge branch 'hdfs_benchmarks' of https://github.com/binpash/pash int…
tammam1998 May 30, 2022
eecff28
fix incorrect flags
tammam1998 May 30, 2022
4749aa4
fixed small issues in eval scripts
tammam1998 Jun 2, 2022
747527d
added gitignores to outputs and inputs
tammam1998 Jun 2, 2022
18f0a08
use temp ffiles n pure functions
tammam1998 Jun 2, 2022
fd06883
minor nlp fixes
tammam1998 Jun 2, 2022
c95e7e8
Merge remote-tracking branch 'origin/future' into hdfs_benchmarks
tammam1998 Jun 2, 2022
051df82
ported unix50 for distributed exec
tammam1998 Jun 3, 2022
be3fae2
ported analytics mts to distributed exec
tammam1998 Jun 3, 2022
170e7cd
added gitingore
tammam1998 Jun 3, 2022
017ef55
fix trigrams nlp
tammam1998 Jun 3, 2022
039334b
some fixes
tammam1998 Jun 6, 2022
99c3dbb
port some dependency untagling scripts to hdfs
tammam1998 Jun 6, 2022
6f0f5f3
small changes to eval scripts
tammam1998 Jun 6, 2022
9d82499
improve oneliners eval and setup scripts
tammam1998 Jun 7, 2022
c6c2c38
fix du installation scripts
tammam1998 Jun 7, 2022
a056c46
Add newly added benchmarks to the run all script
tammam1998 Jun 7, 2022
bce37cf
Merge branch 'hdfs_benchmarks' of https://github.com/binpash/pash int…
tammam1998 Jun 8, 2022
9ee0791
use gzip instead of zip for better streaming support
tammam1998 Jun 8, 2022
63fe614
small changes to setup script
tammam1998 Jun 8, 2022
badea8d
fix bug in pcap.sh
tammam1998 Jun 8, 2022
bb3f2e0
Add max-temp benchmark
tammam1998 Jun 11, 2022
8124626
fix typo
tammam1998 Jun 11, 2022
732c78c
fixes to eval scripts
tammam1998 Jun 11, 2022
55aee24
small bug
tammam1998 Jun 12, 2022
ea2e06e
Merge branch 'hdfs_benchmarks' of https://github.com/binpash/pash int…
tammam1998 Jun 15, 2022
e22dd83
fix small issues
tammam1998 Jun 15, 2022
6fc191d
change bigrams to be consistant and add hdfs put annotation
tammam1998 Jun 16, 2022
c2a60db
Merge branch 'hdfs_benchmarks' of https://github.com/binpash/pash int…
tammam1998 Jun 16, 2022
26ce867
fix leftover merge conflict
tammam1998 Jun 16, 2022
0dfd6f8
Merge branch 'hdfs_benchmarks' of https://github.com/binpash/pash int…
tammam1998 Jun 16, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions annotations/hdfs.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,17 @@
"outputs": ["stdout"],
"comments": "This represents hdfs dfs -cat <path>. Slightly hacky since we only check for -cat"
},
{
"predicate":
{
"operator": "exists",
"operands": ["-put"]
},
"class": "pure",
"inputs": ["stdin"],
"outputs": ["stdout"],
"comments": "Ideally we would use stdin-hyphen but unfortunatly hdfs put deadlocks on fifo"
},
{
"predicate": "default",
"class": "side-effectful",
Expand Down
12 changes: 12 additions & 0 deletions annotations/pure_func.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{
"command": "pure_func",
"cases":
[
{
"predicate": "default",
"class": "pure",
"inputs": ["stdin"],
"outputs": ["stdout"]
}
]
}
12 changes: 12 additions & 0 deletions annotations/stateless_func.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{
"command": "stateless_func",
"cases":
[
{
"predicate": "default",
"class": "stateless",
"inputs": ["stdin"],
"outputs": ["stdout"]
}
]
}
3 changes: 3 additions & 0 deletions evaluation/distr_benchmarks/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
outputs
*.res.*
*.txt
21 changes: 21 additions & 0 deletions evaluation/distr_benchmarks/analytics-mts/1.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/bin/bash
# Vehicles on the road per day

# <in.csv sed 's/T..:..:..//' |
# awk -F, '!seen[$1 $3] {onroad[$1]++; seen[$1 $3] = 1}
# END { OFS = "\t"; for (d in onroad) print d, onroad[d]}' |
# sort > out1

# curl https://balab.aueb.gr/~dds/oasa-$(date --date='1 days ago' +'%y-%m-%d').bz2 |
# bzip2 -d | # decompress
# Replace the line below with the two lines above to stream the latest file
hdfs dfs -cat $IN | # assumes saved input
sed 's/T..:..:..//' | # hide times
cut -d ',' -f 1,3 | # keep only day and bus no
sort -u | # remove duplicate records due to time
cut -d ',' -f 1 | # keep all dates
sort | # preparing for uniq
uniq -c | # count unique dates
awk -v OFS="\t" "{print \$2,\$1}" # print first date, then count

# diff out{1,}
22 changes: 22 additions & 0 deletions evaluation/distr_benchmarks/analytics-mts/2.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/bin/bash
# Days a vehicle is on the road

# <in.csv sed 's/T..:..:..//' |
# awk -F, '!seen[$1 $3] {onroad[$3]++; seen[$1 $3] = 1}
# END { OFS = "\t"; for (d in onroad) print d, onroad[d]}' |
# sort -k2n >out1

# curl https://balab.aueb.gr/~dds/oasa-$(date --date='1 days ago' +'%y-%m-%d').bz2 |
# bzip2 -d | # decompress
# Replace the line below with the two lines above to stream the latest file
hdfs dfs -cat $IN | # assumes saved input
sed 's/T..:..:..//' | # hide times
cut -d ',' -f 3,1 | # keep only day and bus ID
sort -u | # removing duplicate day-buses
cut -d ',' -f 2 | # keep only bus ID
sort | # preparing for uniq
uniq -c | # count unique dates
sort -k1n | # sort in reverse numerical order
awk -v OFS="\t" "{print \$2,\$1}" # print first date, then count

# diff out{1,}
22 changes: 22 additions & 0 deletions evaluation/distr_benchmarks/analytics-mts/3.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/bin/bash
# Hours each vehicle is on the road

# <in.csv sed 's/T\(..\):..:../,\1/' |
# awk -F, '!seen[$1 $2 $4] {onroad[$4]++; seen[$1 $2 $4] = 1}
# END { OFS = "\t"; for (d in onroad) print d, onroad[d]}' |
# sort -k2n > out1

# curl https://balab.aueb.gr/~dds/oasa-$(date --date='1 days ago' +'%y-%m-%d').bz2 |
# bzip2 -d | # decompress
# Replace the line below with the two lines above to stream the latest file
hdfs dfs -cat $IN | # assumes saved input
sed 's/T\(..\):..:../,\1/' | # keep times only
cut -d ',' -f 1,2,4 | # keep only time date and bus id
sort -u | # removing duplicate entries
cut -d ',' -f 3 | # keep only bus ID
sort | # preparing for uniq
uniq -c | # count hours per bus
sort -k1n | # sort in reverse numerical order
awk -v OFS="\t" "{print \$2,\$1}" # print first date, then count

# diff out{1,}
21 changes: 21 additions & 0 deletions evaluation/distr_benchmarks/analytics-mts/4.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/bin/bash
# Hours monitored each day

# <in.csv sed 's/T\(..\):..:../,\1/' |
# awk -F, '!seen[$1 $2] {hours[$1]++; seen[$1 $2] = 1}
# END { OFS = "\t"; for (d in hours) print d, hours[d]}' |
# sort

# curl https://balab.aueb.gr/~dds/oasa-$(date --date='1 days ago' +'%y-%m-%d').bz2 |
# bzip2 -d | # decompress
# Replace the line below with the two lines above to stream the latest file
hdfs dfs -cat $IN | # assumes saved input
sed 's/T\(..\):..:../,\1/' | # keep times only
cut -d ',' -f 1,2 | # keep only time and date
sort -u | # removing duplicate entries
cut -d ',' -f 1 | # keep only date
sort | # preparing for uniq
uniq -c | # count unique dates
awk -v OFS="\t" "{print \$2,\$1}" # print first date, then count

# diff out{1,}
18 changes: 18 additions & 0 deletions evaluation/distr_benchmarks/analytics-mts/5.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/bin/bash
# Hours each bus is active each day

# Records are day, hour, line, bus
<in.csv sed 's/T\(..\):..:../,\1/' | awk -F, '
!seen[$1 $2 $4] { seen[$1 $2 $4] = 1; hours[$1 $4]++; bus[$4] = 1; day[$1] = 1; }
END {
PROCINFO["sorted_in"] = "@ind_str_asc"
for (d in day)
printf("\t%s", d);
printf("\n");
for (b in bus) {
printf("%s", b);
for (d in day)
printf("\t%s", hours[d b]);
printf("\n");
}
}' > out
10 changes: 10 additions & 0 deletions evaluation/distr_benchmarks/analytics-mts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Mass-Transport System Analytics

This set of scripts script is part of [a recent study on OASA](https://insidestory.gr/article/noymera-leoforeia-athinas) from Diomidis Spinellis and Eleftheria Tsaliki. OASA is the the mass-transport system supporting the city of Athens.

1. `1.sh`: Vehicles on the road per day
2. `2.sh`: Days a vehicle is on the road
3. `3.sh`: Hours each vehicle is on the road
4. `4.sh`: Hours monitored each day
5. `5.sh`: Hours each bus is active each day

5 changes: 5 additions & 0 deletions evaluation/distr_benchmarks/analytics-mts/input/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
./oasa-2021-01-08.bz2
in*.csv
./out
./out1
*.out
33 changes: 33 additions & 0 deletions evaluation/distr_benchmarks/analytics-mts/input/setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
#!/bin/bash

# #Check that we are in the appropriate directory where setup.sh is
# #https://stackoverflow.com/a/246128
# DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
# echo "changing to $DIR to run setup.sh"
# cd $DIR

PASH_TOP=${PASH_TOP:-$(git rev-parse --show-toplevel)}

if [[ "$1" == "-c" ]]; then
rm -f *.bz2 'in.csv' 'in_small.csv'
exit
fi

hdfs dfs -mkdir /analytics-mts
if [ ! -f ./in.csv ] && [ "$1" != "--small" ]; then
# yesterday=$(date --date='1 days ago' +'%y-%m-%d')
# curl https://www.balab.aueb.gr/~dds/oasa-$yesterday.bz2 |
curl -sf 'https://www.balab.aueb.gr/~dds/oasa-2021-01-08.bz2' | bzip2 -d > in.csv
if [ $? -ne 0 ]; then
echo "oasa-2021-01-08.bz2 / bzip2 not available, contact the pash authors"
exit 1
fi
hdfs dfs -put in.csv /analytics-mts/in.csv
elif [ ! -f ./in_small.csv ] && [ "$1" = "--small" ]; then
if [ ! -f ./in_small.csv ]; then
echo "Generating small-size inputs"
# FIXME PR: Do we need all of them?
curl -sf 'http://pac-n4.csail.mit.edu:81/pash_data/small/in_small.csv' > in_small.csv
fi
hdfs dfs -put in_small.csv /analytics-mts/in_small.csv
fi
36 changes: 36 additions & 0 deletions evaluation/distr_benchmarks/analytics-mts/run-experiment.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#!/usr/bin/env bash

export PASH_TOP=${PASH_TOP:-$(git rev-parse --show-toplevel --show-superproject-working-tree)}

eval_dir="$PASH_TOP/evaluation/buses/"
results_dir="${eval_dir}/results/"

mkdir -p $results_dir

for i in 1 2 3 4
do
script="${eval_dir}/${i}.sh"
echo "Executing $script..."

seq_output=/tmp/seq_output
pash_width_16_no_cat_split_output=/tmp/pash_16_no_cat_split_output
pash_width_16_output=/tmp/pash_16_output

seq_time="${results_dir}/${i}_2_seq.time"
pash_width_16_no_cat_split_time="${results_dir}/${i}_16_distr_auto_split_fan_in_fan_out.time"
pash_width_16_time="${results_dir}/${i}_16_distr_auto_split.time"

echo "Executing the script with bash..."
{ time /bin/bash $script > $seq_output ; } 2> >(tee "${seq_time}" >&2)

echo "Executing the script with pash -w 16 without the cat-split optimization (log in: /tmp/pash_16_log)"
{ time $PASH_TOP/pa.sh -w 16 -d 1 --log_file /tmp/pash_16_no_cat_split_log --no_cat_split_vanish --output_time $script ; } 1> "$pash_width_16_no_cat_split_output" 2> >(tee "${pash_width_16_no_cat_split_time}" >&2)
echo "Checking for output equivalence..."
diff -s $seq_output $pash_width_16_no_cat_split_output | head

echo "Executing the script with pash -w 16 (log in: /tmp/pash_16_log)"
{ time $PASH_TOP/pa.sh -w 16 -d 1 --log_file /tmp/pash_16_log --output_time $script ; } 1> "$pash_width_16_output" 2> >(tee "${pash_width_16_time}" >&2)
echo "Checking for output equivalence..."
diff -s $seq_output $pash_width_16_output | head

done
75 changes: 75 additions & 0 deletions evaluation/distr_benchmarks/analytics-mts/run.distr.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
PASH_FLAGS='--width 8 --r_split'
export TIMEFORMAT=%R

if [[ "$1" == "--small" ]]; then
export IN="/analytics-mts/in_small.csv"
else
export IN="/analytics-mts/in.csv"
fi

analytics-mts_bash(){
times_file="seq.res"
outputs_suffix="seq.out"
outputs_dir="outputs"

mkdir -p "$outputs_dir"

touch "$times_file"
cat "$times_file" >> "$times_file".d
echo executing MTS analytics $(date) | tee "$times_file"
echo '' >> "$times_file"
## FIXME 5.sh is not working yet
for number in `seq 4`
do
script="${number}"

printf -v pad %20s
padded_script="${script}.sh:${pad}"
padded_script=${padded_script:0:20}
# select the respective input
outputs_file="${outputs_dir}/${script}.${outputs_suffix}"

echo "${padded_script}" $({ time ./${script}.sh > "$outputs_file"; } 2>&1) | tee -a "$times_file"
done
}

analytics-mts_pash(){
flags=${1:-$PASH_FLAGS}
prefix=${2:-par}

times_file="$prefix.res"
outputs_suffix="$prefix.out"
time_suffix="$prefix.time"
outputs_dir="outputs"
pash_logs_dir="pash_logs_$prefix"

mkdir -p "$outputs_dir"
mkdir -p "$pash_logs_dir"

touch "$times_file"
cat "$times_file" >> "$times_file".d
echo executing MTS analytics with pash $(date) | tee "$times_file"
echo '' >> "$times_file"
## FIXME 5.sh is not working yet
for number in `seq 4`
do
script="${number}"

printf -v pad %20s
padded_script="${script}.sh:${pad}"
padded_script=${padded_script:0:20}
outputs_file="${outputs_dir}/${script}.${outputs_suffix}"
pash_log="${pash_logs_dir}/${script}.pash.log"
single_time_file="${outputs_dir}/${script}.${time_suffix}"

echo -n "${padded_script}" | tee -a "$times_file"
{ time "$PASH_TOP/pa.sh" $flags --log_file "${pash_log}" ${script}.sh > "$outputs_file"; } 2> "${single_time_file}"
cat "${single_time_file}" | tee -a "$times_file"
done
}

analytics-mts_bash

analytics-mts_pash "$PASH_FLAGS" "par"

analytics-mts_pash "$PASH_FLAGS --distributed_exec" "distr"
3 changes: 3 additions & 0 deletions evaluation/distr_benchmarks/dependency_untangling/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
input/*
!input/install-deps.sh
!setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/bin/bash
# compress all files in a directory
IN=${IN:-/dependency_untangling/pcap_data/}
OUT=${OUT:-$PASH_TOP/evaluation/distr_benchmarks/dependency_untangling/input/output/compress}

mkdir -p ${OUT}

for item in $(hdfs dfs -ls -C ${IN});
do
output_name=$(basename $item).zip
hdfs dfs -cat $item | gzip -c > $OUT/$output_name
done

echo 'done';
18 changes: 18 additions & 0 deletions evaluation/distr_benchmarks/dependency_untangling/encrypt_files.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/bin/bash
# encrypt all files in a directory
IN=${IN:-/dependency_untangling/pcap_data}
OUT=${OUT:-$PASH_TOP/evaluation/distr_benchmarks/dependency_untangling/input/output/encrypt}
mkdir -p ${OUT}

pure_func() {
openssl enc -aes-256-cbc -pbkdf2 -iter 20000 -k 'key'
}
export -f pure_func

for item in $(hdfs dfs -ls -C ${IN});
do
output_name=$(basename $item).enc
hdfs dfs -cat $item | pure_func > $OUT/$output_name
done

echo 'done';
Loading