Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev #32

Merged
merged 108 commits into from
Oct 15, 2024
Merged

Dev #32

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
108 commits
Select commit Hold shift + click to select a range
5cae88f
Added unit test and created a easy opticluster function
GregJohnsonJr Apr 24, 2024
37ec06d
Release v0.0.1 (#1)
GregJohnsonJr Apr 29, 2024
931b70b
Add cpp test (#3)
GregJohnsonJr May 17, 2024
9efe436
RMD Check is able to run successfully!
GregJohnsonJr May 31, 2024
678d5a3
Correcting the paths of my cpp files, should fix the action errors.
GregJohnsonJr May 31, 2024
2c99c18
Update to the cluster command test fixture
GregJohnsonJr May 31, 2024
8f3cbc1
Modifying the test for opticluster
GregJohnsonJr Jun 3, 2024
626e70a
Ensuring everything works with c++11
GregJohnsonJr Jun 3, 2024
5b1bdb0
Removing code issues from cluster command
GregJohnsonJr Jun 3, 2024
e35e710
Adding the build ignore
GregJohnsonJr Jun 3, 2024
11a41dd
Founds some issue where I am using c++ 17 syntax and not 11.
GregJohnsonJr Jun 5, 2024
8fcff5d
Github action fixes, needed to update syntax towards cpp 11
GregJohnsonJr Jun 6, 2024
6eb79ec
Modified the testing structure by removing the "Opticluster returns p…
GregJohnsonJr Jun 6, 2024
50c3a7c
Fix cluster unit test (#5)
GregJohnsonJr Jun 10, 2024
b717404
Printing out the metrics after you perform a cluster and added a true…
GregJohnsonJr Jun 10, 2024
77ebc1c
Release polish (#6)
GregJohnsonJr Jun 14, 2024
58a4056
Added a depends for lazy-loading and other R related issues.
GregJohnsonJr Jun 14, 2024
d958121
More cluster features (#7)
GregJohnsonJr Jul 12, 2024
12aaa2e
Merge branch 'master' into dev
GregJohnsonJr Jul 12, 2024
2682ec5
The fix for github actions.
GregJohnsonJr Jul 12, 2024
b7a77be
Change to the include file.
GregJohnsonJr Jul 12, 2024
869656e
Removing srand from Utils, going to attempt to set seeds inside of R.
GregJohnsonJr Jul 12, 2024
fc8b722
Fix for race condition issue.
GregJohnsonJr Jul 15, 2024
8603183
Fix for RCMD check warnings
GregJohnsonJr Jul 15, 2024
79fb369
The fix for the windows version of RMD Check!
GregJohnsonJr Jul 16, 2024
1a4256b
Adding dependency for time.
GregJohnsonJr Jul 16, 2024
8e0ae22
Make shared (#9)
GregJohnsonJr Aug 30, 2024
7564d48
Forgot a unit test. (#10)
GregJohnsonJr Sep 3, 2024
3bd7dea
Fix results (#11)
GregJohnsonJr Sep 10, 2024
2be797e
Removing and fixing check issues.
GregJohnsonJr Sep 10, 2024
a445421
Fix compilation warnings (#12)
GregJohnsonJr Sep 11, 2024
e7d8625
Fix for negative index value
GregJohnsonJr Sep 11, 2024
ad47beb
Cleaning up build notes.
GregJohnsonJr Sep 11, 2024
ba93c19
Merge branch 'master' into dev
GregJohnsonJr Sep 11, 2024
e6a4c9f
lintr fixes
GregJohnsonJr Sep 11, 2024
386a0c7
Fix for lintr
GregJohnsonJr Sep 11, 2024
7ea9c0b
Read phylip files (#14)
GregJohnsonJr Sep 12, 2024
fa25af7
Initial push
GregJohnsonJr Sep 12, 2024
25a357d
Adding r documentation about mothur and clustur
GregJohnsonJr Sep 12, 2024
e6f00a8
Added functionality for column distance file reading!
GregJohnsonJr Sep 13, 2024
6f830c6
Column distance files work!
GregJohnsonJr Sep 13, 2024
3d8015f
Adding read column feature (#15)
GregJohnsonJr Sep 16, 2024
42162e1
Documentation (#16)
GregJohnsonJr Sep 16, 2024
d7dc294
Fix for opticluster clustering.
GregJohnsonJr Sep 16, 2024
37cdb7e
Fixing up the documentation
GregJohnsonJr Sep 16, 2024
48a0f38
I am getting the same number of bins!
GregJohnsonJr Sep 16, 2024
4c63f8c
example data
GregJohnsonJr Sep 17, 2024
58a7e8e
Fix for test error
GregJohnsonJr Sep 17, 2024
0d3e798
Testing values to RMD file
GregJohnsonJr Sep 17, 2024
3bddc53
Small changes
GregJohnsonJr Sep 18, 2024
d486604
Added sorting by bin size to cluster output and fixed the clustering …
GregJohnsonJr Sep 18, 2024
d89afb5
Modification to the test!
GregJohnsonJr Sep 18, 2024
60d16f5
Updates to test file
GregJohnsonJr Sep 18, 2024
acfbc9a
Cleaning up test
GregJohnsonJr Sep 18, 2024
585736e
Small change
GregJohnsonJr Sep 18, 2024
1623601
Method to check if each cluster exist in the dataframe
GregJohnsonJr Sep 21, 2024
d08a209
Using content paths instead of absolutes
GregJohnsonJr Sep 21, 2024
a745510
Create 96_sq_column_results_mac.list
GregJohnsonJr Sep 23, 2024
5af269d
Pushing results for different operating systems
GregJohnsonJr Sep 23, 2024
3cb0ba4
Updating documentation
GregJohnsonJr Sep 24, 2024
7c694e3
Added inst folders
GregJohnsonJr Sep 24, 2024
0b5a145
Update Cluster.R
GregJohnsonJr Sep 24, 2024
7eec4fb
Pushing the temporary fix!
GregJohnsonJr Sep 24, 2024
63d515e
Pushing spare_matrix data file
GregJohnsonJr Sep 24, 2024
6a5df20
Squashed commit of the following:
GregJohnsonJr Sep 24, 2024
a714b7d
Creating vignettes
GregJohnsonJr Sep 25, 2024
4a4fcfa
Created base pkgdown structure
GregJohnsonJr Sep 25, 2024
05c8a07
Base structure of documentation and website
GregJohnsonJr Sep 25, 2024
8c648bd
Small optimzation to clustur
GregJohnsonJr Sep 25, 2024
e87319e
Fixing unit test
GregJohnsonJr Sep 25, 2024
1251c3c
Removing comments
GregJohnsonJr Sep 25, 2024
2c642f3
Changed the name of the package to clustur
GregJohnsonJr Sep 25, 2024
3f04170
Removing unneeded data and fixing issue to validate count_table
GregJohnsonJr Sep 25, 2024
3f2d399
Fixing check errors.
GregJohnsonJr Sep 25, 2024
10c2ad9
Consistent randomization (#17)
GregJohnsonJr Sep 26, 2024
45ba179
Consistent randomization (#18)
GregJohnsonJr Sep 26, 2024
6d30db8
Squashed commit of the following:
GregJohnsonJr Sep 26, 2024
00b5013
Merge branch 'Documentation' into dev
GregJohnsonJr Sep 26, 2024
439668a
Documentation (#16) (#19) (#20)
GregJohnsonJr Sep 26, 2024
e2e52c4
Removing old vignette
GregJohnsonJr Sep 26, 2024
ea4d749
Adding additional documentation
GregJohnsonJr Sep 29, 2024
d62037b
Adding links (#22)
GregJohnsonJr Sep 30, 2024
012fc12
Moving RDS file
GregJohnsonJr Sep 30, 2024
3471ba2
Small changes to test
GregJohnsonJr Sep 30, 2024
7501509
Merge branch 'master' into dev
GregJohnsonJr Sep 30, 2024
c1cab5b
Adding a vignette, fixed the test that were failing, and removed old …
GregJohnsonJr Sep 30, 2024
a9e8095
Small change to test
GregJohnsonJr Sep 30, 2024
69b13c4
Pushing lintr fixes
GregJohnsonJr Sep 30, 2024
4f758f2
Distance files to sparse matrix (#23)
GregJohnsonJr Oct 3, 2024
43b21cf
Unify clustering (#25)
GregJohnsonJr Oct 4, 2024
64812ba
Refactor package methods (#26)
GregJohnsonJr Oct 11, 2024
a8a0863
Added tests for the validate_count_table function
GregJohnsonJr Oct 11, 2024
aca2d2e
Removing R profile from tracking
GregJohnsonJr Oct 13, 2024
cfa9e93
Merge branch 'master' into dev
GregJohnsonJr Oct 13, 2024
f2b152f
Delete .Rprofile
GregJohnsonJr Oct 13, 2024
4a8a2cf
Fix lintr and pkgdown issues
GregJohnsonJr Oct 13, 2024
ca49407
Fix for pkgdown and lintr
GregJohnsonJr Oct 13, 2024
046ed48
Change to test in cluster_object-getters.
GregJohnsonJr Oct 13, 2024
4a5b787
Squashed commit of the following:
GregJohnsonJr Oct 14, 2024
8e509d9
Sync branches (#29)
GregJohnsonJr Oct 14, 2024
3d7cddb
Fix for reading column and phylip files.
GregJohnsonJr Oct 14, 2024
8ad307c
removing data files
GregJohnsonJr Oct 14, 2024
dc9c807
Code profiling (#31)
GregJohnsonJr Oct 14, 2024
b93d89a
Removing extra storage files and merge issues
GregJohnsonJr Oct 14, 2024
1b26f74
Added New test for determining if the file is a phylip or column file…
GregJohnsonJr Oct 15, 2024
9c59c09
Merge branch 'master' into dev
GregJohnsonJr Oct 15, 2024
82de80a
lintr fix
GregJohnsonJr Oct 15, 2024
f8f524a
Merge branch 'dev' of https://github.com/SchlossLab/Clustur into dev
GregJohnsonJr Oct 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions R/Cluster.R
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,7 @@ example_path <- function(file = NULL) {
return(path)
}


#' Read Count
#'
#' @export
Expand Down
4 changes: 4 additions & 0 deletions R/RcppExports.R
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@ WriteColumnFile <- function(xPosition, yPosition, data, cutoff, countTable, save
invisible(.Call('_clustur_WriteColumnFile', PACKAGE = 'clustur', xPosition, yPosition, data, cutoff, countTable, saveLocation))
}

DetermineIfPhylipOrColumnFile <- function(filePath) {
.Call('_clustur_DetermineIfPhylipOrColumnFile', PACKAGE = 'clustur', filePath)
}

ProcessDistanceFiles <- function(filePath, countTable, cutoff, isSim) {
.Call('_clustur_ProcessDistanceFiles', PACKAGE = 'clustur', filePath, countTable, cutoff, isSim)
}
Expand Down
2 changes: 2 additions & 0 deletions src/Adapters/CountTableAdapter.h
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,13 @@ class CountTableAdapter {
Rcpp::DataFrame GetCountTable() const {return countTable;}
Rcpp::DataFrame ReCreateDataFrame() const;
private:
void CreateNameToIndex();
struct IndexAbundancePair {
int groupIndex;
int sequenceIndex;
double abundance;
};
std::unordered_map<std::string, size_t> nameToRowIndex;
std::vector<std::string> sampleNames;
std::unordered_map<std::string, std::vector<double>> dataFrameMap;
std::vector<std::string> groups;
Expand Down
32 changes: 19 additions & 13 deletions src/CountTableAdapter.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ bool CountTableAdapter::CreateDataFrameMap(const Rcpp::DataFrame &countTable) {
// We only want the actual group names. so everything after
groups.insert(groups.end(), columnNames.begin() + 2, columnNames.end());
this->countTable = countTable;
CreateNameToIndex();
return true;
}

Expand Down Expand Up @@ -86,29 +87,27 @@ bool CountTableAdapter::CreateDataFrameMapFromSparseCountTable(const Rcpp::DataF
dataFrameMap = data;
// In a count table, the first to columns are the sequence and the total abundance.
// We only want the actual group names. so everything after

this->countTable = countTable;
CreateNameToIndex();
return true;

}

double CountTableAdapter::FindAbundanceBasedOnGroup(const std::string &group, const std::string &sampleName) const {
if (std::find(groups.begin(), groups.end(), group) == groups.end())
return -1; //Not Found, may need to throw and execption...
if (std::find(sampleNames.begin(), sampleNames.end(), sampleName) == sampleNames.end())
return -1; //Not Found, may need to throw and execption...
// We will preprocess the find during hte read dist process. So remove special checks
// - Protip hashmap find is faster than vector
if(nameToRowIndex.find(sampleName) == nameToRowIndex.end())
return -1;
const std::vector<double> groupCol = GetColumnByName(group);
const long index = std::distance(sampleNames.begin(), std::find(sampleNames.begin(),
sampleNames.end(), sampleName));
return dataFrameMap.at(group)[index];
return dataFrameMap.at(group)[nameToRowIndex.at(sampleName)];
}

double CountTableAdapter::FindTotalAbundance(const std::string &sampleName) const {
if(std::find(sampleNames.begin(), sampleNames.end(), sampleName) == sampleNames.end())
return -1; // Not found
const long index = std::distance(sampleNames.begin(), std::find(sampleNames.begin(),
sampleNames.end(), sampleName));
return dataFrameMap.at("total")[index];
// We will preprocess the find during hte read dist process. So remove special checks
// - Protip hashmap find is faster than vector
if(nameToRowIndex.find(sampleName) == nameToRowIndex.end())
return -1;
return dataFrameMap.at("total")[nameToRowIndex.at(sampleName)];
}

std::string CountTableAdapter::GetNameByIndex(const int index) const {
Expand Down Expand Up @@ -148,6 +147,13 @@ Rcpp::DataFrame CountTableAdapter::ReCreateDataFrame() const {
return countTable;
}


void CountTableAdapter::CreateNameToIndex() {
for(size_t i = 0; i < sampleNames.size(); i++) {
nameToRowIndex[sampleNames[i]] = i;
}
}

// Gets every column but the first column (the sequence names)
std::vector<double> CountTableAdapter::GetColumnByName(const std::string &name) const {
if (dataFrameMap.find(name) != dataFrameMap.end())
Expand Down
18 changes: 15 additions & 3 deletions src/RcppExports.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -40,16 +40,27 @@ BEGIN_RCPP
return R_NilValue;
END_RCPP
}
// DetermineIfPhylipOrColumnFile
bool DetermineIfPhylipOrColumnFile(const std::string& filePath);
RcppExport SEXP _clustur_DetermineIfPhylipOrColumnFile(SEXP filePathSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< const std::string& >::type filePath(filePathSEXP);
rcpp_result_gen = Rcpp::wrap(DetermineIfPhylipOrColumnFile(filePath));
return rcpp_result_gen;
END_RCPP
}
// ProcessDistanceFiles
SEXP ProcessDistanceFiles(const std::string& filePath, const Rcpp::DataFrame& countTable, double cutoff, bool isSim);
SEXP ProcessDistanceFiles(const std::string& filePath, const Rcpp::DataFrame& countTable, const double cutoff, const bool isSim);
RcppExport SEXP _clustur_ProcessDistanceFiles(SEXP filePathSEXP, SEXP countTableSEXP, SEXP cutoffSEXP, SEXP isSimSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< const std::string& >::type filePath(filePathSEXP);
Rcpp::traits::input_parameter< const Rcpp::DataFrame& >::type countTable(countTableSEXP);
Rcpp::traits::input_parameter< double >::type cutoff(cutoffSEXP);
Rcpp::traits::input_parameter< bool >::type isSim(isSimSEXP);
Rcpp::traits::input_parameter< const double >::type cutoff(cutoffSEXP);
Rcpp::traits::input_parameter< const bool >::type isSim(isSimSEXP);
rcpp_result_gen = Rcpp::wrap(ProcessDistanceFiles(filePath, countTable, cutoff, isSim));
return rcpp_result_gen;
END_RCPP
Expand Down Expand Up @@ -132,6 +143,7 @@ RcppExport SEXP run_testthat_tests(SEXP);
static const R_CallMethodDef CallEntries[] = {
{"_clustur_WritePhylipFile", (DL_FUNC) &_clustur_WritePhylipFile, 6},
{"_clustur_WriteColumnFile", (DL_FUNC) &_clustur_WriteColumnFile, 6},
{"_clustur_DetermineIfPhylipOrColumnFile", (DL_FUNC) &_clustur_DetermineIfPhylipOrColumnFile, 1},
{"_clustur_ProcessDistanceFiles", (DL_FUNC) &_clustur_ProcessDistanceFiles, 4},
{"_clustur_ProcessSparseMatrix", (DL_FUNC) &_clustur_ProcessSparseMatrix, 6},
{"_clustur_GetDistanceDataFrame", (DL_FUNC) &_clustur_GetDistanceDataFrame, 1},
Expand Down
15 changes: 10 additions & 5 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@
#include "MothurDependencies/ColumnDistanceMatrixReader.h"
#include "MothurDependencies/SharedFileBuilder.h"
#include "Adapters/DistanceFileReader.h"
#include "Tests/OptimatrixAdapterTestFixture.h"
#if DEBUG_RCPP
#include <Rcpp.h>
#include <cctype>
Expand Down Expand Up @@ -53,9 +52,8 @@ Rcpp::DataFrame CreateSharedDataFrame(const CountTableAdapter& countTable, const
}



//[[Rcpp::export]]
SEXP ProcessDistanceFiles(const std::string& filePath, const Rcpp::DataFrame& countTable, double cutoff, bool isSim) {
bool DetermineIfPhylipOrColumnFile(const std::string& filePath) {
std::fstream data(filePath);
std::unordered_map<bool, std::string> map;
map[true] = "This is a phylip file. Processing now...";
Expand All @@ -77,19 +75,26 @@ SEXP ProcessDistanceFiles(const std::string& filePath, const Rcpp::DataFrame& co
isPhylip = false;
Rcpp::Rcout << map[isPhylip] << "\n";
data.close();
return isPhylip;
}

//[[Rcpp::export]]
SEXP ProcessDistanceFiles(const std::string& filePath, const Rcpp::DataFrame& countTable, const double cutoff,
const bool isSim) {
const bool isPhylip = DetermineIfPhylipOrColumnFile(filePath);

CountTableAdapter adapter;
adapter.CreateDataFrameMap(countTable);
if(isPhylip) {
DistanceFileReader* read = new ReadPhylipMatrix(cutoff, isSim);
std::vector<RowData> rowDataMatrix = read->ReadToRowData(filePath);
const std::vector<RowData> rowDataMatrix = read->ReadToRowData(filePath);
read->SetCountTable(adapter);
read->SetRowDataMatrix(rowDataMatrix);
read->ReadRowDataMatrix(rowDataMatrix);
return Rcpp::XPtr<DistanceFileReader>(read);
}
DistanceFileReader* read = new ColumnDistanceMatrixReader(cutoff, isSim);
std::vector<RowData> rowDataMatrix = read->ReadToRowData(adapter, filePath);
const std::vector<RowData> rowDataMatrix = read->ReadToRowData(adapter, filePath);
read->SetCountTable(adapter);
read->SetRowDataMatrix(rowDataMatrix);
read->ReadRowDataMatrix(rowDataMatrix);
Expand Down
Binary file removed tests/testthat/extdata/sparse_matrix_data.RDS
Binary file not shown.
10 changes: 10 additions & 0 deletions tests/testthat/test-test-opticluster.R
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,16 @@ test_that("Read dist can read column and phylip files", {
expect_true(nrow(get_distance_data_frame(distance_data_phylip)) == 9604)
})


test_that("We can determine if a file is phylip or not", {
is_not_phylip <-
DetermineIfPhylipOrColumnFile(test_path("extdata", "amazon_column.dist"))
is_phylip <-
DetermineIfPhylipOrColumnFile(test_path("extdata", "amazon_phylip.dist"))
expect_true(is_phylip)
expect_false(is_not_phylip)
})

test_that("Validate Count Table returns a valid count table", {
count_table <- read.delim(test_path("extdata", "amazon.count_table"))
validated_count_table <- validate_count_table(count_table)
Expand Down
Loading