DataFrameCpp - A C++ Dynamic DataFrame library with Python Pandas-like API inspired by hosseinmoein/DataFrame
The goal of this library is to provide a dynamic DataFrame library for C++ with pandas-like API. Currently it mainly deal with data management. hosseinmoein/DataFrame is a good data library. But I feel upset when I use it due to it is mainly designed for static column types and lacking data manuplating methods, e.g. remove rows and cast column type. So I decide to write my own dataframe library.
"Dynamic" means data type information will be stored (I use variant to store data) and you don't need to specify data types when programming at most time. What's more, the column type can be changed on-the-fly. Of course, the trade-off is Lower Performace as type information must be matched at runtime. However the user can also use function of template version and provide a known data type to avoid runtime comparison.
If performance is critical, I suggest you to try hosseinmoein/DataFrame which is much more mature.
Other limitations are:
- You need a C++ 20 compilier.
- It depends on some external libaries:
Eigen,boost. - If you want to do more computation, you need to learn
EigenorArmadillo. This library provides some API to convert data types. It's up to you on dealing with it.
This library try to provide pandas-like API, however there are some important differences except pandas is way more complete:
- Subscripts usually return a view reffers to the original dataframe. Different rows in a view may refer to the same element.
If you remove rows from the original dataframe, then you shouldn't use views created form the dataframe any more.
-
Most modifications happen in place and there is no
in_placeoption like pandas. In my own experience with pandas, I find it's really a pain to setin_place = trueevery time. -
You can't use subscripts to create a new column. Because I use vector of primitive data types. So there is no direct way to represent
None.
-
apply -
select() -
groupby() -
sort() - Key-value Index
- Statistical and math functions
- Dump to binary and load
- Hierarchical index.
-
DATEandDATETIMEdata types usingboost.datetime
It's a header-only library.
#include "DataFrameCpp/include/DataFrameCpp/DataFrameCpp.hpp"
int main() {
dfc::DataFrame df{{"a", {1.0, 2.0, 9.0}},
{"b", {6, 8, 9}},
{"c", {4, 5, 6}},
{"d", {1.2, 9.7, 8.6}},
{"e", {"A", "B", "C"}}};
std::cout << df << std::endl;
}Output:
[0] [1] [2] [3] [4]
a b c d e
double int int double string
0 1 6 4 1.2 A
1 2 8 5 9.7 B
2 9 9 6 8.6 C
Index: (trival)
Shape: (3, 5)
enum DType { NONE = 0, STRING, BOOL, INT, LONGLONG, FLOAT, DOUBLE, DATE, DATETIME, DATEDURATION, TIMEDURATION };Given a df1.csv file:
a,b,c
1,6,"X2"
2,8,"X3"
9,9,"X1"
Use following code to read it:
auto df1 = dfc::read_csv("df1.csv");The data type will be automatically decided. However it's only for double and string so that the column type will be either double or string.
dfc::DataFrame df{{"a", {1.0, 2.0, 9.0}},
{"b", {6, 8, 9}},
{"c", {4, 5, 6}},
{"d", {1.2, 9.7, 8.6}},
{"e", {"A", "B", "C"}}};dfc::DataFrame df({"a","b","c"},{DType::STRING,DType::DOUBLE, DType::INT}); //3 colums.
// append column
df.append_col("f", dfc::DType::STRING); //blank column
df.append_col("g", {"11", "12", "13"});
// append row
df.append_row(); // blank row. return a DataFrameView of the last row.
df.append_row().set("a", 5).set("b", 7); // Append a row and set value. DataFrameView df1 = df.iloc({2, 1}, {"d", "a"}); //column d and a, row 2 and 1.DataFrameView 2=df.set_index("b");
df.loc({8, 9}, {"d","a"});When you pass a concrete unary function, only columns have the same data type will be applied by the function and other columns won't change.
dfc::DataFrame df{{"a", {1.0, 2.0, 9.0, 8.0}}, {"b", {6, 7, 8, 9}}};
std::function<double(double)> double_plus_1_functor = [](double x) { return x + 1; };
df.apply<true>(double_plus_1_functor); // values of column a changed and b no change. `true` for inplace.
df.apply_in_place(double_plus_1_functor); // the same above.
auto double_to_int_functor = [](double x) { return int(x); };
df.apply<true>(double_to_int_functor); // Now a is a int column.
df.apply<true>([](int x) { return x + 2; }); // pass lambda directly.Consider a to_string() that can accept multiple type inputs and a known unique return type. Then you can pass a template functor as a template argument and specify a list of target argument types to convert specific type of columns to string.
template <typename T> struct to_string_functor {
std::string operator()(T x) { return std::to_string(x); }
};
dfc::DataFrame df{{"a", {1.0, 2.0, 9.0, 8.0}}, {"b", {6, 7, 8, 9}}, {"c", "a", "bc", "d", "e"}};
df.apply<true, to_string_functor, double, int>(); //You only want to convert double and int columns to string.Type is vector<T>. It's always contiguous.
Use a Series to store keys and use unordered_map to store key-index map. The key map is a variant to represent range(contiguous integer indexing), int and string indexing.
If the key map is long long, then the value is the start of indexing.
Type of indices:
template <typename KeyType> class KeyValueIndex: key-value type index. Multiple rows can associated with a single key.template <typename KeyType> class UniqueKeyValueIndex: Unique version key-value type index. An exception will be generated when there are multiple rows for a single key.
Consists of Index and multiple Series.
Use Index* to point to a basis index. The ViewIndex may not be contigous. So there is a pos vector<size_t> to store index-positon map. For example, [ 5, 0, 1, 2, 4] means the first row is the 5th element in Index, the second row is the 0-th element in Index.
This is usefull for sorting and subscripting. For example, we can use a view to record the order of sorting results without actually reorder all rows(which requires copying and moving data between rows).
The ViewIndex can running on Index* = nullptr. Then it can only use pos to generate index.
Note that ViewIndex is the index of view, not view of index.
In-contigouos data may slow computation performance thus you should try copy to create a new concrete DataFrame and do computation on it.
ViewIndex has a key map that map integer location to the location in original index.
Contains a pointer to Series and a ViewIndex.
Consists of vector<SeriesView*>. This is the most complex data structure. Each column are independent thus complications rise when concatenation:
-
Different column in a
DataFrameViewgot by horizontal concatenation may refer to the same basis Series. -
Different elements in one column may refer to the same element by vertical concatenation. To reduce complexity, vertical concatenation always return a concrete
DataFrame(thus must copy data). -
If the
DataFrameViewhas multiple columns that have the same name. Then thecolumn_maponly store the last index.