DataFrameCpp - A C++ Dynamic DataFrame library with Python Pandas-like API inspired by hosseinmoein/DataFrame
The goal of this library is to provide a dynamic DataFrame library for C++ with pandas-like API. Currently it mainly deal with data management. hosseinmoein/DataFrame is a good data library. But I feel upset when I use it due to it is mainly designed for static column types and lacking data manuplating methods, e.g. remove rows and cast column type. So I decide to write my own dataframe library.
"Dynamic" means data type information will be stored (I use variant
to store data) and you don't need to specify data types when programming at most time. What's more, the column type can be changed on-the-fly. Of course, the trade-off is Lower Performace as type information must be matched at runtime. However the user can also use function of template version and provide a known data type to avoid runtime comparison.
If performance is critical, I suggest you to try hosseinmoein/DataFrame which is much more mature.
Other limitations are:
- You need a C++ 20 compilier.
- It depends on some external libaries:
Eigen
,boost
. - If you want to do more computation, you need to learn
Eigen
orArmadillo
. This library provides some API to convert data types. It's up to you on dealing with it.
This library try to provide pandas-like API, however there are some important differences except pandas is way more complete:
- Subscripts usually return a view reffers to the original dataframe. Different rows in a view may refer to the same element.
If you remove rows from the original dataframe, then you shouldn't use views created form the dataframe any more.
-
Most modifications happen in place and there is no
in_place
option like pandas. In my own experience with pandas, I find it's really a pain to setin_place = true
every time. -
You can't use subscripts to create a new column. Because I use vector of primitive data types. So there is no direct way to represent
None
.
-
apply
-
select()
-
groupby()
-
sort()
- Key-value Index
- Statistical and math functions
- Dump to binary and load
- Hierarchical index.
-
DATE
andDATETIME
data types usingboost.datetime
It's a header-only library.
#include "DataFrameCpp/include/DataFrameCpp/DataFrameCpp.hpp"
int main() {
dfc::DataFrame df{{"a", {1.0, 2.0, 9.0}},
{"b", {6, 8, 9}},
{"c", {4, 5, 6}},
{"d", {1.2, 9.7, 8.6}},
{"e", {"A", "B", "C"}}};
std::cout << df << std::endl;
}
Output:
[0] [1] [2] [3] [4]
a b c d e
double int int double string
0 1 6 4 1.2 A
1 2 8 5 9.7 B
2 9 9 6 8.6 C
Index: (trival)
Shape: (3, 5)
enum DType { NONE = 0, STRING, BOOL, INT, LONGLONG, FLOAT, DOUBLE, DATE, DATETIME, DATEDURATION, TIMEDURATION };
Given a df1.csv
file:
a,b,c
1,6,"X2"
2,8,"X3"
9,9,"X1"
Use following code to read it:
auto df1 = dfc::read_csv("df1.csv");
The data type will be automatically decided. However it's only for double
and string
so that the column type will be either double
or string
.
dfc::DataFrame df{{"a", {1.0, 2.0, 9.0}},
{"b", {6, 8, 9}},
{"c", {4, 5, 6}},
{"d", {1.2, 9.7, 8.6}},
{"e", {"A", "B", "C"}}};
dfc::DataFrame df({"a","b","c"},{DType::STRING,DType::DOUBLE, DType::INT}); //3 colums.
// append column
df.append_col("f", dfc::DType::STRING); //blank column
df.append_col("g", {"11", "12", "13"});
// append row
df.append_row(); // blank row. return a DataFrameView of the last row.
df.append_row().set("a", 5).set("b", 7); // Append a row and set value.
DataFrameView df1 = df.iloc({2, 1}, {"d", "a"}); //column d and a, row 2 and 1.
DataFrameView 2=df.set_index("b");
df.loc({8, 9}, {"d","a"});
When you pass a concrete unary function, only columns have the same data type will be applied by the function and other columns won't change.
dfc::DataFrame df{{"a", {1.0, 2.0, 9.0, 8.0}}, {"b", {6, 7, 8, 9}}};
std::function<double(double)> double_plus_1_functor = [](double x) { return x + 1; };
df.apply<true>(double_plus_1_functor); // values of column a changed and b no change. `true` for inplace.
df.apply_in_place(double_plus_1_functor); // the same above.
auto double_to_int_functor = [](double x) { return int(x); };
df.apply<true>(double_to_int_functor); // Now a is a int column.
df.apply<true>([](int x) { return x + 2; }); // pass lambda directly.
Consider a to_string()
that can accept multiple type inputs and a known unique return type. Then you can pass a template functor as a template argument and specify a list of target argument types to convert specific type of columns to string.
template <typename T> struct to_string_functor {
std::string operator()(T x) { return std::to_string(x); }
};
dfc::DataFrame df{{"a", {1.0, 2.0, 9.0, 8.0}}, {"b", {6, 7, 8, 9}}, {"c", "a", "bc", "d", "e"}};
df.apply<true, to_string_functor, double, int>(); //You only want to convert double and int columns to string.
Type is vector<T>
. It's always contiguous.
Use a Series
to store keys and use unordered_map
to store key-index map. The key map is a variant to represent range(contiguous integer indexing), int and string indexing.
If the key map is long long, then the value is the start of indexing.
Type of indices:
template <typename KeyType> class KeyValueIndex
: key-value type index. Multiple rows can associated with a single key.template <typename KeyType> class UniqueKeyValueIndex
: Unique version key-value type index. An exception will be generated when there are multiple rows for a single key.
Consists of Index
and multiple Series
.
Use Index*
to point to a basis index. The ViewIndex
may not be contigous. So there is a pos vector<size_t>
to store index-positon map. For example, [ 5, 0, 1, 2, 4]
means the first row is the 5th element in Index
, the second row is the 0-th element in Index
.
This is usefull for sorting and subscripting. For example, we can use a view to record the order of sorting results without actually reorder all rows(which requires copying and moving data between rows).
The ViewIndex
can running on Index* = nullptr
. Then it can only use pos
to generate index.
Note that ViewIndex
is the index of view, not view of index.
In-contigouos data may slow computation performance thus you should try copy
to create a new concrete DataFrame
and do computation on it.
ViewIndex
has a key map that map integer location to the location in original index.
Contains a pointer to Series
and a ViewIndex
.
Consists of vector<SeriesView*>
. This is the most complex data structure. Each column are independent thus complications rise when concatenation:
-
Different column in a
DataFrameView
got by horizontal concatenation may refer to the same basis Series. -
Different elements in one column may refer to the same element by vertical concatenation. To reduce complexity, vertical concatenation always return a concrete
DataFrame
(thus must copy data). -
If the
DataFrameView
has multiple columns that have the same name. Then thecolumn_map
only store the last index.