forked from viennacl/viennacl-dev
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchangelog
227 lines (201 loc) · 18.7 KB
/
changelog
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
******************************
**** ViennaCL Change Logs ****
******************************
*** Version 1.4.x ***
-- Version 1.4.2 --
This is a maintenance release, particularly resolving compilation problems with Visual Studio 2012.
- Largely refactored the internal code base, unifying code for vector, vector_range, and vector_slice.
Similar code refactoring was applied to matrix, matrix_range, and matrix_slice.
This not only resolves the problems in VS 2012, but also leads to shorter compilation times and a smaller code base.
- Improved performance of matrix-vector products of compressed_matrix on CPUs using OpenCL.
- Resolved a bug which shows up if certain rows and columns of a compressed_matrix are empty and the matrix is copied back to host.
- Fixed a bug and improved performance of GMRES. Thanks to Ivan Komarov for reporting via sourceforge.
- Added additional Doxygen documentation.
-- Version 1.4.1 --
This release focuses on improved stability and performance on AMD devices rather than introducing new features:
- Included fast matrix-matrix multiplication kernel for AMD's Tahiti GPUs if matrix dimensions are a multiple of 128.
Our sample HD7970 reaches over 1.3 TFLOPs in single precision and 200 GFLOPs in double precision (counting multiplications and additions as separate operations).
- All benchmark FLOPs are now using the common convention of counting multiplications and additions separately (ignoring fused multiply-add).
- Fixed a bug for matrix-matrix multiplication with matrix_slice<> when slice dimensions are multiples of 64.
- Improved detection logic for Intel OpenCL SDK.
- Fixed issues when resizing an empty compressed_matrix.
- Fixes and improved support for BLAS-1-type operations on dense matrices and vectors.
- Vector expressions can now be passed to inner_prod() and norm_1(), norm_2() and norm_inf() directly.
- Improved performance when using OpenMP.
- Better support for Intel Xeon Phi (MIC).
- Resolved problems when using OpenCL for CPUs if the number of cores is not a power of 2.
- Fixed a flaw when using AMG in debug mode. Thanks to Jakub Pola for reporting.
- Removed accidental external linkage (invalidating header-only model) of SPAI-related functions. Thanks again to Jakub Pola.
- Fixed issues with copy back to host when OpenCL handles are passed to CTORs of vector, matrix, or compressed_matrix. Thanks again to Jakub Pola.
- Added fix for segfaults on program exit when providing custom OpenCL queues. Thanks to Denis Demidov for reporting.
- Fixed bug in copy() to hyb_matrix as reported by Denis Demidov (thanks!).
- Added an overload for result_of::alignment for vector_expression. Thanks again to Denis Demidov.
- Added SSE-enabled code contributed by Alex Christensen.
-- Version 1.4.0 --
The transition from 1.3.x to 1.4.x features the largest number of additions, improvements, and cleanups since the initial release.
In particular, host-, OpenCL-, and CUDA-based execution is now supported. OpenCL now needs to be enabled explicitly!
New features and feature improvements are as follows:
- Added host-based and CUDA-enabled operations on ViennaCL objects. The default is now a host-based execution for reasons of compatibility.
Enable OpenCL- or CUDA-based execution by defining the preprocessor constant VIENNACL_WITH_OPENCL and VIENNACL_WITH_CUDA respectively.
Note that CUDA-based execution requires the use of nvcc.
- Added mixed-precision CG solver (OpenCL-based).
- Greatly improved performance of ILU0 and ILUT preconditioners (up to 10-fold). Also fixed a bug in ILUT.
- Added initializer types from Boost.uBLAS (unit_vector, zero_vector, scalar_vector, identity_matrix, zero_matrix, scalar_matrix).
Thanks to Karsten Ahnert for suggesting the feature.
- Added incomplete Cholesky factorization preconditioner.
- Added element-wise operations for vectors as available in Boost.uBLAS (element_prod, element_div).
- Added restart-after-N-cycles option to BiCGStab.
- Added level-scheduling for ILU-preconditioners. Performance strongly depends on matrix pattern.
- Added least-squares example including a function inplace_qr_apply_trans_Q() to compute the right hand side vector Q^T b without rebuilding Q.
- Improved performance of LU-factorization of dense matrices.
- Improved dense matrix-vector multiplication performance (thanks to Philippe Tillet).
- Reduced overhead when copying to/from ublas::compressed_matrix.
- ViennaCL objects (scalar, vector, etc.) can now be used as global variables (thanks to an anonymous user on the support-mailinglist).
- Refurbished OpenCL vector kernels backend.
All operations of the type v1 = a v2 @ b v3 with vectors v1, v2, v3 and scalars a and b including += and -= instead of = are now temporary-free. Similarly for matrices.
- matrix_range and matrix_slice as well as vector_range and vector_slice can now be used and mixed completely seamlessly with all standard operations except lu_factorize().
- Fixed a bug when using copy() with iterators on vector proxy objects.
- Final reduction step in inner_prod() and norms is now computed on CPU if the result is a CPU scalar.
- Reduced kernel launch overhead of simple vector kernels by packing multiple kernel arguments together.
- Updated SVD code and added routines for the computation of symmetric eigenvalues using OpenCL.
- custom_operation's constructor now support multiple arguments, allowing multiple expression to be packed in the same kernel for improved performances.
However, all the datastructures in the multiple operations must have the same size.
- Further improvements to the OpenCL kernel generator: Added a repeat feature for generating loops inside a kernel, added element-wise products and division, added support for every one-argument OpenCL function.
- The name of the operation is now a mandatory argument of the constructor of custom_operation.
- Improved performances of the generated matrix-vector product code.
- Updated interfacing code for the Eigen library, now working with Eigen 3.x.y.
- Converter in source-release now depends on Boost.filesystem3 instead of Boost.filesystem2, thus requiring Boost 1.44 or above.
*** Version 1.3.x ***
-- Version 1.3.1 --
The following bugfixes and enhancements have been applied:
- Fixed a compilation problem with GCC 4.7 caused by the wrong order of function declarations. Also removed unnecessary indirections and unused variables.
- Improved out-of-source build in the src-version (for packagers).
- Added virtual destructor in the runtime_wrapper-class in the kernel generator.
- Extended flexibility of submatrix and subvector proxies (ranges, slices).
- Block-ILU for compressed_matrix is now applied on the GPU during the solver cycle phase. However, for the moment the implementation file in viennacl/linalg/detail/ilu/opencl block ilu.hpp needs to be included separately in order to avoid an OpenCL dependency for all ILU implementations.
- SVD now supports double precision.
- Slighly adjusted the interface for NMF. The approximation rank is now specified by the supplied matrices W and H.
- Fixed a problem with matrix-matrix products if the result matrix is not initialized properly (thanks to Laszlo Marak for finding the issue and a fix).
- The operations C += prod(A, B) and C −= prod(A, B) for matrices A, B, and C no longer introduce temporaries if the three matrices are distinct.
-- Version 1.3.0 --
Several new features enter this new minor version release.
Some of the experimental features introduced in 1.2.0 keep their experimental state in 1.3.x due to the short time since 1.2.0, with exceptions listed below along with the new features:
- Full support for ranges and slices for dense matrices and vectors (no longer experimental)
- QR factorization now possible for arbitrary matrix sizes (no longer experimental)
- Further improved matrix-matrix multiplication performance for matrix dimensions which are a multiple of 64 (particularly improves performance for NVIDIA GPUs)
- Added Lanczos and power iteration method for eigenvalue computations of dense and sparse matrices (experimental, contributed by Guenther Mader and Astrid Rupp)
- Added singular value decomposition in single precision (experimental, contributed by Volodymyr Kysenko)
- Two new ILU-preconditioners added: ILU0 (contributed by Evan Bollig) and a block-diagonal ILU preconditioner using either ILUT or ILU0 for each block. Both preconditioners are computed entirely on the CPU.
- Automated OpenCL kernel generator based on high-level operation specifications added (many thanks to Philippe Tillet who had a lot of /fun fun fun/ working on this)
- Two new sparse matrix types (by Volodymyr Kysenko): ell_matrix for the ELL format and hyb_matrix for a hybrid format (contributed by Volodymyr Kysenko).
- Added possibility to specify the OpenCL platform used by a context
- Build options for the OpenCL compiler can now be supplied to a context (thanks to Krzysztof Bzowski for the suggestion)
- Added nonnegative matrix factorization by Lee and Seoung (contributed by Volodymyr Kysenko).
*** Version 1.2.x ***
-- Version 1.2.1 --
The current release mostly provides a few bug fixes for experimental features introduced in 1.2.0.
In addition, performance improvements for matrix-matrix multiplications are applied.
The main changes (in addition to some internal adjustments) are as follows:
- Fixed problems with double precision on AMD GPUs supporting cl_amd_fp64 instead of cl_khr_fp64 (thanks to Sylvain R.)
- Considerable improvements in the handling of matrix_range. Added project() function for convenience (cf. Boost.uBLAS)
- Further improvements of matrix-matrix multiplication performance (contributed by Volodymyr Kysenko)
- Improved performance of QR factorization
- Added direct element access to elements of compressed_matrix using operator() (thanks to sourceforge.net user Sulif for the hint)
- Fixed incorrect matrix dimensions obtained with the transfer of non-square sparse Eigen and MTL matrices to ViennaCL objects (thanks to sourceforge.net user ggrocca for pointing at this)
-- Version 1.2.0 --
Many new features from the Google Summer of Code and the IuE Summer of Code enter this release.
Due to their complexity, they are for the moment still in experimental state (see the respective chapters for details) and are expected to reach maturity with the 1.3.0 release.
Shorter release cycles are planned for the near future.
- Added a bunch of algebraic multigrid preconditioner variants (contributed by Markus Wagner)
- Added (factored) sparse approximate inverse preconditioner (SPAI, contributed by Nikolay Lukash)
- Added fast Fourier transform (FFT) for vector sizes with a power of two, tandard Fourier transform for other sizes (contributed by Volodymyr Kysenko)
- Additional structured matrix classes for circulant matrices, Hankel matrices, Toeplitz matrices and Vandermonde matrices (contributed by Volodymyr Kysenko)
- Added reordering algorithms (Cuthill-McKee and Gibbs-Poole-Stockmeyer, contributed by Philipp Grabenweger)
- Refurbished CMake build system (thanks to Michael Wild)
- Added matrix and vector proxy objects for submatrix and subvector manipulation
- Added (possibly GPU-assisted) QR factorization
- Per default, a viennacl::ocl::context now consists of one device only. The rationale is to provide better out-of-the-box support for machines with hybrid graphics (two GPUs), where one GPU may not be capable of double precision support.
- Fixed problems with viennacl::compressed_matrix which occurred if the number of rows and columns differed
- Improved documentation for the case of multiple custom kernels within a program
- Improved matrix-matrix multiplication kernels (may lead to up to 20 percent performance gains)
- Fixed problems in GMRES for small matrices (dimensions smaller than the maximum number of Krylov vectors)
*** Version 1.1.x ***
-- Version 1.1.2 --
This final release of the ViennaCL 1.1.x family focuses on refurbishing existing functionality:
- Fixed a bug with partial vector copies from CPU to GPU (thanks to sourceforge.net user kaiwen).
- Corrected error estimations in CG and BiCGStab iterative solvers (thanks to Riccardo Rossi for the hint).
- Improved performance of CG and BiCGStab as well as Jacobi and row-scaling preconditioners considerably (thanks to Farshid Mossaiby and Riccardo Rossi for a lot of input).
- Corrected linker statements in CMakeLists.txt for MacOS (thanks to Eric Christiansen).
- Improved handling of ViennaCL types (direct construction, output streaming of matrix- and vector-expressions, etc.).
- Updated old code in the coordinate_matrix type and improved performance (thanks to Dongdong Li for finding this).
- Using size_t instead of unsigned int for the size type on the host.
- Updated double precision support detection for AMD hardware.
- Fixed a name clash in direct_solve.hpp and ilu.hpp (thanks to sourceforge.net user random).
- Prevented unsupported assignments and copies of sparse matrix types (thanks to sourceforge.net user kszyh).
-- Version 1.1.1 --
This new revision release has a focus on better interaction with other linear algebra libraries. The few known glitches with version 1.1.0 are now removed.
- Fixed compilation problems on MacOS X and OpenCL 1.0 header files due to undefined an preprocessor constant (thanks to Vlad-Andrei Lazar and Evan Bollig for reporting this)
- Removed the accidental external linkage for three functions (we appreciate the report by Gordon Stevenson).
- New out-of-the-box support for Eigen and MTL libraries. Iterative solvers from ViennaCL can now directly be used with both libraries.
- Fixed a problem with GMRES when system matrix is smaller than the maximum Krylov space dimension.
- Better default parameter for BLAS3 routines leads to higher performance for matrix-matrix-products.
- Added benchmark for dense matrix-matrix products (BLAS3 routines).
- Added viennacl-info example that displays infos about the OpenCL backend used by ViennaCL.
- Cleaned up CMakeLists.txt in order to selectively enable builds that rely on external libraries.
- More than one installed OpenCL platform is now allowed (thanks to Aditya Patel).
-- Version 1.1.0 --
A large number of new features and improvements over the 1.0.5 release are now available:
- The completely rewritten OpenCL back-end allows for multiple contexts, multiple devices and even to wrap existing OpenCL resources into ViennaCL objects. A tutorial demonstrates the new functionality. Thanks to Josip Basic for pushing us into that direction.
- The tutorials are now named according to their purpose.
- The dense matrix type now supports both row-major and column-major
storage.
- Dense and sparse matrix types now now be filled using STL-emulated types (std::vector< std::vector<NumericT> > and std::vector< std::map< unsigned int, NumericT> >)
- BLAS level 3 functionality is now complete. We are very happy with the general out-of-the-box performance of matrix-matrix-products, even though it cannot beat the extremely tuned implementations tailored to certain matrix sizes on a particular device yet.
- An automated performance tuning environment allows an optimization of the kernel parameters for the library user's machine. Best parameters can be obtained from a tuning run and stored in a XML file and read at program startup using pugixml.
- Two new preconditioners are now included: A Jacobi preconditioner and a row-scaling preconditioner. In contrast to ILUT, they are applied on the OpenCL device directly.
- Clean compilation of all examples under Visual Studio 2005 (we recommend newer compilers though...).
- Error handling is now carried out using C++ exceptions.
- Matrix Market now uses index base 1 per default (thanks to Evan Bollig for reporting that)
- Improved performance of norm_X kernels.
- Iterative solver tags now have consistent constructors: First argument is the relative tolerance, second argument is the maximum number of total iterations. Other arguments depend on the respective solver.
- A few minor improvements here and there (thanks go to Riccardo Rossi and anonymous sourceforge.net users for reporting the issues)
*** Version 1.0.x ***
-- Version 1.0.5 --
This is the last 1.0.x release. The main changes are as follows:
- Added a reader and writer for MatrixMarket files (thanks to Evan Bollig for suggesting that)
- Eliminated a bug that caused the upper triangular direct solver to fail on NVIDIA hardware for large matrices (thanks to Andrew Melfi for finding that)
- The number of iterations and the final estimated error can now be obtained from iterative solver tags.
- Improvements provided by Klaus Schnass are included in the developer converter script (OpenCL kernels to C++ header)
- Disabled the use of reference counting for OpenCL handles on Mac OS X (caused seg faults on program exit)
-- Version 1.0.4 --
The changes in this release are:
- All tutorials now work out-of the box with Visual Studio 2008.
- Eliminated all ViennaCL related warnings when compiling with Visual Studio 2008.
- Better (experimental) support for double precision on ATI GPUs, but no norm_1, norm_2, norm_inf and index_norm_inf functions using ATI Stream SDK on GPUs in double precision.
- Fixed a bug in GMRES that caused segmentation faults under Windows.
- Fixed a bug in const_sparse_matrix_adapter (thanks to Abhinav Golas and Nico Galoppo for almost simultaneous emails on that)
- Corrected incorrect return values in the sparse matrix regression test suite (thanks to Klaus Schnass for the hint)
-- Version 1.0.3 --
The main improvements in this release are:
- Support for multi-core CPUs with ATI Stream SDK (thanks to Riccardo Rossi, UPC. BARCELONA TECH, for suggesting this)
- inner_prod is now up to a factor of four faster (thanks to Serban Georgescu, ETH, for pointing the poor performance of the old implementation out)
- Fixed a bug with plane_rotation that caused system freezes with ATI GPUs.
- Extended the doxygen generated reference documentation
-- Version 1.0.2 --
A bug-fix release that resolves some problems with the Visual C++ compiler.
- Fixed some compilation problems under Visual C++ (version 2005 and 2008).
- All tutorials accidentally relied on ublas. Now tut1 and tut5 can be compiled without ublas.
- Renamed aux/ folder to auxiliary/ (caused some problems on windows machines)
-- Version 1.0.1 --
This is a quite large revision of ViennaCL 1.0.0, but mainly improves things under the hood.
- Fixed a bug in lu_substitute for dense matrices
- Changed iterative solver behavior to stop if a certain relative residual is reached
- ILU preconditioning is now fully done on the CPU, because this gives best overall performance
- All OpenCL handles of ViennaCL types can now be accessed via member function handle()
- Improved GPU performance of GMRES by about a factor of two.
- Added generic norm_2 function in header file norm_2.hpp
- Wrapper for clFlush() and clFinish() added
- Device information can be queried by device.info()
- Extended documentation and tutorials
-- Version 1.0.0 --
First release