1️⃣🐝🏎️ The One Billion Row Challenge

At the beginning of January 2024 Gunnar Morling launched The One Billion Row Challenge (1BRC) to the Java Community on his blog and GitHub. The challenge aimed to find Java code that processed one billion rows from a text file and aggregate the values in the fastest time possible. A lot of developers submitted their solutions until Morling closed the challenge at the end of the month and published the final leaderboards. After the news of this challenge spread many people built their own solutions using other languages, databases, and tools.

Basics of running the challenge

Generate the measurements file with 1B rows (just once).
Calculate the min, max & average values from the measurements file.
- Measure the time needed for reading the file and calculating the average. Output of the result values is not part of the challenge.
Optimize the heck out of it to speed up your code!

Rules and limits

No external library dependencies may be used.
Implementations must be provided as a single source file.
The computation must happen at application runtime, i.e. you cannot process the measurements file at build time and just bake the result into the binary.
Input value ranges are as follows:
- Station name: non null UTF-8 string of min length 1 character and max length 100 bytes, containing neither ; nor \n characters. (i.e. this could be 100 one-byte characters, or 50 two-byte characters, etc.).
- Temperature value: non null double between -99.9 (inclusive) and 99.9 (inclusive), always with one fractional digit.
There is a maximum of 10,000 unique station names.
Line endings in the file are \n characters on all platforms.
Implementations must not rely on specifics of a given data set, e.g. any valid station name as per the constraints above and any data distribution (number of measurements per station) must be supported.
The rounding of output values must be done using the semantics of IEEE 754 rounding-direction "roundTowardPositive".

Attention:

The original generation used a list of weather station names and selected 10k random names from it. The shortest name I found is 3 bytes and the longest is 24 bytes long.
- This results in a line range of 3-24 bytes name + 1 byte separator + 1-5 bytes for the decimal + 1 byte new line = 6-31 bytes per line. Total file size is 6-31 GB, average 18,5 GB.
- The original post contained a warning that the generated file will be approximately 12 GB in size which would mean that most names are 5-9 bytes long.
The extended challenge added the generic weather station names with 1-100 bytes in length.
- This results in a line range of 1-100 bytes name + 1 byte separator + 1-5 bytes for the decimal + 1 byte new line = 3-106 bytes per line. Total file size is 3-106 GB, average 54,5 GB.

Setup

Old notebook

Intel Core i5-4200U - max. 2.3 GHz
DDR3 - 8GB RAM - 1600 MHz
Toshiba MQ01ABF050 - Read 100 MB/s Write 96 MB/s
Visual Studio 2019 (Version 16.11.44)
.NET Framework 4.7.2
.NET 5 (Core)

Writing & reading a file

There are a couple of classes that could be used to read & write data to & from a file:

FileInfo.CreateText
File.Create
File.CreateText
File.ReadAllText
StreamWriter
StreamReader
BinaryWriter
BinaryReader
FileStream
Memory mapped files
P/invoke native methods

After an initial test with reduced data the FileStream with default settings was the clear winner. Because I wanted to understand why that was the case I took a deep dive in the documentation, code and performanced tests for a couple of days. The collected informations and results can be found here.

The logic to generate and read the measurements file is not complicated but it might take a lot of time depending on the implementation. The following tables show a step by step refactoring of the generating and reading code and highlight the performance improvements achieved by them.

Generating the measurements file

	Duration	File size in GB	MB/s	Improvement in %
StreamWriter.WriteLine(line as string) in .NET Framework (name length 7)	11:03:084	10.720	16.17	base line	File
FileStream.Write(line as byte array)	12:01:301	16.750	23.22	42.19	File
FileStream.Write(block filled multiple lines)	11:04:100	13.201	19.88	22.94	File
Reduced array creation & copy when combining the line	10:00:579	13.201	21.98	35.93	File
Temperature generation writes bytes directly to block	03:17:209	11.595	58.79	263.57	File
Parallel execution	03:06:660	13.195	70.69	337.17	File
Copy name with Marshal.Copy instead of Array.Copy	02:49:013	13.195	78.07	382.81	File
Replaced real random with iteration through the values	02:40:766	13.033	81.07	401.36	File
Refactored code (parallel line generation & asyn write)	02:39:172	13.200	82.93	412.86	File
Allocate slices, store pointers & P/invoke CopyMemory	02:35:141	13.200	85.08	426.16	File
P/invoke CreateFile, WriteFile & CloseHandle	02:26:106	13.200	90.34	458.69	File
Removed slices & write directly to block (added fixed 10k names of list)	02:52:751	15.157	87.74	442.61	File
Copied current code to .NET 5, changed to safe version & use spans	02:46:481	14.882	89.39	452.81	File

Reading the file

	Duration	File size in GB	MB/s	Improvement in %
StreamWriter.ReadLine(line as string) in .NET Framework	08:14:673	14.882	30.08	base line	File
FileStream.ReadByte (single byte)	08:04:929	14.882	30.69	2.03	File
Parse temperature to int instead of double	07:41:859	14.882	32.22	7.11	File
Faster int parse	04:42:568	14.882	52.67	75.10	File
Faster dictionary look up	03:10:418	14.882	78.15	159.81	File
FileStream.Read (byte array)	02:48:808	14.882	88.16	193.08	File
P/invoke ReadFile & CloseHandle	02:48:665	14.882	88.23	193.32	File
Parallel execution	02:45:487	14.882	89.93	198.97	File
Copied current code to .NET 5, changed to safe version & use spans	02:52:967	15.157	87.63	191.32	File

Conclusion

After a lot of tests and refactoring the final .NET Framework version with unsafe code and the .NET 5 version with safe code both reached close to 90% of the possible write & read speed. Maybe with a bit of tweaking a couple more seconds could be shaved of the result but right now it is only off by about 10 seconds of the maximum write & read speed which might only be theoretically possible. I might have to test this with a better device with more I/O speed in the future but for now it is enough. Also upgrading to newer versions of .NET will have a better performance.

It was a fun challenge and I enjoyed learning a bit more about different ways and pitfalls when reading & writing files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!