Skip to content

Latest commit

 

History

History
66 lines (58 loc) · 6.9 KB

README.md

File metadata and controls

66 lines (58 loc) · 6.9 KB

1️⃣🐝🏎️ The One Billion Row Challenge

At the beginning of January 2024 Gunnar Morling launched The One Billion Row Challenge (1BRC) to the Java Community on his blog and GitHub. The challenge aimed to find Java code that processed one billion rows from a text file and aggregate the values in the fastest time possible. A lot of developers submitted their solutions until Morling closed the challenge at the end of the month and published the final leaderboards. After the news of this challenge spread many people built their own solutions using other languages, databases, and tools.

Basics of running the challenge

  1. Generate the measurements file with 1B rows (just once).
  2. Calculate the min, max & average values from the measurements file.
    • Measure the time needed for reading the file and calculating the average. Output of the result values is not part of the challenge.
  3. Optimize the heck out of it to speed up your code!

Rules and limits

  • No external library dependencies may be used.
  • Implementations must be provided as a single source file.
  • The computation must happen at application runtime, i.e. you cannot process the measurements file at build time and just bake the result into the binary.
  • Input value ranges are as follows:
    • Station name: non null UTF-8 string of min length 1 character and max length 100 bytes, containing neither ; nor \n characters. (i.e. this could be 100 one-byte characters, or 50 two-byte characters, etc.).
    • Temperature value: non null double between -99.9 (inclusive) and 99.9 (inclusive), always with one fractional digit.
  • There is a maximum of 10,000 unique station names.
  • Line endings in the file are \n characters on all platforms.
  • Implementations must not rely on specifics of a given data set, e.g. any valid station name as per the constraints above and any data distribution (number of measurements per station) must be supported.
  • The rounding of output values must be done using the semantics of IEEE 754 rounding-direction "roundTowardPositive".

Attention:

  • The original generation used a list of weather station names and selected 10k random names from it. The shortest name I found is 3 bytes and the longest is 24 bytes long.
    • This results in a line range of 3-24 bytes name + 1 byte separator + 1-5 bytes for the decimal + 1 byte new line = 6-31 bytes per line. Total file size is 6-31 GB, average 18,5 GB.
    • The original post contained a warning that the generated file will be approximately 12 GB in size which would mean that most names are 5-9 bytes long.
  • The extended challenge added the generic weather station names with 1-100 bytes in length.
    • This results in a line range of 1-100 bytes name + 1 byte separator + 1-5 bytes for the decimal + 1 byte new line = 3-106 bytes per line. Total file size is 3-106 GB, average 54,5 GB.

Setup

Old notebook

  • Intel Core i5-4200U - 2.3 GHz
  • DDR3 - 8GB RAM - 1600 MHz
  • Toshiba MQ01ABF050 - Read 100 MB/s Write 96 MB/s

Writing & reading a file

There are a couple of classes that could be used to read & write data to & from a file:

  • FileInfo.CreateText
  • File.Create
  • File.CreateText
  • File.ReadAllText
  • StreamWriter
  • StreamReader
  • FileStream

After an initial test with reduced data the FileStream with default settings was the clear winner. Because I wanted to understand why that was the case I took a deep dive in the documentation, code and performance tests for a couple of days. The collected informations and results can be found here.

File generation

The logic to generate the rows for the measurements isn't too complicated but writing the file might take a lot of time. So before generating the final measurements file with 1B rows (~12GB) it would be best to improve the code a bit. The name length will be limited to 7 bytes per name so a whole line would be 10-14 bytes long.

Duration File size in GB MBs (new-old)/old*100%
StreamWriter.WriteLine(line as string) 11:03:084 10.720 16.17 base line File
FileStream.Write(line as byte array) 12:01:301 16.750 23.22 42.19 File
FileStream.Write(block filled multiple lines) 11:04:100 13.201 19.88 22.94 File
Reduced array creation & copy when combining the line 10:00:579 13.201 21.98 35.93 File
Temperature generation writes bytes directly to block 03:17:209 11.595 58.79 263.57 File
Parallel execution 03:06:660 13.195 70.69 337.17 File
Copy name with Marshal.Copy instead of Array.Copy 02:49:013 13.195 78.07 382.81 File
Replaced real random with iteration through the values 02:40:766 13.033 81.07 401.36 File
Refactored code (parallel line generation & asyn write) 02:39:172 13.200 82.93 412.86 File
Allocate slices, store pointers & P/invoke CopyMemory 02:35:141 13.200 85.08 426.16 File
P/invoke CreateFile, WriteFile & CloseHandle 02:26:106 13.200 90.34 458.69 File