diff --git a/404.html b/404.html index bbdfaf3..6373c4e 100644 --- a/404.html +++ b/404.html @@ -21,7 +21,7 @@ - + @@ -732,7 +732,7 @@
- Ben J. Ward · + Sabrina J. Ward · Powered by the Academic theme for diff --git a/author/admin/index.html b/author/admin/index.html index d31a8d6..40d7f48 100644 --- a/author/admin/index.html +++ b/author/admin/index.html @@ -21,7 +21,7 @@ - + @@ -333,7 +333,7 @@
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/author/admin/index.xml b/author/admin/index.xml
index 6c1bd49..661d2ab 100644
--- a/author/admin/index.xml
+++ b/author/admin/index.xml
@@ -6,7 +6,7 @@
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/author/index.xml b/author/index.xml
index 12a1551..fcf7066 100644
--- a/author/index.xml
+++ b/author/index.xml
@@ -6,7 +6,7 @@
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/author/kbonham/index.xml b/author/kbonham/index.xml
index 7847d84..2bbbe4e 100644
--- a/author/kbonham/index.xml
+++ b/author/kbonham/index.xml
@@ -6,7 +6,7 @@
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/authors/admin/index.xml b/authors/admin/index.xml
index 7ba8487..5780b76 100644
--- a/authors/admin/index.xml
+++ b/authors/admin/index.xml
@@ -5,7 +5,7 @@
https://BioJulia.github.io/authors/admin/
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/authors/index.xml b/authors/index.xml
index 5bddbd6..4fcd3a0 100644
--- a/authors/index.xml
+++ b/authors/index.xml
@@ -5,7 +5,7 @@
https://BioJulia.github.io/authors/
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/authors/jakobnissen/index.xml b/authors/jakobnissen/index.xml
index 2106142..6209056 100644
--- a/authors/jakobnissen/index.xml
+++ b/authors/jakobnissen/index.xml
@@ -5,7 +5,7 @@
https://BioJulia.github.io/authors/jakobnissen/
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/authors/kbonham/index.xml b/authors/kbonham/index.xml
index f9e8e19..372e817 100644
--- a/authors/kbonham/index.xml
+++ b/authors/kbonham/index.xml
@@ -5,7 +5,7 @@
https://BioJulia.github.io/authors/kbonham/
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/categories/index.xml b/categories/index.xml
index c34ec72..4998808 100644
--- a/categories/index.xml
+++ b/categories/index.xml
@@ -5,7 +5,7 @@
https://BioJulia.github.io/categories/
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/index.html b/index.html
index 4eaef5b..1a75126 100644
--- a/index.html
+++ b/index.html
@@ -21,7 +21,7 @@
-
+
@@ -1087,7 +1087,7 @@ Authors
Authors
Latest
Interests
Interests
Categories
Explore packages and projects
- Ben J. Ward
+ Sabrina Ward
Evolutionary Genetics, Genome Assembly, Retro Computing, Archery, Tabletop gaming
@@ -1852,7 +1852,7 @@
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/index.json b/index.json
index 33b2b2a..16f5347 100644
--- a/index.json
+++ b/index.json
@@ -1 +1 @@
-[{"authors":["jakobnissen"],"categories":null,"content":"I am a bioinformatician at Statens Serum Institut (the Danish center for disease control), where I analyze genome of influenza with the purpose of better understanding pandemic flu.\n","date":1598797689,"expirydate":-62135596800,"kind":"taxonomy","lang":"en","lastmod":1598797689,"objectID":"839151c1b508087f6ab27942678f4035","permalink":"https://BioJulia.github.io/authors/jakobnissen/","publishdate":"0001-01-01T00:00:00Z","relpermalink":"/authors/jakobnissen/","section":"authors","summary":"I am a bioinformatician at Statens Serum Institut (the Danish center for disease control), where I analyze genome of influenza with the purpose of better understanding pandemic flu.","tags":null,"title":"Jakob Nybo Nissen","type":"authors"},{"authors":["admin"],"categories":null,"content":"","date":1579820469,"expirydate":-62135596800,"kind":"taxonomy","lang":"en","lastmod":1579820469,"objectID":"2525497d367e79493fd32b198b28f040","permalink":"https://BioJulia.github.io/authors/admin/","publishdate":"0001-01-01T00:00:00Z","relpermalink":"/authors/admin/","section":"authors","summary":"","tags":null,"title":"Ben J. Ward","type":"authors"},{"authors":["kbonham"],"categories":null,"content":"I am a microbiologist and immunologist teaching and researching at Harvard University and the Broad Institute. I use computational methods to explore the relationships between microbial communities and human disease, and to understand complex microbial communities. I also teach about the immune system and infectious disease.\n","date":-62135596800,"expirydate":-62135596800,"kind":"taxonomy","lang":"en","lastmod":-62135596800,"objectID":"f8654d4cd97b4098f5c15014e9d7e021","permalink":"https://BioJulia.github.io/authors/kbonham/","publishdate":"0001-01-01T00:00:00Z","relpermalink":"/authors/kbonham/","section":"authors","summary":"I am a microbiologist and immunologist teaching and researching at Harvard University and the Broad Institute. I use computational methods to explore the relationships between microbial communities and human disease, and to understand complex microbial communities. I also teach about the immune system and infectious disease.","tags":null,"title":"Kevin Bonham","type":"authors"},{"authors":["Jakob Nybo Nissen"],"categories":[],"content":"Find this notebook at https://github.com/jakobnissen/automa_tutorial\nIn bioinformatics, we have a saying:\n The first step of any bioinformatics project is to define a new file format, incompatible with all previous ones.\n The situation might not be quite as bad as the saying implies, but it is true that we have a lot of different file formats, representing the various kinds of data we work with.\nFor that reason, creating file parsers is a central task in bioinformatics, and has almost become a craft in itself. An awful, awful craft. For anything but the simplest formats, designing robust parsers is hard, and it is difficult to know if a parser you've built is solid. Insignificant details like trailing whitespace can break a seemingly well-built parser.\nAt its core, a parser is a program that checks whether some input data comforms to some format. For bioinformatics, we typically also want to load the data in the file into some data structure to work on it. The format itself is typically specified in some kind of higher-level, but unambiguous language, like this description of the Newick format. A parser then, can be viewed as a program that:\n Is given a format specified in a high-level \u0026ldquo;language\u0026rdquo;, e.g. \u0026ldquo;a leaf in the Newick format consist of only a name, which is formed from the alphabet \u0026ldquo;a-z, A-Z and _\u0026quot;\u0026quot;, and\n Checks whether some input follows that format using relatively low-level instructions, e.g. checking whether the following bytes until the next (, ), , or ; match the pattern [A-Za-z_]\n The central job of a parser is then to act as a translator between different abstraction levels: The format is specified in a high-level language, but the actual data processing must happen at a lower level language.\nWhen described like that, parser generation seems quite analogous to code compilation, and begins too look like a machine's job, not a human's. Machines are fast and accurate, and do not make oversights or mistakes - exactly the characteristics you want when making a parser. Let's leave the machine tasks to the machines; we humans ought to put our effort into what we do best: Specifying the formats themselves.\nIn this tutorial, I will introduce Automa, a Julia package for creating parsers. Actually, when reading this you might be able to tell that Automa solves a problem slightly more general than just creating parsers, but creating parsers is what we will use it for in this tutorial. Automa is not an easy package to use; it is complex and a little opaque, but it's worth all the effort and more. Parsers generated by Automa have several advantages over human-made parsers:\n They are failsafe: Any deviation of the input from the specified format, no matter how trivial, will cause it to fail. Therefore, they are much more reliant than human-written parsers. They are easy to change. You can often easily change details in the input format, such as allowing whitespace, with minimal changes to your code, They are quite fast, with Automa-code often being faster than even optimized human-crafted parsers. In this first part of the tutorial I will show how to create parsers from data loaded into memory in order to highlight Automa itself. In the second part, which I will release at a later date, I will show how to create a proper BioJulia reader and writer for a file format.\nOur format For our example, let us begin with a simlified version of the FASTA format - we can call it \u0026ldquo;SimpleFasta\u0026rdquo;. Our format will be defined as something that looks like this:\n\u0026gt;first TGATAGTAGTCGTGTGATA \u0026gt;second AGGACCCAATTATCGGGGTAA I.e.\n A SimpleFasta file is a sequence of zero or more records A record consists of header * newline * sequence * newline, where \u0026ldquo;*\u0026rdquo; signifies concatenation A header consists of \u0026gt; followed by one or more characters from the alphabet a-z A sequence consists of one or more chatacters from the alphabet A, C, G or T. We can visualize a hypothetical SimpleFasta parser using a simple flowchart - note: You need to have graphviz installed to be able to run the code below.\ns = \u0026quot;\u0026quot;\u0026quot;digraph { graph [ rankdir = LR ]; begin [ shape = circle ]; begin -\u0026gt; header; header [ shape = circle ]; header -\u0026gt; sequence; sequence [ shape = circle ]; sequence -\u0026gt; end; end [ shape = circle ]; sequence -\u0026gt; header; begin -\u0026gt; end; } \u0026quot;\u0026quot;\u0026quot; function display_machine() write(\u0026quot;/tmp/fasta_machine.dot\u0026quot;, s) run(`dot -Tsvg -o /tmp/fasta_machine.svg /tmp/fasta_machine.dot`) open(\u0026quot;/tmp/fasta_machine.svg\u0026quot;) do file display(\u0026quot;image/svg+xml\u0026quot;, String(read(file))) end end; display_machine() After the beginning of the file, we expect to see a header. Next, we move to the sequence. From there, we can either reach the end of the file, or loop back to another header. Alternatively, in the case of an empty file, we may move from the beginning of the file to the end immediately without any headers or sequences. We consider an empty file to also be specification-compliant.\nNote that this simple flowchart describes any valid SimpleFasta format. But if, for example, a file had two consecutive header lines, there is no path we can take through the flowchart from \u0026ldquo;begin\u0026rdquo; to \u0026ldquo;end\u0026rdquo; which describes the file, and we can tell that the input did not conform to the format.\nThe circles marked \u0026ldquo;begin\u0026rdquo;, \u0026ldquo;header\u0026rdquo;, \u0026ldquo;sequence\u0026rdquo; or \u0026ldquo;end\u0026rdquo; are what we call the four states. Because the parser shown in the diagram has a fixed list of possible states that it changes between, we call the parser a finite state machine (FSM), or finite state automaton (FSA).\nThe flowchart above doesn't do a very good job of describing the FSM that our parser is, because it doesn't tell you how the machine transitions between states. The state transitions are what actually defines the format. For example, we know we are transitioning to a header when we see a \u0026gt; symbol - that is, in a way, what defines the header state.\nIn the flowchart below, the arrows signifying transitions between states are marked with the input character(s) that causes the FSM (our parser) to transition, written in a regex-like notation. For example, the transition from a header to a sequence is defined by a \\n followed by one of the characters in [ACGT]. \u0026ldquo;EOF\u0026rdquo; signifies end of file. Note also that some states have transitions leading to itself. For example, when in the \u0026ldquo;header\u0026rdquo; state, the FSM may continue to read characters in a-z and stay in the \u0026ldquo;header\u0026rdquo; state.\ns = \u0026quot;\u0026quot;\u0026quot;digraph { graph [ rankdir = LR ]; begin [ shape = circle ]; header [ shape = circle ]; sequence [ shape = circle ]; end [ shape = circle ]; begin -\u0026gt; header [ label = \u0026quot;\u0026gt;[a-z]\u0026quot; ]; header -\u0026gt; header [ label = \u0026quot;[a-z]\u0026quot;]; header -\u0026gt; sequence [ label = \u0026quot;\\\\\\\\n[ACGT]\u0026quot; ]; sequence -\u0026gt; sequence [ label = \u0026quot;[ACGT]\u0026quot; ]; sequence -\u0026gt; header [ label = \u0026quot;\\\\\\\\n\u0026gt;\u0026quot; ]; sequence -\u0026gt; end [ label = \u0026quot;\\\\\\\\nEOF\u0026quot; ]; begin -\u0026gt; end; } \u0026quot;\u0026quot;\u0026quot; display_machine() How does the machine know when it's supposed to transition between states? Suppose a machine is at the beginning of a file and it begins by observing the input \u0026gt;. Is this part of a transition to the header state, or the beginning of a malformed input?\nIt is certainly possible for the machine to have a memory of the pattern, such that it only transitions to the \u0026ldquo;header\u0026rdquo; state once it sees the two specific inputs \u0026gt; and [a-z] in that order. However, we can make things much simpler by requiring the machine to make a state transition for every single input byte, then we don't need any memory other than what state the machine is in, which is a single integer. To still represent the same format at the flowchart above, we need to insert some extra states:\ns = \u0026quot;\u0026quot;\u0026quot;digraph { graph [ rankdir = LR ]; begin [ shape = circle ]; 2 [ shape = circle ]; 1 [ shape = circle ]; 3 [ shape = circle ]; begin -\u0026gt; 1 [ label = \u0026quot;\u0026gt;\u0026quot; ]; header [ shape = circle ]; 1 -\u0026gt; header [label = \u0026quot;[a-z]\u0026quot; ]; header -\u0026gt; header [ label = \u0026quot;[a-z]\u0026quot; ]; 2 -\u0026gt; sequence [ label = \u0026quot;[ACGT]\u0026quot; ]; header -\u0026gt; 2 [ label = \u0026quot;\\\\\\\\n\u0026quot; ]; sequence [ shape = circle ]; sequence -\u0026gt; sequence [ label = \u0026quot;[ACGT]\u0026quot; ]; sequence -\u0026gt; 3 [ label = \u0026quot;\\\\\\\\n\u0026quot; ]; 3 -\u0026gt; 1 [ label = \u0026quot;\u0026gt;\u0026quot; ]; end [ shape = circle ]; 3 -\u0026gt; end [ label = \u0026quot;EOF\u0026quot; ]; \u0026quot;begin\u0026quot; -\u0026gt; end [ label = \u0026quot;EOF\u0026quot; ]; } \u0026quot;\u0026quot;\u0026quot; display_machine() The diagram above is exactly equivalent to the previous one, but this diagram makes a state transition for every single input byte. All bytes cause a transition, and no transition consumes multiple bytes. For simplicity, Automa's parsers work like that.\nNote that it is always possible, given a FSM with multi-byte transitions to convert it to a FSM with single-byte transitions by just putting in more nodes. E.g. the transition from header to sequence requires two bytes, a newline and an A, C, G or T. Therefore, an intermediate note (on the illustration marked \u0026ldquo;2\u0026rdquo;) is needed. Unfortunately, the requirement of single-byte transitions limits what you can do with Automa - I'll come back to the limitations later.\nNow let's create this same FSM using Automa.\nIntroducing Automa # First imports import Automa import Automa.RegExp: @re_str const re = Automa.RegExp; We define the elements that make up the SimpleFasta (the header and the sequence) using Automa.RegExp, a subset of regular expressions. Automa's regular expressions can be manipulated with the following operations:\n Function Alias Meaning cat(re...) * concatenation alt(re1, re2...) \\| alternation rep(re) zero or more repetition rep1(re) one or more repetition opt(re) zero or one repetition isec(re1, re2) \u0026amp; intersection diff(re1, re2) \\ | difference (subtraction) | neg(re) ! negation Furthermore inside the re-strings themselves, you can use\n groups of characters, like [ACGT] representing the set {A, C, G, T} ranges A-Z representing 'A':'Z' (or more accurately, the bytes UInt8('A'):UInt8('Z') *, representing zero or more repetitions of the last element, such as [a-z]* +, representing one or more repetitions of the last element, such as [a-z]+ ?, representing zero or one repetition of the last element, such as [a-z]? We can specify our FSM like below. We use a let statement only so we don't pollute global namespace with header, sequence etc.\nmachine = let # Define SimpleFasta syntax header = re\u0026quot;\u0026gt;[a-z]+\u0026quot; sequence = re\u0026quot;[ACGT]+\u0026quot; record = header * re\u0026quot;\\n\u0026quot; * sequence * re\u0026quot;\\n\u0026quot; fasta = re.rep(record) # Compile the regex to a FSM Automa.compile(fasta) end; If you have dot installed on your computer, you can visualize the compiled FSM. You might want to modify the paths in the function below to make it work on your computer.\nfunction display_machine(machine) write(\u0026quot;/tmp/fasta_machine.dot\u0026quot;, Automa.machine2dot(machine)) run(`dot -Tsvg -o /tmp/fasta_machine.svg /tmp/fasta_machine.dot`) open(\u0026quot;/tmp/fasta_machine.svg\u0026quot;) do file display(\u0026quot;image/svg+xml\u0026quot;, String(read(file))) end end; display_machine(machine) Yes, that looks correct! The start node here is the small point on the left. You can see one of the states are represented by a double circle. This represents an accept state, where the machine may stop at a valid end of input - here after each SimpleFasta record. If the machine stops at any other state, it did not complete correctly. In other words, state \u0026ldquo;1\u0026rdquo; here doubles as the \u0026ldquo;end\u0026rdquo; state.\nWith this parser we can read a SimpleFasta file and determine if it conforms to the specification. In Automa, we do this by having our machine create Julia code in the form of a Julia expression (of type Expr), which we can then use to create a function using metaprogramming.\nFirst, we need an Automa.CodeGenContext, an object which contains the options for code generation. For now, we don't worry about the settings and just make a default context object:\ncontext = Automa.CodeGenContext(); Then, we use two functions to generate code:\n Automa.generate_init_code(::CodeGenContext, ::Machine) creates the Julia code needed to initialize the parsing of the given FSM. Automa.generate_exec_code(::CodeGenContext, ::Machine, ::Dict{Symbol, Expr}) creates the Julia code needed for the main loop of the running parser. Here, the Dict is an action dict, which may contain arbitrary Julia code that the parser executes while parsing. We need that action dict later to make our parser do something useful - but for now, we simply want to create a parser, not necessarily a useful one, so we just use an empty dict. We will come back to the action dict later. Let's have a look at these functions:\nAutoma.generate_init_code(context, machine) quote #= /home/jakob/.julia/packages/Automa/sfHZ6/src/codegen.jl:94 =# p::Int = 1 #= /home/jakob/.julia/packages/Automa/sfHZ6/src/codegen.jl:95 =# p_end::Int = 0 #= /home/jakob/.julia/packages/Automa/sfHZ6/src/codegen.jl:96 =# p_eof::Int = -1 #= /home/jakob/.julia/packages/Automa/sfHZ6/src/codegen.jl:97 =# cs::Int = 1 end Besides the comments, the init code seems to define four integers. What to they mean?\nAutoma works by reading the data byte by byte. When doing this, the data is represented in the function as an AbstractVector{UInt8} - let us call that data. p represents the index of data that the parser is currently reading from. p_end represents the last position of the data in data. Because data may just be a small, buffered part of a much larger file being parsed, p_eof represents the index of actual end of input. Last, cs is short for \u0026ldquo;current state\u0026rdquo;, and is an integer representing the state of the FSM. You can see that sensibly, it is initialized at state 1, and so, by looking at the FSM diagram, we can see it next expects a \u0026lsquo;\u0026gt;\u0026rsquo;. The zero state is special, and represents the \u0026ldquo;accept state\u0026rdquo;, that is, the state of a correctly exited FSM.\nThe code generated by Automa.generate_exec_code is a little more complicated, and also depends on the CodeGenContext. But it will increment p by one, read a byte from data, and execute a state transition (i.e. change cs) depending on the byte. If cs goes negative, it knows the input is malformed. If cs is zero, the parser has finished. If cs is positive, it will continue until end of file.\nAlright, let's create a parser using these functions!\n@eval function parse_fasta(data::Union{String,Vector{UInt8}}) $(Automa.generate_init_code(context, machine)) # p_end and p_eof were set to 0 and -1 in the init code, # we need to set them to the end of input, i.e. the length of `data`. p_end = p_eof = lastindex(data) # We just use an empty dict here because we don't want our parser # to do anything just yet - only parse the input $(Automa.generate_exec_code(context, machine, Dict{Symbol, Expr}())) # We need to make sure that we reached the accept state, else the # input did not parse correctly iszero(cs) || error(\u0026quot;failed to parse on byte \u0026quot;, p) return nothing end; Does it work?\nfor (inputno, data) in enumerate([ \u0026quot;\u0026quot;, \u0026quot;\u0026gt;hi\\nACGT\\n\u0026quot;, \u0026quot;\u0026gt;hi\\nACGT\u0026quot;, \u0026quot;\u0026gt;hi\\nAA\\n\u0026gt;x\\nGTGATCGTAGTA\\n\u0026quot;, \u0026quot;\u0026gt;hi\\nthere!\u0026quot; ]) try parse_fasta(data) println(\u0026quot;Input $inputno: parsed successfully!\u0026quot;) catch e println(\u0026quot;Input $inputno: \u0026quot;, e) end end Input 1: parsed successfully! Input 2: parsed successfully! Input 3: ErrorException(\u0026quot;failed to parse on byte 9\u0026quot;) Input 4: parsed successfully! Input 5: ErrorException(\u0026quot;failed to parse on byte 5\u0026quot;) Excellent. It correctly parsed the empty data, input 2 and 4. Input 3 lacked a trailing newline, and input 5 didn't have a proper sequence.\nBut just parsing a file is not too useful. Normally, we parse a file in order to load it into some kind of datastructure to manipulate it. That's where the action dict comes into the picture. We can make Automa execute arbitrary Julia code during parsing. Here is how it works:\n When defining our machine, we add action names, in the form of Julia Symbols to certain regexes. For example, we may define that exiting the regex sequence should execute the action :exit_sequence. Like other Julia identifiers, the name can be chosen arbitrarily. Using the action dict, which was of type Dict{Symbol, Expr}, we can map action names to Julia code. This tells Automa what each action name means. The actions expressions are put into the function we create at parse time, so there is no runtime cost to this. First, we redefine our machine. This time, we add actions:\nmachine = let # Define SimpleFasta syntax newline = re\u0026quot;\\n\u0026quot; header = re\u0026quot;\u0026gt;[a-z]+\u0026quot; sequence = re\u0026quot;[ACGT]+\u0026quot; record = header * newline * sequence * newline fasta = re.rep(record) # Define SimpleFasta actions using arbitrary names header.actions[:enter] = [:enter] header.actions[:exit] = [:exit_header] sequence.actions[:enter] = [:enter] sequence.actions[:exit] = [:exit_sequence] record.actions[:exit] = [:exit_record] Automa.compile(fasta) end; If we visualize the machine now, the actions will be printed on the relevant edges in the FSM diagram. For example, the newline after a header is labeled '\\n'/exit_header, meaning the transition will happen at the input byte '\\n', and it will execute the action exit_header.\nAlso note that the diagram structure itself changed, because the entering state and exit state are now distinct. The exit state 6 will execute :exit_record at the end of file (EOF), whereas if the FSM ends at state 1, :exit_record is not executed.\ndisplay_machine(machine) Let's also define a simple data structure to parse the SimpleFasta records into. Like our SimpleFasta format, let's not overcomplicate things:\nusing BioSequences struct FastaRecord{A} header::String sequence::LongSequence{A} end FastaRecord(h, sequence::LongSequence{A}) where A = FastaRecord{A}(h, sequence); And now our action dict. I can use the variables data, p, p_end etc. in the code:\nactions = Dict( # Record which byte position marks the beginning of header or sequence :enter =\u0026gt; quote mark = p end, :exit_header =\u0026gt; quote header = String(data[mark+1:p-1]) end, :exit_sequence =\u0026gt; quote sequence = LongSequence{A}(data[mark:p-1]) end, :exit_record =\u0026gt; quote record = FastaRecord(header, sequence) push!(records, record) end, ); @eval function parse_fasta(::Type{A}, data::Union{String,Vector{UInt8}}) where {A \u0026lt;: Alphabet} # We can't control the order that the code from out action dict is executed, so # we define variables beforehand so we can be sure they are not used before # they are defined. mark = 0 records = FastaRecord{A}[] header = \u0026quot;\u0026quot; sequence = LongSequence{A}() # The rest is just like in our previous example, except now we use the # action dict we just created. $(Automa.generate_init_code(context, machine)) p_end = p_eof = lastindex(data) $(Automa.generate_exec_code(context, machine, actions)) iszero(cs) || error(\u0026quot;failed to parse on byte \u0026quot;, p) return records end; We can verify that it works:\nparse_fasta(DNAAlphabet{4}, \u0026quot;\u0026gt;hello\\nTAG\\n\u0026gt;there\\nTAA\\n\u0026quot;) 2-element Array{FastaRecord{DNAAlphabet{4}},1}: FastaRecord{DNAAlphabet{4}}(\u0026quot;hello\u0026quot;, TAG) FastaRecord{DNAAlphabet{4}}(\u0026quot;there\u0026quot;, TAA) Redesign the format With the basics now covered, let's step up the game a tad and make it parse SimpleFasta records with multiple lines per sequence. We first define a seqline pattern that looks like the old sequence pattern, and build a new sequence pattern from multiple seqline patterns.\nmachine = let # Define SimpleFasta syntax header = re\u0026quot;\u0026gt;[a-z]+\u0026quot; seqline = re\u0026quot;[ACGT]+\u0026quot; sequence = seqline * re.rep(re\u0026quot;\\n\u0026quot; * seqline) record = header * re\u0026quot;\\n\u0026quot; * sequence * re\u0026quot;\\n\u0026quot; fasta = re.rep(record) # Define SimpleFasta actions header.actions[:enter] = [:enter] header.actions[:exit] = [:exit_header] seqline.actions[:enter] = [:enter] seqline.actions[:exit] = [:exit_seqline] record.actions[:exit] = [:exit_record] Automa.compile(fasta) end; The machine now has a subtle change with a small loop between node 5 and 6 representing the parser seeing multiple sequence lines.\ndisplay_machine(machine) We also need to update the actions. Here, I use :(this syntax), which is equivalent to\nquote this syntax end but more readable for one-liners. I keep a buffer containing the sequence, and append every sequence line to the buffer at the end of each seqline. When the record is done, I create the sequence and empty the buffer for the next record.\nactions = Dict( :enter =\u0026gt; :(mark= p), :exit_header =\u0026gt; :(header = String(data[mark+1:p-1])), :exit_seqline =\u0026gt; quote doff = length(seqbuffer) + 1 resize!(seqbuffer, length(seqbuffer) + p - mark) copyto!(seqbuffer, doff, data, mark, p-mark) end, :exit_record =\u0026gt; quote sequence = LongSequence{A}(seqbuffer) empty!(seqbuffer) record = FastaRecord(header, sequence) push!(records, record) end, ); We need to redefine parse_fasta, now also containing the seqbuffer:\n@eval function parse_fasta(::Type{A}, data::Union{String,Vector{UInt8}}) where {A \u0026lt;: Alphabet} mark = 0 records = FastaRecord{A}[] header = \u0026quot;\u0026quot; sequence = LongSequence{A}() seqbuffer = UInt8[] $(Automa.generate_init_code(context, machine)) p_end = p_eof = lastindex(data) $(Automa.generate_exec_code(context, machine, actions)) iszero(cs) || error(\u0026quot;failed to parse on byte \u0026quot;, p) return records end; And we can confirm it now works for multiple sequences!\nparse_fasta(DNAAlphabet{2}, \u0026quot;\u0026gt;hello\\nTAG\\nGGG\\n\u0026gt;there\\nTAA\\nTAG\\n\u0026quot;) 2-element Array{FastaRecord{DNAAlphabet{2}},1}: FastaRecord{DNAAlphabet{2}}(\u0026quot;hello\u0026quot;, TAGGGG) FastaRecord{DNAAlphabet{2}}(\u0026quot;there\u0026quot;, TAATAG) Even more complexity One of the greatest things about Automa is how we can parse quite complicated formats without needing to manually construct much code. For example, here, I change only the regular expressions:\n I allow optional Windows-style newlines (\\r\\n) I allow trailing whitespace on each line (re.space() \\ (re\u0026quot;\\r\u0026quot; | re\u0026quot;\\n\u0026quot;) means all space characters except \\r or \\n), but not whitespace inside headers Sequences can only contain all UIPAC ambiguous RNA or DNA nucleotides The trailing record no longer needs a trailing newline to be valid. It has the same actions as a regular record. machine = let # Define FASTA syntax newline = re.opt(\u0026quot;\\r\u0026quot;) * re\u0026quot;\\n\u0026quot; hspace = re.space() \\ (re\u0026quot;\\r\u0026quot; | re\u0026quot;\\n\u0026quot;) header = re\u0026quot;\u0026gt;\u0026quot; * re.rep1((re.any() \\ re.space())) * re.opt(hspace) seqline = re\u0026quot;[acgturmkyswbdhvnACGTURMKYSWBDHVN\\-*]+\u0026quot; * re.opt(hspace) sequence = seqline * re.rep(newline * seqline) record = header * newline * sequence * newline trailing_record = header * newline * sequence * re.rep(newline * re.opt(hspace)) fasta = re.rep(newline) * re.rep(record) * re.opt(trailing_record) # Define FASTA actions header.actions[:enter] = [:enter] header.actions[:exit] = [:exit_header] seqline.actions[:enter] = [:enter] seqline.actions[:exit] = [:exit_seqline] record.actions[:exit] = [:exit_record] trailing_record.actions[:exit] = [:exit_record] Automa.compile(fasta) end; The machine is now a bit more complicated. But who cares, it's automatically generated. Look how easy it was to generate!\ndisplay_machine(machine) Pitfall 1: Ambiguous parsers Besides its complexity, building parsers with Automa has some pitfalls - notably those caused by Automa transitioning state for every input byte.\nUnfortunately, it's very easy to accidentally create a machine that can't possibly figure out what action to take when looking only at one byte at a time. The following simple machine will not compile (on the master branch of Automa).\nbad_machine = let left = re\u0026quot;Turn left!\u0026quot; right = re\u0026quot;Turn right!\u0026quot; left_or_right = left | right left.actions[:enter] = [:turn_left] right.actions[:enter] = [:turn_right] Automa.compile(left_or_right) end Ambiguous DFA: Input 0x54 can lead to actions [:turn_right] or [:turn_left] Stacktrace: [1] error(::String) at ./error.jl:33 [2] validate_paths(::Array{Tuple{Union{Nothing, Automa.Edge},Array{Symbol,1}},1}) at /home/jakob/.julia/packages/Automa/sfHZ6/src/dfa.jl:176 [3] validate_nfanodes(::Automa.StableSet{Automa.NFANode}) at /home/jakob/.julia/packages/Automa/sfHZ6/src/dfa.jl:195 [4] nfa2dfa(::Automa.NFA, ::Bool) at /home/jakob/.julia/packages/Automa/sfHZ6/src/dfa.jl:104 [5] compile(::Automa.RegExp.RE; optimize::Bool, unambiguous::Bool) at /home/jakob/.julia/packages/Automa/sfHZ6/src/machine.jl:55 [6] compile(::Automa.RegExp.RE) at /home/jakob/.julia/packages/Automa/sfHZ6/src/machine.jl:55 [7] top-level scope at In[27]:9 [8] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1091 Automa refuse to create this FSM. Why? Well, the first byte it expects is a T (or 0x54) - but there is no way of knowing whether it's entering left or right when it sees that byte, and these two patters have conflicting actions! If they had the same actions, there would be no problem.\nThis situation is highly artificial, but the situation happens often in real life. Here's a very subtle change to one of the previous parsers:\nbad_machine = let # Define SimpleFasta syntax header = re\u0026quot;\u0026gt;[a-z]+\u0026quot; seqline = re\u0026quot;[ACGT]+\u0026quot; sequence = seqline * re.rep(re\u0026quot;\\n\u0026quot; * seqline) record = header * re\u0026quot;\\n\u0026quot; * sequence fasta = re.rep(record * re\u0026quot;\\n\u0026quot;) # Define SimpleFasta actions header.actions[:enter] = [:enter] header.actions[:exit] = [:exit_header] seqline.actions[:enter] = [:enter] seqline.actions[:exit] = [:exit_seqline] record.actions[:exit] = [:exit_record] Automa.compile(fasta) end; Ambiguous DFA: Input 0x0a can lead to actions [:exit_seqline, :exit_record] or [:exit_seqline] Stacktrace: [1] error(::String) at ./error.jl:33 [2] validate_paths(::Array{Tuple{Union{Nothing, Automa.Edge},Array{Symbol,1}},1}) at /home/jakob/.julia/packages/Automa/sfHZ6/src/dfa.jl:176 [3] validate_nfanodes(::Automa.StableSet{Automa.NFANode}) at /home/jakob/.julia/packages/Automa/sfHZ6/src/dfa.jl:195 [4] nfa2dfa(::Automa.NFA, ::Bool) at /home/jakob/.julia/packages/Automa/sfHZ6/src/dfa.jl:104 [5] compile(::Automa.RegExp.RE; optimize::Bool, unambiguous::Bool) at /home/jakob/.julia/packages/Automa/sfHZ6/src/machine.jl:55 [6] compile(::Automa.RegExp.RE) at /home/jakob/.julia/packages/Automa/sfHZ6/src/machine.jl:55 [7] top-level scope at In[28]:16 [8] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1091 After a seqline, when encountering a newline, there is no way of knowing whether this marks the end of a record or whether it continues at the next line. Automa can't look ahead, it has to make a decision at every byte.\nHere, the solution is to move the newline from the definition of fasta to that of sequence. That way, the newline functions as a kind of one-byte lookahead - if the byte after the newline is a \u0026gt;, it knows the record is complete.\nIn general, encountering situations like this is usually a sign that the parser is badly built - or that the format is. You can usually resolve it by using single-byte lookahead like above. If not, it is possible to optionally disable the check for ambiguous actions by passing a keyword to the Automa.parse function. But beware that this may lead to absurd results.\nPitfall 2: No recursively defined patterns Some formats are naturally recursive. The Newick format, for example, defines \u0026ldquo;subtree\u0026rdquo; in terms of \u0026ldquo;internal\u0026rdquo;, which is defined in terms of \u0026ldquo;branchset\u0026rdquo;, defined in terms of \u0026ldquo;branch\u0026rdquo; which is defined in terms of\u0026hellip; \u0026ldquo;subtree\u0026rdquo;.\nSomething like that will not work Automa:\nsimple_newick = let name = re\u0026quot;[a-z_]\u0026quot; clade = name | (re\u0026quot;\\(\u0026quot; * cladeset * \u0026quot;)\u0026quot;) cladeset = clade * re.rep(re\u0026quot;,\u0026quot; * clade) Automa.compile(cladeset * re\u0026quot;;\u0026quot;) end UndefVarError: cladeset not defined Stacktrace: [1] top-level scope at In[29]:3 [2] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1091 Because, in general, you can't refer to objects in Julia that have not yet been defined. Worse, even if the syntactical challenge was solved, recursive patterns usually only make sense if you have lookahead. Newick files, for example, simply can't be parsed by FSMs, because every time you see an open parenthesis, you need to make sure there is a closing one further in the file, and this requires lookahead that can't be resolved by parsing one byte at a time.\nLuckily, because Automa can execute arbitrary Julia code, we can have the FSM manipulate a stack and turn the FSM into a pushdown automaton, which can handle formats like Newick. It does require a little more manual fiddling, and is less elegant:\nsimple_newick = let name = re\u0026quot;[a-z_]+\u0026quot; cladestart = re\u0026quot;\\(\u0026quot; cladestop = re\u0026quot;\\)\u0026quot; cladesep = re\u0026quot;,\u0026quot; nodechange = cladestart | cladestop | cladesep newick_element = nodechange * re.opt(name) tree = re.opt(name) * re.rep(newick_element) * re\u0026quot;;\u0026quot; cladestart.actions[:enter] = [:push] cladestop.actions[:enter] = [:pop] cladesep.actions[:enter] = [:pop, :push] Automa.compile(tree) end actions = Dict( :push =\u0026gt; quote level -= 1 end, :pop =\u0026gt; quote iszero(level) \u0026amp;\u0026amp; error(\u0026quot;Too many closing parentheses\u0026quot;) level += 1 end ); @eval function parse_newick(data::Union{String,Vector{UInt8}}) level = 0 $(Automa.generate_init_code(context, simple_newick)) p_end = p_eof = lastindex(data) $(Automa.generate_exec_code(context, simple_newick, actions)) iszero(cs) || error(\u0026quot;failed to parse on byte \u0026quot;, p) iszero(level) || error(\u0026quot;Too many opening parentheses\u0026quot;) end; And it sort of works:\nprintln(\u0026quot;Does this parse? \u0026quot;, parse_newick(\u0026quot;(hello,hi);\u0026quot;)) println(\u0026quot;Does this parse? \u0026quot;, parse_newick(\u0026quot;(hello,hi)(;\u0026quot;)) Does this parse? true Too many opening parentheses Stacktrace: [1] error(::String) at ./error.jl:33 [2] parse_newick(::String) at ./In[31]:8 [3] top-level scope at In[32]:2 [4] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1091 Here, I needed to manually keep track of the level of nesting. In fact, I would have to keep track of more stuff to disallow inputs like ();, removing some of the point of Automa - namely, that it is automatic.\nIn the next part of this tutorial, I will show how to use Automa to create honest-to-God parsers which integrate with the BioJulia ecosystem and can work on streamed data.\nOptimizing Automa's parsers are pretty fast. To create hand-written parsers of comparable speeds, you need to microoptimize every operation, which is annoying. That being said, for actually loading parsed files to be fast, you need to optimize the user-defined actions in the actions dictionary, too.\nHow fast is our un-optimized Fasta parser already? Let's test it on some data and find out. I'll make 50k sequences with 2k base pairs each, for a total of 100 MB data.\nRemember to re-run the block of code where the parse_fasta function is defined since we changed the machine.\n# Create 50k records with 2kbp each function generate_data() open(\u0026quot;/tmp/fasta.fna\u0026quot;, \u0026quot;w\u0026quot;) do file for seq in 1:50000 println(file, '\u0026gt;', join(rand('a':'z', 16))) for line in 1:20 println(file, join(rand(\u0026quot;ACGT\u0026quot;, 100))) end end end end generate_data() function parsedata() open(\u0026quot;/tmp/fasta.fna\u0026quot;) do file parse_fasta(DNAAlphabet{2}, read(file)) end end; parsedata(); @time parsedata(); 0.330489 seconds (200.04 k allocations: 138.861 MiB, 4.15% gc time) Alright! It does about 300 MB/s on my laptop. Not bad! I'd say for most applications, this speed is more than sufficient.\nBut this is Julia. We want to be able to optimize the crap out of it. So let's optimize the actions:\nactions = Dict( :enter =\u0026gt; :(mark = p), # Create string with only one copy of the data. :exit_header =\u0026gt; :(header = unsafe_string(pointer(data, mark + 1), p - mark - 1)), # Only resize buffer if it's too small, else just keep track of how many bytes are used. :exit_seqline =\u0026gt; quote N = p - mark length(seqbuffer) \u0026lt; filled + N \u0026amp;\u0026amp; resize!(seqbuffer, filled + N) copyto!(seqbuffer, filled+1, data, mark, N) filled += N end, # Use `copyto!` to only encode data directly from data vector into sequence # note this requires the latest versions of BioSequences. :exit_record =\u0026gt; quote sequence = copyto!(LongSequence{A}(filled), 1, seqbuffer, 1, filled) record = FastaRecord(header, sequence) push!(records, record) filled = 0 end, ); If we want to optimize, we also need to use the fastest CodeGenContext. We use the :goto-generator. This creates machine code using @goto-statements, which creates very long and completely unreadable code - but it's fast. Also, who cares if it's hard to read. It's machine-generated code. We disable the FSM's boundschecks when accessing the data vector - since we don't manipulate the position p manually, we are confident it will not go out of bounds. And we allow the code generator to unroll loops:\ncontext = Automa.CodeGenContext(generator=:goto, checkbounds=false, loopunroll=4); We also need to modify the parse function because we use a new variable called filled and need to initialize it.\n@eval function parse_fasta(::Type{A}, data::Union{String,Vector{UInt8}}) where {A \u0026lt;: Alphabet} mark = 0 records = FastaRecord{A}[] header = \u0026quot;\u0026quot; sequence = LongSequence{A}() seqbuffer = UInt8[] filled = 0 $(Automa.generate_init_code(context, machine)) p_end = p_eof = lastindex(data) $(Automa.generate_exec_code(context, machine, actions)) iszero(cs) || error(\u0026quot;failed to parse on byte \u0026quot;, p) return records end; parsedata(); @time parsedata(); 0.162113 seconds (150.04 k allocations: 133.520 MiB, 7.90% gc time) Okay, we're at slightly above 600 MB/s. Profiling confirms nearly all time is spent encoding the DNA sequences to LongSequence. We can make it faster still by not storing the data as a LongSequence, and instead make it a \u0026ldquo;lazy\u0026rdquo; object that only constructs the sequence upon demand. Even further, we could avoid heap-allocating each sequence individually. But these optimizations do not pertain directly to Automa, and I will leave them here.\n","date":1598797689,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1598797689,"objectID":"9e06435c7c0526e80ee1ebbfe9d021c5","permalink":"https://BioJulia.github.io/post/automa1/","publishdate":"2020-08-30T16:28:09.16+02:00","relpermalink":"/post/automa1/","section":"post","summary":"Find this notebook at https://github.com/jakobnissen/automa_tutorial\nIn bioinformatics, we have a saying:\n The first step of any bioinformatics project is to define a new file format, incompatible with all previous ones.\n The situation might not be quite as bad as the saying implies, but it is true that we have a lot of different file formats, representing the various kinds of data we work with.\nFor that reason, creating file parsers is a central task in bioinformatics, and has almost become a craft in itself.","tags":[],"title":"Tutorial to Automa: Part 1","type":"post"},{"authors":["Jakob Nybo Nissen"],"categories":[],"content":"Find this notebook at https://github.com/jakobnissen/hardware_introduction\nProgramming is used in many fields of science today, where individual scientists often have to write custom code for their own projects. For most scientists, however, computer science is not their field of expertise; They have learned programming by necessity. I count myself as one of them. While we may be reasonably familiar with the software side of programming, we rarely have even a basic understanding of how computer hardware impacts code performance.\nThe aim of this tutorial is to give non-professional programmers a brief overview of the features of modern hardware that you must understand in order to write fast code. It will be a distillation of what have learned the last few years. This tutorial will use Julia because it allows these relatively low-level considerations to be demonstrated easily in a high-level, interactive language.\nThis is not a guide to the Julia programming language To write fast code, you must first understand your programming language and its idiosyncrasies. But this is not a guide to the Julia programming language. I recommend reading the performance tips section of the Julia documentation.\nThis is not an explanation of specific datastructures or algorithms Besides knowing your language, you must also know your own code to make it fast. You must understand the idea behind big-O notation, why some algorithms are faster than others, and how different data structures work internally. Without knowing what an Array is, how could you possibly optimize code making use of arrays?\nThis too, is outside the scope of this paper. However, I would say that as a minimum, a programmer should have an understanding of:\n How a binary integer is represented in memory How a floating point number is represented in memory (learning this is also necessary to understand computational inacurracies from floating point operations, which is a must when doing scientific programming) The memory layout of a String including ASCII and UTF-8 encoding The basics of how an Array is structured, and what the difference between a dense array of e.g. integers and an array of references to objects are The principles behind how a Dict (i.e. hash table) and a Set works Furthermore, I would also recommend familiarizing yourself with:\n Heaps Deques Tuples This is not a tutorial on benchmarking your code To write fast code in practice, it is necessary to profile your code to find bottlenecks where your machine spends the majority of the time. One must benchmark different functions and approaches to find the fastest in practice. Julia (and other languages) have tools for exactly this purpose, but I will not cover them here.\nContent Minimize disk writes CPU cache Alignment Inspect generated assembly Minimize allocations Exploit SIMD vectorization Struct of arrays Use specialized CPU instructions Inline small functions Unroll tight loops Avoid unpredictable branches Multithreading GPUs Before you begin: Install packages # If you don't already have these packages installed, outcomment these lines and run it: # using Pkg # Pkg.add(\u0026quot;BenchmarkTools\u0026quot;) # Pkg.add(\u0026quot;StaticArrays\u0026quot;) using StaticArrays using BenchmarkTools \u0026quot;Print median elapsed time of benchmark\u0026quot; function print_median(trial) println(\u0026quot;Median time: \u0026quot;, BenchmarkTools.prettytime(median(trial).time)) end; The basic structure of computer hardware For now, we will work with a simplified mental model of a computer. Through this document, I will add more details to our model as they become relevant.\n[CPU] ↔ [RAM] ↔ [DISK] \nIn this simple diagram, the arrows represent data flow in either direction. The diagram shows three important parts of a computer:\n The central processing unit (CPU) is a chip the size of a stamp. This is where all the computation actually occurs, the brain of the computer. Random access memory (RAM, or just \u0026ldquo;memory\u0026rdquo;) is the short-term memory of the computer. This memory requires electrical power to maintain, and is lost when the computer is shut down. RAM serves as a temporary storage of data between the disk and the CPU. Much of time spent \u0026ldquo;loading\u0026rdquo; various applications and operating systems is actually spent moving data from disk to RAM and unpacking it there. A typical consumer laptop has around 10^11 bits of RAM memory. The disk is a mass storage unit. This data on disk persists after power is shut down, so the disk contains the long-term memory of the computer. It is also much cheaper per gigabyte than RAM, with consumer PCs having around 10^13 bits of disk space. Avoid write to disk where possible Even with a fast mass storage unit such as a solid state drive (SSD) or even the newer Optane technology, disks are many times, usually thousands of times, slower than RAM. In particular, seeks, i.e. switching to a new point of the disk to read from or write to, is slow. As a consequence, writing a large chunk of data to disk is much faster than writing many small chunks.\nTo write fast code, you must therefore make sure to have your working data in RAM, and limit disk writes as much as possible.\nThe following example serves to illustrate the difference in speed: The first function opens a file, accesses one byte from the file, and closes it again. The second function randomly accesses 1,000,000 integers from RAM.\n# Open a file function test_file(path) open(path) do file # Go to 1000'th byte of file and read it seek(file, 1000) read(file, UInt8) end end @time test_file(\u0026quot;test_file\u0026quot;) # Randomly access data N times function random_access(data::Vector{UInt}, N::Integer) n = rand(UInt) mask = length(data) - 1 @inbounds for i in 1:N n = (n \u0026gt;\u0026gt;\u0026gt; 7) ⊻ data[n \u0026amp; mask + 1] end return n end data = rand(UInt, 2^24) @time random_access(data, 1000000); 0.001321 seconds (15 allocations: 976 bytes) 0.130833 seconds Benchmarking this is a little tricky, because the first invokation will include the compilation times of both functions. And in the second invokation, your operating system will have stored a copy of the file (or cached the file) in RAM, making the file seek almost instant. To time it properly, run it once, then change the file, and run it again. So in fact, we should update our computer diagram:\n[CPU] ↔ [RAM] ↔ [DISK CACHE] ↔ [DISK] \nOn my computer, finding a single byte in a file (including opening and closing the file) takes about 13 miliseconds, and accessing 1,000,000 integers from memory takes 131 miliseconds. So RAM is on the order of 10,000 times faster than disk.\nWhen working with data too large to fit into RAM, load in the data chunk by chunk, e.g. one line at a time, and operate on that. That way, you don't need random access to your file and thus need to waste time on extra seeks, but only sequential access. And you must strive to write your program such that any input files are only read through once, not multiple times.\nIf you need to read a file byte by byte, for example when parsing a file, great speed improvements can be found by buffering the file. When buffering, you read in larger chunks, the buffer, to memory, and when you want to read from the file, you check if it's in the buffer. If not, read another large chunk into your buffer from the file. This approach minimizes disk reads. Both your operating system and your programming language will make use of caches, however, sometimes it is necessary to manually buffer your files.\nCPU cache The RAM is faster than the disk, and the CPU in turn is faster than RAM. A CPU ticks like a clock, with a speed of about 3 GHz, i.e. 3 billion ticks per second. One \u0026ldquo;tick\u0026rdquo; of this clock is called a clock cycle. While this is not really true, you may imagine that every cycle, the CPU executes a single, simple command called a CPU instruction which does one operation on a small piece of data. The clock speed then can serve as a reference for other timings in a computer. It is worth realizing that in a single clock cycle, a photon will travel only around 10 cm, and this puts a barrier to how fast memory (which is placed some distance away from the CPU) can operate. In fact, modern computers are so fast that a significant bottleneck in their speed is the delay caused by the time needed for electricity to move through the wires inside the computer.\nOn this scale, reading from RAM takes around 100 clock cycles. Similarly to how the slowness of disks can be mitigated by copying data to the faster RAM, data from RAM is copied to a smaller memory chip physically on the CPU, called a cache. The cache is faster because it is physically on the CPU chip (reducing wire delays), and because it uses a faster type of RAM, static RAM, instead of the slower (but cheaper to manufacture) dynamic RAM. Because it must be placed on the CPU, limiting its size, and because it is more expensive to produce, a typical CPU cache only contains around 10^8 bits, around 1000 times less than RAM. There are actually multiple layers of CPU cache, but here we simplify it and just refer to \u0026ldquo;the cache\u0026rdquo; as one single thing:\n[CPU] ↔ [CPU CACHE] ↔ [RAM] ↔ [DISK CACHE] ↔ [DISK] \nWhen the CPU requests a piece of data from the RAM, say a single byte, it will first check if the memory is already in cache. If so, it will read from it from there. This is much faster, usually just a few clock cycles, than access to RAM. If not, we have a cache miss, and your program will stall for tens of nanoseconds while your computer copies data from RAM into the cache.\nIt is not possible, except in very low-level languages, to manually manage the CPU cache. Instead, you must make sure to use your cache effectively.\nEffective use of the cache comes down to locality, temporal and spacial locality:\n By temporal locality, I mean that data you recently accessed likely resides in cache already. Therefore, if you must access a piece of memory multiple times, make sure you do it close together in time. By spacial locality, I mean that you should access data from memory addresses close to each other. Your CPU does not copy just the requested bytes to cache. Instead, your CPU will always copy larger chunk of data (usually 512 consecutive bits). From this information, one can deduce a number of simple tricks to improve performance:\n Use as little memory as possible. When your data takes up less memory, it is more likely that your data will be in cache. Remember, a CPU can do approximately 100 small operations in the time wasted by a single cache miss.\n When reading data from RAM, read it sequentially, such that you mostly have the next data you will be using in cache, instead of in a random order. In fact, modern CPUs will detect if you are reading in data sequentially, and prefetch upcoming data, that is, fetching the next chunk while the current chunk is being processed, reducing delays caused by cache misses.\n To illustrate this, let's compare the time spent on the random_access function above with a new function linear_access. This function performans the same computation as random_access, but uses i instead of n to access the array, so it access the data in a linear fashion. Hence, the only difference is the memory access pattern.\nNotice the large discrepency in time spent.\nfunction linear_access(data::Vector{UInt}, N::Integer) n = rand(UInt) mask = length(data) - 1 for i in 1:N n = (n \u0026gt;\u0026gt;\u0026gt; 7) ⊻ data[i \u0026amp; mask + 1] end return n end print_median(@benchmark random_access(data, 4096)) print_median(@benchmark linear_access(data, 4096)) Median time: 1.997 μs Median time: 435.631 μs This also has implications for your data structures. Hash tables such as Dicts and Sets are inherently cache inefficient and almost always cause cache misses, whereas arrays don't.\nMany of the optimizations in this document indirectly impact cache use, so this is important to have in mind.\nMemory alignment As just mentioned, your CPU will move 512 consecutive bits (64 bytes) to and from main RAM to cache at a time. These 64 bytes are called a cache line. Your entire main memory is segmented into cache lines. For example, memory addresses 0 to 63 is one cache line, addresses 64 to 127 is the next, 128 to 191 the next, et cetera. Your CPU may only request one of these cache lines from memory, and not e.g. the 64 bytes from address 30 to 93.\nThis means that some data structures can straddle the boundaries between cache lines. If I request a 64-bit (8 byte) integer at adress 60, the CPU must first generate two memory requests from the single requested memory address (namely to get cache lines 0-63 and 64-127), and then retrieve the integer from both cache lines, wasting time.\nThe time wasted can be significant. In a situation where in-cache memory access proves the bottleneck, the slowdown can approach 2x. In the following example, I use a pointer to repeatedly access an array at a given offset from a cache line boundary. If the offset is in the range 0:56, the integers all fit within one single cache line, and the function is fast. If the offset is in 57:63 all integers will straddle cache lines.\nfunction alignment_test(data::Vector{UInt}, offset::Integer) # Jump randomly around the memory. n = rand(UInt) mask = (length(data) - 9) ⊻ 7 GC.@preserve data begin # protect the array from moving in memory ptr = pointer(data) iszero(UInt(ptr) \u0026amp; 63) || error(\u0026quot;Array not aligned\u0026quot;) ptr += (offset \u0026amp; 63) for i in 1:4096 n = (n \u0026gt;\u0026gt;\u0026gt; 7) ⊻ unsafe_load(ptr, (n \u0026amp; mask + 1) % Int) end end return n end data = rand(UInt, 256 + 8); print_median(@benchmark alignment_test(data, 0)) print_median(@benchmark alignment_test(data, 60)) Median time: 6.561 μs Median time: 12.978 μs In the example above, the short vector easily fit into cache. If we increase the vector size, we will force cache misses. Note that the effect of misalignment is dwarfed by the time wasted on cache misses:\ndata = rand(UInt, 1 \u0026lt;\u0026lt; 24 + 8) print_median(@benchmark alignment_test(data, 10)) print_median(@benchmark alignment_test(data, 60)) Median time: 423.401 μs Median time: 497.868 μs Fortunately, the compiler does a few tricks to make it less likely that you will access misaligned data. First, Julia (and other compiled languages) always places new objects in memory at the boundaries of cache lines. When an object is placed right at the boundary, we say that it is aligned. Julia also aligns the beginning of larger arrays:\nmemory_address = reinterpret(UInt, pointer(data)) @assert iszero(memory_address % 64) Note that if the beginning of an array is aligned, then it's not possible for 1-, 2-, 4-, or 8-byte objects to straddle cache line boundaries, and everything will be aligned.\nIt would still be possible for an e.g. 7-byte object to be misaligned in an array. In an array of 7-byte objects, the 10th object would be placed at byte offset 7 * (10-1) = 63, and the object would straddle the cache line. However, the compiler usually does not allow struct with a nonstandard size for this reason. If we define a 7-byte struct:\nstruct AlignmentTest a::UInt32 # 4 bytes + b::UInt16 # 2 bytes + c::UInt8 # 1 byte = 7 bytes? end Then we can use Julia's introspection to get the relative position of each of the three integers in an AlignmentTest object in memory:\nfunction get_mem_layout(T) for fieldno in 1:fieldcount(T) println(\u0026quot;Name: \u0026quot;, fieldname(T, fieldno), \u0026quot;\\t\u0026quot;, \u0026quot;Size: \u0026quot;, sizeof(fieldtype(T, fieldno)), \u0026quot; bytes\\t\u0026quot;, \u0026quot;Offset: \u0026quot;, fieldoffset(T, fieldno), \u0026quot; bytes.\u0026quot;) end end println(\u0026quot;Size of AlignmentTest: \u0026quot;, sizeof(AlignmentTest), \u0026quot; bytes.\u0026quot;) get_mem_layout(AlignmentTest) Size of AlignmentTest: 8 bytes. Name: a\tSize: 4 bytes\tOffset: 0 bytes. Name: b\tSize: 2 bytes\tOffset: 4 bytes. Name: c\tSize: 1 bytes\tOffset: 6 bytes. We can see that, despite an AlignmentTest only having 4 + 2 + 1 = 7 bytes of actual data, it takes up 8 bytes of memory, and accessing an AlignmentTest object from an array will always be aligned.\nAs a coder, there are only a few situations where you can face alignment issues. I can come up with two:\n If you manually create object with a strange size, e.g. by accessing a dense integer array with pointers. This can save memory, but will waste time. My implementation of a Cuckoo filter does this to save space. During matrix operations. If you have a matrix the columns are sometimes unaligned because it is stored densely in memory. E.g. in a 15x15 matrix of Float32s, only the first column is aligned, all the others are not. This can have serious effects when doing matrix operations: I've seen benchmarks where an 80x80 matrix/vector multiplication is 2x faster than a 79x79 one due to alignment issues. Assembly code To run, any program must be translated, or compiled to CPU instructions. The CPU instructions are what is actually running on your computer, as opposed to the code written in your programming language, which is merely a description of the program. CPU instructions are usually presented to human beings in assembly. Assembly is a programming language which has a one-to-one correspondance with CPU instructions.\nViewing assembly code will be useful to understand some of the following sections which pertain to CPU instructions.\nIn Julia, we can easily inspect the compiled assembly code using the code_native function or the equivalent @code_native macro. We can do this for a simple function:\n# View assembly code generated from this function call function foo(x) s = zero(eltype(x)) @inbounds for i in eachindex(x) s = x[i ⊻ s] end return s end # Actually running the function will immediately crash Julia, so don't. @code_native foo(data) .section\t__TEXT,__text,regular,pure_instructions ; ┌ @ In[43]:4 within `foo' ; │┌ @ abstractarray.jl:212 within `eachindex' ; ││┌ @ abstractarray.jl:95 within `axes1' ; │││┌ @ abstractarray.jl:75 within `axes' ; ││││┌ @ In[43]:3 within `size' movq\t24(%rdi), %rax ; ││││└ ; ││││┌ @ tuple.jl:157 within `map' ; │││││┌ @ range.jl:320 within `OneTo' @ range.jl:311 ; ││││││┌ @ promotion.jl:409 within `max' testq\t%rax, %rax ; │└└└└└└ jle\tL75 movq\t%rax, %rcx sarq\t$63, %rcx andnq\t%rax, %rcx, %rcx movq\t(%rdi), %rdx ; │ @ In[43]:5 within `foo' ; │┌ @ int.jl:860 within `xor' @ int.jl:301 negq\t%rcx movl\t$1, %esi xorl\t%eax, %eax nopw\t%cs:(%rax,%rax) nopl\t(%rax) L48: xorq\t%rsi, %rax ; │└ ; │┌ @ multidimensional.jl:543 within `getindex' @ array.jl:788 movq\t-8(%rdx,%rax,8), %rax ; │└ ; │┌ @ range.jl:597 within `iterate' ; ││┌ @ promotion.jl:398 within `==' leaq\t(%rcx,%rsi), %rdi addq\t$1, %rdi ; ││└ addq\t$1, %rsi ; ││┌ @ promotion.jl:398 within `==' cmpq\t$1, %rdi ; │└└ jne\tL48 ; │ @ In[43]:7 within `foo' retq L75: xorl\t%eax, %eax ; │ @ In[43]:7 within `foo' retq nop ; └ Let's break it down:\nThe lines beginning with ; are comments, and explain which section of the code the following instructions come from. They show the nested series of function calls, and where in the source code they are. You can see that eachindex, calls axes1, which calls axes, which calls size. Under the comment line containing the size call, we see the first CPU instruction. The instruction name is on the far left, movq. The name is composed of two parts, mov, the kind of instruction (to move content to or from a register), and a suffix q, short for \u0026ldquo;quad\u0026rdquo;, which means 64-bit integer. There are the following suffixes: b (byte, 8 bit), w (word, 16 bit), l, (long, 32 bit) and q (quad, 64 bit).\nThe next two columns in the instruction, 24(%rdi) and %rax are the arguments to movq. These are the names of the registers (we will return to registers later) where the data to operate on are stored.\nYou can also see (in the larger display of assembly code) that the code is segmented into sections beginning with a name starting with \u0026ldquo;L\u0026rdquo;, for example there's a section L48. These sections are jumped between using if-statements, or branches. Here, section L48 marks the actual loop. You can see the following two instructions in the L48 section:\n; ││┌ @ promotion.jl:401 within `==' cmpq $1, %rdi ; │└└ jne L48 The first instruction cmpq (compare quad) compares the data in registry rdi, which hold the data for the number of iterations left (plus one), with the number 1, and sets certain flags (wires) in the CPU based on the result. The next instruction jne (jump if not equal) makes a jump if the \u0026ldquo;equal\u0026rdquo; flag is not set in the CPU, which happens if there is one or more iterations left. You can see it jumps to L48, meaning this section repeat.\nFast instruction, slow instruction Not all CPU instructions are equally fast. Below is a table of selected CPU instructions with very rough estimates of how many clock cycles they take to execute. You can find much more detailed tables in this document. Here, I'll summarize the speed of instructions on modern Intel CPUs. It's very similar for all modern CPUs.\nCPUs instructions typically take multiple CPU cycles to complete. However, if an instruction uses different part of the CPU during its execution, the CPU can usually start a new instruction before the old one is finished: If some operation X takes, say 4 clock cycles, they may queue one or even two operations per clock cycle using a feature called instruction pipelining. Hence, instruction X has a latency of 4 cycles, meaning it takes 4 cycles for the instruction to complete. But if the CPU can queue a new instruction every single cycle, it can have a reciprocal throughput of 1 clock cycle, meaning on average, it only takes 1 cycle per operation.\nThe following table measures time in clock cycles:\n Instruction Latency Rec. throughp. move data 1 0.25 and/or/xor 1 0.25 test/compare 1 0.25 do nothing 1 0.25 int add/subtract 1 0.25 bitshift 1 0.5 float multiplication 5 0.5 vector int and/or/xor 1 0.5 vector int add/sub 1 0.5 vector float add/sub 4 0.5 vector float multiplic. 5 0.5 lea 3 1 int multiplic 3 1 float add/sub 3 1 float multiplic. 5 1 float division 15 5 vector float division 13 8 integer division 50 40 The lea instruction takes three inputs, A, B and C, where A must be 2, 4, or 8, and calculates AB + C. We'll come back to what the \u0026ldquo;vector\u0026rdquo; instructions do later.\nFor comparison, we may also add some very rough estimates of other sources of delays:\n Delay Cycles move memory from cache 1 misaligned memory read 10 cache miss 500 read from disk 5000000 If you have an inner loop executing millions of times, it may pay off to inspect the generated assembly code for the loop and check if you can express the computation in terms of fast CPU instructions. For example, if you have an integer you know to be 0 or above, and you want to divide it by 8 (discarding any remainder), you can instead do a bitshift, since bitshifts are way faster than integer division:\ndivide_slow(x) = div(x, 8) divide_fast(x) = x \u0026gt;\u0026gt;\u0026gt; 3; However, modern compilers are pretty clever, and will often figure out the optimal instructions to use in your functions to obtain the same result, by for example replacing an integer divide idivq instruction with a bitshift right (shrq) where applicable to be faster. You need to check the assembly code yourself to see:\n# Calling it with debuginfo=:none removes the comments in the assembly code code_native(divide_slow, (UInt,), debuginfo=:none) .section\t__TEXT,__text,regular,pure_instructions movq\t%rdi, %rax shrq\t$3, %rax retq nopl\t(%rax,%rax) Allocations and immutability As already mentioned, main RAM is much slower than the CPU cache. However, working in main RAM comes with an additional disadvantage: Your operating system (OS) keeps track of which process have access to which memory. If every process had access to all memory, then it would be trivially easy to make a program that scans your RAM for secret data such as bank passwords - or for one program to accidentally overwrite the memory of another program. Instead, every process is allocated a bunch of memory by the OS, and is only allowed to read or write to the allocated data.\nThe creation of new objects in RAM is termed allocation, and the destruction is called deallocation. Really, the (de)allocation is not really creation or destruction per se, but rather the act of starting and stopping keeping track of the memory. Memory that is not kept track of will eventually be overwritten by other data. Allocation and deallocation take a significant amount of time depending on the size of objects, from a few tens to hundreds of nanoseconds per allocation.\nIn programming languages such as Julia, Python, R and Java, deallocation is automatically done using a program called the garbage collector (GC). This program keeps track of which objects are rendered unreachable by the programmer, and deallocates them. For example, if you do:\nthing = [1,2,3] thing = nothing Then there is no way to get the original array [1,2,3] back, it is unreachable. Hence it is simply wasting RAM, and doing nothing. It is garbage. Allocating and deallocating objects sometimes cause the GC to start its scan of all objects in memory and deallocate the unreachable ones, which causes significant lag. You can also start the garbage collector manually:\nGC.gc() The following example illustrates the difference in time spent in a function that allocates a vector with the new result relative to one which simply modifies the vector, allocating nothing:\nfunction increment(x::Vector{\u0026lt;:Integer}) y = similar(x) @inbounds for i in eachindex(x) y[i] = x[i] + 1 end return y end function increment!(x::Vector{\u0026lt;:Integer}) @inbounds for i in eachindex(x) x[i] = x[i] + 1 end return x end data = rand(UInt, 2^10); @btime increment(data); @btime increment!(data); 419.655 ns (1 allocation: 8.13 KiB) 70.350 ns (0 allocations: 0 bytes) On my computer, the allocating function is about 5x slower. This is due to a few properties of the code:\n First, the allocation itself takes time Second, the allocated objects eventually have to be deallocated, also taking time Third, repeated allocations triggers the GC to run, causing overhead Fourth, more allocations sometimes means less efficient cache use because you are using more memory For these reasons, performant code should keep allocations to a minimum. Note that the @btime macro prints the number and size of the allocations. This information is given because it is assumed that any programmer who cares to benchmark their code will be interested in reducing allocations.\nNot all objects need to be allocated Inside RAM, data is kept on either the stack or the heap. The stack is a simple data structure with a beginning and end, similar to a Vector in Julia. The stack can only be modified by adding or subtracting elements from the end, analogous to a Vector with only the two mutating operations push! and pop!. These operations on the stack are very fast. When we talk about \u0026ldquo;allocations\u0026rdquo;, however, we talk about data on the heap. Unlike the stack, the heap has an unlimited size (well, it has the size of your computer's RAM), and can be modified arbitrarily, deleting any objects.\nIntuitively, it may seem obvious that all objects need to be placed in RAM, must be able to be retrieved and deleted at any time by the program, and therefore need to be allocated on the heap. And for some languages, like Python, this is true. However, this is not true in Julia and other efficient, compiled languages. Integers, for example, can often be placed on the stack.\nWhy do some objects need to be heap allocated, while others can be stack allocated? To be stack-allocated, the compiler needs to know for certain that:\n The object is a reasonably small size, so it fits on the stack. This is needed for technical reasons for the stack to operate. The compiler can predict exactly when it needs to add and destroy the object so it can destroy it by simply popping the stack (similar to calling pop! on a Vector). This is usually the case for local variables in compiled languages. Julia has even more constrains on stack-allocated objects.\n The object should have a fixed size known at compile time. The compiler must know that object never changes. The CPU is free to copy stack-allocated objects, and for immutable objects, there is no way to distinguish a copy from the original. This bears repeating: With immutable objects, there is no way to distinguish a copy from the original. This gives the compiler and the CPU certain freedoms when operating on it. In Julia, we have a concept of a bitstype, which is an object that recursively contain no heap-allocated objects. Heap allocated objects are objects of types String, Array, Ref and Symbol, mutable objects, or objects containing any of the previous. Bitstypes are more performant exactly because they are immutable, fixed in size and can be stack allocated. The latter point is also why objects are immutable by default in Julia, and leads to one other performance tip: Use immutable objects whereever possible.\nWhat does this mean in practise? In Julia, it means if you want fast stack-allocated objects:\n You object must be created, used and destroyed in a fully compiled function so the compiler knows for certain when it needs to create, use and destroy the object. If the object is returned for later use (and not immediately returned to another, fully compiled function), we say that the object escapes, and must be allocated. Your object's type must be a bitstype. Your type must be limited in size. I don't know exactly how large it has to be, but 100 bytes is fine. The exact memory layout of your type must be known by the compiler. abstract type AllocatedInteger end struct StackAllocated \u0026lt;: AllocatedInteger x::Int end mutable struct HeapAllocated \u0026lt;: AllocatedInteger x::Int end We can inspect the code needed to instantiate a HeapAllocated object with the code needed to instantiate a StackAllocated one:\n@code_native HeapAllocated(1) .section\t__TEXT,__text,regular,pure_instructions ; ┌ @ In[18]:8 within `HeapAllocated' pushq\t%rbx movq\t%rsi, %rbx movabsq\t$jl_get_ptls_states_fast, %rax callq\t*%rax movabsq\t$jl_gc_pool_alloc, %rcx movl\t$1400, %esi ## imm = 0x578 movl\t$16, %edx movq\t%rax, %rdi callq\t*%rcx movabsq\t$4609615504, %rcx ## imm = 0x112C12690 movq\t%rcx, -8(%rax) movq\t%rbx, (%rax) popq\t%rbx retq nopl\t(%rax) ; └ Notice the callq instructions in the HeapAllocated one. This instruction calls out to other functions, meaning that in fact, much more code is really needed to create a HeapAllocated object that what is displayed. In constrast, the StackAllocated really only needs a few instructions:\n@code_native StackAllocated(1) .section\t__TEXT,__text,regular,pure_instructions ; ┌ @ In[18]:4 within `StackAllocated' movq\t%rsi, %rax retq nopw\t%cs:(%rax,%rax) ; └ Because bitstypes dont need to be stored on the heap and can be copied freely, bitstypes are stored inline in arrays. This means that bitstype objects can be stored directly inside the array's memory. Non-bitstypes have a unique identity and location on the heap. They are distinguishable from copies, so cannot be freely copied, and so arrays contain reference to the memory location on the heap where they are stored. Accessing such an object from an array then means first accessing the array to get the memory location, and then accessing the object itself using that memory location. Beside the double memory access, objects are stored less efficiently on the heap, meaning that more memory needs to be copied to CPU caches, meaning more cache misses. Hence, even when stored on the heap in an array, bitstypes can be stored more effectively.\nBase.:+(x::Int, y::AllocatedInteger) = x + y.x Base.:+(x::AllocatedInteger, y::AllocatedInteger) = x.x + y.x data_stack = [StackAllocated(i) for i in rand(UInt16, 1000000)] data_heap = [HeapAllocated(i.x) for i in data_stack] @btime sum(data_stack) @btime sum(data_heap); 281.676 μs (1 allocation: 16 bytes) 1.015 ms (1 allocation: 16 bytes) We can verify that, indeed, the array in the data_stack stores the actual data of a StackAllocated object, whereas the data_heap contains pointers (i.e. memory addresses):\nprintln(\u0026quot;First object of data_stack: \u0026quot;, data_stack[1]) println(\u0026quot;First data in data_stack array: \u0026quot;, unsafe_load(pointer(data_stack)), '\\n') println(\u0026quot;First object of data_heap: \u0026quot;, data_heap[1]) first_data = unsafe_load(Ptr{UInt}(pointer(data_heap))) println(\u0026quot;First data in data_heap array: \u0026quot;, repr(first_data)) println(\u0026quot;Data at address \u0026quot;, repr(first_data), \u0026quot;: \u0026quot;, unsafe_load(Ptr{HeapAllocated}(first_data))) First object of data_stack: StackAllocated(3693) First data in data_stack array: StackAllocated(3693) First object of data_heap: HeapAllocated(3693) First data in data_heap array: 0x0000000110cb16a0 Data at address 0x0000000110cb16a0: HeapAllocated(3693) Registers and SIMD It is time yet again to update our simplified computer schematic. A CPU operates only on data present in registers. These are small, fixed size slots (e.g. 8 bytes in size) inside the CPU itself. A register is meant to hold one single piece of data, like an integer or a floating point number. As hinted in the section on assembly code, each instruction usually refers to one or two registers which contain the data the operation works on:\n[CPU] ↔ [REGISTERS] ↔ [CPU CACHE] ↔ [RAM] ↔ [DISK CACHE] ↔ [DISK] \nTo operate on data structures larger than one register, the data must be broken up into smaller pieces that fits inside the register. For example, when adding two 128-bit integers on my computer:\n@code_native UInt128(5) + UInt128(11) .section\t__TEXT,__text,regular,pure_instructions ; ┌ @ int.jl:53 within `+' addq\t%rcx, %rsi adcxq\t%rdx, %r8 movq\t%rsi, (%rdi) movq\t%r8, 8(%rdi) movq\t%rdi, %rax retq nopw\t%cs:(%rax,%rax) ; └ There is no register that can do 128-bit additions. First the lower 64 bits must be added using a addq instruction, fitting in a register. Then the upper bits are added with a adcxq instruction, which adds the digits, but also uses the carry bit from the previous instruction. Finally, the results are moved 64 bits at a time using movq instructions.\nThe small size of the registers serves as a bottleneck for CPU throughput: It can only operate on one integer/float at a time. In order to sidestep this, modern CPUs contain specialized 256-bit registers (or 128-bit in older CPUs, or 512-bit in the brand new ones) than can hold 4 64-bit integers/floats at once, or 8 32-bit integers, etc. Confusingly, the data in such wide registers are termed \u0026ldquo;vectors\u0026rdquo;. The CPU have access to instructions that can perform various CPU operations on vectors, operating on 4 64-bit integers in one instruction. This is called \u0026ldquo;single instruction, multiple data\u0026rdquo;, SIMD, or vectorization. Notably, a 4x64 bit operation is not the same as a 256-bit operation, e.g. there is no carry-over with between the 4 64-bit integers when you add two vectors. Instead, a 256-bit vector operation is equivalent to 4 individual 64-bit operations.\nWe can illustrate this with the following example:\n# Create a single statically-sized vector of 8 32-bit integers # I could also have created 4 64-bit ones, etc. a = @SVector Int32[1,2,3,4,5,6,7,8] # Don't add comments to output code_native(+, (typeof(a), typeof(a)), debuginfo=:none) .section\t__TEXT,__text,regular,pure_instructions movq\t%rdi, %rax vmovdqu\t(%rdx), %ymm0 vpaddd\t(%rsi), %ymm0, %ymm0 vmovdqu\t%ymm0, (%rdi) vzeroupper retq nopw\t%cs:(%rax,%rax) nopl\t(%rax) Here, two 8*32 bit vectors are added together in one single instruction. You can see the CPU makes use of a single vpaddd (vector packed add double) instruction to add 8 32-bit integers, as well as the corresponding move instruction vmovdqu. Note that vector CPU instructions begin with v.\nIt's worth mentioning the interaction between SIMD and alignment: If a series of 256-bit (32-byte) SIMD loads are misaligned, then up to half the loads could cross cache line boundaries, as opposed to just 1/8th of 8-byte loads. Thus, alignment is a much more serious issue when using SIMD. Since array beginnings are always aligned, this is usually not an issue, but in cases where you are not guaranteed to start from an aligned starting point, such as with matrix operations, this may make a significant difference. In brand new CPUs with 512-bit registers, the issues is even worse as the SIMD size is the same as the cache line size, so all loads would be misaligned if the initial load is.\nSIMD vectorization of e.g. 64-bit integers may increase throughput by almost 4x, so it is of huge importance in high-performance programming. Compilers will automatically vectorize operations if they can. What can prevent this automatic vectorization?\nSIMD needs uninterrupted iteration of fixed length Because vectorized operations operates on multiple data at once, it is not possible to interrupt the loop at an arbitrary point. For example, if 4 64-bit integers are processed in one clock cycle, it is not possible to stop a SIMD loop after 3 integers have been processed. Suppose you had a loop like this:\nfor i in 1:8 if foo() break end # do stuff with my_vector[i] end Here, the loop could end on any iteration due to the break statement. Therefore, any SIMD instruction which loaded in multiple integers could operate on data after the loop is supposed to break, i.e. data which is never supposed to be read. This would be wrong behaviour, and so, the compiler cannot use SIMD instructions.\nA good rule of thumb is that simd needs:\n A loop with a predetermined length, so it knows when to stop, and A loop with no branches (i.e. if-statements) in the loop In fact, even boundschecking, i.e. checking that you are not indexing outside the bounds of a vector, causes a branch. After all, if the code is supposed to raise a bounds error after 3 iterations, even a single SIMD operation would be wrong! To achieve SIMD vectorization then, all boundschecks must be disabled. We can use this do demonstrate the impact of SIMD:\nfunction sum_nosimd(x::Vector) n = zero(eltype(x)) for i in eachindex(x) n += x[i] end return n end function sum_simd(x::Vector) n = zero(eltype(x)) # By removing the boundscheck, we allow automatic SIMD @inbounds for i in eachindex(x) n += x[i] end return n end # Make sure the vector is small enough to fit in cache so we don't time cache misses data = rand(UInt64, 4096); @btime sum_nosimd(data) @btime sum_simd(data); 2.198 μs (1 allocation: 16 bytes) 200.598 ns (1 allocation: 16 bytes) On my computer, the SIMD code is 10x faster than the non-SIMD code. SIMD alone accounts for only about 4x improvements (since we moved from 64-bits per iteration to 256 bits per iteration). The rest of the gain comes from not spending time checking the bounds and from automatic loop unrolling (explained later), which is also made possible by the @inbounds annotation.\nSIMD needs a loop where loop order doesn't matter SIMD can change the order in which elements in an array is processed. If the result of any iteration depends on any previous iteration such that the elements can't be re-ordered, the compiler will usually not SIMD-vectorize. Often when a loop won't auto-vectorize, it's due to subtleties in which data moves around in registers means that there will be some hidden memory dependency between elements in an array.\nImagine we want to sum some 64-bit integers in an array using SIMD. For simplicity, let's say the array has 8 elements, A, B, C \u0026hellip; H. In an ordinary non-SIMD loop, the additions would be done like so:\n(((((((A + B) + C) + D) + E) + F) + G) + H)\nWhereas when loading the integers using SIMD, four 64-bit integers would be loaded into one vector \u0026lt;A, B, C, D\u0026gt;, and the other four into another \u0026lt;E, F, G, H\u0026gt;. The two vectors would be added: \u0026lt;A+E, B+F, C+G, D+H\u0026gt;. After the loop, the four integers in the resulting vector would be added. So other overall order would be:\n((((A + E) + (B + F)) + (C + G)) + (D + H))\nPerhaps surprisingly, addition of floating point numbers can give different results depending on the order (i.e. float addition is not associative):\nx = eps(1.0) * 0.4 1.0 + (x + x) == (1.0 + x) + x false for this reason, float addition will not auto-vectorize:\ndata = rand(Float64, 4096) @btime sum_nosimd(data) @btime sum_simd(data); 4.339 μs (1 allocation: 16 bytes) 4.339 μs (1 allocation: 16 bytes) However, high-performance programming languages usually provide a command to tell the compiler it's alright to re-order the loop, even for non-associative loops. In Julia, this command is the @simd macro:\nfunction sum_simd(x::Vector) n = zero(eltype(x)) # Here we add the `@simd` macro to allow SIMD of floats @inbounds @simd for i in eachindex(x) n += x[i] end return n end data = rand(Float64, 4096) @btime sum_nosimd(data) @btime sum_simd(data); 4.342 μs (1 allocation: 16 bytes) 297.430 ns (1 allocation: 16 bytes) Julia also provides the macro @simd ivdep which further tells the compiler that there are no memory-dependencies in the loop order. However, I strongly discourage the use of this macro, unless you really know what you're doing. In general, the compiler knows best when a loop has memory dependencies, and misuse of @simd ivdep can very easily lead to bugs that are hard to detect.\nStruct of arrays If we create an array containing four AlignmentTest objects A, B, C and D, the objects will lie end to end in the array, like this:\nObjects: | A | B | C | D | Fields: | a | b |c| | a | b |c| | a | b |c| | a | b |c| | Byte: 1 9 17 25 33 Note again that byte no. 8, 16, 24 and 32 are empty to preserve alignment, wasting memory. Now suppose you want to do an operation on all the .a fields of the structs. Because the .a fields are scattered 8 bytes apart, SIMD operations are much less efficient (loading up to 4 fields at a time) than if all the .a fields were stored together (where 8 fields could fit in a 256-bit register). When working with the .a fields only, the entire 64-byte cache lines would be read in, of which only half, or 32 bytes would be useful. Not only does this cause more cache misses, we also need instructions to pick out the half of the data from the SIMD registers we need.\nThe memory structure we have above is termed an \u0026ldquo;array of structs\u0026rdquo;, because, well, it is an array filled with structs. Instead we can strucure our 4 objects A to D as a \u0026ldquo;struct of arrays\u0026rdquo;. Conceptually, it could look like:\nstruct AlignmentTestVector a::Vector{UInt32} b::Vector{UInt16} c::Vector{UInt8} end With the following memory layout for each field:\nObject: AlignmentTestVector .a | A | B | C | D | .b | A | B | C | D | .c |A|B|C|D| Alignment is no longer a problem, no space is wasted on padding. When running through all the a fields, all cache lines contain full 64 bytes of relevant data, so SIMD operations do not need extra operations to pick out the relevant data:\nBase.rand(::Type{AlignmentTest}) = AlignmentTest(rand(UInt32), rand(UInt16), rand(UInt8)) N = 1_000_000 array_of_structs = [rand(AlignmentTest) for i in 1:N] struct_of_arrays = AlignmentTestVector(rand(UInt32, N), rand(UInt16, N), rand(UInt8, N)); @btime sum(x -\u0026gt; x.a, array_of_structs) @btime sum(struct_of_arrays.a); 444.546 μs (1 allocation: 16 bytes) 113.006 μs (1 allocation: 16 bytes) Specialized CPU instructions Most code makes use of only a score of CPU instructions like move, add, multiply, bitshift, and, or, xor, jumps, and so on. However, CPUs in the typical modern laptop support a lot of CPU instructions. Typically, if a certain operation is used heavily in consumer laptops, CPU manufacturers will add specialized instructions to speed up these operations. Depending on the hardware implementation of the instructions, the speed gain from using these instructions can be significant.\nJulia only exposes a few specialized instructions, including:\n The number of set bits in an integer is effectively counted with the popcnt instruction, exposed via the count_ones function. The tzcnt instructions counts the number of trailing zeros in the bits an integer, exposed via the trailing_zeros function The order of individual bytes in a multi-byte integer can be reversed using the bswap instruction, exposed via the bswap function. This can be useful when having to deal with endianness. The following example illustrates the performance difference between a manual implementation of the count_ones function, and the built-in version, which uses the popcnt instruction:\nfunction manual_count_ones(x) n = 0 while x != 0 n += x \u0026amp; 1 x \u0026gt;\u0026gt;\u0026gt;= 1 end return n end data = rand(UInt, 10000) @btime sum(manual_count_ones, data) @btime sum(count_ones, data); 217.605 μs (1 allocation: 16 bytes) 2.059 μs (1 allocation: 16 bytes) The timings you observe here will depend on whether your compiler is clever enough to realize that the computation in the first function can be expressed as a popcnt instruction, and thus will be compiled to that. On my computer, the compiler is not able to make that inference, and the second function achieves the same result more than 100x faster.\nCall any CPU instruction Julia makes it possible to call CPU instructions direcly. This is not generally advised, since not all your users will have access to the same CPU with the same instructions.\nThe latest CPUs contain specialized instructions for AES encryption and SHA256 hashing. If you wish to call these instructions, you can call Julia's backend compiler, LLVM, directly. In the example below, I create a function which calls the vaesenc (one round of AES encryption) instruction directly:\n# This is a 128-bit CPU \u0026quot;vector\u0026quot; in Julia const __m128i = NTuple{2, VecElement{Int64}} # Define the function in terms of LLVM instructions aesenc(a, roundkey) = ccall(\u0026quot;llvm.x86.aesni.aesenc\u0026quot;, llvmcall, __m128i, (__m128i, __m128i), a, roundkey); We can verify it works by checking the assembly of the function, which should contain only a single vaesenc instruction, as well as the retq (return) and the nopw (do nothing, used as a filler to align the CPU instructions in memory) instruction:\n@code_native aesenc(__m128i((213132, 13131)), __m128i((31231, 43213))) .section\t__TEXT,__text,regular,pure_instructions ; ┌ @ In[33]:5 within `aesenc' vaesenc\t%xmm1, %xmm0, %xmm0 retq nopw\t%cs:(%rax,%rax) ; └ Algorithms which makes use of specialized instructions can be extremely fast. In a blog post, the video game company Molly Rocket unveiled a new non-cryptographic hash function using AES instructions which reached unprecedented speeds.\nInlining Consider the assembly of this function:\n# Simply throw an error f() = error() @code_native f() .section\t__TEXT,__text,regular,pure_instructions ; ┌ @ In[35]:2 within `f' pushq\t%rax movabsq\t$error, %rax callq\t*%rax nopl\t(%rax) ; └ This code contains the callq instruction, which calls another function. A function call comes with some overhead depending on the arguments of the function and other things. While the time spent on a function call is measured in microseconds, it can add up if the function called is in a tight loop.\nHowever, if we show the assembly of this function:\ncall_plus(x) = x + 1 code_native(call_plus, (Int,), debuginfo=:none) .section\t__TEXT,__text,regular,pure_instructions leaq\t1(%rdi), %rax retq nopw\t%cs:(%rax,%rax) The function call_plus calls +, and is compiled to a single leaq instruction (as well as some filler retq and nopw). But + is a normal Julia function, so call_plus is an example of one regular Julia function calling another. Why is there no callq instruction to call +?\nThe compiler has chosen to inline the function + into call_plus. That means that instead of calling +, it has copied the content of + directly into call_plus. The advantages of this is:\n There is no overhead from the function call There is no need to construct a Tuple to hold the arguments of the + function Whatever computations happens in + is compiled together with call_plus, allowing the compiler to use information from one in the other and possibly simplify some calculations. So why aren't all functions inlined then? Inlining copies code, increases the size of it and consuming RAM. Furthermore, the CPU instructions themselves also needs to fit into the CPU cache (although CPU instructions have their own cache) in order to be efficiently retrieved. If everything was inlined, programs would be enormous and grind to a halt. Inlining is only an improvement if the inlined function is small.\nInstead, the compiler uses heuristics (rules of thumb) to determine when a function is small enough for inlining to increase performance. These heuristics are not bulletproof, so Julia provides the macros @noinline, which prevents inlining of small functions (useful for e.g. functions that raises errors, which must be assumed to be called rarely), and @inline, which does not force the compiler to inline, but strongly suggests to the compiler that it ought to inline the function.\nIf code contains a time-sensitive section, for example an inner loop, it is important to look at the assembly code to verify that small functions in the loop is inlined. For example, in this line in my kmer hashing code, overall minhashing performance drops by a factor of two if this @inline annotation is removed.\nAn extreme difference between inlining and no inlining can be demonstrated thus:\n@noinline noninline_poly(x) = x^3 - 4x^2 + 9x - 11 inline_poly(x) = x^3 - 4x^2 + 9x - 11 function time_function(F, x::AbstractVector) n = 0 for i in x n += F(i) end return n end; @btime time_function(noninline_poly, data) @btime time_function(inline_poly, data); 13.380 μs (1 allocation: 16 bytes) 7.207 μs (1 allocation: 16 bytes) Unrolling Consider a function that sums a vector of 64-bit integers. If the vector's data's memory offset is stored in register %r9, the length of the vector is stored in register %r8, the current index of the vector in %rcx and the running total in %rax, the assembly of the inner loop could look like this:\nL1: ; add the integer at location %r9 + %rcx * 8 to %rax addq (%r9,%rcx,8), %rax ; increment index by 1 addq $1, %rcx ; compare index to length of vector cmpq %r8, %rcx ; repeat loop if index is smaller jb L1 For a total of 4 instructions per element of the vector. The actual code generated by Julia will be similar to this, but also incluce extra instructions related to bounds checking that are not relevant here (and of course will include different comments).\nHowever, if the function is written like this:\nfunction sum_vector(v::Vector{Int}) n = 0 i = 1 for chunk in 1:div(length(v), 4) n += v[i + 0] n += v[i + 1] n += v[i + 2] n += v[i + 3] i += 4 end return n end The result is obviously the same if we assume the length of the vector is divisible by four. If the length is not divisible by four, we could simply use the function above to sum the first N - rem(N, 4) elements and add the last few elements in another loop. Despite the functionally identical result, the assembly of the loop is different and may look something like:\nL1: addq (%r9,%rcx,8), %rax addq 8(%r9,%rcx,8), %rax addq 16(%r9,%rcx,8), %rax addq 24(%r9,%rcx,8), %rax addq $4, %rcx cmpq %r8, %rcx jb L1 For a total of 7 instructions per 4 additions, or 1.75 instructions per addition. This is less than half the number of instructions per integer! The speed gain comes from simply checking less often when we're at the end of the loop. We call this process unrolling the loop, here by a factor of four. Naturally, unrolling can only be done if we know the number of iterations beforehand, so we don't \u0026ldquo;overshoot\u0026rdquo; the number of iterations. Often, the compiler will unroll loops automatically for extra performance, but it can be worth looking at the assembly. For example, this is the assembly for the innermost loop generated on my computer for sum([1]):\nL144: vpaddq 16(%rcx,%rax,8), %ymm0, %ymm0 vpaddq 48(%rcx,%rax,8), %ymm1, %ymm1 vpaddq 80(%rcx,%rax,8), %ymm2, %ymm2 vpaddq 112(%rcx,%rax,8), %ymm3, %ymm3 addq $16, %rax cmpq %rax, %rdi jne L144 Where you can see it is both unrolled by a factor of four, and uses 256-bit SIMD instructions, for a total of 128 bytes, 16 integers added per iteration, or 0.44 instructions per integer.\nNotice also that the compiler chooses to use 4 different ymm SIMD registers, ymm0 to ymm3, whereas in my example assembly code, I just used one register rax. This is because, if you use 4 independent registers, then you don't need to wait for one vpaddq to complete (remember, it had a ~3 clock cycle latency) before the CPU can begin the next.\nBranch prediction As mentioned previously, CPU instructions take multiple cycles to complete, but may be queued into the CPU before the previous instruction has finished computing. So what happens when the CPU encounters a branch (i.e. a jump instruction)? It can't know which instruction to queue next, because that depends on the instruction that it just put into the queue and which has yet to be executed.\nModern CPUs make use of branch prediction. The CPU has a branch predictor circuit, which guesses the correct branch based on which branches were recently taken. In essense, the branch predictor attempts to learn simple patterns in which branches are taken in code, while the code is running. After queueing a branch, the CPU immediately queues instructions from whatever branch predicted by the branch predictor. The correctness of the guess is verified later, when the queued branch is being executed. If the guess was correct, great, the CPU saved time by guessing. If not, the CPU has to empty the pipeline and discard all computations since the initial guess, and then start over. This process causes a delay of a few nanoseconds.\nFor the programmer, this means that the speed of an if-statement depends on how easy it is to guess. If it is trivially easy to guess, the branch predictor will be correct almost all the time, and the if statement will take no longer than a simple instruction, typically 1 clock cycle. In a situation where the branching is random, it will be wrong about 50% of the time, and each misprediction may cost around 10 clock cycles.\nBranches caused by loops are among the easiest to guess. If you have a loop with 1000 elements, the code will loop back 999 times and break out of the loop just once. Hence the branch predictor can simply always predict \u0026ldquo;loop back\u0026rdquo;, and get a 99.9% accuracy.\nWe can demonstrate the performance of branch misprediction with a simple function:\n# Copy all odd numbers from src to dst. function copy_odds!(dst::Vector{UInt}, src::Vector{UInt}) write_index = 1 @inbounds for i in eachindex(src) # \u0026lt;--- this branch is trivially easy to predict v = src[i] if isodd(v) # \u0026lt;--- this is the branch we want to predict dst[write_index] = v write_index += 1 end end return dst end dst = rand(UInt, 5000) src_random = rand(UInt, 5000) src_all_odd = [2i+1 for i in src_random]; @btime copy_odds!(dst, src_random) @btime copy_odds!(dst, src_all_odd); 11.586 μs (0 allocations: 0 bytes) 1.702 μs (0 allocations: 0 bytes) In the first case, the integers are random, and about half the branches will be mispredicted causing delays. In the second case, the branch is always taken, the branch predictor is quickly able to pick up the pattern and will reach near 100% correct prediction. As a result, on my computer, the latter is around 6x faster.\nNote that if you use smaller vectors and repeat the computation many times, as the @btime macro does, the branch predictor is able to learn the pattern of the small random vectors by heart, and will reach much better than random prediction. This is especially pronounced in the most modern CPUs where the branch predictors have gotten much better. This \u0026ldquo;learning by heart\u0026rdquo; is an artifact of the loop in the benchmarking process. You would not expect to run the exact same computation repeatedly on real-life data:\nsrc_random = rand(UInt, 100) src_all_odd = [2i+1 for i in src_random]; @btime copy_odds!(dst, src_random) @btime copy_odds!(dst, src_all_odd); 62.548 ns (0 allocations: 0 bytes) 42.602 ns (0 allocations: 0 bytes) Because branches are very fast if they are predicted correctly, highly predictable branches caused by error checks are not of much performance concern, assuming that the code essensially never errors. Hence a branch like bounds checking is very fast. You should only remove bounds checks if absolutely maximal performance is critical, or if the bounds check happens in a loop which would otherwise SIMD-vectorize.\nIf branches cannot be easily predicted, it is often possible to re-phrase the function to avoid branches all together. For example, in the copy_odds! example above, we could instead write it like so:\nfunction copy_odds!(dst::Vector{UInt}, src::Vector{UInt}) write_index = 1 @inbounds for i in eachindex(src) v = src[i] dst[write_index] = v write_index += isodd(v) end return dst end dst = rand(UInt, 5000) src_random = rand(UInt, 5000) src_all_odd = [2i+1 for i in src_random]; @btime copy_odds!(dst, src_random) @btime copy_odds!(dst, src_all_odd); 2.295 μs (0 allocations: 0 bytes) 2.341 μs (0 allocations: 0 bytes) Which contains no other branches than the one caused by the loop itself (which is easily predictable), and results in speeds somewhat worse than the perfectly predicted one, but much better for random data.\nThe compiler will often remove branches in your code when the same computation can be done using other instructions. When the compiler fails to do so, Julia offers the ifelse function, which sometimes can help elide branching.\nVariable clock speed A modern laptop CPU optimized for low power consumption consumes roughly 25 watts of power on a chip as small as a stamp (and thinner than a human hair). Without proper cooling, this will cause the temperature of the CPU to skyrocket and melting the plastic of the chip, destroying it. Typically, CPUs have a maximal operating temperature of about 100 degrees C. Power consumption, and therefore heat generation, depends among many factors on clock speed, higher clock speeds generate more heat.\nModern CPUs are able to adjust their clock speeds according to the CPU temperature to prevent the chip from destroying itself. Often, CPU temperature will be the limiting factor in how quick a CPU is able to run. In these situations, better physical cooling for your computer translates directly to a faster CPU. Old computers can often be revitalized simply by removing dust from the interior, and replacing the cooling fans and CPU thermal paste!\nAs a programmer, there is not much you can do to take CPU temperature into account, but it is good to know. In particular, variations in CPU temperature often explain observed difference in performance:\n CPUs usually work fastest at the beginning of a workload, and then drop in performance as it reaches maximal temperature SIMD instructions usually require more power than ordinary instructions, generating more heat, and lowering the clock frequency. This can offset some performance gains of SIMD, but SIMD will still always be more efficient when applicable Multithreading In the bad old days, CPU clock speed would increase every year as new processors were brought onto the market. Partially because of heat generation, this acceleration slowed down once CPUs hit the 3 GHz mark. Now we see only minor clock speed increments every processor generation. Instead of raw speed of execution, the focus has shifted on getting more computation done per clock cycle. CPU caches, CPU pipelining, branch prediction and SIMD instructions are all important progresses in this area, and have all been covered here.\nAnother important area where CPUs have improved is simply in numbers: Almost all CPU chips contain multiple smaller CPUs, or cores inside them. Each core has their own small CPU cache, and does computations in parallel. Furthermore, many CPUs have a feature called hyper-threading, where two threads (i.e. streams of instructions) are able to run on each core. The idea is that whenever one process is stalled (e.g. because it experiences a cache miss or a misprediction), the other process can continue on the same core. The CPU \u0026ldquo;pretends\u0026rdquo; to have twice the amount of processors. For example, I am writing this on a laptop with an Intel Core i5-8259U CPU. This CPU has 4 cores, but various operating systems like Windows or Linux would show 8 \u0026ldquo;CPUs\u0026rdquo; in the systems monitor program.\nHyperthreading only really matters when your threads are sometimes prevented from doing work. Besides CPU-internal causes like cache misses, a thread can also be paused because it is waiting for an external resource like a webserver or data from a disk. If you are writing a program where some threads spend a significant time idling, the core can be used by the other thread, and hyperthreading can show its value.\nLet's see our first parallel program in action. First, we need to make sure that Julia actually was started with the correct number of threads. You can set the environment variable JULIA_NUM_THREADS before starting Julia. I have 4 cores on this CPU, both with hyperthreading so I have set the number of threads to 8:\nThreads.nthreads() 8 # Spend about half the time waiting, half time computing function half_asleep(start::Bool) a, b = 1, 0 for iteration in 1:5 start \u0026amp;\u0026amp; sleep(0.06) for i in 1:100000000 a, b = a + b, a end start || sleep(0.06) end return a end function parallel_sleep(n_jobs) jobs = [] for job in 1:n_jobs push!(jobs, Threads.@spawn half_asleep(isodd(job))) end return sum(fetch, jobs) end parallel_sleep(1); # run once to compile it for njobs in (1, 4, 8, 16) @time parallel_sleep(njobs); end 0.604390 seconds (44 allocations: 1.906 KiB) 0.694747 seconds (292 allocations: 15.500 KiB) 0.609797 seconds (425 allocations: 20.922 KiB) 1.157514 seconds (705 allocations: 33.109 KiB) You can see that with this task, my computer can run 8 jobs in parallel almost as fast as it can run 1. But 16 jobs takes much longer.\nFor CPU-constrained programs, the core is kept busy with only one thread, and there is not much to do as a programmer to leverage hyperthreading. Actually, for the most optimized programs, it usually leads to better performance to disable hyperthreading. Most workloads are not that optimized and can really benefit from hyperthreading, so we'll stick with 8 threads for now.\nParallelizability Multithreading is more difficult that any of the other optimizations, and should be one of the last tools a programmer reaches for. However, it is also an impactful optimization. Compute clusters usually contain CPUs with tens of CPU cores, offering a massive potential speed boost ripe for picking.\nA prerequisite for efficient use of multithreading is that your computation is able to be broken up into multiple chunks that can be worked on independently. Luckily the majority of compute-heavy tasks (at least in my field of work, bioinformatics), contain sub-problems that are embarassingly parallel. This means that there is a natural and easy way to break it into sub-problems that can be processed independently. For example, if a certain independent computation is required for 100 genes, it is natural to use one thread for each gene.\nLet's have an example of a small embarrasingly parallel problem. We want to construct a Julia set. Julia sets are named after Gaston Julia, and have nothing to do with the Julia language. Julia sets are (often) fractal sets of complex numbers. By mapping the real and complex component of the set's members to the X and Y pixel value of a screen, one can generate the LSD-trippy images associated with fractals.\nThe Julia set I create below is defined thus: We define a function f(z) = z^2 + C, where C is some constant. We then record the number of times f can be applied to any given complex number z before abs(z) \u0026gt; 2. The number of iterations correspond to the brightness of one pixel in the image. We simply repeat this for a range of real and imaginary values in a grid to create an image.\nFirst, let's see a non-parallel solution:\nconst SHIFT = Complex{Float32}(-0.221, -0.713) f(z::Complex) = z^2 + SHIFT \u0026quot;Set the brightness of a particular pixel represented by a complex number\u0026quot; function mandel(z) n = 0 while ((abs2(z) \u0026lt; 4) \u0026amp; (n \u0026lt; 255)) n += 1 z = f(z) end return n end \u0026quot;Set brightness of pixels in one column of pixels\u0026quot; function fill_column!(M::Matrix, x, real) for (y, im) in enumerate(range(-1.0f0, 1.0f0, length=size(M, 1))) M[y, x] = mandel(Complex{Float32}(real, im)) end end \u0026quot;Create a Julia fractal image\u0026quot; function julia() M = Matrix{UInt8}(undef, 10000, 5000) for (x, real) in enumerate(range(-1.0f0, 1.0f0, length=size(M, 2))) fill_column!(M, x, real) end return M end; @time M = julia(); 4.924423 seconds (2 allocations: 47.684 MiB, 0.39% gc time) That took around 5 seconds on my computer. Now for a parallel one:\nfunction recursive_fill_columns!(M::Matrix, cols::UnitRange) F, L = first(cols), last(cols) # If only one column, fill it using fill_column! if F == L r = range(-1.0f0,1.0f0,length=size(M, 1))[F] fill_column!(M, F, r) # Else divide the range of columns in two, spawning a new task for each half else mid = div(L+F,2) p = Threads.@spawn recursive_fill_columns!(M, F:mid) recursive_fill_columns!(M, mid+1:L) wait(p) end end function julia() M = Matrix{UInt8}(undef, 10000, 5000) recursive_fill_columns!(M, 1:size(M, 2)) return M end; @time M = julia(); 0.833607 seconds (34.32 k allocations: 51.106 MiB) This is almost 6 times as fast! This is close to the best case scenario for 8 threads, only possible for near-perfect embarrasingly parallel tasks.\nDespite the potential for great gains, in my opinion, multithreading should be one of the last resorts for performance improvements, for three reasons:\n Implementing multithreading is harder than other optimization methods in many cases. In the example shown, it was very easy. In a complicated workflow, it can get messy quickly. Multithreading can cause hard-to-diagnose and erratic bugs. These are almost always related to multiple threads reading from, and writing to the same memory. For example, if two threads both increment an integer with value N at the same time, the two threads will both read N from memory and write N+1 back to memory, where the correct result of two increments should be N+2! Infuriatingly, these bugs appear and disappear unpredictably, since they are causing by unlucky timing. These bugs of course have solutions, but it is tricky subject outside the scope of this document. Finally, achieving performance by using multiple threads is really achieving performance by consuming more resources, instead of gaining something from nothing. Often, you pay for using more threads, either literally when buying cloud compute time, or when paying the bill of increased electricity consumption from multiple CPU cores, or metaphorically by laying claim to more of your users\u0026rsquo; CPU resources they could use somewhere else. In contrast, more efficent computation costs nothing. GPUs So far, we've covered only the most important kind of computing chip, the CPU. But there are many other kind of chips out there. The most common kind of alternative chip is the graphical processing unit or GPU.\nAs shown in the above example with the Julia set, the task of creating computer images are often embarassingly parallel with an extremely high degree of parallelizability. In the limit, the value of each pixel is an independent task. This calls for a chip with a high number of cores to do effectively. Because generating graphics is a fundamental part of what computers do, nearly all commercial computers contain a GPU. Often, it's a smaller chip integrated into the motherboard (integrated graphics, popular in small laptops). Other times, it's a large, bulky card.\nGPUs have sacrificed many of the bells and whistles of CPUs covered in this document such as specialized instructions, SIMD and branch prediction. They also usually run at lower frequencies than CPUs. This means that their raw compute power is many times slower than a CPU. To make up for this, they have a high number of cores. For example, the high-end gaming GPU NVIDIA RTX 2080Ti has 4,352 cores. Hence, some tasks can experience 10s or even 100s of times speedup using a GPU. Most notably for scientific applications, matrix and vector operations are highly parallelizable.\nUnfortunately, the laptop I'm writing this document on has only integrated graphics, and there is not yet a stable way to interface with integrated graphics using Julia, so I cannot show examples.\nThere are also more esoteric chips like TPUs (explicitly designed for low-precision tensor operations common in deep learning) and ASICs (an umbrella term for highly specialized chips intended for one single application). At the time of writing, these chips are uncommon, expensive, poorly supported and have limited uses, and are therefore not of any interest for non-computer science researchers.\nThanks to Chris Elrod for reviewing this for correctness and teaching me a thing or two about computers\n","date":1587292718,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1587292718,"objectID":"437ea5bb0077aac274a929a0758750ac","permalink":"https://BioJulia.github.io/post/hardware/","publishdate":"2020-04-19T12:38:38+02:00","relpermalink":"/post/hardware/","section":"post","summary":"Find this notebook at https://github.com/jakobnissen/hardware_introduction\nProgramming is used in many fields of science today, where individual scientists often have to write custom code for their own projects. For most scientists, however, computer science is not their field of expertise; They have learned programming by necessity. I count myself as one of them. While we may be reasonably familiar with the software side of programming, we rarely have even a basic understanding of how computer hardware impacts code performance.","tags":[],"title":"What scientists must know about hardware to write fast code","type":"post"},{"authors":["Ben J. Ward"],"categories":[],"content":"Hi there! There have been incidents of confusion where newer users have tried to install and run an old and deprecated BioJulia package called Bio.jl, or they have not known which package(s) they need to install in order to start working.\nSo I'd like to take the opportunity to clear this situation up and perhaps put out a clearer explanation, of why Bio.jl exists when there are other packages with overlapping functionality.\nBio.jl was the first package BioJulia produced, at a time when the scale of the project meant it made sense to have a single module - Bio - with several submodules\n Seq, Align, Phylo, Intervals, Structure, Var, Services, Tools. Bio has existed since very early days - or versions - of julia as well.\nEventually there came a point where datatype and algorithm specific packages made more sense than a single monolithic repository.\nAnd so the package got split into many:\n Bio.Seq became BioSequences.jl Bio.Align became BioAlignments.jl Bio.Intervals became GenomicFeatures.jl Bio.Structure became BioStructures.jl Bio.Var became GeneticVariation.jl Bio.Phylo became Phylogenies.jl Bio.Services became BioServices.jl Bio.Tools became BioTools.jl Bio.jl then basically persisted as a convenience wrapper around those packages, so that they could be installed with a single command and were distributed set to compatible versions.\nWith julia v0.7/v1.0 and the new Pkg and Project system the need for Bio.jl to provide that functionality became largely moot.\nSo what should you do? As of today, Bio.jl is available for installation, but mostly just to prevent old scripts and projects breaking.\nIt is not longer recommended that you use Bio.jl. Instead you should start a julia project and add just the BioJulia packages you want to use. For a brief description on how to do this, check out the getting started page.\nHappy hacking.\nBen.\n","date":1579820469,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1579820469,"objectID":"7cc635bf6613cf421bab5391ffdd3a49","permalink":"https://BioJulia.github.io/post/biojl/","publishdate":"2020-01-23T23:01:09Z","relpermalink":"/post/biojl/","section":"post","summary":"Hi there! There have been incidents of confusion where newer users have tried to install and run an old and deprecated BioJulia package called Bio.jl, or they have not known which package(s) they need to install in order to start working.\nSo I'd like to take the opportunity to clear this situation up and perhaps put out a clearer explanation, of why Bio.jl exists when there are other packages with overlapping functionality.","tags":["BioSequences","DNA","RNA","Sequences","Deprecation"],"title":"Bio.jl is old and deprecated","type":"post"},{"authors":["Jakob Nybo Nissen","Ben J. Ward"],"categories":[],"content":"Introduction In October 2019, a paper was published in Proceedings of the ACM on Programming Languages, introducing a new language for bioinformatics called Seq.\nIn it, the authors rightly recognize that the field of bioinformatics is in need of a high-performance, yet expressive and productive language. They present Seq, a domain specific compiled language, with the user friendliness of Python, the performance of C, and bioinformatics-specific data types and optimisations. As Julians, we consider their goal to be noble and well worth pursuit. However, they also presented a series of benchmarking results in which they claim the performance of Seq code is far greater than equivalent code written in other languages, including C++ and Julia.\nOf course, benchmarking between languages is a tricky thing. Different languages present different syntax, tools and idioms to the programmer, such that what is efficient and natural in one language may be inefficient and clumsy in another. When benchmarking different languages, therefore, it is important to write code that is idiomatic in each language before comparing the code in terms of performance, readability or ease-of-writing. Disagreements about benchmarks between languages often come down to disagreements of whether the code compared are equivalent - if they aren’t, comparisons may be meaningless.\nFor example, in the majority of the benchmarks between Seq and Julia in the Seq paper, the DNA sequences of Seq are represented by a dedicated sequence type, whereas the sequences in Julia are represented by strings. As the Seq type is crafted specifically for efficient DNA processing, and Julia’s strings are crafted for efficient generic text processing, it is no surprise that Seq is much faster in these benchmarks. Therefore, these particular benchmarks are obviously misleading, and we will not discuss them further here.\nThe Seq authors, to their credit, realize this and include comparison between Seq and Julia code that uses BioSequences, a package of the BioJulia organization. This package contains custom, efficient types for exactly the type of work they are benchmarking, and so here, the code is equivalent and the comparison reasonable. Even when comparing against BioSequences, the Seq authors find that Seq offers a large speed advantage. As a core developer and long-time user of BioJulia, respectively, we were puzzled by these results. BioJulia is highly optimised. Could Seq really be that much faster? Perhaps we could learn something from Seq?\nRecreating the benchmarks The first thing to do was to see if the results in the paper could be replicated. The Seq authors have made a real effort to allow easy replication of their results, as they have released a virtual machine with all necessary code and software to run their benchmarks out of the box. Only their BioJulia code was missing, which they promptly provided upon request. So we spun up their VM, and recreated their benchmarks. For details on the benchmarking procedure, see the last section of this post.\nWe note that the provided version of Seq provides wrong answers to the benchmarks on several counts. First, kmers containing ambiguous nucleotides are not skipped during iteration, leading to simpler and faster (but by convention, wrong) code. Second, the middle base is not complemented during reverse-complementation of odd-length sequences. Third, the reverse-complement of ambiguous nucleotides are wrong, as e.g. K is complemented to N rather than the correct answer (M). Fourth, we can make little sense of what the result returned by their CpG benchmarking code actually means. However, these small blemishes could easily be fixed by the Seq authors with little or no performance penalty, so they are not of great importance here.\nThe results of our recreation are seen in the plot below. We were happy to see we could recreate their performance difference, seeing nearly the same performance difference they found.\nFigure 1. Running time of BioJulia (blue) versus Seq (red). We could recreate the authors findings and found Seq to be much faster than BioJulia for the three provided benchmarks. The barplot shows shows the mean +/- stddev time of 5 consecutive runs.\nImproving the benchmarked code As a long time developers and users of BioJulia, we feel qualified to judge whether or not the Julia code used for the benchmarks is idiomatic or not, and whether or not BioJulia was being correctly used. Non-idiomatic Julia is the most common source of complaints when new Julia users benchmark Julia code and find the performance disappointing. This often happens because new users code in a way that carries little penalty in languages like R and Python, but which is difficult for Julia’s JIT compiler to optimise. So, a natural starting point was to check their Julia code for inefficiencies.\nHowever, the Seq author’s Julia benchmarks were well implemented. Only the code for the CpG benchmark could be improved by fixing an error which resulted in incorrect answers, and by computing the result in a single pass of the input sequence. The two other benchmarks only needed minor tweaking to be idiomatic. These minor changes did not improve the timings of BioJulia markedly (Figure 2).\nFigure 2. Improving the Julia code of the Seq paper’s benchmarks (green series), did not change the results very much. The barplot shows shows the mean +/- stddev time of 5 consecutive runs.\nExplaining the difference in performance It seems that Seq really is much faster than BioJulia, at least for the benchmarked tasks, and we wanted to know why. So we profiled their BioJulia code to see what BioJulia was taking its sweet time doing. The results for the three benchmarks are shown below in Figure 3.\nFigure 3. Barplots showing the fraction of time BioJulia benchmarking code was spending doing various sub-tasks, as determined by the built-in Julia Profiler tool.\nSurprisingly, Figure 3 reveals that only a small fraction of the time is spent doing what the benchmark nominally measures. For example, the RC benchmark presumably should benchmark reverse-complementation (RC’ing) of sequences, but BioJulia spends less than 10% of the time actually RC’ing. Likewise, checking symmetric kmers and kmer iteration in the 16-mer benchmark is relatively insignificant time-wise. We implemented Seq’s algorithm for RC’ing kmers as described in their paper, but found no difference in performance to BioJulia’s approach, even when benchmarking only the time spent RC’ing.\nIn fact, Figure 3 shows that for all benchmarks, most of the time is spent building instances of the BioSequence type that BioSequences.jl uses to represent long DNA, RNA and Amino Acid sequences. To find the source of the performance difference between Seq and BioSequences, then, it is necessary to take a small detour into the internals of BioSequences and Seq.\nHow to represent DNA sequences in software DNA consists of four nucleotides called A, C, G and T. In some contexts, a nucleotide may instead be one of 16 IUPAC defined symbols. Hence, one nucleotide may be represented in either two bits or four bits. In BioSequences, DNA is stored in a compact format, where nucleotides are encoded to 2 or 4-bit encodings in an integer array. This representation comes with a few advantages:\n The encoding and decoding step implicitly validates that the input data is meaningful DNA data instead of some random string. Therefore, when given a BioSequence, the user can be confident that the underlying data actually represents valid DNA.\n Encoding saves 4x or 2x on RAM. Speed is important in bioinformatics precisely because of our large datasets, and this also puts a premium on RAM, making these savings worthwhile.\n The encoded representation of DNA makes biological operations easier and more efficient. Complementation of a 2-bit DNA sequence, for example, consists only of inverting the encoded bits. Converting between DNA and RNA includes only changing the metadata.\n The encoding/decoding also comes with a disadvantage, as it takes more time than just accessing raw bytes. This tradeoff is reminiscent of the pros and cons of a boolean vector versus a bit-vector. As the benchmarks here consists of very simple, fast operations on many small sequences that are read in as text, encoding time dominate.\nIn contrast, Seq represents DNA sequences as a byte array with the underlying data simply being the bytes of the input string. Constructing this type is obviously much faster than a BioSequence since it can wrap a string directly, but the RAM consumption is 2 or 4 times higher. Remarkably, Seq appears to do no input validation at all, as we confirmed by re-running the benchmarks with corrupted data. BioSequences immediately crashed with an informative error message, whereas Seq happily produced the wrong answer with no warnings.\nSo it appears the primary reason BioJulia code is slower than Seq code in these three benchmarks is that BioSequences.jl is doing important work for you that Seq is not doing. As scientists, we hope you value tools that spend the time and effort to validate inputs given to it rather than fail silently.\nImplementing Seq-like types in Julia BioSequences may be more memory efficient and safer to use, we still verified the finding of the Seq authors: Seq really is much faster than BioSequences. That surprised us. Was Seq so fast because of amazing software engineering in the language itself, or simply because so much time is lost in BioSequence’s encoding and decoding? We decided to mimic Seq in Julia by implementing DNA sequences and kmer types in Julia with the same data representation Seq uses. Our module, SeqJL turned out to be simple to write, taking less than 100 lines of code. We note that BioSequences could easily have been written this way, and intentionally isn’t.\nFigure 4 shows the performance of our SeqJL code (Figure 4, red), where it is significantly faster than Seq, except in the RC benchmark (Figure 4, orange). We found that even further speed improvements could be found by buffering input and output using the BufferedStreams.jl package, but we did not use that in the benchmarks.\nFigure 4. Timing a Julia implementation of the data types in Seq (red) resulted in timings which beat those achieved by the Seq code (orange). The barplot shows shows the mean +/- stddev time of 5 consecutive runs.\nLearning from Seq Being challenged often teaches valuable lessons, and the challenge Seq presented to BioJulia is no exception. We were surprised to see a bioinformatics workload where the encoding step of BioSequences proved to be a bottleneck, as we have always believed it to be very fast. We did not expect our simple SeqJL code to outperform BioSequences by such a large factor in these benchmarks.\nIn most realistic workloads, sequences are subject to more intense processing, which makes the speed of encoding and IO operations less important in comparison. In addition we note that BioJulia provides optimal buffered, state machine generated parsers for many file formats like FASTA and FASTQ, but they were not explored in this work as no benchmark involved them. BioJulia also offers ReadDatastores.jl, which implements indexed disk-backed collections of sequences, stored in the BioSequences encodings. These data-stores mean commonly used sequence datasets like sequencing reads stored in FASTQ files need to be encoded only once, and then the data-store can be reused for a great performance benefit. Nonetheless, the impressive speed of Seq over BioSequences in these benchmarks made us reconsider the need for much faster string/sequence conversion and printing. As a result of this challenge, we have subjected BioSequences to a performance review, optimising a dozen code sections that was causing slowdowns. To benefit from these improvements, users should install BioSequences version 2.X from the BioJuliaRegistry.\nWe were happy to discover that these changes made a big difference. With the updates, BioSequences rivals Seq in speed while keeping its advantages of a lower memory footprint and doing data validation.\nFigure 5. The newest version of BioSequences (purple) comes with performance improvements in encoding, decoding and IO, making it 3-4 times faster than BioSequences v 1.1.0 (green) for these benchmarks, and rivaling Seq (orange). The barplot shows shows the mean +/- stddev time of 5 consecutive runs.\nOur take away impressions Jakob I wholeheartedly agree with Seq’s goal of bringing a performant language to the masses, so to speak. I also applaud Seq’s authors for their efforts in providing reproducible results by sharing their code and environment, and for their efforts to compare Seq to efficient, proper Julia code. Ultimately, the speed difference between BioSequences and Seq comes down to a design decision in the implementation of the sequence type. I happen to think BioSequences made the right call by encoding the sequences, and Seq made the wrong one.\nMore broadly I think Seq brings little of value to bioinformatics. Our simple SeqJL implementation shows that Julia can achieve what Seq aims to do with even higher performance and, I would argue, even more elegant, reusable and concise code. In contrast, Seq comes with serious drawbacks. Beside DNA sequence processing, a typical bioinformatics workflow may include reading CSV files, feature extraction, doing statistical tests, plotting data and more. Being a domain specific language, Seq has zero chance of having packages for all these tasks available. In contrast, because BioSequences is simply a package in pure Julia, a user of BioSequences have all the tools of the broader Julia ecosystem available.\nMore realistically, Seq could survive by providing interoperation with other languages. But madness lies that way. In the bad old days, in a few hours of work, a bioinformatician would use awk to edit text files, pass them through Perl one-liners, run it through Python data processing before graphing using ggplot, all these languages duct-taped together using Bash. Workflows of that kind were inefficient, brittle, and required the programmer to learn a handful of different domain specific languages. Surely that path, the path that Seq shows us, must lie behind us.\nBen I can only echo and agreement with Jakob in applauding Seq’s authors efforts in trying to bring a performant programming language to life, and also applaud their efforts in providing reproducible results.\nThat said, before playing with these benchmarks it’s fair to say I was fairly skeptical of the idea that what bioinformatics really needs is a domain specific language. When I consider critical problems the field faces, my mind goes to problems such as sometimes undocumented, sometimes poorly understood assumptions about biological systems hardcoded into tools (e.g. assembly pipelines that assume levels of ploidy or genome characteristics). I think of experiments that start with a single reference genome, and run pipelines of heuristic after heuristic, each with its own slew of parameters and biases.\nIn other words the greatest gains are to be made by improving how we do things, not how fast we do them. Everyone wants speed, that’s why so many Big Data frameworks exist. Indeed the Seq developer’s intention of having a language that is Pythonic in user friendliness, and speed approaching C, is also one of the main reasons Julia exists today. After Julia 1.0 was released, as a core developer of BioJulia, I considered the argument as to whether or not Julia is performant to be settled. It is.\nNow the question of whether Julia is fast enough for bioinformatics is settled. I think the current goal of BioJulia is to provide clear, flexible, easy to use frameworks for doing bioinformatics analyses in a robust/reproducible, transparent manner, using clear concepts.\nAfter this review I think I would echo the same same conclusions as Jakob: In brief, the results in the paper can be explained largely due to our design choices in BioSequences.jl, which we maintain were the right calls. The exercise has been valuable, as it has resulted in several pull requests that improve on the performance of the string conversions and sequence encoding BioSequences.jl does; whilst we think the computational effort is justified, we also want it to be faster if it can be. Without Seq’s benchmarks to prompt us to examine this more closely, those those improvements might not have been made for a long time, if at all.\nBenchmarking procedure Benchmarking environment We ran the code in the Vagrant virtual machine provided by the authors. In the virtual machine, all software needed to reproduce the benchmarks was there, except the Seq author’s BioJulia code, which was provided upon request. The code was run on a 2018 MacBook pro using Julia v 1.0.3 and BioSequences v1.1.0. We could not determine the version of Seq used in the VM.\nBenchmarking method Due to a small virtual disk size in the VM, the whole dataset could not be downloaded, so we generated input data instead by taking the provided small test dataset HG00123_small.txt, and concatenating it to itself 16 times for a total of 2^24 lines, 1.3 GiB. To benchmark, we ran each script 5 times consecutively. For Seq, we used the idiomatic Seq code “seq-id”, and did not include compilation time. For Julia, we included JIT compilation (which we estimated to be ~2 seconds for BioJulia benchmarks, and somewhat less for SeqJL benchmarks), but not package precompilation time. Note that unlike the Seq authors, we ran Julia with default optimization level and boundschecks enabled, as these are the most common settings to run Julia in.\n","date":1579820469,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1579820469,"objectID":"ebbc894ff5d5eeacc7025d31dc2bfe20","permalink":"https://BioJulia.github.io/post/seq-lang/","publishdate":"2020-01-23T23:01:09Z","relpermalink":"/post/seq-lang/","section":"post","summary":"Introduction In October 2019, a paper was published in Proceedings of the ACM on Programming Languages, introducing a new language for bioinformatics called Seq.\nIn it, the authors rightly recognize that the field of bioinformatics is in need of a high-performance, yet expressive and productive language. They present Seq, a domain specific compiled language, with the user friendliness of Python, the performance of C, and bioinformatics-specific data types and optimisations. As Julians, we consider their goal to be noble and well worth pursuit.","tags":["BioSequences","Benchmarking","Seq-Lang","DNA","RNA","Sequences"],"title":"On the performance and design of BioSequences compared to the Seq language","type":"post"},{"authors":null,"categories":null,"content":"","date":-62135596800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":-62135596800,"objectID":"4ba4b809a14808b5e20c82aa71829eed","permalink":"https://BioJulia.github.io/getting-started/","publishdate":"0001-01-01T00:00:00Z","relpermalink":"/getting-started/","section":"","summary":"","tags":null,"title":"Getting Started","type":"widget_page"}]
\ No newline at end of file
+[{"authors":["jakobnissen"],"categories":null,"content":"I am a bioinformatician at Statens Serum Institut (the Danish center for disease control), where I analyze genome of influenza with the purpose of better understanding pandemic flu.\n","date":1598797689,"expirydate":-62135596800,"kind":"taxonomy","lang":"en","lastmod":1598797689,"objectID":"839151c1b508087f6ab27942678f4035","permalink":"https://BioJulia.github.io/authors/jakobnissen/","publishdate":"0001-01-01T00:00:00Z","relpermalink":"/authors/jakobnissen/","section":"authors","summary":"I am a bioinformatician at Statens Serum Institut (the Danish center for disease control), where I analyze genome of influenza with the purpose of better understanding pandemic flu.","tags":null,"title":"Jakob Nybo Nissen","type":"authors"},{"authors":["admin"],"categories":null,"content":"","date":1579820469,"expirydate":-62135596800,"kind":"taxonomy","lang":"en","lastmod":1579820469,"objectID":"2525497d367e79493fd32b198b28f040","permalink":"https://BioJulia.github.io/authors/admin/","publishdate":"0001-01-01T00:00:00Z","relpermalink":"/authors/admin/","section":"authors","summary":"","tags":null,"title":"Sabrina Ward","type":"authors"},{"authors":["kbonham"],"categories":null,"content":"I am a microbiologist and immunologist teaching and researching at Harvard University and the Broad Institute. I use computational methods to explore the relationships between microbial communities and human disease, and to understand complex microbial communities. I also teach about the immune system and infectious disease.\n","date":-62135596800,"expirydate":-62135596800,"kind":"taxonomy","lang":"en","lastmod":-62135596800,"objectID":"f8654d4cd97b4098f5c15014e9d7e021","permalink":"https://BioJulia.github.io/authors/kbonham/","publishdate":"0001-01-01T00:00:00Z","relpermalink":"/authors/kbonham/","section":"authors","summary":"I am a microbiologist and immunologist teaching and researching at Harvard University and the Broad Institute. I use computational methods to explore the relationships between microbial communities and human disease, and to understand complex microbial communities. I also teach about the immune system and infectious disease.","tags":null,"title":"Kevin Bonham","type":"authors"},{"authors":["Jakob Nybo Nissen"],"categories":[],"content":"Find this notebook at https://github.com/jakobnissen/automa_tutorial\nIn bioinformatics, we have a saying:\n The first step of any bioinformatics project is to define a new file format, incompatible with all previous ones.\n The situation might not be quite as bad as the saying implies, but it is true that we have a lot of different file formats, representing the various kinds of data we work with.\nFor that reason, creating file parsers is a central task in bioinformatics, and has almost become a craft in itself. An awful, awful craft. For anything but the simplest formats, designing robust parsers is hard, and it is difficult to know if a parser you've built is solid. Insignificant details like trailing whitespace can break a seemingly well-built parser.\nAt its core, a parser is a program that checks whether some input data comforms to some format. For bioinformatics, we typically also want to load the data in the file into some data structure to work on it. The format itself is typically specified in some kind of higher-level, but unambiguous language, like this description of the Newick format. A parser then, can be viewed as a program that:\n Is given a format specified in a high-level \u0026ldquo;language\u0026rdquo;, e.g. \u0026ldquo;a leaf in the Newick format consist of only a name, which is formed from the alphabet \u0026ldquo;a-z, A-Z and _\u0026quot;\u0026quot;, and\n Checks whether some input follows that format using relatively low-level instructions, e.g. checking whether the following bytes until the next (, ), , or ; match the pattern [A-Za-z_]\n The central job of a parser is then to act as a translator between different abstraction levels: The format is specified in a high-level language, but the actual data processing must happen at a lower level language.\nWhen described like that, parser generation seems quite analogous to code compilation, and begins too look like a machine's job, not a human's. Machines are fast and accurate, and do not make oversights or mistakes - exactly the characteristics you want when making a parser. Let's leave the machine tasks to the machines; we humans ought to put our effort into what we do best: Specifying the formats themselves.\nIn this tutorial, I will introduce Automa, a Julia package for creating parsers. Actually, when reading this you might be able to tell that Automa solves a problem slightly more general than just creating parsers, but creating parsers is what we will use it for in this tutorial. Automa is not an easy package to use; it is complex and a little opaque, but it's worth all the effort and more. Parsers generated by Automa have several advantages over human-made parsers:\n They are failsafe: Any deviation of the input from the specified format, no matter how trivial, will cause it to fail. Therefore, they are much more reliant than human-written parsers. They are easy to change. You can often easily change details in the input format, such as allowing whitespace, with minimal changes to your code, They are quite fast, with Automa-code often being faster than even optimized human-crafted parsers. In this first part of the tutorial I will show how to create parsers from data loaded into memory in order to highlight Automa itself. In the second part, which I will release at a later date, I will show how to create a proper BioJulia reader and writer for a file format.\nOur format For our example, let us begin with a simlified version of the FASTA format - we can call it \u0026ldquo;SimpleFasta\u0026rdquo;. Our format will be defined as something that looks like this:\n\u0026gt;first TGATAGTAGTCGTGTGATA \u0026gt;second AGGACCCAATTATCGGGGTAA I.e.\n A SimpleFasta file is a sequence of zero or more records A record consists of header * newline * sequence * newline, where \u0026ldquo;*\u0026rdquo; signifies concatenation A header consists of \u0026gt; followed by one or more characters from the alphabet a-z A sequence consists of one or more chatacters from the alphabet A, C, G or T. We can visualize a hypothetical SimpleFasta parser using a simple flowchart - note: You need to have graphviz installed to be able to run the code below.\ns = \u0026quot;\u0026quot;\u0026quot;digraph { graph [ rankdir = LR ]; begin [ shape = circle ]; begin -\u0026gt; header; header [ shape = circle ]; header -\u0026gt; sequence; sequence [ shape = circle ]; sequence -\u0026gt; end; end [ shape = circle ]; sequence -\u0026gt; header; begin -\u0026gt; end; } \u0026quot;\u0026quot;\u0026quot; function display_machine() write(\u0026quot;/tmp/fasta_machine.dot\u0026quot;, s) run(`dot -Tsvg -o /tmp/fasta_machine.svg /tmp/fasta_machine.dot`) open(\u0026quot;/tmp/fasta_machine.svg\u0026quot;) do file display(\u0026quot;image/svg+xml\u0026quot;, String(read(file))) end end; display_machine() After the beginning of the file, we expect to see a header. Next, we move to the sequence. From there, we can either reach the end of the file, or loop back to another header. Alternatively, in the case of an empty file, we may move from the beginning of the file to the end immediately without any headers or sequences. We consider an empty file to also be specification-compliant.\nNote that this simple flowchart describes any valid SimpleFasta format. But if, for example, a file had two consecutive header lines, there is no path we can take through the flowchart from \u0026ldquo;begin\u0026rdquo; to \u0026ldquo;end\u0026rdquo; which describes the file, and we can tell that the input did not conform to the format.\nThe circles marked \u0026ldquo;begin\u0026rdquo;, \u0026ldquo;header\u0026rdquo;, \u0026ldquo;sequence\u0026rdquo; or \u0026ldquo;end\u0026rdquo; are what we call the four states. Because the parser shown in the diagram has a fixed list of possible states that it changes between, we call the parser a finite state machine (FSM), or finite state automaton (FSA).\nThe flowchart above doesn't do a very good job of describing the FSM that our parser is, because it doesn't tell you how the machine transitions between states. The state transitions are what actually defines the format. For example, we know we are transitioning to a header when we see a \u0026gt; symbol - that is, in a way, what defines the header state.\nIn the flowchart below, the arrows signifying transitions between states are marked with the input character(s) that causes the FSM (our parser) to transition, written in a regex-like notation. For example, the transition from a header to a sequence is defined by a \\n followed by one of the characters in [ACGT]. \u0026ldquo;EOF\u0026rdquo; signifies end of file. Note also that some states have transitions leading to itself. For example, when in the \u0026ldquo;header\u0026rdquo; state, the FSM may continue to read characters in a-z and stay in the \u0026ldquo;header\u0026rdquo; state.\ns = \u0026quot;\u0026quot;\u0026quot;digraph { graph [ rankdir = LR ]; begin [ shape = circle ]; header [ shape = circle ]; sequence [ shape = circle ]; end [ shape = circle ]; begin -\u0026gt; header [ label = \u0026quot;\u0026gt;[a-z]\u0026quot; ]; header -\u0026gt; header [ label = \u0026quot;[a-z]\u0026quot;]; header -\u0026gt; sequence [ label = \u0026quot;\\\\\\\\n[ACGT]\u0026quot; ]; sequence -\u0026gt; sequence [ label = \u0026quot;[ACGT]\u0026quot; ]; sequence -\u0026gt; header [ label = \u0026quot;\\\\\\\\n\u0026gt;\u0026quot; ]; sequence -\u0026gt; end [ label = \u0026quot;\\\\\\\\nEOF\u0026quot; ]; begin -\u0026gt; end; } \u0026quot;\u0026quot;\u0026quot; display_machine() How does the machine know when it's supposed to transition between states? Suppose a machine is at the beginning of a file and it begins by observing the input \u0026gt;. Is this part of a transition to the header state, or the beginning of a malformed input?\nIt is certainly possible for the machine to have a memory of the pattern, such that it only transitions to the \u0026ldquo;header\u0026rdquo; state once it sees the two specific inputs \u0026gt; and [a-z] in that order. However, we can make things much simpler by requiring the machine to make a state transition for every single input byte, then we don't need any memory other than what state the machine is in, which is a single integer. To still represent the same format at the flowchart above, we need to insert some extra states:\ns = \u0026quot;\u0026quot;\u0026quot;digraph { graph [ rankdir = LR ]; begin [ shape = circle ]; 2 [ shape = circle ]; 1 [ shape = circle ]; 3 [ shape = circle ]; begin -\u0026gt; 1 [ label = \u0026quot;\u0026gt;\u0026quot; ]; header [ shape = circle ]; 1 -\u0026gt; header [label = \u0026quot;[a-z]\u0026quot; ]; header -\u0026gt; header [ label = \u0026quot;[a-z]\u0026quot; ]; 2 -\u0026gt; sequence [ label = \u0026quot;[ACGT]\u0026quot; ]; header -\u0026gt; 2 [ label = \u0026quot;\\\\\\\\n\u0026quot; ]; sequence [ shape = circle ]; sequence -\u0026gt; sequence [ label = \u0026quot;[ACGT]\u0026quot; ]; sequence -\u0026gt; 3 [ label = \u0026quot;\\\\\\\\n\u0026quot; ]; 3 -\u0026gt; 1 [ label = \u0026quot;\u0026gt;\u0026quot; ]; end [ shape = circle ]; 3 -\u0026gt; end [ label = \u0026quot;EOF\u0026quot; ]; \u0026quot;begin\u0026quot; -\u0026gt; end [ label = \u0026quot;EOF\u0026quot; ]; } \u0026quot;\u0026quot;\u0026quot; display_machine() The diagram above is exactly equivalent to the previous one, but this diagram makes a state transition for every single input byte. All bytes cause a transition, and no transition consumes multiple bytes. For simplicity, Automa's parsers work like that.\nNote that it is always possible, given a FSM with multi-byte transitions to convert it to a FSM with single-byte transitions by just putting in more nodes. E.g. the transition from header to sequence requires two bytes, a newline and an A, C, G or T. Therefore, an intermediate note (on the illustration marked \u0026ldquo;2\u0026rdquo;) is needed. Unfortunately, the requirement of single-byte transitions limits what you can do with Automa - I'll come back to the limitations later.\nNow let's create this same FSM using Automa.\nIntroducing Automa # First imports import Automa import Automa.RegExp: @re_str const re = Automa.RegExp; We define the elements that make up the SimpleFasta (the header and the sequence) using Automa.RegExp, a subset of regular expressions. Automa's regular expressions can be manipulated with the following operations:\n Function Alias Meaning cat(re...) * concatenation alt(re1, re2...) \\| alternation rep(re) zero or more repetition rep1(re) one or more repetition opt(re) zero or one repetition isec(re1, re2) \u0026amp; intersection diff(re1, re2) \\ | difference (subtraction) | neg(re) ! negation Furthermore inside the re-strings themselves, you can use\n groups of characters, like [ACGT] representing the set {A, C, G, T} ranges A-Z representing 'A':'Z' (or more accurately, the bytes UInt8('A'):UInt8('Z') *, representing zero or more repetitions of the last element, such as [a-z]* +, representing one or more repetitions of the last element, such as [a-z]+ ?, representing zero or one repetition of the last element, such as [a-z]? We can specify our FSM like below. We use a let statement only so we don't pollute global namespace with header, sequence etc.\nmachine = let # Define SimpleFasta syntax header = re\u0026quot;\u0026gt;[a-z]+\u0026quot; sequence = re\u0026quot;[ACGT]+\u0026quot; record = header * re\u0026quot;\\n\u0026quot; * sequence * re\u0026quot;\\n\u0026quot; fasta = re.rep(record) # Compile the regex to a FSM Automa.compile(fasta) end; If you have dot installed on your computer, you can visualize the compiled FSM. You might want to modify the paths in the function below to make it work on your computer.\nfunction display_machine(machine) write(\u0026quot;/tmp/fasta_machine.dot\u0026quot;, Automa.machine2dot(machine)) run(`dot -Tsvg -o /tmp/fasta_machine.svg /tmp/fasta_machine.dot`) open(\u0026quot;/tmp/fasta_machine.svg\u0026quot;) do file display(\u0026quot;image/svg+xml\u0026quot;, String(read(file))) end end; display_machine(machine) Yes, that looks correct! The start node here is the small point on the left. You can see one of the states are represented by a double circle. This represents an accept state, where the machine may stop at a valid end of input - here after each SimpleFasta record. If the machine stops at any other state, it did not complete correctly. In other words, state \u0026ldquo;1\u0026rdquo; here doubles as the \u0026ldquo;end\u0026rdquo; state.\nWith this parser we can read a SimpleFasta file and determine if it conforms to the specification. In Automa, we do this by having our machine create Julia code in the form of a Julia expression (of type Expr), which we can then use to create a function using metaprogramming.\nFirst, we need an Automa.CodeGenContext, an object which contains the options for code generation. For now, we don't worry about the settings and just make a default context object:\ncontext = Automa.CodeGenContext(); Then, we use two functions to generate code:\n Automa.generate_init_code(::CodeGenContext, ::Machine) creates the Julia code needed to initialize the parsing of the given FSM. Automa.generate_exec_code(::CodeGenContext, ::Machine, ::Dict{Symbol, Expr}) creates the Julia code needed for the main loop of the running parser. Here, the Dict is an action dict, which may contain arbitrary Julia code that the parser executes while parsing. We need that action dict later to make our parser do something useful - but for now, we simply want to create a parser, not necessarily a useful one, so we just use an empty dict. We will come back to the action dict later. Let's have a look at these functions:\nAutoma.generate_init_code(context, machine) quote #= /home/jakob/.julia/packages/Automa/sfHZ6/src/codegen.jl:94 =# p::Int = 1 #= /home/jakob/.julia/packages/Automa/sfHZ6/src/codegen.jl:95 =# p_end::Int = 0 #= /home/jakob/.julia/packages/Automa/sfHZ6/src/codegen.jl:96 =# p_eof::Int = -1 #= /home/jakob/.julia/packages/Automa/sfHZ6/src/codegen.jl:97 =# cs::Int = 1 end Besides the comments, the init code seems to define four integers. What to they mean?\nAutoma works by reading the data byte by byte. When doing this, the data is represented in the function as an AbstractVector{UInt8} - let us call that data. p represents the index of data that the parser is currently reading from. p_end represents the last position of the data in data. Because data may just be a small, buffered part of a much larger file being parsed, p_eof represents the index of actual end of input. Last, cs is short for \u0026ldquo;current state\u0026rdquo;, and is an integer representing the state of the FSM. You can see that sensibly, it is initialized at state 1, and so, by looking at the FSM diagram, we can see it next expects a \u0026lsquo;\u0026gt;\u0026rsquo;. The zero state is special, and represents the \u0026ldquo;accept state\u0026rdquo;, that is, the state of a correctly exited FSM.\nThe code generated by Automa.generate_exec_code is a little more complicated, and also depends on the CodeGenContext. But it will increment p by one, read a byte from data, and execute a state transition (i.e. change cs) depending on the byte. If cs goes negative, it knows the input is malformed. If cs is zero, the parser has finished. If cs is positive, it will continue until end of file.\nAlright, let's create a parser using these functions!\n@eval function parse_fasta(data::Union{String,Vector{UInt8}}) $(Automa.generate_init_code(context, machine)) # p_end and p_eof were set to 0 and -1 in the init code, # we need to set them to the end of input, i.e. the length of `data`. p_end = p_eof = lastindex(data) # We just use an empty dict here because we don't want our parser # to do anything just yet - only parse the input $(Automa.generate_exec_code(context, machine, Dict{Symbol, Expr}())) # We need to make sure that we reached the accept state, else the # input did not parse correctly iszero(cs) || error(\u0026quot;failed to parse on byte \u0026quot;, p) return nothing end; Does it work?\nfor (inputno, data) in enumerate([ \u0026quot;\u0026quot;, \u0026quot;\u0026gt;hi\\nACGT\\n\u0026quot;, \u0026quot;\u0026gt;hi\\nACGT\u0026quot;, \u0026quot;\u0026gt;hi\\nAA\\n\u0026gt;x\\nGTGATCGTAGTA\\n\u0026quot;, \u0026quot;\u0026gt;hi\\nthere!\u0026quot; ]) try parse_fasta(data) println(\u0026quot;Input $inputno: parsed successfully!\u0026quot;) catch e println(\u0026quot;Input $inputno: \u0026quot;, e) end end Input 1: parsed successfully! Input 2: parsed successfully! Input 3: ErrorException(\u0026quot;failed to parse on byte 9\u0026quot;) Input 4: parsed successfully! Input 5: ErrorException(\u0026quot;failed to parse on byte 5\u0026quot;) Excellent. It correctly parsed the empty data, input 2 and 4. Input 3 lacked a trailing newline, and input 5 didn't have a proper sequence.\nBut just parsing a file is not too useful. Normally, we parse a file in order to load it into some kind of datastructure to manipulate it. That's where the action dict comes into the picture. We can make Automa execute arbitrary Julia code during parsing. Here is how it works:\n When defining our machine, we add action names, in the form of Julia Symbols to certain regexes. For example, we may define that exiting the regex sequence should execute the action :exit_sequence. Like other Julia identifiers, the name can be chosen arbitrarily. Using the action dict, which was of type Dict{Symbol, Expr}, we can map action names to Julia code. This tells Automa what each action name means. The actions expressions are put into the function we create at parse time, so there is no runtime cost to this. First, we redefine our machine. This time, we add actions:\nmachine = let # Define SimpleFasta syntax newline = re\u0026quot;\\n\u0026quot; header = re\u0026quot;\u0026gt;[a-z]+\u0026quot; sequence = re\u0026quot;[ACGT]+\u0026quot; record = header * newline * sequence * newline fasta = re.rep(record) # Define SimpleFasta actions using arbitrary names header.actions[:enter] = [:enter] header.actions[:exit] = [:exit_header] sequence.actions[:enter] = [:enter] sequence.actions[:exit] = [:exit_sequence] record.actions[:exit] = [:exit_record] Automa.compile(fasta) end; If we visualize the machine now, the actions will be printed on the relevant edges in the FSM diagram. For example, the newline after a header is labeled '\\n'/exit_header, meaning the transition will happen at the input byte '\\n', and it will execute the action exit_header.\nAlso note that the diagram structure itself changed, because the entering state and exit state are now distinct. The exit state 6 will execute :exit_record at the end of file (EOF), whereas if the FSM ends at state 1, :exit_record is not executed.\ndisplay_machine(machine) Let's also define a simple data structure to parse the SimpleFasta records into. Like our SimpleFasta format, let's not overcomplicate things:\nusing BioSequences struct FastaRecord{A} header::String sequence::LongSequence{A} end FastaRecord(h, sequence::LongSequence{A}) where A = FastaRecord{A}(h, sequence); And now our action dict. I can use the variables data, p, p_end etc. in the code:\nactions = Dict( # Record which byte position marks the beginning of header or sequence :enter =\u0026gt; quote mark = p end, :exit_header =\u0026gt; quote header = String(data[mark+1:p-1]) end, :exit_sequence =\u0026gt; quote sequence = LongSequence{A}(data[mark:p-1]) end, :exit_record =\u0026gt; quote record = FastaRecord(header, sequence) push!(records, record) end, ); @eval function parse_fasta(::Type{A}, data::Union{String,Vector{UInt8}}) where {A \u0026lt;: Alphabet} # We can't control the order that the code from out action dict is executed, so # we define variables beforehand so we can be sure they are not used before # they are defined. mark = 0 records = FastaRecord{A}[] header = \u0026quot;\u0026quot; sequence = LongSequence{A}() # The rest is just like in our previous example, except now we use the # action dict we just created. $(Automa.generate_init_code(context, machine)) p_end = p_eof = lastindex(data) $(Automa.generate_exec_code(context, machine, actions)) iszero(cs) || error(\u0026quot;failed to parse on byte \u0026quot;, p) return records end; We can verify that it works:\nparse_fasta(DNAAlphabet{4}, \u0026quot;\u0026gt;hello\\nTAG\\n\u0026gt;there\\nTAA\\n\u0026quot;) 2-element Array{FastaRecord{DNAAlphabet{4}},1}: FastaRecord{DNAAlphabet{4}}(\u0026quot;hello\u0026quot;, TAG) FastaRecord{DNAAlphabet{4}}(\u0026quot;there\u0026quot;, TAA) Redesign the format With the basics now covered, let's step up the game a tad and make it parse SimpleFasta records with multiple lines per sequence. We first define a seqline pattern that looks like the old sequence pattern, and build a new sequence pattern from multiple seqline patterns.\nmachine = let # Define SimpleFasta syntax header = re\u0026quot;\u0026gt;[a-z]+\u0026quot; seqline = re\u0026quot;[ACGT]+\u0026quot; sequence = seqline * re.rep(re\u0026quot;\\n\u0026quot; * seqline) record = header * re\u0026quot;\\n\u0026quot; * sequence * re\u0026quot;\\n\u0026quot; fasta = re.rep(record) # Define SimpleFasta actions header.actions[:enter] = [:enter] header.actions[:exit] = [:exit_header] seqline.actions[:enter] = [:enter] seqline.actions[:exit] = [:exit_seqline] record.actions[:exit] = [:exit_record] Automa.compile(fasta) end; The machine now has a subtle change with a small loop between node 5 and 6 representing the parser seeing multiple sequence lines.\ndisplay_machine(machine) We also need to update the actions. Here, I use :(this syntax), which is equivalent to\nquote this syntax end but more readable for one-liners. I keep a buffer containing the sequence, and append every sequence line to the buffer at the end of each seqline. When the record is done, I create the sequence and empty the buffer for the next record.\nactions = Dict( :enter =\u0026gt; :(mark= p), :exit_header =\u0026gt; :(header = String(data[mark+1:p-1])), :exit_seqline =\u0026gt; quote doff = length(seqbuffer) + 1 resize!(seqbuffer, length(seqbuffer) + p - mark) copyto!(seqbuffer, doff, data, mark, p-mark) end, :exit_record =\u0026gt; quote sequence = LongSequence{A}(seqbuffer) empty!(seqbuffer) record = FastaRecord(header, sequence) push!(records, record) end, ); We need to redefine parse_fasta, now also containing the seqbuffer:\n@eval function parse_fasta(::Type{A}, data::Union{String,Vector{UInt8}}) where {A \u0026lt;: Alphabet} mark = 0 records = FastaRecord{A}[] header = \u0026quot;\u0026quot; sequence = LongSequence{A}() seqbuffer = UInt8[] $(Automa.generate_init_code(context, machine)) p_end = p_eof = lastindex(data) $(Automa.generate_exec_code(context, machine, actions)) iszero(cs) || error(\u0026quot;failed to parse on byte \u0026quot;, p) return records end; And we can confirm it now works for multiple sequences!\nparse_fasta(DNAAlphabet{2}, \u0026quot;\u0026gt;hello\\nTAG\\nGGG\\n\u0026gt;there\\nTAA\\nTAG\\n\u0026quot;) 2-element Array{FastaRecord{DNAAlphabet{2}},1}: FastaRecord{DNAAlphabet{2}}(\u0026quot;hello\u0026quot;, TAGGGG) FastaRecord{DNAAlphabet{2}}(\u0026quot;there\u0026quot;, TAATAG) Even more complexity One of the greatest things about Automa is how we can parse quite complicated formats without needing to manually construct much code. For example, here, I change only the regular expressions:\n I allow optional Windows-style newlines (\\r\\n) I allow trailing whitespace on each line (re.space() \\ (re\u0026quot;\\r\u0026quot; | re\u0026quot;\\n\u0026quot;) means all space characters except \\r or \\n), but not whitespace inside headers Sequences can only contain all UIPAC ambiguous RNA or DNA nucleotides The trailing record no longer needs a trailing newline to be valid. It has the same actions as a regular record. machine = let # Define FASTA syntax newline = re.opt(\u0026quot;\\r\u0026quot;) * re\u0026quot;\\n\u0026quot; hspace = re.space() \\ (re\u0026quot;\\r\u0026quot; | re\u0026quot;\\n\u0026quot;) header = re\u0026quot;\u0026gt;\u0026quot; * re.rep1((re.any() \\ re.space())) * re.opt(hspace) seqline = re\u0026quot;[acgturmkyswbdhvnACGTURMKYSWBDHVN\\-*]+\u0026quot; * re.opt(hspace) sequence = seqline * re.rep(newline * seqline) record = header * newline * sequence * newline trailing_record = header * newline * sequence * re.rep(newline * re.opt(hspace)) fasta = re.rep(newline) * re.rep(record) * re.opt(trailing_record) # Define FASTA actions header.actions[:enter] = [:enter] header.actions[:exit] = [:exit_header] seqline.actions[:enter] = [:enter] seqline.actions[:exit] = [:exit_seqline] record.actions[:exit] = [:exit_record] trailing_record.actions[:exit] = [:exit_record] Automa.compile(fasta) end; The machine is now a bit more complicated. But who cares, it's automatically generated. Look how easy it was to generate!\ndisplay_machine(machine) Pitfall 1: Ambiguous parsers Besides its complexity, building parsers with Automa has some pitfalls - notably those caused by Automa transitioning state for every input byte.\nUnfortunately, it's very easy to accidentally create a machine that can't possibly figure out what action to take when looking only at one byte at a time. The following simple machine will not compile (on the master branch of Automa).\nbad_machine = let left = re\u0026quot;Turn left!\u0026quot; right = re\u0026quot;Turn right!\u0026quot; left_or_right = left | right left.actions[:enter] = [:turn_left] right.actions[:enter] = [:turn_right] Automa.compile(left_or_right) end Ambiguous DFA: Input 0x54 can lead to actions [:turn_right] or [:turn_left] Stacktrace: [1] error(::String) at ./error.jl:33 [2] validate_paths(::Array{Tuple{Union{Nothing, Automa.Edge},Array{Symbol,1}},1}) at /home/jakob/.julia/packages/Automa/sfHZ6/src/dfa.jl:176 [3] validate_nfanodes(::Automa.StableSet{Automa.NFANode}) at /home/jakob/.julia/packages/Automa/sfHZ6/src/dfa.jl:195 [4] nfa2dfa(::Automa.NFA, ::Bool) at /home/jakob/.julia/packages/Automa/sfHZ6/src/dfa.jl:104 [5] compile(::Automa.RegExp.RE; optimize::Bool, unambiguous::Bool) at /home/jakob/.julia/packages/Automa/sfHZ6/src/machine.jl:55 [6] compile(::Automa.RegExp.RE) at /home/jakob/.julia/packages/Automa/sfHZ6/src/machine.jl:55 [7] top-level scope at In[27]:9 [8] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1091 Automa refuse to create this FSM. Why? Well, the first byte it expects is a T (or 0x54) - but there is no way of knowing whether it's entering left or right when it sees that byte, and these two patters have conflicting actions! If they had the same actions, there would be no problem.\nThis situation is highly artificial, but the situation happens often in real life. Here's a very subtle change to one of the previous parsers:\nbad_machine = let # Define SimpleFasta syntax header = re\u0026quot;\u0026gt;[a-z]+\u0026quot; seqline = re\u0026quot;[ACGT]+\u0026quot; sequence = seqline * re.rep(re\u0026quot;\\n\u0026quot; * seqline) record = header * re\u0026quot;\\n\u0026quot; * sequence fasta = re.rep(record * re\u0026quot;\\n\u0026quot;) # Define SimpleFasta actions header.actions[:enter] = [:enter] header.actions[:exit] = [:exit_header] seqline.actions[:enter] = [:enter] seqline.actions[:exit] = [:exit_seqline] record.actions[:exit] = [:exit_record] Automa.compile(fasta) end; Ambiguous DFA: Input 0x0a can lead to actions [:exit_seqline, :exit_record] or [:exit_seqline] Stacktrace: [1] error(::String) at ./error.jl:33 [2] validate_paths(::Array{Tuple{Union{Nothing, Automa.Edge},Array{Symbol,1}},1}) at /home/jakob/.julia/packages/Automa/sfHZ6/src/dfa.jl:176 [3] validate_nfanodes(::Automa.StableSet{Automa.NFANode}) at /home/jakob/.julia/packages/Automa/sfHZ6/src/dfa.jl:195 [4] nfa2dfa(::Automa.NFA, ::Bool) at /home/jakob/.julia/packages/Automa/sfHZ6/src/dfa.jl:104 [5] compile(::Automa.RegExp.RE; optimize::Bool, unambiguous::Bool) at /home/jakob/.julia/packages/Automa/sfHZ6/src/machine.jl:55 [6] compile(::Automa.RegExp.RE) at /home/jakob/.julia/packages/Automa/sfHZ6/src/machine.jl:55 [7] top-level scope at In[28]:16 [8] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1091 After a seqline, when encountering a newline, there is no way of knowing whether this marks the end of a record or whether it continues at the next line. Automa can't look ahead, it has to make a decision at every byte.\nHere, the solution is to move the newline from the definition of fasta to that of sequence. That way, the newline functions as a kind of one-byte lookahead - if the byte after the newline is a \u0026gt;, it knows the record is complete.\nIn general, encountering situations like this is usually a sign that the parser is badly built - or that the format is. You can usually resolve it by using single-byte lookahead like above. If not, it is possible to optionally disable the check for ambiguous actions by passing a keyword to the Automa.parse function. But beware that this may lead to absurd results.\nPitfall 2: No recursively defined patterns Some formats are naturally recursive. The Newick format, for example, defines \u0026ldquo;subtree\u0026rdquo; in terms of \u0026ldquo;internal\u0026rdquo;, which is defined in terms of \u0026ldquo;branchset\u0026rdquo;, defined in terms of \u0026ldquo;branch\u0026rdquo; which is defined in terms of\u0026hellip; \u0026ldquo;subtree\u0026rdquo;.\nSomething like that will not work Automa:\nsimple_newick = let name = re\u0026quot;[a-z_]\u0026quot; clade = name | (re\u0026quot;\\(\u0026quot; * cladeset * \u0026quot;)\u0026quot;) cladeset = clade * re.rep(re\u0026quot;,\u0026quot; * clade) Automa.compile(cladeset * re\u0026quot;;\u0026quot;) end UndefVarError: cladeset not defined Stacktrace: [1] top-level scope at In[29]:3 [2] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1091 Because, in general, you can't refer to objects in Julia that have not yet been defined. Worse, even if the syntactical challenge was solved, recursive patterns usually only make sense if you have lookahead. Newick files, for example, simply can't be parsed by FSMs, because every time you see an open parenthesis, you need to make sure there is a closing one further in the file, and this requires lookahead that can't be resolved by parsing one byte at a time.\nLuckily, because Automa can execute arbitrary Julia code, we can have the FSM manipulate a stack and turn the FSM into a pushdown automaton, which can handle formats like Newick. It does require a little more manual fiddling, and is less elegant:\nsimple_newick = let name = re\u0026quot;[a-z_]+\u0026quot; cladestart = re\u0026quot;\\(\u0026quot; cladestop = re\u0026quot;\\)\u0026quot; cladesep = re\u0026quot;,\u0026quot; nodechange = cladestart | cladestop | cladesep newick_element = nodechange * re.opt(name) tree = re.opt(name) * re.rep(newick_element) * re\u0026quot;;\u0026quot; cladestart.actions[:enter] = [:push] cladestop.actions[:enter] = [:pop] cladesep.actions[:enter] = [:pop, :push] Automa.compile(tree) end actions = Dict( :push =\u0026gt; quote level -= 1 end, :pop =\u0026gt; quote iszero(level) \u0026amp;\u0026amp; error(\u0026quot;Too many closing parentheses\u0026quot;) level += 1 end ); @eval function parse_newick(data::Union{String,Vector{UInt8}}) level = 0 $(Automa.generate_init_code(context, simple_newick)) p_end = p_eof = lastindex(data) $(Automa.generate_exec_code(context, simple_newick, actions)) iszero(cs) || error(\u0026quot;failed to parse on byte \u0026quot;, p) iszero(level) || error(\u0026quot;Too many opening parentheses\u0026quot;) end; And it sort of works:\nprintln(\u0026quot;Does this parse? \u0026quot;, parse_newick(\u0026quot;(hello,hi);\u0026quot;)) println(\u0026quot;Does this parse? \u0026quot;, parse_newick(\u0026quot;(hello,hi)(;\u0026quot;)) Does this parse? true Too many opening parentheses Stacktrace: [1] error(::String) at ./error.jl:33 [2] parse_newick(::String) at ./In[31]:8 [3] top-level scope at In[32]:2 [4] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1091 Here, I needed to manually keep track of the level of nesting. In fact, I would have to keep track of more stuff to disallow inputs like ();, removing some of the point of Automa - namely, that it is automatic.\nIn the next part of this tutorial, I will show how to use Automa to create honest-to-God parsers which integrate with the BioJulia ecosystem and can work on streamed data.\nOptimizing Automa's parsers are pretty fast. To create hand-written parsers of comparable speeds, you need to microoptimize every operation, which is annoying. That being said, for actually loading parsed files to be fast, you need to optimize the user-defined actions in the actions dictionary, too.\nHow fast is our un-optimized Fasta parser already? Let's test it on some data and find out. I'll make 50k sequences with 2k base pairs each, for a total of 100 MB data.\nRemember to re-run the block of code where the parse_fasta function is defined since we changed the machine.\n# Create 50k records with 2kbp each function generate_data() open(\u0026quot;/tmp/fasta.fna\u0026quot;, \u0026quot;w\u0026quot;) do file for seq in 1:50000 println(file, '\u0026gt;', join(rand('a':'z', 16))) for line in 1:20 println(file, join(rand(\u0026quot;ACGT\u0026quot;, 100))) end end end end generate_data() function parsedata() open(\u0026quot;/tmp/fasta.fna\u0026quot;) do file parse_fasta(DNAAlphabet{2}, read(file)) end end; parsedata(); @time parsedata(); 0.330489 seconds (200.04 k allocations: 138.861 MiB, 4.15% gc time) Alright! It does about 300 MB/s on my laptop. Not bad! I'd say for most applications, this speed is more than sufficient.\nBut this is Julia. We want to be able to optimize the crap out of it. So let's optimize the actions:\nactions = Dict( :enter =\u0026gt; :(mark = p), # Create string with only one copy of the data. :exit_header =\u0026gt; :(header = unsafe_string(pointer(data, mark + 1), p - mark - 1)), # Only resize buffer if it's too small, else just keep track of how many bytes are used. :exit_seqline =\u0026gt; quote N = p - mark length(seqbuffer) \u0026lt; filled + N \u0026amp;\u0026amp; resize!(seqbuffer, filled + N) copyto!(seqbuffer, filled+1, data, mark, N) filled += N end, # Use `copyto!` to only encode data directly from data vector into sequence # note this requires the latest versions of BioSequences. :exit_record =\u0026gt; quote sequence = copyto!(LongSequence{A}(filled), 1, seqbuffer, 1, filled) record = FastaRecord(header, sequence) push!(records, record) filled = 0 end, ); If we want to optimize, we also need to use the fastest CodeGenContext. We use the :goto-generator. This creates machine code using @goto-statements, which creates very long and completely unreadable code - but it's fast. Also, who cares if it's hard to read. It's machine-generated code. We disable the FSM's boundschecks when accessing the data vector - since we don't manipulate the position p manually, we are confident it will not go out of bounds. And we allow the code generator to unroll loops:\ncontext = Automa.CodeGenContext(generator=:goto, checkbounds=false, loopunroll=4); We also need to modify the parse function because we use a new variable called filled and need to initialize it.\n@eval function parse_fasta(::Type{A}, data::Union{String,Vector{UInt8}}) where {A \u0026lt;: Alphabet} mark = 0 records = FastaRecord{A}[] header = \u0026quot;\u0026quot; sequence = LongSequence{A}() seqbuffer = UInt8[] filled = 0 $(Automa.generate_init_code(context, machine)) p_end = p_eof = lastindex(data) $(Automa.generate_exec_code(context, machine, actions)) iszero(cs) || error(\u0026quot;failed to parse on byte \u0026quot;, p) return records end; parsedata(); @time parsedata(); 0.162113 seconds (150.04 k allocations: 133.520 MiB, 7.90% gc time) Okay, we're at slightly above 600 MB/s. Profiling confirms nearly all time is spent encoding the DNA sequences to LongSequence. We can make it faster still by not storing the data as a LongSequence, and instead make it a \u0026ldquo;lazy\u0026rdquo; object that only constructs the sequence upon demand. Even further, we could avoid heap-allocating each sequence individually. But these optimizations do not pertain directly to Automa, and I will leave them here.\n","date":1598797689,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1598797689,"objectID":"9e06435c7c0526e80ee1ebbfe9d021c5","permalink":"https://BioJulia.github.io/post/automa1/","publishdate":"2020-08-30T16:28:09.16+02:00","relpermalink":"/post/automa1/","section":"post","summary":"Find this notebook at https://github.com/jakobnissen/automa_tutorial\nIn bioinformatics, we have a saying:\n The first step of any bioinformatics project is to define a new file format, incompatible with all previous ones.\n The situation might not be quite as bad as the saying implies, but it is true that we have a lot of different file formats, representing the various kinds of data we work with.\nFor that reason, creating file parsers is a central task in bioinformatics, and has almost become a craft in itself.","tags":[],"title":"Tutorial to Automa: Part 1","type":"post"},{"authors":["Jakob Nybo Nissen"],"categories":[],"content":"Find this notebook at https://github.com/jakobnissen/hardware_introduction\nProgramming is used in many fields of science today, where individual scientists often have to write custom code for their own projects. For most scientists, however, computer science is not their field of expertise; They have learned programming by necessity. I count myself as one of them. While we may be reasonably familiar with the software side of programming, we rarely have even a basic understanding of how computer hardware impacts code performance.\nThe aim of this tutorial is to give non-professional programmers a brief overview of the features of modern hardware that you must understand in order to write fast code. It will be a distillation of what have learned the last few years. This tutorial will use Julia because it allows these relatively low-level considerations to be demonstrated easily in a high-level, interactive language.\nThis is not a guide to the Julia programming language To write fast code, you must first understand your programming language and its idiosyncrasies. But this is not a guide to the Julia programming language. I recommend reading the performance tips section of the Julia documentation.\nThis is not an explanation of specific datastructures or algorithms Besides knowing your language, you must also know your own code to make it fast. You must understand the idea behind big-O notation, why some algorithms are faster than others, and how different data structures work internally. Without knowing what an Array is, how could you possibly optimize code making use of arrays?\nThis too, is outside the scope of this paper. However, I would say that as a minimum, a programmer should have an understanding of:\n How a binary integer is represented in memory How a floating point number is represented in memory (learning this is also necessary to understand computational inacurracies from floating point operations, which is a must when doing scientific programming) The memory layout of a String including ASCII and UTF-8 encoding The basics of how an Array is structured, and what the difference between a dense array of e.g. integers and an array of references to objects are The principles behind how a Dict (i.e. hash table) and a Set works Furthermore, I would also recommend familiarizing yourself with:\n Heaps Deques Tuples This is not a tutorial on benchmarking your code To write fast code in practice, it is necessary to profile your code to find bottlenecks where your machine spends the majority of the time. One must benchmark different functions and approaches to find the fastest in practice. Julia (and other languages) have tools for exactly this purpose, but I will not cover them here.\nContent Minimize disk writes CPU cache Alignment Inspect generated assembly Minimize allocations Exploit SIMD vectorization Struct of arrays Use specialized CPU instructions Inline small functions Unroll tight loops Avoid unpredictable branches Multithreading GPUs Before you begin: Install packages # If you don't already have these packages installed, outcomment these lines and run it: # using Pkg # Pkg.add(\u0026quot;BenchmarkTools\u0026quot;) # Pkg.add(\u0026quot;StaticArrays\u0026quot;) using StaticArrays using BenchmarkTools \u0026quot;Print median elapsed time of benchmark\u0026quot; function print_median(trial) println(\u0026quot;Median time: \u0026quot;, BenchmarkTools.prettytime(median(trial).time)) end; The basic structure of computer hardware For now, we will work with a simplified mental model of a computer. Through this document, I will add more details to our model as they become relevant.\n[CPU] ↔ [RAM] ↔ [DISK] \nIn this simple diagram, the arrows represent data flow in either direction. The diagram shows three important parts of a computer:\n The central processing unit (CPU) is a chip the size of a stamp. This is where all the computation actually occurs, the brain of the computer. Random access memory (RAM, or just \u0026ldquo;memory\u0026rdquo;) is the short-term memory of the computer. This memory requires electrical power to maintain, and is lost when the computer is shut down. RAM serves as a temporary storage of data between the disk and the CPU. Much of time spent \u0026ldquo;loading\u0026rdquo; various applications and operating systems is actually spent moving data from disk to RAM and unpacking it there. A typical consumer laptop has around 10^11 bits of RAM memory. The disk is a mass storage unit. This data on disk persists after power is shut down, so the disk contains the long-term memory of the computer. It is also much cheaper per gigabyte than RAM, with consumer PCs having around 10^13 bits of disk space. Avoid write to disk where possible Even with a fast mass storage unit such as a solid state drive (SSD) or even the newer Optane technology, disks are many times, usually thousands of times, slower than RAM. In particular, seeks, i.e. switching to a new point of the disk to read from or write to, is slow. As a consequence, writing a large chunk of data to disk is much faster than writing many small chunks.\nTo write fast code, you must therefore make sure to have your working data in RAM, and limit disk writes as much as possible.\nThe following example serves to illustrate the difference in speed: The first function opens a file, accesses one byte from the file, and closes it again. The second function randomly accesses 1,000,000 integers from RAM.\n# Open a file function test_file(path) open(path) do file # Go to 1000'th byte of file and read it seek(file, 1000) read(file, UInt8) end end @time test_file(\u0026quot;test_file\u0026quot;) # Randomly access data N times function random_access(data::Vector{UInt}, N::Integer) n = rand(UInt) mask = length(data) - 1 @inbounds for i in 1:N n = (n \u0026gt;\u0026gt;\u0026gt; 7) ⊻ data[n \u0026amp; mask + 1] end return n end data = rand(UInt, 2^24) @time random_access(data, 1000000); 0.001321 seconds (15 allocations: 976 bytes) 0.130833 seconds Benchmarking this is a little tricky, because the first invokation will include the compilation times of both functions. And in the second invokation, your operating system will have stored a copy of the file (or cached the file) in RAM, making the file seek almost instant. To time it properly, run it once, then change the file, and run it again. So in fact, we should update our computer diagram:\n[CPU] ↔ [RAM] ↔ [DISK CACHE] ↔ [DISK] \nOn my computer, finding a single byte in a file (including opening and closing the file) takes about 13 miliseconds, and accessing 1,000,000 integers from memory takes 131 miliseconds. So RAM is on the order of 10,000 times faster than disk.\nWhen working with data too large to fit into RAM, load in the data chunk by chunk, e.g. one line at a time, and operate on that. That way, you don't need random access to your file and thus need to waste time on extra seeks, but only sequential access. And you must strive to write your program such that any input files are only read through once, not multiple times.\nIf you need to read a file byte by byte, for example when parsing a file, great speed improvements can be found by buffering the file. When buffering, you read in larger chunks, the buffer, to memory, and when you want to read from the file, you check if it's in the buffer. If not, read another large chunk into your buffer from the file. This approach minimizes disk reads. Both your operating system and your programming language will make use of caches, however, sometimes it is necessary to manually buffer your files.\nCPU cache The RAM is faster than the disk, and the CPU in turn is faster than RAM. A CPU ticks like a clock, with a speed of about 3 GHz, i.e. 3 billion ticks per second. One \u0026ldquo;tick\u0026rdquo; of this clock is called a clock cycle. While this is not really true, you may imagine that every cycle, the CPU executes a single, simple command called a CPU instruction which does one operation on a small piece of data. The clock speed then can serve as a reference for other timings in a computer. It is worth realizing that in a single clock cycle, a photon will travel only around 10 cm, and this puts a barrier to how fast memory (which is placed some distance away from the CPU) can operate. In fact, modern computers are so fast that a significant bottleneck in their speed is the delay caused by the time needed for electricity to move through the wires inside the computer.\nOn this scale, reading from RAM takes around 100 clock cycles. Similarly to how the slowness of disks can be mitigated by copying data to the faster RAM, data from RAM is copied to a smaller memory chip physically on the CPU, called a cache. The cache is faster because it is physically on the CPU chip (reducing wire delays), and because it uses a faster type of RAM, static RAM, instead of the slower (but cheaper to manufacture) dynamic RAM. Because it must be placed on the CPU, limiting its size, and because it is more expensive to produce, a typical CPU cache only contains around 10^8 bits, around 1000 times less than RAM. There are actually multiple layers of CPU cache, but here we simplify it and just refer to \u0026ldquo;the cache\u0026rdquo; as one single thing:\n[CPU] ↔ [CPU CACHE] ↔ [RAM] ↔ [DISK CACHE] ↔ [DISK] \nWhen the CPU requests a piece of data from the RAM, say a single byte, it will first check if the memory is already in cache. If so, it will read from it from there. This is much faster, usually just a few clock cycles, than access to RAM. If not, we have a cache miss, and your program will stall for tens of nanoseconds while your computer copies data from RAM into the cache.\nIt is not possible, except in very low-level languages, to manually manage the CPU cache. Instead, you must make sure to use your cache effectively.\nEffective use of the cache comes down to locality, temporal and spacial locality:\n By temporal locality, I mean that data you recently accessed likely resides in cache already. Therefore, if you must access a piece of memory multiple times, make sure you do it close together in time. By spacial locality, I mean that you should access data from memory addresses close to each other. Your CPU does not copy just the requested bytes to cache. Instead, your CPU will always copy larger chunk of data (usually 512 consecutive bits). From this information, one can deduce a number of simple tricks to improve performance:\n Use as little memory as possible. When your data takes up less memory, it is more likely that your data will be in cache. Remember, a CPU can do approximately 100 small operations in the time wasted by a single cache miss.\n When reading data from RAM, read it sequentially, such that you mostly have the next data you will be using in cache, instead of in a random order. In fact, modern CPUs will detect if you are reading in data sequentially, and prefetch upcoming data, that is, fetching the next chunk while the current chunk is being processed, reducing delays caused by cache misses.\n To illustrate this, let's compare the time spent on the random_access function above with a new function linear_access. This function performans the same computation as random_access, but uses i instead of n to access the array, so it access the data in a linear fashion. Hence, the only difference is the memory access pattern.\nNotice the large discrepency in time spent.\nfunction linear_access(data::Vector{UInt}, N::Integer) n = rand(UInt) mask = length(data) - 1 for i in 1:N n = (n \u0026gt;\u0026gt;\u0026gt; 7) ⊻ data[i \u0026amp; mask + 1] end return n end print_median(@benchmark random_access(data, 4096)) print_median(@benchmark linear_access(data, 4096)) Median time: 1.997 μs Median time: 435.631 μs This also has implications for your data structures. Hash tables such as Dicts and Sets are inherently cache inefficient and almost always cause cache misses, whereas arrays don't.\nMany of the optimizations in this document indirectly impact cache use, so this is important to have in mind.\nMemory alignment As just mentioned, your CPU will move 512 consecutive bits (64 bytes) to and from main RAM to cache at a time. These 64 bytes are called a cache line. Your entire main memory is segmented into cache lines. For example, memory addresses 0 to 63 is one cache line, addresses 64 to 127 is the next, 128 to 191 the next, et cetera. Your CPU may only request one of these cache lines from memory, and not e.g. the 64 bytes from address 30 to 93.\nThis means that some data structures can straddle the boundaries between cache lines. If I request a 64-bit (8 byte) integer at adress 60, the CPU must first generate two memory requests from the single requested memory address (namely to get cache lines 0-63 and 64-127), and then retrieve the integer from both cache lines, wasting time.\nThe time wasted can be significant. In a situation where in-cache memory access proves the bottleneck, the slowdown can approach 2x. In the following example, I use a pointer to repeatedly access an array at a given offset from a cache line boundary. If the offset is in the range 0:56, the integers all fit within one single cache line, and the function is fast. If the offset is in 57:63 all integers will straddle cache lines.\nfunction alignment_test(data::Vector{UInt}, offset::Integer) # Jump randomly around the memory. n = rand(UInt) mask = (length(data) - 9) ⊻ 7 GC.@preserve data begin # protect the array from moving in memory ptr = pointer(data) iszero(UInt(ptr) \u0026amp; 63) || error(\u0026quot;Array not aligned\u0026quot;) ptr += (offset \u0026amp; 63) for i in 1:4096 n = (n \u0026gt;\u0026gt;\u0026gt; 7) ⊻ unsafe_load(ptr, (n \u0026amp; mask + 1) % Int) end end return n end data = rand(UInt, 256 + 8); print_median(@benchmark alignment_test(data, 0)) print_median(@benchmark alignment_test(data, 60)) Median time: 6.561 μs Median time: 12.978 μs In the example above, the short vector easily fit into cache. If we increase the vector size, we will force cache misses. Note that the effect of misalignment is dwarfed by the time wasted on cache misses:\ndata = rand(UInt, 1 \u0026lt;\u0026lt; 24 + 8) print_median(@benchmark alignment_test(data, 10)) print_median(@benchmark alignment_test(data, 60)) Median time: 423.401 μs Median time: 497.868 μs Fortunately, the compiler does a few tricks to make it less likely that you will access misaligned data. First, Julia (and other compiled languages) always places new objects in memory at the boundaries of cache lines. When an object is placed right at the boundary, we say that it is aligned. Julia also aligns the beginning of larger arrays:\nmemory_address = reinterpret(UInt, pointer(data)) @assert iszero(memory_address % 64) Note that if the beginning of an array is aligned, then it's not possible for 1-, 2-, 4-, or 8-byte objects to straddle cache line boundaries, and everything will be aligned.\nIt would still be possible for an e.g. 7-byte object to be misaligned in an array. In an array of 7-byte objects, the 10th object would be placed at byte offset 7 * (10-1) = 63, and the object would straddle the cache line. However, the compiler usually does not allow struct with a nonstandard size for this reason. If we define a 7-byte struct:\nstruct AlignmentTest a::UInt32 # 4 bytes + b::UInt16 # 2 bytes + c::UInt8 # 1 byte = 7 bytes? end Then we can use Julia's introspection to get the relative position of each of the three integers in an AlignmentTest object in memory:\nfunction get_mem_layout(T) for fieldno in 1:fieldcount(T) println(\u0026quot;Name: \u0026quot;, fieldname(T, fieldno), \u0026quot;\\t\u0026quot;, \u0026quot;Size: \u0026quot;, sizeof(fieldtype(T, fieldno)), \u0026quot; bytes\\t\u0026quot;, \u0026quot;Offset: \u0026quot;, fieldoffset(T, fieldno), \u0026quot; bytes.\u0026quot;) end end println(\u0026quot;Size of AlignmentTest: \u0026quot;, sizeof(AlignmentTest), \u0026quot; bytes.\u0026quot;) get_mem_layout(AlignmentTest) Size of AlignmentTest: 8 bytes. Name: a\tSize: 4 bytes\tOffset: 0 bytes. Name: b\tSize: 2 bytes\tOffset: 4 bytes. Name: c\tSize: 1 bytes\tOffset: 6 bytes. We can see that, despite an AlignmentTest only having 4 + 2 + 1 = 7 bytes of actual data, it takes up 8 bytes of memory, and accessing an AlignmentTest object from an array will always be aligned.\nAs a coder, there are only a few situations where you can face alignment issues. I can come up with two:\n If you manually create object with a strange size, e.g. by accessing a dense integer array with pointers. This can save memory, but will waste time. My implementation of a Cuckoo filter does this to save space. During matrix operations. If you have a matrix the columns are sometimes unaligned because it is stored densely in memory. E.g. in a 15x15 matrix of Float32s, only the first column is aligned, all the others are not. This can have serious effects when doing matrix operations: I've seen benchmarks where an 80x80 matrix/vector multiplication is 2x faster than a 79x79 one due to alignment issues. Assembly code To run, any program must be translated, or compiled to CPU instructions. The CPU instructions are what is actually running on your computer, as opposed to the code written in your programming language, which is merely a description of the program. CPU instructions are usually presented to human beings in assembly. Assembly is a programming language which has a one-to-one correspondance with CPU instructions.\nViewing assembly code will be useful to understand some of the following sections which pertain to CPU instructions.\nIn Julia, we can easily inspect the compiled assembly code using the code_native function or the equivalent @code_native macro. We can do this for a simple function:\n# View assembly code generated from this function call function foo(x) s = zero(eltype(x)) @inbounds for i in eachindex(x) s = x[i ⊻ s] end return s end # Actually running the function will immediately crash Julia, so don't. @code_native foo(data) .section\t__TEXT,__text,regular,pure_instructions ; ┌ @ In[43]:4 within `foo' ; │┌ @ abstractarray.jl:212 within `eachindex' ; ││┌ @ abstractarray.jl:95 within `axes1' ; │││┌ @ abstractarray.jl:75 within `axes' ; ││││┌ @ In[43]:3 within `size' movq\t24(%rdi), %rax ; ││││└ ; ││││┌ @ tuple.jl:157 within `map' ; │││││┌ @ range.jl:320 within `OneTo' @ range.jl:311 ; ││││││┌ @ promotion.jl:409 within `max' testq\t%rax, %rax ; │└└└└└└ jle\tL75 movq\t%rax, %rcx sarq\t$63, %rcx andnq\t%rax, %rcx, %rcx movq\t(%rdi), %rdx ; │ @ In[43]:5 within `foo' ; │┌ @ int.jl:860 within `xor' @ int.jl:301 negq\t%rcx movl\t$1, %esi xorl\t%eax, %eax nopw\t%cs:(%rax,%rax) nopl\t(%rax) L48: xorq\t%rsi, %rax ; │└ ; │┌ @ multidimensional.jl:543 within `getindex' @ array.jl:788 movq\t-8(%rdx,%rax,8), %rax ; │└ ; │┌ @ range.jl:597 within `iterate' ; ││┌ @ promotion.jl:398 within `==' leaq\t(%rcx,%rsi), %rdi addq\t$1, %rdi ; ││└ addq\t$1, %rsi ; ││┌ @ promotion.jl:398 within `==' cmpq\t$1, %rdi ; │└└ jne\tL48 ; │ @ In[43]:7 within `foo' retq L75: xorl\t%eax, %eax ; │ @ In[43]:7 within `foo' retq nop ; └ Let's break it down:\nThe lines beginning with ; are comments, and explain which section of the code the following instructions come from. They show the nested series of function calls, and where in the source code they are. You can see that eachindex, calls axes1, which calls axes, which calls size. Under the comment line containing the size call, we see the first CPU instruction. The instruction name is on the far left, movq. The name is composed of two parts, mov, the kind of instruction (to move content to or from a register), and a suffix q, short for \u0026ldquo;quad\u0026rdquo;, which means 64-bit integer. There are the following suffixes: b (byte, 8 bit), w (word, 16 bit), l, (long, 32 bit) and q (quad, 64 bit).\nThe next two columns in the instruction, 24(%rdi) and %rax are the arguments to movq. These are the names of the registers (we will return to registers later) where the data to operate on are stored.\nYou can also see (in the larger display of assembly code) that the code is segmented into sections beginning with a name starting with \u0026ldquo;L\u0026rdquo;, for example there's a section L48. These sections are jumped between using if-statements, or branches. Here, section L48 marks the actual loop. You can see the following two instructions in the L48 section:\n; ││┌ @ promotion.jl:401 within `==' cmpq $1, %rdi ; │└└ jne L48 The first instruction cmpq (compare quad) compares the data in registry rdi, which hold the data for the number of iterations left (plus one), with the number 1, and sets certain flags (wires) in the CPU based on the result. The next instruction jne (jump if not equal) makes a jump if the \u0026ldquo;equal\u0026rdquo; flag is not set in the CPU, which happens if there is one or more iterations left. You can see it jumps to L48, meaning this section repeat.\nFast instruction, slow instruction Not all CPU instructions are equally fast. Below is a table of selected CPU instructions with very rough estimates of how many clock cycles they take to execute. You can find much more detailed tables in this document. Here, I'll summarize the speed of instructions on modern Intel CPUs. It's very similar for all modern CPUs.\nCPUs instructions typically take multiple CPU cycles to complete. However, if an instruction uses different part of the CPU during its execution, the CPU can usually start a new instruction before the old one is finished: If some operation X takes, say 4 clock cycles, they may queue one or even two operations per clock cycle using a feature called instruction pipelining. Hence, instruction X has a latency of 4 cycles, meaning it takes 4 cycles for the instruction to complete. But if the CPU can queue a new instruction every single cycle, it can have a reciprocal throughput of 1 clock cycle, meaning on average, it only takes 1 cycle per operation.\nThe following table measures time in clock cycles:\n Instruction Latency Rec. throughp. move data 1 0.25 and/or/xor 1 0.25 test/compare 1 0.25 do nothing 1 0.25 int add/subtract 1 0.25 bitshift 1 0.5 float multiplication 5 0.5 vector int and/or/xor 1 0.5 vector int add/sub 1 0.5 vector float add/sub 4 0.5 vector float multiplic. 5 0.5 lea 3 1 int multiplic 3 1 float add/sub 3 1 float multiplic. 5 1 float division 15 5 vector float division 13 8 integer division 50 40 The lea instruction takes three inputs, A, B and C, where A must be 2, 4, or 8, and calculates AB + C. We'll come back to what the \u0026ldquo;vector\u0026rdquo; instructions do later.\nFor comparison, we may also add some very rough estimates of other sources of delays:\n Delay Cycles move memory from cache 1 misaligned memory read 10 cache miss 500 read from disk 5000000 If you have an inner loop executing millions of times, it may pay off to inspect the generated assembly code for the loop and check if you can express the computation in terms of fast CPU instructions. For example, if you have an integer you know to be 0 or above, and you want to divide it by 8 (discarding any remainder), you can instead do a bitshift, since bitshifts are way faster than integer division:\ndivide_slow(x) = div(x, 8) divide_fast(x) = x \u0026gt;\u0026gt;\u0026gt; 3; However, modern compilers are pretty clever, and will often figure out the optimal instructions to use in your functions to obtain the same result, by for example replacing an integer divide idivq instruction with a bitshift right (shrq) where applicable to be faster. You need to check the assembly code yourself to see:\n# Calling it with debuginfo=:none removes the comments in the assembly code code_native(divide_slow, (UInt,), debuginfo=:none) .section\t__TEXT,__text,regular,pure_instructions movq\t%rdi, %rax shrq\t$3, %rax retq nopl\t(%rax,%rax) Allocations and immutability As already mentioned, main RAM is much slower than the CPU cache. However, working in main RAM comes with an additional disadvantage: Your operating system (OS) keeps track of which process have access to which memory. If every process had access to all memory, then it would be trivially easy to make a program that scans your RAM for secret data such as bank passwords - or for one program to accidentally overwrite the memory of another program. Instead, every process is allocated a bunch of memory by the OS, and is only allowed to read or write to the allocated data.\nThe creation of new objects in RAM is termed allocation, and the destruction is called deallocation. Really, the (de)allocation is not really creation or destruction per se, but rather the act of starting and stopping keeping track of the memory. Memory that is not kept track of will eventually be overwritten by other data. Allocation and deallocation take a significant amount of time depending on the size of objects, from a few tens to hundreds of nanoseconds per allocation.\nIn programming languages such as Julia, Python, R and Java, deallocation is automatically done using a program called the garbage collector (GC). This program keeps track of which objects are rendered unreachable by the programmer, and deallocates them. For example, if you do:\nthing = [1,2,3] thing = nothing Then there is no way to get the original array [1,2,3] back, it is unreachable. Hence it is simply wasting RAM, and doing nothing. It is garbage. Allocating and deallocating objects sometimes cause the GC to start its scan of all objects in memory and deallocate the unreachable ones, which causes significant lag. You can also start the garbage collector manually:\nGC.gc() The following example illustrates the difference in time spent in a function that allocates a vector with the new result relative to one which simply modifies the vector, allocating nothing:\nfunction increment(x::Vector{\u0026lt;:Integer}) y = similar(x) @inbounds for i in eachindex(x) y[i] = x[i] + 1 end return y end function increment!(x::Vector{\u0026lt;:Integer}) @inbounds for i in eachindex(x) x[i] = x[i] + 1 end return x end data = rand(UInt, 2^10); @btime increment(data); @btime increment!(data); 419.655 ns (1 allocation: 8.13 KiB) 70.350 ns (0 allocations: 0 bytes) On my computer, the allocating function is about 5x slower. This is due to a few properties of the code:\n First, the allocation itself takes time Second, the allocated objects eventually have to be deallocated, also taking time Third, repeated allocations triggers the GC to run, causing overhead Fourth, more allocations sometimes means less efficient cache use because you are using more memory For these reasons, performant code should keep allocations to a minimum. Note that the @btime macro prints the number and size of the allocations. This information is given because it is assumed that any programmer who cares to benchmark their code will be interested in reducing allocations.\nNot all objects need to be allocated Inside RAM, data is kept on either the stack or the heap. The stack is a simple data structure with a beginning and end, similar to a Vector in Julia. The stack can only be modified by adding or subtracting elements from the end, analogous to a Vector with only the two mutating operations push! and pop!. These operations on the stack are very fast. When we talk about \u0026ldquo;allocations\u0026rdquo;, however, we talk about data on the heap. Unlike the stack, the heap has an unlimited size (well, it has the size of your computer's RAM), and can be modified arbitrarily, deleting any objects.\nIntuitively, it may seem obvious that all objects need to be placed in RAM, must be able to be retrieved and deleted at any time by the program, and therefore need to be allocated on the heap. And for some languages, like Python, this is true. However, this is not true in Julia and other efficient, compiled languages. Integers, for example, can often be placed on the stack.\nWhy do some objects need to be heap allocated, while others can be stack allocated? To be stack-allocated, the compiler needs to know for certain that:\n The object is a reasonably small size, so it fits on the stack. This is needed for technical reasons for the stack to operate. The compiler can predict exactly when it needs to add and destroy the object so it can destroy it by simply popping the stack (similar to calling pop! on a Vector). This is usually the case for local variables in compiled languages. Julia has even more constrains on stack-allocated objects.\n The object should have a fixed size known at compile time. The compiler must know that object never changes. The CPU is free to copy stack-allocated objects, and for immutable objects, there is no way to distinguish a copy from the original. This bears repeating: With immutable objects, there is no way to distinguish a copy from the original. This gives the compiler and the CPU certain freedoms when operating on it. In Julia, we have a concept of a bitstype, which is an object that recursively contain no heap-allocated objects. Heap allocated objects are objects of types String, Array, Ref and Symbol, mutable objects, or objects containing any of the previous. Bitstypes are more performant exactly because they are immutable, fixed in size and can be stack allocated. The latter point is also why objects are immutable by default in Julia, and leads to one other performance tip: Use immutable objects whereever possible.\nWhat does this mean in practise? In Julia, it means if you want fast stack-allocated objects:\n You object must be created, used and destroyed in a fully compiled function so the compiler knows for certain when it needs to create, use and destroy the object. If the object is returned for later use (and not immediately returned to another, fully compiled function), we say that the object escapes, and must be allocated. Your object's type must be a bitstype. Your type must be limited in size. I don't know exactly how large it has to be, but 100 bytes is fine. The exact memory layout of your type must be known by the compiler. abstract type AllocatedInteger end struct StackAllocated \u0026lt;: AllocatedInteger x::Int end mutable struct HeapAllocated \u0026lt;: AllocatedInteger x::Int end We can inspect the code needed to instantiate a HeapAllocated object with the code needed to instantiate a StackAllocated one:\n@code_native HeapAllocated(1) .section\t__TEXT,__text,regular,pure_instructions ; ┌ @ In[18]:8 within `HeapAllocated' pushq\t%rbx movq\t%rsi, %rbx movabsq\t$jl_get_ptls_states_fast, %rax callq\t*%rax movabsq\t$jl_gc_pool_alloc, %rcx movl\t$1400, %esi ## imm = 0x578 movl\t$16, %edx movq\t%rax, %rdi callq\t*%rcx movabsq\t$4609615504, %rcx ## imm = 0x112C12690 movq\t%rcx, -8(%rax) movq\t%rbx, (%rax) popq\t%rbx retq nopl\t(%rax) ; └ Notice the callq instructions in the HeapAllocated one. This instruction calls out to other functions, meaning that in fact, much more code is really needed to create a HeapAllocated object that what is displayed. In constrast, the StackAllocated really only needs a few instructions:\n@code_native StackAllocated(1) .section\t__TEXT,__text,regular,pure_instructions ; ┌ @ In[18]:4 within `StackAllocated' movq\t%rsi, %rax retq nopw\t%cs:(%rax,%rax) ; └ Because bitstypes dont need to be stored on the heap and can be copied freely, bitstypes are stored inline in arrays. This means that bitstype objects can be stored directly inside the array's memory. Non-bitstypes have a unique identity and location on the heap. They are distinguishable from copies, so cannot be freely copied, and so arrays contain reference to the memory location on the heap where they are stored. Accessing such an object from an array then means first accessing the array to get the memory location, and then accessing the object itself using that memory location. Beside the double memory access, objects are stored less efficiently on the heap, meaning that more memory needs to be copied to CPU caches, meaning more cache misses. Hence, even when stored on the heap in an array, bitstypes can be stored more effectively.\nBase.:+(x::Int, y::AllocatedInteger) = x + y.x Base.:+(x::AllocatedInteger, y::AllocatedInteger) = x.x + y.x data_stack = [StackAllocated(i) for i in rand(UInt16, 1000000)] data_heap = [HeapAllocated(i.x) for i in data_stack] @btime sum(data_stack) @btime sum(data_heap); 281.676 μs (1 allocation: 16 bytes) 1.015 ms (1 allocation: 16 bytes) We can verify that, indeed, the array in the data_stack stores the actual data of a StackAllocated object, whereas the data_heap contains pointers (i.e. memory addresses):\nprintln(\u0026quot;First object of data_stack: \u0026quot;, data_stack[1]) println(\u0026quot;First data in data_stack array: \u0026quot;, unsafe_load(pointer(data_stack)), '\\n') println(\u0026quot;First object of data_heap: \u0026quot;, data_heap[1]) first_data = unsafe_load(Ptr{UInt}(pointer(data_heap))) println(\u0026quot;First data in data_heap array: \u0026quot;, repr(first_data)) println(\u0026quot;Data at address \u0026quot;, repr(first_data), \u0026quot;: \u0026quot;, unsafe_load(Ptr{HeapAllocated}(first_data))) First object of data_stack: StackAllocated(3693) First data in data_stack array: StackAllocated(3693) First object of data_heap: HeapAllocated(3693) First data in data_heap array: 0x0000000110cb16a0 Data at address 0x0000000110cb16a0: HeapAllocated(3693) Registers and SIMD It is time yet again to update our simplified computer schematic. A CPU operates only on data present in registers. These are small, fixed size slots (e.g. 8 bytes in size) inside the CPU itself. A register is meant to hold one single piece of data, like an integer or a floating point number. As hinted in the section on assembly code, each instruction usually refers to one or two registers which contain the data the operation works on:\n[CPU] ↔ [REGISTERS] ↔ [CPU CACHE] ↔ [RAM] ↔ [DISK CACHE] ↔ [DISK] \nTo operate on data structures larger than one register, the data must be broken up into smaller pieces that fits inside the register. For example, when adding two 128-bit integers on my computer:\n@code_native UInt128(5) + UInt128(11) .section\t__TEXT,__text,regular,pure_instructions ; ┌ @ int.jl:53 within `+' addq\t%rcx, %rsi adcxq\t%rdx, %r8 movq\t%rsi, (%rdi) movq\t%r8, 8(%rdi) movq\t%rdi, %rax retq nopw\t%cs:(%rax,%rax) ; └ There is no register that can do 128-bit additions. First the lower 64 bits must be added using a addq instruction, fitting in a register. Then the upper bits are added with a adcxq instruction, which adds the digits, but also uses the carry bit from the previous instruction. Finally, the results are moved 64 bits at a time using movq instructions.\nThe small size of the registers serves as a bottleneck for CPU throughput: It can only operate on one integer/float at a time. In order to sidestep this, modern CPUs contain specialized 256-bit registers (or 128-bit in older CPUs, or 512-bit in the brand new ones) than can hold 4 64-bit integers/floats at once, or 8 32-bit integers, etc. Confusingly, the data in such wide registers are termed \u0026ldquo;vectors\u0026rdquo;. The CPU have access to instructions that can perform various CPU operations on vectors, operating on 4 64-bit integers in one instruction. This is called \u0026ldquo;single instruction, multiple data\u0026rdquo;, SIMD, or vectorization. Notably, a 4x64 bit operation is not the same as a 256-bit operation, e.g. there is no carry-over with between the 4 64-bit integers when you add two vectors. Instead, a 256-bit vector operation is equivalent to 4 individual 64-bit operations.\nWe can illustrate this with the following example:\n# Create a single statically-sized vector of 8 32-bit integers # I could also have created 4 64-bit ones, etc. a = @SVector Int32[1,2,3,4,5,6,7,8] # Don't add comments to output code_native(+, (typeof(a), typeof(a)), debuginfo=:none) .section\t__TEXT,__text,regular,pure_instructions movq\t%rdi, %rax vmovdqu\t(%rdx), %ymm0 vpaddd\t(%rsi), %ymm0, %ymm0 vmovdqu\t%ymm0, (%rdi) vzeroupper retq nopw\t%cs:(%rax,%rax) nopl\t(%rax) Here, two 8*32 bit vectors are added together in one single instruction. You can see the CPU makes use of a single vpaddd (vector packed add double) instruction to add 8 32-bit integers, as well as the corresponding move instruction vmovdqu. Note that vector CPU instructions begin with v.\nIt's worth mentioning the interaction between SIMD and alignment: If a series of 256-bit (32-byte) SIMD loads are misaligned, then up to half the loads could cross cache line boundaries, as opposed to just 1/8th of 8-byte loads. Thus, alignment is a much more serious issue when using SIMD. Since array beginnings are always aligned, this is usually not an issue, but in cases where you are not guaranteed to start from an aligned starting point, such as with matrix operations, this may make a significant difference. In brand new CPUs with 512-bit registers, the issues is even worse as the SIMD size is the same as the cache line size, so all loads would be misaligned if the initial load is.\nSIMD vectorization of e.g. 64-bit integers may increase throughput by almost 4x, so it is of huge importance in high-performance programming. Compilers will automatically vectorize operations if they can. What can prevent this automatic vectorization?\nSIMD needs uninterrupted iteration of fixed length Because vectorized operations operates on multiple data at once, it is not possible to interrupt the loop at an arbitrary point. For example, if 4 64-bit integers are processed in one clock cycle, it is not possible to stop a SIMD loop after 3 integers have been processed. Suppose you had a loop like this:\nfor i in 1:8 if foo() break end # do stuff with my_vector[i] end Here, the loop could end on any iteration due to the break statement. Therefore, any SIMD instruction which loaded in multiple integers could operate on data after the loop is supposed to break, i.e. data which is never supposed to be read. This would be wrong behaviour, and so, the compiler cannot use SIMD instructions.\nA good rule of thumb is that simd needs:\n A loop with a predetermined length, so it knows when to stop, and A loop with no branches (i.e. if-statements) in the loop In fact, even boundschecking, i.e. checking that you are not indexing outside the bounds of a vector, causes a branch. After all, if the code is supposed to raise a bounds error after 3 iterations, even a single SIMD operation would be wrong! To achieve SIMD vectorization then, all boundschecks must be disabled. We can use this do demonstrate the impact of SIMD:\nfunction sum_nosimd(x::Vector) n = zero(eltype(x)) for i in eachindex(x) n += x[i] end return n end function sum_simd(x::Vector) n = zero(eltype(x)) # By removing the boundscheck, we allow automatic SIMD @inbounds for i in eachindex(x) n += x[i] end return n end # Make sure the vector is small enough to fit in cache so we don't time cache misses data = rand(UInt64, 4096); @btime sum_nosimd(data) @btime sum_simd(data); 2.198 μs (1 allocation: 16 bytes) 200.598 ns (1 allocation: 16 bytes) On my computer, the SIMD code is 10x faster than the non-SIMD code. SIMD alone accounts for only about 4x improvements (since we moved from 64-bits per iteration to 256 bits per iteration). The rest of the gain comes from not spending time checking the bounds and from automatic loop unrolling (explained later), which is also made possible by the @inbounds annotation.\nSIMD needs a loop where loop order doesn't matter SIMD can change the order in which elements in an array is processed. If the result of any iteration depends on any previous iteration such that the elements can't be re-ordered, the compiler will usually not SIMD-vectorize. Often when a loop won't auto-vectorize, it's due to subtleties in which data moves around in registers means that there will be some hidden memory dependency between elements in an array.\nImagine we want to sum some 64-bit integers in an array using SIMD. For simplicity, let's say the array has 8 elements, A, B, C \u0026hellip; H. In an ordinary non-SIMD loop, the additions would be done like so:\n(((((((A + B) + C) + D) + E) + F) + G) + H)\nWhereas when loading the integers using SIMD, four 64-bit integers would be loaded into one vector \u0026lt;A, B, C, D\u0026gt;, and the other four into another \u0026lt;E, F, G, H\u0026gt;. The two vectors would be added: \u0026lt;A+E, B+F, C+G, D+H\u0026gt;. After the loop, the four integers in the resulting vector would be added. So other overall order would be:\n((((A + E) + (B + F)) + (C + G)) + (D + H))\nPerhaps surprisingly, addition of floating point numbers can give different results depending on the order (i.e. float addition is not associative):\nx = eps(1.0) * 0.4 1.0 + (x + x) == (1.0 + x) + x false for this reason, float addition will not auto-vectorize:\ndata = rand(Float64, 4096) @btime sum_nosimd(data) @btime sum_simd(data); 4.339 μs (1 allocation: 16 bytes) 4.339 μs (1 allocation: 16 bytes) However, high-performance programming languages usually provide a command to tell the compiler it's alright to re-order the loop, even for non-associative loops. In Julia, this command is the @simd macro:\nfunction sum_simd(x::Vector) n = zero(eltype(x)) # Here we add the `@simd` macro to allow SIMD of floats @inbounds @simd for i in eachindex(x) n += x[i] end return n end data = rand(Float64, 4096) @btime sum_nosimd(data) @btime sum_simd(data); 4.342 μs (1 allocation: 16 bytes) 297.430 ns (1 allocation: 16 bytes) Julia also provides the macro @simd ivdep which further tells the compiler that there are no memory-dependencies in the loop order. However, I strongly discourage the use of this macro, unless you really know what you're doing. In general, the compiler knows best when a loop has memory dependencies, and misuse of @simd ivdep can very easily lead to bugs that are hard to detect.\nStruct of arrays If we create an array containing four AlignmentTest objects A, B, C and D, the objects will lie end to end in the array, like this:\nObjects: | A | B | C | D | Fields: | a | b |c| | a | b |c| | a | b |c| | a | b |c| | Byte: 1 9 17 25 33 Note again that byte no. 8, 16, 24 and 32 are empty to preserve alignment, wasting memory. Now suppose you want to do an operation on all the .a fields of the structs. Because the .a fields are scattered 8 bytes apart, SIMD operations are much less efficient (loading up to 4 fields at a time) than if all the .a fields were stored together (where 8 fields could fit in a 256-bit register). When working with the .a fields only, the entire 64-byte cache lines would be read in, of which only half, or 32 bytes would be useful. Not only does this cause more cache misses, we also need instructions to pick out the half of the data from the SIMD registers we need.\nThe memory structure we have above is termed an \u0026ldquo;array of structs\u0026rdquo;, because, well, it is an array filled with structs. Instead we can strucure our 4 objects A to D as a \u0026ldquo;struct of arrays\u0026rdquo;. Conceptually, it could look like:\nstruct AlignmentTestVector a::Vector{UInt32} b::Vector{UInt16} c::Vector{UInt8} end With the following memory layout for each field:\nObject: AlignmentTestVector .a | A | B | C | D | .b | A | B | C | D | .c |A|B|C|D| Alignment is no longer a problem, no space is wasted on padding. When running through all the a fields, all cache lines contain full 64 bytes of relevant data, so SIMD operations do not need extra operations to pick out the relevant data:\nBase.rand(::Type{AlignmentTest}) = AlignmentTest(rand(UInt32), rand(UInt16), rand(UInt8)) N = 1_000_000 array_of_structs = [rand(AlignmentTest) for i in 1:N] struct_of_arrays = AlignmentTestVector(rand(UInt32, N), rand(UInt16, N), rand(UInt8, N)); @btime sum(x -\u0026gt; x.a, array_of_structs) @btime sum(struct_of_arrays.a); 444.546 μs (1 allocation: 16 bytes) 113.006 μs (1 allocation: 16 bytes) Specialized CPU instructions Most code makes use of only a score of CPU instructions like move, add, multiply, bitshift, and, or, xor, jumps, and so on. However, CPUs in the typical modern laptop support a lot of CPU instructions. Typically, if a certain operation is used heavily in consumer laptops, CPU manufacturers will add specialized instructions to speed up these operations. Depending on the hardware implementation of the instructions, the speed gain from using these instructions can be significant.\nJulia only exposes a few specialized instructions, including:\n The number of set bits in an integer is effectively counted with the popcnt instruction, exposed via the count_ones function. The tzcnt instructions counts the number of trailing zeros in the bits an integer, exposed via the trailing_zeros function The order of individual bytes in a multi-byte integer can be reversed using the bswap instruction, exposed via the bswap function. This can be useful when having to deal with endianness. The following example illustrates the performance difference between a manual implementation of the count_ones function, and the built-in version, which uses the popcnt instruction:\nfunction manual_count_ones(x) n = 0 while x != 0 n += x \u0026amp; 1 x \u0026gt;\u0026gt;\u0026gt;= 1 end return n end data = rand(UInt, 10000) @btime sum(manual_count_ones, data) @btime sum(count_ones, data); 217.605 μs (1 allocation: 16 bytes) 2.059 μs (1 allocation: 16 bytes) The timings you observe here will depend on whether your compiler is clever enough to realize that the computation in the first function can be expressed as a popcnt instruction, and thus will be compiled to that. On my computer, the compiler is not able to make that inference, and the second function achieves the same result more than 100x faster.\nCall any CPU instruction Julia makes it possible to call CPU instructions direcly. This is not generally advised, since not all your users will have access to the same CPU with the same instructions.\nThe latest CPUs contain specialized instructions for AES encryption and SHA256 hashing. If you wish to call these instructions, you can call Julia's backend compiler, LLVM, directly. In the example below, I create a function which calls the vaesenc (one round of AES encryption) instruction directly:\n# This is a 128-bit CPU \u0026quot;vector\u0026quot; in Julia const __m128i = NTuple{2, VecElement{Int64}} # Define the function in terms of LLVM instructions aesenc(a, roundkey) = ccall(\u0026quot;llvm.x86.aesni.aesenc\u0026quot;, llvmcall, __m128i, (__m128i, __m128i), a, roundkey); We can verify it works by checking the assembly of the function, which should contain only a single vaesenc instruction, as well as the retq (return) and the nopw (do nothing, used as a filler to align the CPU instructions in memory) instruction:\n@code_native aesenc(__m128i((213132, 13131)), __m128i((31231, 43213))) .section\t__TEXT,__text,regular,pure_instructions ; ┌ @ In[33]:5 within `aesenc' vaesenc\t%xmm1, %xmm0, %xmm0 retq nopw\t%cs:(%rax,%rax) ; └ Algorithms which makes use of specialized instructions can be extremely fast. In a blog post, the video game company Molly Rocket unveiled a new non-cryptographic hash function using AES instructions which reached unprecedented speeds.\nInlining Consider the assembly of this function:\n# Simply throw an error f() = error() @code_native f() .section\t__TEXT,__text,regular,pure_instructions ; ┌ @ In[35]:2 within `f' pushq\t%rax movabsq\t$error, %rax callq\t*%rax nopl\t(%rax) ; └ This code contains the callq instruction, which calls another function. A function call comes with some overhead depending on the arguments of the function and other things. While the time spent on a function call is measured in microseconds, it can add up if the function called is in a tight loop.\nHowever, if we show the assembly of this function:\ncall_plus(x) = x + 1 code_native(call_plus, (Int,), debuginfo=:none) .section\t__TEXT,__text,regular,pure_instructions leaq\t1(%rdi), %rax retq nopw\t%cs:(%rax,%rax) The function call_plus calls +, and is compiled to a single leaq instruction (as well as some filler retq and nopw). But + is a normal Julia function, so call_plus is an example of one regular Julia function calling another. Why is there no callq instruction to call +?\nThe compiler has chosen to inline the function + into call_plus. That means that instead of calling +, it has copied the content of + directly into call_plus. The advantages of this is:\n There is no overhead from the function call There is no need to construct a Tuple to hold the arguments of the + function Whatever computations happens in + is compiled together with call_plus, allowing the compiler to use information from one in the other and possibly simplify some calculations. So why aren't all functions inlined then? Inlining copies code, increases the size of it and consuming RAM. Furthermore, the CPU instructions themselves also needs to fit into the CPU cache (although CPU instructions have their own cache) in order to be efficiently retrieved. If everything was inlined, programs would be enormous and grind to a halt. Inlining is only an improvement if the inlined function is small.\nInstead, the compiler uses heuristics (rules of thumb) to determine when a function is small enough for inlining to increase performance. These heuristics are not bulletproof, so Julia provides the macros @noinline, which prevents inlining of small functions (useful for e.g. functions that raises errors, which must be assumed to be called rarely), and @inline, which does not force the compiler to inline, but strongly suggests to the compiler that it ought to inline the function.\nIf code contains a time-sensitive section, for example an inner loop, it is important to look at the assembly code to verify that small functions in the loop is inlined. For example, in this line in my kmer hashing code, overall minhashing performance drops by a factor of two if this @inline annotation is removed.\nAn extreme difference between inlining and no inlining can be demonstrated thus:\n@noinline noninline_poly(x) = x^3 - 4x^2 + 9x - 11 inline_poly(x) = x^3 - 4x^2 + 9x - 11 function time_function(F, x::AbstractVector) n = 0 for i in x n += F(i) end return n end; @btime time_function(noninline_poly, data) @btime time_function(inline_poly, data); 13.380 μs (1 allocation: 16 bytes) 7.207 μs (1 allocation: 16 bytes) Unrolling Consider a function that sums a vector of 64-bit integers. If the vector's data's memory offset is stored in register %r9, the length of the vector is stored in register %r8, the current index of the vector in %rcx and the running total in %rax, the assembly of the inner loop could look like this:\nL1: ; add the integer at location %r9 + %rcx * 8 to %rax addq (%r9,%rcx,8), %rax ; increment index by 1 addq $1, %rcx ; compare index to length of vector cmpq %r8, %rcx ; repeat loop if index is smaller jb L1 For a total of 4 instructions per element of the vector. The actual code generated by Julia will be similar to this, but also incluce extra instructions related to bounds checking that are not relevant here (and of course will include different comments).\nHowever, if the function is written like this:\nfunction sum_vector(v::Vector{Int}) n = 0 i = 1 for chunk in 1:div(length(v), 4) n += v[i + 0] n += v[i + 1] n += v[i + 2] n += v[i + 3] i += 4 end return n end The result is obviously the same if we assume the length of the vector is divisible by four. If the length is not divisible by four, we could simply use the function above to sum the first N - rem(N, 4) elements and add the last few elements in another loop. Despite the functionally identical result, the assembly of the loop is different and may look something like:\nL1: addq (%r9,%rcx,8), %rax addq 8(%r9,%rcx,8), %rax addq 16(%r9,%rcx,8), %rax addq 24(%r9,%rcx,8), %rax addq $4, %rcx cmpq %r8, %rcx jb L1 For a total of 7 instructions per 4 additions, or 1.75 instructions per addition. This is less than half the number of instructions per integer! The speed gain comes from simply checking less often when we're at the end of the loop. We call this process unrolling the loop, here by a factor of four. Naturally, unrolling can only be done if we know the number of iterations beforehand, so we don't \u0026ldquo;overshoot\u0026rdquo; the number of iterations. Often, the compiler will unroll loops automatically for extra performance, but it can be worth looking at the assembly. For example, this is the assembly for the innermost loop generated on my computer for sum([1]):\nL144: vpaddq 16(%rcx,%rax,8), %ymm0, %ymm0 vpaddq 48(%rcx,%rax,8), %ymm1, %ymm1 vpaddq 80(%rcx,%rax,8), %ymm2, %ymm2 vpaddq 112(%rcx,%rax,8), %ymm3, %ymm3 addq $16, %rax cmpq %rax, %rdi jne L144 Where you can see it is both unrolled by a factor of four, and uses 256-bit SIMD instructions, for a total of 128 bytes, 16 integers added per iteration, or 0.44 instructions per integer.\nNotice also that the compiler chooses to use 4 different ymm SIMD registers, ymm0 to ymm3, whereas in my example assembly code, I just used one register rax. This is because, if you use 4 independent registers, then you don't need to wait for one vpaddq to complete (remember, it had a ~3 clock cycle latency) before the CPU can begin the next.\nBranch prediction As mentioned previously, CPU instructions take multiple cycles to complete, but may be queued into the CPU before the previous instruction has finished computing. So what happens when the CPU encounters a branch (i.e. a jump instruction)? It can't know which instruction to queue next, because that depends on the instruction that it just put into the queue and which has yet to be executed.\nModern CPUs make use of branch prediction. The CPU has a branch predictor circuit, which guesses the correct branch based on which branches were recently taken. In essense, the branch predictor attempts to learn simple patterns in which branches are taken in code, while the code is running. After queueing a branch, the CPU immediately queues instructions from whatever branch predicted by the branch predictor. The correctness of the guess is verified later, when the queued branch is being executed. If the guess was correct, great, the CPU saved time by guessing. If not, the CPU has to empty the pipeline and discard all computations since the initial guess, and then start over. This process causes a delay of a few nanoseconds.\nFor the programmer, this means that the speed of an if-statement depends on how easy it is to guess. If it is trivially easy to guess, the branch predictor will be correct almost all the time, and the if statement will take no longer than a simple instruction, typically 1 clock cycle. In a situation where the branching is random, it will be wrong about 50% of the time, and each misprediction may cost around 10 clock cycles.\nBranches caused by loops are among the easiest to guess. If you have a loop with 1000 elements, the code will loop back 999 times and break out of the loop just once. Hence the branch predictor can simply always predict \u0026ldquo;loop back\u0026rdquo;, and get a 99.9% accuracy.\nWe can demonstrate the performance of branch misprediction with a simple function:\n# Copy all odd numbers from src to dst. function copy_odds!(dst::Vector{UInt}, src::Vector{UInt}) write_index = 1 @inbounds for i in eachindex(src) # \u0026lt;--- this branch is trivially easy to predict v = src[i] if isodd(v) # \u0026lt;--- this is the branch we want to predict dst[write_index] = v write_index += 1 end end return dst end dst = rand(UInt, 5000) src_random = rand(UInt, 5000) src_all_odd = [2i+1 for i in src_random]; @btime copy_odds!(dst, src_random) @btime copy_odds!(dst, src_all_odd); 11.586 μs (0 allocations: 0 bytes) 1.702 μs (0 allocations: 0 bytes) In the first case, the integers are random, and about half the branches will be mispredicted causing delays. In the second case, the branch is always taken, the branch predictor is quickly able to pick up the pattern and will reach near 100% correct prediction. As a result, on my computer, the latter is around 6x faster.\nNote that if you use smaller vectors and repeat the computation many times, as the @btime macro does, the branch predictor is able to learn the pattern of the small random vectors by heart, and will reach much better than random prediction. This is especially pronounced in the most modern CPUs where the branch predictors have gotten much better. This \u0026ldquo;learning by heart\u0026rdquo; is an artifact of the loop in the benchmarking process. You would not expect to run the exact same computation repeatedly on real-life data:\nsrc_random = rand(UInt, 100) src_all_odd = [2i+1 for i in src_random]; @btime copy_odds!(dst, src_random) @btime copy_odds!(dst, src_all_odd); 62.548 ns (0 allocations: 0 bytes) 42.602 ns (0 allocations: 0 bytes) Because branches are very fast if they are predicted correctly, highly predictable branches caused by error checks are not of much performance concern, assuming that the code essensially never errors. Hence a branch like bounds checking is very fast. You should only remove bounds checks if absolutely maximal performance is critical, or if the bounds check happens in a loop which would otherwise SIMD-vectorize.\nIf branches cannot be easily predicted, it is often possible to re-phrase the function to avoid branches all together. For example, in the copy_odds! example above, we could instead write it like so:\nfunction copy_odds!(dst::Vector{UInt}, src::Vector{UInt}) write_index = 1 @inbounds for i in eachindex(src) v = src[i] dst[write_index] = v write_index += isodd(v) end return dst end dst = rand(UInt, 5000) src_random = rand(UInt, 5000) src_all_odd = [2i+1 for i in src_random]; @btime copy_odds!(dst, src_random) @btime copy_odds!(dst, src_all_odd); 2.295 μs (0 allocations: 0 bytes) 2.341 μs (0 allocations: 0 bytes) Which contains no other branches than the one caused by the loop itself (which is easily predictable), and results in speeds somewhat worse than the perfectly predicted one, but much better for random data.\nThe compiler will often remove branches in your code when the same computation can be done using other instructions. When the compiler fails to do so, Julia offers the ifelse function, which sometimes can help elide branching.\nVariable clock speed A modern laptop CPU optimized for low power consumption consumes roughly 25 watts of power on a chip as small as a stamp (and thinner than a human hair). Without proper cooling, this will cause the temperature of the CPU to skyrocket and melting the plastic of the chip, destroying it. Typically, CPUs have a maximal operating temperature of about 100 degrees C. Power consumption, and therefore heat generation, depends among many factors on clock speed, higher clock speeds generate more heat.\nModern CPUs are able to adjust their clock speeds according to the CPU temperature to prevent the chip from destroying itself. Often, CPU temperature will be the limiting factor in how quick a CPU is able to run. In these situations, better physical cooling for your computer translates directly to a faster CPU. Old computers can often be revitalized simply by removing dust from the interior, and replacing the cooling fans and CPU thermal paste!\nAs a programmer, there is not much you can do to take CPU temperature into account, but it is good to know. In particular, variations in CPU temperature often explain observed difference in performance:\n CPUs usually work fastest at the beginning of a workload, and then drop in performance as it reaches maximal temperature SIMD instructions usually require more power than ordinary instructions, generating more heat, and lowering the clock frequency. This can offset some performance gains of SIMD, but SIMD will still always be more efficient when applicable Multithreading In the bad old days, CPU clock speed would increase every year as new processors were brought onto the market. Partially because of heat generation, this acceleration slowed down once CPUs hit the 3 GHz mark. Now we see only minor clock speed increments every processor generation. Instead of raw speed of execution, the focus has shifted on getting more computation done per clock cycle. CPU caches, CPU pipelining, branch prediction and SIMD instructions are all important progresses in this area, and have all been covered here.\nAnother important area where CPUs have improved is simply in numbers: Almost all CPU chips contain multiple smaller CPUs, or cores inside them. Each core has their own small CPU cache, and does computations in parallel. Furthermore, many CPUs have a feature called hyper-threading, where two threads (i.e. streams of instructions) are able to run on each core. The idea is that whenever one process is stalled (e.g. because it experiences a cache miss or a misprediction), the other process can continue on the same core. The CPU \u0026ldquo;pretends\u0026rdquo; to have twice the amount of processors. For example, I am writing this on a laptop with an Intel Core i5-8259U CPU. This CPU has 4 cores, but various operating systems like Windows or Linux would show 8 \u0026ldquo;CPUs\u0026rdquo; in the systems monitor program.\nHyperthreading only really matters when your threads are sometimes prevented from doing work. Besides CPU-internal causes like cache misses, a thread can also be paused because it is waiting for an external resource like a webserver or data from a disk. If you are writing a program where some threads spend a significant time idling, the core can be used by the other thread, and hyperthreading can show its value.\nLet's see our first parallel program in action. First, we need to make sure that Julia actually was started with the correct number of threads. You can set the environment variable JULIA_NUM_THREADS before starting Julia. I have 4 cores on this CPU, both with hyperthreading so I have set the number of threads to 8:\nThreads.nthreads() 8 # Spend about half the time waiting, half time computing function half_asleep(start::Bool) a, b = 1, 0 for iteration in 1:5 start \u0026amp;\u0026amp; sleep(0.06) for i in 1:100000000 a, b = a + b, a end start || sleep(0.06) end return a end function parallel_sleep(n_jobs) jobs = [] for job in 1:n_jobs push!(jobs, Threads.@spawn half_asleep(isodd(job))) end return sum(fetch, jobs) end parallel_sleep(1); # run once to compile it for njobs in (1, 4, 8, 16) @time parallel_sleep(njobs); end 0.604390 seconds (44 allocations: 1.906 KiB) 0.694747 seconds (292 allocations: 15.500 KiB) 0.609797 seconds (425 allocations: 20.922 KiB) 1.157514 seconds (705 allocations: 33.109 KiB) You can see that with this task, my computer can run 8 jobs in parallel almost as fast as it can run 1. But 16 jobs takes much longer.\nFor CPU-constrained programs, the core is kept busy with only one thread, and there is not much to do as a programmer to leverage hyperthreading. Actually, for the most optimized programs, it usually leads to better performance to disable hyperthreading. Most workloads are not that optimized and can really benefit from hyperthreading, so we'll stick with 8 threads for now.\nParallelizability Multithreading is more difficult that any of the other optimizations, and should be one of the last tools a programmer reaches for. However, it is also an impactful optimization. Compute clusters usually contain CPUs with tens of CPU cores, offering a massive potential speed boost ripe for picking.\nA prerequisite for efficient use of multithreading is that your computation is able to be broken up into multiple chunks that can be worked on independently. Luckily the majority of compute-heavy tasks (at least in my field of work, bioinformatics), contain sub-problems that are embarassingly parallel. This means that there is a natural and easy way to break it into sub-problems that can be processed independently. For example, if a certain independent computation is required for 100 genes, it is natural to use one thread for each gene.\nLet's have an example of a small embarrasingly parallel problem. We want to construct a Julia set. Julia sets are named after Gaston Julia, and have nothing to do with the Julia language. Julia sets are (often) fractal sets of complex numbers. By mapping the real and complex component of the set's members to the X and Y pixel value of a screen, one can generate the LSD-trippy images associated with fractals.\nThe Julia set I create below is defined thus: We define a function f(z) = z^2 + C, where C is some constant. We then record the number of times f can be applied to any given complex number z before abs(z) \u0026gt; 2. The number of iterations correspond to the brightness of one pixel in the image. We simply repeat this for a range of real and imaginary values in a grid to create an image.\nFirst, let's see a non-parallel solution:\nconst SHIFT = Complex{Float32}(-0.221, -0.713) f(z::Complex) = z^2 + SHIFT \u0026quot;Set the brightness of a particular pixel represented by a complex number\u0026quot; function mandel(z) n = 0 while ((abs2(z) \u0026lt; 4) \u0026amp; (n \u0026lt; 255)) n += 1 z = f(z) end return n end \u0026quot;Set brightness of pixels in one column of pixels\u0026quot; function fill_column!(M::Matrix, x, real) for (y, im) in enumerate(range(-1.0f0, 1.0f0, length=size(M, 1))) M[y, x] = mandel(Complex{Float32}(real, im)) end end \u0026quot;Create a Julia fractal image\u0026quot; function julia() M = Matrix{UInt8}(undef, 10000, 5000) for (x, real) in enumerate(range(-1.0f0, 1.0f0, length=size(M, 2))) fill_column!(M, x, real) end return M end; @time M = julia(); 4.924423 seconds (2 allocations: 47.684 MiB, 0.39% gc time) That took around 5 seconds on my computer. Now for a parallel one:\nfunction recursive_fill_columns!(M::Matrix, cols::UnitRange) F, L = first(cols), last(cols) # If only one column, fill it using fill_column! if F == L r = range(-1.0f0,1.0f0,length=size(M, 1))[F] fill_column!(M, F, r) # Else divide the range of columns in two, spawning a new task for each half else mid = div(L+F,2) p = Threads.@spawn recursive_fill_columns!(M, F:mid) recursive_fill_columns!(M, mid+1:L) wait(p) end end function julia() M = Matrix{UInt8}(undef, 10000, 5000) recursive_fill_columns!(M, 1:size(M, 2)) return M end; @time M = julia(); 0.833607 seconds (34.32 k allocations: 51.106 MiB) This is almost 6 times as fast! This is close to the best case scenario for 8 threads, only possible for near-perfect embarrasingly parallel tasks.\nDespite the potential for great gains, in my opinion, multithreading should be one of the last resorts for performance improvements, for three reasons:\n Implementing multithreading is harder than other optimization methods in many cases. In the example shown, it was very easy. In a complicated workflow, it can get messy quickly. Multithreading can cause hard-to-diagnose and erratic bugs. These are almost always related to multiple threads reading from, and writing to the same memory. For example, if two threads both increment an integer with value N at the same time, the two threads will both read N from memory and write N+1 back to memory, where the correct result of two increments should be N+2! Infuriatingly, these bugs appear and disappear unpredictably, since they are causing by unlucky timing. These bugs of course have solutions, but it is tricky subject outside the scope of this document. Finally, achieving performance by using multiple threads is really achieving performance by consuming more resources, instead of gaining something from nothing. Often, you pay for using more threads, either literally when buying cloud compute time, or when paying the bill of increased electricity consumption from multiple CPU cores, or metaphorically by laying claim to more of your users\u0026rsquo; CPU resources they could use somewhere else. In contrast, more efficent computation costs nothing. GPUs So far, we've covered only the most important kind of computing chip, the CPU. But there are many other kind of chips out there. The most common kind of alternative chip is the graphical processing unit or GPU.\nAs shown in the above example with the Julia set, the task of creating computer images are often embarassingly parallel with an extremely high degree of parallelizability. In the limit, the value of each pixel is an independent task. This calls for a chip with a high number of cores to do effectively. Because generating graphics is a fundamental part of what computers do, nearly all commercial computers contain a GPU. Often, it's a smaller chip integrated into the motherboard (integrated graphics, popular in small laptops). Other times, it's a large, bulky card.\nGPUs have sacrificed many of the bells and whistles of CPUs covered in this document such as specialized instructions, SIMD and branch prediction. They also usually run at lower frequencies than CPUs. This means that their raw compute power is many times slower than a CPU. To make up for this, they have a high number of cores. For example, the high-end gaming GPU NVIDIA RTX 2080Ti has 4,352 cores. Hence, some tasks can experience 10s or even 100s of times speedup using a GPU. Most notably for scientific applications, matrix and vector operations are highly parallelizable.\nUnfortunately, the laptop I'm writing this document on has only integrated graphics, and there is not yet a stable way to interface with integrated graphics using Julia, so I cannot show examples.\nThere are also more esoteric chips like TPUs (explicitly designed for low-precision tensor operations common in deep learning) and ASICs (an umbrella term for highly specialized chips intended for one single application). At the time of writing, these chips are uncommon, expensive, poorly supported and have limited uses, and are therefore not of any interest for non-computer science researchers.\nThanks to Chris Elrod for reviewing this for correctness and teaching me a thing or two about computers\n","date":1587292718,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1587292718,"objectID":"437ea5bb0077aac274a929a0758750ac","permalink":"https://BioJulia.github.io/post/hardware/","publishdate":"2020-04-19T12:38:38+02:00","relpermalink":"/post/hardware/","section":"post","summary":"Find this notebook at https://github.com/jakobnissen/hardware_introduction\nProgramming is used in many fields of science today, where individual scientists often have to write custom code for their own projects. For most scientists, however, computer science is not their field of expertise; They have learned programming by necessity. I count myself as one of them. While we may be reasonably familiar with the software side of programming, we rarely have even a basic understanding of how computer hardware impacts code performance.","tags":[],"title":"What scientists must know about hardware to write fast code","type":"post"},{"authors":["Sabrina Ward"],"categories":[],"content":"Hi there! There have been incidents of confusion where newer users have tried to install and run an old and deprecated BioJulia package called Bio.jl, or they have not known which package(s) they need to install in order to start working.\nSo I'd like to take the opportunity to clear this situation up and perhaps put out a clearer explanation, of why Bio.jl exists when there are other packages with overlapping functionality.\nBio.jl was the first package BioJulia produced, at a time when the scale of the project meant it made sense to have a single module - Bio - with several submodules\n Seq, Align, Phylo, Intervals, Structure, Var, Services, Tools. Bio has existed since very early days - or versions - of julia as well.\nEventually there came a point where datatype and algorithm specific packages made more sense than a single monolithic repository.\nAnd so the package got split into many:\n Bio.Seq became BioSequences.jl Bio.Align became BioAlignments.jl Bio.Intervals became GenomicFeatures.jl Bio.Structure became BioStructures.jl Bio.Var became GeneticVariation.jl Bio.Phylo became Phylogenies.jl Bio.Services became BioServices.jl Bio.Tools became BioTools.jl Bio.jl then basically persisted as a convenience wrapper around those packages, so that they could be installed with a single command and were distributed set to compatible versions.\nWith julia v0.7/v1.0 and the new Pkg and Project system the need for Bio.jl to provide that functionality became largely moot.\nSo what should you do? As of today, Bio.jl is available for installation, but mostly just to prevent old scripts and projects breaking.\nIt is not longer recommended that you use Bio.jl. Instead you should start a julia project and add just the BioJulia packages you want to use. For a brief description on how to do this, check out the getting started page.\nHappy hacking.\nBen.\n","date":1579820469,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1579820469,"objectID":"7cc635bf6613cf421bab5391ffdd3a49","permalink":"https://BioJulia.github.io/post/biojl/","publishdate":"2020-01-23T23:01:09Z","relpermalink":"/post/biojl/","section":"post","summary":"Hi there! There have been incidents of confusion where newer users have tried to install and run an old and deprecated BioJulia package called Bio.jl, or they have not known which package(s) they need to install in order to start working.\nSo I'd like to take the opportunity to clear this situation up and perhaps put out a clearer explanation, of why Bio.jl exists when there are other packages with overlapping functionality.","tags":["BioSequences","DNA","RNA","Sequences","Deprecation"],"title":"Bio.jl is old and deprecated","type":"post"},{"authors":["Jakob Nybo Nissen","Sabrina Ward"],"categories":[],"content":"Introduction In October 2019, a paper was published in Proceedings of the ACM on Programming Languages, introducing a new language for bioinformatics called Seq.\nIn it, the authors rightly recognize that the field of bioinformatics is in need of a high-performance, yet expressive and productive language. They present Seq, a domain specific compiled language, with the user friendliness of Python, the performance of C, and bioinformatics-specific data types and optimisations. As Julians, we consider their goal to be noble and well worth pursuit. However, they also presented a series of benchmarking results in which they claim the performance of Seq code is far greater than equivalent code written in other languages, including C++ and Julia.\nOf course, benchmarking between languages is a tricky thing. Different languages present different syntax, tools and idioms to the programmer, such that what is efficient and natural in one language may be inefficient and clumsy in another. When benchmarking different languages, therefore, it is important to write code that is idiomatic in each language before comparing the code in terms of performance, readability or ease-of-writing. Disagreements about benchmarks between languages often come down to disagreements of whether the code compared are equivalent - if they aren’t, comparisons may be meaningless.\nFor example, in the majority of the benchmarks between Seq and Julia in the Seq paper, the DNA sequences of Seq are represented by a dedicated sequence type, whereas the sequences in Julia are represented by strings. As the Seq type is crafted specifically for efficient DNA processing, and Julia’s strings are crafted for efficient generic text processing, it is no surprise that Seq is much faster in these benchmarks. Therefore, these particular benchmarks are obviously misleading, and we will not discuss them further here.\nThe Seq authors, to their credit, realize this and include comparison between Seq and Julia code that uses BioSequences, a package of the BioJulia organization. This package contains custom, efficient types for exactly the type of work they are benchmarking, and so here, the code is equivalent and the comparison reasonable. Even when comparing against BioSequences, the Seq authors find that Seq offers a large speed advantage. As a core developer and long-time user of BioJulia, respectively, we were puzzled by these results. BioJulia is highly optimised. Could Seq really be that much faster? Perhaps we could learn something from Seq?\nRecreating the benchmarks The first thing to do was to see if the results in the paper could be replicated. The Seq authors have made a real effort to allow easy replication of their results, as they have released a virtual machine with all necessary code and software to run their benchmarks out of the box. Only their BioJulia code was missing, which they promptly provided upon request. So we spun up their VM, and recreated their benchmarks. For details on the benchmarking procedure, see the last section of this post.\nWe note that the provided version of Seq provides wrong answers to the benchmarks on several counts. First, kmers containing ambiguous nucleotides are not skipped during iteration, leading to simpler and faster (but by convention, wrong) code. Second, the middle base is not complemented during reverse-complementation of odd-length sequences. Third, the reverse-complement of ambiguous nucleotides are wrong, as e.g. K is complemented to N rather than the correct answer (M). Fourth, we can make little sense of what the result returned by their CpG benchmarking code actually means. However, these small blemishes could easily be fixed by the Seq authors with little or no performance penalty, so they are not of great importance here.\nThe results of our recreation are seen in the plot below. We were happy to see we could recreate their performance difference, seeing nearly the same performance difference they found.\nFigure 1. Running time of BioJulia (blue) versus Seq (red). We could recreate the authors findings and found Seq to be much faster than BioJulia for the three provided benchmarks. The barplot shows shows the mean +/- stddev time of 5 consecutive runs.\nImproving the benchmarked code As a long time developers and users of BioJulia, we feel qualified to judge whether or not the Julia code used for the benchmarks is idiomatic or not, and whether or not BioJulia was being correctly used. Non-idiomatic Julia is the most common source of complaints when new Julia users benchmark Julia code and find the performance disappointing. This often happens because new users code in a way that carries little penalty in languages like R and Python, but which is difficult for Julia’s JIT compiler to optimise. So, a natural starting point was to check their Julia code for inefficiencies.\nHowever, the Seq author’s Julia benchmarks were well implemented. Only the code for the CpG benchmark could be improved by fixing an error which resulted in incorrect answers, and by computing the result in a single pass of the input sequence. The two other benchmarks only needed minor tweaking to be idiomatic. These minor changes did not improve the timings of BioJulia markedly (Figure 2).\nFigure 2. Improving the Julia code of the Seq paper’s benchmarks (green series), did not change the results very much. The barplot shows shows the mean +/- stddev time of 5 consecutive runs.\nExplaining the difference in performance It seems that Seq really is much faster than BioJulia, at least for the benchmarked tasks, and we wanted to know why. So we profiled their BioJulia code to see what BioJulia was taking its sweet time doing. The results for the three benchmarks are shown below in Figure 3.\nFigure 3. Barplots showing the fraction of time BioJulia benchmarking code was spending doing various sub-tasks, as determined by the built-in Julia Profiler tool.\nSurprisingly, Figure 3 reveals that only a small fraction of the time is spent doing what the benchmark nominally measures. For example, the RC benchmark presumably should benchmark reverse-complementation (RC’ing) of sequences, but BioJulia spends less than 10% of the time actually RC’ing. Likewise, checking symmetric kmers and kmer iteration in the 16-mer benchmark is relatively insignificant time-wise. We implemented Seq’s algorithm for RC’ing kmers as described in their paper, but found no difference in performance to BioJulia’s approach, even when benchmarking only the time spent RC’ing.\nIn fact, Figure 3 shows that for all benchmarks, most of the time is spent building instances of the BioSequence type that BioSequences.jl uses to represent long DNA, RNA and Amino Acid sequences. To find the source of the performance difference between Seq and BioSequences, then, it is necessary to take a small detour into the internals of BioSequences and Seq.\nHow to represent DNA sequences in software DNA consists of four nucleotides called A, C, G and T. In some contexts, a nucleotide may instead be one of 16 IUPAC defined symbols. Hence, one nucleotide may be represented in either two bits or four bits. In BioSequences, DNA is stored in a compact format, where nucleotides are encoded to 2 or 4-bit encodings in an integer array. This representation comes with a few advantages:\n The encoding and decoding step implicitly validates that the input data is meaningful DNA data instead of some random string. Therefore, when given a BioSequence, the user can be confident that the underlying data actually represents valid DNA.\n Encoding saves 4x or 2x on RAM. Speed is important in bioinformatics precisely because of our large datasets, and this also puts a premium on RAM, making these savings worthwhile.\n The encoded representation of DNA makes biological operations easier and more efficient. Complementation of a 2-bit DNA sequence, for example, consists only of inverting the encoded bits. Converting between DNA and RNA includes only changing the metadata.\n The encoding/decoding also comes with a disadvantage, as it takes more time than just accessing raw bytes. This tradeoff is reminiscent of the pros and cons of a boolean vector versus a bit-vector. As the benchmarks here consists of very simple, fast operations on many small sequences that are read in as text, encoding time dominate.\nIn contrast, Seq represents DNA sequences as a byte array with the underlying data simply being the bytes of the input string. Constructing this type is obviously much faster than a BioSequence since it can wrap a string directly, but the RAM consumption is 2 or 4 times higher. Remarkably, Seq appears to do no input validation at all, as we confirmed by re-running the benchmarks with corrupted data. BioSequences immediately crashed with an informative error message, whereas Seq happily produced the wrong answer with no warnings.\nSo it appears the primary reason BioJulia code is slower than Seq code in these three benchmarks is that BioSequences.jl is doing important work for you that Seq is not doing. As scientists, we hope you value tools that spend the time and effort to validate inputs given to it rather than fail silently.\nImplementing Seq-like types in Julia BioSequences may be more memory efficient and safer to use, we still verified the finding of the Seq authors: Seq really is much faster than BioSequences. That surprised us. Was Seq so fast because of amazing software engineering in the language itself, or simply because so much time is lost in BioSequence’s encoding and decoding? We decided to mimic Seq in Julia by implementing DNA sequences and kmer types in Julia with the same data representation Seq uses. Our module, SeqJL turned out to be simple to write, taking less than 100 lines of code. We note that BioSequences could easily have been written this way, and intentionally isn’t.\nFigure 4 shows the performance of our SeqJL code (Figure 4, red), where it is significantly faster than Seq, except in the RC benchmark (Figure 4, orange). We found that even further speed improvements could be found by buffering input and output using the BufferedStreams.jl package, but we did not use that in the benchmarks.\nFigure 4. Timing a Julia implementation of the data types in Seq (red) resulted in timings which beat those achieved by the Seq code (orange). The barplot shows shows the mean +/- stddev time of 5 consecutive runs.\nLearning from Seq Being challenged often teaches valuable lessons, and the challenge Seq presented to BioJulia is no exception. We were surprised to see a bioinformatics workload where the encoding step of BioSequences proved to be a bottleneck, as we have always believed it to be very fast. We did not expect our simple SeqJL code to outperform BioSequences by such a large factor in these benchmarks.\nIn most realistic workloads, sequences are subject to more intense processing, which makes the speed of encoding and IO operations less important in comparison. In addition we note that BioJulia provides optimal buffered, state machine generated parsers for many file formats like FASTA and FASTQ, but they were not explored in this work as no benchmark involved them. BioJulia also offers ReadDatastores.jl, which implements indexed disk-backed collections of sequences, stored in the BioSequences encodings. These data-stores mean commonly used sequence datasets like sequencing reads stored in FASTQ files need to be encoded only once, and then the data-store can be reused for a great performance benefit. Nonetheless, the impressive speed of Seq over BioSequences in these benchmarks made us reconsider the need for much faster string/sequence conversion and printing. As a result of this challenge, we have subjected BioSequences to a performance review, optimising a dozen code sections that was causing slowdowns. To benefit from these improvements, users should install BioSequences version 2.X from the BioJuliaRegistry.\nWe were happy to discover that these changes made a big difference. With the updates, BioSequences rivals Seq in speed while keeping its advantages of a lower memory footprint and doing data validation.\nFigure 5. The newest version of BioSequences (purple) comes with performance improvements in encoding, decoding and IO, making it 3-4 times faster than BioSequences v 1.1.0 (green) for these benchmarks, and rivaling Seq (orange). The barplot shows shows the mean +/- stddev time of 5 consecutive runs.\nOur take away impressions Jakob I wholeheartedly agree with Seq’s goal of bringing a performant language to the masses, so to speak. I also applaud Seq’s authors for their efforts in providing reproducible results by sharing their code and environment, and for their efforts to compare Seq to efficient, proper Julia code. Ultimately, the speed difference between BioSequences and Seq comes down to a design decision in the implementation of the sequence type. I happen to think BioSequences made the right call by encoding the sequences, and Seq made the wrong one.\nMore broadly I think Seq brings little of value to bioinformatics. Our simple SeqJL implementation shows that Julia can achieve what Seq aims to do with even higher performance and, I would argue, even more elegant, reusable and concise code. In contrast, Seq comes with serious drawbacks. Beside DNA sequence processing, a typical bioinformatics workflow may include reading CSV files, feature extraction, doing statistical tests, plotting data and more. Being a domain specific language, Seq has zero chance of having packages for all these tasks available. In contrast, because BioSequences is simply a package in pure Julia, a user of BioSequences have all the tools of the broader Julia ecosystem available.\nMore realistically, Seq could survive by providing interoperation with other languages. But madness lies that way. In the bad old days, in a few hours of work, a bioinformatician would use awk to edit text files, pass them through Perl one-liners, run it through Python data processing before graphing using ggplot, all these languages duct-taped together using Bash. Workflows of that kind were inefficient, brittle, and required the programmer to learn a handful of different domain specific languages. Surely that path, the path that Seq shows us, must lie behind us.\nBen I can only echo and agreement with Jakob in applauding Seq’s authors efforts in trying to bring a performant programming language to life, and also applaud their efforts in providing reproducible results.\nThat said, before playing with these benchmarks it’s fair to say I was fairly skeptical of the idea that what bioinformatics really needs is a domain specific language. When I consider critical problems the field faces, my mind goes to problems such as sometimes undocumented, sometimes poorly understood assumptions about biological systems hardcoded into tools (e.g. assembly pipelines that assume levels of ploidy or genome characteristics). I think of experiments that start with a single reference genome, and run pipelines of heuristic after heuristic, each with its own slew of parameters and biases.\nIn other words the greatest gains are to be made by improving how we do things, not how fast we do them. Everyone wants speed, that’s why so many Big Data frameworks exist. Indeed the Seq developer’s intention of having a language that is Pythonic in user friendliness, and speed approaching C, is also one of the main reasons Julia exists today. After Julia 1.0 was released, as a core developer of BioJulia, I considered the argument as to whether or not Julia is performant to be settled. It is.\nNow the question of whether Julia is fast enough for bioinformatics is settled. I think the current goal of BioJulia is to provide clear, flexible, easy to use frameworks for doing bioinformatics analyses in a robust/reproducible, transparent manner, using clear concepts.\nAfter this review I think I would echo the same same conclusions as Jakob: In brief, the results in the paper can be explained largely due to our design choices in BioSequences.jl, which we maintain were the right calls. The exercise has been valuable, as it has resulted in several pull requests that improve on the performance of the string conversions and sequence encoding BioSequences.jl does; whilst we think the computational effort is justified, we also want it to be faster if it can be. Without Seq’s benchmarks to prompt us to examine this more closely, those those improvements might not have been made for a long time, if at all.\nBenchmarking procedure Benchmarking environment We ran the code in the Vagrant virtual machine provided by the authors. In the virtual machine, all software needed to reproduce the benchmarks was there, except the Seq author’s BioJulia code, which was provided upon request. The code was run on a 2018 MacBook pro using Julia v 1.0.3 and BioSequences v1.1.0. We could not determine the version of Seq used in the VM.\nBenchmarking method Due to a small virtual disk size in the VM, the whole dataset could not be downloaded, so we generated input data instead by taking the provided small test dataset HG00123_small.txt, and concatenating it to itself 16 times for a total of 2^24 lines, 1.3 GiB. To benchmark, we ran each script 5 times consecutively. For Seq, we used the idiomatic Seq code “seq-id”, and did not include compilation time. For Julia, we included JIT compilation (which we estimated to be ~2 seconds for BioJulia benchmarks, and somewhat less for SeqJL benchmarks), but not package precompilation time. Note that unlike the Seq authors, we ran Julia with default optimization level and boundschecks enabled, as these are the most common settings to run Julia in.\n","date":1579820469,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1579820469,"objectID":"ebbc894ff5d5eeacc7025d31dc2bfe20","permalink":"https://BioJulia.github.io/post/seq-lang/","publishdate":"2020-01-23T23:01:09Z","relpermalink":"/post/seq-lang/","section":"post","summary":"Introduction In October 2019, a paper was published in Proceedings of the ACM on Programming Languages, introducing a new language for bioinformatics called Seq.\nIn it, the authors rightly recognize that the field of bioinformatics is in need of a high-performance, yet expressive and productive language. They present Seq, a domain specific compiled language, with the user friendliness of Python, the performance of C, and bioinformatics-specific data types and optimisations. As Julians, we consider their goal to be noble and well worth pursuit.","tags":["BioSequences","Benchmarking","Seq-Lang","DNA","RNA","Sequences"],"title":"On the performance and design of BioSequences compared to the Seq language","type":"post"},{"authors":null,"categories":null,"content":"","date":-62135596800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":-62135596800,"objectID":"4ba4b809a14808b5e20c82aa71829eed","permalink":"https://BioJulia.github.io/getting-started/","publishdate":"0001-01-01T00:00:00Z","relpermalink":"/getting-started/","section":"","summary":"","tags":null,"title":"Getting Started","type":"widget_page"}]
\ No newline at end of file
diff --git a/index.xml b/index.xml
index d377bf8..bfd9939 100644
--- a/index.xml
+++ b/index.xml
@@ -5,7 +5,7 @@
https://BioJulia.github.io/
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/post/biojl/index.html b/post/biojl/index.html
index 06c55a6..145e286 100644
--- a/post/biojl/index.html
+++ b/post/biojl/index.html
@@ -21,7 +21,7 @@
-
+
@@ -277,7 +277,7 @@
"author": {
"@type": "Person",
- "name": "Ben J. Ward"
+ "name": "Sabrina Ward"
},
"publisher": {
@@ -688,7 +688,7 @@ BioJulia Developer
Bio.jl is old and deprecated
- Ben J. Ward
+ Sabrina Ward
Bioinformatician at the Earlham Institute, UK.
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/post/index.xml b/post/index.xml
index b20bb16..ef72683 100644
--- a/post/index.xml
+++ b/post/index.xml
@@ -5,7 +5,7 @@
https://BioJulia.github.io/post/
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/project/biosequences/index.html b/project/biosequences/index.html
index c4f8754..2de8751 100644
--- a/project/biosequences/index.html
+++ b/project/biosequences/index.html
@@ -21,7 +21,7 @@
-
+
@@ -749,7 +749,7 @@
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/project/biosymbols/index.html b/project/biosymbols/index.html
index a4d5532..5e5858a 100644
--- a/project/biosymbols/index.html
+++ b/project/biosymbols/index.html
@@ -21,7 +21,7 @@
-
+
@@ -740,7 +740,7 @@
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/project/index.html b/project/index.html
index fce8820..4a553ab 100644
--- a/project/index.html
+++ b/project/index.html
@@ -23,7 +23,7 @@
-
+
@@ -372,7 +372,7 @@
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/project/index.xml b/project/index.xml
index 10e9e2a..54f0529 100644
--- a/project/index.xml
+++ b/project/index.xml
@@ -6,7 +6,7 @@
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/publication/index.xml b/publication/index.xml
index e92e8c1..177a2ac 100644
--- a/publication/index.xml
+++ b/publication/index.xml
@@ -5,7 +5,7 @@
https://BioJulia.github.io/publication/
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/publication_types/index.xml b/publication_types/index.xml
index 65f0227..2317598 100644
--- a/publication_types/index.xml
+++ b/publication_types/index.xml
@@ -5,7 +5,7 @@
https://BioJulia.github.io/publication_types/
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/tags/amino-acid/index.xml b/tags/amino-acid/index.xml
index 6125dca..8549e18 100644
--- a/tags/amino-acid/index.xml
+++ b/tags/amino-acid/index.xml
@@ -6,7 +6,7 @@
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/tags/benchmarking/index.xml b/tags/benchmarking/index.xml
index 708af41..5b89c2d 100644
--- a/tags/benchmarking/index.xml
+++ b/tags/benchmarking/index.xml
@@ -5,7 +5,7 @@
https://BioJulia.github.io/tags/benchmarking/
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/tags/biosequences/index.xml b/tags/biosequences/index.xml
index 448898c..ab334b5 100644
--- a/tags/biosequences/index.xml
+++ b/tags/biosequences/index.xml
@@ -5,7 +5,7 @@
https://BioJulia.github.io/tags/biosequences/
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/tags/deprecation/index.xml b/tags/deprecation/index.xml
index e22eeac..5e1dd57 100644
--- a/tags/deprecation/index.xml
+++ b/tags/deprecation/index.xml
@@ -5,7 +5,7 @@
https://BioJulia.github.io/tags/deprecation/
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/tags/dna/index.xml b/tags/dna/index.xml
index 5952f31..9f2e5e8 100644
--- a/tags/dna/index.xml
+++ b/tags/dna/index.xml
@@ -5,7 +5,7 @@
https://BioJulia.github.io/tags/dna/
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/tags/index.xml b/tags/index.xml
index 6bc6634..011bf5e 100644
--- a/tags/index.xml
+++ b/tags/index.xml
@@ -5,7 +5,7 @@
https://BioJulia.github.io/tags/
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/tags/kmers/index.xml b/tags/kmers/index.xml
index 9da2ffe..4fcaec2 100644
--- a/tags/kmers/index.xml
+++ b/tags/kmers/index.xml
@@ -6,7 +6,7 @@
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/tags/package/index.xml b/tags/package/index.xml
index 176664c..d5209bb 100644
--- a/tags/package/index.xml
+++ b/tags/package/index.xml
@@ -6,7 +6,7 @@
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/tags/rna/index.xml b/tags/rna/index.xml
index 33200a9..0f0408e 100644
--- a/tags/rna/index.xml
+++ b/tags/rna/index.xml
@@ -5,7 +5,7 @@
https://BioJulia.github.io/tags/rna/
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/tags/seq-lang/index.xml b/tags/seq-lang/index.xml
index 0c77e2f..1d07d70 100644
--- a/tags/seq-lang/index.xml
+++ b/tags/seq-lang/index.xml
@@ -5,7 +5,7 @@
https://BioJulia.github.io/tags/seq-lang/
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/tags/sequences/index.xml b/tags/sequences/index.xml
index 39d80e3..beb42f1 100644
--- a/tags/sequences/index.xml
+++ b/tags/sequences/index.xml
@@ -5,7 +5,7 @@
https://BioJulia.github.io/tags/sequences/
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/tags/sequencing/index.xml b/tags/sequencing/index.xml
index 682f920..7776910 100644
--- a/tags/sequencing/index.xml
+++ b/tags/sequencing/index.xml
@@ -6,7 +6,7 @@
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/tags/skipmers/index.xml b/tags/skipmers/index.xml
index 08e77d6..d76832b 100644
--- a/tags/skipmers/index.xml
+++ b/tags/skipmers/index.xml
@@ -6,7 +6,7 @@
- Ben J. Ward ·
+ Sabrina Ward ·
Powered by the
Academic theme for
diff --git a/talk/index.xml b/talk/index.xml
index 30a652b..1a63b48 100644
--- a/talk/index.xml
+++ b/talk/index.xml
@@ -5,7 +5,7 @@
https://BioJulia.github.io/talk/
On the performance and design of BioSequences compared to the Seq language
- Jakob Nybo Nissen, Ben J. Ward
+ Jakob Nybo Nissen, Sabrina Ward
@@ -1381,7 +1381,7 @@
Related
Features include:
Features include:
BioSymbols
Publications
Publication_types
BioSymbols
On the performance and design of BioSequences comp
On the performance and design of BioSequences comp
Bio.jl is old and deprecated
On the performance and design of BioSequences comp
Sequences
BioSequences
BioSymbols
On the performance and design of BioSequences comp
On the performance and design of BioSequences comp
On the performance and design of BioSequences comp
BioSequences
BioSequences
Recent & Upcoming Talks