Some introductory sentence(s). Data set and task are relatively fixed, so probably you don't have much to say about them (unless you modifed them). If you haven't changed the application much, there's also not much to say about that. The following structure thus only covers preprocessing, feature extraction, dimensionality reduction, classification, and evaluation.
Which evaluation metrics did you use and why? Which baselines did you use and why?
How do the baselines perform with respect to the evaluation metrics?
Is there anything we can learn from these results?
I'm following the "Design Decisions - Results - Interpretation" structure here, but you can also just use one subheading per preprocessing step to organize things (depending on what you do, that may be better structured).
Which kind of preprocessing steps did you implement? Why are they necessary and/or useful down the road?
Maybe show a short example what your preprocessing does.
Probably, no real interpretation possible, so feel free to leave this section out.
Again, either structure among decision-result-interpretation or based on feature, up to you.
Which features did you implement? What's their motivation and how are they computed?
Can you say something about how the feature values are distributed? Maybe show some plots?
Can we already guess which features may be more useful than others?
If you didn't use any because you have only few features, just state that here. In that case, you can nevertheless apply some dimensionality reduction in order to analyze how helpful the individual features are during classification
Which dimensionality reduction technique(s) did you pick and why?
Which features were selected / created? Do you have any scores to report?
Can we somehow make sense of the dimensionality reduction results? Which features are the most important ones and why may that be the case?
Which classifier(s) did you use? Which hyperparameter(s) (with their respective candidate values) did you look at? What were your reasons for this?
The big finale begins: What are the evaluation results you obtained with your classifiers in the different setups? Do you overfit or underfit? For the best selected setup: How well does it generalize to the test set?
Which hyperparameter settings are how important for the results? How good are we? Can this be used in practice or are we still too bad? Anything else we may have learned?