1 Introduction Whole game
2 Data visualization - artifacts - communication using data - scenarios - geographical data - fish populations in different water bodies - water quality at different bathing water sites - pollution levels across areas - time series data - climate data over time - networks - water network data -- discovering upstream and downstream - transportation - links between areas - relationships between variables - iris dataset - type of graph is determined by the variables you're trying to visualize - applications - advertising - health care - business analytics - common visualisations - tables (summary, statistical, comparisons) - charts (lines, bars, box, plots, regressions)
- exercises
3 Workflow: basics
- JVM (different arch.es )
- IDEs (cider, calva, cursive)
- tooling (kindly, toolkit/notebooks -- clay, clerk, portal)
- interop (python, R -- config and installation of external environments)
- publishing
- reports, blog posts, presentations
- maintaining history and making data available
- technical details but also what should be included in the contents 4 Data transformation 5 Workflow: code style 6 Data tidying 7 Workflow: scripts and projects 8 Data import 9 Workflow: getting help
- How much is worth including as background?
- How much should we assume readers know about Clojure?
- Write for an audience who know basic Clojure syntax, only reference language guides
- Already lots of good resources. E.g. official Clojure guide on language basics
- How to use this cookbook? Who is it for, what is the best way to use it?
- Possible topics to include:
- Setting up a Clojure environment
- Installing and using Clojure and its libraries
- Executing Clojure code
- Tour of the tool we're using (clay? clerk?)
- Working with data in Clojure
- Accessing subsets of larger data structures
- Types of collections in Clojure, hashed (maps, sets) vs. sequential collections (lists, vectors)
- Tour of useful collections operations?
filter
,take
,drop
,nth
,subvec
- Maybe a good place to introduce tablecloth? Way nicer tool for common dataset manipulation tasks
- Comparisons to other languages?
- E.g. how does negative indexing work? Some languages index backward from the end (python), some drop the element at that index (R)
- EDN data format
- Reading from a CSV file
- Normalize values into consistent formats and types
- Transforming headers
- Specifying separators if it's not a
,
- Specify file encoding
- Reading from a URL, JSON
- Excel, other file formats? spss?
- Reading from a database
- Connecting to a database
- SQL results, SPARQL results
- Generating random sequences
- Generating repeatable sequences of dummy data
- How do you know it's messy?
- Multiple types mixed in one column
- Multiple formats for a thing that's supposed to have one (e.g. phone numbers, postal codes)
- Missing values
- Arbitrary values meant to indicate missing (e.g. "NONE", "N/A", false, etc.)
- Parsing dates, formatting dates
- Parsing numbers, booleans
- Handling missing/blank cells
- Writing to a CSV file
- Leave data in Clojure files
- Notebook artifacts
- HTML pages
- Other formats, options for exporting notebooks? PDFs?
- Partial artifacts, e.g. export just a graph
- Writing to a database?
- Sorting columns, rows, tables
- Defaults and custom sorting functions
- Selecting one column or multiple columns
- Randomizing order
- Repeatable randomisation
- De-duplicating values
- Finding unique rows
- Adding computed columns to data
- e.g. converting a column with numbers to a category (>5 "yes", <5 "no"), summing multiple columns into a new one
- Removing columns
- Transforming values
- Rearranging order of columns
- Renaming columns
- Filtering rows
- Single filter, multiple filters
- Aggregating rows (counts, groups)
- Concatenating datasets
- Merging datasets
- When column headers are the same or different, on multiple columns
- Converting between wide and long formats?
- Summarizing data (mean, standard deviation, confidence intervals etc.)
- Working with sequential data
- Smoothing out data
- Calculating a moving average
- Averaging a sequence in blocks
- Run length encoding?
- Filling
nil
s with last non-nil
value?
- Smoothing out data
- Vega and vega-lite viewers in clerk/clay
- Introduce Hanami? How is the community feeling about Hanami? Do people like it as a go-to tool for data viz or is there something else?
- Getting data into the right format for these libraries (EDN maps)
- Bar graphs
- Of values
- Of counts
- Line graphs
- Scatterplots
- Handling overplotting (overlapping points)
- Distributions
- Histograms and density plots
- With multiple groups
- Box plots
- With mean
- Histograms and density plots
- Q-Q Plot
- Correlation matrix
- Network graph
- Heat map
- 3D Scatterplot
- Tree plot (dendrogram)
- Vector field
- Mosaic Chart
- Pie chart
- Maps
- Geo data formats (GeoJSON, shapefiles)
- Choropleths
- Careful splitting up data for choropleths, see jenks algorithm — you cannot map totals using a choropleth thematic mapping technique (Kenneth Field)
- Layering graphs
- Add regression lines
- Adding a horizontal line
- Adding a vertical line
- Adding lines for the mean
- Means and error bars
- On bar graphs and line graphs
- Over grouped bars
- Over individual grouped bars
- Error bars for within-subjects variables
- Error bands
- Facets and matrices
- Independent scales per facet
- Organizing facets (vertical/horizontal stacks)
- Labelling facets
- Lines on facets, different line on different facets (?)
- Graph size
- Gridlines
- Colors
- Keep colour blindness in mind, colour is last choice for conveying information
- Mapping variables to colours
- Shapes and line types
- Adding titles to graphs
- Labelling axes
- Fonts
- Changing direction of axes
- Changing order of items on a discrete axis
- Transforming axes (e.g. log/linear)
- Labelling data points
- Legends
- Changing labels, order of items
- Organizing multiple graphs
- Saving graphs
- Image formats, SVG, PDF
- Vega embed
Not sure how much of this it makes sense to include — need to balance teaching what these even mean with showing how to do them in Clojure, assuming someone already knows why they would use it and how to interpret the results. Maybe better as a whole separate project? Clojure doesn't have as many sophisticated libraries that just do all these with a single command.
- Linear regression and correlation
- Logistic regression
- p-values
- t-test, z-test, ANOVA
- Confidence intervals
- Resampling
- Bootstrapping, Jacknifing
- Frequency tests
- Chi-square, Fisher’s Exact, Exact binomial, McNemar’s test
- Homogeneity of variance
- Levene’s, Bartlett’s, Fligner-Killeen test
- Inter-rater reliability
- Cohen’s Kappa, weighted Kappa, Scott's pi, Fleiss’s Kappa, Conger’s Kappa, intraclass correlation coefficient
- Types of numbers, when to use which ones
- Integers (
java.lang.Long
,clojure.lang.BigInt
) - Decimals (
java.lang.Double
,java.math.BigDecimal
)
- Integers (
- How to indicate different types of numbers, e.g.
1
vs1.0
vs.1.0M
- Converting between types
- Gotchas to watch out for, e.g. trying to show a number that has infinitely repeating decimal places ⟶
(/ 1.0M 6.0M)
wil throw an error (great guide) - Rounding numbers
- Comparing numbers in Clojure, i.e.
(= 1.0M 1.0 1) => true
, not the case in every language- Comparisons with
nil
, booleans
- Comparisons with
- Not sure how much is useful to include.. maybe just quick mention of regex support/syntax and
clojure.string
standard lib? - String interpolation, using variables in strings
- How to work with Clojure/Java time formats
- Adding and subtracting times/dates
- Converting between date and time formats
- Comparing dates (relatively, i.e. checking for earlier/later than a certain date/time)
- Working with python libraries in Clojure (see libpython-clj)
- How to include python libraries with libpython-clj
- Using python classes as Clojure namespaces
- Examples of some of the most common ones that don't have equivalents in Clojure? (ML/NLP libraries?)
- Mention some common ones that do have equivalents (data manipulation, visualization?)
- Benchmarking blocks of code
- Tips for avoiding performance bottlenecks
- Careful with realizing large lazy sequences
- Careful holding on to a reference to the head of a sequence
- Data Science Walkthrough by Daniel
- Wrangling data with Tablecloth by Mei Beisaron
- Intro to Statistical Inference by Rohit Thadani
- Confidence Intervals from Normalized Data: A correction to Cousineau (2005)
- Confidence intervals in within-subject designs: A simpler solution to Loftus and Masson’s method
- Using confidence intervals in within-subject designs
Choropleths
Python interop