-
Notifications
You must be signed in to change notification settings - Fork 30
Database format
We are currently using a flat file content DB but are openly soliciting for new ideas or improvements. The flat file DB has the following structure:
content
--resources.txt
--flags.txt
--nodes
----unique_node_tag
------title.txt
------summary.txt
------goals.txt
------dependencies.txt
------questions.txt (currently ignored)
------resources.txt
------see-also.txt
------flags.txt
------id.txt
------topics.txt (deprecated)
--shortcuts
----unique_node_tag
------goals.txt
------dependencies.txt
------questions.txt (currently ignored)
------resources.txt
------topics.txt (deprecated)
--courses
----unique_course_tag
------title.txt
------concepts.txt
Most of the files in the database are either plain text files or lists of field/value pairs. In the latter format, each item (e.g. a resource or a dependency) is given as an unordered list of field/value pairs. Different items are separated by one or more blank lines. See the resources list section for an example.
Any line beginning with the # symbol is a comment. For example,
# This is a comment.
some stuff # This is not a comment.
Roughly, the resources list in content/resources.txt
contains metadata about resources, such as textbooks, papers, or online lectures, which the user is referred to. We've found that we use certain resources, such as textbooks, over and over again, while there are other resources, such as individual papers, which are only used once. In order to handle both situations, the global resources list basically defines default values for a given resource's fields, and the node-specific content/nodes/node_name/resources.txt
overrides those default values. By convention, we only include resources in the global content/resources.txt
if they are likely to be used multiple times.
Each resource is given as a (collection) of unordered field/value pairs, and resources are separated by blank lines. All resource entries must specify a key
field, which is the tag by which the resource is referenced in the node-specific resources.txt
. Other fields which are typically listed in resources.txt
include:
-
title
, the label which is shown to the user (e.g. the name of a textbook) -
authors
, the list of authors of the resource. (Multiple authors are separated byand
.) -
resource_type
, the general category of the resource, e.g. paper, online lectures, etc. Currently this isn't used, but we are considering having HTML templates associated with each resource type which determine how they're rendered. -
free
, which indicates whether the resource is freely available -
url
, a URL representing the resource in question, e.g. the home page for a textbook or the welcome page for a Coursera course. -
specific_url_base
, the base URL for the location resources (seelocation
under node-specific resources) -
level
, a string giving the overall maturity level of the resource. While any string is allowed, by convention the options areintroductory
,advanced undergraduate
,graduate
,expert
, andreference
. See Editing Guidelines for the meaning of these categories. -
extra
, additional instructions to the user
The required fields are key
, title
, and resource_type
. The other fields are all optional.
There are some other fields which may be specified here, but by convention are specified in the node-specific resources.txt
file. These are described in the node-specific resources section.
Here are some example entries:
key: pgm
title: Probabilistic Graphical Models: Principles and Techniques
authors: Daphne Koller and Nir Friedman
url: http://pgm.stanford.edu/
resource_type: textbook
free: 0
level: graduate
key: coursera_hinton
title: Coursera: Neural Networks for Machine Learning
authors: Geoffrey Hinton
url: https://www.coursera.org/course/neuralnets
resource_type: online lectures
free: 1
note: Click on "Preview" to see the videos.
level: advanced undergraduate
See here for more information on what resources represent.
Each concept node lives in a subdirectory of content/nodes
. The concept has two identifiers: a human-readable tag which is used in the hand-annotated dependencies and see-also links, and a unique identifier used in the databases. The latter should stay fixed even if the human-readable tag is modified. This way, any graphs a user has saved will still be consistent even if the tag is changed.
The information about a node is stored in plain-text files inside the node's directory. These files are as follows:
-
id.txt
, the unique identifier. This is machine generated and shouldn't be modified. -
title.txt
, a single line giving the title of the node which is shown to the user -
summary.txt
, a 2-3 sentence summary of what the concept is and what it is used for -
goals.txt
, a list of what the user should understand or be able to do after learning the concept, or questions they should be able to answer -
dependencies.txt
, a list of the concept nodes that the current one directly depends on -
resources.txt
, a list of resources the user can consult to learn about the topic -
see-also.txt
, a list of pointers to related concepts -
flags.txt
, a list of caveats about the concept -
questions.txt
, a list of questions for the user to think about. We are currently debating what to include here, so you can ignore it for now. -
topics.txt
, a listing of the specific topics covered by the concept node. This is deprecated, since it is now subsumed bygoals.txt
.
The files title.txt
, summary.txt
, topics.txt
and questions.txt
are currently treated as plain text files, but we're considering using Markdown or Textile formatting. The remaining files have a particular structure described below.
See here for more information on the philosophy behind concept nodes.
The file content/nodes/node_name/dependencies.txt
gives a list of the concepts which a particular concept depends on. Each dependency is given as a list of field/value pairs, and the dependencies are separated by blank lines. There are three fields:
-
tag
, the human-readable tag for the required concept -
reason
, the reason that concept is required. It is shown in the Context section of the learning view, which lists the dependencies for a given concept, or what the concept is needed for. This field is optional, but it generally should be given unless it is obvious from the titles that one concept is an elaboration of the other. -
shortcut
, which specifies whether a shortcut can be used in lieu of the full content node. See the editing guidelines for more discussion of shortcuts and the shortcuts section for the format. The default value is 0 (false), so the only meaningful value to specify is 1 (true).
The ordering of the dependencies in the file is significant: it determines what order the concepts should be presented in in the learning view. See here for more details.
Here is an example, for the covariance_matrices
node:
tag: covariance
tag: positive-definite-matrices
reason: The covariance matrix is a PSD matrix.
shortcut: 1
See here for more information on what the dependencies represent.
The file content/nodes/node_name/resources.txt
gives a list of resources where you can learn about a concept. The list should be interpreted as "read one of the following," rather than "read all of the following."
There are some resources (such as textbooks or online courses) which are used over and over again, and others (such as individual papers) which are only used once. The former are defined in the global resources list. These may be referred to here by specifying the source
field, which will pull in the default values associated with that resource. For unique resources, simply don't specify a source
field, and instead specify each of the values individually.
The resources are given as lists of field/value pairs, and the resources are separated by blank lines. The following fields are conventionally specified in the node-specific resources list:
-
location
, the location(s) within the resource which the user should read/watch. If there are multiple locations, each one should be listed as a separate location field. If there is a URL associated with the location, put it in brackets at the end of the line. If the resource has aspecific_url_base
specified (see global resources.txt), it will be prepended to each of the location URLs. (This can be overridden by starting the URL withhttp:
orhttps:
.) -
edition
, the edition number of a textbook. Currently this isn't used, but we are planning to allow resources to be added for multiple editions of a textbook, and the user can choose which one is to be displayed. -
mark
, an annotation for the node. Currently, the only mark isstar
, which indicates that the resource is well-written and fits nicely with the structure of the concept map. (Generally, we're expecting that the user would start with a starred resource, and maybe go to one of the other ones for additional clarification.) -
dependencies
, a comma-separated list of tags representing additional concepts that resource depends on which aren't already given by the graph structure
In addition, all of the fields listed in global resources list may be specified here as well. This is often the case for unique resources.
Here is part of resources.txt
for the matrix_multiplication
node:
source: strang
edition: 4
location: Section 2.4, up to "Block matrices and block multiplication," pages 67-70
mark: star
source: khan_academy_linear_algebra
location: Lecture sequence "Transformations and matrix multiplication" [https://www.khanacademy.org/math/linear-algebra/matrix_transformations/composition_of_transformations]
mark: star
extra: Watch the lecture sequence "Functions and linear transformations" if you're not used to thinking of matrices as linear transformations.
source: beezer
edition: 3
location: Section "Matrix operations" [section-MO.html]
location: Section "Matrix multiplication" [section-MM.html]
Here is an example of a unique resource:
resource_type: paper
authors: Yann LeCun and Leon Bottou and Yoshua Bengio and Patrick Haffner
title: Gradient-based learning applied to document recognition
url: http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf
free: 1
mark: star
See here for more information about what resources represent.
Finally, the file see-also.txt
gives pointers to other concept nodes related to the current one. Common examples include techniques which improve on the current one, issues to watch out for, applications where the concept is used, or concepts which specialize or generalize the current one. The file uses Textile-like formatting for both lists and links. Any links corresponding to nonexistent concepts are ignored. Here is the example for gaussian_processes
:
* Gaussian processes have a variety of uses in machine learning, including:
** "regression":gaussian_process_regression
** "classification":gaussian_process_classification
** "black-box optimization":bayesian_optimization (where we only get to evaluate the function, and doing so is expensive)
** "reinforcement learning":gaussian_processes_for_reinforcement_learning
* Techniques for "constructing kernel functions":constructing_kernels
So far, most of the concepts we include in the graph are mathematical and relatively well understood. These fit the most naturally into our graph structure. For concepts which don't fit so nicely for one reason or another, we flag them, and a note is displayed to the user. The flags.txt
in the main directory is a list of items, each of which has key
and text
fields. Then, for each node, there is an optional flags.txt
file where each line is one of the keys. The corresponding text is displayed to the user. Right now, flags.txt
consists of a single item:
key: active_research
text: This concept is an active area of research, so our understanding of it may change considerably.
Sometimes one concept only requires understanding another at a very general level. In these cases, the solution is to add a shortcut, which is based on the original concept node, but with a reduced set of dependencies and a different set of resources. The format is simple: the shortcuts
directory at the top level contains a list of subdirectories, which should be human-readable tags matching those in the nodes
directory. Each shortcut subdirectory contains the files dependencies.txt
, resources.txt
, and goals.txt
, each of which overrides the corresponding file from the nodes
directory and has the same format. Note that the dependencies for the shortcut node are required to be a subset of the dependencies for the original concept node.
A large fraction of users are likely to have already taken basic undergrad courses in subjects like linear algebra and probability theory. For subjects which are sufficiently standardized across institutions, we specify the list of concepts covered, so that those concepts can be hidden from users who specify that they've already taken the course. More details here.
Inside the courses
directory is a list of subdirectories, whose names are human-readable tags analogous to the concept tags. Each of these subdirectories contains title.txt
, which gives the course title which is displayed to the user, and concepts.txt
, which is a listing of all the concepts covered by the course. In concepts.txt
, each line is a single concept tag.