-
Notifications
You must be signed in to change notification settings - Fork 0
/
infer-schema.1
135 lines (135 loc) · 3.44 KB
/
infer-schema.1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
.\" Generated by scdoc 1.11.2
.\" Complete documentation for this program is not available as a GNU info page
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.nh
.ad l
.\" Begin generated content:
.TH "infer-schema" "1" "2023-11-19"
.P
.SH NAME
.P
infer-schema - generate JSON Schema from CSV files
.P
.SH SYNOPSIS
.P
\fBinfer-schema\fR file [options]
.P
.SH DESCRIPTION
.P
The infer-schema utility generates JSON Schema (draft 7) corresponding to a CSV
file, such that the CSV file is valid against the generated schema.\& The JSON
Schema is shown on standard output and can be piped or written to a file with
the \fB--output\fR option.\&
.P
The following options are available:
.P
\fB-h\fR, \fB--help\fR
.RS 4
Show usage information
.P
.RE
\fB--enum-threshold\fR \fIthreshold\fR
.RS 4
The \fIthreshold\fR of unique items up to which enum categories should be
populated in the JSON schema.\& Default threshold is 10.\&
.P
.RE
\fB--enum-fields\fR \fIfields\fR
.RS 4
Forces \fIfields\fR (comma-separated) to be classed as an enum, useful for
including fields that do not meet enum \fIthreshold\fR criteria
.P
.RE
\fB--bound-types\fR \fItypes\fR
.RS 4
Comma-separated \fItypes\fR for which bounds should be encoded into the schema,
default is '\&number,integer'\&, for which minimum / maximum are determined.\& For
strings minLength and maxLength are determined.\& Set \fB--bound-types\fR=none to
disable bound detection.\& Allowed bound types are \fBinteger\fR, \fBnumber\fR and
\fBstring\fR
.P
.RE
\fB--explicit-nulls\fR
.RS 4
By default, fields that have null and another type are typed as non-required
with the non-null type.\& This setting makes the nulls explicit by dual typing
a field with the non-null type.\&
.P
As an example, consider a field '\&count'\& that has the following values
20,NA,30.\& By default, this field will be typed as '\&integer'\& and will not be
required.\& With \fB--explicit-nulls\fR set, this will be typed as [integer, null]
.P
.RE
\fB-o\fR \fIoutput\fR, \fB--output\fR \fIoutput\fR
.RS 4
Save schema to \fIoutput\fR file
.P
.RE
.SH EXAMPLES
.P
Given this CSV file called \fIdates.\&csv\fR
.P
.nf
.RS 4
date,num_cases
2022-11-11,4
2022-11-12,5
2022-11-13,6
,10
2022-11-15,10
2022-11-16,5
2022-11-17,3
2022-11-18,2
2022-11-19,10
2022-11-20,11
2022-11-21,4
2022-11-22,20
2022-11-23,
2022-11-24,9
2022-11-25,4
2022-11-26,21
2022-11-27,99
2022-11-28,59
2022-11-30,45
.fi
.RE
.P
Running '\&infer-schema dates.\&csv'\& gives the following output
.P
.nf
.RS 4
{
"$schema": "https://json-schema\&.org/draft-07/schema",
"description": "Description of tests/dates\&.csv",
"properties": {
"date": {
"description": "Description for column date",
"format": "date",
"type": "string"
},
"num_cases": {
"description": "Description for column num_cases",
"maximum": 99,
"minimum": 2,
"type": "integer"
}
},
"required": [],
"title": "JSON Schema for tests/dates\&.csv"
}
.fi
.RE
.P
Here we see that infer-schema determines minimum and maximum values for integer
columns.\& For strings, minLength and maxLength are determined.\& This is controlled
by the \fB--bound-types\fR setting, which can be set to \fBnone\fR to turn off bounds
detection.\&
.P
By default, any column with upto 10 (default \fB--enum-threshold\fR) unique values
is considered categorical and expressed as a JSON Schema enum type.\& Columns with
more than 10 values can be forced to be of enum type by using \fB--enum-fields\fR.\&
.P
.SH BUGS
.P
Report bugs at https://github.\&com/abhidg/infer-schema/issues