Skip to content

DBF encodings in GeoDa

Xun Li edited this page May 8, 2019 · 7 revisions

DBF encodings in GeoDa

When saving a DBF file, it is possible a .CPG file is created if a specific encoding (e.g. GB2312 for Chinese characters) is used in GeoDa Table. The encoding information could be loaded from original dataset or specified manually by a user using Table->Encode menu.

A .CPG file is an optional file that can be used to specify the code page for identifying the character set to be used. From OGR's Shapefile/DBF driver page: https://www.gdal.org/drv_shapefile.html

An attempt is made to read the code page setting in the .cpg file, or as a fallback in the LDID/codepage setting from the .dbf file, and use it to translate string fields to UTF-8 on read, and back when writing.

LDID valid Language Driver ID is defined in http://www.autopark.ru/ASBProgrammerGuide/DBFSTRUC.HTM, and this page also shows which LDID value matches to which code page.

Please note that: Shapefile's DBF table contains a valid Language Driver ID (LDID) value in its header. However, the .CPG file has the highest priority.

From GDAL/OGR source code, it seems that OGR provides the code to translate LDID to code page, which is used internally in GDAL/OGR. A code page value (instead of LDID value) can also be directly written in a .CPG file (e.g. "Big5" for Traditional Chinese). See the code here:

https://github.com/lixun910/gdal/blob/abcc93ccb4a712aed843f13f794a439982310f22/gdal/ogr/ogrsf_frmts/shape/ogrshapelayer.cpp#L233

The valid code page includes:

Windows code page: CPxxxx
ISO code page: ISO-88859-xxx
Others: e.g. UTF-8, Big5, etc.

For development, the logic to handle additional files when saving a DBF file:

  1. don't create a .prj file
  2. don't create a .cpg file if no specific encoding is used in GeoDa
  3. only create a .cpg file when a specific encoding is used in GeoDa