Skip to content

Latest commit

 

History

History
130 lines (101 loc) · 6.46 KB

2012-08-06-cfstringtransform.md

File metadata and controls

130 lines (101 loc) · 6.46 KB
layout title ref framework rating description
post
CFStringTransform
CoreFoundation
9.1
<tt>NSString</tt> is the crown jewel of Foundation. But as powerful as it is, we would be remiss no to mention its toll-free bridged cousin, <tt>CFMutableString</tt>. Or more specifically, <tt>CFStringTransform</tt>.

There are two indicators that can tell you pretty much everything you need to know about how nice a language is to use:

  1. API Consistency
  2. Quality of String Implementation

NSString is the crown jewel of Foundation. In an age where other languages still struggle to handle Unicode correctly, NSString can not only handle anything you throw at it, but it can turn around and parse that input into linguistic tags. It's unfairly good.

But as powerful as NSString / NSMutableString are, we would be remiss not to mention their toll-free bridged cousin, CFMutableString. Or more specifically, CFStringTransform.

As denoted by the CF prefix, CFStringTransform is part of Core Foundation, which is a C, rather than Objective-C API. The function returns a Boolean for whether or not the transform was successful, and takes the following arguments:

  • string: The string to be transformed. Since this argument is a CFMutableStringRef, an NSMutableString can be passed using toll-free bridging.
  • range: The range of the string over which the transformation should be applied.
  • transform: The transformation to apply. This argument takes one of the string constants described below.
  • reverse: Whether to run the transformation in reverse, where applicable.

CFStringTransform covers a lot of ground with its transform argument. Here's a rundown of what it can do:

Strip Accents and Diacritics

Énġlišh långuãge lẳcks iñterêßţing diaçrïtičş, so it can be useful to normalize extended Latin characters into ASCII-friendly representations. You can get rid of the squiggly bits using the kCFStringTransformStripCombiningMarks transformation.

Name Unicode Characters

kCFStringTransformToUnicodeName allows you to determine the Unicode standard name for special characters, including Emoji. For instance, "🐑💨✨" becomes "{SHEEP} {DASH SYMBOL} {SPARKLES}".

Transliterate Between Orthographies

With the exception of English (with its complicated spelling inconsistencies), writing systems encode speech sounds into phonetic, written representations. European languages generally use the Latin alphabet with a few added diacritics, Russian uses Cyrillic, Japanese uses Hiragana & Katakana, and Thai, Korean, & Arabic each have their own scripts.

Although each language has a particular inventory of sounds that other languages may not have, the overlap across all of the major writing systems is remarkably high--enough so that one can rather effectively transliterate from one to another (not to be confused with translation).

CFStringTransform can transliterate between Latin and Arabic, Cyrillic, Greek, Korean (Hangul), Hebrew, Japanese (Hiragana & Katakana), Mandarin Chinese, and Thai. And not only that, but those transformations are all reversible:

Transformation Input Output
kCFStringTransformLatinArabic mrḥbạ مرحبا
kCFStringTransformLatinCyrillic privet привет
kCFStringTransformLatinGreek geiá sou γειά σου
kCFStringTransformLatinHangul annyeonghaseyo 안녕하세요
kCFStringTransformLatinHebrew şlwm שלום
kCFStringTransformLatinHiragana hiragana ひらがな
kCFStringTransformLatinKatakana katakana カタカナ
kCFStringTransformLatinThai s̄wạs̄dī สวัสดี
kCFStringTransformHiraganaKatakana にほんご ニホンゴ
kCFStringTransformMandarinLatin 中文 zhōng wén

Normalize User-Generated Content

One of the more practical applications for all of this is to normalize unpredictable user input in useful ways. Even if your application doesn't specifically deal with languages, you should be able to intelligently process anything the user types into your app.

Let's say you want to build a searchable index of movies on the device, which includes titles from around the world. You could:

  • First, apply the kCFStringTransformToLatin transform to transliterate all non-English text into a phonetic Latin alphabetic representation.

Hello! こんにちは! สวัสดี! مرحبا! 您好! →
Hello! kon'nichiha! s̄wạs̄dī! mrḥbạ! nín hǎo!

  • Next, apply the kCFStringTransformStripCombiningMarks transform to remove any diacritics or accents.

Hello! kon'nichiha! s̄wạs̄dī! mrḥbạ! nín hǎo! →
Hello! kon'nichiha! swasdi! mrhba! nin hao!

  • Finally, downcase the text and use CFStringTokenizer to split the text into tokens, and index the movie on them.

(hello, kon'nichiha, swasdi, mrhba, nin, hao)

If you do the same to search text entered by the user, you now have an easy way to search for names of titles, regardless of either the language of the search string or the movie title. Mathematical!

CFStringTransform can be an insanely powerful way to bend language to your will. And it's but one of many powerful features that await you if you're brave enough to explore outside of Objective-C's warm OO embrace.