Using Custom Components in Swift's Regex

Topics: Swift , Regex
Table of Contents

In our last article we learned about Swift’s Regex type and the various different ways to create them. Today we’re going to dive a little deeper into one of those methods. We’ll be building a custom RegexComponent using the CustomConsumingRegexComponent protocol.

For a quick refresher, remember that we can create a custom parser using the RegexBuilder DSL like this:

import RegexBuilder 

Regex {
  Capture {
    Repeat(count: 3) {
      One(.digit)
    }
  }
  "-"
  Capture {
    Repeat(count: 3) {
      One(.digit)
    }
  }
  "-"
  Capture {
    Repeat(count: 4) {
      One(.digit)
    }
  }
}

Don’t forget that we can use many built-in parsers provided by Foundation like this:

let usdRegex = Regex {
  Capture(.currency(code: "USD").sign(strategy: .accounting))
}

let dateRegex = Regex {
    Capture(
        .date(
            .numeric,
            locale: .autoupdatingCurrent,
            timeZone: .autoupdatingCurrent,
            calendar: .autoupdatingCurrent
        )
    )
}

let intRegex = Regex {
    Capture(.localizedInteger(locale: .autoupdatingCurrent))
}

Using NSDataDetector Inside a Swift Regex

Not only can you use Foundation’s parsers, you can also create your own custom parsers, through the CustomConsumingRegexComponent protocol. Let’s create a new custom parser that uses Apple’s NSDataDetector class.

First, let’s get a working example of our NSDataDetector to detect phone numbers:

import Foundation

let types: NSTextCheckingResult.CheckingType = [.phoneNumber]
let detector = try NSDataDetector(types: types.rawValue)
let input = "(789) 555-1234"
let swiftRange = input.startIndex..<input.endIndex
let nsRange = NSRange(swiftRange, in: input) 
var result: String?
detector.enumerateMatches(
    in: input,
    options: [],
    range: nsRange,
    using: { (match, flags, _) in
        guard let phoneNumber = match?.phoneNumber,
            let nsRange = match?.range,
            let swiftRange = Range.init(nsRange, in: input) else {
            print("no phone number found")
            result = nil
            return
        }
        print("found phone number: \(phoneNumber)")
        result = phoneNumber
    }
)

Conforming to CustomConsumingRegexComponent is fairly straightforward. We simply implement consuming(_:startingAt:in:):

public struct PhoneNumberDataDetector: CustomConsumingRegexComponent {
    public typealias RegexOutput = String
    public func consuming(
        _ input: String,
        startingAt index: String.Index,
        in bounds: Range<String.Index>
    ) throws -> (upperBound: String.Index, output: String)? {
        // implementation goes here...
    }
}

CustomConsumingRegexComponent

So now let’s plug in our earlier implementation:

public struct PhoneNumberDataDetector: CustomConsumingRegexComponent {
    public typealias RegexOutput = String
    public func consuming(
        _ input: String,
        startingAt index: String.Index,
        in bounds: Range<String.Index>
    ) throws -> (upperBound: String.Index, output: String)? {
        var result: (upperBound: String.Index, output: String)?
        
        let types: NSTextCheckingResult.CheckingType = [.phoneNumber]
        let detector = try NSDataDetector(types: types.rawValue)
        let swiftRange = index..<input.endIndex
        let nsRange = NSRange(swiftRange, in: input) // Fatal error: String index is out of bounds
        detector.enumerateMatches(
            in: input,
            options: [],
            range: nsRange,
            using: { (match, flags, _) in
                guard let phoneNumber = match?.phoneNumber,
                      let nsRange = match?.range,
                      let swiftRange = Range(nsRange, in: input) else {
                    // no phone number found
                    result = nil; return
                }
                
                result = (upperBound: swiftRange.upperBound, output: phoneNumber)
            }
        )
        
        return result
    }
}

As you can see the NSDataDetector API is a bit cumbersome to use. Notice how we need to convert back and forth between Range and NSRange. As Mattt from NSHipster said: “NSDataDetector has an interface that only a mother could love.” But now we have a new Swift interface that is far easier to use. Now that it is a RegexComponent, we can plug it into any Regex like this:

let phoneNumberDataDetector: some RegexComponent = Regex {
    ChoiceOf {
        Anchor.startOfLine
        Anchor.startOfSubject
        One(.whitespace)
    }
    PhoneNumberDataDetector()
    ChoiceOf {
        Anchor.endOfLine
        Anchor.endOfSubject
        Anchor.endOfSubjectBeforeNewline
        One(.whitespace)
    }
}

Note, however, that in Swift 6 language mode, we have Data Race Safety turned on. Unfortunately, Regex is not Sendable so we will need to isolate it somehow. One of the easiest ways to do this is to use a global actor.

@MainActor
let phoneNumberDataDetector: some RegexComponent = Regex {
    // ...
}

Room For Improvement

Writing parsers is complicated. Even the most sophisticated parser can be overcome by an obscure corner case. But by converting our NSDataDetector into a RegexComponent we now get to take advantage of decades of Apple’s development. Even better, our new detector is composable and can be added as a component to any other RegexComponent.

Still, as great as this solution is, there is still room for improvement. In my testing, so far, I found many correctly parsed strings. “555-1234”, “(808) 555-1234”, “1 (808) 555-1234” were all correctly identified as phone numbers. However, I did have some false negatives (at least what would appear to be ). I would have expected “5555-1234” to NOT be identified as a phone number. Whatver solution you use, remember to test your code. Don’t just test the happy path. Assert that your code does not generate false-positives or false-negatives.

Another issue is, there seems to be a mistake with the bounds calculation for my implementation of this parser. Because of this, it can work effectively with methods like wholeMatch() or contains(). However, it works incorrectly with replace(). Instead of just replacing the matched phone number, it replaces the entire string.

If you see any solutions, please don’t hesitate to reach out to me on Mastodon, or X.

Think of the Possibilities

There are so many more powerful parsing libraries that could benefit from Swift’s Regex. For example Pointfree has a powerful parsing library called swift-parsing. It has an API that looks a lot like RegexBuilders, and yet it’s far more flexible. By creating a CustomConsumingRegexComponent we could allow any Regex to take advantage of the swift-parsing library.

Do you have any code that you think could be super-powered as a RegexComponent?

Conclusion

When writing parsers, we have to fully appreciate the full domain of the problem that we are trying to solve. There are dozens of countries. Perhaps hundreds of phone companies. Many of these are standardized, but there are certainly exceptions to all of these standards and there is no one universally accepted standard. The problem domain is far too large and ever-changing for one team to tackle. Instead we should look to battle-tested parsers established by the community to tackle these problems.

That’s why I created NativeRegexExamples. It’s a library where we can crowd-source our learning, and collectively discover best practices for various parsers. Please contribute, so that the entire community can benefit!