Parsing HTML and Downloading Files in Swift 2.0

October 11, 2015

While web APIs are getting more and more common some data at some point you’ll find yourself working with data that’s only available in HTML. I have a problem like this right now: NOAA makes all of their boating charts (depth and navigation maps) available online for free. But the index of files is just a huge HTML list or a funky web viewer. I want to download some charts to use offline but building an app to do that would require parsing the HTML list. The charts that I want are the Booklet format ones for the Atlantic coast.

HTML parsing support in iOS isn’t very good so handling tasks like this one can end up being quite a challenge. We’ll use the HTMLReader library to make it easier. Our app will have a table view listing the chart names and numbers. Tapping on a row will download the PDF chart. We’ll use Alamofire to download the PDF files and save them locally. And we’ll include an “Open In” function since the default PDF apps on iOS choke on PDFs with as much detail as these ones.

This tutorial was written using Swift 2.0, Xcode 7.0, and Alamofire v3.0.0.

There are a few real issues with HTML parsing. First, it’s not always permitted by all websites. Wikipedia has a good discussion about the legality of web scraping that you should read, along with the target site’s TOS, before considering it. Second, it’s fragile. It’s usually nearly impossible to write web scraping code that’s resistant to changes to the HTML. So it’s best suited for hobby projects, not apps submitted to the App Store where you have to wait for Apple to approve each new version.

Project Setup

To start, create a new Master-Detail project in Xcode. That’ll give us the boilerplate for a tableview.

Remove the insert and edit button code from the boilerplate. Replace the array of generic objects with one of Chart objects (we’ll create a Chart class shortly).

Create a Podfile and add HTMLReader, Alamofire, and SwiftyJSON to it:


source 'https://github.com/CocoaPods/Specs.git'
platform :ios, '9.0'

target 'grokHTMLAndDownloads' do
  use_frameworks!
  pod 'HTMLReader', '~> 0.9'
  pod 'Alamofire', '~> 3.0.0'
  pod 'SwiftyJSON', '~> 2.3.0'
end

Make sure the target in your Podfile matches the name of the target in your project.

Close Xcode. Run pod install. Open the new .xcworkspace file that CocoaPods generated for you.

To organize the project we’ll keep the HTML parsing and data storage in it’s own class. Create a new file called DataController.swift.

This class will have a few responsibilities:

Fetch and parse the HTML list of charts
Make the charts available as an array so we can show them in the tableview
Download a chart’s PDF file from a URL

We’ll also want a class to represent the charts. Create a new Chart.swift file for this class:


import Foundation

class Chart {
}

Here’s the table of charts on the webpage:

Charts table

And here’s what the HTML looks like:

Charts table HTML

Not very pretty or semantic but it’s what we have to work with.

Each chart has a few properties that we can extract from the HTML:

Number
Scale
Title
PDF URL

So we’ll need some properties in our Chart class and an initializer to create a Chart from those properties:


import Foundation

class Chart {
  let title: String
  let url: NSURL
  let number: Int
  let scale: Int

  required init(title: String, url: NSURL, number: Int, scale: Int) {
    self.title = title
    self.url = url
    self.number = number
    self.scale = scale
  }
}

App Transport Security (ATS)

In iOS 9 Apple has added some checks for the security of the connection to any URLs. Since we’ll be grabbing data from charts.noaa.gov, which doesn’t comply with the HTTPS TLS 1.2 requirements that Apple has set, we’ll need to add an exception to ATS in our info.plist. Because some of the links point to http://www.charts.noaa.gov/... (with a leading www), we’ll include that domain in the ATS exceptions too:

ATS Settings

HTML Parsing

In your DataController, import HTMLReader and Alamofire.

Add a constant to hold the URL.

Add an array to hold the charts:


import Foundation
import Alamofire
import HTMLReader

let URLString = "http://charts.noaa.gov/BookletChart/AtlanticCoastBookletCharts.htm"

class DataController {
  var charts: [Chart]?
}

Add a function to DataController to retrieve the HTML page and parse it into an array of Chart objects. We’ll call it fetchCharts. It’ll need to:

Get the HTML as a String from the URL
Convert the HTML to an HTMLDocument
Find the correct table in the HTMLDocument (there are several on this page)
Loop through the rows in the table and create a Chart for each row (unless it’s one of the title rows)

Here’s the structure:


private func parseHTMLRow(rowElement: HTMLElement) -> Chart? {
  // TODO: implement
}

private func isChartsTable(tableElement: HTMLElement) -> Bool {
  // TODO: implement
}

func fetchCharts(completionHandler: (NSError?) -> Void) {
  Alamofire.request(.GET, URLString)
    .responseString { responseString in
      guard responseString.result.error == nil else {
        completionHandler(responseString.result.error!)
        return

      }
      guard let htmlAsString = responseString.result.value else {
        let error = Error.errorWithCode(.StringSerializationFailed, failureReason: "Could not get HTML as String")
        completionHandler(error)
        return
      }

      let doc = HTMLDocument(string: htmlAsString)
      
      // find the table of charts in the HTML
      let tables = doc.nodesMatchingSelector("tbody")
      var chartsTable:HTMLElement?
      for table in tables {
        if let tableElement = table as? HTMLElement {
          if self.isChartsTable(tableElement) {
            chartsTable = tableElement
            break
          }
        }
      }

      // make sure we found the table of charts
      guard let tableContents = chartsTable else {
        // TODO: create error
        let error = Error.errorWithCode(.DataSerializationFailed, failureReason: "Could not find charts table in HTML document")
        completionHandler(error)
        return
      }

      self.charts = []
      for row in tableContents.children {
        if let rowElement = row as? HTMLElement { // TODO: should be able to combine this with loop above
          if let newChart = self.parseHTMLRow(rowElement) {
            self.charts?.append(newChart)
          }
        }
      }
      completionHandler(nil)
    }
}

I’ve set up fetchCharts to have a completion handler block instead of a return type because the Alamofire call is asynchronous. That means our UI won’t freeze while we’re fetching the webpage. But it makes the flow of the code less linear. Blocks can greatly improve your app’s performance and are pretty easy to use once you get the hang of them. They’re well worth understanding even if the syntax is hard to remember.

Now we need to implement the two functions that actually do the web parsing. First, isChartsTable:


private func isChartsTable(tableElement: HTMLElement) -> Bool {
  if tableElement.children.count > 0 {
    let firstChild = tableElement.childAtIndex(0)
    let lowerCaseContent = firstChild.textContent.lowercaseString
    if lowerCaseContent.containsString("number") && lowerCaseContent.containsString("scale") && lowerCaseContent.containsString("title") {
      return true
    }
  }
  return false
}

Coding up HTML parsing often involves a lot of trial and error. It’s handy to print out the element or its children to figure out if you’ve got the element that you think you do. You can use print(element.textContent) to do that. Be aware that it will print the text contained in the whole element, including the children, so even if print(element.textContent) shows the text you’re trying to find you might need to go down in to element.children or element.childAtIndex(X).children to find the right node.

Within fetchCharts we got the tables by doing let tables = doc.nodesMatchingSelector("tbody"), which gave us all of the table bodies within our HTML document. There are 5 or 6 on this page so we needed to set up isChartsTable to find the one that contains the lists of charts. From looking at the web page we can see that the table that we want has 3 titles in it’s first row: NUMBER, SCALE, and TITLE. So we’ll check for those in the first row of the table (which is the table element’s first child):


let firstChild = tableElement.childAtIndex(0)
let lowerCaseContent = firstChild.textContent.lowercaseString
if lowerCaseContent.containsString("number") && lowerCaseContent.containsString("scale") && lowerCaseContent.containsString("title") {
  return true
}

Within fetchCharts we’ll make sure we do find a table that meets our requirements before trying to parse it.

Second, parseHTMLRow:


private func parseHTMLRow(rowElement: HTMLElement) -> Chart? {
  var url: NSURL?
  var number: Int?
  var scale: Int?
  var title: String?
  // first column: URL and number
  if let firstColumn = rowElement.childAtIndex(1) as? HTMLElement {
    // skip the first row, or any other where the first row doesn't contain a number
    if let urlNode = firstColumn.firstNodeMatchingSelector("a") {
      if let urlString = urlNode.objectForKeyedSubscript("href") as? String {
        url = NSURL(string: urlString)
      }
      // need to make sure it's a number
      let textNumber = firstColumn.textContent.stringByTrimmingCharactersInSet(NSCharacterSet.whitespaceAndNewlineCharacterSet())
      number = Int(textNumber)
    }
  }
  if (url == nil || number == nil) {
   return nil // can't do anything without a URL, e.g., the header row
  }
  
  if let secondColumn = rowElement.childAtIndex(3) as? HTMLElement {
    let text = secondColumn.textContent
      .stringByTrimmingCharactersInSet(NSCharacterSet.whitespaceAndNewlineCharacterSet())
      .stringByReplacingOccurrencesOfString(",", withString: "")
    scale = Int(text)
  }

  if let thirdColumn = rowElement.childAtIndex(5) as? HTMLElement {
    title = thirdColumn.textContent
      .stringByTrimmingCharactersInSet(NSCharacterSet.whitespaceAndNewlineCharacterSet())
      .stringByReplacingOccurrencesOfString("\n", withString: "")
  }

  if let title = title, url = url, number = number, scale = scale {
    return Chart(title: title, url: url, number: number, scale: scale)
  }
  return nil
}

Let’s take that one step at a time. First we declare the variables we’re going to try to extract from the HTML:


var url: NSURL?
var number: Int?
var scale: Int?
var title: String?

Then we tackle each column. The first column should contain the URL and the chart number. This is complicated by the rows at the top of the table that don’t think to charts, so we need to make sure we get the URL and number to make sure this is a row with a chart.

To extract the URL we need to step into the column let firstColumn = rowElement.childAtIndex(1) as? HTMLElement then into the correct sub-element. Here’s where the fragility of HTML scraping is obvious: there’s an empty column before the column of numbers that could easily be removed by edits to this page. But for now this works.

Once we have the correct column we can find the link in it (firstNodeMatchingSelector("a")) and then the number is the text contained in that node (after trimming off any excess characters):


// first column: URL and number
if let firstColumn = rowElement.childAtIndex(1) as? HTMLElement {
  // skip the first row, or any other where the first row doesn't contain a number
  if let urlNode = firstColumn.firstNodeMatchingSelector("a") {
    if let urlString = urlNode.objectForKeyedSubscript("href") as? String {
      url = NSURL(string: urlString)
    }
    // need to make sure it's a number
    let textNumber = firstColumn.textContent.stringByTrimmingCharactersInSet(NSCharacterSet.whitespaceAndNewlineCharacterSet())
    number = Int(textNumber)
  }
}
if (url == nil || number == nil) {
 return nil // can't do anything without a URL, e.g., the header row
}

Then we move on to the next column. Again we can get the text content and then remove the characters that we don’t want: trailing & leading whitespace and commas. Swift’s Int() function that converts a String to an Int doesn’t like commas:


if let secondColumn = rowElement.childAtIndex(3) as? HTMLElement {
  let text = secondColumn.textContent
    .stringByTrimmingCharactersInSet(NSCharacterSet.whitespaceAndNewlineCharacterSet())
    .stringByReplacingOccurrencesOfString(",", withString: "")
  scale = Int(text)
}

The third column is similar: grab the text and remove the extra characters. This time it’s random line breaks in the middle of the title that we want to remove. (stringByTrimmingCharactersInSet will only remove unwanted characters at the start and end of the string, not in the middle.):


if let thirdColumn = rowElement.childAtIndex(5) as? HTMLElement {
  title = thirdColumn.textContent
    .stringByTrimmingCharactersInSet(NSCharacterSet.whitespaceAndNewlineCharacterSet())
    .stringByReplacingOccurrencesOfString("\n", withString: "")
}

Finally we make sure that we got all of the necessary data parsed from the HTML and create a new Chart object with it. If something went wrong or this row doesn’t represent a chart then we just return nil:


if let title = title, url = url, number = number, scale = scale {
  return Chart(title: title, url: url, number: number, scale: scale)
}
return nil

Displaying the Charts in the Table View

Now we’re fetching the list of charts but the user can’t see them. So let’s mate up our data controller with the table view controller.

Open your MasterViewController and add a variable for the DataController. In viewWillAppear tell it to fetch the charts and reload the tableview when it’s done:


class MasterViewController: UITableViewController {
  var dataController = DataController()
  
  override func viewWillAppear(animated: Bool) {
    super.viewWillAppear(animated)
    dataController.fetchCharts { _ in
      // TODO: handle errors
      self.tableView.reloadData()
    }
  }

  ...
}

Now we need to adjust all of the autogenerated UITableViewDataSource methods to use our DataController:


override func numberOfSectionsInTableView(tableView: UITableView) -> Int {
  return 1
}

override func tableView(tableView: UITableView, numberOfRowsInSection section: Int) -> Int {
  return dataController.charts?.count ?? 0
}

override func tableView(tableView: UITableView, cellForRowAtIndexPath indexPath: NSIndexPath) -> UITableViewCell {
  let cell = tableView.dequeueReusableCellWithIdentifier("Cell", forIndexPath: indexPath)
  
  if let chart = dataController.charts?[indexPath.row] {
    cell.textLabel!.text = "\(chart.number): \(chart.title)"
  } else {
    cell.textLabel!.text = ""
  }
  
  return cell
}

Save and run. You should see the table view populated with the list of charts.

Downloading and Saving PDFs

Now that we can see the list of charts it would be nice to be able to get them. In the storyboard remove the segue and detail view. Remove the prepareForSegue function in MasterViewController.

In the DataController we’ll add a downloadChart function and an isChartDownloaded function so we can tell if we need to download the chart:


func isChartDownloaded(chart: Chart) -> Bool {
  if let path = chart.urlInDocumentsDirectory?.path {
    let fileManager = NSFileManager.defaultManager()
    return fileManager.fileExistsAtPath(path)
  }
  return false
}

func downloadChart(chart: Chart, completionHandler: (Double?, NSError?) -> Void) {
  guard isChartDownloaded(chart) == false else {
    completionHandler(1.0, nil) // already have it
    return
  }
  
  let destination = Alamofire.Request.suggestedDownloadDestination(directory: .DocumentDirectory, domain: .UserDomainMask)
  Alamofire.download(.GET, chart.url, destination: destination)
    .progress { bytesRead, totalBytesRead, totalBytesExpectedToRead in
      dispatch_async(dispatch_get_main_queue()) {
        let progress = Double(totalBytesRead) / Double(totalBytesExpectedToRead)
        completionHandler(progress, nil)
      }
    }
    .responseString { response in
      print(response.result.error)
      completionHandler(nil, response.result.error)
    }
}

Alamofire.download let us download a file and specify where it should be saved. We’ll save the file in the documents directory for our app.

Before downloading the file we’ve got a check that it isn’t already downloading using the isChartDownloaded function. It just checks for a file with the correct name in the documents directory.

Even though we won’t implement a UI for it in this tutorial we’ll show how to track the download’s progress. Just add a .progress block and compare the totalBytesRead to the totalBytesExpectedToRead. In our function we’ll use a completion handler that reports the progress as a fraction (i.e., when it’s half downloaded we’ll call the completion handler with 0.5). Then we’ll know when the file is completely downloaded when the completion handler is called with 1.0.

To make debugging easier we’ve added a .responseString response serializer that just prints out any errors.

For those functions we need the chart to know where it’s being stored:


class Chart {
  ...
  
  var filename: String? {
    return url.lastPathComponent
  }
  
  var urlInDocumentsDirectory: NSURL? {
    let paths = NSSearchPathForDirectoriesInDomains(.DocumentDirectory, .UserDomainMask, true)
    if paths.count > 0 {
      let path = paths[0]
      if let directory = NSURL(string: path), filename = filename {
        let fileURL = directory.URLByAppendingPathComponent(filename)
        return fileURL
      }
    }
    return nil
  }
}

Now to use this code in MasterViewController add a didSelectRowAtIndexPath so it’ll get called when they tap on a row:


override func tableView(tableView: UITableView, didSelectRowAtIndexPath indexPath: NSIndexPath) {
  if let chart = dataController.charts?[indexPath.row] {
    dataController.downloadChart(chart) { progress, error in
      // TODO: handle error
      print(progress)
      print(error)
      if (progress == 1.0) {
        // TODO: show open in dialog
      }
    }
  }
}

Now we have the file saved locally. We just need to show the “Open In” dialog for it. The downloadChart function takes care of checking if we already have the file so we won’t download it if we don’t need to.

Add Open In Function

Since the charts tend to be big PDF files, Adobe Acrobat handles them a lot better than iBooks or trying to display then in a web view. So we’ll use an “Open In” dialog instead of a preview within the app.

The “Open In” dialog is part of the UIDocumentInteractionController functionality, so we’ll need to add one of those to our MasterViewController, along with declaring that we can be its delegate::


class MasterViewController: UITableViewController, UIDocumentInteractionControllerDelegate {
  var dataController = DataController()
  var docController: UIDocumentInteractionController?
  ...
}

And then when we have a file downloaded (progress == 1.0) we can show the open in dialog:


override func tableView(tableView: UITableView, didSelectRowAtIndexPath indexPath: NSIndexPath) {
  if let chart = dataController.charts?[indexPath.row] {
    dataController.downloadChart(chart) { progress, error in
      ...
      if (progress == 1.0) {
        if let filename = chart.filename {
          let paths = NSSearchPathForDirectoriesInDomains(.DocumentDirectory, .UserDomainMask, true)
          let docs = paths[0]
          let pathURL = NSURL(fileURLWithPath: docs, isDirectory: true)
          let fileURL = NSURL(fileURLWithPath: filename, isDirectory: false, relativeToURL: pathURL)
        
          self.docController = UIDocumentInteractionController(URL: fileURL)
          self.docController?.delegate = self
          if let cell = self.tableView.cellForRowAtIndexPath(indexPath) {
            self.docController?.presentOptionsMenuFromRect(cell.frame, inView: self.tableView, animated: true)
          }
        }
      }
    }
  }
}

After getting the correct path we create a UIDocumentInteractionController and use it to show the “Open In” dialog using presentOptionsMenuFromRect.

To make sure the UIDocumentInteractionController gets nilled out when we’re not using it we declare the MasterViewController as its delegate: self.docController?.delegate = self. raThen we can check when the file loading ends and nil out the doc controller:


// MARK: - UIDocumentInteractionControllerDelegate
func documentInteractionController(controller: UIDocumentInteractionController, didEndSendingToApplication application: String?) {
  self.docController = nil
  if let indexPath = tableView.indexPathForSelectedRow {
    tableView.deselectRowAtIndexPath(indexPath, animated:true)
  }
}

That would also be handy if you wanted to show an activity indicator while the document is opening in another app since that can take a little while.

And That’s All

Save and run to test it out. You should be able to see the list of charts. When you tap on one it should download (unless you’ve already downloaded it) then display the “Open In” dialog:

Open In dialog

Here’s the final code on GitHub.

If you want to push your skills here are a few improvements this demo app could use:

Cache the HTML so it doesn’t need to be reloaded or parsed every time the app is run
Track which files you’ve already downloaded and when, and show it in the table view
Allow deleting files within the app (maybe using UITableView’s swipe to delete)
Show a progress meter while downloading the PDFs
Show an activity indicator during the “Open In” process

If you’d like more Swift tutorials on topics like this one, sign up below to get them sent directly to your inbox.