October 11, 2015
While web APIs are getting more and more common some data at some point you’ll find yourself working with data that’s only available in HTML. I have a problem like this right now: NOAA makes all of their boating charts (depth and navigation maps) available online for free. But the index of files is just a huge HTML list or a funky web viewer. I want to download some charts to use offline but building an app to do that would require parsing the HTML list. The charts that I want are the Booklet format ones for the Atlantic coast.
HTML parsing support in iOS isn’t very good so handling tasks like this one can end up being quite a challenge. We’ll use the HTMLReader library to make it easier. Our app will have a table view listing the chart names and numbers. Tapping on a row will download the PDF chart. We’ll use Alamofire to download the PDF files and save them locally. And we’ll include an “Open In” function since the default PDF apps on iOS choke on PDFs with as much detail as these ones.
This tutorial was written using Swift 2.0, Xcode 7.0, and Alamofire v3.0.0.
There are a few real issues with HTML parsing. First, it’s not always permitted by all websites. Wikipedia has a good discussion about the legality of web scraping that you should read, along with the target site’s TOS, before considering it. Second, it’s fragile. It’s usually nearly impossible to write web scraping code that’s resistant to changes to the HTML. So it’s best suited for hobby projects, not apps submitted to the App Store where you have to wait for Apple to approve each new version.
Project Setup
To start, create a new Master-Detail project in Xcode. That’ll give us the boilerplate for a tableview.
Remove the insert and edit button code from the boilerplate. Replace the array of generic objects with one of Chart
objects (we’ll create a Chart
class shortly).
Create a Podfile and add HTMLReader, Alamofire, and SwiftyJSON to it:
source 'https://github.com/CocoaPods/Specs.git'
platform :ios, '9.0'
target 'grokHTMLAndDownloads' do
use_frameworks!
pod 'HTMLReader', '~> 0.9'
pod 'Alamofire', '~> 3.0.0'
pod 'SwiftyJSON', '~> 2.3.0'
end
Make sure the target in your Podfile matches the name of the target in your project.
Close Xcode. Run pod install
. Open the new .xcworkspace
file that CocoaPods generated for you.
To organize the project we’ll keep the HTML parsing and data storage in it’s own class. Create a new file called DataController.swift
.
This class will have a few responsibilities:
- Fetch and parse the HTML list of charts
- Make the charts available as an array so we can show them in the tableview
- Download a chart’s PDF file from a URL
We’ll also want a class to represent the charts. Create a new Chart.swift
file for this class:
import Foundation
class Chart {
}
Here’s the table of charts on the webpage:
And here’s what the HTML looks like:
Not very pretty or semantic but it’s what we have to work with.
Each chart has a few properties that we can extract from the HTML:
- Number
- Scale
- Title
- PDF URL
So we’ll need some properties in our Chart class and an initializer to create a Chart from those properties:
import Foundation
class Chart {
let title: String
let url: NSURL
let number: Int
let scale: Int
required init(title: String, url: NSURL, number: Int, scale: Int) {
self.title = title
self.url = url
self.number = number
self.scale = scale
}
}
App Transport Security (ATS)
In iOS 9 Apple has added some checks for the security of the connection to any URLs. Since we’ll be grabbing data from charts.noaa.gov
, which doesn’t comply with the HTTPS TLS 1.2 requirements that Apple has set, we’ll need to add an exception to ATS in our info.plist. Because some of the links point to http://www.charts.noaa.gov/...
(with a leading www
), we’ll include that domain in the ATS exceptions too:
HTML Parsing
In your DataController
, import HTMLReader and Alamofire.
Add a constant to hold the URL.
Add an array to hold the charts:
import Foundation
import Alamofire
import HTMLReader
let URLString = "http://charts.noaa.gov/BookletChart/AtlanticCoastBookletCharts.htm"
class DataController {
var charts: [Chart]?
}
Add a function to DataController
to retrieve the HTML page and parse it into an array of Chart
objects. We’ll call it fetchCharts
. It’ll need to:
- Get the HTML as a String from the URL
- Convert the HTML to an
HTMLDocument
- Find the correct table in the
HTMLDocument
(there are several on this page) - Loop through the rows in the table and create a
Chart
for each row (unless it’s one of the title rows)
Here’s the structure:
private func parseHTMLRow(rowElement: HTMLElement) -> Chart? {
// TODO: implement
}
private func isChartsTable(tableElement: HTMLElement) -> Bool {
// TODO: implement
}
func fetchCharts(completionHandler: (NSError?) -> Void) {
Alamofire.request(.GET, URLString)
.responseString { responseString in
guard responseString.result.error == nil else {
completionHandler(responseString.result.error!)
return
}
guard let htmlAsString = responseString.result.value else {
let error = Error.errorWithCode(.StringSerializationFailed, failureReason: "Could not get HTML as String")
completionHandler(error)
return
}
let doc = HTMLDocument(string: htmlAsString)
// find the table of charts in the HTML
let tables = doc.nodesMatchingSelector("tbody")
var chartsTable:HTMLElement?
for table in tables {
if let tableElement = table as? HTMLElement {
if self.isChartsTable(tableElement) {
chartsTable = tableElement
break
}
}
}
// make sure we found the table of charts
guard let tableContents = chartsTable else {
// TODO: create error
let error = Error.errorWithCode(.DataSerializationFailed, failureReason: "Could not find charts table in HTML document")
completionHandler(error)
return
}
self.charts = []
for row in tableContents.children {
if let rowElement = row as? HTMLElement { // TODO: should be able to combine this with loop above
if let newChart = self.parseHTMLRow(rowElement) {
self.charts?.append(newChart)
}
}
}
completionHandler(nil)
}
}
I’ve set up fetchCharts
to have a completion handler block instead of a return type because the Alamofire call is asynchronous. That means our UI won’t freeze while we’re fetching the webpage. But it makes the flow of the code less linear. Blocks can greatly improve your app’s performance and are pretty easy to use once you get the hang of them. They’re well worth understanding even if the syntax is hard to remember.
Now we need to implement the two functions that actually do the web parsing. First, isChartsTable
:
private func isChartsTable(tableElement: HTMLElement) -> Bool {
if tableElement.children.count > 0 {
let firstChild = tableElement.childAtIndex(0)
let lowerCaseContent = firstChild.textContent.lowercaseString
if lowerCaseContent.containsString("number") && lowerCaseContent.containsString("scale") && lowerCaseContent.containsString("title") {
return true
}
}
return false
}
Coding up HTML parsing often involves a lot of trial and error. It’s handy to print out the element or its children to figure out if you’ve got the element that you think you do. You can use print(element.textContent)
to do that. Be aware that it will print the text contained in the whole element, including the children, so even if print(element.textContent)
shows the text you’re trying to find you might need to go down in to element.children
or element.childAtIndex(X).children
to find the right node.
Within fetchCharts
we got the tables by doing let tables = doc.nodesMatchingSelector("tbody")
, which gave us all of the table bodies within our HTML document. There are 5 or 6 on this page so we needed to set up isChartsTable
to find the one that contains the lists of charts. From looking at the web page we can see that the table that we want has 3 titles in it’s first row: NUMBER, SCALE, and TITLE. So we’ll check for those in the first row of the table (which is the table element’s first child):
let firstChild = tableElement.childAtIndex(0)
let lowerCaseContent = firstChild.textContent.lowercaseString
if lowerCaseContent.containsString("number") && lowerCaseContent.containsString("scale") && lowerCaseContent.containsString("title") {
return true
}
Within fetchCharts
we’ll make sure we do find a table that meets our requirements before trying to parse it.
Second, parseHTMLRow
:
private func parseHTMLRow(rowElement: HTMLElement) -> Chart? {
var url: NSURL?
var number: Int?
var scale: Int?
var title: String?
// first column: URL and number
if let firstColumn = rowElement.childAtIndex(1) as? HTMLElement {
// skip the first row, or any other where the first row doesn't contain a number
if let urlNode = firstColumn.firstNodeMatchingSelector("a") {
if let urlString = urlNode.objectForKeyedSubscript("href") as? String {
url = NSURL(string: urlString)
}
// need to make sure it's a number
let textNumber = firstColumn.textContent.stringByTrimmingCharactersInSet(NSCharacterSet.whitespaceAndNewlineCharacterSet())
number = Int(textNumber)
}
}
if (url == nil || number == nil) {
return nil // can't do anything without a URL, e.g., the header row
}
if let secondColumn = rowElement.childAtIndex(3) as? HTMLElement {
let text = secondColumn.textContent
.stringByTrimmingCharactersInSet(NSCharacterSet.whitespaceAndNewlineCharacterSet())
.stringByReplacingOccurrencesOfString(",", withString: "")
scale = Int(text)
}
if let thirdColumn = rowElement.childAtIndex(5) as? HTMLElement {
title = thirdColumn.textContent
.stringByTrimmingCharactersInSet(NSCharacterSet.whitespaceAndNewlineCharacterSet())
.stringByReplacingOccurrencesOfString("\n", withString: "")
}
if let title = title, url = url, number = number, scale = scale {
return Chart(title: title, url: url, number: number, scale: scale)
}
return nil
}
Let’s take that one step at a time. First we declare the variables we’re going to try to extract from the HTML:
var url: NSURL?
var number: Int?
var scale: Int?
var title: String?
Then we tackle each column. The first column should contain the URL and the chart number. This is complicated by the rows at the top of the table that don’t think to charts, so we need to make sure we get the URL and number to make sure this is a row with a chart.
To extract the URL we need to step into the column let firstColumn = rowElement.childAtIndex(1) as? HTMLElement
then into the correct sub-element. Here’s where the fragility of HTML scraping is obvious: there’s an empty column before the column of numbers that could easily be removed by edits to this page. But for now this works.
Once we have the correct column we can find the link in it (firstNodeMatchingSelector("a")
) and then the number is the text contained in that node (after trimming off any excess characters):
// first column: URL and number
if let firstColumn = rowElement.childAtIndex(1) as? HTMLElement {
// skip the first row, or any other where the first row doesn't contain a number
if let urlNode = firstColumn.firstNodeMatchingSelector("a") {
if let urlString = urlNode.objectForKeyedSubscript("href") as? String {
url = NSURL(string: urlString)
}
// need to make sure it's a number
let textNumber = firstColumn.textContent.stringByTrimmingCharactersInSet(NSCharacterSet.whitespaceAndNewlineCharacterSet())
number = Int(textNumber)
}
}
if (url == nil || number == nil) {
return nil // can't do anything without a URL, e.g., the header row
}
Then we move on to the next column. Again we can get the text content and then remove the characters that we don’t want: trailing & leading whitespace and commas. Swift’s Int()
function that converts a String to an Int doesn’t like commas:
if let secondColumn = rowElement.childAtIndex(3) as? HTMLElement {
let text = secondColumn.textContent
.stringByTrimmingCharactersInSet(NSCharacterSet.whitespaceAndNewlineCharacterSet())
.stringByReplacingOccurrencesOfString(",", withString: "")
scale = Int(text)
}
The third column is similar: grab the text and remove the extra characters. This time it’s random line breaks in the middle of the title that we want to remove. (stringByTrimmingCharactersInSet
will only remove unwanted characters at the start and end of the string, not in the middle.):
if let thirdColumn = rowElement.childAtIndex(5) as? HTMLElement {
title = thirdColumn.textContent
.stringByTrimmingCharactersInSet(NSCharacterSet.whitespaceAndNewlineCharacterSet())
.stringByReplacingOccurrencesOfString("\n", withString: "")
}
Finally we make sure that we got all of the necessary data parsed from the HTML and create a new Chart
object with it. If something went wrong or this row doesn’t represent a chart then we just return nil:
if let title = title, url = url, number = number, scale = scale {
return Chart(title: title, url: url, number: number, scale: scale)
}
return nil
Displaying the Charts in the Table View
Now we’re fetching the list of charts but the user can’t see them. So let’s mate up our data controller with the table view controller.
Open your MasterViewController
and add a variable for the DataController
. In viewWillAppear
tell it to fetch the charts and reload the tableview when it’s done:
class MasterViewController: UITableViewController {
var dataController = DataController()
override func viewWillAppear(animated: Bool) {
super.viewWillAppear(animated)
dataController.fetchCharts { _ in
// TODO: handle errors
self.tableView.reloadData()
}
}
...
}
Now we need to adjust all of the autogenerated UITableViewDataSource
methods to use our DataController
:
override func numberOfSectionsInTableView(tableView: UITableView) -> Int {
return 1
}
override func tableView(tableView: UITableView, numberOfRowsInSection section: Int) -> Int {
return dataController.charts?.count ?? 0
}
override func tableView(tableView: UITableView, cellForRowAtIndexPath indexPath: NSIndexPath) -> UITableViewCell {
let cell = tableView.dequeueReusableCellWithIdentifier("Cell", forIndexPath: indexPath)
if let chart = dataController.charts?[indexPath.row] {
cell.textLabel!.text = "\(chart.number): \(chart.title)"
} else {
cell.textLabel!.text = ""
}
return cell
}
Save and run. You should see the table view populated with the list of charts.
Downloading and Saving PDFs
Now that we can see the list of charts it would be nice to be able to get them. In the storyboard remove the segue and detail view. Remove the prepareForSegue
function in MasterViewController
.
In the DataController
we’ll add a downloadChart
function and an isChartDownloaded
function so we can tell if we need to download the chart:
func isChartDownloaded(chart: Chart) -> Bool {
if let path = chart.urlInDocumentsDirectory?.path {
let fileManager = NSFileManager.defaultManager()
return fileManager.fileExistsAtPath(path)
}
return false
}
func downloadChart(chart: Chart, completionHandler: (Double?, NSError?) -> Void) {
guard isChartDownloaded(chart) == false else {
completionHandler(1.0, nil) // already have it
return
}
let destination = Alamofire.Request.suggestedDownloadDestination(directory: .DocumentDirectory, domain: .UserDomainMask)
Alamofire.download(.GET, chart.url, destination: destination)
.progress { bytesRead, totalBytesRead, totalBytesExpectedToRead in
dispatch_async(dispatch_get_main_queue()) {
let progress = Double(totalBytesRead) / Double(totalBytesExpectedToRead)
completionHandler(progress, nil)
}
}
.responseString { response in
print(response.result.error)
completionHandler(nil, response.result.error)
}
}
Alamofire.download
let us download a file and specify where it should be saved. We’ll save the file in the documents directory for our app.
Before downloading the file we’ve got a check that it isn’t already downloading using the isChartDownloaded
function. It just checks for a file with the correct name in the documents directory.
Even though we won’t implement a UI for it in this tutorial we’ll show how to track the download’s progress. Just add a .progress
block and compare the totalBytesRead
to the totalBytesExpectedToRead
. In our function we’ll use a completion handler that reports the progress as a fraction (i.e., when it’s half downloaded we’ll call the completion handler with 0.5). Then we’ll know when the file is completely downloaded when the completion handler is called with 1.0.
To make debugging easier we’ve added a .responseString
response serializer that just prints out any errors.
For those functions we need the chart to know where it’s being stored:
class Chart {
...
var filename: String? {
return url.lastPathComponent
}
var urlInDocumentsDirectory: NSURL? {
let paths = NSSearchPathForDirectoriesInDomains(.DocumentDirectory, .UserDomainMask, true)
if paths.count > 0 {
let path = paths[0]
if let directory = NSURL(string: path), filename = filename {
let fileURL = directory.URLByAppendingPathComponent(filename)
return fileURL
}
}
return nil
}
}
Now to use this code in MasterViewController
add a didSelectRowAtIndexPath
so it’ll get called when they tap on a row:
override func tableView(tableView: UITableView, didSelectRowAtIndexPath indexPath: NSIndexPath) {
if let chart = dataController.charts?[indexPath.row] {
dataController.downloadChart(chart) { progress, error in
// TODO: handle error
print(progress)
print(error)
if (progress == 1.0) {
// TODO: show open in dialog
}
}
}
}
Now we have the file saved locally. We just need to show the “Open In” dialog for it. The downloadChart
function takes care of checking if we already have the file so we won’t download it if we don’t need to.
Add Open In Function
Since the charts tend to be big PDF files, Adobe Acrobat handles them a lot better than iBooks or trying to display then in a web view. So we’ll use an “Open In” dialog instead of a preview within the app.
The “Open In” dialog is part of the UIDocumentInteractionController
functionality, so we’ll need to add one of those to our MasterViewController
, along with declaring that we can be its delegate::
class MasterViewController: UITableViewController, UIDocumentInteractionControllerDelegate {
var dataController = DataController()
var docController: UIDocumentInteractionController?
...
}
And then when we have a file downloaded (progress == 1.0
) we can show the open in dialog:
override func tableView(tableView: UITableView, didSelectRowAtIndexPath indexPath: NSIndexPath) {
if let chart = dataController.charts?[indexPath.row] {
dataController.downloadChart(chart) { progress, error in
...
if (progress == 1.0) {
if let filename = chart.filename {
let paths = NSSearchPathForDirectoriesInDomains(.DocumentDirectory, .UserDomainMask, true)
let docs = paths[0]
let pathURL = NSURL(fileURLWithPath: docs, isDirectory: true)
let fileURL = NSURL(fileURLWithPath: filename, isDirectory: false, relativeToURL: pathURL)
self.docController = UIDocumentInteractionController(URL: fileURL)
self.docController?.delegate = self
if let cell = self.tableView.cellForRowAtIndexPath(indexPath) {
self.docController?.presentOptionsMenuFromRect(cell.frame, inView: self.tableView, animated: true)
}
}
}
}
}
}
After getting the correct path we create a UIDocumentInteractionController
and use it to show the “Open In” dialog using presentOptionsMenuFromRect
.
To make sure the UIDocumentInteractionController
gets nilled out when we’re not using it we declare the MasterViewController
as its delegate: self.docController?.delegate = self
. raThen we can check when the file loading ends and nil out the doc controller:
// MARK: - UIDocumentInteractionControllerDelegate
func documentInteractionController(controller: UIDocumentInteractionController, didEndSendingToApplication application: String?) {
self.docController = nil
if let indexPath = tableView.indexPathForSelectedRow {
tableView.deselectRowAtIndexPath(indexPath, animated:true)
}
}
That would also be handy if you wanted to show an activity indicator while the document is opening in another app since that can take a little while.
And That’s All
Save and run to test it out. You should be able to see the list of charts. When you tap on one it should download (unless you’ve already downloaded it) then display the “Open In” dialog:
Here’s the final code on GitHub.
If you want to push your skills here are a few improvements this demo app could use:
- Cache the HTML so it doesn’t need to be reloaded or parsed every time the app is run
- Track which files you’ve already downloaded and when, and show it in the table view
- Allow deleting files within the app (maybe using UITableView’s swipe to delete)
- Show a progress meter while downloading the PDFs
- Show an activity indicator during the “Open In” process
If you’d like more Swift tutorials on topics like this one, sign up below to get them sent directly to your inbox.