Episode #587

Parsing Strings into Structured Data

44 minutes
Published on December 19, 2024

This video is only available to subscribers. Get access to this video and 586 others.

In this episode, we explore parsing strings into structured data using the Swift parsing library from Point Free. We demonstrate how to create composable parsers, starting with a simple example, then build a couple more complex parsers, including one for parsing sample input from Advent of Code and another for parsing CSV data of taxi trips. We'll explore how to use both forms of parsers and lean on composability to build them up.

This episode uses Swift parsing-0.13.0, Swift 6.0, Xcode 16.2.

In today's episode, I want to talk about parsing.

We're going to parse some strings that represent some data that we need in our application and parse that into a structured format.

So we're going to do this with the Swift parsing library from Point Free.

And this is a parser printer library, which allows you to make sort of a composable set of parsers that you can build up and use those to parse input into structured data.

Now, the input does not have to be a string, but that is the canonical example.

So that's what we're going to be working with in this episode.

So latest version is 0.13.0.

I'm going to copy this URL, and we're going to add it to our project using right-click here.

And we're going to say Add Package Dependencies.

And then here, I'm going to paste in the URL and add it to our project.

So let's say we want to take this string, Hello World.

I'm going to change this to a variable called input.

And we want to make this part dynamic, so we parse this part out of the string.

And it will always end with this exclamation mark, and it will always start with Hello, space.

So to do that, we can create a parser.

So I'll say parser.

This is going to be parse with a block here.

And here, what we're going to be passing is a string literal inside this block, Hello, space.

And then we just want to grab-- for now, let's just grab the rest of the string, which is a parser on its own, which just continues until it gets to the end of the string.

So that gives us a parser.

And if we look at the type here, the type is a little bit hard to read.

But it starts with the input type, which is a substring.

And then the rest of this gives us the output.

So what we can do is we can say, let the output of this be parser.parse, and we pass in the input.

So this can fail.

So we're going to mark that with try.

And then we're going to print the output.

Now at this point, we will see that this will either succeed or fail.

And if it fails, it'll throw an exception to tell us what happened.

And if it works, then we'll see what was returned from this parser.

Now this part of the string is going to be discarded, because it's considered a string literal.

And this part will be captured and returned as output.

And so we get world.

Let's say we wanted world without the exclamation mark.

So we can't do the rest of the string.

So what we're looking for at the end is a exclamation mark.

And what we need here is to prefix everything where the element under there is not the exclamation mark.

So it'll start with hello, space, then give me everything up until this exclamation mark, and then an exclamation mark.

And again, because these parts are string literals, they get discarded in the output, and we're left with whatever was parsed here.

So if I run this, we should get world by itself.

And then this works also if I do that.

Now if I don't have the expected input, this is going to fail, and it's going to tell us what happened here.

It expected this.

So the input is going to fail here.

And this is really good for us to make sure we capture the errors when they happen and not continue to parse things.

If something is indeed optional, then you have to represent that with-- we can do optionally.

And optionally is a parser, which will consume this input if it exists, but still continue if it does not.

So if we do that, then we get this value here, which ended up being an optional result in the output.

If we don't want that, then we can use the skip parser, which will allow us to just skip this, but still consume it as input.

So if we had more than one line, then we could work with that.

So this is the sort of basics of creating a parser.

There's two different ways.

One, we can create it with a parse method like this, and then a body or a builder closure here.

And we can also create-- let's call this a greeting parser.

And this is going to be a parser.

And that's going to have a var body, which gives us the parser that parses a substring and returns to us a string.

And that is going to be some parser.

And what we're going to do in here is essentially the same thing that we did before.

I'm going to copy and paste this in here.

Now at this point, we need the prefix that's being returned here is returning a substring at the moment.

If we want that to be a string, we can use the map function on the map.

If we want that to be a string, we can use the map function on any parser to take the input and convert it to whatever output that you want.

OK, so if I do that, then we can create a parser, which is a greeting parser, and then call parse input.

Well, we have that already in the other line.

Let's do this.

Greeting parser.

This works in the same way, but now we can share this as a type for the rest of the application or to use in other parsers if we want to.

OK, so that's a quick overview of the parsing library.

And I'm going to go through two examples.

One of them is to parse some input from Advent of Code from this year on day 13.

We have a claw machine game, and there are two buttons on the machine.

One is called button A.

They're always called A and B.

And this is going to move the x by that amount and move the y by that amount.

And then there's going to be a prize that exists at some other position where x equals the x-coordinate and y equals the y-coordinate.

OK, so let's load this input into a string.

I'm going to paste a quick function here to load input from the bundle that we're working in.

And here I'm going to say let AOC input equals load input.

That needs to have a try because it can fail.

And that's going to be AOC 13 at TXT.

OK, at this point, we can print the AOC input and make sure that that's working.

And we can see we get all of our input.

So what we want to do here is I want to create a struct to represent a point.

And the point is going to be the x and y value, which are always integers.

So let's do x int and let y int.

So to create a point, then we're going to have a button A, button B, and a prize.

So we could have our button have an offset, which is a point, and a name, which is a string, or maybe a character.

Let's do a character.

OK, so at this point now, we want to have our puzzle input is-- well, this is going to be each case.

So I mentioned this is going to be a claw machine.

So let's just call this claw machine.

And we're going to have two buttons.

And we can say button A.

And that's going to be a button.

And then button B is going to be a button.

We could also do this as an array if we ever expected there to be more than one button.

And then the prize location is going to be at a point.

So this is the structure that we want our data to be represented in.

So how do we do that?

I'm going to start off with the smallest piece, which is the point.

So we're going to create a point parser.

This is going to parse some input, and here we can specify the output type that we want.

If I do output here is the type that we want.

We're going to do point.self.

And then we can specify the parser here.

And this can be useful so that we can get the type that we're going for right up front.

OK, so this one doesn't know about our input yet because we haven't ever said parse with it.

So if we want to specify the input type being a substring, we can look at the other initializers here that have input output width.

So our input type is going to be a substring.self.

Our output is going to be point.self.

And then width will be our parser block.

So let's just do this.

Input that.

Let's give ourselves some more room.

OK, so looking on the right-hand side, I want to be able to grab x and y, and I want to be able to do it in either of these cases.

So the first thing we can do is we can expect that there's going to be a literal x followed by any character.

We could probably do something like one of.

If we did one of, we can specify either a plus or an equal sign.

Then we expect a literal comma and a space, then a literal y.

And actually, before this, we need an int.parcer.

And then here we need an int.parcer as well.

OK, so what is that going to give us?

It's asking us that input requires the types int int and empty tuple be equivalent.

So it's kind of hard to read what that is, but this should be giving us an int int.parcer.

And we want to map that to a point.

If we look at the map function, that gives us a point.

If we were to not put that in here and we look at the map function, we get our numbers here.

This is the output from the previous parser, which is an int int.

And now I need to say point init xy.

So that can be nums.0, nums.1.

There is also a way to do a member-wise initializer.

If you look in the docs, you can find that as well.

These are essentially the same way of doing that.

OK, so now we have, or we should have, a way to parse points.

We can give that a quick test by saying, let's try to parse this input.

So we can say point_parcer.parse.

If our input is this, we can do that.

And let's print out the price that we got.

That can fail.

[TYPING] OK, so it expected an integer because we did not add our oneof parser in between the y and the number.

And now we get our point.

So this is the smallest piece that we can use.

I'm going to get rid of that now.

And let's worry about next looking at a button parser.

So a button parser is going to be the same thing here.

We're going to parse.

The input type is going to be substring.self.

And then here, we're going to be parsing the literal string button, followed by a character, which a character does not have a parser.

There are other parsers where we can specify a character set.

And in this case, we could say character_set.letters.

And that is a parser.

And then after that, we expect a colon and a space, and then a point stop parser-- or sorry, our point parser.

So here, we're composing the button parser from a built-in character parser as well-- or character set parser-- and a point parser.

So at this point now, we should be able to parse this button.

And I, of course, need to map this to the type that I want.

So this gives us a substring and a point.

I don't want this to be a substring.

I want this to grab the first character.

And this we can unwrap, because if there were no characters, this wouldn't have matched.

So at this point, our map transform should be taking a character. substring.element is a character.

So we can say label and point in.

And now we can create our button with the label and point.

OK.

Let's grab our button.

This is going to be the button parser. .parse.

We'll grab that input.

We'll mark it with try.

And then we'll print out the button.

OK.

We got our button.

And I'm just going to assume that we can get our prize.

So at this point now, we're ready for our claw machine parser.

So claw machine parser is going to be, again, a parser with the input type of substring.self.

And then here, we're going to grab a button parser, then a literal new line, then another button parser.

And then another literal new line.

And then we need a prize parser, which I can just do prize space and then have a point parser.

At that point, we should have a button, a button, and a point.

And then we can take that A, B, and prize and turn that into our claw machine.

A, B, and prize.

And we should be able to print our claw machine.

And I'm going to take the-- the claw machine parser now is going to parse one of these blocks, and it will parse everything up to this point here, up to the end.

The last one is not going to have a new line at the end of it, so we don't want to include that.

So at this point now, for us to parse our entire input, we want to get a bunch of claw machines.

So that's going to be a parse.

And we're going to use the built-in mini parser, which is going to separate all of our claw machine parsers.

And we're going to separate all of our claw machine parsers with a new line.

And in this case, there's actually two new lines because there's a blank one in between.

So at that point, now we can say parse our AOC input.

And let's make sure that we can get all of our claw machines.

So the 4624, which is the last one.

And it looks like it expected the end of the input to be here, but didn't find it.

It's possible there was a new line at the end of that file.

And what ends up happening here is that we actually have a new line at the end of this file, and Xcode won't let me delete it.

So this is one of the annoying parts about working with Xcode and these data files in Advent of Code.

So what I typically will do is do trimming characters in white spaces and new lines on the input to make sure that our input is as structured as we expect.

So if we get rid of this, this will trim that last new line.

And now we should be able to print out all of our claw machines.

Now it is possible for you to take a parser and turn it into a printer, which could take these claw machines and spit out this output.

I'm not going to go over that in this case, but I will point out that the times when that won't work is when you have something like this.

It doesn't know what to print out here.

If you were to take a button parser-- actually, this one may work.

This one definitely will not, because it doesn't know which one we want to print.

And when we're parsing, we're just saying, give me either one, and we're discarding that information.

So if we were to turn around and take that parser with some structured data and want to turn it back into a string using the same parser, just going in reverse, which one would it choose?

It's ambiguous, so this one is not available to be printed.

OK, so that's the claw machines.

There's one more example I want to talk about, and that's taxis.

This is a CSV file for some taxi pickup and drop-off information.

We can take a look at this in numbers to get a little bit better view of what's happening here.

So we've got pickup, drop-off, passengers, distance, fare tip, zones, et cetera.

So this is all comma-separated values.

So what we're going to do here is we're going to create a CSV parser for our-- I'm going to call this a taxi trip.

So let's make a taxi trip.

And we're going to basically have a pickup date-- or pickup at is a date-- drop-off at.

Let's call these dates just so that we can have our locations as well.

Then we have number of passengers.

And just scanning through the data here, just making sure that it's well-formed, that these are all integers.

And it looks like they are.

So we'll say numPassengers is an int.

Then we have the distance, fare, tip, tolls, and total.

So distance, that's going to be a double.

And then we're going to have some others.

So this is fare, tip, tolls, and total.

And just, again, scanning through these.

If there is any that are empty, then we need to consider an optional value.

But I don't see any that are empty.

We also have the color of the taxi.

That is either yellow or green, I believe.

I'm just looking through the whole list here.

Yellow or green, yeah.

So that can be an enum.

So let's call this a taxi cab color.

And this will have a string raw value.

And we'll use a case for yellow and a case for green.

Again, we could parse this as a string.

But if your goal is to take it into structured data, and maybe this becomes important later, that's something that we might want to do.

We also have pickup zone, drop off zone, pickup burrow, and drop off burrow.

So pickup zone, maybe that's a zone type.

I mean, there's many of them.

Some of them are empty.

Oh, I also forgot payment.

So let's do enum payment type.

Again, I'm going to do a string backed.

This is going to be cash or credit card, I think.

Yeah, credit card.

I'll just do cash or card.

And then we will have to map from one to the other.

Let's see.

This is going to be payment, which will be a payment type.

That one actually is optional.

Our pickup zone, I'm going to do a string.

Drop off zone is going to be a string.

And then we also have the burrow.

And the burrow is more of a well-defined set.

So maybe this becomes an enum.

So here, we'll say enum.

OK, so pickup zone, drop off zone.

This is also in a burrow.

So I wonder if we could do our pickup and drop off locations.

We could do a struct called location, which has a zone and a burrow.

[TYPING] And I'm wondering if all of the values that are empty are empty on all of them.

So here's an example where it's empty on all of them.

I don't want to group these types up together if it's possible for only some of them to be optional.

Can we filter on-- we're going to do blank ones.

OK, so they are not all optional together.

Sometimes we have a pickup zone-- oh, it actually does match.

So the pickup zone and the burrow, maybe this is unknown or not recorded or whatever.

And then drop off zone was there.

So I think these are correct.

Let's go grab.

And we'll select everything.

And we'll do the same thing for drop off zone.

And in this case, if our pickup zone is there, we always have the pickup burrow.

So this is actually a good choice by grabbing our location.

So instead of doing pickup and drop off, this is going to be pickup location, which the entire thing can be optional, and drop off location.

OK, so that's our taxi trip.

So let's create a parser for this.

This time I'm going to create a struct called taxi trip parser, which will be a parser.

And here we'll have a var body, which is going to be some parser of substring to taxi trip.

And because we're going to be parsing CSV, we know that this are all going to be separated by a comma.

And we can take a look at our input here.

So we've got our date and time here.

We've got two of those.

So that may be something that we need to parse specifically.

So let's call this a simple date parser.

And this is a body which takes a parser, or returns a parser, of substring to date.

And we're going to have our first date parser, which is going to have our format.

Our formatter for this parser be static let formatter is a date formatter.

That's going to equal this closure.

Date formatter, we're going to set the date format equal to-- or date style.

We're probably going to need a custom one.

I'm not sure if we have one that's built in that will use this style.

Well, especially depending on the locale, we could just say let's go check out ns-date-formatter.com. ns-date-formatter.com.

And here we can take this input, and let's just build up our own parser that looks maybe similar to this one, the ISO 8601 formatter.

So if we do that, then we get this date.

And so year, month, day, we don't need the literal T.

And we don't need the time zone either.

And we do need seconds, though.

OK, so this becomes our date formatter, our date format.

So this is going to be f.dateformat equals this.

And when we're specifying a date formatter like this, we also want to specify the locale is init with enus_posix, similar to what is happening here.

I guess maybe that's underscored.

And then we can return f.

OK, so now we can say we expect there to be a string in this input, and that string needs to map to a date.

So at this point, we can say somehow we need to know when to stop.

And in this case, we can stop when any character that's not one of a given character set.

So we can say if we do characterSet.decimaldigits, and spaces, and colons and dashes.

So if we have our characterSet plus decimal digits, we're going to union that with a few others.

And we can give it a sequence.

So this will be colon, dash, space.

And then we're going to take that.

We're going to map this substring then to our formatter.datefrom string.

That becomes self.formatter.

And at this point now, it's an optional date.

OK, so I think what we can use here, similar to how we're doing map here, we can do compactMap.

And that will return a parser that outputs the non-nil result of calling the given closure with the output of this parser.

So if we change this to compactMap, then now our simple date parser is going to work.

Now, it's worth noting that this format here is also parsable.

What we could have done here is parse exactly four digits, a literal dash, exactly two digits, a literal dash, et cetera.

And that would do the same thing.

And then we'd have to call the date initializer, which lets us specify the year, month, day, hour, minute, and second.

So that's just a different approach that you might take.

That might be faster than using date formatters each time through.

I'm not quite sure.

OK, so at this point now, we've got our simple date parser for the from date.

Then we've got a comma.

Then we've got another simple date parser.

Then after that, we've got distance, fare.

And so we're going to create a double parser four times.

Each one of these needs to be separated with a comma.

Then distance, fare, tip, and tolls, and total.

So we've got five.

Distance, fare, tip, tolls, and total.

Then after that, we've got the color, the payment, and pickup.

So we've got our color.

This is going to be our taxi cab color.

And at this point now, we need to take that prefix as long as $0 is not equal to a comma.

Then we take the result of that, and we need to compact map this to our taxi cab color with this raw value.

So at this point now, substring.element.

So this is a character.

So not space.

We just have a comma.

And then this one needs to be converted to a string for that to work.

So now we have a way to convert the substring into the taxi cab color, which is an enum.

Again, we expect a comma.

After that, we have our pickup zone and drop-off zone.

So this is where we're going to need to parse the string, but then combine those later, because there's no way for us to take these two and then skip one and come back to it, really.

So we're going to create a prefix, $0 not equal to comma.

And then we'll do the same thing for the drop-off zone.

And then also the-- so we've got pickup zone, drop-off zone, pickup borough, drop-off borough.

And at the very end of that, we have just a new line.

And so that's how we're going to capture multiples.

So at this point, we have all of these things, but these need to return a taxi trip.

So at this point, I need to take all of these things and then combine them into something else.

So these need to return a taxi trip.

Now, in order to do that, I'm going to have to take this function here and wrap it in a parse block.

And I'm going to do that so that we can go down to the bottom of this, and we can map this to our taxi trip.

Now, our taxi trip has all of this stuff in it.

So essentially what we need to do is we need to be able to call this MemberWise initializer with all of these values.

And if you take a look at the value here, it looks like we've got a nested tuple.

And so we've got date, date, double, double, double, double, double, taxi cab color, payment type, and substring.

And those are all wrapped in a tuple.

And it just so happens that there's one, two, three, four, five, six, seven, eight, nine, 10-- is that right?

One, two, three, four, five, six, seven, eight, nine, 10 elements in this tuple.

And I don't think that's a coincidence.

I think that this is just the way that the closure parameter-- or sorry, this builder closure-- is able to wrap up all the values that we have here because there are so many.

And so that's going to make this a little bit awkward for us to represent here.

So what I'm going to try to do is define this as a destructured-- or as a nested tuple here.

So the first tuple is going to have a date.

And that's going to be the pickup date.

Then we're going to have the drop off date.

Then we're going to have num passengers.

And I think I forgot to do that up here.

Yeah, I did.

So this one becomes an int.parse, followed by a comma.

OK, so num passengers, that's going to be np is an int.

And then what do we have after that?

We have five doubles.

That is the distance, which will be a double.

We've got the fare.

We've got the tip, the tolls, and then the total.

Let's see.

Yeah.

And then after that, we've got the taxi cab color.

That's going to be taxi cab color.

And then after that, we have-- this actually needs a name.

So this is going to be-- well, this is going to be the first element.

So it's going to be args, which is of this type.

Then after this one, we're going to have those four other things.

This is going to be the pickup location, which is going to be a string.

And then we're going to have the drop off location, which will be a string.

Then we have the pickup zone, which is going to be a borough.

And then we have the drop off zone, which is going to be a borough.

And at that point now, our tuple is-- I believe that this is complete.

And that is a type here.

And then we need to wrap the whole thing.

OK.

So at this point now, we have all of our values.

We're going to do args.0.pickupDate.

And I called that pub instead of p-u-d.

Then drop off date is args.0.dropoffDate.

After that, we have the two locations.

We've got the pickup and drop off and payment.

So we've got payment.

And we've got payment here after color.

And then we've got the four items there.

So payment here is going to be the payment here.

Pay will be the payment type, which we said was non-optional because we're going to default it to cash.

And these things do need to get nested like that.

It's a little bit weird how this nesting works.

All right, so now we have our pickup and our drop off, which we need to add as locations.

The locations have a zone and a borough.

The zone will have-- it may be an empty string, but we'll have it.

But the borough, we may not.

So what we're going to do is grab the pickup borough, which is this one, and the drop off borough.

So we're going to take that and then flat map it to location with the zone of args.pickupZone, which should have been this one.

And then the borough will be $0.

And then we're going to do the same thing for drop off.

This time we'll do drop off borough, flat map.

This will be drop off borough.

And then the-- sorry, this will be zone.

And then the borough will be $0.

OK, we need a comma here.

And it says no exact arguments to map.

So we've got payment type here.

Then we've got 1, 2, 3, 4.

Now these things are strings, and these things are boroughs.

And we have not mapped them.

So this is one area where, because we're doing this same thing multiple times, this actually would be good for us to have a struct called a CSV field, which is a parser.

And then this will have a body, which-- not a string.

This is a sum parser that takes a substring and returns a string.

So this would be like a string field type.

And then we can have that do the prefix logic where we say everything not up to that.

But we also need to make sure that we're not leading up to a new line.

So we do not comma and not new line.

Once we have that, we can map that to a string.

We can do map string.init.

And now we've got a CSV field parser, which we can use for these.

So this will be a CSV field.

And same thing here and here and here.

Now the first two are going to just pass right through to these two.

But these I need to map to those boroughs.

So this is going to be map to.

And now I'm going to take the borough.

And we're going to try to initialize with a raw value is going to be that lower cased.

And that can return an optional.

And that's fine.

So this becomes the borough based on the CSV field.

Now we have the same thing for this one.

So we can do CSV field here as well.

And then this one maps from the string value to that.

And we can also do it for the taxi cab color.

OK, let's build this.

We're still missing something here.

If we take a look at what the map wants to give us, we have date, date, int, one, two, three, four, five doubles, taxi cab color, and payment type.

So payment type goes inside.

And then after that, we've got string, string, and two optional boroughs.

So I think that that is correct.

Now this becomes args.zero.pay.

OK, so now we have the ability to read all of this data and turn it into a taxi trip.

So at that point now, let's grab our CSV.

It's going to be try load input.

The name is going to be taxis.csv.

The first line of this CSV file is the headers.

And we don't want those.

So at this point, I'm going to split those with a separator of new line.

And then I'm going to drop the first one.

And then I'm going to join the rest with a new line.

There's probably an easier way to do this.

But yeah, I'm not too concerned about performance at this stage.

So at this point now, let's create our parser for taxis, taxi trips.

That's going to be a parser.

And we'll do try.

Inside of here, we'll do a many.

We're going to do many taxi trip parser.

And each one has a separator of a new line.

Then we can say parse CSV.

That needs to be parse.

This needs to be parse with just the raw.

For trip and taxi trips, we'll also print out parsed taxi trips.count trips.

For each one, we'll do, I don't know, the fare.

We can print out trip fare was the trip.total.

And let's just print these out.

And now you can see that we're pulling out the structured data from the CSV file.

And we're doing that in a way that is composable.

So there's probably other ways to clean this up.

But I do like the fact that we can sort of build up bigger parsers out of other smaller parsers.

All right.

And this is a great tip to have in your tool belt, especially for Advent of Code, which is going on right now.

Every single one of the problems starts off with some text input.

And many of those could benefit from using a parsing library to parse that into some structured data