Profiling and Removing Splits

Ben

All right, welcome back Matt. Let's resume the 1 billion row challenge. Last time we did the basic straightforward naive solution without regards to performance and processing these chunks. Each chunk ends up splitting the string on the separator, then by the semicolon, then we parse the city as the first part and then float parse the temperature as the second part. And then we collect all the data and then we output it at the end. And that solution ran and I let it run even though we weren't done recording. I put the little timestamp at the very end. It was about 11 minutes. So pretty awful in terms of performance. So what I think would be interesting is to profile this before, because there's lots of things that I would like to clean up, but I don't want to clean up for code vanity sake, which is very tempting to do. But I want to see like, where are we spending all this time and try to be driven by the metrics? Yeah, that's a great idea. Okay, so there are a couple of ways we could do this. I have this input file environment variable set up here. And I don't think I called it test.txt. What did we call it? There's measurements and measurements huge. And I guess that it's going to... Let's see. Input file. Did I call it input file? That looked like...

Matt

I think input, yeah.

Ben

Okay. Input file. And then I'm going to call this measurements.txt. Because we don't need to run the 1 billion to get all these metrics. Doing the same thing over and over and over again. So we figure out how much time we're spending percentage-wise in the thing, and that'll show us the outliers.

Matt

As long as it's not slowing down over time, that's true. But I still think even if that was the case, this is going to be useful information.

Ben

Yeah, I see what you mean. We didn't really keep track of the memory usage over time and stuff like that.

Matt

I don't immediately see that. It's true the result, that dictionary is growing because we're adding new entries to it, but it's not growing that much.

Ben

yeah I think there's going to be about 10,000 entries in here

Matt

yeah that doesn't sound outrageous

Ben

I feel like that's not

Matt

I would be surprised if that's the problem so let's take a look at this I think this is going to be useful

Ben

okay and so then it's also worth noting that when we profile I'm going to go back over to the scheme editor so that we can see the profile profiling is done in release mode which is important because there's going to be all kinds of slow stuff that happens in debug builds

Matt

When you ran this, that was in debug. Was it?

Ben

Was it? I don't remember.

Matt

I'm pretty sure. I don't remember you specifically building for release, but maybe you did.

Ben

Remember I did the Mies.toml? Oh, yes, you're right. You did do that. So I did. Yeah, I wanted to make sure about that. And let me make that bigger.

Matt

Yeah, this is a cool tool.

Ben

Okay. So when we do profile, it's going to do it in release mode, which is important because in debug builds, you're going to have all kinds of code that's added to make debugging easier. Bounds checking on arrays. So like, for instance, every array index is going to be slower. Probably similar things on dictionaries. A lot of that stuff is just optimized out when you do release mode. There's probably lots of other interesting compiler tricks that I'm not aware of that are going to slow it down to make it easier for debugging. But we don't want any of that because we're just concerned with raw performance. Right. Okay, let's look at the... This to me, I don't use this very often. I usually use the time profiler. I don't know how familiar you are with instruments and using it.

Matt

Were you thinking of doing something different?

Ben

No, no, no. So what I mean is I don't often use a profiler at all.

Matt

Oh, yeah.

Ben

And when I do, this is kind of the one I'm familiar with. And so I'm happy to learn more about instruments or if there's, you know, a better way to do this sort of thing.

Matt

I think Time Profiler is great. I really like there's these tools called signposts. Have you used these things before? Yeah. So they are incredibly useful. I wouldn't be surprised if we start getting interested in using them. But first, let's just take a look at it because it's such a simple, it's a relatively simple program. I think we'll learn a lot from this.

Ben

Okay. So Time Profiler, it's basically going to launch our application. It's going to start recording immediately. I could say maybe just to make this consistent across every run, what if we just, we could run for the entire thing or I could just run for like 30 seconds. And that way, every single time we would get like not only the percentage, but the actual time we spent in that. Oh, yeah. OK. I don't remember if ours took 30 seconds. I think it took less.

Matt

I think it took less as well.

Ben

So let's figure it out. If this takes less, then maybe I'll just do it on the big one and we'll just stop.

Matt

I think it will probably stop recording if the process exits anyways. All right.

Ben

So let's let's do this. okay so what are we looking at here i'm uh hitting command minus and plus to sort of zoom out and zoom in uh i think there's a way to yeah you can control the window but we probably want to see the whole thing um and then so this is the time spent on the cpu this thermal state i don't care about

Matt

uh and then here's our i don't think you know we i think we might have a problem though because i don't see symbolicated traces.

Ben

I think, yeah, we're getting it. We're getting-- you're right, we are. OK. So this is going to show the entire weight of the program. The entire thing took 661 milliseconds, which is faster than I remember. But let me just double check that. If we do Mizar go, it will do a build. And then I think it times the thing. Well, that took five seconds. You know what's interesting, though? Wait, scroll down.

Matt

Your time? Look at those real user and sys.

Ben

Yeah, that's interesting. I'm not sure I understand what's happening here. The finished in 5.49 seconds is coming from the time command, right? It's not coming from me's or whatever. Because I'm also doing the build as a dependency. But then when I do it the second time, and just to confirm, we are depending on this, and then we time the result of that.

Matt

Right. So we're not timing the build. This is only timing the execution.

Ben

Yeah. And it finished faster, which is worth noting because this file, like my, the SSD may well have a cache in front of it. I'm not sure. Well, the OS will cache that for sure. The OS definitely does.

Matt

Yeah.

Ben

So, so it's like, oh, you asked for this file and it's, it's still there. And so that's why we get different numbers here. But it looks like we're getting about the same that instruments saw. It's just strange to me.

Matt

Yeah, that's what's strange. I can't explain that. And in the case where it said five seconds, those numbers were roughly the same. But that isn't a mystery, but maybe not that important for what we're specifically doing.

Ben

Yeah. Okay, so instruments saw 661 milliseconds. 100% of the time was spent in this binary. And we go down to our main function and we start seeing these numbers decline. And then at some point you open one up and you're like, okay, inside this main function. And is this visible? I can zoom this in for the recording, but I'm just curious if you can see it okay.

Matt

I can see it. Yeah, I can see it. Okay.

Ben

So now all of a sudden these numbers are starting to diverge. And so we can see that we're spending 81% of the time in finding this process new line aligned chunks, which is kind of where all the work is. But 15% of our time is spent in finding the last new line, which is surprising. So I feel like, you know, it's tempting to just go for the biggest thing, but I also think that this shouldn't take long at all. Well, let's have a peek. Let's just see what's going on in there. And so we have sequence got reversed. That is weird. We had to walk from the back of the file. And then inside of there, it's initializing arrays for 14% of the execution time, which seems like it was weird to me because, or this is weird to me because why are we creating new arrays when we should just be creating iterators and walking backwards?

Matt

I don't know.

Ben

And so it's copying from one array to the other, which, you know, so this is a really good thing for us to look at. Yes. think that if and it's nice to know that these are inline too um like i don't think i said that uh oops not this uh i don't think i said that in um in xcode it just did it which is good you can say um inline always or something like that yeah inlining is always crazy and i rare i

Matt

I never mess around with it unless I have a very good reason to do it.

Ben

Yeah, but it's nice knowing that the compiler chose the right thing here is to inline this function. And again, I'm trying to like, I want to make sure that for people who are watching who aren't familiar with this type of stuff, what inlining actually means is like where this code is. It's going to copy this code. The compiler is going to do this. And where was that? Find new line. And it's going to replace this with all of that every time. So we can see it as a nice separate function that makes it easier to step into and all that stuff. But then when you compile into release mode, it just copies the code in.

Matt

And the reason it might choose to do that is because function calls themselves are extremely cheap, but not free. Yes. And so in a loop in particular, it makes a lot of sense to do that.

Ben

Okay. So find that last new line. So I did this data.enumerated.reverse. So somewhere in here, when it creates the... reversed sequence. So it's not in the enumerated part. It's definitely here. Where it's creating a new array, which

Matt

is super strange to me, and I don't quite understand why that would happen. We're doing data.enumerate. Why is the enumerate in there? Oh, because we need the offset. That's right, yeah.

Ben

So I think that what we could do is change this to a simple for loop where we say for, let's see, we don't care. Yeah, so we just need the index. And then I'm going to use stride from to by. And this is going to be data.count to zero by negative one. So it'll go from the end and walk backward. and then I see what you're saying. Now I can check.

Matt

I would not have thought to do that. This is an interesting way to do it for sure.

Ben

What would you have done? A range? Like a negative range?

Matt

What I was thinking was range, but then range only has to go from smaller to bigger. So maybe I would have tried to reverse that.

Ben

Range.reversed, yeah. Which would be an array copy probably. Maybe.

Matt

I mean, I'm very surprised that reversed requires a copy.

Ben

Yeah, I thought that was the entire point of using these types of APIs is it's just a way to control the iteration, not actually creating new arrays. We'll see. We'll see what the performance is. Yeah, but let's try this for sure. Compared to that. So now I get data for that index. And if that is equal to our new line byte, which I put it, I think I did that.

Matt

Yeah, it's UTF-8. U8 int. Uint eight.

Ben

Yeah. So then we have to return the index. And we're going to fatal error if we got to the end and we didn't find it.

Matt

That's right. Yes.

Ben

Okay. Just I want to compare for a second. So we did a guard. And then a first. Okay. Yeah. So that should be generating roughly the same code. I am really not versed at looking at the actual code here, but it is probably useful at some point to have a general understanding of like, what's, what are we even looking at? Right. Um, if we go to something, I know how familiar are you with this type of stuff? Looking at assembly? Yes. Um,

Matt

I've done it before, but it's not a thing I like to do.

Ben

That sounds like exactly my experience. Like I did it in college and I have poked around and I can kind of like generally see like, okay, we've got moving stuff from memory into registers. We've got BL. I don't know what that means.

Matt

I can't remember.

Ben

Sometimes B means branch.

Matt

yeah it could be and there's this outlined i mean the little the little um uh comment is like a function call of some kind but i don't know i i'm not sure let's let's run what we have i don't know

Ben

that we need to look at this yet i'm just saying like just getting familiar with this like maybe this would help us figure out like maybe why we're seeing a copy when well line 11 is the really

Matt

to me line 11 is the interesting thing if you go back for just a second uh sorry not that one this this one yes line 11 this one copy so it's copy sequence this is clearly intentional like this is yeah this is built to do something and it must be that to satisfy the api they have to do this copy i'm gonna assume these are pretty optimized already yeah so if they're doing this copy it's probably because they have to i am surprised and i don't i'm having trouble understanding why that would be the case. But I think it's interesting that not that far down, we see an outlined consume of. And the reason why that is interesting to me is because just recently, we talked about this a little last time, non-copyable types are this thing that are being incorporated more and more into the language. And specifically, there was a non-copyable version, or I should say, it was called a borrowing version of sequence. And that's being proposed. And I wouldn't be stunned if we are seeing one of the performance implications of the API right now.

Ben

Yeah, okay. That's interesting. I have done a fair bit of Rust. So I'm familiar with the borrowing and consuming and moving and that sort of that concept. And I'm recognizing that it's coming to Swift. But yeah, I don't often notice why I need those things in day-to-day Swift development. Me neither. But maybe we're seeing it here. Okay, so let's go to this implementation, and we will profile again. I'm going to hit command I to open the profile, and I think I can just click this to make a new run. And hopefully we can save these runs.

Matt

Yeah, we can.

Ben

Well, I guess I also would like to know, do you think we should do more data by using the larger file and just cutting it off at a time frame?

Matt

Right now, I don't. I think that was so easy to do. It's obviously this, that what we made, I think my intuition is has to be uniformly faster. Yeah. So I think we just keep going because I want to look at that other. This was the smaller chunk. Let's see what this looks like. And then we can turn our attention to the smaller chunk. And then maybe we can see if we made enough of a difference.

Ben

Looks like that didn't work. Let me delete this and try it again. So at least we're getting the same results. Something happened with It's crashing now. Okay, so

Matt

well, let's wait a minute, wait a minute, wait a minute, wait a minute. Data.count. We can't read from the end. Yeah, there we go.

Ben

Now I know why we don't use these API's anymore. And off by one error. Yeah. Okay, so doesn't seem faster. But let's not use the the profile to profiler to confirm. Let me delete this run. And you know, it's interesting, is I could rename these I could say, run one naive. Yeah. And then, here we go. Okay. So this one, Let me just run five is, what do we call it?

Matt

Well, this is where like no reversed. Yeah. But this did not, this does not seem like it is making progress here. Okay, so. Oh, no, no, no, it just zoomed in. Yeah, there we go. We have something.

Ben

Okay, so and here's another Mac OS trick that I find that not many people know is if you If you hold the option on any outline view, it will just recursively expand.

Matt

Yeah. And if you have a, we have a very small, we have a very small, look, it didn't even do it. You see how it didn't recursively expand everything. I have done this with instruments, but for non-trivial traces, it's like an explosion of too much, too much stuff.

Ben

Maybe there's like a max depth that it does. It could be. But I find that to be super useful. I use it all the time and I find that other people that when I'm like watching them use a computer I'm like they click this then they click this then they click it and I'm like yeah there's a faster way okay so let's go back to process new aligned chunk is now 88.9% and now we're missing like what happened to that other code it's just not there anymore it's like I don't see our function

Matt

Yeah, that's interesting. So it's gone. That cost is now gone, but it didn't really speed up the overall program.

Ben

It did. The last one was 660 and this is 574. So that time is definitely gone. I'm just curious, like we should still see it in the trace somewhere. It's just going to be... It's there somewhere. Yeah, it's in this list somewhere. And because it was inlined, would we see... It should still...

Matt

We should still see it. We'd still see the function name? We should still see it. What was it called again? It was called...

Ben

Find last new line.

Matt

Yeah.

Ben

It should still be there. We can filter. But it's not showing. So, okay. Well, I'm going to call that a win.

Matt

Yeah, okay. That is strange, though.

Ben

We can just start moving on to the next sort of biggest thing. Okay. So the next biggest thing, as we look at our main function, 95% of the time is spent here. We've got some string interpolation, which is probably our print. It's our print. So I'm not worried about that. And it's also tiny. So we take a look at this. And 88% of the time is spent in this function. Then we have two of these which look the same. So I think it's just... We do two different splits, though. That's right. Okay. Yeah. So, okay. So we do two different splits. And inside each split, there's 30% of the time is reading from the collection. so collection subscript for strings which is not like because this is a UTF-8 string subscripting into a string isn't quite straightforward that's right because you need to count from the beginning and process the UTF-8 lengths of each character so it's not a

Matt

It's not a plain array.

Ben

A plain array, yeah. Okay. What else is interesting here? String.init So we're spending 25 milliseconds turning a substring into a string here. And why do we do that again? Why are we? Oh, that's right. Because our key is a regular string, not a substring. So we have to do that. Now, a substring, can you do that on a substring? Like, could I say that this results as a substring, not a string? Then I would change this to substring. And then here I could change this.

Matt

What I'm wondering about, I think you can do this, but what I'm wondering about is whether the lifetime of the substring lives long enough here.

Ben

Because the string is here. And we-- or sorry, yeah. So that string is here. Then it gets split into these, which each line is its own string, right? No, no, no. Okay, so a line is a subsequence into a string. So we still have a window into this big string, and this big string is not going to stick around for long enough. After this function's over, that memory's gone.

Matt

My suspicion is that this will not actually result in any meaningful performance improvement.

Ben

I agree because it's not the biggest in our list. Well, I didn't mean that.

Matt

What I meant was I think that we're just shifting around but ultimately have to do the same work because fundamentally, I think some copy still needs to happen. So basically, is it cheaper to hold on to a substring or to make a new string? I bet it ends up doing the same work, just in a different form. But it would be interesting to measure.

Ben

Yeah. Okay. So let's see. We also have a string.init for this. the string data one. Oh yeah, because we're doing the split. And I can't remember which, this one is from a substring, so that's the one we were just talking about. Where is the string init?

Matt

I think what this is telling you is that function is spent this time into string init, but which of the different, it doesn't distinguish. I don't think it distinguishes between these two. Like actually,

Ben

there's this one which is copying, But then I can't see anything else. So in order for me to do that, I would probably have to Jump into here to figure out what is it doing? It's copying from an unsafe buffer Which seems like the data thing unchecked

Matt

If you go back to arc if you go back to the tree view for just a second I'm actually kind of getting interested now Why is there one string in it even though we do strings more than one time? But why are there two collection splits are those the same they look the same to me there?

Ben

Remember, we have two splits, one on new line, one on semicolon. Yes, that's true. But you're saying like, why?

Matt

Why are there two entries in this table? That's what I don't understand.

Ben

I think perhaps it is a specialized function. Like when it inlined it, it inlined a version that operates just on semicolon and a version that operates just on new line. Specialized? I thought that was about types, not about values. I don't know. but so when you when you in when you inline this function it is that going to be a variable and then the code is duplicated and references the variable or is it going to copy this as a literal into the inline i thought it would be a literal so if that happens then we're going to have one version that has this literal and one version no here's what's

Matt

going on is because one split doesn't use max and one does these are two different functions that's why it shows up twice.

Ben

Ah, okay. Well, okay. This one might call the other one though, because it's just a default int.max. Okay. Well, I actually would like to collect some more data. I think that going to the bigger file and processing it for some amount of time will give us more information, because I just think there's not enough happening. So let's use measurements huge. And let's... You're right though, this is a good idea. Profile. And this one I said it was going to stop after... Maybe we don't need 30 seconds.

Matt

So yeah, I mean, it probably we probably have we have a way better window into what's going on already.

Ben

I think I'm going to just say 10 seconds so that each one is roughly the same. Yes. All right. Time profiler.

Matt

Okay, so well, this is good because it tells us that all that really matters is our new line align chunk function. Yeah.

Ben

So splitting is expensive.

Matt

Yeah.

Ben

And it's expensive. I mean, it's kind of spread out. There are some, like, the subscript read unchecked from UTF-8. So that is a part of it, is that splitting a UTF-8 string is expensive. So What if We didn't do this Which means we can't do that But we can Utilize the same strategy To find the byte Instead of finding the string We find the byte of the new line Because we do have one in here

Matt

Right

Ben

Well, we have many, right? And then we can do the same thing for finding this. Like, for instance, that could be...

Matt

Can I look? Can I see just quickly the file format again? I just want to remind myself what it looks like. I remember there's lines, but how many semicolons are inside? There's just one per line? There's always one. Okay.

Ben

This one. Oh, wait, no. This is, yeah, this is the basic readings. And then it generates it based off of this, like it generates more values based off of this. The other thing is that we will never have these type of decimals. We always have one exactly. That's right. So if we look at, they don't have that, but let me, let me just look at. So we always have either positive or negative, some number of digits. I think it's not padded, not zero padded, but we always will have one decimal point, even if it's zero.

Matt

Yeah. I was just thinking, because we can totally start by not operating on strings from the beginning. And I think that that will lead us towards a higher performance solution. Yeah.

Ben

Well, I also don't think that we need it here, right? Because we can find this based on just byte by byte comparison. That's true. We can find this byte by byte comparison. We really don't need the key to be the city's name as UTF-8 string.

Matt

That's also true.

Ben

It could just be...

Matt

Yeah, we could operate on the data and just march through the data chunk that we have.

Ben

And then we do need it at the end, but at the end we have a much smaller set of data to work with. Because we'll need to sort it and print it out. So is that what you're thinking?

Matt

You're thinking about just going through byte by byte of this chunk and like kind of parsing it as we go? Yeah.

Ben

And I think that also leads into the SIMD strategy of like, okay, now I'm doing it one. I'm trying to find a new line byte in this array of bytes. If I structure that problem correctly, we can do that SIMD width at a time. And that's one instruction. But I don't know exactly how to do that. So I think let's do the, let's get it working on without strings first.

Matt

Yeah.

Ben

So I need some sort of funk, maybe, maybe on data. Let's do that. Extension on data, which gives me. the we're going to say find some byte I think isn't data a sequence can we use first here data is a sequence

Matt

so use first where I think find byte is the same as the first function

Ben

that needs to return the index let's see so now I'm back at the enumerated.firstwhere and now I need to

Matt

oh that's right because we don't want the byte we need the that's right that's right I need the index I think there's also a first index there's something like that first index where there we go well actually first index of first index of yes fight there you go

Ben

that's fine okay um okay then I don't need this the only thing I'd be doing there is maybe force unwrapping, but I can do that here. Okay. So let's get the while let new line index equals. Yeah, I'm following you. I need that new line index to start at some number because I need to start at the start of the string.

Matt

Yeah.

Ben

Maybe I need to do this once here and then at the end.

Matt

These kinds of loops can sometimes be modeled by having some sort of beginning. You might not do that. The first thing you can do is have it be, for example, 0. Then you can say, wow, the new line index is less than... fall off the edge so while it's less than like chunk dot end index then you can the first operation is find the first so it's like a while true basically yeah because that way you still do need but you still do need your position so you need your line index to be outside of the loop so you can have some sort of state as you're walking through the through this um yeah yeah you're right

Ben

because I need to know. So here's the thing. My chunk, I'm going to have to do like my position. Byte position is zero or something like that.

Matt

Yeah, you have to start. So that's fine. And the very first thing that you can do, you're somewhere is find your first new line.

Ben

Find new line index is our chunk from our byte position onward.

Matt

And just so I remember, so in new line, align chunk, our chunk is guaranteed to start at an entry point and not be like a partial entry. Yes. Okay. So how our algorithm could work is we know we're at an entry. So we have this little tiny state machine. And the first thing that it's doing is finding a semicolon. And that's the name. And then the next thing that it's doing is finding the float value and then it's looking for a... Yeah, you're right.

Ben

Okay, so this is going to be the semi-index.

Matt

Yeah.

Ben

So we start at the byte position that we're at. We go to the first one of the semicolon, which I'm going to have to do here.

Matt

Yeah. We've got to look that one up.

Ben

And see, ASCII semi colon.

Matt

Ascii table.com.

Ben

If I should probably have memorized that ASCII table.com. So it's 59. That sounds believable. No, alt 59. That's a windows thing. It is still 59. Okay, good. All right. I'm still going to go there because I want to see it.

Matt

It does not look good, but it has served me well for 30 years.

Ben

Oh, God, the ads.

Matt

It's actually a shame there's no easy way in Swift to, instead of writing the hexadecimal, to be able to get back the ASCII version of the character encoding. But that's a non-trivial thing to do because of how Swift represents characters. I believe it is possible to do, but it's not a straightforward thing to do.

Ben

Well, in Rust, you can say B and the value will give you the binary representation of that. If there's a way to do this. You could just literal, you know, there's a literal of B new line.

Matt

Well, so the problem is, so B new line, but what is the encoding of that string? Is that a UTF-8 encoded string? Does Rust define it to be UTF-8? I don't know. So the binary encoding of slash n,

Ben

if it's UTF-6 encoded, it's not the same. It's probably just a single character.

Matt

Yeah. But what I'm saying is, the definition of what is a character depends on your encoding. True. These are things we could look up. I am perfectly happy with 59.

Ben

Okay. So while true, we get our semicolon index, we get our first index of our semicolon and now we know yeah we're parsing here this is what we're doing uh so from so our uh our city bytes are from it's the chunk from our byte position up to but not including the semi-index that's right those are city bytes which may have not might not that's currently an option that might not exist i'm gonna force unwrap it because i just want it to fail hard if okay uh because i know that it i mean we could add debugging information if we hit this but i think this is going to be fine um so we go to well i think i think actually

Matt

that yeah i think that's the check where it's at our end so if that's an optional i believe we have finished processing the chunk so to terminate this while loop i think we can make that a guard and the else is just break.

Ben

Okay.

Matt

All of the other ones we're going to have to check more carefully about.

Ben

So it depends on if the last one is like, okay, we ended on a new line. So are you saying that we...

Matt

I think our final entry, our final byte is going to be the last decimal digit of a floating point number.

Ben

Yes, you're right. Okay, so if we can't find the next one, then we're done.

Matt

I think.

Ben

Got it. Okay. So now that we have that, I can move our byte position equal to the semi-index plus one.

Matt

Yep.

Ben

And now we're looking for new line and we find from that position to the new line byte, this one should... And I guess maybe I'm just going to do this because it's a fatal error. expected to always find a new line byte here. And then I could even convert this chunk from byte position to the end of the string into an encoded utfa string. OK. Okay, so now we have our new line index. And now we know our and I realized after typing this, I hate the word temp because I have like PTSD of like temp variables and stuff like that. So I'm going to call this temperature. Yeah, temperature bytes is equal to our chunk at our byte position up to but not including the new line index. Okay, so now that we have this our results array. Let's see, what do we do with this? We need to do this sort of stuff.

Matt

Yeah.

Ben

But I think our results array can just be those bytes.

Matt

And not a string name. Yeah, absolutely.

Ben

And I'm to the point now where I kind of want to have a type alias for our results. Yeah, for sure. Is equal to data and entry so that I can make these changes without too much issue. Let's see. Results is now your results. This one is now results. Okay. So results for city bytes is equal to entry. And then our temperature.

Matt

So that we currently rely on converting to a string.

Ben

Yes. And should we continue? Yeah, I think we should continue converting to a string.

Matt

For now, let's just see how that goes. Yeah.

Ben

Because it's the easiest option. Because this will tell us how much the iteration saves us over the split. the scanning over the split. So I will do a float of a string of data, temperature bytes as encoding UTF-8. We're going to assume...

Matt

You know what we can do? Can you try... No, no, this is fine. But there is another option for string. There is another constructor for string that allows you to assume that it's already known UTF-8 data.

Ben

Yeah, there's a...

Matt

I don't remember what it is, though. I think it's... It's got two arguments.

Ben

I know what you're talking about. UTF-8 string?

Matt

let me look this up it's worth it because it is a good one uh decoding s that's decoding as that's the one yeah so and then you just pass in bytes

Ben

decoding bytes uh sorry temperature bytes as utf8.self that's exactly it and then you can get

Matt

rid of the optional this also this isn't this is going to not work properly if it wasn't actually a UTF-8, but we know that it is.

Ben

It is interesting that there's such different APIs and one returns an optional and one doesn't, but they do the same thing.

Matt

Well, this one is, so this is unsafe. If you pass in data that is not UTF-8, this will just do the wrong thing. As the other one is validating that it is also UTF-8.

Ben

Yeah. It just seems like if you weren't familiar with this. I mean, I think that the thing is, this is operating on collections where the other one is operating on just data. or something. That's true. Yeah, it is. It's weird. It's weird. I think it's good that

Matt

that both exist, but the naming of them is not very, very not obvious.

Ben

Okay, so what are we doing here missing argument for parameter results? This needs to be city bytes. Yeah. Okay. And so now we're not doing any of the string stuff. And we're back to compiling. No, we're not we okay, so this is where now we need to map the Yes, keys.

Matt

We might want to iterate. And now at this point, we might want to iterate over the results themselves. Instead of the keys? Well, because now if you if you map the key, then we lose the ability to go back into the index into the results. see we're relying on city in the four in city has to be a key into the result structure oh i see what you're saying change it yeah right so just just enumerating over the

Ben

results themselves is going to be fine uh but i'm gonna have to do um because i have to sort it so i've got to have so i'm gonna have to like oh shoot you're right Is there a results.map keys? I don't think so. There's only map keys.

Matt

Well, that's fine. I think making another copy, making a copy that sorts it is still going to be, I think that's still going to be a win.

Ben

So that's going to be cities as results.map, which gives me key and value, I think. Or am I doing reduce here? I'm trying to think.

Matt

What is this map? Also, that map, it looks like you're writing Ruby here.

Ben

Oh, sorry. Ruby, Rust. I've been playing with Odin recently. I'm all over the place. Map gives me back an array, which is not what I want. map values gives me a dictionary, but that's also not what I want.

Matt

I think we have to make a copy into a string.

Ben

So we could do reduce into, let's just do this, uh, reduce into, uh, and that gives me back, uh, the, um, the new dictionary and the key and value. That might be right.

Matt

I think that dictionary literal is going to need a type. Something's going to need a type here.

Ben

Yeah, I think I'm going to say this pair dot zero and that becomes string decoding pair dot zero. Yes. ETF eight dot self, our newfound friend.

Matt

Yeah.

Ben

Equals pair dot one. And then that works. Okay. Okay. So now I can say for city in keys.sorted, but that's cities.keys.sorted.

Matt

There we go. Yeah.

Ben

And then this becomes cities. Let me rename that to entrance.

Matt

maybe okay um i think this is doing the right thing anyways this is a drop in the bucket compared to processing the whole file so yeah yeah i think it's fine to go over this twice

Ben

okay so we'll run again

Matt

you know i just thought about this but i'm pretty sure that the lexicographical sorting of data and string will be the same in this specific case. So we could actually sort these things. We could make a sort function on the data and then it would end up being sorted alphabetically.

Ben

But then you still have to print it.

Matt

Yes. But then we don't have to copy and pass it over twice. It's done because this is not important.

Ben

Okay. We must have crashed. Honestly, not surprised. So let's run it on the small set and see. There we go. I love that. Okay, let me make sure that when I run, because I don't want to run on the huge one. But I guess it doesn't matter because I'm going to crash it. So I'm just going to run. That's true. Okay, so we broke here. Byte position was not advanced after.

Matt

Okay, that's right.

Ben

Okay, byte position. plus equals no equals new line index plus one.

Matt

Yeah, that seems right.

Ben

So it was always that second chunk. Still by position is still zero. Let me just step.

Matt

Are we crashing on the first?

Ben

No, it's always the second chunk. So this is the first chunk. We've got eight bytes for the city. Then byte position is nine. Okay, byte position is that. We get our temperature, which is negative 39.8. We get our entry. We save it. And then now it's the next one, which we're going to... It's going to go through all of these. So I need to go until we hit that again.

Matt

Why? Hmm. Okay. I don't understand yet, but I'm not surprised we have a problem here.

Ben

byte position zero. So what I could do is find, let's see, how big is our chunk? Yeah, that's what I'm wondering too. It's like 10 megabytes, right? It's a 10 megabyte chunk-ish.

Matt

But if the byte is that size, how could we ever be out of bounds here?

Ben

I don't think we're out of bounds. I think this is failing. Oh, wait, no, it's not.

Matt

No, I think our index operation is what's crashing. Can you, is it possible? Are we hitting a chunk of zero length?

Ben

Yeah, look at this. It's a chunk. I can't use zero. I've got to use chunk.startIndex. Why? Because it's one big array. Sorry, if we go back up here, we are, we have. Oh, it's not. It's a sub data.

Matt

Oh, yes, yes, yes, yes. That's important.

Ben

And that is clear when you use it as a string, you get a string dot sub sequence. Right. In another language, they call it a string slice. So you're just a slice into the window of data. You can also call that an array slice. But with data, what is the type of this? It's just data.

Matt

No, that should cause a copy. That shouldn't be what's happening. So whatever this call is, are you certain that's what's going on?

Ben

Why would chunk start index be that?

Matt

Chunk start index is a range. Yeah, it's an integer. Why is it start? Oh, it's start index is so high. Wait a minute. Why would that be the case? what is its last index or end in its end index okay you're right it's a slice um

Ben

do you understand why that's the case no because i didn't know that data did that

Matt

you know what's happening here i have i have a feeling i know what's going on here i think that data returns when you index into a data you get back some sort of sub data type uh oh i don't have my let's see what we return can you go can you show me

Ben

sometimes I wish this was just faster to do like just give me a playground I don't use playgrounds

Matt

very much I got so burned by them because top level code does not behave identical to code that appears in function bodies and so I always get afraid about using it macOS if you use macOS

Ben

also it's way faster. Yes. It's an NF and Buddhist simulator. All right. So let's, well, let's use this greeting and let me get the data from that greeting. And now if I do P or data.count, well, it told me already it's.

Matt

But if you sub, if you, if you, if you subscript into data, that's what we're doing.

Ben

So if I say one to the end.

Matt

Yeah. What is the type of object that gets returned there? It's a data.

Ben

Yeah, let me do this. The type here, run. Yeah. It is a... It's a pointer.

Matt

Well, that's internal to this data structure,

Ben

but it's just like, okay. But what is the type of this thing? It's data.

Matt

Like assign this to a, make a, yeah, make a lead of this. What is X here? It's a data. It's a data.

Ben

It's a data.

Matt

What's its first, what's its first index? So that's, that's the question.

Ben

I'm going to, I'm going to do this. So we're going to have a, my data, which has a, say sysinner, which would be an array of these. Okay. So let's say I did that. And then I added the subscript operator, which took a range. A range of int. And then now I need to return something so i could return my data here and then i would do inner you know i would have to create a new my data that had like a copy of the bytes but we don't want to do a copy of the byte so i think this is what it's doing it's just i think you're right i think you're right this

Matt

is what it's doing it's just surprising to me because normally when collections do that like

Ben

string string doesn't return a string it returns a substance yeah so that's why i was expecting to see that but i data is a really old type and lots of things operate on data so i wonder if they just can't mess with it in the same way that they can change strings because an ns string is not a swift

Matt

string okay i mean i buy it um i am this is this is why sometimes i like it would be nice to get um

Ben

you know somebody from the swift standard library team or something to come in and be like

Matt

that is true or well like fundamentally what we're doing we're treating chunk like a collection and the answer to that question is you have to use start index zero is not defined it's it's that's not what the that's not how this thing is indexed you have to operate at this level of abstraction so we're cheating even by adding one technically we need to do chunk dot next index or advancing index by whatever like we're cheating everywhere here now i think that would work

Ben

but what does advanced by one even do it still gives you the int that's because the index is of

Matt

type int right that's because the index is of type int but it's more this would work our algorithm now could work if we change it to a more sophisticated sequence that didn't use an

Ben

index of an integer got it but in this case it does work okay so uh i think that that will fix the issue? I think so too. It was a surprising problem. Let's run here. Good. It got faster. It got a lot faster. Okay, stop that. Profile. I kind of wish that crashes could be eliminated from here because that's...

Matt

I am very surprised that Instruments show you something different when your process crashes yeah again i could be holding it wrong i doubt it but yes i suppose we could maybe learn how to how to do it okay so uh so we ran for 10

Ben

seconds so we don't have like an overall uh speed up but we did for the small like we did notice

Matt

it was like 339 right we don't know how much progress we made in that 10 seconds yeah um

Ben

But this is promising. Absolutely. In fact, it looks about twice as fast. So I think we could run the entire thing. In fact...

Matt

Well, while we're looking at this, why don't you start it off? Kick it off while we're looking at...

Ben

We'll say go big or go home. Yeah, go big. And what did I say? I could just pass in an argument

Matt

or is it has to be input file? It has to be a environment variable. This me thing is really cool. I have never used this before. And I'm interested.

Ben

Input file. Yeah. Equals. Measurements huge.txt. I think that's how you do. Go big. that worked yeah okay so while it's running it doesn't look to me like it's doing that fast enough though no it's gonna take a while like it's gonna take at least five six minutes yeah okay but but that while that's running we can take a look at this and see where we stand in terms of our

Matt

performance so but all the time is still in the process um a line chunk which is interesting um

Ben

right process new line well i mean where else would it be no no you're right that's true it's true uh first index where okay interesting is taking 27 that's interesting

Matt

data representation this is this is our scan so this is the operation that's scanning

Ben

looking for the next character. Yeah. And unsafe mutable raw pointer dot load a from byte offset as I don't, I mean, I, you just have this assumption that everything is like really optimized, but I also know that it has to be optimized for every possible type of application we might right. You know what I mean? So sometimes the standard library algorithms may not be like, like, of course they wouldn't optimize for this problem. You know what I mean?

Matt

Yeah. And in fact, I think we're seeing right here that we're paying a very steep cost for that abstraction right here, because look at those, those DYLD stubs, like the, the actual, we're, we're spending a lot of time just in the mechanics involved in implementing this stuff. I think. I think that's what this is telling us.

Ben

So maybe we could shave, like, it's not that this would be free, but maybe we could shave some time off of this where its self-weight was 500 milliseconds.

Matt

Well, you know what we can do? You know what I think might help us here is a span.

Ben

A span. I have not used that yet.

Matt

I think that instead of using first to scan this data internally, which relies on going through all this abstraction of making the subscript work, I think we can get at the bytes themselves. Okay. And span will let us do that safely. It's basically, it's going to be the exact same. It's the exact same. API? Algorithm. No, I think our algorithm is the same. And all we're changing is how we are. Yeah, exactly. How we are getting at the data inside. So you can leave it as data. No, no, no. You can leave this as data. Okay. And what we want to do before we get started is we want to do like chunk right above this. Yeah. The thing we're going to operate on is not the data. We're going to look at the bytes inside, which we can get at with chunk.span.

Ben

I have never used this before.

Matt

So what this is, so before this existed, what we could do is we could use a with unsafe mutable bytes wrapper around this data. And that's a closure. And then inside of that, we just have raw uint8s. We just have bytes.

Ben

And a span is similar to that then.

Matt

So a span is basically the exact same thing, except it doesn't require the closure because the compiler is able to enforce that you can't use that span after chunk no longer is available. So span only makes sense. The internal bytes only make sense as long as chunk is alive. And the compiler can now enforce that. Now the problem is our first index of now needs to be a loop. That's doing some work for us that we have to give up by using this technique.

Ben

The first index needs to be a what?

Matt

I think we're going to have to replace first index of with an inner loop that is looking, comparing byte by byte.

Ben

Can I do extension span?

Matt

Well, you know, actually, let's play around. I'm not sure. Let's play around and see what span. Is span a sequence?

Ben

Sorry. Hang on. Before you do this, let's just check.

Matt

It said it was, right? Then maybe we don't need to do that.

Ben

Where am I? Go back again? I don't know what happened there.

Matt

Just see if that span we created has a dot first index. Maybe this is easier than I thought.

Ben

It did not.

Matt

It did not.

Ben

Oh, it doesn't.

Matt

Okay.

Ben

Okay, so this span is a span of U8. Come on. So you read this as not escapable, copyable, bitwise copyable, where the element is not copyable.

Matt

Yes.

Ben

Okay.

Matt

That's right.

Ben

Okay, it's just a struct, and what extensions does it have? Yeah, it might have a lot of extensions. empty so it has some of the behavior of collections but it is not a collection yeah at least as far as i could see so what is so span at this point has indices which i would like to see i'm going to check back in on this oh check it out oh okay that was a huge improvement huge improvement i am impressed all right um yeah turns out working with raw data and not utf-8 is yeah it's huge um okay so here in our

Matt

case it wasn't just that it was also well it was it was we were doing a lot of extra work there we were we were visiting we're now visiting the bytes exactly one time whereas before we were visiting all of them once and then visiting them all again for each one of those. We're doing lots of splits.

Ben

Yeah. I want to see what is our first index in the span. So the span now has a correct window into the data. The first chunk, yes. Oh, yeah, yeah. Thank you. Let me go to the next one. I've got to go here. you have to advance there you go do do what i mean not what i say okay oh this is even more convenient too yeah okay so we're gonna say our byte position is our

Matt

span dot um well it can be zero now because now i mean i think it's fine to work with it

Ben

wait where did start i thought i looked for start index and didn't find it oh i don't know and we still can use the same thing yes are mostly it's the same the difference is going to be yeah

Matt

this is what i'm not sure i don't know what this is going to do exactly so here's the thing okay so span doesn't let us grab a range of stuff uh let's see i wasn't i wasn't sure how this was

Ben

going to work. So indices first last lazy. Yeah, how do we get a range of these things? So these I think you can just do span of a given thing. And that gives you the bite, right? I don't need...

Matt

Let's see. Oh, you know what we can do? Hang on a second. So what we're doing here now is we're doing this...

Ben

Sorry, let me undo that and go back to chunk here. This, can I do a span here? Yes, you can. Or is that getting rid of what we wanted?

Matt

I don't... Well, I think that... So there's actually a lot happening in that line without the span. So the first thing we're doing is we're producing a sub-chunk. And then we're, so we're offsetting, like we're advancing through, we're getting, so we start off with the whole chunk and we're pushing through index. And then for each one of those, we're looking for a little section. I'm not unfamiliar enough with the span API to know how to do those same two operations, but I think we can do it. And in fact, just looking at this, that is what we need to do though. So we need to get like a sub, a sub span, I think. I wish if I knew this API better, this would be easier.

Ben

Yeah, I think that maybe I would like to read up on span and see what it could potentially offer us. I like the idea of just get out of my way and give me the raw bytes. And we lose this, but we can rewrite something that does that.

Matt

Yeah.

Ben

So maybe that's where we start next time is investigate span. Yeah, yeah, yeah. Because I like ending on a win, and I certainly think that that's a win. Yeah, for sure. Huge improvement. All right. So that's it for this week, and we will see you again next week. Thank you.

Profiling and Removing Splits

Series: Billion Row Challenge

Source Code

Series