Billion Row Challenge - Intro

Ben

All right. Hello and welcome to another episode of NS Screencast. Today with me, I'm joined by Matt Massacott, and we are going to be doing the 1 billion row challenge in Swift. The 1 billion row challenge started out in the Java world, and it was an idea of taking these readings of these numbers from weather stations, so temperature readings. And you have to parse out the city name. They'll always be separated by a semicolon. Then you're going to have a temperature. The temperature is going to be a floating point number, could be negative. And it will always have one decimal point at the end. So even digits will have a dot zero at the end, and you'll never have more than one decimal point. All of these details will become super important as we go because we're going to be optimizing a lot of these things. And the trick is that there's a billion rows. So anything we do is going to take a lot of time. And what's interesting is, you know, most of the time I'm doing iOS development in Swift. I also do some back-end development. But most of the time, it's like we're programming at a high level. We're using high level types and things like that. And this is like a way for us to look at some of the lower level stuff and how the code we write actually runs on the computer and what certain decisions we can make could make the program slower or whatever. So does that sound interesting? Sounds awesome. Okay, so I think the point of this, just to be clear, is to explore some parts that we may not do in our day jobs and learn a lot about Swift in the process. So I have, there's a repo with a whole bunch of stuff in it. It's kind of overwhelming. But basically there is a create measurements file, which I have right here. And this is going to create the file on my machine. So while we're talking, I'm going to run it. And this is going to create a billion rows. Let's see. Oh, I needed this weather stations CSV. Let me pause and grab that real quick. Okay. So now I can run. Oh, I'm on the wrong branch now. Let's go here. And okay. So it's saying this file is going to be potentially up to 30 gigs, but around half. So it's probably going to be about 14 gigs. And that's going to take a couple of minutes. And while we're doing that, we can kind of talk about, you know, like, what is our strategy for, you know, for doing this? So, and this, sorry, this weather station CSV is literally just, it's like a real CSV from some place. But the only thing we're getting from it is the cities. Oh, I see.

Matt

So the data on the other side is going to be like random or something?

Ben

Yeah. So it's generating a bunch of random data that matches all these cities. And there's not that many cities. 10,000 or something. So in this file, they don't repeat. But in ours, it's going to repeat a lot. Oh, interesting. Okay. So in the end, we need to calculate the min, the mean, and the max temperature per weather station. okay okay and so we we need to keep track of each one of these things as we go and uh it's you know our initial solutions are going to be kind of slow uh and it's probably worth talking about why uh but maybe we should just do the naive solution see how long it takes

Matt

and then uh and then go from there that's what i would be inclined to do as well we might want to Also make a smaller version of the file so we can like iterate.

Ben

Yeah. Yeah. Yeah. So when I did the naive version a few days ago, just in preparation for this, I think that my solution didn't finish after 10 minutes. And I've seen other naive solutions that ran in just a few minutes.

Matt

Interesting.

Ben

So I was like, wow, I'm way off. And it's really, I think, kind of useful to figure out, like, what is, how fast could this possibly be, right? Like, because your operating system has to, like, read the file. There's going to be some, you know, depending on the hardware, but on a Mac, you've got, like, an SSD that's got a really fast bridge to the CPU. And the CPU's got its own register and stuff.

Matt

Fundamentally, I feel like we're gated by the I.O. speed. Like we can't go faster than the speed you can read the file, right? Yeah. But it's also, the file is immutable. I don't know. Because we have the whole file. Say that again.

Ben

Because we have the whole file, we can split it into chunks and read it in parallel. And I think that's something we should do.

Matt

Is that helpful? Like that isn't helpful for us. I don't think that's particularly helpful for a spinning disk, for example, but maybe it's a good idea for an SSD.

Ben

And there's also memory mapped files. That's another thing that I think that we can explore. I have done this in C just as a like exploration of the topic. In Swift, the API for it is suspiciously high level. What API did you find? It's an initializer on data. So you know how you can say data, contents of URL or contents of file or whatever? There's another one that says options, memory mapped if possible. And I was like, really, that's it? I wonder what that actually does. Because there are a lot of things you can... Let me open this up. It's like M8 sequential or something. Well, Mmap works.

Matt

You can call Mmap from Swift.

Ben

Yeah, yeah. So we can go to the lower level one. So it's something like this M-Advise is basically telling the operating system, hey, I'm going to read this sequentially. And this is maybe this is just a Linux thing. Yeah, it looks like it.

Matt

Let's see.

Ben

So yeah, M-Advise. So basically you're just like.

Matt

No, M-Advise is an API on Mac OS.

Ben

Okay. So, you know, so when we get into memory map files, which which isn't going to be today, we'll explore what that means and how that can possibly help. It may not help because it's such a big file still has to read the entire thing.

Matt

Well, it doesn't. Yes. So that's true. But it could help because we what we don't want to do it this year. It's going to be 30 gigs at the end. Right. Something like that. Yeah. 15. We don't want to have 30 gigs resident and also all of our other working structures as well.

Ben

Okay. So that took a while. Sorry. I should have done that earlier. So I'm going to create one that has a mere 1 million lines. And we're going to call that test.txt. Can I do this? Or maybe I should just rename this one to measurements huge. And then we'll run it again with 1 million. Bam, that's a lot. That was faster. Okay, so one idea that I had is if I do cat and I take this measurements huge and I just pipe it to dev null, this is going to read in the entire file and discard the results, right? So this is going to process it from disk to establish a baseline. And now I probably should have timed it the first time, but that's pretty fast, right?

Matt

Yeah, that is fast.

Ben

This was surprising to me how fast this is. Another thing we could do is use word count. And this time I don't want it to go into DevNol. That's an interesting idea. The word. So this one will actually go through and count words. So obviously this is, you know, word count was probably invented in the 70s or something. it's probably been optimized as much as it possibly could. And it doesn't really have to do anything that crazy, but still, you know, this could also give us a number to say, like, there's no way we could possibly be faster than the most bare bones just skipping words.

Matt

Yeah, that's a really interesting idea. So that took 21 seconds. Right. So my intuition would have been that catting the file to dev null would be like roughly the same amount of time as counting the number of words.

Ben

So cat is just going to be pulling in the bytes and discarding them, right?

Matt

Yes, that's right. And the measurement needs to pull in. I thought that we were going to be dominated by IO, right? So pulling in the bytes should have been the most expensive thing. But I think your test has proven that that's not the case.

Ben

Yeah. So it's good to get some just rough ideas of like, man, we could make this really fast. Okay, so let's create a Swift package. It's, and I'll open it in Xcode. And what I'd like to be able to do is have a, let's see, I need to be careful not to click on this huge file in Xcode. It's probably not going to be happy with me. All right. So let me get the font size a little bigger. Oh, another thing I should mention, there are people who do this challenge who say no third-party libraries. And I think that's fine. I think it's useful, but I think that we may reach a point where I don't know how to do advanced hashing algorithms. So I may permit a library or two for stuff that is just totally out of my wheelhouse. But I will also import Foundation.

Matt

What is the third party means not included by the standard library of the language you're using?

Ben

Yeah.

Matt

Yeah. Okay. So Foundation is straddling the line, I think.

Ben

Yes, it is. But I mean, there's no like, how do you read a file? I mean, okay, the standard library doesn't really have any answer to that, right? Yeah. It's more data structures and core types. Okay. So let's just do something really basic. So we'll grab the path of the file. I want this to be, I want to be able to pass in a path. So I'm going to say input file and then do... And if that doesn't exist, then we're going to use from the current folder measurements.txt. So that'll be the small one. Yeah. Okay. So then there are many ways we could read a file. Well, maybe first I could say file exists at path, path, and let's just make sure that this exists. Yeah, that makes sense. Okay. So then I'm going to close that because I don't want to accidentally click on anything. We're just going to be working the same file. Okay, so there is obviously data contents of, and it will dutifully pull in the entire thing into memory and eat up, you know, many gigs. So there's also file handle.

Matt

Yes.

Ben

And there's also buffered, what is it called? Buffered reader or something?

Matt

Oh, I'm not familiar with buffered reader.

Ben

I came across it when I was searching, and I ended up with this file handle approach. So we may do some research and see what are the pros and cons. But this one says file handle for reading at path. I think that this can-- does this return nil or throw or something? Yeah, so we'll do a guard here. And then I'll just fail error. I'm able to read file. OK, so the way the file handle reads is you can read up to account. So we'll figure out some buffer size. I have no idea what is appropriate here, but I mean, when you think about like, OK, we're going to take some bit of data. then we're going to start going through the file. I'm trying to think if there's like maybe an even more naive way of doing this rather than doing this. Like if we did...

Matt

Well, the most naive way is to do what you suggested, I think, which is just make a new data object with the contents, but that won't work because it's too large.

Ben

Yeah. um okay so i'm just wondering like if i'm going too fast into the into reading bytes but i think we just have to uh we we well you so you said that data has this

Matt

memory map if possible thing i'm kind of curious what that does i i i have used that api but it was a very long time ago and i don't recall i don't recall it working the way i thought it

Ben

going to. Let's look at it. Read contents of URL file path mapped if safe. A hint indicating the file should be mapped into virtual memory if possible and safe. So we are not really in control of whether or not we...

Matt

Well, if this works, I think it'll return really fast. And if it doesn't work, it'll return... It won't return. Okay.

Ben

So I'm just... This is a try-bang type of program? Yeah. Yeah, yeah, yeah. Let's see. Let data equals that. And then let's print out the... We can even use the formatter, byte count formatter. Oh, I don't know about this. Formatter, then we can do string from data.count, I think. Oh, that's cool. Yeah, yeah, yeah. Oh, I need to do string from byte count. There's a way you can do like between two, like a measurement API between two different units and figure out how to do math on them and all but this is just data.count. OK. And we'll say that the file is that. That needs to be in N64. OK. And let's just run it. And I can, I typically like to do things in, um, in the command line. I, I often do this when I'm doing command line type of stuff, but I also have it set up so that we can just set the input file and in our environment so we can use it.

Matt

Oh yeah. No, I'm fine with command line.

Ben

Okay. Uh, so handle was never used. That's fine. Um, also we're going to be running it and release mode a lot. Yeah. Because we want to get the fastest possible thing. So I'm going to actually do that. And also, because we're going to do this a lot, I'm going to use a tool that I like called Mies. Are you familiar with Mies?

Matt

I have heard it talked about a lot, but I've never used it myself.

Ben

So it's a tool dependency manager as well as a task runner. so it's um so i can i can say that i depend on like xc uh beautify and uh swift lint and swift format or whatever and it will install the right versions and make sure that i have all those and this will be something you check in so that so that your whole team has the same exact versions of things oh that's cool okay and then you can say task build let's say and here uh this is going to be a swift build. And then I can say task dot build release. And that's going to be swift build dash c release. I could say tasks dot go maybe. And this is going to depend on the build release task. So it's got dependency tracking. And when it runs, it's going to run time and dot build slash release slash one beer.

Matt

Yeah, I see what's going on here. This is cool. And then you do like me's go.

Ben

So if you do me's are, it will say, give me a task to run. And I think that these are supposed to be tasks, not tasks. And the other cool thing about this is once these expand beyond like a one two line command, you can actually just export it, like eject it into a file and it'll still pick it up as a Mies task. Okay, cool. So it's pretty, pretty nice. Okay, so if I do that Mies R with no parameters, then I get to pick or I can say Mies R go and it will do it. Okay, so that file is 15 megabytes. Okay. I am going to try it with the other one. if zoom crashes we'll know measurements

Matt

huge I think you'll be able to tell really quick

Ben

yep so it either didn't work or it's not really doing anything intense I'm pretty sure it worked it says mapped if say oh yeah I see what you're saying so if it was safe, it did it. And that means that we have a memory map file that we can now read a point in memory and it will go grab that part in the file. Okay. I think so. Now here, let's, let's, let's do

Matt

it. I think this was actually a really good experiment, but I, what I'm curious about

Ben

is like reading the last bite and see if that's fast or not. So, uh, data dot last, right? Last uh that's fine yeah yeah that's fine

Matt

okay okay yeah so pretty neat i think that this has proven to us that it can map the file and also it's cheap for us to do seeking in the file too yeah and this is where i was saying like that

Ben

mAdviseSequential is an optimization that you can give to the kernel to say that like, hey, when I read this byte, you can probably fetch the next few as well, or not this byte, but this page. Okay, let's stick that in our back pocket. Now, in this case, this is interesting,

Matt

because just in this case, I don't know that you can get out the file handle that data is using internally. So I think that if we wanted to do some mAdvise trickery, we can't use data anymore.

Ben

Right. Yep. And I think we'll explore that because I'm going to do this file handle one next. Okay. So I'm just going to get rid of that for the moment. And now I'm going to talk about reading the file in. I don't need that either. Actually, no, I can do this. We can grab. does this show me the file size available data.count

Matt

oh yeah I wonder what that does

Ben

file handle I think this just opens a file seeks to the end gets the offset I think it's a fast operation

Matt

yeah okay oh and there's the file descriptor that'll be useful maybe

Ben

Where did I, where was I missing that? Wasn't there a handle that available data? There was. Okay. Okay.

Matt

Bytes. Now available data, the type of that, that's a capital D data, I think. So I wonder what that's going to do. Oh, sorry. I think it's going to read.

Ben

Oh, that's not it. Okay. The available data in the receiver, that's not what I want. Bytes? No, no, no. That's not what we want. Oh, that's interesting, too. That is interesting. That's fascinating, actually. Okay. How about file manager.default.attributesofvitamin.path. We'll get the adders. And then we'll say say let's size is adders, uh, size. And what type is this? Is there any? Let's try. Still isn't any. So I think maybe this is, uh, can I maybe do that?

Matt

I bet you can just force cast this. Yeah. It's probably an NS number internally, but, I think that'll still work.

Ben

OK. Build and we'll do the small one. OK. Works. OK, so I mentioned that there's a file.read uptoCount, which throws and returns an optional data. So we need some sort of buffer size that we're going to read. And I don't have a good answer here. one thought is like you, like we could have something really small that would fit in a register or we could just take it in a manageable chunk that we could say, okay, let's say we have got a 50 megabyte chunk of data. And then later, once we have this, like these different chunks, we can pass that off to all the cores to work in parallel. Yep. That makes sense to me. Okay. So this buffer size, let's, I'll just start at 10 megabytes and we can tweak it if we need to. So I'm going to read up to that amount of bytes. Guard let chunk equals... no, I'm going to say while let chunk.

Matt

Read up to count. So will this give us always the... This will give us...

Ben

The file handle will keep track of where we are.

Matt

Okay, okay, okay. Yeah.

Ben

And we may get back less than 10 megabytes, right? Because the file we're using, the small one is only 15 megabytes. So we're going to have two chunks. And I'm just going to say process chunk and we'll grab the chunk. There. We'll make a funk process chunk. Chunk is

Matt

This may need to be static as well.

Ben

And you know, I was just thinking is we could make this throws and then I could get rid of this. Oh, yeah, that's a good idea. Okay. And this is almost certainly going to throw. Well, I don't know. Maybe it will. We'll figure that out.

Matt

We'll find some error in case I'm sure.

Ben

So here's the issue now. This data may not be aligned to a Unicode character. That's another important point. If we look at the, like there are Unicode characters in the data set.

Matt

Yeah, I mean, this is a parsing problem.

Ben

Yeah. So, and, you know, I'm going to do probably a poor job of explaining Unicode, But if it was all ASCII, this would be easy because every character would be one byte and we're already aligned to a byte because we picked out the number of bytes we want. But Unicode can be an arbitrary amount of bytes per character. And we don't know if we just sliced off in the middle of a Unicode character. And so what we can do...

Matt

I think we have two problems. So one is we have this unit of character, but then we also have another higher level problem, which is the unit of like parsable token inside of this list. So even if it was ASCII, I think we would still have an issue here because we could chop one of these lines in half.

Ben

That's true. Yes, we could chop one of these lines in half. And that's kind of where I think that what we should do next is in process chunk, I'm going to have a find the last new line in some data. And that's going to return to me the index of the last new line in that data. So if we can start from here and just walk backwards until we find a new line, this is operating under the giant assumption that new line is only one byte. And let's see. New line is one byte and is not present in any ASCII sequence, not ASCII, Unicode sequence. So it's not ever the byte in the middle of an ASCII sequence.

Matt

I believe for UTF-8 that that's the case.

Ben

Yeah. UTF-8 sequence. Okay. So we could do something like data.reversed.firstwhere.

Matt

Yeah.

Ben

And that's going to give us a byte. And then I want to see if the byte is equal to that. So I think that it would be useful for us to have an extension on, let's see, uint8. to give us the bytes for new line. So I could do, and I don't know what that is off the top of my head, new line ASCII. And I'm searching ASCII because it's simpler, but 0x0a. ASCII and Unicode are like, Unicode is overlaid on top of ASCII. So everything that is ASCII is still the same code in Unicode. So if we do 0x0a, that needs to be bar. Oh, great. Static let. Yeah. Okay. So now we have that, and we're going to return where the byte is equal to uint8.newlinebyte. And there should be one in here. If there's not, we have a big problem. So I'll say guard let. Oh, here's the other thing. I need this to be enumerated, then reversed. Oh, because we need the index. Yeah. And is it offset element or element offset? Offset element. Okay. Okay. Okay. So then this is going to say guard let. And we don't care about the byte. And we'll say else fatal error. Didn't find new line in beta. Return offset. Now, I don't want to derail us too much,

Matt

but I think it would be possible for us to construct a chunk size. Our chunk size is very big. But I think it would be possible either for a very big input or for a very small chunk size where we could have only a portion of a line. In this case, I think the city names are like reasonable lengths.

Ben

Oh, yeah, I see what you're saying. So like what if we had one that was like 10 million characters long? Yeah, this could fail.

Matt

Yeah. But I don't think that will actually happen in our case.

Ben

Yeah, and we'll have to take some of those assumptions that the, you know, the readme does promise a handful of things. I think the max length of the city name is actually in there as well.

Matt

Oh, okay. Yeah, we could use that. I wonder if that's exploitable somehow.

Ben

Okay, so this seems okay. And so I'm just going to print out the last new line in chunk just to see if that... To find last new line in chunk. And we'll see if that shows us anything. Okay, so we had two chunks and it found the position in the data of that last new line.

Matt

Okay, interesting.

Ben

Okay, so once we have that new line, I really think that...

Matt

Wait, sorry, just one quick question. No, no, never mind. Please continue. Yep. What I was getting hung up on was, and I know enumerated is a little strange, and I was wondering if those offsets are relative to the chunk or relative to the total size of the file.

Ben

Oh, this is a fantastic point. Data.

Matt

So I don't think so. I think these datas are independent. So those offsets are relative to.

Ben

Yeah, they have to be because they didn't increase, right?

Matt

Yeah.

Ben

And I think it's okay as long as we know that because we're going to be processing from the chunk beginning.

Matt

We will need to know that. Yeah, a chunk needs more than just a data object. That's true.

Ben

Okay. So I'm going to call this process aligned chunk. and then I'm going to move this into here so that we get the aligned amount. What I was thinking is that we'd call process aligned chunk and so the chunk would already be aligned and then we'd have some like leftover data.

Matt

So the term, the only thing, this is correct. The one thing I'm hesitating on is the aligned, the terminology of aligned is like, because we're talking about very low level things and this might actually matter, the actual memory alignment, like how we're aligned to pages. So that terminology.

Ben

So should I call new line aligned?

Matt

Yeah, I guess so.

Ben

And then, okay, so that's going to give us the last new line index. And so this is going to be my chunk from the start of it up to, but not including the new line. No, but including the new line.

Matt

Well, we've fixed the end, but we have to look at the beginning as well.

Ben

Say that again. No, so what I'm saying is once I have that, I'm going to have some like remainder and I'm just going to spill that over into the next chunk maybe.

Matt

Yes, yes, yes. So that's absolutely, so I think we need to do both ends of this chunk because the chunk isn't necessarily starting on a line and it's not finishing on a line. We've backed up the end, but I think we need to do the same thing for the beginning.

Ben

If we're starting from... Let me think. If we're starting... Oh, no, you're right.

Matt

If we're doing it sequentially, then we're always starting on a line.

Ben

Yeah. Yes. Because if we do the other, then I'm not sure who's going to be the one to get the new line. Is it going to be the previous chunk or the next chunk?

Matt

Yeah, yeah, yeah. The first one starts. So yes, as long as we do this in order, then this will work fine.

Ben

Okay. So remainder bytes is going to be some sort of data. uh no some like that and then remainder bytes is now equal to the chunk from last new line index plus one uh and we need to be careful not to not to walk over the end of it. Yeah. Like that. So if the so I have the size so I can say if the last new line index plus one is greater than the size then we break. mm-hmm so then we assign remainder bytes uh to whatever was remaining and then on the next read we need to add the remainder bytes to the beginning of this one that's right this seems like it's going to be very slow

Matt

yeah i mean i think there's probably much smarter ways to do it but this is this is a straightforward i like that yeah um

Ben

okay um let's see if we got uh new line aligned chunks here so i'm gonna say that well no i mean i think we're good there what i want to do now is just print out the data and see if it looks reasonable. Okay. So we can do the naive thing and say, we'll just do that and print it out and hope it doesn't crash. Okay, so seems reasonable. And I don't see anything.

Matt

We're looking for chopped chunks. Yeah, that'll help.

Ben

There's still a million lines in here.

Matt

Yeah, there's a lot. I wonder if there's a way that we could do this.

Ben

Even smaller?

Matt

Well, I'm not sure we could do it even smaller, but I was wondering if there's a way that we could do it more, like a more rigorous test. I feel pretty good about it right now. I'm not very worried about this. It doesn't look like anything is going wrong.

Ben

Yeah. Did I scroll past my buffer? I did.

Matt

Yeah. I think that once, if we got this wrong, how I would imagine what would happen is once it got wrong, it would progressively get worse and worse. So if the end is right, I think that we've done a pretty good job.

Ben

Yeah. What I can do is say, I'm going to rep for Chunk. and grab the grab like five lines on either side.

Matt

Yeah, yeah, yeah. That was smart.

Ben

So what is this?

Matt

Oh, yeah, there's only two. Oh, that's interesting.

Ben

And what is we've got an extra new line here.

Matt

Is that what Grep is doing?

Ben

Oh, yes, almost certainly. This is, sorry, this is the context separator from grep. So it's doing five before, five after.

Matt

Yeah.

Ben

And before, let's see. So this one was the start of it. So we had one, two, three, four, five lines after. Then one, two, three, four, five lines before the chunk. So before the chunk, we did have an extra new line.

Matt

Right.

Ben

I'm wondering if it's because I didn't advance the new line index or I need to go...

Matt

That seems right. You found the new line index, but then... No, yeah, that does seem right, actually.

Ben

Because on the remainder, we would have started here. I mean, I think it's probably... let me um i don't know why that happened oh well it could have been

Matt

hmm why did that happen

Ben

is it this it's not oh the new the print prints a new line it prints a new line oh but this one does also yeah yeah so we need this to have a terminator nothing. So that we're just printing. Yeah. So now if we do it... Okay. Perfect.

Matt

There we go.

Ben

Okay. So now that we have that, we can take... I feel like I would like to, before we finish, just get to let's sum the numbers and see, because this is going to be pretty bad. But now I have a big string. You know what I mean? And we can do big string.split on new line, right? And so this is obviously the naive approach, but at least we get to a working example and then we move on from there. So that's going to give us our lines and then lines. Say we map each line and line.split on separator semicolon. And is there a max splits? There is. Separator max splits. So we'll split on that, and we're going to say max splits is one. So we should get now parts, and we would assert here that parts.count is equal to true. equal to 2 and then the left side is our city parts 0 and our right side is our temperature and that needs to be parsed as a floating point number so And what happens? Is that a try or what is this? Or optional?

Matt

Optional, I bet. Yeah.

Ben

Okay. I will just force unwrap that. And this is where we're going to have to have to accumulate these into some sort of results dictionary. So we'll do string to entry. I'll a max and a mean. And in order to figure out the mean, we need to be done. So we need the sum and the count.

Matt

Ah, right.

Ben

So let's create the dictionary here. Results. No. Results. equals string to entry. And then read, this one is going to be results is results. So even more things that'll blow up if we do this in parallel.

Matt

Yeah.

Ben

You know, it needs to be here. okay um this is really for line and lines i'm not mapping anything really and then here i'll say results for city uh so we'll need to check to see if there is one um if results dot keys dot contains city then we'll let's see how do I want to do that actually I'll do it like this var entry equals result city or yeah that'll work and then it'll be all zeros no not all zeros that's a trap this needs to be max float dot max what's it called greatest finite magnitude

Matt

that's it that's the one I was trying to remember

Ben

least normal magnitude sum is zero count is zero yeah and for those following along at home that's because our numbers can be negative so we don't want to start with zero because we never we may never get a zero maybe the hottest temperature in that city Okay, that is, that needs to be converted to a string. And now that we have our entry, we can say count plus equals one. Entry.min is the min of entry.min and the temp that we just got. Max is max and max. plus equals temp. And then we keep going. Okay. We can store it now. Thank you. Results city equals entry. Okay. Lots of opportunities to make this faster. Okay. So we go through each chunk. We process the on chunk into that results dictionary. Now we need to, and I think what we need to do is sort the the results at the end it looks sorted alphabetically ordered yep yeah here it is okay min max and mean alphabetically ordered like so okay so we're gonna get the keys is the uh well i can call it cities is that's going to be results.keys.sorted. And then for city and cities, we're going to print out.

Matt

Is it, can you show me the output again? Is it a, it looked structured to me. It's supposed to produce, is it supposed to produce some sort of JSON structure? Doesn't look like it. I think this

Ben

is just a dictionary being printed. Oh, I see. Yeah, okay. Or something. Yeah. And so the numbers are like that cities like that. So I could say the city equals. And then let's do an extension on entry and have this be custom string convertible maybe and we can do description and return the uh let's see we need the mean float it's going to be sum over count and this is going to be oh i don't remember how to do the formatting is it like this yeah so percent dot one f slash percent dot one f oh yes yes yes yes max mean okay so then that's going to be city equals goodness entry results for city yes there you go and that's optional but it definitely exists because we proved it exists here. Okay. Let's run it. Yeah, give that a spin. Okay. Oh, is this correct? I don't know. It could be. It looks like it might be. I can actually search for this in the weather stations data. Copy that and go here and find this. Yeah.

Matt

Well, our rendering is a little weird. Oh, no, no, it's not. No, no. It's the, it's the, yeah.

Ben

Let's find these. Yeah.

Matt

Okay. So we did the right thing. The numbers certainly look believable.

Ben

Yeah. Okay. So it printed it out and it did a million rows in 2.9 seconds. We'll give this a go. Now these chunks are going to affect the time a little bit like the printing,

Matt

but yeah,

Ben

not that much. And we're not printing every line. Right. I should probably have done like a thing that prints, like how many chunks do we have to go? Cause we did 10 megabytes, right?

Matt

Well, it's a thousand times the size. So if we had two chunks, We now need to have 2000 prints, I think. Something like that.

Ben

Yeah, we're going to be here a while. So I think this is actually a really good place to end it. I, you know, if it ends while we're talking, we'll see the time. Otherwise, I will just fast forward to the end to show everybody the time. But I feel like my mind is already like kind of spinning with all the things that I did that were really inefficient.

Matt

Yeah, but we got so far. This is the thing, right? Like we've solved the problem. It's true. We have a lot of opportunity for optimization, but just even getting a handle on like, what are the tools we can use? And we learned a lot about some APIs that are useful as well. So I think that was a pretty productive session, I think.

Ben

Yeah, I think that this, I'm going to let this run. I think it's going to be quite a while. Okay. So, so next time we will profile this so that we can see exactly like where is the time being spent and, and then we'll make a plan for what is the next thing to tackle.

Matt

Yeah. Sounds awesome.

All right. Yeah. Thanks for joining me and we'll see you in the next one. Okay, great. Thank you for having me.

Billion Row Challenge - Intro

Series: Billion Row Challenge

Links

Swift Concepts & APIs Discussed:

Source Code

Series