Episode #590

Audio Fundamentals

Series: Learn AudioKit

17 minutes
Published on March 28, 2025
We kick off a new series on audio programming for iOS using AudioKit. Before we jump into code, however, it is important to get a foundational understanding of how audio works and how it is represented by the audio hardware (sound cards, etc) that the software integrates with. In this episode we will talk about the fundamentals of audio, learn some essential terminology, and look at some real world audio equipment so we can better understand the mental model behind AudioKit.

Hello, and welcome to NSScreencast. My name is Ben. And in this series, we're going to learn all about audio programming for Apple Platform. We're going to learn about the fundamentals of audio, some essential terminology, and how to think about it. And then we're also going to get our hands dirty with AudioKit, a very popular open source framework for doing digital signal processing and audio manipulation, and just sort of a bunch of fun things related to audio programming for iOS. So whether you're building a music app or trying to just have some fun, this series will give you the tools and techniques to get started. So before we jump into the programming part, let's talk a little bit about what audio actually is. So audio is waves of sound, vibrations, that hit your eardrums and they wiggle your eardrum and cause you to perceive the sound. And these waves are made up of an amplitude, which is the height of this graph, or the height of the wave. And that determines the volume. Then we have the frequency of the wave. And that determines-- and this is how many times it oscillates in a given second-- that determines what the pitch of the sound is. And so a higher frequency means a higher pitched sound. So at the low end, we have a lower sound, like bass or rumble. And at the high end, we have a high pitched sound. Now in the range of human hearing, we have around 20 hertz as the low end. This is going to be very deep low end bass. And 20 kilohertz at the high end, which is extremely high pitched. I personally can't hear this high. My hearing probably stops around 15k. And so I don't have the greatest hearing. But people with more healthy hearing might be able to hear something that high pitched. OK, so that is sort of the range of values. Now this is logarithmic. So most of the useful information here is way on the lower end and less on the higher end. And so to give you some intuition about things you might recognize and where their frequencies sit, we can take a look at this example here. On the left here, we have a subwoofer. So in electronic dance music, hip hop, anything with some deep bass or kick drum on a drum kit, these are going to have on the lower side around 60 to 80 hertz. And that's going to produce kind of a thumping sound, a lower rumble. And in the 250 to 2 kilohertz range, we have most instruments and vocals. And an important one to sort of remember is 440 hertz is the standard for concert pitch tuning. So on a piano, 440 hertz is the C above middle-- sorry, the A above middle C. And so that is sort of a standard. You might see this if you ever open up a tuning app or something like that, the 440 is just a common number to remember. And it kind of sits in the middle here. Then on the upper ranges of what I would consider sort of the common frequencies, you have like 2 kilohertz to 6 kilohertz. This is going to be higher pitch stuff like the hi-hat or the cymbals in a drum kit. And then above that, you have a lot of frequencies. But people would typically refer to this as breath or air or sparkle or sizzle. And it's stuff that adds to the sound and makes it feel more open and more lively. But too much of it can make sound harsh. And a lot of times I'm coming at this from a mixing sort of music type of perspective because I'm in a band. But this gives you some sort of sense of where these frequencies lie and what the numbers mean. Now, I mentioned that all of these are producing waves that hit your eardrums or they hit a microphone. And the microphone is going to wiggle. But they don't always come at pure sine waves. In fact, all of these waves are coming at the same time. And so they get summed together. And so we end up with sort of a much more wiggly oscillating graph here. And if you are, say, you're an audio sound card and a microphone is plugged in, that microphone is wiggling. And the diaphragm inside the microphone is wiggling. And it's producing small voltages. And the sound card reads these as samples. And so depending on the format or the configuration of that sound device, those samples can be in any number of formats, usually some sort of integer or floating point number. In this case, we're going to use floating point number in a range of negative 1 to 1, so 1 being the highest and negative 1 being the lowest. And it will continue going through these peaks and valleys. And it's going to sample the value or the voltage at that specific time. So in this case, we have 0.8, 0.4, 0.2, or negative 0.2. And these are called samples. So this is at a specific point in time. And an important thing to sort of take a look at is how many samples are we going to be either listening for or playing back on audio hardware? So that's called the sample rate. And this is how many samples per second that we're going to either play back or record audio at. And some common ones that you'll see are 44.1 kilohertz. This is CD quality sound. Or 48 kilohertz, and this is sort of a higher end. It doesn't seem like much more than 44, but gives you significant more headroom and dynamic range there. And you can imagine that the more samples you have, the smoother this line will be. There'll be less sort of straight lines connecting from one sample point to the next. And so the more accurately we can reproduce this curve, the better. And above 48 kilohertz, you also have 96 kilohertz. And there's even higher than that. But this is a trade-off with diminishing returns. This is also going to impact how much memory you're going to be using, how much disk storage, and how much processing power you need, because all of these are going to get packed into a single second worth of audio. Now, the next thing I want to talk about is how audio is unforgiving. So if you contrast this with video programming, or video in general, video is made up of a bunch of pictures played in sequence. So in the case of a video game, let's say, we have here's 10 frames of a running animation for Samus from the game Super Metroid. So you can kind of see how this will look. And if you play this back, you can play this back at 10 frames per second, which is the top one, or 20 frames per second at the bottom one. And it doesn't really matter which one you choose. The bottom one looks a little smoother, but also looks like she's running faster. But our brains are OK with this. Like, the top one just looks like a slow running animation. It doesn't really look super choppy. And so if you played a game and this was the running animation, it would be fine. Your brain is OK filling in the gaps or interpolating between these different frames. So if you contrast that with audio, if we're responsible for playing back audio, we're going to be sending samples to an audio buffer that the sound card is then going to use to, again, take those values and send voltages out through the speaker to move the speaker back and forth. And if we are, say, we do something that blocks the thread that we're using and we miss an opportunity to fill up our sound buffer, well, that's going to be filled with either junk or all zeros, which is bad. And the listener is going to perceive this as clicks and pops and crackles and things like that, which sound really awful. And this is something that our ears actively reject. People don't want to listen to music that has this sort of artifact in it. And so audio is unforgiving in that way. So it's really important that audio programming is done at a level where you can control memory allocations and deallocations. Typically, you would allocate once ahead of time for all the memory you're going to use. And then we're just going to continue to read or write to that buffer. And in some cases, we'll take that buffer and we'll just write in Windows. And then we'll actually wrap around and start writing at the beginning of that buffer and continue. So that way, we're always using the same buffer of memory. And we're not continuously allocating new memory and deallocating memory. We just have a consistent set of performance characteristics. For that reason, a lot of audio frameworks are done at a low level like C or C++. So now I want to move on to some real world analogs or real world sort of comparisons that we can make. And this will help make sense for the API that AudioKit and some other audio frameworks have. So the first thing is a mixer. And if you've never seen a mixer before, it can look really complicated. There's a bunch of knobs and buttons and things like that. But at its basic level, if we take a look just at channel 2 here, which is highlighted, at the top, we have an input where we can plug in a microphone or an instrument. And we can ignore all the knobs and just go all the way down to the bottom where we have a fader. And the fader controls the volume of that instrument. And that's basically the essentials of a mixer. Now, each one of these vertical channels are the same. And so you plug in all your audio. You move the faders up to mix the audio. And on the right-hand side, we have a master fader, which is going to control the overall volume of the entire mix. So we'll see later on that AudioKit has a mixer. And we can use it in the exact same way as this. The next piece of physical hardware I want to talk about is a synthesizer. So a synthesizer comes in all shapes and sizes, all kinds of forms. You can get modular ones. They can take up entire rooms. It's an entire world. And a synthesizer is going to produce sound and then manipulate that sound in really interesting ways. So the fundamental building block of a synthesizer is oscillators. And an oscillator is going to produce a bunch of values for sound. And so we can have a sine wave oscillator, which is going to produce a perfect sine wave. You can switch that to a square wave oscillator, in which case the numbers that you get through, if we're sticking with the floating point, it's either going to be positive 1 for the period and then negative 1. And there's no transition in between, so it sounds very unnatural and harsh. But used the right way can produce a really interesting effect. And even though something may not be, quote unquote, "smooth" like a sine wave, turns out that distortion, when applied correctly, is really pleasing to our ears. It sounds really good in synthesizers. It sounds really good in guitar and many other contexts. So used properly, these harsh types of waveforms can actually sound really, really pleasing. Two others that you might see are sawtooth waves or triangular waves. And they both have different characteristics. And you can sort of switch between them on most synthesizers and take a look. Another thing you'll see on a synthesizer is-- and it is so common-- is attack, decay, sustain, and release. And these terms are used in many types of audio equipment. But specifically for synthesizers, attack, decay, sustain, and release are so common that they are abbreviated ADSR. So if you imagine I am pushing a key on a synthesizer's keyboard, it's going to produce a sound. And that sound is going to not just start at the immediate volume. And given this graph, it's going to take that attack time for that to reach its full volume. So when I push the key, it's going to slowly ramp up to that peak volume. And then it's going to have a decay. So the peak is actually only going to stay temporarily. And then it's going to drop down to the resting volume. And that is determined by the decay time. And then the sustain time is how I'm holding down the key. And then if I let go, then this is how long the audio is going to take to drop back down to zero. So with these four parameters, you can vastly change how the sounds are perceived, how they perform, how they mix together, and has a totally different feel. So it's definitely a fun thing to play around with. And then the last piece of sort of physical audio representation that is useful to have a mental model of is a guitar pedal board. So in a guitar pedal board, you'll usually plug in your guitar. There's a cable coming from your guitar. And you're going to plug it into the right side of the pedal board here. The right side is the input. And on the left side is the output. And so you can see, if you follow these cables along, the output of the previous pedal is plugged into the input of the next pedal. And by doing that, you have one effect whose output leads into another effect. And so the order of these matters a whole lot. Typically, with guitar processing, you're going to have gain and distortion and things like that toward the front. And you're going to have delays and reverbs and any kind of modulation effects are going to be in the back. But again, these are sort of guidelines. People will call them rules. But rules are meant to be broken. And if it sounds good, it is good. So these are sometimes ways you can easily swap things around. You can take one pedal and move it to the front of the chain and see how it sounds. Now, sometimes these rules are there for a reason, because it's generally a good idea. But especially with programming, it is a lot easier than swapping around cables. And so we will see some examples of, in AudioKit, they're called nodes. And you can connect the output of one node to the input of another node. OK. So that's it for the sort of real world hardware analogs. Let's take a look at the building blocks of audio on Apple's platforms. So Apple ships Core Audio. Core Audio is a low level framework, set of C APIs, which gives you full control over sound, playback, and recording on iOS. And this gives you sample level control. You can do whatever you want with the samples here. You're in full control over the audio buffers that are used. But as such, it is quite cumbersome to even get started with Core Audio. Now, let's say if you're building a podcast player, like the podcast player I use, Overcast, has a voice boost function. So that may be done as an audio unit that is plugged into Core Audio that takes the samples, picks out certain frequencies, boosts those, and then sends the output back to the buffer. So this is sort of an example. Another example that Overcast does is skipping silence. So it's analyzing the audio. When it detects that there's a gap of silence, it can actually just shorten that. And now we're hearing just the spoken words without all the gaps in between, which can speed up listening to a track without making it sound super fast. On top of Core Audio is AV Foundation. And this is a set of APIs that are built with Core Audio. And I'm specifically speaking about the audio part of AV Foundation, also the whole thing for video. This is going to be usable from Swift. And it's much more high level. It's much easier to use. If you have a URL to a file, either on a network or on disk, and you just want to play it back, this is the easiest way to do that. Same thing for recording. If you want to just start recording from a microphone into a file, this is the easiest way to do that. But if you want sample level control or you want some building blocks to build other complex things, it's not as flexible as Core Audio is. And that brings me to AudioKit, which is a popular open source third party framework for doing all kinds of interesting things with audio on iOS. And we're going to see a lot of those soon. OK. So in the rest of the series, we're going to dive into the world of audio programming and see what's possible with AudioKit. And we'll build some fun things.

Want more? Subscribers can view all 590 episodes. New episodes are released regularly.

Subscribe to get access →

Source Code

View on GitHub Download Source

Series