Getting the Best Out of FLAC on ARMv7: Performance Optimization Tips

Nitansh Nagpal

Media & Streaming

Tags:

media

audio codecs

FLAC

codec optimization

ARM processor

optimization techniques

compression algorithms

lossless codecs

lossy codecs

audio quality

Overview

FLAC stands for Free Lossless Audio Codec, an audio format similar to MP3 but lossless. This means audio is compressed in FLAC without any loss in quality. It is generally used when we have to encode audio without compromising quality.

FLAC is an open-source codec (software or hardware that compresses or decompresses digital audio) and is free to use.

We chose to deploy the FLAC encoder on an ARMv7 embedded platform.

ARMv7 is a version of the ARM processor architecture; it is used in a wide range of devices, including smartphones, tablets, and embedded systems.

Let's dive into how to optimize FLAC's performance specifically for the ARMv7 architecture. This will provide you with valuable insight with regard to the importance of optimizing FLAC.

So, tighten your seat belts, and let’s get started.

Why Do We Need to Optimize FLAC?

Optimizing FLAC in terms of its performance will make it faster. That way, it will encode/decode(compress/decompress) the audio faster. The below points explain why we need fast codecs.

Suppose you’re using one of your favorite music streaming apps, and suddenly, you encounter glitches or pauses in your listening experience.
How would you react to the above? A poor user experience will cause this app to lose users to the competition.
There can be many reasons for that glitch to happen, possibly a network problem, a server problem, or maybe the audio codec.
The app’s audio codec may not be fast enough for your device to deliver the music without any glitches. That’s the reason we need fast codecs. It is a critical component within our control.
FLAC is a widely used HiRes audio codec because of its lossless nature.

Optimizing FLAC for ARMv7

WHY Optimize for the ARM Platform?

Most music devices use ARM-based processors, like mobiles, tablets, car systems, FM radios, wireless headphones, and speakers.
They use ARM because of the small chip size, low energy consumption (good for battery-powered devices), and it’s less prone to heating.

Optimization Techniques

FLAC source code is written in the C programming language. So, there are two ways to optimize.

We can rearrange the FLAC source code or write it in a certain way that will execute it faster, as FLAC source code is written in C. So, let’s call this technique C Optimization Technique.
We can convert some parts of the FLAC source code into machine-specific assembly language. Let’s call this technique ARM Assembly Optimization as we are optimizing it for ARMv7.

According to my experience, assembly optimization gives better results.

To discuss optimization techniques, first, we need to identify where codec performance typically lags.

Usually, a general codec uses complex algorithms that involve many complex mathematical operations.
Loops are also one of the parts where codecs generally spend more time.
Also, we need to access the main memory (RAM) frequently for the above calculations, which is a penalty in performance.
Therefore, before optimizing FLAC, we have to keep the above things in mind. Our main goal should be to make mathematical calculations, loops, and memory access faster.

C Optimization

There are many ways in which we can approach C optimizations. Most methods are generalized and can be applied to any C source code.

Loop Optimizations

As discussed earlier, loops are one of the parts where a codec generally spends more time. We can optimize loops in C itself.

There are two widely used methods to optimize the loop in C.

Loop Unrolling -

Loops have three parts: initialization, condition checking, and increment.
In the loop, every time we have to test for conditions to exit and increment the counter.
This condition check disrupts the flow of execution and imposes a significant performance penalty when working on a large data set.
Loop unrolling reduces branching overhead by working on a larger data chunk before the condition check.

Let’s try to understand by an example:

CODE: https://gist.github.com/velotiotech/1161a016ffdcd581a86397664e0a5229.js

As you can see, after unrolling it by 4, we have to test the exit condition and increment n/4 times instead of n times.

Loop Fusion -

‍When we use the same data structure in two loops, then instead of executing two loops, we can combine the loops. That way, it will remove the overhead of one loop, and therefore it will execute faster. But we need to ensure the number of loop iterations are the same and the code operations are independent of each other.

Let’s see an example.

CODE: https://gist.github.com/velotiotech/0fb06d70fc881ca07babeb4cfaa4d3a6.js

As you can see in the above code, we are using the array a[ ] in both loops, so we can merge the loops by which it will check for conditions and increment n times instead of 2n.

Memory Optimizations for Arm Architecture

Memory access can significantly impact performance in C since multiple processor cycles are consumed for memory accesses. ARM cannot operate on data stored in memory; it needs to be transferred to the register bank first. This highlights the need to streamline the flow of data to the ARM CPU for processing.

We can also utilize cache memory, which is much faster than main memory, to help minimize this performance penalty.

To make memory access faster, data can be rearranged to sequential accesses, which consume fewer cycles. By optimizing memory access, we can improve overall performance in FLAC.

**Fig-1 Cache memory lies between the main memory and the processor**

Below are some tips for using the data cache more efficiently.

Preload the frequently used data into the cache memory.
Group related data together, as sequential memory accesses are faster.
Similarly, try to access array values sequentially instead of randomly.
Use arrays instead of linked lists wherever possible for sequential memory access.

Let’s understand the above by an example:

CODE: https://gist.github.com/velotiotech/c46677b7439dbb21c3623b42e94d8328.js

As we can see in the above example, loop interchange significantly reduces cache-misses, with optimized code experiencing only 0.1923% of cache-misses. This accumulates over time to a performance improvement of 20% on ARMv7 for an array a[1000][900].

Assembly Optimizations

First, we need to understand why assembly optimizations are required.

In C optimization, we can access limited hardware features.
In ARM Assembly, we can leverage the processor features to the full extent, which will further help in the fast execution of code.
We have a Neon Co-processor, Floating Point Unit, and EDSP unit in ARMv7, which accelerate mathematical operations. We can explicitly access such hardware only via assembly language.
Compilers convert C code to assembly code, but may not always generate efficient code for certain functions. Writing those functions directly in assembly can lead to further optimization.

The below points explain why the compiler doesn’t generate efficient assembly for some functions.

The first obvious reason is that compilers are designed to convert any C code to assembly without changing the meaning of the code. The compiler does not understand the algorithms or calculations being used.
The person who understands the algorithm can, of course, write better assembly than the compiler.
An experienced assembly programmer can modify the code to leverage specific hardware features to speed up performance.

Now let me explain the most widely used hardware units in ARM, which accelerate mathematical operations.

NEON -

The NEON co-processor is an additional computational unit to which the ARM processor can offload mathematical calculations.
It is just like a sub-conscious mind (co-processor) in our brain (processor), which helps ease the workload.
NEON does parallel processing; it can do up to 16 addition, subtraction, etc., in just a single instruction.

FLOATING POINT UNIT - This hardware unit is used to perform operations on floating point numbers. Typical operations it supports are addition, subtraction, multiplications, divisions, square roots, etc.

EDSP(Enhanced Digital Signal Processing) - This hardware unit supports fast multiplications, multiply-accumulate, and vector operations.

**Fig-3 ARMv7 CPU, NEON, EDSP, FPU, and Cache under ARM Core**

Approaching Optimizations

First of all, we have to see which functions we have to optimize. We can get to know about that by profiling FLAC.

Profiling is a technique for learning which section of code takes more time to execute and which functions are getting called frequently. Then we can optimize that section of code or function.

Below are some tips you can follow for an idea of which optimization technique to use.

For performance-critical functions, ARM Assembly should be considered as the first option for optimization, as it typically provides better performance than C optimization as we can directly leverage hardware features.
When there is no scope for using the hardware units which primarily deal with mathematical operations then we can go for C optimizations.
To determine if assembly code can be improved, we can check the compiler's assembly output.
If there is scope for improvement, we can write code directly in assembly for better utilization of hardware features, such as Neon and FPU.

Results

After applying the above techniques to the FLAC Encoder, we saw an improvement of 22.1% in encoding time. As you can see in the table below, we used a combination of assembly and C optimizations.

**Fig-4 Graphical visualization of average encoding time vs Sampling frequency before and after optimization.**

Conclusion

FLAC is a lossless audio codec used to preserve quality for HiRes audio applications. Optimizations that target the platform on which the codec is deployed help in providing a great user experience by drastically improving the speed at which audio can be compressed or decompressed. The same techniques can apply to other codecs by identifying and optimizing performance-critical functions.

The optimization techniques we have used are bit-exact i.e.,: after optimizations, you will get the same audio output as before.

However, it is important to note that although we can trade bit-exactness for speed, it should be done judiciously, as it can negatively impact the perceived audio quality.

Looking to the future, with ongoing research into new compression algorithms and hardware, as these technologies continue to evolve, it is likely that we will see new and innovative ways to optimize audio codecs for better performance and quality.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Getting the Best Out of FLAC on ARMv7: Performance Optimization Tips

Overview

FLAC is an open-source codec (software or hardware that compresses or decompresses digital audio) and is free to use.

We chose to deploy the FLAC encoder on an ARMv7 embedded platform.

ARMv7 is a version of the ARM processor architecture; it is used in a wide range of devices, including smartphones, tablets, and embedded systems.

Let's dive into how to optimize FLAC's performance specifically for the ARMv7 architecture. This will provide you with valuable insight with regard to the importance of optimizing FLAC.

So, tighten your seat belts, and let’s get started.

Why Do We Need to Optimize FLAC?

Optimizing FLAC in terms of its performance will make it faster. That way, it will encode/decode(compress/decompress) the audio faster. The below points explain why we need fast codecs.

Suppose you’re using one of your favorite music streaming apps, and suddenly, you encounter glitches or pauses in your listening experience.
How would you react to the above? A poor user experience will cause this app to lose users to the competition.
There can be many reasons for that glitch to happen, possibly a network problem, a server problem, or maybe the audio codec.
The app’s audio codec may not be fast enough for your device to deliver the music without any glitches. That’s the reason we need fast codecs. It is a critical component within our control.
FLAC is a widely used HiRes audio codec because of its lossless nature.

Optimizing FLAC for ARMv7

WHY Optimize for the ARM Platform?

Most music devices use ARM-based processors, like mobiles, tablets, car systems, FM radios, wireless headphones, and speakers.
They use ARM because of the small chip size, low energy consumption (good for battery-powered devices), and it’s less prone to heating.

Optimization Techniques

FLAC source code is written in the C programming language. So, there are two ways to optimize.

We can rearrange the FLAC source code or write it in a certain way that will execute it faster, as FLAC source code is written in C. So, let’s call this technique C Optimization Technique.
We can convert some parts of the FLAC source code into machine-specific assembly language. Let’s call this technique ARM Assembly Optimization as we are optimizing it for ARMv7.

According to my experience, assembly optimization gives better results.

To discuss optimization techniques, first, we need to identify where codec performance typically lags.

Usually, a general codec uses complex algorithms that involve many complex mathematical operations.
Loops are also one of the parts where codecs generally spend more time.
Also, we need to access the main memory (RAM) frequently for the above calculations, which is a penalty in performance.
Therefore, before optimizing FLAC, we have to keep the above things in mind. Our main goal should be to make mathematical calculations, loops, and memory access faster.

C Optimization

There are many ways in which we can approach C optimizations. Most methods are generalized and can be applied to any C source code.

Loop Optimizations

As discussed earlier, loops are one of the parts where a codec generally spends more time. We can optimize loops in C itself.

There are two widely used methods to optimize the loop in C.

Loop Unrolling -

Loops have three parts: initialization, condition checking, and increment.
In the loop, every time we have to test for conditions to exit and increment the counter.
This condition check disrupts the flow of execution and imposes a significant performance penalty when working on a large data set.
Loop unrolling reduces branching overhead by working on a larger data chunk before the condition check.

Let’s try to understand by an example:

CODE: https://gist.github.com/velotiotech/1161a016ffdcd581a86397664e0a5229.js

As you can see, after unrolling it by 4, we have to test the exit condition and increment n/4 times instead of n times.

Loop Fusion -

Let’s see an example.

CODE: https://gist.github.com/velotiotech/0fb06d70fc881ca07babeb4cfaa4d3a6.js

As you can see in the above code, we are using the array a[ ] in both loops, so we can merge the loops by which it will check for conditions and increment n times instead of 2n.

Memory Optimizations for Arm Architecture

We can also utilize cache memory, which is much faster than main memory, to help minimize this performance penalty.

To make memory access faster, data can be rearranged to sequential accesses, which consume fewer cycles. By optimizing memory access, we can improve overall performance in FLAC.

Below are some tips for using the data cache more efficiently.

Preload the frequently used data into the cache memory.
Group related data together, as sequential memory accesses are faster.
Similarly, try to access array values sequentially instead of randomly.
Use arrays instead of linked lists wherever possible for sequential memory access.

Let’s understand the above by an example:

CODE: https://gist.github.com/velotiotech/c46677b7439dbb21c3623b42e94d8328.js

Assembly Optimizations

First, we need to understand why assembly optimizations are required.

In C optimization, we can access limited hardware features.
In ARM Assembly, we can leverage the processor features to the full extent, which will further help in the fast execution of code.
We have a Neon Co-processor, Floating Point Unit, and EDSP unit in ARMv7, which accelerate mathematical operations. We can explicitly access such hardware only via assembly language.
Compilers convert C code to assembly code, but may not always generate efficient code for certain functions. Writing those functions directly in assembly can lead to further optimization.

The below points explain why the compiler doesn’t generate efficient assembly for some functions.

The first obvious reason is that compilers are designed to convert any C code to assembly without changing the meaning of the code. The compiler does not understand the algorithms or calculations being used.
The person who understands the algorithm can, of course, write better assembly than the compiler.
An experienced assembly programmer can modify the code to leverage specific hardware features to speed up performance.

Now let me explain the most widely used hardware units in ARM, which accelerate mathematical operations.

NEON -

The NEON co-processor is an additional computational unit to which the ARM processor can offload mathematical calculations.
It is just like a sub-conscious mind (co-processor) in our brain (processor), which helps ease the workload.
NEON does parallel processing; it can do up to 16 addition, subtraction, etc., in just a single instruction.

FLOATING POINT UNIT - This hardware unit is used to perform operations on floating point numbers. Typical operations it supports are addition, subtraction, multiplications, divisions, square roots, etc.

EDSP(Enhanced Digital Signal Processing) - This hardware unit supports fast multiplications, multiply-accumulate, and vector operations.

Approaching Optimizations

First of all, we have to see which functions we have to optimize. We can get to know about that by profiling FLAC.

Profiling is a technique for learning which section of code takes more time to execute and which functions are getting called frequently. Then we can optimize that section of code or function.

Below are some tips you can follow for an idea of which optimization technique to use.

For performance-critical functions, ARM Assembly should be considered as the first option for optimization, as it typically provides better performance than C optimization as we can directly leverage hardware features.
When there is no scope for using the hardware units which primarily deal with mathematical operations then we can go for C optimizations.
To determine if assembly code can be improved, we can check the compiler's assembly output.
If there is scope for improvement, we can write code directly in assembly for better utilization of hardware features, such as Neon and FPU.

Results

After applying the above techniques to the FLAC Encoder, we saw an improvement of 22.1% in encoding time. As you can see in the table below, we used a combination of assembly and C optimizations.

Conclusion

The optimization techniques we have used are bit-exact i.e.,: after optimizations, you will get the same audio output as before.

However, it is important to note that although we can trade bit-exactness for speed, it should be done judiciously, as it can negatively impact the perceived audio quality.

optimization techniques

compression algorithms

lossless codecs

lossy codecs

audio quality

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

Explore current openings

Velotio Technologies is an outsourced software product development partner for top technology startups and enterprises. We partner with companies to design, develop, and scale their products. Our work has been featured on TechCrunch, Product Hunt and more.

We have partnered with our customers to built 90+ transformational products in areas of edge computing, customer data platforms, exascale storage, cloud-native platforms, chatbots, clinical trials, healthcare and investment banking.

Since our founding in 2016, our team has completed more than 90 projects with 220+ employees across the following areas:

Building web/mobile applications
Architecting Cloud infrastructure and Data analytics platforms
Designing AI/ML-based solutions
Intelligent Chatbots

Talk to us

Subscribe to get the latest technology updates

Getting the Best Out of FLAC on ARMv7: Performance Optimization Tips

Nitansh Nagpal

Overview

Why Do We Need to Optimize FLAC?

Optimizing FLAC for ARMv7

WHY Optimize for the ARM Platform?

Optimization Techniques

C Optimization

Loop Optimizations

Loop Unrolling -

Loop Fusion -

Memory Optimizations for Arm Architecture

Assembly Optimizations

NEON -

Approaching Optimizations

Results

Conclusion

MORE POSTS BY THIS AUTHOR

Nitansh Nagpal

You may also like

ARMed to Entertain: Why the Consumer Electronics Industry loves the ARM microcontroller

Vernon Dsouza

Kickstart your Critical Listening Skills - Learn to Analyze Hi-Res/High Quality Audio with a Spectrogram

Vernon Dsouza

Getting the Best Out of FLAC on ARMv7: Performance Optimization Tips

Overview

Why Do We Need to Optimize FLAC?

Optimizing FLAC for ARMv7

WHY Optimize for the ARM Platform?

Optimization Techniques

C Optimization

Loop Optimizations

Loop Unrolling -

Loop Fusion -

Memory Optimizations for Arm Architecture

Assembly Optimizations

NEON -

Approaching Optimizations

Results

Conclusion

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

About Velotio

Subscribe to get the latest technology updates

Related Posts

ARMed to Entertain: Why the Consumer Electronics Industry loves the ARM microcontroller

Kickstart your Critical Listening Skills - Learn to Analyze Hi-Res/High Quality Audio with a Spectrogram

Product Engineering

Data and AI

Cloud & DevOps

Strategy and Consulting