Fluid Implicit Particle Simulation for CPU and GPU (2024)

Pedro Pereira da Silva Anastácio Centeno
(p.silva.centeno@tecnico.ulisboa.pt)

João Madeiras Pereira
(jap@inesc-id.pt)

Instituto Superior Técnico, Lisboa, Portugal

Abstract

One of the current challenges in physically-based simulations, and, more specifically, fluid simulations, is to produce visually appealing results at interactive rates, capable of being used in multiple forms of media. In recent times, a lot of effort has been made with regards to this with the use of multi-core architectures, as many of the computations involved in the algorithms for these simulations are very well suited for these architectures. Although there is a considerable amount of works regarding acceleration techniques in this field, there is yet room to further explore and analyze some of them. To investigate this problem, we surveyed the topic of fluid simulations and some of the recent contributions towards this field. Additionally, we implemented two versions of a fluid simulation algorithm, one on the CPU and the other on the GPU using NVIDIA’s CUDA framework, with the intent of gaining a better understanding of the effort needed to move these simulations to a multi-core architecture and the performance gains that we get with it.

Keywords: Fluid Simulations, PIC, FLIP, Poisson Solvers, Surface Reconstruction, CUDA.

1. Introduction

Physically-based fluid animation is an area of great interest in computer graphics and, in particular, can be used to capture stunning visual effects, from clouds and smokes, to fire and explosions, viscous, elastic and plastic flows, sand, droplets, foams, bubbles or storms. In the movie industry, it has its roots in Antz and has been employed in countless other works such as The Lord of the Rings, The Pirates of the Caribbean and The Day After Tomorrow. In the videogame industry it has been used to great effect in games like Shadow of the Tomb Raider.

With the nimble progress we are seeing in computer graphics research, it is now not only possible to perform beautiful fluid simulations, but we can also simulate more advanced effects.

A fluid’s behavior can be modeled in multiple ways, ranging from a collection of particles, to a level-set defined surface or a voxelized domain. Each of these, on their own terms, can mold the way the behavior of the fluid is perceived when simulating it. The Fluid Implicit Particle algorithm and its extensions, the current hallmarks for capturing accurate and minute effects, as hybrid methods that combine the advantages of particleand grid-based methods, offer a distinctive approach for handling fluid simulation.

However, this type of simulation involves intensive numerical calculations and, due to the complexity of the employed techniques, we often have to wait several seconds, if not minutes, for every one frame to be computed, which means leaving our computer crunching these scenes overnight or having to wait days for results to appear. These computations can range from simulation steps themselves, in the form of differential operators, integration and particle and grid interpolation, to posterior surface reconstruction algorithms and can even extend to affect the representation of the simulation data.

Fortunately, since FLIP operates on the fact that the potential of the fluid system changes little over the course of a single timestep, each particle can be assumed to be independent of its neighbors. This, in turn, means many of the computations have become more tractable in recent times, as they lend themselves to be trivially parallelizable.

2. Background

There is a large body of work on physically-based fluid simulation in computer graphics and, as such, in this section we will only try to touch upon topics of considerable relevance to this project - enough so to help it stand on its own and allow for an outline of most developments that led to it.

2.1. Fluid Simulation

Physically-based simulation methods model the dynamics of fluids by solving the governing equations, namely the Navier-Stokes equations, which express conservation of momentum, mass and energy for Newtonian fluids. For simplicity’s sake, we will focus solely on incompressible and hom*ogeneous fluids.

2.1.1 Navier-Stokes Equations

There are two of these titular equations: the momentum equation and the incompressibility condition.

The momentum equation is really just Newton’s Sec-
ond Law of Motion, 𝐅=m𝐚𝐅𝑚𝐚\mathbf{F}=m\mathbf{a}bold_F = italic_m bold_a, in disguise. Imagining the fluid as a particle system, where each particle represents a small blob of fluid with mass m𝑚mitalic_m, volume V𝑉Vitalic_V and velocity 𝒖𝒖\boldsymbol{u}bold_italic_u, the same law then tells us how each particle accelerates, allowing us to integrate the system forward in time.

The acceleration term can be written as the material derivative of the particle’s velocity, leaving us with 𝐅=𝐅absent\mathbf{F}=bold_F = mD𝐮Dt𝑚𝐷𝐮𝐷𝑡m\frac{D\mathbf{u}}{Dt}italic_m divide start_ARG italic_D bold_u end_ARG start_ARG italic_D italic_t end_ARG. With external forces, which usually amount to just gravity, set aside for now, the remaining forces acting on each particle are due to pressure and viscosity.

For pressure, what matters in the imbalance of higher pressure on one side of the particle than the other, resulting in a force pointing away form the high pressure and towards the low pressure. The simplest way to measure the imbalance at the position of the particle is to take the negative gradient of the pressure which, when integrated by approximation over the particle’s volume, results in the pressure force Vp𝑉𝑝-V\nabla p- italic_V ∇ italic_p.

For viscosity, there is a force that tries to make a particle move at the average velocity of its neighbours, minimizing the difference in velocity between them. The differential operator that measures how far a quantity is from the average around it is the Laplacian which, when also integrated by approximation of the particle’s volume and multiplied by the viscosity coefficient, yields the viscous force Vμ2𝐮𝑉𝜇superscript2𝐮V\mu\nabla^{2}\mathbf{u}italic_V italic_μ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_u.

Putting it all together, we get a differential equation which dictates how a blob of fluid moves:

mD𝐮Dt=m𝐠Vp+Vμ2𝐮𝑚𝐷𝐮𝐷𝑡𝑚𝐠𝑉𝑝𝑉𝜇superscript2𝐮m\frac{D\mathbf{u}}{Dt}=m\mathbf{g}-V\nabla p+V\mu\nabla^{2}\mathbf{u}italic_m divide start_ARG italic_D bold_u end_ARG start_ARG italic_D italic_t end_ARG = italic_m bold_g - italic_V ∇ italic_p + italic_V italic_μ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_u

Given that it is erroneous to approximate a fluid to a finite number of particles, we should take the limit as our number of particles goes to infinity and the size of each goes to zero. This poses a problem, as the mass and volume of a particle are then also going to zero. This can be solved by dividing the equation by the volume and only taking the limit subsequently.

Remembering that m/V𝑚𝑉m/Vitalic_m / italic_V is the fluid density and μ/ρ𝜇𝜌\mu/\rhoitalic_μ / italic_ρ is the kinematic viscosity, this is what we are left with:

ρD𝐮Dt=ρ𝐠p+μ2𝐮D𝐮Dt=𝐠1ρp+ν2𝐮𝜌𝐷𝐮𝐷𝑡𝜌𝐠𝑝𝜇superscript2𝐮𝐷𝐮𝐷𝑡𝐠1𝜌𝑝𝜈superscript2𝐮\rho\frac{D\mathbf{u}}{Dt}=\rho\mathbf{g}-\nabla p+\mu\nabla^{2}\mathbf{u}%\Leftrightarrow\frac{D\mathbf{u}}{Dt}=\mathbf{g}-\frac{1}{\rho}\nabla p+\nu%\nabla^{2}\mathbf{u}italic_ρ divide start_ARG italic_D bold_u end_ARG start_ARG italic_D italic_t end_ARG = italic_ρ bold_g - ∇ italic_p + italic_μ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_u ⇔ divide start_ARG italic_D bold_u end_ARG start_ARG italic_D italic_t end_ARG = bold_g - divide start_ARG 1 end_ARG start_ARG italic_ρ end_ARG ∇ italic_p + italic_ν ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_u(1)

Picking an arbitrary chunk of fluid at some instant in time, with volume ΩΩ\Omegaroman_Ω and its boundary surface ΩΩ\partial\Omega∂ roman_Ω, we can measure how fast said volume is changing by integrating the normal component of its velocity around the boundary.

ddtVolume(Ω)=Ω𝐮n^𝑑𝑑𝑡VolumeΩsubscriptdouble-integralΩ𝐮^𝑛\frac{d}{dt}\operatorname{Volume}(\Omega)=\iint_{\partial\Omega}\mathbf{u}%\cdot\hat{n}divide start_ARG italic_d end_ARG start_ARG italic_d italic_t end_ARG roman_Volume ( roman_Ω ) = ∬ start_POSTSUBSCRIPT ∂ roman_Ω end_POSTSUBSCRIPT bold_u ⋅ over^ start_ARG italic_n end_ARG

For an incompressible fluid, the volume should then remain constant, i.e. this rate of change is zero:

Ω𝐮n^=0subscriptdouble-integralΩ𝐮^𝑛0\iint_{\partial\Omega}\mathbf{u}\cdot\hat{n}=0∬ start_POSTSUBSCRIPT ∂ roman_Ω end_POSTSUBSCRIPT bold_u ⋅ over^ start_ARG italic_n end_ARG = 0

Using the divergence theorem, we can translate this to a volume integral and get:

Ω𝐮=0subscripttriple-integralΩ𝐮0\iiint_{\Omega}\nabla\cdot\mathbf{u}=0∭ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ∇ ⋅ bold_u = 0

This equation should hold true for any choice of ΩΩ\Omegaroman_Ω and the only function that integrates to zero independent of the volume of integration is zero itself, prompting that:

𝐮=0𝐮0\nabla\cdot\mathbf{u}=0∇ ⋅ bold_u = 0(2)

Known as the incompressibility condition, this is the other part of the incrompressible Navier-Stokes equations.

2.1.2 Lagrangian and Eulerian Viewpoints

As previously stated, there are multiple approaches to trailing a moving continuum.

On the one hand, much like how we began by deriving the momentum equation, the Lagrangian approach treats a continuum as a particle system - "a set of discrete particles that move through space"[4], each with its own position, velocity, density, mass and other assortment of properties.

On the other hand, the Eulerian approach, instead, "looks at fixed points in space and measures how each fluid quantity changes with time at each of those points"[13]. In Eulerian methods, the Navier-Stokes equations are discretized with grids.

The link between the two viewpoints is the material derivative, which states, for any quantity q𝑞qitalic_q, that:

DqDt=qt+q𝐮𝐷𝑞𝐷𝑡𝑞𝑡𝑞𝐮\frac{Dq}{Dt}=\frac{\partial q}{\partial t}+\nabla q\cdot\mathbf{u}divide start_ARG italic_D italic_q end_ARG start_ARG italic_D italic_t end_ARG = divide start_ARG ∂ italic_q end_ARG start_ARG ∂ italic_t end_ARG + ∇ italic_q ⋅ bold_u

Returning now to equation 1 and applying this result we finally get:

𝐮t=(𝐮)𝐮+ν2𝐮1ρp+g𝐮𝑡𝐮𝐮𝜈superscript2𝐮1𝜌𝑝𝑔\frac{\partial\mathbf{u}}{\partial t}=-(\mathbf{u}\cdot\nabla)\mathbf{u}+\nu%\nabla^{2}\mathbf{u}-\frac{1}{\rho}\nabla p+gdivide start_ARG ∂ bold_u end_ARG start_ARG ∂ italic_t end_ARG = - ( bold_u ⋅ ∇ ) bold_u + italic_ν ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_u - divide start_ARG 1 end_ARG start_ARG italic_ρ end_ARG ∇ italic_p + italic_g(3)

2.1.3 Stable Fluids

At the inception of physically-based fluid simulation, researchers "did not tackle the Navier-Stokes equations directly"[13]. In exchange for simplicity and efficiency, their approahces did not capture a score of interesting phenomena such as overturning waves, sprays and splashes.

In 1999, Jos Stam introduced the first unconditionally stable algorithm that solves the Navier-Stokes equations[12]. Stam achieved this by exercising the idea of time splitting, effectively dividing the Eulerian method in equation 3 into four distinct steps, where 𝐰0(x)=subscript𝐰0𝑥absent\mathbf{w}_{0}(x)=bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) = 𝐮(𝐱,t)𝐮𝐱𝑡\mathbf{u}(\mathbf{x},t)bold_u ( bold_x , italic_t ) and 𝐰4(x)=𝐮(𝐱,t+Δt)::subscript𝐰4𝑥𝐮𝐱𝑡Δ𝑡absent\mathbf{w}_{4}(x)=\mathbf{u}(\mathbf{x},t+\Delta t):bold_w start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( italic_x ) = bold_u ( bold_x , italic_t + roman_Δ italic_t ) :

𝐖0(𝐱)force𝐖1(𝐱)advect𝐖2(𝐱)diffuse𝐖3(𝐱)project𝐖4(𝐱)forcesubscript𝐖0𝐱subscript𝐖1𝐱advectsubscript𝐖2𝐱diffusesubscript𝐖3𝐱projectsubscript𝐖4𝐱\mathbf{W}_{0}(\mathbf{x})\xrightarrow{\text{ force }}\mathbf{W}_{1}(\mathbf{x%})\xrightarrow{\text{ advect }}\mathbf{W}_{2}(\mathbf{x})\xrightarrow{\text{ %diffuse }}\mathbf{W}_{3}(\mathbf{x})\xrightarrow{\text{ project }}\mathbf{W}_{%4}(\mathbf{x})bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_x ) start_ARROW over force → end_ARROW bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_x ) start_ARROW over advect → end_ARROW bold_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_x ) start_ARROW over diffuse → end_ARROW bold_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( bold_x ) start_ARROW over project → end_ARROW bold_W start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( bold_x )

We can discard the diffusion step because we will mostly be dealing with inviscid fluids.

2.1.4 Hybrid Methods: PIC and FLIP

Lagrangian techniques avoid continuous re-sampling during the advection step, which allows for finer-grained details, such as turbulent splashing flows, bubbles and foams. Additionally, they have an easier time depicting irregular boundaries. In general, however, this comes at the expense of stability restrictions, numerical smoothing, inability to guarantee incompressibility and a need for complicated re-meshing.

Eulerian techniques, in contrast, allow for larger time steps and, thanks to grid discretizations, have well defined finite differences for computing the projection step. Having said that, they suffer from numerical dissipation in the semi-Lagrangian advection step.

Just as the name proposes, hybrid methods try to combine the advantages of both Lagrangian and Eulerian techniques.

Evans and Harlow presented the Particle in Cell (PIC) method in 1957[7]. The central idea behind this method is that particles represent the fluid domain and carry velocity but, before the projection step, the particles’ velocities are transferred to a grid where this step is handled. Once this new velocity field has been computed, the velocites are interpolated back to the particles. This back-and-forth interpolation between particles and grid introduces some numerical dissipation, which leads to fluids appearing more viscous then they are in actuality. Abreu et al. [17] describes the implementation of PIC code in CUDA-enabled GPU.

Brackbill and Rupert later proposed the Fluid Implicit Particle, or FLIP, method in 1986, whose main contributions was computing the change in velocities instead and adding this back to each particle[3]. Effectively, this alternate approach removes half of the smoothing introduced by PIC but introduces some unwanted visible noise. The model solution, then, became to linearly combine both PIC and FLIP, producing close to ideal results. Zhu and Bridson consolidated this with incompressible flow, which became the go-to algorithm for fluid simulation[16].

2.2. Surface Tracking and Reconstruction

Discretizing and maintaining moving surfaces are essential parts in capturing accurate behavior in physicallybased fluid simulations. There are two main paradigms in which one can categorize the methods employed: implicit (or Eulerian) and explicit (or Lagrangian) surface tracking.

2.2.1 Level-Set Methods

Unlike other volumetric effects such as smoke or clouds, liquids have clear interfaces and can be plagued by extreme topology changes - "examples include merging, i.e., multiple surfaces becoming one; splitting, i.e., one surface being divided into several; and changes in genus, i.e., opening and closing of holes in the fluid volume"[6]. For this reason, implicit techniques are the tried-and-true approach to describe liquids.

The most common way to define implicit surfaces is through the use of level-set methods, proposed by
Zhao et al.[15]. The surface is characterized with a set of distance samples stored in a fixed grid, where it can be interpreted as the zero-contour of a signed distance field, Φ(𝐱)Φ𝐱\Phi(\mathbf{x})roman_Φ ( bold_x ) - positive Φ(𝐱)Φ𝐱\Phi(\mathbf{x})roman_Φ ( bold_x ) indicates 𝐱𝐱\mathbf{x}bold_x is outside the fluid region, while negative Φ(𝐱)Φ𝐱\Phi(\mathbf{x})roman_Φ ( bold_x ) identify the inside.

When dealing with implicit surfaces, it is necessary to reconstruct them at render time in order to visualize the fluid domain. One of the first instances of implicit surfaces for particle-based simulations, called blobbies was proposed by Blinn in 1982[1], in which a set of smooth scalar functions (or smoothing kernels) is used to map points in space to a mass density, i.e., each particle of the fluid defines an implicit sphere with a set radius.

In 2005, Bridson and Zhu introduced a method called improved blobbies[16], producing more pleasing results by exactly reconstructing the signed distance field of each isolated particles.

2.2.2 Marching Cubes

The above methods only extract scalar distance fields from the particles. To convert the field into a triangle mesh, one of the more common and simpler approaches for offline rendering is the marching cubes algorithm, introduced by Lorensen and Cline in 1987[10].

The algorithm is based on evaluating a real-valued function into a discrete scalar field discretized on a regular grid. The standard MC constructs a "facetized isosurface" by processing the signed distance function for every cell corner in the grid in a "sequential, cube-bycube (scanline) manner"[11].

3. System Overview

3.1. CPU-based Implementation

3.1.1 Staggered Grid

When simulating Eulerian fluids, where grids are used to maintain information, there is a need to not only discretize the simulation in time, but also in space. The most fundamental innovation in this field was the Marker-and-Cell grid, introduced by Harlow and Welch[8]. The MAC grid has a staggered configuration, i.e., it stores different variables at different locations. Figure 1 illustrates a three-dimensional MAC grid.

Fluid Implicit Particle Simulation for CPU and GPU (1)

Figure 1: One cell from a three-dimensional MAC grid.

When applied to FLIP, this means scalars have their sample points located at the center of the cell, whereas sample points of the velocity field (a vector measure) coincide with the cell face boundaries. The benefit of using this configuration comes from the fact that the discretization of differential operators is greatly simplified - namely,
we can use accurate central differences for the pressure gradient and for the divergence of the velocity. To estimate the derivative of any quantity q𝑞qitalic_q at a grid point i𝑖iitalic_i, we can do so as:

(qx)iqi+1/2qi1/2Δτsubscript𝑞𝑥𝑖subscript𝑞𝑖12subscript𝑞𝑖12Δ𝜏\left(\frac{\partial q}{\partial x}\right)_{i}\approx\frac{q_{i+1/2}-q_{i-1/2}%}{\Delta\tau}( divide start_ARG ∂ italic_q end_ARG start_ARG ∂ italic_x end_ARG ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≈ divide start_ARG italic_q start_POSTSUBSCRIPT italic_i + 1 / 2 end_POSTSUBSCRIPT - italic_q start_POSTSUBSCRIPT italic_i - 1 / 2 end_POSTSUBSCRIPT end_ARG start_ARG roman_Δ italic_τ end_ARG

Where ΔτΔ𝜏\Delta\tauroman_Δ italic_τ is the size of a cell. This is "unbiased and accurate to O(Δx2)′′[5]𝑂superscriptΔsuperscript𝑥2′′delimited-[]5O\left(\Delta x^{2}\right)^{\prime\prime}[5]italic_O ( roman_Δ italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT [ 5 ], naturally resulting in smaller-scale errors.

Note also that when manipulating staggered grids, we sample the normal components of the velocity at the center of each of its faces, which intuitively allows us to estimate the amount of fluid flowing into and our of the cell.

Although the MAC grid is suitable for handling pressure and incompressibility, it does come with its downsides. One significant example is when wanting to evaluate the full velocity vector somewhere in the fluid domain, where we always need some sort of interpolation (in the case of three dimensions, trilinearly so) for each component.

Moreover, while half indices, such as i+1/2𝑖12i+1/2italic_i + 1 / 2, are convenient theoretically and conceptually, an implementation should use integer indices, so we represent an n𝑛nitalic_n dimensional MAC grid as a flattened array of cells (see figure 2 as an example).

Fluid Implicit Particle Simulation for CPU and GPU (2)

Figure 2: Representation of a two-dimensional grid as a flattened array.

3.1.2 Seeding

We handle seeding by defining a sharp region, given by a level set, inside of which the emission density is constant, dropping to zero outside of said region. Inside each cell that represents the fluid domain, the particles are seeded in a jittered grid - much like super-sampling in a renderer - with a number of subcells proportional to a specified density.

Fluid Implicit Particle Simulation for CPU and GPU (3)

Figure 3: A density ρ=2𝜌2\rho=2italic_ρ = 2 totals eight subcells per grid cell, all of which will be populated with one fluid particle randomly located inside.

According to Zhu and Bridson[16], this method helps "avoid aliasing when the flow is underresolved at the simulation resolution". Bridson also proposes that, in order to avoid gaps in the fluid and noise when running the simulation, a fluid cell should contain no less than three and no more than twelve particles[5].
At each timestep, if a fluid cell fails to fulfill the specified density, it is naively reseeded until the number of particles it houses reaches the extrapolated value.

3.1.3 Advection

Advecting the velocity field is akin to moving each particle in the current state of the velocity field. In this step, we apply the physically-motivated semiLagrangian method as introduced by Stam[12]. The Semi-Lagrangian advection traces the particles through the grid, finding the value q𝑞qitalic_q at any current grid point by moving back through the path dictated by the velocity field, grabbing the value at the starting point.

Note that at this point in the simulation, the particles are still freely moving, so we’re effectively dealing with a Lagrangian measure and, as it happens, in our implementation, we’re not carrying any supplementary quantities in a particle, which greatly simplifies the semiLagrangian step. We use a simple forward Euler, which translates applying, for each particle: 𝐱P=𝐱P+𝐮Δtsubscript𝐱𝑃subscript𝐱𝑃𝐮Δ𝑡\mathbf{x}_{P}=\mathbf{x}_{P}+\mathbf{u}\Delta tbold_x start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT + bold_u roman_Δ italic_t.

This, however, is unconditionally unstable for any timestep, which can cause particles that have penetrated grid boundaries to become stuck. To fix this issue we clamp the particles back into the grid, by moving them in the normal direction of the suspect face by a metric of an arbitrary skin width.

Better results could be obtained by using slightly more sophisticated techniques, such as a higher-order RungeKutta method, but we found that forward Euler yielded adequate results for what we intended.

3.1.4 CFL Condition

Generally speaking, in physically-based simulations, the goal is to recreate visually pleasing results as fast as feasibly possible. As noted above, our advection scheme is unconditionally unstable, i.e., for any timestep of our choosing, be it bigger or smaller, it can eventually blow up so, in theory, we shouldn’t have to bother with stability. Empirically, what we noticed, was that, if we dampenned the rate at which we diverge towards an inaccurate solution by choosing a ΔtΔ𝑡\Delta troman_Δ italic_t that guarantees CourantFriedrichs-Levy’s, or CFL, condition, we got better results.

Bridson notes that, for advection, the true solution is moving at speed 𝐮norm𝐮\|\mathbf{u}\|∥ bold_u ∥, so the speed at which numerical information is transmitted must be at least as fast, that is:

αΔτΔt𝐮𝛼Δ𝜏Δ𝑡norm𝐮\alpha\frac{\Delta\tau}{\Delta t}\geq\|\mathbf{u}\|italic_α divide start_ARG roman_Δ italic_τ end_ARG start_ARG roman_Δ italic_t end_ARG ≥ ∥ bold_u ∥

Which turns into a condition on the timestep:

ΔtαΔτ𝐮Δ𝑡𝛼Δ𝜏norm𝐮\Delta t\leq\alpha\frac{\Delta\tau}{\|\mathbf{u}\|}roman_Δ italic_t ≤ italic_α divide start_ARG roman_Δ italic_τ end_ARG start_ARG ∥ bold_u ∥ end_ARG

Where α𝛼\alphaitalic_α is the CFL number, a useful parameter.

Thus, to compute ΔtΔ𝑡\Delta troman_Δ italic_t, we iterate through all the particles to find which is travelling the fastest, and use that in conjunction with a chosen α=3𝛼3\alpha=3italic_α = 3 :

Fluid Implicit Particle Simulation for CPU and GPU (4)

3.1.5 Particle Binning

We added auxiliary subdivisions to our domain as a way to hash particles and the cells these occupy in a manner that speeds up grid traversal when necessary, as we get to step over air cells. For this, we created two additional data structures: a vector that stores raw indices to fluid cells in the MAC grid and a grid that bins particles by the cell they are in, as seen in figure 4 :

Fluid Implicit Particle Simulation for CPU and GPU (5)

Figure 4: Representation of the updates carried out for the vector of fluid cells and space hash. First, we compute the raw index for each particle. Then, we sort the vector of particles based on each particle’s raw index. Finally, we iterate through the vector, appending a new fluid cell for every new raw index we find and storing the beginning offset and count of particles for each grid cell in the space hash.

The space hash holds a grid-like structure with the same dimensions of the MAC grid and is used during the advection step to survey which cells can be marked as containing fluid versus air (i.e., if their particle count is greater than zero) and the reseeding step where, if the particle count of a cell either exceeds or is below the predefined thresholds, we remove or emit new particles, respectively. The fluid cells vector is used in the projection step.

3.1.6 Particle-to-Grid Transfer

Bridson[5] mentions the simplest way to compute the total energy of the velocity reaching a certain grid point is to sum, at each grid point, the particle values modulated by a kernel function, k𝑘kitalic_k, that only gives weight to nearby particles. The kernel function should be adapted to the grid spacing, just as in rendering - if its support is less than ΔτΔ𝜏\Delta\tauroman_Δ italic_τ then some particles between grid points may momentarily vanish from the grid (i.e., not contribute to any grid points), but if the support is too much larger than ΔτΔ𝜏\Delta\tauroman_Δ italic_τ, the method becomes inefficient and it will blur away a lot of the desirable sharpness. He recommends the use of a trilinear hat function, as seen in equation 4.

k(x,y,z)=h(xΔτ)h(yΔτ)h(zΔτ)𝑘𝑥𝑦𝑧𝑥Δ𝜏𝑦Δ𝜏𝑧Δ𝜏\displaystyle k(x,y,z)=h\left(\frac{x}{\Delta\tau}\right)h\left(\frac{y}{%\Delta\tau}\right)h\left(\frac{z}{\Delta\tau}\right)italic_k ( italic_x , italic_y , italic_z ) = italic_h ( divide start_ARG italic_x end_ARG start_ARG roman_Δ italic_τ end_ARG ) italic_h ( divide start_ARG italic_y end_ARG start_ARG roman_Δ italic_τ end_ARG ) italic_h ( divide start_ARG italic_z end_ARG start_ARG roman_Δ italic_τ end_ARG )(4a)
h(r)={1r,0r11+r,1r<00,otherwise𝑟cases1𝑟0𝑟11𝑟1𝑟00otherwise\displaystyle h(r)=\begin{cases}1-r,&0\leq r\leq 1\\1+r,&-1\leq r<0\\0,&\text{ otherwise }\end{cases}italic_h ( italic_r ) = { start_ROW start_CELL 1 - italic_r , end_CELL start_CELL 0 ≤ italic_r ≤ 1 end_CELL end_ROW start_ROW start_CELL 1 + italic_r , end_CELL start_CELL - 1 ≤ italic_r < 0 end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise end_CELL end_ROW(4b)

Calculating velocity is done component-wise, and, for example, the u𝑢uitalic_u component can be computed as follows:

ui+1/2,j,k=PuPk(𝐱P𝐱i+1/2,j,k)Pk(𝐱P𝐱i+1/2,j,k)subscript𝑢𝑖12𝑗𝑘subscript𝑃subscript𝑢𝑃𝑘subscript𝐱𝑃subscript𝐱𝑖12𝑗𝑘subscript𝑃𝑘subscript𝐱𝑃subscript𝐱𝑖12𝑗𝑘u_{i+1/2,j,k}=\frac{\sum_{P}u_{P}k\left(\mathbf{x}_{P}-\mathbf{x}_{i+1/2,j,k}%\right)}{\sum_{P}k\left(\mathbf{x}_{P}-\mathbf{x}_{i+1/2,j,k}\right)}italic_u start_POSTSUBSCRIPT italic_i + 1 / 2 , italic_j , italic_k end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT italic_k ( bold_x start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_i + 1 / 2 , italic_j , italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT italic_k ( bold_x start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_i + 1 / 2 , italic_j , italic_k end_POSTSUBSCRIPT ) end_ARG

With the other components following suit. Bridson further recommends each weight be calculated separately for each grid point due to the non-uniform particle distribution.

Unlike what is suggested in the same chapter, instead of looping through all particles and updating the grid points affected by the current particle, we use what is called a gathering method: we loop over all grid points and calculate the velocity using the previous equation and looping over all particles that affect said grid point, as seen in algorithm ??.

3.1.7 Pressure Solve

The projection routine will subtract the pressure gradient from the intermediate velocity field computed up to this point,

𝐮n+1=𝐮Δt1ρpsuperscript𝐮𝑛1𝐮Δ𝑡1𝜌𝑝\mathbf{u}^{n+1}=\mathbf{u}-\Delta t\frac{1}{\rho}\nabla pbold_u start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT = bold_u - roman_Δ italic_t divide start_ARG 1 end_ARG start_ARG italic_ρ end_ARG ∇ italic_p

so the result satisfies incompressibility inside the fluid, as seen in equationn 2 , and satisfies the solid wall boundary conditions while also respecting the free surface boundary condition that pressure be zero there.

We highlight here the importance of adapting a staggered grid configuration, "making central differences robust". For example, where we need to subtract the /x𝑥\partial/\partial x∂ / ∂ italic_x-component of the pressure gradient from the u𝑢uitalic_u component of the velocity, there are two pressure values lines up perfectly on either side:

ui+1/2,j,kn+1=ui+1/2,j,kΔt1ρpi+1,j,kpi,j,kΔτsuperscriptsubscript𝑢𝑖12𝑗𝑘𝑛1subscript𝑢𝑖12𝑗𝑘Δ𝑡1𝜌subscript𝑝𝑖1𝑗𝑘subscript𝑝𝑖𝑗𝑘Δ𝜏u_{i+1/2,j,k}^{n+1}=u_{i+1/2,j,k}-\Delta t\frac{1}{\rho}\frac{p_{i+1,j,k}-p_{i%,j,k}}{\Delta\tau}italic_u start_POSTSUBSCRIPT italic_i + 1 / 2 , italic_j , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT = italic_u start_POSTSUBSCRIPT italic_i + 1 / 2 , italic_j , italic_k end_POSTSUBSCRIPT - roman_Δ italic_t divide start_ARG 1 end_ARG start_ARG italic_ρ end_ARG divide start_ARG italic_p start_POSTSUBSCRIPT italic_i + 1 , italic_j , italic_k end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT end_ARG start_ARG roman_Δ italic_τ end_ARG(5)

This follows suit for the other two components.

As we want our fluid to be incompressible (equation 2), on a grid we approximate this condition with finite differences while requiring that the divergence estimated at each fluid grid cell be zero for 𝐮n+1superscript𝐮𝑛1\mathbf{u}^{n+1}bold_u start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT. Remember that the divergence, in three dimensions, is

𝐮=ux+vy+wz𝐮𝑢𝑥𝑣𝑦𝑤𝑧\nabla\cdot\mathbf{u}=\frac{\partial u}{\partial x}+\frac{\partial v}{\partialy%}+\frac{\partial w}{\partial z}∇ ⋅ bold_u = divide start_ARG ∂ italic_u end_ARG start_ARG ∂ italic_x end_ARG + divide start_ARG ∂ italic_v end_ARG start_ARG ∂ italic_y end_ARG + divide start_ARG ∂ italic_w end_ARG start_ARG ∂ italic_z end_ARG(6)

which we can also approximate using central differences.

Putting equations 2, 5 and 6 together we reach a numerical approximation to the Poisson problem Δt/ρΔ𝑡𝜌-\Delta t/\rho\nabla- roman_Δ italic_t / italic_ρ ∇. p=𝐮𝑝𝐮\nabla p=-\nabla\cdot\mathbf{u}∇ italic_p = - ∇ ⋅ bold_u.

Finally, we can think of this as a large system of linear equations. We can conceptually think of it as a large coefficient matrix, A𝐴Aitalic_A, times a vector consisting of all pressure unknowns, p𝑝pitalic_p, equal to a vector consisting of the negative divergences in each fluid grid cell, b:Ap=b:𝑏𝐴𝑝𝑏b:Ap=bitalic_b : italic_A italic_p = italic_b.

In our implementation, p𝑝pitalic_p and b𝑏bitalic_b are grids represented as flattened arrays. With regards to matrix A𝐴Aitalic_A, it is relevant to understand we do not need to store it directly as a matrix. Each row of A𝐴Aitalic_A corresponds to one equation, i.e., one fluid cell, and its entries are the coefficients of all the pressure unknowns in that equation, almost all of which are zero except (possibly) for the seven entries corresponding to pi,j,ksubscript𝑝𝑖𝑗𝑘p_{i,j,k}italic_p start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT and its six neighbors, pi±1,j,k,pi,j±1,ksubscript𝑝plus-or-minus𝑖1𝑗𝑘subscript𝑝𝑖plus-or-minus𝑗1𝑘p_{i\pm 1,j,k},p_{i,j\pm 1,k}italic_p start_POSTSUBSCRIPT italic_i ± 1 , italic_j , italic_k end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i , italic_j ± 1 , italic_k end_POSTSUBSCRIPT and pi,j,k±1subscript𝑝𝑖𝑗plus-or-minus𝑘1p_{i,j,k\pm 1}italic_p start_POSTSUBSCRIPT italic_i , italic_j , italic_k ± 1 end_POSTSUBSCRIPT. With this said, not only is A𝐴Aitalic_A sparse, it is also symmetric, that is, for example, A(i,j,k),(i+1,j,k)subscript𝐴𝑖𝑗𝑘𝑖1𝑗𝑘A_{(i,j,k),(i+1,j,k)}italic_A start_POSTSUBSCRIPT ( italic_i , italic_j , italic_k ) , ( italic_i + 1 , italic_j , italic_k ) end_POSTSUBSCRIPT, the coefficient of pi+1,j,ksubscript𝑝𝑖1𝑗𝑘p_{i+1,j,k}italic_p start_POSTSUBSCRIPT italic_i + 1 , italic_j , italic_k end_POSTSUBSCRIPT in the equation for grid cell (i,j,k)𝑖𝑗𝑘(i,j,k)( italic_i , italic_j , italic_k ),
has to be equal to A(i+1,j,k(i,j,k)A_{(i+1,j,k}(i,j,k)italic_A start_POSTSUBSCRIPT ( italic_i + 1 , italic_j , italic_k end_POSTSUBSCRIPT ( italic_i , italic_j , italic_k ) - either it’s zero if one of those two cells is not fluid, or it’s the same non-zero value. Ergo, we just keep a vector of data for each fluid cell.

The A𝐴Aitalic_A matrix is a very well-known type of matrix, referred to as the seven-point Laplacian matrix in three dimensions. It is a prototypical sparse matrix and has been meticulously studied so we will not delve very deep into the vast body of work that exists on solving these systems. We implemented a Jacobi, a Gauss-Seidel and a PCG solver on the CPU (of which a more thorough overview can be found in Bridson’s titular book[5]).

3.1.8 Grid-to-Particle Transfer

To finalize a simulation step, the particles need to be updated with the newly computed velocities stored in the MAC grid. This is managed by applying a trilinear interpolation for every velocity component of the eight neighbouring cells of the particle. Essentially, this means each particle will get a weighted velocity depending on its position in a grid cell.

However, since the final velocities are a combination of PIC and FLIP, we first compute the change in velocity and store it and, following this, interpolate the new velocity field and this change based on a blending factor.

3.2. CUDA-based Implementation

Before introducing our GPU developments, it is important to clarify that the FLIP algorithm, as a whole, was preserved. What we mean by this is that certain steps of the algorithm will not be mentioned in detail in this section as they map directly to our GPU implementation without any non-trivialities.

3.2.1 Data Representation

We first look at how data for fluid particles and grid nodes should be laid out in GPU memory. Vector fields can be stored in one of two configurations: as an Array of Structures, where each element in the array is an instance of a structure storing vector components of the fluid; or a Structure of Arrays, where each vector component over the entire domain of the fluid is first grouped together and then stored consecutively in an array.

Fluid Implicit Particle Simulation for CPU and GPU (6)

Figure 5: Different data layouts for particles in memory.

If we were to access each component of a particle concurrently, then AoS would make sense as successive reads of each component would be continuous. However, during the advection step and when binning particles, we assign one thread per particle and not all of the different physical components are accessed at the same time. To put it differently, if we use an AoS configuration, scattered memory access patterns would occur, as each thread would fetch values for each particle, which would be displaced in memory. By adopting a SoA configuration, we can coalesce memory accesses, achieving higher memory bandwidth.

Additionally, we maintain one global array for each velocity component for all grid cells and, for parameters that remain constant throughout the simulation (e.g., gravity accelaration, grid size and spacing, etc.), we utiliza CUDA’s constant memory to store them for efficient data access.

3.2.2 Particle Binning

For the sake of simplicity, all grid cells are the same size spatially, which means each grid cell only has to keep track of the section of the main particle list that it represents and its spatial origin.

The spatial origin of the grid cell is hashed using a 16-bit unsigned-integer, referred to as the cell index, which is used for determining the region of the domain the cell is responsible for and computed as index === (𝐮P\left(\mathbf{u}_{P}-\right.( bold_u start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT - grid_min )*)*) * cell_size.

In order to compute grid offsets, we do an all-prefixsums operation on the amount of particles in each grid cell. An all-prefix-sums operations, also known as a scan, is described as an operation that "takes a binary associative operator, direct-sum\oplus, with an identity, I𝐼Iitalic_I, and an array of nn\mathrm{n}roman_n elements, [a0,a1,,an1]subscript𝑎0subscript𝑎1subscript𝑎𝑛1\left[a_{0},a_{1},\ldots,a_{n-1}\right][ italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] and returns the array [I,a0,(a0a1),,(a0a1an2)]"[9]𝐼subscript𝑎0direct-sumsubscript𝑎0subscript𝑎1direct-sumsubscript𝑎0subscript𝑎1subscript𝑎𝑛2"delimited-[]9\left[I,a_{0},\left(a_{0}\oplus a_{1}\right),\ldots,\left(a_{0}\oplus a_{1}%\oplus\ldots\oplus a_{n-2}\right)\right]"[9][ italic_I , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⊕ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⊕ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊕ … ⊕ italic_a start_POSTSUBSCRIPT italic_n - 2 end_POSTSUBSCRIPT ) ] " [ 9 ]. To achieve a work-efficient parallel scan, we follow the approach described in [9]. In essence, the algorithm is based on the idea of using, conceptually, a balanced binary tree on the input data and sweeping it to and from the root to compute the prefix sum.

Finally, the last step is responsible for moving all particle information from their old locations in memory to the new sorted locations, utilizing the double-buffering technique.

3.2.3 Pressure Solve

Jacobi solvers are inherently parallelizable but tend to have very slow convergence rates. Meanwhile, conjugate gradient solvers are non-trivial to parallelize. For these reasons, we decided not to develop GPU implementations for either of them. We redirect curious readers to [2] for a fine-grained survey on sparse matrix solvers on the GPU.

Gauss-Seidel utilizes values from the same iteration it is computing, making it also non-trivial to parallelize as is. As it happens, however, for sparse matrices, such as A𝐴Aitalic_A in our projection step, it is fairly simple to implement a parallel solution. This applies because we can find partitions of variables that can be solved independently. More specifically, if mi,jsubscript𝑚𝑖𝑗m_{i,j}italic_m start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is non-zero, then the variables xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT must not be solved for in parallel; otherwise, it is fair game.

This version of Gauss-Seidel is computed in two interleaved passes, according to the pattern shown in figure 6 . The values of i𝑖iitalic_i and j𝑗jitalic_j are obtained with the thread

Fluid Implicit Particle Simulation for CPU and GPU (7)

Figure 6: Gauss-Seidel red-black checkerboard pattern for a 2D grid.

and block information when calling the kernel and fixed, and, for every iteration, we iterate over k𝑘kitalic_k, updating the solution according to the Gauss-Seidel update rule.

One downside to this approach is that, since matrix values are stored in global memory, it suffers from global memory latency. However, since we store the data for all individual slices of z𝑧zitalic_z in a single array, this array’s size surpasses that of the allowed shared memory for each thread block, so using shared memory was not an option.

3.3. Surface Reconstruction

We achieve this with two distinct steps, creating a scalar field out of the simulation data and obtaining vertex data to extrapolate a mesh.

3.3.1 Scalar Field

For this step, we follow a similar approach to what Zhu and Bridson propose on their paper on improved blobbies[16]. For every grid corner, we calculate the average position and radius of the nearby particles and determine whether that corner is located inside these values. The time complexity for this approach is very inefficient, specially with higher resolution grids, at O(mn)𝑂𝑚𝑛O(m\cdot n)italic_O ( italic_m ⋅ italic_n ), where n𝑛nitalic_n is the number of particles and m𝑚mitalic_m the number of grid corners. Using a similar structure to space_hash described in sec 3.1.5, we can narrow our formula to only the particles we compute a priori to be around the grid corner. What we end up with are negative distances for corners that are close to many particles, and positive distances for those which are not. In effect, this forms a signed distance function of which the definition is the following:

ϕ(𝐱)=𝐱iwi𝐱iiwiriitalic-ϕ𝐱norm𝐱subscript𝑖subscript𝑤𝑖subscript𝐱𝑖subscript𝑖subscript𝑤𝑖subscript𝑟𝑖\phi(\mathbf{x})=\left\|\mathbf{x}-\sum_{i}w_{i}\mathbf{x}_{i}\right\|-\sum_{i%}w_{i}r_{i}italic_ϕ ( bold_x ) = ∥ bold_x - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

where risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the influence radius of a particle. The points are weighted as:

wi=k(𝐱𝐱i/R)jk(𝐱𝐱j/R)subscript𝑤𝑖𝑘norm𝐱subscript𝐱𝑖𝑅subscript𝑗𝑘norm𝐱subscript𝐱𝑗𝑅w_{i}=\frac{k\left(\left\|\mathbf{x}-\mathbf{x}_{i}\right\|/R\right)}{\sum_{j}%k\left(\left\|\mathbf{x}-\mathbf{x}_{j}\right\|/R\right)}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_k ( ∥ bold_x - bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ / italic_R ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_k ( ∥ bold_x - bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ / italic_R ) end_ARG

where R𝑅Ritalic_R is the search radius of nearby particles to the grid corner and k𝑘kitalic_k is the kernel function

k(s)=max(0,(1s2)3)𝑘𝑠0superscript1superscript𝑠23k(s)=\max\left(0,\left(1-s^{2}\right)^{3}\right)italic_k ( italic_s ) = roman_max ( 0 , ( 1 - italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT )

which is a function that decays as the distance between particles and the grid corner grows, being exactly zero at a distance equal to the search radius.

3.3.2 Marching Cubes

We decided to classify each corner as either being below or above the isovalue, giving us 256 possible configurations of corner classification. Two of these are trivial where all points are inside or outside and the cube does not contribute to the isosurface. For all other configurations we need to determine where, along each cube edge, the isosurfaces crosses, and use these edge intersection points to create one or more triangular patches for the isosurface. Each possibility will be characterised by the number of vertices that have values above or below the isosurface: for example, if one vertex is above the isosurface and an adjacent vertex is below the isosurface, then we know the isosurface cuts the edge between these two vertices, the position of the cut being linearly interpolated.

When polygonising a field, we can also control the resolution of the sampling grid (which does not need to match that of the simulation grid). This allows us coarser or finer approximations to the isosurface to be generated, depending on the smoothness we want out of our fluid mesh.

Moreover, for smooth rendering purposes, we also compute normals for each vertex of the triangular faces. To do this, after having created the triangle patches, we average the normals of all the patches that share a vertex, weighted by the inverse of the area of said patch. This makes it so small polygons have greater weight, with the idea being that small polygons may occur in regions of higher surface curvature.

4. Evaluation

The test result were captured on a machine with the following specifications: Intel Core i7-8750H @ 2.20GHz CPU, 16GB of DDR4 RAM and an NVIDIA GeForce GTX 1050Ti1050Ti1050\mathrm{Ti}1050 roman_T roman_i GPU.

We compare the performance of our implementations on three different, relatively simple, benchmarks, with varying particle counts and grid sizes. Typical total speedups between our CPU (running the PCG solver) and GPU implementations can be seen in table 1. Every benchmark was run with the specified parameters for 50 full steps. Overall, on the GPU, we can achieve simulation times that are an order of magnitude improved compared to the corresponding CPU simulation, averaging around a speedup of 20x20𝑥20x20 italic_x.

Table 1: Average time per simulation step. All times are in milliseconds, averaged after 50 steps were taken.

Particle
##\##
Grid
Res
CPU
Time
GPU
Time
Speedup
Dam-Break100k100𝑘100k100 italic_k323superscript32332^{3}32 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT4511825x25x25\mathrm{x}25 roman_x
500k500𝑘500k500 italic_k643superscript64364^{3}64 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT20658324x24x24\mathrm{x}24 roman_x
1M1𝑀1M1 italic_M1283superscript1283128^{3}128 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT424720420x20x20\mathrm{x}20 roman_x
Double Dam-Break100k100𝑘100k100 italic_k323superscript32332^{3}32 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT6302723x23x23\mathrm{x}23 roman_x
500k500𝑘500k500 italic_k643superscript64364^{3}64 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT346215422x22x22\mathrm{x}22 roman_x
1M1𝑀1M1 italic_M1283superscript1283128^{3}128 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT1811491819x19x19\mathrm{x}19 roman_x
Water Drop100k100𝑘100k100 italic_k323superscript32332^{3}32 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT3231620x20x20\mathrm{x}20 roman_x
500k500𝑘500k500 italic_k643superscript64364^{3}64 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT296017117x17x17\mathrm{x}17 roman_x
1M1𝑀1M1 italic_M1283superscript1283128^{3}128 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT16588105615x15x15\mathrm{x}15 roman_x

It should be noted that the increasing simulation times with each benchmark are very likely due to the different distribution of particles in space between them: whereas the dam-break example is fairly muted in visual spectacle, the double dam-break and water drop examples result in more splashes and scatters. This, in turn, increases the ratio of "active" grid cells (e.g., grid cells that contain fluid) at a more rapid rate during the initial stages of the simulation, resulting in slower particle binning phases on both approaches.

It is also clear that there is no linearity in simulation times among the samples gathered in table 1 . This can be correlated to the distributions we chose for each of them, i.e., in order to increase the particle count between the dam-break samples, we not only increased the initial particle density per grid cell, but also the measurements (height, width and depth) of the initial domain volumes. This can result in mismatched behavior per sample of each benchmark with, for example, new emerging behavior resulting out of these differences.

Perhaps the most interesting result observed from table 1 is that the speedup decreases with a combination of particle count and grid size. We know we are not allocating enough particles to saturate the GPU, so this speaks to the scalability of our solution. In theory, for smaller particle counts, time spent initializing threads and calling kernels should amount to a sizeable chunk of each timestep, which would become less relevant at higher particle counts and, as such, not impact our speedup as much. This, however, is not our case, as we, again, in general, are met with decreasing speedups.

In order to explore this, we can take a look at figures 7,8 and 9 .

Fluid Implicit Particle Simulation for CPU and GPU (8)

Figure 7: Time for each stage of the simulation in a simulation step for varying number of particles in a 32×32×3232323232\times 32\times 3232 × 32 × 32 grid.

Fluid Implicit Particle Simulation for CPU and GPU (9)

Figure 8: Time for each stage of the simulation in a simulation step for varying number of particles in a 64×64×6464646464\times 64\times 6464 × 64 × 64 grid.
For figures 7 and 8 , the simulation was run on the GPU for two standard grid sizes of 323superscript32332^{3}32 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and 643superscript64364^{3}64 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, respectively. Straightaway, what jumps out into view is that the pressure solve step takes around the same amount, independently of particle count, for the same grid size. This adds up, given that the size of matrix A𝐴Aitalic_A is invariable in these cases.

Additionally, for larger grids (e.g., 643superscript64364^{3}64 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ), the pressure step is by far the most expensive for all ranges of particles, whereas for smaller grids (e.g., 323superscript32332^{3}32 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ), the pressure step dominates for fewer than 800k800𝑘800k800 italic_k particles, but drops below particle binning and the particle-to-grid transfer (accumulated) for counts larger than that. With this in mind, it is not unreasonable to conclude that what the reason behind this is how most of the data for the pressure step is stored in global memory, which can result in a large amount of memory access conflicts, specially considering the relevant values for each grid cell are in adjacent neighbours which, in turn, neighbour multiple cells. On the other hand, most of the computations for the particle binning step are stored in shared memory and we only access global memory to store the final results, usually in a coalesced manner, which corroborates the values shown.

Another point of interest is the scaling of the particle binning step. Note that, between figures 7 and 8 , it, seemingly, only scales with the particle count and is completely independent of grid size. One might expect to find a variance due to the grid cell size and the distance the particle values have to be moved in memory during this step, but this does not seem to be the case. The same happens for for the advection step, but this is expected as the grid has no bearing in it.

Fluid Implicit Particle Simulation for CPU and GPU (10)

Figure 9: Time for each stage of the simulation in a simulation step for varying grid side size with 500k500𝑘500k500 italic_k particles.

With figure 9, we can take a look at the next parameter of interest, the grid resolution. We ran the simulation with 500k500𝑘500k500 italic_k particle for varying resolutions, from 163superscript16316^{3}16 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT to 1283superscript1283128^{3}128 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. The pressure step is the primary stage of interest here and, as expected, scales roughly linearly (note that the xx\mathrm{x}roman_x-axis is a log\logroman_log scale) with the number of grid cells. It is also feasible to assume that for grid resolutions that don’t saturate the GPU, the pressure step should be bounded by a smaller scaling.

The more subtle curve comes from the grid-to-particle interpolation step, which scales with both parameters, particle count and grid resolution, as can be seen from
the above figures combined. The scaling is steeper with grid resolution, which is plausible as particles are sorted and memory access to each of the particles’ positions can be coalesced, whereas in order to get face sample values we have to access sporadic memory positions for each grid cell.

In light of these findings, we can understand why the aggregated speedup sees decreasing totals when both parameters are adjusted. We see the pressure step consuming ever more time with higher grid resolutions, which, coupled with an increase in particles, results in more time-intensive particle binning and grid-to-particle interpolation steps as well. From this, it is clear that our solution doesn’t scale particularly well. This can be attributed to a lack of use of shared memory, which is only actively employed during binning and, consequently, results in relatively slow memory access patterns. In theory, these could be offset by the use of texture memory to store the diagonal of matrix A𝐴Aitalic_A, as it is only read from, but this is not something we investigated. It is also possible that further tweaks could have been made to kernel parameters, in order to ensure each thread was performing enough work to warrant the cost of its creation.

Before moving on, there is one more thing we want to touch upon by looking at figure 10

Fluid Implicit Particle Simulation for CPU and GPU (11)

Figure 10: Total system energy for sets of 25k25𝑘25k25 italic_k particles running on the CPU and on the GPU.

This graph shows the variance in system energy for simulations running on the CPU and GPU for 200 full timesteps. We mentioned that our advection scheme, a simple forward Euler, is unconditionally unstable and that can be demonstrated by the downward trajectory in the curve, where the total energy of the system dissipates with each iteration. The samples in figure 10 were taken from the dam-break benchmark, which, as previously indicated, is relatively simple and culminates in a state that allows for less numerical dissipation overall. For more aggressive benchmarks, this scheme can cause the simulation to eventually explode due to the instability introduced.

The gap between the values gathered from both implementation, conversely, can be attributed not to the algorithm itself, but the use of lower precision floating point variables in the GPU implementation, with no care taken to avoid the loss of precision when accumulating intermediate results. A more exact result could be established by performing the calculations at higher preci- sion or by structuring the algorithm in such a fashion that would allow accommodating this difference, but neither option was explored.

4.1. Visual Results

Table 2: Average time per MC run. All times in milliseconds.

Particle
##\##
Grid
Res
Image
##\##
MC
Time
Water Drop300k300𝑘300k300 italic_k1043superscript1043104^{3}104 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPTa)a)italic_a )348
b)b)italic_b )341
c)c)italic_c )382
d)d)italic_d )391

Table 2 shows the run times for MC𝑀𝐶MCitalic_M italic_C on two of the benchmarks - the double dam-break and the water drop -, with static particle counts and grid resolution (remember that the grid used for meshing can be of a different resolution than the one used during the simulation), for the images in 11.

The times in table 2 relate to simulations with a number of particles an order of magnitude lesser than what we ideally want to run. Additionally, running with grid resolutions much lower than what we used results in an overall loss of detail, as mentioned in sec ??. As the times in this table indicate, this algorithm is not suited for online rendering, i.e., rasterizing and observing the mesh while the simulation is running in real-time, when running large-scale (this is, with many millions of particles) simulations, since for comparably smaller particle counts we get dangerously close to one second per frame which is not acceptable.

4.2. Discussion

All in all, our study emphasizes the idea that fluid simulations are highly parallelizable applications. After identifying the steps of the algorithm that were not trivially convertible to CUDA, such as particle binning (which is indispensable for an efficient particle-to-grid transfer) and the projection step, we nevertheless landed on fairly similar approaches between the two implementations, with a high speedup gained from moving to the GPU.

As discussed in the previous section, there are some limitations to our work which could be addressed in future studies. Firstly, the most harmful of these is the lacking use of shared memory or other more efficient memory options at our disposal (e.g., textures). We can see sharp declines in speedup (and, in reverse, inclines in simulation times) due in very big part to the steps where we make the most use out of global memory (such as the pressure solve and grid-to-particle transfer). We believe that approaches that take bigger care when engaging with thread workload and global memory bandwidth could achieve much higher speedups, including for higher particle counts and grid resolutions.

Secondly, although red-black Gauss Seidel is a good midpoint for parallel solvers, we believe that other solvers such as a matrix-free Conjugate Gradient method (as

Fluid Implicit Particle Simulation for CPU and GPU (12)

(a)

Fluid Implicit Particle Simulation for CPU and GPU (13)

(b)

Fluid Implicit Particle Simulation for CPU and GPU (14)

(c)

Fluid Implicit Particle Simulation for CPU and GPU (15)

(d)

Figure 11: Fluid meshes for the water drop benchmark rasterized with Marching Cubes and rendered using PBRTv3 (available here).

seen in [14]) would bring benefits to our implementation, with the potential to not only disregard the use of global memory entirely in the projection step but also introduce methods that, in theory, take fewer iterations to converge.

Finally, another thing of note is that the mindset for our CPU implementation was not one that aimed for effectiveness. Rather, we wanted to get a better grasp of the algorithm and did not concern ourselves with this, which can, in part, have brought about skewed empirical results when compared to our GPU implementation.

5. Conclusions

This study was conducted with the goal of getting a better understanding of physically-based simulations, computationally and visually, and how these are impacted with the growing trend of GPGPU. We wanted to come away with concrete conclusions regarding the performance impact of a multi-core approach and feel like our CUDA implementation does just that, allowing for better performance in real-time for simulations with higher particle counts and finer-grained grid resolutions.

To investigate this problem, we started off by devising an implementation of the FLIP algorithm in the CPU and exploring the core steps of said algorithm, to deepen our bases and assimilate what exactly composes the algorithm and how to extract the results we desire in an efficient manner. Following this, we adapted the algorithm to an implementation in CUDA, maintaining the simplicity in the algorithm steps to the best of our ability.

While the end results were what we expected, we learned that a certain care has to be taken into account when implementing the FLIP algorithm on the GPU. Over the course of this project we analyzed the methods we could use to assimilate this implementation, with the main issues we identified being binning particles and the pressure solve step. The conclusions we came to and the details we had to pore over were that: either we use sparse data to continually keep track of only the grid cells that contain fluid, or we need a way to properly sort particles in order to ensure that accumulating values to the grid can be done in a structured manner; and that not all solvers can be trivially implemented in a parallel architecture and we need to opt between better convergence rates or shorter development time.

In the end, however, we cannot say that our work brings relevant contributions to the current landscape of fluid simulations as most of what we worked with has been explored, even to greater detail, in other studies. With this in mind, still, in the end, we achieved what we set out to do, which was to improve our understanding of this specific field and those adjacent to it and showcase that fluid simulations are indeed transparently convertible to multi-core architectures.

References

A generalization of algebraic surface drawing. In ACM Transactions on Graphics, page 235-256. ACM, 1982.

[2] J. Bolz, I. Farmer, E. Grinspun, and P. Schröder. Sparse matrix solvers on the gpu: Conjugate gradients and multigrid. SIGGRAPH ’03, page 917-924, New York, NY, USA, 2003. Association for Computing Machinery.

[3] J. U. Brackbill and H. M. Ruppel. Flip: A method for adaptively zoned, particlein-cell calculations of fluid flows in two dimensions. In Journal of Computational Physics, page 314-343. Academic Press Professional, Inc., 1986.

[4] C. Braley and A. Sandu. Fluid simulation for computer graphics: A tutorial in grid based and particle based methods. 2009.

[5] R. Bridson. Fluid Simulation for Computer Graphics, Second Edition. Taylor & Francis, 2015.

[6] T. Brochu. Toward better surface tracking for fluid simulation. In IEEE Computer Graphics and Applications, pages 74-81, 2015.

[7] M. Evans and F. Harlow. The particle-in-cell method for hydrodynamic calculations. Technical report, Los Alamos Scientific Lab., N. Mex., 1957.

[8] F. H. Harlow and J. E. Welch. Numerical calculation of time-dependent viscous incompressible flow of fluid with free surface. In The Physics of Fluids, pages 2182-2189, 1965.

[9] M. Harris, S. Sengupta, and J. Owens. Parallel prefix sum (scan) with CUDA, volume 39, pages 851.082007851.082007851-.082007851 - .082007

[10] W. E. Lorensen and H. E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. In SIGGRAPH Comput. Graph., page 163-169. ACM,

[11] T. Newman and H. Yi. A survey of the marching cubes algorithm. Computers & Graphics, 30:854-879, 102006

[12] J. Stam. Stable fluids. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pages 121-128, 1999.

[13] J. Tan and X. Yang. Physically-based fluid animation: A survey. In Science in China Series F: Information Sciences, pages 723-740, 2009.

[14] K. Wu, N. Truong, C. Yuksel, and R. Hoetzlein. Fast fluid simulations with sparse volumes on the gpu. In Computer Graphics Forum, pages 157-167, 2018.

[15] H. Zhao, S. Osher, and R. Fedkiw. Fast surface reconstruction using the level set method. In Proceedings IEEE Workshop on Variational and Level Set Meth ods in Computer Vision, pages 194-201, 2001.

[16] Y. Zhu and R. Bridson. Animating sand as a fluid. In ACM Transactions on Graphics, page 965-972. ACM New York, 2005.

[17] P. Abreu, R. A. Fonseca, J. M. Pereira and L. O. Silva. PIC Codes in New Processors: A Full Relativistic PIC Code in CUDA-Enabled Hardware With Direct Visualization. In IEEE Transactions on Plasma Science, vol. 39, no. 2, pages 675-685, 2011.

Fluid Implicit Particle Simulation for CPU and GPU (2024)
Top Articles
Latest Posts
Article information

Author: Dr. Pierre Goyette

Last Updated:

Views: 6223

Rating: 5 / 5 (50 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Dr. Pierre Goyette

Birthday: 1998-01-29

Address: Apt. 611 3357 Yong Plain, West Audra, IL 70053

Phone: +5819954278378

Job: Construction Director

Hobby: Embroidery, Creative writing, Shopping, Driving, Stand-up comedy, Coffee roasting, Scrapbooking

Introduction: My name is Dr. Pierre Goyette, I am a enchanting, powerful, jolly, rich, graceful, colorful, zany person who loves writing and wants to share my knowledge and understanding with you.