[SQL] Backend Architecture: How to scale with larger SQL datasets when offering paginated...

Stack · Outubro 15, 2024 às 12:23

Let's imagine a website that offers a table view of some entity, e.g. vehicles.

DB: In my database there is a table vehicles which has 500k entries. Each user has permissions on a subset of those vehicles, lets say 100k.

Frontend is a website with a simple table view, i.e some columns which are also present in sql, e.g. license plate, milage, number of wheels, previous owner, current owner, etc.

I obviously need to offer pagination for the frontend so queries work with limits and offsets. However, I now have the problem that I also want to offer custom ordering (+searching) which means the user can order the data ascending/descending with the columns he sees.

Question: What are options to build/query this in a way so that query/request times don't scale up with data size?

I now have a (PostgreSQL) SQL query that does the following:

Joins several tables for permission-related information, resulting in a list of all vehicle UUIDs. Unfortunately, since there are different ways to grant permissions (directly and via groups of vehicles), there is a need for a DISTINCT clause somewhere. This is the first aspect that seems to increase computation time as data volume grows. This step ensures that the response does not contain the same vehicle more than once if the user has direct access as well as group access.

Joins these results with the actual data in a CTE (Common Table Expression) so I can get a total count of all vehicles the user has access to. This is the second factor that appears to take longer as the data grows, allowing the frontend to display the number of pages available, thus informing the user about how many pages there are. It's done in a CTE at the moment so I can perform the count and further processing.

Returns 100 results with an offset of X from the CTE, ordered by the column the user wants to sort by. This is the third operation that seems to scale with data size. The same would apply to any searches using LIKE expressions.

These are very common features, so how should this be approached from a database/backend design perspective? I currently have response times of around 80 ms, but I'm expecting the data volume to increase 100 times in a year and would like to implement this correctly.

Considerations:

I know that subqueries are somewhat faster than CTEs, but that doesn’t address the main issue, right?

Materialized views or any other form of "caching" would obviously require significant storage since it would create different subsets for each user and, theoretically, numerous possibilities for ordering those subsets.

Upgrading to a larger server can help, but it may not be sufficient as both data size and query times continue to increase. Also, my understanding is that PostgreSQL does not distribute the load of a single query across more than one core, so theoretically, only a more powerful core would help, rather than simply having more cores.

Any attempt to scale by distributing the same dataset across multiple machines (sharding) would face the same challenge: at some point, you need the entire dataset to order and return the limited results (100 offset X), correct?

Any indexing on the overall table wouldn't help since it's always a subset.

I am currently experiencing query/request processing times of around 80 ms, and it seems this will only increase as my data grows. As a beginner, could someone point me in the right direction here?

Continue reading...

Logar ou Criar uma Conta

[SQL] Backend Architecture: How to scale with larger SQL datasets when offering paginated...

Stack Membro Participativo

Compartilhe esta Página

Logar ou Criar uma Conta

[SQL] Backend Architecture: How to scale with larger SQL datasets when offering paginated...

Stack Membro Participativo

Compartilhe esta Página

Pesquisas Úteis