[Python] Partitioning Large R-MAT Graph Datasets From Binary File

Stack · Outubro 7, 2024 às 12:52

I am currently using a python script to find the ideal point with number of non-zeros (nnz) to split the sparse graph: ( from this answer )

def myhorsplit(
matrix: sparse.sparray, n_compute_units: int = 4,
) -> list[sparse.sparray]:
nnz = matrix.getnnz(axis=1).cumsum()
total = nnz[-1]
ideal_breaks = np.arange(0, total, total/n_compute_units)
break_idx = [*nnz.searchsorted(ideal_breaks), None]
return [
matrix[i: j, :]
for i, j in zip(break_idx[:-1], break_idx[1:])
]
def main() -> None:
rand = np.random.default_rng(seed=0)
# Create an 8x8 adjacency matrix with the modified element
adjacency_matrix = [
(1, 1, 1, 1, 0, 0, 0, 0),
(1, 0, 1, 0, 0, 0, 0, 0),
(1, 1, 0, 1, 0, 0, 0, 0),
(1, 0, 1, 0, 0, 0, 0, 0),
(0, 0, 1, 0, 0, 1, 0, 1),
(0, 0, 0, 0, 1, 0, 0, 0),
(0, 0, 0, 0, 1, 1, 0, 1),
(0, 0, 1, 0, 1, 0, 1, 0),
]
# csr_matrix = sparse.csr_array(adjacency_matrix)
csr_matrix = sparse.csr_array(
rand.integers(low=0, high=2, size=(10_000, 50), dtype=np.uint8)
)

partitions = myhorsplit(csr_matrix)

for i, partition in enumerate(partitions):
print(f"Partition {i}: {partition.nnz} ones, shape {partition.shape}")
# print(partition.toarray())

So the example from the code is split into these for 4 partitions :

Partition 0: 4 ones, shape (1, 8)
[[1 1 1 1 0 0 0 0]]
Partition 1: 5 ones, shape (2, 8)
[[1 0 1 0 0 0 0 0]
[1 1 0 1 0 0 0 0]]
Partition 2: 6 ones, shape (3, 8)
[[1 0 1 0 0 0 0 0]
[0 0 1 0 0 1 0 1]
[0 0 0 0 1 0 0 0]]
Partition 3: 6 ones, shape (2, 8)
[[0 0 0 0 1 1 0 1]
[0 0 1 0 1 0 1 0]]

Currently, I am trying to split R-MAT graph with scale=29 and edge factor=16 which is around 120GBytes binary file to read and convert to sparse array. This Python script hangs with an OOM Killed message.

I was wondering in order to achieve this partitioning from the binary file (generated via graph500) is there a way to do this from reading the file only ?

One option I thought was finding an optimal function that ranks in order all edges depending on their source node and then split from that would still keep the regions right ? How could I approach this problem ?

Continue reading...

Logar ou Criar uma Conta

[Python] Partitioning Large R-MAT Graph Datasets From Binary File

Stack Membro Participativo

Compartilhe esta Página

Logar ou Criar uma Conta

[Python] Partitioning Large R-MAT Graph Datasets From Binary File

Stack Membro Participativo

Compartilhe esta Página

Pesquisas Úteis