1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[Python] Can I speed up querying my >600Mio lines SQLite datebase?

Discussão em 'Python' iniciado por Stack, Outubro 1, 2024 às 06:23.

  1. Stack

    Stack Membro Participativo

    I created a local db that has about 600Mio entries that I want to efficiently query (I first tried pandas, but that's way too slow). This is a local dbSNP that only contains rsIDs and genomic positions for those interested. For this I used this code:

    import sqlite3
    import gzip
    import csv

    rsid_db = sqlite3.connect('rsid.db')

    rsid_cursor = rsid_db.cursor()

    rsid_cursor.execute(
    """
    CREATE TABLE rsids (
    rsid TEXT,
    chrom TEXT,
    pos INTEGER,
    ref TEXT,
    alt TEXT
    )
    """
    )

    with gzip.open('00-All.vcf.gz', 'rt') as vcf: # from https://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz
    reader = csv.reader(vcf, delimiter="\t")
    i = 0
    for row in reader:
    if not ''.join(row).startswith('#'):
    rsid_cursor.execute(
    f"""
    INSERT INTO rsids (rsid, chrom, pos, ref, alt)
    VALUES ('{row[2]}', '{row[0]}', '{row[1]}', '{row[3]}', '{row[4]}');
    """
    )
    i += 1
    if i % 1000000 == 0:
    print(f'{i} entries written')
    rsid_db.commit()

    rsid_db.commit()

    rsid_db.close()


    All I want to do with this db is query multiple rsIDs and get their genomic position. One entry looks like this:

    rsid chrom pos ref alt
    rs537152180 1 4002401 G A,C


    I currently query it using this function:

    import sqlite3
    import pandas as pd


    def query_rsid(rsid_list,
    rsid_db_path='rsid.db'):

    with sqlite3.connect(rsid_db_path) as rsid_db:

    rsid_cursor = rsid_db.cursor()
    rsid_cursor.execute(
    f"""
    SELECT * FROM rsids
    WHERE rsid IN ('{"', '".join(rsid_list)}');
    """
    )

    query = rsid_cursor.fetchall()

    return query


    It takes about 1.5 minutes to query this no matter how many entries. Is there a way to speed this up using SQLite or is this simply limited by my machine?

    Continue reading...

Compartilhe esta Página