1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[Python] Stuck on scraping with BeautifulSoup while learning. Need some pointers

Discussão em 'Python' iniciado por Stack, Setembro 28, 2024 às 05:52.

  1. Stack

    Stack Membro Participativo

    I started learning screen scraping using BeautifulSoup. To get started I took a wikipedia article in the following format

    <table class="wikitable sortable jquery-tablesorter">
    <caption></caption>
    <thead>
    <tr>
    <th colspan="2" style="width: 6%;" class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Opening</th>
    <th style="width: 20%;" class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Title</th>
    <th style="width: 10%;" class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Director</th>
    <th style="width: 45%;" class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Cast</th>
    <th style="width: 30%;" class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Production company</th>
    <th class="unsortable" style="width: 1%;"><abbr title="Reference(s)">Ref.</abbr></th>
    </tr>
    </thead>
    <tbody>
    <tr>
    <td rowspan="3" style="text-align: center; background: #77bc83;">
    <b>
    O<br />
    C<br />
    T
    </b>
    </td>
    <td rowspan="1" style="text-align: center; background: #77bc83;"><b>11</b></td>
    <td style="text-align: center;">
    <i><a href="/wiki/Viswam_(film)" title="Viswam (film)">Viswam</a></i>
    </td>
    <td>Sreenu Vaitla</td>
    <td>
    <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1129693374" />
    <div class="hlist">
    <ul>
    <li><a href="/wiki/Gopichand_(actor)" title="Gopichand (actor)">Gopichand</a></li>
    <li><a href="/wiki/Kavya_Thapar" title="Kavya Thapar">Kavya Thapar</a></li>
    <li><a href="/wiki/Vennela_Kishore" title="Vennela Kishore">Vennela Kishore</a></li>
    <li><a href="/wiki/Sunil" title="Sunil">Sunil</a></li>
    <li><a href="/wiki/Naresh" title="Naresh">Naresh</a></li>
    </ul>
    </div>
    </td>
    <td>
    Chitralayam Studios<br />
    People Media Factory
    </td>
    <td style="text-align: center;">
    <sup id="cite_ref-180" class="reference">
    <a href="#cite_note-180"><span class="cite-bracket">[</span>178<span class="cite-bracket">]</span></a>
    </sup>
    </td>
    </tr>
    <tr>
    <td rowspan="2" style="text-align: center; background: #77bc83;"><b>31</b></td>
    <td style="text-align: center;">
    <i><a href="/wiki/Lucky_Baskhar" title="Lucky Baskhar">Lucky Baskhar</a></i>
    </td>
    <td><a href="/wiki/Venky_Atluri" title="Venky Atluri">Venky Atluri</a></td>
    <td>
    <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1129693374" />
    <div class="hlist">
    <ul>
    <li><a href="/wiki/Dulquer_Salmaan" title="Dulquer Salmaan">Dulquer Salmaan</a></li>
    <li><a href="/wiki/Meenakshi_Chaudhary" title="Meenakshi Chaudhary">Meenakshi Chaudhary</a></li>
    </ul>
    </div>
    </td>
    <td><a href="/wiki/S._Radha_Krishna" title="S. Radha Krishna">Sithara Entertainments</a></td>
    <td style="text-align: center;">
    <sup id="cite_ref-181" class="reference">
    <a href="#cite_note-181"><span class="cite-bracket">[</span>179<span class="cite-bracket">]</span></a>
    </sup>
    </td>
    </tr>
    <tr>
    <td style="text-align: center;">
    <i><a href="/wiki/Mechanic_Rocky" title="Mechanic Rocky">Mechanic Rocky</a></i>
    </td>
    <td>Ravi Teja Mullapudi</td>
    <td>
    <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1129693374" />
    <div class="hlist">
    <ul>
    <li><a href="/wiki/Vishwak_Sen" title="Vishwak Sen">Vishwak Sen</a></li>
    <li><a href="/wiki/Meenakshi_Chaudhary" title="Meenakshi Chaudhary">Meenakshi Chaudhary</a></li>
    </ul>
    </div>
    </td>
    <td>SRT Entertainments</td>
    <td style="text-align: center;">
    <sup id="cite_ref-182" class="reference">
    <a href="#cite_note-182"><span class="cite-bracket">[</span>180<span class="cite-bracket">]</span></a>
    </sup>
    </td>
    </tr>
    <tr>
    <td style="text-align: center; background: #77ea83;">
    <b>
    N<br />
    O<br />
    V
    </b>
    </td>
    <td style="text-align: center; background: #77ea83;"><b>9</b></td>
    <td style="text-align: center;">
    <i><a href="/wiki/Telusu_Kada" title="Telusu Kada">Telusu Kada</a></i>
    </td>
    <td><a href="/wiki/Neeraja_Kona" title="Neeraja Kona">Neeraja Kona</a></td>
    <td>
    <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1129693374" />
    <div class="hlist">
    <ul>
    <li><a href="/wiki/Siddhu_Jonnalagadda" title="Siddhu Jonnalagadda">Siddhu Jonnalagadda</a></li>
    <li><a href="/wiki/Raashii_Khanna" title="Raashii Khanna">Raashii Khanna</a></li>
    <li><a href="/wiki/Srinidhi_Shetty" title="Srinidhi Shetty">Srinidhi Shetty</a></li>
    </ul>
    </div>
    </td>
    <td>People Media Factory</td>
    <td style="text-align: center;">
    <sup id="cite_ref-183" class="reference">
    <a href="#cite_note-183"><span class="cite-bracket">[</span>181<span class="cite-bracket">]</span></a>
    </sup>
    </td>
    </tr>
    <tr>
    <td rowspan="2" style="text-align: center; background: #f4ca16; textcolor: #000;">
    <b>
    D<br />
    E<br />
    C
    </b>
    </td>
    <td rowspan="1" style="text-align: center; background: #f8de7e;"><b>6</b></td>
    <td style="text-align: center;">
    <i><a href="/wiki/Pushpa_2:_The_Rule" title="Pushpa 2: The Rule">Pushpa 2: The Rule</a></i>
    </td>
    <td><a href="/wiki/Sukumar" title="Sukumar">Sukumar</a></td>
    <td>
    <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1129693374" />
    <div class="hlist">
    <ul>
    <li><a href="/wiki/Allu_Arjun" title="Allu Arjun">Allu Arjun</a></li>
    <li><a href="/wiki/Fahadh_Faasil" title="Fahadh Faasil">Fahadh Faasil</a></li>
    <li><a href="/wiki/Rashmika_Mandanna" title="Rashmika Mandanna">Rashmika Mandanna</a></li>
    </ul>
    </div>
    </td>
    <td><a href="/wiki/Mythri_Movie_Makers" title="Mythri Movie Makers">Mythri Movie Makers</a></td>
    <td style="text-align: center;">
    <sup id="cite_ref-184" class="reference">
    <a href="#cite_note-184"><span class="cite-bracket">[</span>182<span class="cite-bracket">]</span></a>
    </sup>
    </td>
    </tr>
    <tr>
    <td rowspan="1" style="text-align: center; background: #f8de7e;"><b>20</b></td>
    <td style="text-align: center;"><i>Robinhood</i></td>
    <td><a href="/wiki/Venky_Kudumula" title="Venky Kudumula">Venky Kudumula</a></td>
    <td>
    <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1129693374" />
    <div class="hlist">
    <ul>
    <li><a href="/wiki/Nithiin" title="Nithiin">Nithiin</a></li>
    <li><a href="/wiki/Sreeleela" title="Sreeleela">Sreeleela</a></li>
    </ul>
    </div>
    </td>
    <td><a href="/wiki/Mythri_Movie_Makers" title="Mythri Movie Makers">Mythri Movie Makers</a></td>
    <td style="text-align: center;">
    <sup id="cite_ref-185" class="reference">
    <a href="#cite_note-185"><span class="cite-bracket">[</span>183<span class="cite-bracket">]</span></a>
    </sup>
    </td>
    </tr>
    </tbody>
    <tfoot></tfoot>
    </table>


    This is my python script that I wrote:

    soup = BeautifulSoup(html_page, "html.parser")

    tables = soup.find_all("table",{"class":"wikitable sortable"})
    headers = ['month','day','movie','director','cast','producer','reference']
    movie_tables = []
    total_movies = 0
    for table in tables:
    caption = table.find("caption")
    if not caption or not caption.get_text().strip():
    movie_tables.append(table)

    #captions = soup.find_all("caption")

    max_columns = len(headers)

    # List to store dictionaries
    data_dict_list = []

    movies= []
    for movie_table in movie_tables:
    table_rows = movie_table.find("tbody").find_all("tr")[1:]
    for table_row in table_rows:
    total_movies += 1
    columns = table_row.find_all('td')
    row_data = [col.get_text(strip=True) for col in columns]
    # If the row has fewer columns than the max, pad it with None
    if len(row_data) == 6:
    row_data.insert(0, None)
    elif len(row_data) == 5:
    row_data.insert(0, None)
    row_data.insert(1, None)
    for col in columns:
    li_tags = col.find_all('li')
    if li_tags:
    cast=""
    for li in li_tags:
    li_values = li.get_text(strip=True)
    cast = ', '.join(li_values)

    row_data.append(cast)
    else:
    row_data.append(col.get_text())
    # Create a dictionary mapping headers to row data
    row_dict = dict(zip(headers, row_data))

    # Append the dictionary to the list
    data_dict_list.append(row_dict)

    # Print the list of dictionaries
    for row_dict in data_dict_list:
    print(row_dict)


    This is the output I am getting (Just showing a few items here):

    {'month': 'OCT', 'day': '11', 'movie': 'Viswam', 'director': 'Sreenu Vaitla', 'cast': 'GopichandKavya ThaparVennela KishoreSunilNaresh', 'producer': 'Chitralayam StudiosPeople Media Factory', 'reference': '[178]'}

    {'month': None, 'day': '31', 'movie': 'Lucky Baskhar', 'director': 'Venky Atluri', 'cast': 'Dulquer SalmaanMeenakshi Chaudhary', 'producer': 'Sithara Entertainments', 'reference': '[179]'}

    {'month': None, 'day': None, 'movie': 'Mechanic Rocky', 'director': 'Ravi Teja Mullapudi', 'cast': 'Vishwak SenMeenakshi Chaudhary', 'producer': 'SRT Entertainments', 'reference': '[180]'}

    {'month': 'NOV', 'day': '9', 'movie': 'Telusu Kada', 'director': 'Neeraja Kona', 'cast': 'Siddhu JonnalagaddaRaashii KhannaSrinidhi Shetty', 'producer': 'People Media Factory', 'reference': '[181]'}

    {'month': 'DEC', 'day': '6', 'movie': 'Pushpa 2: The Rule', 'director': 'Sukumar', 'cast': 'Allu ArjunFahadh FaasilRashmika Mandanna', 'producer': 'Mythri Movie Makers', 'reference': '[182]'}

    {'month': None, 'day': '20', 'movie': 'Robinhood', 'director': 'Venky Kudumula', 'cast': 'NithiinSreeleela', 'producer': 'Mythri Movie Makers', 'reference': '[183]'}


    This is what I am trying to get(Just showing the last item here):

    {'month': 'DEC', 'day': '20', 'movie': 'Robinhood', 'director': 'Venky Kudumula', 'cast': 'Nithiin|Sreeleela', 'producer': 'Mythri Movie Makers', 'reference': '[183]'}


    I've been trying to debug this for the last day or so but I cannot figure out where I went wrong.

    I am expecting:

    1. the month, day filled up in all the items when those columns span multiple rows and are not represented in all rows.
    2. Next, I want to have a separator between the different cast members so that it will be easy for me to create a graph later.
    3. Also, how do I extract the hyperlinks and store in a separate key in the dictionary while doing all of this?

    Continue reading...

Compartilhe esta Página