python - Convert html table to dictionary without losing structure -
i'm new python (and programming) , using beautifulsoup
first time.
i'm trying find best way parse contents of table in html , convert dictionary - ideally in least brittle way.
here example of html i'm trying parse (i've put key value numbers text i'm trying pick up).
<div class="tablename"> <table border="0" cellpadding="0" cellspacing="0" style="border: 1px solid #dddddd; border-collapse: collapse; font-family: arial, helvetica, sans-serif; font-size: 14px; margin: 0; padding: 0; width: 100%"> <thead> <tr> <th colspan="4" style="background-color: #000; border: 1px solid #616161; color: #ffffff; font-size: 14px; font-weight: bold; line-height: 20px; padding: 14px 20px 12px 20px; text-align: left">some text not needed</th> </tr> </thead> <tbody> <tr> <td style="width: 20px"> </td> <td style="border-bottom: 1px solid #dddddd; color: #666666; font-size: 14px; line-height: 20px; padding: 11px 20px 10px 0; text-align: left; width: 42.5%; vertical-align: middle">key 1</td> <td style="border-bottom: 1px solid #dddddd; color: #000; font-size: 14px; line-height: 20px; padding: 11px 0 10px 0; text-align: left; vertical-align: middle">value 1</td> <td style="width: 20px"> </td> </tr> <tr> <td> </td> <td style="border-bottom: 1px solid #dddddd; color: #666666; font-size: 14px; line-height: 20px; padding: 11px 20px 10px 0; text-align: left; vertical-align: middle">key 2</td> <td style="border-bottom: 1px solid #dddddd; color: #000; font-size: 14px; line-height: 20px; padding: 11px 0 10px 0; text-align: left; vertical-align: middle">value 2</td> <td> </td> </tr> <tr> <td> </td> <td style="border-bottom: 1px solid #dddddd; color: #666666; font-size: 14px; line-height: 20px; padding: 11px 20px 10px 0; text-align: left; vertical-align: middle">key 3</td> <td style="border-bottom: 1px solid #dddddd; color: #000; font-size: 14px; line-height: 20px; padding: 11px 0 10px 0; text-align: left; vertical-align: middle">value 3</td> <td> </td> </tr> <tr>
and code i'm using:
import requests bs4 import beautifulsoup html = requests.get('https://examplewebaddress.com') soup = beautifulsoup(html.text) print(soup.tbody.text)
i loop on soup.tbody.text
string , split key value pairs. doesn't seem way , seem losing structure of table converting string , building again dictionary.
is there more direct way parse table beautifulsoup
(or more suitable) dictionary can use?
the idea iterate on table rows , each row extract the text of second , third cells represent key , value of future dictionary:
soup = beautifulsoup(html.text) result = dict([[item.get_text(strip=true) item in row.find_all('td')[1:3]] row in soup.select("div.tablename table tr")[1:]]) print result
for provided sample data, prints:
{u'key 1': u'value 1', u'key 2': u'value 2', u'key 3': u'value 3'}
div.tablename table tr
css selector match tr
elements under table
element has div
class="tablename"
parent. slicing result of select
([1:]
) skip first header row.
Comments
Post a Comment