python - Convert html table to dictionary without losing structure -

June 15, 2014

i'm new python (and programming) , using beautifulsoup first time.

i'm trying find best way parse contents of table in html , convert dictionary - ideally in least brittle way.

here example of html i'm trying parse (i've put key value numbers text i'm trying pick up).

<div class="tablename"> <table border="0" cellpadding="0" cellspacing="0" style="border: 1px solid #dddddd;  border-collapse: collapse; font-family: arial, helvetica, sans-serif; font-size: 14px; margin: 0; padding: 0; width: 100%"> <thead> <tr> <th colspan="4" style="background-color: #000; border: 1px solid #616161; color: #ffffff; font-size: 14px; font-weight: bold; line-height: 20px; padding: 14px 20px 12px 20px; text-align: left">some text not needed</th> </tr> </thead> <tbody> <tr> <td style="width: 20px"> </td> <td style="border-bottom: 1px solid #dddddd; color: #666666; font-size: 14px; line-height: 20px; padding: 11px 20px 10px 0; text-align: left; width: 42.5%; vertical-align: middle">key 1</td> <td style="border-bottom: 1px solid #dddddd; color: #000; font-size: 14px; line-height: 20px; padding: 11px 0 10px 0; text-align: left; vertical-align: middle">value 1</td> <td style="width: 20px"> </td> </tr> <tr> <td> </td> <td style="border-bottom: 1px solid #dddddd; color: #666666; font-size: 14px; line-height: 20px; padding: 11px 20px 10px 0; text-align: left; vertical-align: middle">key 2</td> <td style="border-bottom: 1px solid #dddddd; color: #000; font-size: 14px; line-height: 20px; padding: 11px 0 10px 0; text-align: left; vertical-align: middle">value 2</td> <td> </td> </tr> <tr> <td> </td> <td style="border-bottom: 1px solid #dddddd; color: #666666; font-size: 14px; line-height: 20px; padding: 11px 20px 10px 0; text-align: left; vertical-align: middle">key 3</td> <td style="border-bottom: 1px solid #dddddd; color: #000; font-size: 14px; line-height: 20px; padding: 11px 0 10px 0; text-align: left; vertical-align: middle">value 3</td> <td> </td> </tr> <tr>

and code i'm using:

import requests bs4 import beautifulsoup  html = requests.get('https://examplewebaddress.com') soup = beautifulsoup(html.text) print(soup.tbody.text)

i loop on soup.tbody.text string , split key value pairs. doesn't seem way , seem losing structure of table converting string , building again dictionary.

is there more direct way parse table beautifulsoup (or more suitable) dictionary can use?

the idea iterate on table rows , each row extract the text of second , third cells represent key , value of future dictionary:

soup = beautifulsoup(html.text)  result = dict([[item.get_text(strip=true) item in row.find_all('td')[1:3]]                row in soup.select("div.tablename table tr")[1:]])  print result

for provided sample data, prints:

{u'key 1': u'value 1', u'key 2': u'value 2', u'key 3': u'value 3'}

div.tablename table tr css selector match tr elements under table element has div class="tablename" parent. slicing result of select ([1:]) skip first header row.

Search This Blog

Plus Code

python - Convert html table to dictionary without losing structure -

Comments

Post a Comment

Popular posts from this blog

r - Trouble relying on third party package imports in my package -

java - Intellij IDEA shortcut How to add new element (ex. class or package)? -

Payment information shows nothing in one page checkout page magento -