在PythonHTML表转换为CSV
问题说明
我正在尝试将HTML中的表转换为Python中的csv。我要提取的表就是这个表:
I am trying to convert a table in HTML to a csv in Python. The table I am trying to extract is this one:
<table class="tblperiode">
<caption>Dades de període</caption>
<tr>
<th class="sortable"><span title="Període (Temps Universal)">Període</span><br/>TU</th>
<th><span title="Temperatura mitjana (°C)">TM</span><br/>°C</th>
<th><span title="Temperatura màxima (°C)">TX</span><br/>°C</th>
<th><span title="Temperatura mínima (°C)">TN</span><br/>°C</th>
<th><span title="Humitat relativa mitjana (%)">HRM</span><br/>%</th>
<th><span title="Precipitació (mm)">PPT</span><br/>mm</th>
<th><span title="Velocitat mitjana del vent (km/h)">VVM (10 m)</span><br/>km/h</th>
<th><span title="Direcció mitjana del vent (graus)">DVM (10 m)</span><br/>graus</th>
<th><span title="Ratxa màxima del vent (km/h)">VVX (10 m)</span><br/>km/h</th>
<th><span title="Irradiància solar global mitjana (W/m2)">RS</span><br/>W/m<sup>2</sup></th>
</tr>
<tr>
<th>
00:00 - 00:30
</th>
<td>16.2</td>
<td>16.5</td>
<td>15.4</td>
<td>93</td>
<td>0.0</td>
<td>6.5</td>
<td>293</td>
<td>10.4</td>
<td>0</td>
</tr>
<tr>
<th>
00:30 - 01:00
</th>
<td>16.4</td>
<td>16.5</td>
<td>16.1</td>
<td>90</td>
<td>0.0</td>
<td>5.8</td>
<td>288</td>
<td>8.6</td>
<td>0</td>
</tr>
我希望它看起来像这样:
And I want it to look something like this:
要实现这一点,我尝试解析html并成功构建数据正确执行以下操作的数据框:
To achieve so, what I have tried is to parse the html and I have managed to build a dataframe with the data correctly doing the following:
from bs4 import BeautifulSoup
import csv
html = open("table.html").read()
soup = BeautifulSoup(html)
table = soup.select_one("table.tblperiode")
output_rows = []
for table_row in table.findAll('tr'):
columns = table_row.findAll('td')
output_row = []
for column in columns:
output_row.append(column.text)
output_rows.append(output_row)
df = pd.DataFrame(output_rows)
print(df)
但是,我想将列命名为d指示时间间隔的列,在上面的html示例中,其中只有两个出现在00:00-00:30和00:30 1:00。因此,我的表应具有两行,其中一行对应于00:00-00:30的观察值,另一行对应于00:30和1:00的观察值。
However, I would like to have the columns name and a column indicating the interval of time, in the example of html above just two of them appear 00:00-00:30 and 00:30 1:00. Therefore my table should have two rows, one corresponding with the observations of 00:00-00:30 and another one with the observations of 00:30 and 1:00.
如何从HTML中获取此信息?
How could I get this information from my HTML?
正确答案
这里是一种方法,它可能不是最好的方法,但它可以工作!您可以通读注释以了解代码在做什么!
Here's a way of doing it, it's probably not the nicest way but it works! You can read through the comments to figure out what the code is doing!
from bs4 import BeautifulSoup
import csv
#read the html
html = open("table.html").read()
soup = BeautifulSoup(html, 'html.parser')
# get the table from html
table = soup.select_one("table.tblperiode")
# find all rows
rows = table.findAll('tr')
# strip the header from rows
headers = rows[0]
header_text = []
# add the header text to array
for th in headers.findAll('th'):
header_text.append(th.text)
# init row text array
row_text_array = []
# loop through rows and add row text to array
for row in rows[1:]:
row_text = []
# loop through the elements
for row_element in row.findAll(['th', 'td']):
# append the array with the elements inner text
row_text.append(row_element.text.replace('\n', '').strip())
# append the text array to the row text array
row_text_array.append(row_text)
# output csv
with open("out.csv", "w") as f:
wr = csv.writer(f)
wr.writerow(header_text)
# loop through each row array
for row_text_single in row_text_array:
wr.writerow(row_text_single)
这篇好文章是转载于:学新通技术网
- 版权申明: 本站部分内容来自互联网,仅供学习及演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,请提供相关证据及您的身份证明,我们将在收到邮件后48小时内删除。
- 本站站名: 学新通技术网
- 本文地址: /reply/detail/tanhcffkkh
-
YouTube API 不能在 iOS (iPhone/iPad) 工作,但在桌面浏览器工作正常?
it1352 07-30 -
iPhone,一张图像叠加到另一张图像上以创建要保存的新图像?(水印)
it1352 07-17 -
保持在后台运行的 iPhone 应用程序完全可操作
it1352 07-25 -
使用 iPhone 进行移动设备管理
it1352 07-23 -
在android同时打开手电筒和前置摄像头
it1352 09-28 -
检查邮件是否发送成功
it1352 07-25 -
扫描 NFC 标签时是否可以启动应用程序?
it1352 08-02 -
Android微调工具-删除当前选择
it1352 06-20 -
Android App 和三星 Galaxy S4 不兼容
it1352 07-20 -
希伯来语的空格句子标记化错误
it1352 06-22