Wie findet man den Kommentar-Tag  mit BeautifulSoup?

Question

Wie findet man den Kommentar-Tag  mit BeautifulSoup?

Gefragt el 19 de Kann, 2011: Wann wurde die Frage gestellt
12435 Ansichten: Anzahl der Besuche der Frage
2 Antworten: Anzahl der Fragenantworten
Gelöst: Aktueller Status der Frage

Ich habe soup.find('!--') versucht, aber es scheint nicht zu funktionieren. Vielen Dank im Voraus.

Edit: Danke für den Tipp, wie man alle Kommentare finden kann. Ich habe eine Folgefrage. Wie kann ich gezielt nach einem Kommentar suchen?

Ich habe zum Beispiel das folgende Kommentar-Tag:

Ich will wirklich nur dieses Zeug Wednesday 110518 . Die "110518" ist das Datum JJMMTT, das ich als Suchziel verwenden möchte. Ich weiß jedoch nicht, wie ich etwas innerhalb eines bestimmten Kommentar-Tags finden kann.

Gefragt el 19 de Kann, 2011 von 1stsage

Answer 1

2 Antworten

Answer 2

24voto

yan Punkte 20278

Sie können alle Kommentare in einem Dokument mit über die findAll Methode. Das folgende Beispiel zeigt, wie Sie genau das tun können, was Sie vorhaben Entfernen von Elementen :

Kurz gesagt, Sie wollen dies:

comments = soup.findAll(text=lambda text:isinstance(text, Comment))

Edit: Wenn Sie versuchen, innerhalb der Spalten zu suchen, können Sie es versuchen:

import re
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
for comment in comments:
  e = re.match(r'<i>([^<]*)</i>', comment.string).group(1)
  print e

Beantwortet el 19 de Kann, 2011 von yan (20278 Punkte )

Answer 3

0voto

PaulMcG Punkte 59178

Mit Pyparsing können Sie nach HTML-Kommentaren suchen, indem Sie ein eingebautes htmlComment Ausdruck und fügen Parse-Time-Callbacks hinzu, um die verschiedenen Datenfelder innerhalb des Kommentars zu validieren und zu extrahieren:

from pyparsing import makeHTMLTags, oneOf, withAttribute, Word, nums, Group, htmlComment
import calendar

# have pyparsing define tag start/end expressions for the 
# tags we want to look for inside the comments
span,spanEnd = makeHTMLTags("span")
i,iEnd = makeHTMLTags("i")

# only want spans with class=titlefont
span.addParseAction(withAttribute(**{'class':'titlefont'}))

# define what specifically we are looking for in this comment
weekdayname = oneOf(list(calendar.day_name))
integer = Word(nums)
dateExpr = Group(weekdayname("day") + integer("daynum"))
commentBody = '<!--' + span + i + dateExpr("date") + iEnd

# define a parse action to attach to the standard htmlComment expression,
# to extract only what we want (or raise a ParseException in case 
# this is not one of the comments we're looking for)
def grabCommentContents(tokens):
    return commentBody.parseString(tokens[0])
htmlComment.addParseAction(grabCommentContents)

# let's try it
htmlsource = """
want to match this one
<!-- <span class="titlefont"> <i>Wednesday 110518</i>(05:00PM)<br /></span> -->

don't want the next one, wrong span class
<!-- <span class="bodyfont"> <i>Wednesday 110519</i>(05:00PM)<br /></span> -->

not even a span tag!
<!-- some other text with a date in italics <i>Wednesday 110520</i>(05:00PM)<br /></span> -->

another matching comment, on a different day
<!-- <span class="titlefont"> <i>Thursday 110521</i>(05:00PM)<br /></span> -->
"""

for comment in htmlComment.searchString(htmlsource):
    parsedDate = comment.date
    # date info can be accessed like elements in a list
    print parsedDate[0], parsedDate[1]
    # because we named the expressions within the dateExpr Group
    # we can also get at them by name (this is much more robust, and 
    # easier to maintain/update later)
    print parsedDate.day
    print parsedDate.daynum
    print

Drucke:

Wednesday 110518
Wednesday
110518

Thursday 110521
Thursday
110521

Beantwortet el 19 de Kann, 2011 von PaulMcG (59178 Punkte )

Wie findet man den Kommentar-Tag  mit BeautifulSoup?

Antworten

Empfohlene Fragen

Top-Tags

CodeJaeger.com

Powered by:

Wie findet man den Kommentar-Tag  mit BeautifulSoup?

Antworten

Verwandte Fragen

Empfohlene Fragen

Top-Tags

CodeJaeger.com

Powered by: