In R XML Xpath, @href is returning the text "href" -
i trying contents of href
using xpath code described in these two posts. unfortunately code returning actual text "href" , several spaces in addition url. how can avoid that?
library(xml) html <- readlines("http://www.msu.edu") html.parse <- htmlparse(html) node <- getnodeset(html.parse, "//div[@id='msu-top-utilities']//a/@href") node[[1]] # > node[[1]] # href # "students/index.html" # attr(,"class") # [1] "xmlattributevalue"
it's named character vector. can do:
as.character(node[[1]])
which give
## [1] "students/index.html"
alternately, here's better idiom in xml2
package:
library(xml2) doc <- read_html("http://www.msu.edu") nodes <- xml_find_all(doc, "//div[@id='msu-top-utilities']//a") xml_attr(nodes, "href") ## [1] "students/index.html" "faculty-staff/index.html" "alumni/index.html" ## [4] "businesses/index.html" "visitors/index.html"
Comments
Post a Comment