python - Find and display links to specified URLs using regex -


so trying extract links particular sites. have written following sifting through site hours now, not work great me.

match = re.compile('<a href="(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)(youtu|www.youtube|youtube|vimeo|dailymotion|)\.(.+?)"',re.dotall).findall(html) title in match:     print '<a href="'+title+'>'+title+'</a>' 

method above gives error:

    print '<a href="'+title+'>'+title+'</a>' typeerror: cannot concatenate 'str' , 'tuple' objects 

and if put "print = title" following ugly result

('https://www.', 'youtube', 'com/watch?v=gm2sgfjvgjm') 

all links scraped this:

<a href="https://www.youtube.com/watch?v=gm2sgfjvgjm" 

im hoping have print following:

<a href="https://www.youtube.com/watch?v=gm2sgfjvgjm">youtube</a> <a href="http://www.dailymotion.com/video/x5zuvuu">dailymotion</a> 

been playing python while struggle alot lol. , fyi ive spent endless hours trying figure out beautiful soup dont it. appreciate on without changing method totally if possible, help.

your pattern seems okay. problem capturing groups inside. make them non-capturing ?: end capturing whole expression together.

p = re.compile('<a href="((?:http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)'\                          '(?:youtu|www.youtube|youtube|vimeo|dailymotion|)'\                          '\.(?:.+?))"',re.dotall) match = p.findall(html) title in match:     print '<a href="' + title + '>' + title + '</a>' 

to retain link domain name, small change needed – capture whole expression, , website name 2 separate groups (the former captures latter):

p = re.compile('<a href="((?:http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)'\                          '(youtu|www.youtube|youtube|vimeo|dailymotion|)'\                          '\.(?:.+?))"',re.dotall)  match = p.findall(html) title in match:     print '<a href="' + title[0] + '>' + title[1] + '</a>' 

access groups title[i].


Comments

Popular posts from this blog

resizing Telegram inline keyboard -

command line - How can a Python program background itself? -

php - "cURL error 28: Resolving timed out" on Wordpress on Azure App Service on Linux -