javascript - Cannot screen scrape site -


i trying screen scrape contents of following page (lists medical specialties) , contents of links contained in page (all universities offer training in given specialty). here main page specialties: https://services.aamc.org/eras/erasstats/par/index.cfm

the first link on page specialty adolescent medicine (pediatrics) , url https://services.aamc.org/eras/erasstats/par/display8.cfm?nav_row=par&spec_cd=321

when try directly navigate above url, server redirects me main page. however, clicking on link adolescent medicine takes me page want access.

i confused why clicking on link takes me correct page, navigating same url not. here thoughts:

1) javascript issues- disabled js , still keep getting redirected main page. moreover, clicking link in table still takes me correct page.

2) cookies- disabled cookies still ran same problem when trying access adolescent medicine page using direct url. interestingly, navigating using table links still works cookies disabled.

3) https- not sure how check if causing problem.

i have spent quite lot of time looking @ developer's console on chrome , have not been able find reason why cannot access specialty pages using direct url. or guidance appreciated!

the key making sure appropriate headers set. server apparently requires valid user agent , correct referrer set. once set header shown below, worked!

headers = { 'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8', 'accept-encoding':'gzip, deflate, sdch', 'accept-language':'en-us,en;q=0.8', 'connection':'keep-alive', 'user-agent': 'mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, gecko) chrome/39.0.2171.95 safari/537.36', 'referer' : 'https://services.aamc.org/eras/erasstats/par/index.cfm' }


Comments

Popular posts from this blog

Payment information shows nothing in one page checkout page magento -

tcpdump - How to check if server received packet (acknowledged) -