over scraping scholar, or filter strings being misapplied? Which is it and in any case how to handle exceptions.
Created by: russelljjarvis
It seems now that no results return from a query.
It generates an error, but I think the error is because no results came through (convert a NaN).
It seems now that no results return from a query.
But scraping still took time? If so then the scraper may be functioning correctly but the words are being thrown out in post-processing.
I suggest putting the lines:
import streamlit as st
st.text(corpus)
at line https://github.com/russelljjarvis/ScienceAccess/blob/master/science_access/t_analysis.py#L150 to see what words are going into t_analysis.
Do it again at https://github.com/russelljjarvis/ScienceAccess/blob/master/science_access/t_analysis.py#L217
import streamlit as st
st.text(tokens)
To see what words were used after filtering in the analysis. You might find that one of these datatypes is empty.
If privacy policy was in the keywords then you have ever scraped. In that case, you can gracefully fall back to more robust Open Access paper searching. If privacy policy is in the first result it will be in all the results. Instead of waiting 4-5 minutes to return from scraping 15 pages of privacy-policy, you can find out straight away if len(ar)==0
and if 'privacy-policy is in tokens. If that's true rather than suffering a 5minute wait to find out, the code logic should fall back at first reading of privacy policy.
You need to modify a function so it can test if scholar scraping worked and if it didn't it can set the variable ```
st.text("scholar scrape failed falling back to Open Access search")
OPEN_ACCESS=True
-- this variable needs to be changed to lower case now https://github.com/russelljjarvis/ScienceAccess/blob/master/science_access/online_app_backend.py#L176
(ar, trainingDats) = ar_manipulation(ar)
if len(ar)==0:
openaccess=True
st.text("scholar scrape failed falling back to Open Access search')
if openaccess:
import os
from crossref_commons.iteration import iterate_publications_as_json
import requests
#filter_ = {'type': 'journal-article'}
queries = {'query.author': NAME}
ar = []
bi =[p for p in iterate_publications_as_json(max_results=100, queries=queries)]
for p in bi[0:9]:
res = str('https://api.unpaywall.org/v2/')+str(p['DOI'])+str('?email=YOUR_EMAIL')
response = requests.get(res)
response = response.json()
if response['is_oa'] and response is not None:
st.text(response)
print(response.keys())
try:
temp = response['best_oa_location']['url_for_pdf']
except:
temp = response['best_oa_location']['url']#['url_for_pdf']
st.text(temp)
if temp is not None:
urlDat = process(temp)
if not isinstance(urlDat,type(None)):
ar.append(urlDat)
(ar, trainingDats) = ar_manipulation(ar)
@mcgurrgurr