Skip to content

GitLab

  • Menu
Projects Groups Snippets
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in / Register
  • S ScienceAccess
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 9
    • Issues 9
    • List
    • Boards
    • Service Desk
    • Milestones
    • Requirements
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
    • Test Cases
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages & Registries
    • Packages & Registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Code review
    • Insights
    • Issue
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Russell Jarvis
  • ScienceAccess
  • Issues
  • #3
Closed
Open
Created Aug 17, 2020 by Russell Jarvis@russelljjarvis💬Owner

over scraping scholar, or filter strings being misapplied? Which is it and in any case how to handle exceptions.

Created by: russelljjarvis

It seems now that no results return from a query.

It generates an error, but I think the error is because no results came through (convert a NaN). image

It seems now that no results return from a query.

But scraping still took time? If so then the scraper may be functioning correctly but the words are being thrown out in post-processing.

I suggest putting the lines:

import streamlit as st
st.text(corpus)

at line https://github.com/russelljjarvis/ScienceAccess/blob/master/science_access/t_analysis.py#L150 to see what words are going into t_analysis.

Do it again at https://github.com/russelljjarvis/ScienceAccess/blob/master/science_access/t_analysis.py#L217

import streamlit as st
st.text(tokens)

To see what words were used after filtering in the analysis. You might find that one of these datatypes is empty.

If privacy policy was in the keywords then you have ever scraped. In that case, you can gracefully fall back to more robust Open Access paper searching. If privacy policy is in the first result it will be in all the results. Instead of waiting 4-5 minutes to return from scraping 15 pages of privacy-policy, you can find out straight away if len(ar)==0 and if 'privacy-policy is in tokens. If that's true rather than suffering a 5minute wait to find out, the code logic should fall back at first reading of privacy policy.

You need to modify a function so it can test if scholar scraping worked and if it didn't it can set the variable ```

st.text("scholar scrape failed falling back to Open Access search")
OPEN_ACCESS=True

-- this variable needs to be changed to lower case now https://github.com/russelljjarvis/ScienceAccess/blob/master/science_access/online_app_backend.py#L176

    (ar, trainingDats) = ar_manipulation(ar)
    if len(ar)==0:
        openaccess=True
        st.text("scholar scrape failed falling back to Open Access search')
    if openaccess:
        import os
        from crossref_commons.iteration import iterate_publications_as_json
        import requests

        #filter_ = {'type': 'journal-article'}
        queries = {'query.author': NAME}
        ar = []
        bi =[p for p in iterate_publications_as_json(max_results=100, queries=queries)]   
        for p in bi[0:9]:    
            res = str('https://api.unpaywall.org/v2/')+str(p['DOI'])+str('?email=YOUR_EMAIL')
            response = requests.get(res)
            response = response.json()
            if response['is_oa'] and response is not None:
                st.text(response)
                print(response.keys())
                try:
                    temp = response['best_oa_location']['url_for_pdf']

                except:
                    temp = response['best_oa_location']['url']#['url_for_pdf']

                st.text(temp) 
                if temp is not None:
                    urlDat = process(temp)        
                    if not isinstance(urlDat,type(None)):
                        ar.append(urlDat)

        (ar, trainingDats) = ar_manipulation(ar)

@mcgurrgurr

Assignee
Assign to
Time tracking