Skip to Main Content

Artificial Intelligence: AI & Information Literacy

AI & Information Literacy

AI & Information Literacy considerations:

  • Generative AI tools largely pull from the exact same database sets and are limited to the datasets they have been trained with, meaning they are not accessing newer content nor content that is behind a paywall. This means that you cannot discover any scholarly content that exists in a closed access database available through the Library. It cannot contain any scholarship that it has not been trained with!
  • Generative AI tools are meant to generate answers and will do that even if it has to hallucinate the content. This means that any and all sources produced by AI must be evaluated and verified. It can provide you with citations, but the danger is that it will simply cobble these together and oftentimes generates false information and resources. The evaluation and verification of the sources provided by AI can take more time than you may want to give it.
  • Generative AI cannot do research because of its generative nature. It is also not a search engine.
  • AI Tools have cost considerations, and you get what you pay for.
  • The AI tools that allow for uploading articles and scholarship still require you to be able to do quality research.
AI Tools & AI Datasets Overview
Model / Tool Type Primary Data Sources Discipline Coverage (approx.) Special Features
GPT‑4 (OpenAI) LLM Web (Common Crawl, RefinedWeb), textbooks, code, Wikipedia, licensed data Web: ~70–80%Books: ~5–10%Academic: ~5–10%Code: ~5–10% General-purpose: research Q&A, writing, coding
GPT‑3.5 / ChatGPT LLM WebText2, Books1/2, Wikipedia, Common Crawl Web: ~82%Books: ~16%Wiki: ~3% Fast general use; limited academic depth
Gemini (Google) LLM (multimodal) Web, books, textbooks, code, multimedia, real-time web Web: ~70–80%Books: ~5–10%Code: ~10%Multimedia: ~5–10% Multimodal (text, images, audio, video)
Co‑Pilot (GitHub) Code assistant GitHub repos, StackOverflow, public codebases Code: ~80%Docs/QA: ~15%NLP: ~5% Optimized for coding and developer needs
LLaMA / Falcon / BLOOM Open-source LLM Common Crawl, books, Wikipedia, arXiv, PubMed, The Pile Web: ~70%Books: ~10%Academic: ~5%Code: ~5% Open-access, customizable
BERT Encoder model Wikipedia, BookCorpus Wiki: ~60%Books: ~40% Foundation for text classification & summarization
T5 Encoder-decoder C4 dataset (filtered web) Web: ~90%Books: <10% Text-to-text tasks (translate, summarize, etc.)
SciBERT / PubMedBERT Domain LLM PubMed + biomedical articles Biomedical: ~100% Strong in medicine & life sciences
BloombergGPT Domain LLM Financial docs, filings, news, web data Finance: ~50%Web: ~42%News: ~7% Specialized for financial research
Elicit Research AI tool Semantic Scholar (126M+ papers) Academic: ~100% Finds papers, extracts claims, Q&A
Research Rabbit Discovery tool Semantic Scholar, CrossRef metadata Academic: ~100% Builds citation & co-author networks
Jenny AI Summarizer Preprints, PDFs, research summaries Academic: ~70%Preprints/blogs: ~30% Draft generation & summarization
Paperpal Writing assistant Publisher data, user uploads Academic: ~100% Proofreading, grammar, style suggestions
Scite Citation analytics Scholarly papers (citation contexts) Academic: ~100% Shows supportive/contrasting citations
SciSpace Research assistant Academic PDFs (STEM-heavy), user uploads STEM academic: ~100% PDF annotation, term definitions
ResearchGPT (Consensus / SciSpace) Research GPT assistant ~200–282M academic papers via Consensus or SciSpace Academic: ~100% across disciplines Chat-based Q&A, citation-backed summaries, PDF upload, image support (dailyai.com, yeschat.ai, gptr.dev, digital-science.com, morganstanley.com, consensus.app)

Developed and modified in ChatGPT using the following prompt: "Create a professional table showing the major Large language models and generative AI Tools, especially for students and researchers, along with what datasets they pull and the percentage of material from the various disciplines included in the databases."

Provider Index Size Key Sources OA Full‑Text Access API Available Best For
Allen Institute for AI (AI2) ~280 M papers arXiv, bioRxiv, SSRN, HAL, CrossRef, publisher feeds, U.S. gov sites Partial (OA + some partnerships) Yes (academic) Scholarly search & AI‑driven insights
OurResearch (Unpaywall) ~250 M works CrossRef metadata, DOAJ, institutional repos, Unpaywall, MAG legacy, ROR  Extensive OA Yes (open API) Metadata research, alt‑metrics, open data
Google Scholar ~400 M+ items Publisher sites, repositories, personal pages, preprints, theses Mixed (OA + paywalled links) No official API Broad scholarly search
Digital Science ~350 M publications + grants etc. CrossRef, PubMed, arXiv, publisher content, clinical trials, patents Mixed (OA + proprietary) Yes (limited) Citation analysis, grants, patents, altmetrics
CORE (Jisc) ~300 M documents Institutional repositories, open-access journals via OAI-PMH Fully Open Access Yes (open API) Full-text discovery across OA repositories
Cambia / Lens Foundation ~200 M works CrossRef, PubMed, Patents, Preprints, institutional repos Mixed (OA + paywalled links) Yes (API & data dumps) Scholarly + patent search, licensing analysis

Developed and modified in ChatGPT using the prompt: "create a professional table showing the major datasets used in AI Tools and Web-based research and where their data/ material comes from."