Guides: Artificial Intelligence: AI & Information Literacy

AI & Information Literacy

AI & Information Literacy considerations:

Generative AI tools largely pull from the exact same database sets and are limited to the datasets they have been trained with, meaning they are not accessing newer content nor content that is behind a paywall. This means that you cannot discover any scholarly content that exists in a closed access database available through the Library. It cannot contain any scholarship that it has not been trained with!
Generative AI tools are meant to generate answers and will do that even if it has to hallucinate the content. This means that any and all sources produced by AI must be evaluated and verified. It can provide you with citations, but the danger is that it will simply cobble these together and oftentimes generates false information and resources. The evaluation and verification of the sources provided by AI can take more time than you may want to give it.
Generative AI cannot do research because of its generative nature. It is also not a search engine.
AI Tools have cost considerations, and you get what you pay for.
The AI tools that allow for uploading articles and scholarship still require you to be able to do quality research.

AI Tools & AI Datasets Overview

AI Tools
AI Datasets

Model / Tool	Type	Primary Data Sources	Discipline Coverage (approx.)	Special Features
GPT‑4 (OpenAI)	LLM	Web (Common Crawl, RefinedWeb), textbooks, code, Wikipedia, licensed data	Web: ~70–80%Books: ~5–10%Academic: ~5–10%Code: ~5–10%	General-purpose: research Q&A, writing, coding
GPT‑3.5 / ChatGPT	LLM	WebText2, Books1/2, Wikipedia, Common Crawl	Web: ~82%Books: ~16%Wiki: ~3%	Fast general use; limited academic depth
Gemini (Google)	LLM (multimodal)	Web, books, textbooks, code, multimedia, real-time web	Web: ~70–80%Books: ~5–10%Code: ~10%Multimedia: ~5–10%	Multimodal (text, images, audio, video)
Co‑Pilot (GitHub)	Code assistant	GitHub repos, StackOverflow, public codebases	Code: ~80%Docs/QA: ~15%NLP: ~5%	Optimized for coding and developer needs
LLaMA / Falcon / BLOOM	Open-source LLM	Common Crawl, books, Wikipedia, arXiv, PubMed, The Pile	Web: ~70%Books: ~10%Academic: ~5%Code: ~5%	Open-access, customizable
BERT	Encoder model	Wikipedia, BookCorpus	Wiki: ~60%Books: ~40%	Foundation for text classification & summarization
T5	Encoder-decoder	C4 dataset (filtered web)	Web: ~90%Books: <10%	Text-to-text tasks (translate, summarize, etc.)
SciBERT / PubMedBERT	Domain LLM	PubMed + biomedical articles	Biomedical: ~100%	Strong in medicine & life sciences
BloombergGPT	Domain LLM	Financial docs, filings, news, web data	Finance: ~50%Web: ~42%News: ~7%	Specialized for financial research
Elicit	Research AI tool	Semantic Scholar (126M+ papers)	Academic: ~100%	Finds papers, extracts claims, Q&A
Research Rabbit	Discovery tool	Semantic Scholar, CrossRef metadata	Academic: ~100%	Builds citation & co-author networks
Jenny AI	Summarizer	Preprints, PDFs, research summaries	Academic: ~70%Preprints/blogs: ~30%	Draft generation & summarization
Paperpal	Writing assistant	Publisher data, user uploads	Academic: ~100%	Proofreading, grammar, style suggestions
Scite	Citation analytics	Scholarly papers (citation contexts)	Academic: ~100%	Shows supportive/contrasting citations
SciSpace	Research assistant	Academic PDFs (STEM-heavy), user uploads	STEM academic: ~100%	PDF annotation, term definitions
ResearchGPT (Consensus / SciSpace)	Research GPT assistant	~200–282M academic papers via Consensus or SciSpace	Academic: ~100% across disciplines	Chat-based Q&A, citation-backed summaries, PDF upload, image support (dailyai.com, yeschat.ai, gptr.dev, digital-science.com, morganstanley.com, consensus.app)

Developed and modified in ChatGPT using the following prompt: "Create a professional table showing the major Large language models and generative AI Tools, especially for students and researchers, along with what datasets they pull and the percentage of material from the various disciplines included in the databases."

Provider	Index Size	Key Sources	OA Full‑Text Access	API Available	Best For
Allen Institute for AI (AI2)	~280 M papers	arXiv, bioRxiv, SSRN, HAL, CrossRef, publisher feeds, U.S. gov sites	Partial (OA + some partnerships)	Yes (academic)	Scholarly search & AI‑driven insights
OurResearch (Unpaywall)	~250 M works	CrossRef metadata, DOAJ, institutional repos, Unpaywall, MAG legacy, ROR	Extensive OA	Yes (open API)	Metadata research, alt‑metrics, open data
Google Scholar	~400 M+ items	Publisher sites, repositories, personal pages, preprints, theses	Mixed (OA + paywalled links)	No official API	Broad scholarly search
Digital Science	~350 M publications + grants etc.	CrossRef, PubMed, arXiv, publisher content, clinical trials, patents	Mixed (OA + proprietary)	Yes (limited)	Citation analysis, grants, patents, altmetrics
CORE (Jisc)	~300 M documents	Institutional repositories, open-access journals via OAI-PMH	Fully Open Access	Yes (open API)	Full-text discovery across OA repositories
Cambia / Lens Foundation	~200 M works	CrossRef, PubMed, Patents, Preprints, institutional repos	Mixed (OA + paywalled links)	Yes (API & data dumps)	Scholarly + patent search, licensing analysis

Developed and modified in ChatGPT using the prompt: "create a professional table showing the major datasets used in AI Tools and Web-based research and where their data/ material comes from."

Artificial Intelligence: AI & Information Literacy

AI & Information Literacy

About

Resources

Contact

Sign In