glossagen.pipelines package

Submodules

glossagen.pipelines.generate_glossary module

Module for generating a glossary based on a research document.

class glossagen.pipelines.generate_glossary.GlossaryGenerator(research_doc: ResearchDoc, chunk_size: int = 20000)[source]

Bases: object

A class that generates a glossary based on a research document.

Attributes

research_doc (ResearchDoc): The research document to generate the glossary from. glossary_predictor (dspy.Predict): The predictor used to generate the glossary.

Methods

generate_glossary: Generates the glossary based on the research document.

deduplicate_entries(glossary: list[TerminusTechnicus]) list[TerminusTechnicus][source]

Deduplicate the glossary entries by considering plurals and similar-sounding terms.

format_nicely(glossary: list[TerminusTechnicus]) str[source]

Format the glossary nicely.

Args:

glossary (list[TerminusTechnicus]): The glossary to format.

Returns

str: The nicely formatted glossary.

generate_glossary_from_doc() Any[source]

Generate the glossary based on the research document.

Returns

Any: The generated glossary.

normalize_term(term: str) str[source]

Normalize a term by converting it to lowercase and removing common plural endings.

class glossagen.pipelines.generate_glossary.KeepImportantTerms(*, termini_technici: list[TerminusTechnicus], important_terms: list[TerminusTechnicus])[source]

Bases: Signature

Keep only the important terms from a list of termini technici.

important_terms: list[TerminusTechnicus]
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'important_terms': FieldInfo(annotation=list[TerminusTechnicus], required=True, json_schema_extra={'desc': 'The list of important terms extracted from the termini technici.\n        NEEDS to be abbreviations or very important terms.', '__dspy_field_type': 'output', 'prefix': 'Important Terms:'}), 'termini_technici': FieldInfo(annotation=list[TerminusTechnicus], required=True, json_schema_extra={'desc': 'The list of termini technici extracted from the text.', '__dspy_field_type': 'input', 'prefix': 'Termini Technici:'})}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

termini_technici: list[TerminusTechnicus]
class glossagen.pipelines.generate_glossary.TerminusTechnicus(*, term: str, definition: str)[source]

Bases: BaseModel

A terminus technicus, i.e. a techincal term in materials science and chemistry.

definition: str
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'definition': FieldInfo(annotation=str, required=True, title='The definition of the technical term.'), 'term': FieldInfo(annotation=str, required=True, title='The technical term. Can also be an abbreviation.')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

term: str
class glossagen.pipelines.generate_glossary.Text2GlossarySignature(*, text: str, glossary: list[TerminusTechnicus])[source]

Bases: Signature

Generating a list of termini technici from a text in materials science and chemistry.

glossary: list[TerminusTechnicus]
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'glossary': FieldInfo(annotation=list[TerminusTechnicus], required=True, json_schema_extra={'desc': 'The list of termini technici extracted from the text.\n        ONLY TAKE VERY INPORTANT TERMS, no general terms like Chemistry.', '__dspy_field_type': 'output', 'prefix': 'Glossary:'}), 'text': FieldInfo(annotation=str, required=True, json_schema_extra={'desc': 'The text to extract the termini technici from.', '__dspy_field_type': 'input', 'prefix': 'Text:'})}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

text: str
glossagen.pipelines.generate_glossary.generate_glossary(document_directory: str, log_to_wandb_flag: bool = True) Any[source]

Generate a glossary based on a research document.

Args:

document_directory (str): The directory where the research document is stored.

Returns

str: The generated glossary.

glossagen.pipelines.generate_glossary.log_to_wandb(glossary: list[TerminusTechnicus], chunk_size: int, project_name: str = 'GlossaGen', config: Dict[Any, Any] | None = None) None[source]

Initialize wandb and log the generated glossary as a wandb.Table.

Args:

glossary (list[TerminusTechnicus]): The list of glossary terms to log. chunk_size (int): The size of the chunks the research document was split into. project_name (str): The name of the wandb project. config (dict): Configuration parameters for the wandb run.

glossagen.pipelines.generate_glossary.main() None[source]

Demonstrate the generation of a glossary from a research document.

glossagen.pipelines.glossary_to_ontology module

Module to generate an ontology from a glossary.

class glossagen.pipelines.glossary_to_ontology.Glossary2Labels(*, input_text: str, labels: list[OntologyEntityLabels])[source]

Bases: Signature

Generating a list of entities labels (general overarching classes, e.g. Material, Property, …, Other) from glossary term and description pairs.

input_text: str
labels: list[OntologyEntityLabels]
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'input_text': FieldInfo(annotation=str, required=True, json_schema_extra={'desc': 'Glossary term and description pairs, one by line.', '__dspy_field_type': 'input', 'prefix': 'Input Text:'}), 'labels': FieldInfo(annotation=list[OntologyEntityLabels], required=True, json_schema_extra={'desc': 'List of general labels categorizing the terms.', '__dspy_field_type': 'output', 'prefix': 'Labels:'})}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class glossagen.pipelines.glossary_to_ontology.Glossary2Relations(*, input_text: str, relations: list[OntologyRelation])[source]

Bases: Signature

Generating a list of relations (single verb) between entities from glossary term and description pairs.

input_text: str
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'input_text': FieldInfo(annotation=str, required=True, json_schema_extra={'desc': 'Glossary term and description pairs, one by line.', '__dspy_field_type': 'input', 'prefix': 'Input Text:'}), 'relations': FieldInfo(annotation=list[OntologyRelation], required=True, json_schema_extra={'desc': '\n        List of general relations between glossary terms for an ontology.\n        Short terms or verbs describing the relation.', '__dspy_field_type': 'output', 'prefix': 'Relations:'})}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

relations: list[OntologyRelation]
class glossagen.pipelines.glossary_to_ontology.Ontology(*, labels: Dict[str, str], relationships: List[str])[source]

Bases: BaseModel

Represent an ontology with labels and relationships.

labels: Dict[str, str]
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'labels': FieldInfo(annotation=Dict[str, str], required=True), 'relationships': FieldInfo(annotation=List[str], required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

print_labels()[source]

Print all the labels in the ontology.

print_relationships()[source]

Print all the relationships in the ontology.

relationships: List[str]
class glossagen.pipelines.glossary_to_ontology.OntologyEntityLabels(*, label: str)[source]

Bases: BaseModel

An ontology label, i.e. an entity label in materials science and chemistry.

label: str
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'label': FieldInfo(annotation=str, required=True, title="A string 'label: description'  describing a entity class.")}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class glossagen.pipelines.glossary_to_ontology.OntologyGenerator(glossary: Dict[str, str])[source]

Bases: object

A class that generates a ontology based on a glossary.

Attributes

research_doc (ResearchDoc): The research document to generate the glossary from. glossary_predictor (dspy.Predict): The predictor used to generate the glossary.

Methods

generate_glossary: Generates the glossary based on the research document.

generate_ontology_from_glossary(verbose: bool = False) Any[source]

Generate the glossary based on the research document.

Returns

Any: The generated glossary.

class glossagen.pipelines.glossary_to_ontology.OntologyRelation(*, relation: str)[source]

Bases: BaseModel

Ontology relation, i.e. relation between two entities.

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'relation': FieldInfo(annotation=str, required=True, title='A relation type of the ontology.')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

relation: str
glossagen.pipelines.glossary_to_ontology.generate_ontology_from_glossary(document_directory: str) Any[source]

Generate ontology from a glossary.

glossagen.pipelines.knowledge_graph module

Pipeline to generate a knowledge graph from research documents.

glossagen.pipelines.knowledge_graph.create_documents_from_text_chunks(text: str, max_length: int = 2000) List[Document][source]

Create documents from text by dividing it into chunks of max_length characters.

glossagen.pipelines.knowledge_graph.main() None[source]

Orchestrate graph generation from research documents.

glossagen.pipelines.latex_glossary module

Extract glossary from LaTeX document.

glossagen.pipelines.latex_glossary.extract_text_from_latex(latex_file_path: str) str[source]

Extract readable text from .tex document, focusing on content between begin and end doc.

glossagen.pipelines.latex_glossary.main(latex_file_path: str) None[source]

Extract glossary from LaTeX document.

Module contents